date:20210209

Re: 64-bit XIDs in deleted nbtree pages

2021-02-09 Thread Heikki Linnakangas


On 10/02/2021 00:14, Peter Geoghegan wrote:

There is a long standing problem with the way that nbtree page
deletion places deleted pages in the FSM for recycling: The use of a
32-bit XID within the deleted page (in the special
area's/BTPageOpaqueData struct's btpo.xact field) is not robust
against XID wraparound, which can lead to permanently leaking pages in
a variety of scenarios. The problems became worse with the addition of
the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID
in this context creates risk for any further improvements in VACUUM
that similarly involve skipping whole indexes. For example, Masahiko
has been working on a patch that teaches VACUUM to skip indexes that
are known to have very little garbage [2].

Attached patch series fixes the issue once and for all. This is
something that I'm targeting for Postgres 14, since it's more or less
a bug fix.


Thanks for picking this up!


The first patch teaches nbtree to use 64-bit transaction IDs here, and
so makes it impossible to leak deleted nbtree pages. This patch is the
nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
due to exactly the same set of problems. The first patch also makes
the level field stored in nbtree page's special area/BTPageOpaqueData
reliably store the level, even in a deleted page. This allows me to
consistently use the level field within amcheck, including even within
deleted pages.


Is it really worth the trouble to maintain 'level' on deleted pages? All 
you currently do with it is check that the BTP_LEAF flag is set iff 
"level == 0", which seems pointless. I guess there could be some 
forensic value in keeping 'level', but meh.



The second patch in the series adds new information to VACUUM VERBOSE.
This makes it easy to understand what's going on here. Index page
deletion related output becomes more useful. It might also help with
debugging the first patch.

Currently, VACUUM VERBOSE output for an index that has some page
deletions looks like this:

"38 index pages have been deleted, 38 are currently reusable."

With the second patch applied, we might see this output at the same
point in VACUUM VERBOSE output instead:

"38 index pages have been deleted, 0 are newly deleted, 38 are
currently reusable."

This means that out of the 38 of the pages that were found to be
marked deleted in the index, 0 were deleted by the VACUUM operation
whose output we see here. That is, there were 0 nbtree pages that were
newly marked BTP_DELETED within _bt_unlink_halfdead_page() during
*this particular* VACUUM -- the VACUUM operation that we see
instrumentation about here. It follows that the 38 deleted pages that
we encountered must have been marked BTP_DELETED by some previous
VACUUM operation.

In practice the "%u are currently reusable" output should never
include newly deleted pages, since there is no way that a page marked
BTP_DELETED can be put in the FSM during the same VACUUM operation --
that's unsafe (we need all of this recycling/XID indirection precisely
because we need to delay recycling until it is truly safe, of course).
Note that the "%u index pages have been deleted" output includes both
pages deleted by some previous VACUUM operation, and newly deleted
pages (no change there).

Note that the new "newly deleted" output is instrumentation about this
particular *VACUUM operation*. In contrast, the other two existing
output numbers ("deleted" and "currently reusable") are actually
instrumentation about the state of the *index as a whole* at a point
in time (barring concurrent recycling of pages counted in VACUUM by
some random _bt_getbuf() call in another backend). This fundamental
distinction is important here. All 3 numbers/stats that we output can
have different values, which can be used to debug the first patch. You
can directly observe uncommon cases just from the VERBOSE output, like
when a long running transaction holds up recycling of a deleted page
that was actually marked BTP_DELETED in an *earlier* VACUUM operation.
And so if the first patch had any bugs, there'd be a pretty good
chance that you could observe them using multiple VACUUM VERBOSE
operations -- you might notice something inconsistent or contradictory
just by examining the output over time, how things change, etc.


The full message on master is:

INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL:  25 index row versions were removed.
2056 index pages have been deleted, 1370 are currently reusable.

How about:

INFO:  index "foo_pkey" now contains 250001 row versions in 2745 pages
DETAIL:  25 index row versions and 686 pages were removed.
2056 index pages are now unused, 1370 are currently reusable.

The idea is that the first DETAIL line now says what the VACUUM did this 
round, and the last line says what the state of the index is now. One 
concern with that phrasing is that it might not be clear what "686 pages 
were removed" means. We don't actually shrink the fil

pg_get_first_normal_oid()?

2021-02-09 Thread Joel Jacobson

Hi,

I need to filter out any system catalog objects from SQL,
and I've learned it's not possible to simply filter based on namespace name,
since there are objects such as pg_am that don't have any namespace belonging,
except indirectly via their handler, but since you can define a new access 
method
using an existing handler from the system catalog, there is no way to 
distinguish
your user created access handler from the system catalog access handlers
only based on namespace name based filtering.

After some digging I found this

   #define FirstNormalObjectId 16384

in src/include/access/transam.h, which pg_dump.c and 14 other files are using 
at 27 different places in the sources.

Seems to be a popular and important fellow.

I see this value doesn't change often, it was added back in 2005-04-13 in 
commit 2193a856a229026673cbc56310cd0bddf7b5ea25.

Is it safe to just hard-code in application code needing to know this cut-off 
value?

Or will we have a Bill Gates "640K ought to be enough for anybody" moment in 
the foreseeable future,
where this limit needs to be increased?

If there is a risk we will, then maybe we should add a function such as 
$SUBJECT to expose this value to SQL users who needs it?

I see there has been a related discussion in the thread "Identifying 
user-created objects"


https://www.postgresql.org/message-id/flat/CA%2Bfd4k7Zr%2ByQLYWF3O_KjAJyYYUZFBZ_dFchfBvq5bMj9GgKQw%40mail.gmail.com

However, this thread focused on security and wants to know if a specific oid is 
user defined or not.

I think pg_get_first_normal_oid() would be more useful than 
pg_is_user_object(oid),
since with pg_get_first_normal_oid() you could do filtering based on oid 
indexes.

Compare e.g.:

SELECT * FROM pg_class WHERE oid >= pg_get_first_normal_oid()

with..

SELECT * FROM pg_class WHERE pg_is_user_object(oid) IS TRUE

The first query could use the index on pg_class.oid,
whereas I'm not mistaken, the second query would need a seq_scan to evaluate 
pg_is_user_object() for each oid.

/Joel

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread Amit Langote

On Wed, Feb 10, 2021 at 1:35 PM Greg Nancarrow  wrote:
> On Wed, Feb 10, 2021 at 2:39 PM Amit Langote  wrote:
> >
> > > The code check that you have identified above ensures that the INSERT
> > > has an underlying SELECT, because the planner won't (and shouldn't
> > > anyway) generate a parallel plan for INSERT...VALUES, so there is no
> > > point doing any parallel-safety checks in this case.
> > > It just so happens that the problem test case uses INSERT...VALUES -
> > > and it shouldn't have triggered the parallel-safety checks for
> > > parallel INSERT for this case anyway, because INSERT...VALUES can't
> > > (and shouldn't) be parallelized.
> >
> > AFAICS, max_parallel_hazard() path never bails from doing further
> > safety checks based on anything other than finding a query component
> > whose hazard level crosses context->max_interesting.
>
> It's parallel UNSAFE because it's not safe or even possible to have a
> parallel plan for this.
> (as UNSAFE is the max hazard level, no point in referencing
> context->max_interesting).
> And there are existing cases of bailing out and not doing further
> safety checks (even your v15_delta.diff patch placed code - for
> bailing out on "ON CONFLICT ... DO UPDATE" - underneath one such
> existing case in max_parallel_hazard_walker()):
>
> else if (IsA(node, Query))
> {
> Query  *query = (Query *) node;
>
> /* SELECT FOR UPDATE/SHARE must be treated as unsafe */
> if (query->rowMarks != NULL)
> {
> context->max_hazard = PROPARALLEL_UNSAFE;
> return true;
> }

In my understanding, the max_parallel_hazard() query tree walk is to
find constructs that are parallel unsafe in that they call code that
can't run in parallel mode.  For example, FOR UPDATE/SHARE on
traditional heap AM tuples calls AssignTransactionId() which doesn't
support being called in parallel mode.  Likewise ON CONFLICT ... DO
UPDATE calls heap_update() which doesn't support parallelism.  I'm not
aware of any such hazards downstream of ExecValuesScan().

> >You're trying to
> > add something that bails based on second-guessing that a parallel path
> > would not be chosen, which I find somewhat objectionable.
> >
> > If the main goal of bailing out is to avoid doing the potentially
> > expensive modification safety check on the target relation, maybe we
> > should try to somehow make the check less expensive.  I remember
> > reading somewhere in the thread about caching the result of this check
> > in relcache, but haven't closely studied the feasibility of doing so.
> >
>
> There's no "second-guessing" involved here.
> There is no underlying way of dividing up the VALUES data of
> "INSERT...VALUES" amongst the parallel workers, even if the planner
> was updated to produce a parallel-plan for the "INSERT...VALUES" case
> (apart from the fact that spawning off parallel workers to insert that
> data would almost always result in worse performance than a
> non-parallel plan...)
> The division of work for parallel workers is part of the table AM
> (scan) implementation, which is not invoked for "INSERT...VALUES".

I don't disagree that the planner would not normally assign a parallel
path simply to pull values out of a VALUES list mentioned in the
INSERT command, but deciding something based on the certainty of it in
an earlier planning phase seems odd to me.  Maybe that's just me
though.

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: Support for NSS as a libpq TLS backend

2021-02-09 Thread Michael Paquier

On Tue, Feb 09, 2021 at 10:30:52AM +0100, Daniel Gustafsson wrote:
> It can be, it's not the most pressing patch scope reduction but everything
> helps of course.

Okay.  I have spent some time on this one and finished it.

> Thanks.  That patch is slightly more interesting in terms of reducing scope
> here, but I also think it makes the test code a bit easier to digest when
> certificate management is abstracted into the API rather than the job of the
> testfile to perform.

That's my impression.  Still, I am wondering if there could be a
different approach.  I need to think more about that first..
--
Michael


signature.asc
Description: PGP signature

Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-09 Thread Michael Paquier

On Wed, Feb 10, 2021 at 01:44:12PM +0900, Kyotaro Horiguchi wrote:
> It seems to me that the above just means the caller must provide a
> digest buffer that fits the use. In the above example digest just must
> be 64 byte.  If Coverity complains so, what should do for the
> complaint is to fix the caller to provide a digest buffer of the
> correct size.
> 
> Could you show the detailed context where Coverity complained?

FWIW, the community Coverity instance is not complaining here, so
I have no idea what kind of configuration it uses to generate this
report.  Saying that, this is just the same idea as cfc40d3 for
base64.c and aef8948 for hex.c where we provide the length of the 
result buffer to be able to control any overflow.  So that's a safety
belt to avoid a caller to do stupid things where he/she would
overwrite some memory with a buffer allocation with a size lower than
the size of the digest expected in the result generated.

> So it doesn't seem to me the right direction. Even if we are going to
> make pg_cryptohash_final to take the buffer length, it should
> error-out or assert-out if the length is too small rather than copy a
> part of the digest bytes. (In short, it would only be assertion-use.)

Yes, we could be more defensive here, and considering libpq I think
that this had better be an error rather than an assertion to remain on
the safe side.  The patch proposed is incomplete on several points:
- cryptohash_openssl.c is not touched, so this patch will fail to
compile with --with-ssl=openssl (or --with-openssl if you want).
- There is nothing actually checked in the final function.  As we
already know the size of the result digest, we just need to make sure
that the size of the output is at least the size of the digest, so we
can just add a check based on MD5_DIGEST_LENGTH and such.  There is no
need to touch the internal functions of MD5/SHA1/SHA2 for the
non-OpenSSL case.  For the OpenSSL case, and looking at digest.c in
the upstream code, we would need a similar check, as
EVP_DigestFinal_ex() would happily overwrite the area if the caller is
not careful (note that the third argument of the function reports the
number of bytes written, *after* the fact).

I don't see much the point to complicate scram_HMAC_final() and
scram_H() here, as well as the manipulations done for SCRAM_KEY_LEN in
scram-common.h.
--
Michael


signature.asc
Description: PGP signature

Re: New IndexAM API controlling index vacuum strategies

2021-02-09 Thread Peter Geoghegan

On Tue, Feb 9, 2021 at 6:14 PM Masahiko Sawada  wrote:
> Thanks. I think that's very good if we resolve this recycling stuff
> first then try the new approach to skip index vacuum in more cases.
> That way, even if the vacuum strategy stuff took a very long time to
> get committed over several major versions, we would not be affected by
> deleted nbtree page recycling problem (at least for built-in index
> AMs). Also, the approach of 6655a7299d8 itself is a good improvement
> and seems straightforward to me.

I'm glad that you emphasized this issue, because I came up with a
solution that turns out to not be very invasive. At the same time it
has unexpected advantages, liking improving amcheck coverage for
deleted pages.

-- 
Peter Geoghegan

Re: 64-bit XIDs in deleted nbtree pages

2021-02-09 Thread Peter Geoghegan

On Tue, Feb 9, 2021 at 5:53 PM Peter Geoghegan  wrote:
> Here is a plan that allows us to stop storing any kind of XID in the
> metapage in all cases:

Attached is v2, which deals with the metapage 32-bit
XID/btm_oldest_btpo_xact issue using the approach I described earlier.
We don't store an XID in the metapage anymore in v2. This seems to
work well, as I expected it would.

-- 
Peter Geoghegan

v2-0001-Use-full-64-bit-XID-for-nbtree-page-deletion.patch
Description: Binary data

v2-0002-Add-pages_newly_deleted-to-VACUUM-VERBOSE.patch
Description: Binary data

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread tsunakawa.ta...@fujitsu.com

From: tsunakawa.ta...@fujitsu.com 
> From: Hou, Zhijie/侯 志杰 
> > It did have performance gain, but I think it's not huge enough to ignore the
> > extra's index cost.
> > What do you think ?
> 
> Yes... as you suspect, I'm afraid the benefit from parallel bitmap scan may 
> not
> compensate for the loss of the parallel insert operation.
> 
> The loss is probably due to 1) more index page splits, 2) more buffer writes
> (table and index), and 3) internal locks for things such as relation extension
> and page content protection.  To investigate 3), we should want something
> like [1], which tells us the wait event statistics (wait count and time for 
> each
> wait event) per session or across the instance like Oracke, MySQL and EDB
> provides.  I want to continue this in the near future.

What would the result look like if you turn off parallel_leader_participation?  
If the leader is freed from reading/writing the table and index, the index page 
splits and internal lock contention may decrease enough to recover part of the 
loss.

https://www.postgresql.org/docs/devel/parallel-plans.html

"In a parallel bitmap heap scan, one process is chosen as the leader. That 
process performs a scan of one or more indexes and builds a bitmap indicating 
which table blocks need to be visited. These blocks are then divided among the 
cooperating processes as in a parallel sequential scan. In other words, the 
heap scan is performed in parallel, but the underlying index scan is not."


BTW, the following sentences seem to be revisited, because "the work to be 
done" is not the same for parallel INSERT as for serial INSERT - the order of 
rows stored, table and index sizes, and what else?

https://www.postgresql.org/docs/devel/using-explain.html#USING-EXPLAIN-ANALYZE

"It's worth noting that although the data-modifying node can take a 
considerable amount of run time (here, it's consuming the lion's share of the 
time), the planner does not currently add anything to the cost estimates to 
account for that work. That's because the work to be done is the same for every 
correct query plan, so it doesn't affect planning decisions."


Regards
Takayuki Tsunakawa

Re: Preserve attstattarget on REINDEX CONCURRENTLY

2021-02-09 Thread Justin Pryzby

On Fri, Feb 05, 2021 at 11:17:48AM +0900, Michael Paquier wrote:
> Let's copy this data in index_concurrently_swap() instead.  The
> attached patch does that, and adds a test cheaper than what was
> proposed.  There is a minor release planned for next week, so I may be

> +++ b/src/test/regress/sql/create_index.sql
> @@ -1103,6 +1104,13 @@ SELECT starelid::regclass, count(*) FROM pg_statistic 
> WHERE starelid IN (
>'concur_exprs_index_pred'::regclass,
>'concur_exprs_index_pred_2'::regclass)
>GROUP BY starelid ORDER BY starelid::regclass::text;
> +-- attstattarget should remain intact
> +SELECT attrelid::regclass, attnum, attstattarget
> +  FROM pg_attribute WHERE attrelid IN (
> +'concur_exprs_index_expr'::regclass,
> +'concur_exprs_index_pred'::regclass,
> +'concur_exprs_index_pred_2'::regclass)
> +  ORDER BY 'concur_exprs_index_expr'::regclass::text, attnum;

If I'm not wrong, you meant to ORDER BY attrelid::regclass::text, attnum;

-- 
Justin

Re: Improvements and additions to COPY progress reporting

2021-02-09 Thread Bharath Rupireddy

On Tue, Feb 9, 2021 at 6:02 PM Matthias van de Meent
 wrote:
> > Also, you can add this to the current commitfest.
>
> See https://commitfest.postgresql.org/32/2977/
>
> On Tue, 9 Feb 2021 at 12:53, Josef Šimánek  wrote:
> >
> > OK, would you mind to integrate my regression test initial patch as
> > well in v3 or should I submit it later in a separate way?
>
> Attached, with minor fixes

Why do we need to have a new test file progress.sql for the test
cases? Can't we add them into existing copy.sql or copy2.sql? Or do
you have a plan to add test cases into progress.sql for other progress
reporting commands?

IMO, it's better not add any new test file but add the tests to existing files.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: repeated decoding of prepared transactions

2021-02-09 Thread Amit Kapila

On Wed, Feb 10, 2021 at 11:45 AM Ajin Cherian  wrote:
>
> On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat
>  wrote:
>
>
> > I think we need to treat a prepared transaction slightly different from an 
> > uncommitted transaction when sending downstream. We need to send a whole 
> > uncommitted transaction downstream again because previously applied changes 
> > must have been aborted and hence lost by the downstream and thus it needs 
> > to get all of those again. But when a downstream prepares a transaction, 
> > even if it's not committed, those changes are not lost even after restart 
> > of a walsender. If the downstream confirms that it has "flushed" PREPARE, 
> > there is no need to send all the changes again.
>
> But the other side of the problem is that ,without this, if the
> prepared transaction is prior to a consistent snapshot when decoding
> starts/restarts, then only the "commit prepared" is sent to downstream
> (as seen in the test scenario I shared above), and downstream has to
> error away the commit prepared because it does not have the
> corresponding prepared transaction.
>

I think it is not only simple error handling, it is required for
data-consistency. We need to send the transactions whose commits are
encountered after a consistent snapshot is reached.

-- 
With Regards,
Amit Kapila.

Re: repeated decoding of prepared transactions

2021-02-09 Thread Ajin Cherian

On Wed, Feb 10, 2021 at 3:43 PM Ashutosh Bapat
 wrote:

> I think we need to treat a prepared transaction slightly different from an 
> uncommitted transaction when sending downstream. We need to send a whole 
> uncommitted transaction downstream again because previously applied changes 
> must have been aborted and hence lost by the downstream and thus it needs to 
> get all of those again. But when a downstream prepares a transaction, even if 
> it's not committed, those changes are not lost even after restart of a 
> walsender. If the downstream confirms that it has "flushed" PREPARE, there is 
> no need to send all the changes again.

But the other side of the problem is that ,without this, if the
prepared transaction is prior to a consistent snapshot when decoding
starts/restarts, then only the "commit prepared" is sent to downstream
(as seen in the test scenario I shared above), and downstream has to
error away the commit prepared because it does not have the
corresponding prepared transaction. We did not find an easy way to
distinguish between these two scenarios for prepared transactions.
a. A consistent snapshot being formed in between a prepare and a
commit prepared for the first time.
b. Decoder restarting between a prepare and a commit prepared.

For plugins to be able to handle this, we have added a special
callback "Begin Prepare" as explained in [1] section 49.6.4.10

"The required begin_prepare_cb callback is called whenever the start
of a prepared transaction has been decoded. The gid field, which is
part of the txn parameter can be used in this callback to check if the
plugin has already received this prepare in which case it can skip the
remaining changes of the transaction. This can only happen if the user
restarts the decoding after receiving the prepare for a transaction
but before receiving the commit prepared say because of some error."

The pgoutput plugin is also being updated to be able to handle this
situation of duplicate prepared transactions.

regards,
Ajin Cherian
Fujitsu Australia

Re: Libpq support to connect to standby server as priority

2021-02-09 Thread vignesh C

Thanks for the comments Greg, please find my comments inline below.

On Tue, Feb 9, 2021 at 2:27 PM Greg Nancarrow  wrote:
>
> On Mon, Feb 8, 2021 at 8:17 PM vignesh C  wrote:
> >
> > >
> > > I think what we want to do is mark default_transaction_read_only as
> > > GUC_REPORT, instead.  That will give a reliable report of what the
> > > state of its GUC is, and you can combine it with is_hot_standby
> > > to decide whether the session should be considered read-only.
> > > If you don't get those two GUC values during connection, then you
> > > can fall back on "SHOW transaction_read_only".
> > >
> >
> > I have made a patch for the above with the changes suggested and
> > rebased it with the head code.
> > Attached v21 patch which has the changes for the same.
> > Thoughts?
>
> Further to my other doc change feedback, I can only spot the following
> minor things (otherwise the changes that you have made seek OK to me).
>
> 1) doc/src/sgml/protocol.sgml
>
>default_transaction_read_only  and
>in_hot_standby were not reported by releases before
>14.)
>
> should be:
>
>default_transaction_read_only  and
>in_hot_standby were not reported by releases before
>14.0)
>

Modified.

> 2) doc/src/sgml/high-availability,sgml
>
>
> During hot standby, the parameter in_hot_standby and
> default_transaction_read_only are always true and may
> not be changed.
>
> should be:
>
>
> During hot standby, the parameters in_hot_standby and
> transaction_read_only are always true and may
> not be changed.
>
>
> [I believe that there's only checks on attempts to change
> "transaction_read_only" when in hot_standby, not
> "default_transaction_read_only"; see  check_transaction_read_only()]
>

Modified.

> 3) src/interfaces/libpq/fe-connect.c
>
> In rejectCheckedReadOrWriteConnection() and
> rejectCheckedStandbyConnection(), now that host and port info are
> emitted separately and are not included in each error message string
> (as parameters in a format string), I think those functions should use
> appendPQExpBufferStr() instead of appendPQExpBuffer(), as it's more
> efficient if there is just a single string argument.
>

Modified.
These comments are handled in v22 patch posted in my earlier mail.

Regards,
VIgnesh

Re: Libpq support to connect to standby server as priority

2021-02-09 Thread vignesh C

On Tue, Feb 9, 2021 at 5:47 AM Greg Nancarrow  wrote:
>
> On Mon, Feb 8, 2021 at 8:17 PM vignesh C  wrote:
> >
> > >
> > > I think what we want to do is mark default_transaction_read_only as
> > > GUC_REPORT, instead.  That will give a reliable report of what the
> > > state of its GUC is, and you can combine it with is_hot_standby
> > > to decide whether the session should be considered read-only.
> > > If you don't get those two GUC values during connection, then you
> > > can fall back on "SHOW transaction_read_only".
> > >
> >
> > I have made a patch for the above with the changes suggested and
> > rebased it with the head code.
> > Attached v21 patch which has the changes for the same.
> > Thoughts?
> >
>
> I'm still looking at the patch code, but I noticed that the
> documentation update describing how support of read-write transactions
> is determined isn't quite right and it isn't clear how the parameters
> work.
> I'd suggest something like the following (you'd need to fix the line
> lengths and line-wrapping appropriately) - please check it for
> correctness:
>
>
> The support of read-write transactions is determined by the value of 
> the
> default_transaction_read_only and
> in_hot_standby configuration parameters,
> that, if supported,
> are reported by the server upon successful connection. If the
> value of either
> of these parameters is on, it means the
> server doesn't support
> read-write transactions. If either/both of these parameters
> are not reported,
> then the support of read-write transactions is determined by
> an explicit query,
> by sending SHOW transaction_read_only after
> successful
> connection; if it returns on, it means the
> server doesn't
> support read-write transactions. The standby mode state is
> determined by either
> the value of the in_hot_standby configuration
> parameter, that is reported by the server (if supported) upon
> successful connection, or is otherwise explicitly queried by sending
> SELECT pg_is_in_recovery() after successful
> connection; if it returns t, it means the server is
> in hot standby mode.
>

Thanks Greg for the comments, Please find the attached v22 patch
having the fix for the same.
Thoughts?

Regards,
Vignesh
From db9b894d8a8c51de80e6b3b7b8c660dda5374ff9 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 8 Feb 2021 11:23:31 +0530
Subject: [PATCH v22] Enhance the libpq "target_session_attrs" connection
 parameter.

Enhance the connection parameter "target_session_attrs" to support new values:
read-only/primary/standby/prefer-standby.
Add a new "read-only" target_session_attrs option value, to support connecting
to a read-only server if available from the list of hosts (otherwise the
connection attempt fails).
Add a new "primary" target_session_attrs option value, to support connecting to
a server which is not in hot-standby mode, if available from the list of hosts
(otherwise the connection attempt fails).
Add a new "standby" target_session_attrs option value, to support connecting to
a server which is in hot-standby mode, if available from the list of hosts
(otherwise the connection attempt fails).
Add a new "prefer-standby" target_session_attrs option value, to support
connecting to a server which is in hot-standby mode, if available from the list
of hosts (otherwise connect to a server which is not in hot-standby mode).

Discussion: https://www.postgresql.org/message-id/flat/caf3+xm+8-ztokav9ghij3wfgentq97qcjxqt+rbfq6f7onz...@mail.gmail.com
---
 doc/src/sgml/high-availability.sgml   |  16 +-
 doc/src/sgml/libpq.sgml   |  73 +-
 doc/src/sgml/protocol.sgml|   5 +-
 src/backend/utils/misc/guc.c  |   3 +-
 src/interfaces/libpq/fe-connect.c | 432 ++
 src/interfaces/libpq/fe-exec.c|  18 +-
 src/interfaces/libpq/libpq-fe.h   |   3 +-
 src/interfaces/libpq/libpq-int.h  |  52 +++-
 src/test/recovery/t/001_stream_rep.pl |  79 ++-
 src/tools/pgindent/typedefs.list  |   2 +
 10 files changed, 602 insertions(+), 81 deletions(-)

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index f49f5c0..2bbd52c 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1700,14 +1700,14 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'

 

-During hot standby, the parameter transaction_read_only is always
-true and may not be changed.  But as long as no attempt is made to modify
-the database, connections during hot standby will act much like any other
-database connection.  If failover or switchover occurs, the database will
-switch to normal processing mode.  Sessions will remain connected while the
-server changes mode.  Once hot standby finishes, it will be possible to
-initiate read-write transactions (eve

Re: Online checksums patch - once again

2021-02-09 Thread Michael Paquier

On Tue, Feb 09, 2021 at 10:54:50AM +0200, Heikki Linnakangas wrote:
> (I may have said this before, but) My overall high-level impression of this
> patch is that it's really cmmplex for a feature that you use maybe once in
> the lifetime of a cluster. I'm happy to review but I'm not planning to
> commit this myself. I don't object if some other committer picks this up
> (Magnus?).

I was just looking at the latest patch set as a matter of curiosity,
and I have a shared feeling.  I think that this is a lot of
complication in-core for what would be a one-time operation,
particularly knowing that there are other ways to do it already with
the offline checksum tool, even if that is more costly:
- Involve logical replication after initializing the new instance with
--data-checksums, or in an upgrade scenatio with pg_upgrade.
- Involve physical replication: stop the standby cleanly, enable
checksums on it and do a switchover.

Another thing we could do is to improve pg_checksums with a parallel
mode.  The main design question would be how to distribute the I/O,
and that would mean balancing at least across tablespaces.
--
Michael


signature.asc
Description: PGP signature

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> Sorry, Maybe the tablename is a little confused,
> but 'testscan_index' is actually a table's name.
...
> Did I miss something ?

No, I don't think so.  I just wanted to know the facts correctly.  Your EXPLAIN 
output shows that the target table is testscan as follows.  How does 
testscan_index relate to testscan?

   ->  Insert on public.testscan  (cost=3272.20..1260119.35 rows=0 width=0) 
(actual time=378.227..378.229 rows=0 loops=5)


> It did have performance gain, but I think it's not huge enough to ignore the
> extra's index cost.
> What do you think ?

Yes... as you suspect, I'm afraid the benefit from parallel bitmap scan may not 
compensate for the loss of the parallel insert operation.

The loss is probably due to 1) more index page splits, 2) more buffer writes 
(table and index), and 3) internal locks for things such as relation extension 
and page content protection.  To investigate 3), we should want something like 
[1], which tells us the wait event statistics (wait count and time for each 
wait event) per session or across the instance like Oracke, MySQL and EDB 
provides.  I want to continue this in the near future.


[1]
Add accumulated statistics for wait event
https://commitfest.postgresql.org/28/2332/


Regards
Takayuki Tsunakawa

Re: [patch] bit XOR aggregate functions

2021-02-09 Thread Kyotaro Horiguchi

At Tue, 9 Feb 2021 15:25:19 +, Alexey Bashtanov  wrote 
in 
> I personally use it as a checksum for a large unordered set, where
> performance and simplicity is prioritized over collision resilience.
> Maybe there are other ways to use them.

FWIW the BIT_XOR can be created using CREATE AGGREGATE.

CREATE OR REPLACE AGGREGATE BIT_XOR(IN v smallint) (SFUNC = int2xor, STYPE = 
smallint);
CREATE OR REPLACE AGGREGATE BIT_XOR(IN v int4) (SFUNC = int4xor, STYPE = int4);
CREATE OR REPLACE AGGREGATE BIT_XOR(IN v bigint) (SFUNC = int8xor, STYPE = 
bigint);
CREATE OR REPLACE AGGREGATE BIT_XOR(IN v bit) (SFUNC = bitxor, STYPE = bit);

The bit_and/bit_or aggregates are back to 2004, that commit says that:

> commit 8096fe45cee42ce02e602cbea08e969139a77455
> Author: Bruce Momjian 
> Date:   Wed May 26 15:26:28 2004 +
...
>(2) bitwise integer aggregates named bit_and and bit_or for
>int2, int4, int8 and bit types. They are not standard, but I find
>them useful. I needed them once.

We already had CREATE AGGREATE at the time, so BIT_XOR can be thought
as it falls into the same category with BIT_AND and BIT_OR, that is,
we may have BIT_XOR as an intrinsic aggregation?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Single transaction in the tablesync worker?

2021-02-09 Thread Amit Kapila

On Wed, Feb 10, 2021 at 7:41 AM Peter Smith  wrote:
>
> On Tue, Feb 9, 2021 at 10:38 AM Peter Smith  wrote:
> >
>
> PSA v2 of this WalRcvExceResult patch (it is same as v1 but includes
> some PG doc updates).
> This applies OK on top of v30 of the main patch.
>

Thanks, I have integrated these changes into the main patch and
additionally made some changes to comments and docs. I have also fixed
the function name inconsistency issue you reported and ran pgindent.

-- 
With Regards,
Amit Kapila.

v31-0001-Allow-multiple-xacts-during-table-sync-in-logica.patch
Description: Binary data

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread Hou, Zhijie

> > --
> > postgres=# select pg_size_pretty(pg_indexes_size('testscan_index'));
> >  pg_size_pretty
> > 
> >  4048 kB
> > (1 row)
> >
> > postgres=# select pg_size_pretty(pg_relation_size('testscan_index'));
> >  pg_size_pretty
> > 
> >  4768 kB
> > (1 row)
> 
> Which of the above shows the table size?  What does pg_indexes_size()
> against an index (testscan_index) return?

Sorry, Maybe the tablename is a little confused,
but 'testscan_index' is actually a table's name.

pg_indexes_size will return the index's size attatched to the table.
pg_relation_size will return the table's size.

Did I miss something ?



> > IMO, due to the difference of inserts with parallel execution, the
> > btree insert's cost is more than serial.
> >
> > At the same time, the parallel does not have a huge performance gain
> > with bitmapscan, So the extra cost of btree index will result in
> > performance degradation.
> 
> How did you know that the parallelism didn't have a huge performance gain
> with bitmap scan?
> 
> [serial]
>->  Bitmap Heap Scan on public.x  (cost=3272.20..3652841.26 rows=79918
> width=8) (actual time=8.096..41.005 rows=12 loops=1)
> 
> [parallel]
>  ->  Parallel Bitmap Heap Scan on public.x
> (cost=3272.20..1260119.35 rows=19980 width=8) (actual time=5.832..14.787
> rows=26000 loops=5)

I tested the case without insert(Just the query use bitmapscan):

[serial]:
postgres=# explain analyze select a from x where a<8 or (a%2=0 and 
a>19990);
QUERY PLAN
--
 Bitmap Heap Scan on x  (cost=3258.59..3647578.53 rows=81338 width=4) (actual 
time=8.091..34.222 rows=12 loops=1)
   Recheck Cond: ((a < 8) OR (a > 19990))
   Filter: ((a < 8) OR (((a % 2) = 0) AND (a > 19990)))
   Rows Removed by Filter: 5
   Heap Blocks: exact=975
   ->  BitmapOr  (cost=3258.59..3258.59 rows=173971 width=0) (actual 
time=7.964..7.965 rows=0 loops=1)
 ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1495.11 rows=80872 
width=0) (actual time=3.451..3.451 rows=7 loops=1)
   Index Cond: (a < 8)
 ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1722.81 rows=93099 
width=0) (actual time=4.513..4.513 rows=10 loops=1)
   Index Cond: (a > 19990)
 Planning Time: 0.108 ms
 Execution Time: 38.136 ms

[parallel]
postgres=# explain analyze select a from x where a<8 or (a%2=0 and 
a>19990);
   QUERY PLAN

 Gather  (cost=4258.59..1266704.42 rows=81338 width=4) (actual 
time=9.177..22.592 rows=12 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Parallel Bitmap Heap Scan on x  (cost=3258.59..1257570.62 rows=20334 
width=4) (actual time=6.402..12.882 rows=26000 loops=5)
 Recheck Cond: ((a < 8) OR (a > 19990))
 Filter: ((a < 8) OR (((a % 2) = 0) AND (a > 19990)))
 Rows Removed by Filter: 1
 Heap Blocks: exact=1
 ->  BitmapOr  (cost=3258.59..3258.59 rows=173971 width=0) (actual 
time=8.785..8.786 rows=0 loops=1)
   ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1495.11 rows=80872 
width=0) (actual time=3.871..3.871 rows=7 loops=1)
 Index Cond: (a < 8)
   ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1722.81 rows=93099 
width=0) (actual time=4.914..4.914 rows=10 loops=1)
 Index Cond: (a > 19990)
 Planning Time: 0.158 ms
 Execution Time: 26.951 ms
(15 rows)

It did have performance gain, but I think it's not huge enough to ignore the 
extra's index cost.
What do you think ?

Best regards,
houzj

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> table and index's size after parallel insert
> --
> postgres=# select pg_size_pretty(pg_indexes_size('testscan_index'));
>  pg_size_pretty
> 
>  4048 kB
> (1 row)
> 
> postgres=# select pg_size_pretty(pg_relation_size('testscan_index'));
>  pg_size_pretty
> 
>  4768 kB
> (1 row)

Which of the above shows the table size?  What does pg_indexes_size() against 
an index (testscan_index) return?


> IMO, due to the difference of inserts with parallel execution,
> the btree insert's cost is more than serial.
> 
> At the same time, the parallel does not have a huge performance gain with
> bitmapscan,
> So the extra cost of btree index will result in performance degradation.

How did you know that the parallelism didn't have a huge performance gain with 
bitmap scan?

[serial]
   ->  Bitmap Heap Scan on public.x  (cost=3272.20..3652841.26 rows=79918 
width=8) (actual time=8.096..41.005 rows=12 loops=1)

[parallel]
 ->  Parallel Bitmap Heap Scan on public.x  (cost=3272.20..1260119.35 
rows=19980 width=8) (actual time=5.832..14.787 rows=26000 loops=5)


Regards
Takayuki Tsunakawa

Re: TRUNCATE on foreign table

2021-02-09 Thread Ashutosh Bapat

On Tue, Feb 9, 2021 at 7:45 PM Kazutaka Onishi  wrote:
>
> > IIUC, "truncatable" would be set to "false" for relations which do not
> > have physical storage e.g. views but will be true for regular tables.
>
> "truncatable" option is just for the foreign table and it's not related with 
> whether it's on a physical storage or not.
> "postgres_fdw" already has "updatable" option to make the table read-only.
> However, "updatable" is for DML, and it's not suitable for TRUNCATE.
> Therefore new options "truncatable" was added.
>
> Please refer to this message for details.
> https://www.postgresql.org/message-id/20200128040346.GC1552%40paquier.xyz
>
> > DELETE is very different from TRUNCATE. Application may want to DELETE
> > based on a join with a local table and hence it can not be executed on
> > a foreign server. That's not true with TRUNCATE.
>
> Yeah, As you say, Applications doesn't need  TRUNCATE.
> We're focusing for analytical use, namely operating huge data.
> TRUNCATE can erase rows faster than DELETE.

The question is why can't that truncate be run on the foreign server
itself rather than local server?



-- 
Best Wishes,
Ashutosh Bapat

Re: [HACKERS] Custom compression methods

2021-02-09 Thread Dilip Kumar

On Wed, Feb 10, 2021 at 1:42 AM Robert Haas  wrote:
>
> Please remember to trim unnecessary quoted material.

Okay, I will.

> On Sun, Feb 7, 2021 at 6:45 AM Dilip Kumar  wrote:
> > [ a whole lot of quoted stuff ]
> >
> > Conclusion:
> > 1. In most cases lz4 is faster and doing better compression as well.
> > 2. In Test2 when small data is incompressible then lz4 tries to
> > compress whereas pglz doesn't try so there is some performance loss.
> > But if we want we can fix
> > it by setting some minimum limit of size for lz4 as well, maybe the
> > same size as pglz?
>
> So my conclusion here is that perhaps there's no real problem. It
> looks like externalizing is so expensive compared to compression that
> it's worth trying to compress even though it may not always pay off.
> If, by trying to compress, we avoid externalizing, it's a huge win
> (~5x). If we try to compress and don't manage to avoid externalizing,
> it's a small loss (~6%). It's probably reasonable to expect that
> compressible data is more common than incompressible data, so not only
> is the win a lot bigger than the loss, but we should be able to expect
> it to happen a lot more often. It's not impossible that somebody could
> get bitten, but it doesn't feel like a huge risk to me.

I agree with this.  That said maybe we could test the performance of
pglz also by lowering/removing the min compression limit but maybe
that should be an independent change.

> One thing that does occur to me is that it might be a good idea to
> skip compression if it doesn't change the number of chunks that will
> be stored into the TOAST table. If we compress the value but still
> need to externalize it, and the compression didn't save enough to
> reduce the number of chunks, I suppose we ideally would externalize
> the uncompressed version. That would save decompression time later,
> without really costing anything. However, I suppose that would be a
> separate improvement from this patch.

Yeah, this seems like a good idea and we can work on that in a different thread.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-09 Thread Kyotaro Horiguchi

At Tue, 9 Feb 2021 22:01:45 -0300, Ranier Vilela  wrote in 
> Hi Hackers,
> 
> Per Coverity.
> 
> Coverity complaints about pg_cryptohash_final function.
> And I agree with Coverity, it's a bad design.
> Its allows this:
> 
> #define MY_RESULT_LENGTH 32
> 
> function pgtest(char * buffer, char * text) {
> pg_cryptohash_ctx *ctx;
> uint8 digest[MY_RESULT_LENGTH];
> 
> ctx = pg_cryptohash_create(PG_SHA512);
> pg_cryptohash_init(ctx);
> pg_cryptohash_update(ctx, (uint8 *) buffer, text);
> pg_cryptohash_final(ctx, digest); // <--  CID 1446240 (#1 of 1):
> Out-of-bounds access (OVERRUN)
> pg_cryptohash_free(ctx);
> return
> }

It seems to me that the above just means the caller must provide a
digest buffer that fits the use. In the above example digest just must
be 64 byte.  If Coverity complains so, what should do for the
complaint is to fix the caller to provide a digest buffer of the
correct size.

Could you show the detailed context where Coverity complained?

> Attached has a patch with suggestions to make things better.

So it doesn't seem to me the right direction. Even if we are going to
make pg_cryptohash_final to take the buffer length, it should
error-out or assert-out if the length is too small rather than copy a
part of the digest bytes. (In short, it would only be assertion-use.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: repeated decoding of prepared transactions

2021-02-09 Thread Ashutosh Bapat

On Wed, Feb 10, 2021 at 8:02 AM Amit Kapila  wrote:

> On Wed, Feb 10, 2021 at 12:08 AM Robert Haas 
> wrote:
> >
> > On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila 
> wrote:
> > > I think similar happens without any of the work done in PG-14 as well
> > > if we restart the apply worker before the commit completes on the
> > > subscriber. After the restart, we will send the start_decoding_at
> > > point based on some previous commit which will make publisher send the
> > > entire transaction again. I don't think restart of WAL sender or WAL
> > > receiver is such a common thing. It can only happen due to some bug in
> > > code or user wishes to stop the nodes or some crash happened.
> >
> > Really? My impression is that the logical replication protocol is
> > supposed to be designed in such a way that once a transaction is
> > successfully confirmed, it won't be sent again. Now if something is
> > not confirmed then it has to be sent again. But if it is confirmed
> > then it shouldn't happen.
> >
>
> If by successfully confirmed, you mean that once the subscriber node
> has received, it won't be sent again then as far as I know that is not
> true. We rely on the flush location sent by the subscriber to advance
> the decoding locations. We update the flush locations after we apply
> the transaction's commit successfully. Also, after the restart, we use
> the replication origin's last flush location as a point from where we
> need the transactions and the origin's progress is updated at commit
> time.
>
> OTOH, If by successfully confirmed, you mean that once the subscriber
> has applied the complete transaction (including commit), then you are
> right that it won't be sent again.
>

I think we need to treat a prepared transaction slightly different from an
uncommitted transaction when sending downstream. We need to send a whole
uncommitted transaction downstream again because previously applied changes
must have been aborted and hence lost by the downstream and thus it needs
to get all of those again. But when a downstream prepares a transaction,
even if it's not committed, those changes are not lost even after restart
of a walsender. If the downstream confirms that it has "flushed" PREPARE,
there is no need to send all the changes again.

--
Best Wishes,
Ashutosh

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread Greg Nancarrow

On Wed, Feb 10, 2021 at 2:39 PM Amit Langote  wrote:
>
> > The code check that you have identified above ensures that the INSERT
> > has an underlying SELECT, because the planner won't (and shouldn't
> > anyway) generate a parallel plan for INSERT...VALUES, so there is no
> > point doing any parallel-safety checks in this case.
> > It just so happens that the problem test case uses INSERT...VALUES -
> > and it shouldn't have triggered the parallel-safety checks for
> > parallel INSERT for this case anyway, because INSERT...VALUES can't
> > (and shouldn't) be parallelized.
>
> AFAICS, max_parallel_hazard() path never bails from doing further
> safety checks based on anything other than finding a query component
> whose hazard level crosses context->max_interesting.

It's parallel UNSAFE because it's not safe or even possible to have a
parallel plan for this.
(as UNSAFE is the max hazard level, no point in referencing
context->max_interesting).
And there are existing cases of bailing out and not doing further
safety checks (even your v15_delta.diff patch placed code - for
bailing out on "ON CONFLICT ... DO UPDATE" - underneath one such
existing case in max_parallel_hazard_walker()):

else if (IsA(node, Query))
{
Query  *query = (Query *) node;

/* SELECT FOR UPDATE/SHARE must be treated as unsafe */
if (query->rowMarks != NULL)
{
context->max_hazard = PROPARALLEL_UNSAFE;
return true;
}

>You're trying to
> add something that bails based on second-guessing that a parallel path
> would not be chosen, which I find somewhat objectionable.
>
> If the main goal of bailing out is to avoid doing the potentially
> expensive modification safety check on the target relation, maybe we
> should try to somehow make the check less expensive.  I remember
> reading somewhere in the thread about caching the result of this check
> in relcache, but haven't closely studied the feasibility of doing so.
>

There's no "second-guessing" involved here.
There is no underlying way of dividing up the VALUES data of
"INSERT...VALUES" amongst the parallel workers, even if the planner
was updated to produce a parallel-plan for the "INSERT...VALUES" case
(apart from the fact that spawning off parallel workers to insert that
data would almost always result in worse performance than a
non-parallel plan...)
The division of work for parallel workers is part of the table AM
(scan) implementation, which is not invoked for "INSERT...VALUES".

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Preserve attstattarget on REINDEX CONCURRENTLY

2021-02-09 Thread Michael Paquier

On Sun, Feb 07, 2021 at 09:39:36AM +0900, Michael Paquier wrote:
> I'll see about applying this stuff after the next version is tagged
> then.

The new versions have been tagged, so done as of bd12080 and
back-patched.  I have added a note in the commit log about the
approach to use index_create() instead for HEAD.
--
Michael


signature.asc
Description: PGP signature

Re: Is Recovery actually paused?

2021-02-09 Thread Dilip Kumar

On Wed, Feb 10, 2021 at 8:19 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 9 Feb 2021 12:27:21 +0530, Bharath Rupireddy 
>  wrote in
> > What I meant was that if we were to add waiting logic inside
> > pg_wal_replay_pause, we should also have a timeout with some default
> > value, to avoid pg_wal_replay_pause waiting forever in the waiting
> > loop. Within that timeout, if the recovery isn't paused,
> > pg_wal_replay_pause will return probably a warning and a false(this
> > requires us to change the return value of the existing
> > pg_wal_replay_pause)?
>
> I thought that rm_redo finishes shortly unless any trouble
> happens. But on second thought, I found that I forgot a case of a
> recovery-conflict. So as you pointed out, pg_wal_replay_pause() needs
> a flag 'wait' to wait for a pause established. And the flag can be
> turned into "timeout".
>
> # And the prevous verision had another silly bug.
>
> > To avoid changing the existing API and return type, a new function
> > pg_get_wal_replay_pause_state is introduced.
>
> I mentioned about IN parameters, not OUTs. IN parameters can be
> optional to accept existing usage. pg_wal_replay_pause() is changed
> that way in the attached.
>
> If all of you still disagree with my proposal, I withdraw it.

I don't find any problem with this approach as well, but I personally
feel that the other approach where we don't wait in any API and just
return the recovery pause state is much simpler and more flexible.  So
I will make the pending changes in that patch and let's see what are
the other opinion and based on that we can conclude.  Thanks for the
patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Single transaction in the tablesync worker?

2021-02-09 Thread Ajin Cherian

On Tue, Feb 9, 2021 at 10:38 AM Peter Smith  wrote:

> PSA an alternative patch. This one adds a new member to
> WalRcvExecResult and so is able to detect the "slot does not exist"
> error. This patch also applies on top of V28, if you want it.

Did some testing with this patch on top of v29. I could see that now,
while dropping the subscription, if the tablesync slot does not exist
on the publisher, then it gives a warning
but the command does not fail.

postgres=# CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost
dbname=postgres port=6972' PUBLICATION tap_pub WITH (enabled = false);
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
postgres=# ALTER SUBSCRIPTION tap_sub enable;
ALTER SUBSCRIPTION
postgres=# ALTER SUBSCRIPTION tap_sub disable;
ALTER SUBSCRIPTION
=== here, the tablesync slot exists on the publisher but I go and
=== manually drop it.

postgres=# drop subscription tap_sub;
WARNING:  could not drop the replication slot
"pg_16401_sync_16389_6927117142022745645" on publisher
DETAIL:  The error was: ERROR:  replication slot
"pg_16401_sync_16389_6927117142022745645" does not exist
NOTICE:  dropped replication slot "tap_sub" on publisher
DROP SUBSCRIPTION

I have a minor comment on the error message, the "The error was:"
seems a bit redundant here. Maybe remove it? So that it looks like:

WARNING:  could not drop the replication slot
"pg_16401_sync_16389_6927117142022745645" on publisher
DETAIL:  ERROR:  replication slot
"pg_16401_sync_16389_6927117142022745645" does not exist

regards,
Ajin Cherian
Fujitsu Australia

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread John Naylor

On Tue, Feb 9, 2021 at 4:22 PM Heikki Linnakangas  wrote:
>
> On 09/02/2021 22:08, John Naylor wrote:
> > Maybe there's a smarter way to check for zeros in C. Or maybe be more
> > careful about cache -- running memchr() on the whole input first might
> > not be the best thing to do.
>
> The usual trick is the haszero() macro here:
> https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. That's
> how memchr() is typically implemented, too.

Thanks for that. Checking with that macro each loop iteration gives a small
boost:

v1, but using memcpy()

 mixed | ascii
---+---
   601 |   129

with haszero()

 mixed | ascii
---+---
   583 |   105

remove zero-byte check:

 mixed | ascii
---+---
   588 |93

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread Greg Nancarrow

On Wed, Jan 6, 2021 at 7:39 PM Antonin Houska  wrote:
>
>
> @@ -252,6 +252,7 @@ set_plan_references(PlannerInfo *root, Plan *plan)
> PlannerGlobal *glob = root->glob;
> int rtoffset = list_length(glob->finalrtable);
> ListCell   *lc;
> +   Plan   *finalPlan;
>
> /*
>  * Add all the query's RTEs to the flattened rangetable.  The live 
> ones
> @@ -302,7 +303,17 @@ set_plan_references(PlannerInfo *root, Plan *plan)
> }
>
> /* Now fix the Plan tree */
> -   return set_plan_refs(root, plan, rtoffset);
> +   finalPlan = set_plan_refs(root, plan, rtoffset);
> +   if (finalPlan != NULL && IsA(finalPlan, Gather))
> +   {
> +   Plan   *subplan = outerPlan(finalPlan);
> +
> +   if (IsA(subplan, ModifyTable) && castNode(ModifyTable, 
> subplan)->returningLists != NULL)
> +   {
> +   finalPlan->targetlist = 
> copyObject(subplan->targetlist);
> +   }
> +   }
> +   return finalPlan;
>  }
>
> I'm not sure if the problem of missing targetlist should be handled here (BTW,
> NIL is the constant for an empty list, not NULL). Obviously this is a
> consequence of the fact that the ModifyTable node has no regular targetlist.
>
> Actually I don't quite understand why (in the current master branch) the
> targetlist initialized in set_plan_refs()
>
> /*
>  * Set up the visible plan targetlist as being the same as
>  * the first RETURNING list. This is for the use of
>  * EXPLAIN; the executor won't pay any attention to the
>  * targetlist.  We postpone this step until here so that
>  * we don't have to do set_returning_clause_references()
>  * twice on identical targetlists.
>  */
> splan->plan.targetlist = copyObject(linitial(newRL));
>
> is not used. Instead, ExecInitModifyTable() picks the first returning list
> again:
>
> /*
>  * Initialize result tuple slot and assign its rowtype using the first
>  * RETURNING list.  We assume the rest will look the same.
>  */
> mtstate->ps.plan->targetlist = (List *) 
> linitial(node->returningLists);
>
> So if you set the targetlist in create_modifytable_plan() (according to
> best_path->returningLists), or even in create_modifytable_path(), and ensure
> that it gets propagated to the Gather node (generate_gather_pahs currently
> uses rel->reltarget), then you should no longer need to tweak
> setrefs.c. Moreover, ExecInitModifyTable() would no longer need to set the
> targetlist. However I don't guarantee that this is the best approach - some
> planner expert should speak up.
>

I've had a bit closer look at this particular issue.
I can see what you mean about the ModifyTable targetlist (that is set
in set_plan_refs()) getting overwritten by ExecInitModifyTable() -
which also contradicts the comment in set_plan_refs() that claims the
targetlist being set is used by EXPLAIN (which it is not). So the
current Postgres master branch does seem to be broken in that respect.

I did try your suggestion (and also remove my tweak), but I found that
in the T_Gather case of set_plan_refs() it does some processing (see
set_upper_references()) of the current Gather targetlist and subplan's
targetlist (and will then overwrite the Gather targetlist after that),
but in doing that processing it produces the error:
ERROR:  variable not found in subplan target list
I think that one of the fundamental problems is that, up to now,
ModifyTable has always been the top node in a (non-parallel) plan, but
now for Parallel INSERT we have a Gather node with ModifyTable in its
subplan. So the expected order of processing and node configuration
has changed.
For the moment (until perhaps a planner expert DOES speak up) I've
parked my temporary "fix" at the bottom of set_plan_refs(), simply
pointing the Gather node's targetlist to subplan's ModifyTable
targetlist.

if (nodeTag(plan) == T_Gather)
{
Plan   *subplan = plan->lefttree;

if (IsA(subplan, ModifyTable) &&
castNode(ModifyTable, subplan)->returningLists != NIL)
{
plan->targetlist = subplan->targetlist;
}
}

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread Amit Langote

On Tue, Feb 9, 2021 at 10:30 AM Greg Nancarrow  wrote:
> On Tue, Feb 9, 2021 at 1:04 AM Amit Langote  wrote:
> > That brings me to to this part of the hunk:
> >
> > +   /*
> > +* If there is no underlying SELECT, a parallel table-modification
> > +* operation is not possible (nor desirable).
> > +*/
> > +   hasSubQuery = false;
> > +   foreach(lc, parse->rtable)
> > +   {
> > +   rte = lfirst_node(RangeTblEntry, lc);
> > +   if (rte->rtekind == RTE_SUBQUERY)
> > +   {
> > +   hasSubQuery = true;
> > +   break;
> > +   }
> > +   }
> > +   if (!hasSubQuery)
> > +   return PROPARALLEL_UNSAFE;
> >
> > The justification for this given in:
> >
> > https://www.postgresql.org/message-id/CAJcOf-dF9ohqub_D805k57Y_AuDLeAQfvtaax9SpwjTSEVdiXg%40mail.gmail.com
> >
> > seems to be that the failure of a test case in
> > partition-concurrent-attach isolation suite is prevented if finding no
> > subquery RTEs in the query is flagged as parallel unsafe, which in
> > turn stops max_parallel_hazard_modify() from locking partitions for
> > safety checks in such cases.  But it feels unprincipled to have this
> > code to work around a specific test case that's failing.  I'd rather
> > edit the failing test case to disable parallel execution as
> > Tsunakawa-san suggested.
> >
>
> The code was not changed because of the test case (though it was
> fortunate that the test case worked after the change).

Ah, I misread then, sorry.

> The code check that you have identified above ensures that the INSERT
> has an underlying SELECT, because the planner won't (and shouldn't
> anyway) generate a parallel plan for INSERT...VALUES, so there is no
> point doing any parallel-safety checks in this case.
> It just so happens that the problem test case uses INSERT...VALUES -
> and it shouldn't have triggered the parallel-safety checks for
> parallel INSERT for this case anyway, because INSERT...VALUES can't
> (and shouldn't) be parallelized.

AFAICS, max_parallel_hazard() path never bails from doing further
safety checks based on anything other than finding a query component
whose hazard level crosses context->max_interesting.  You're trying to
add something that bails based on second-guessing that a parallel path
would not be chosen, which I find somewhat objectionable.

If the main goal of bailing out is to avoid doing the potentially
expensive modification safety check on the target relation, maybe we
should try to somehow make the check less expensive.  I remember
reading somewhere in the thread about caching the result of this check
in relcache, but haven't closely studied the feasibility of doing so.

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

2021-02-09 Thread Andy Fan

On Wed, Feb 10, 2021 at 11:18 AM Andy Fan  wrote:

> Hi:
>
> This patch is the first patch in UniqueKey patch series[1], since I need
> to revise
> that series many times but the first one would be not that often, so I'd
> like to
> submit this one for review first so that I don't need to maintain it again
> and again.
>
> v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch
>
> Introduce notnullattrs field in RelOptInfo to indicate which attr are not
> null
> in current query. The not null is judged by checking pg_attribute and
> query's
> restrictinfo. The info is only maintained at Base RelOptInfo and
> Partition's
> RelOptInfo level so far.
>
>
> Any thoughts?
>
> [1]
> https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com
>
> --
> Best Regards
> Andy Fan (https://www.aliyun.com/)
>

Add the missed patch..

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch
Description: Binary data

Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

2021-02-09 Thread Andy Fan

Hi:

This patch is the first patch in UniqueKey patch series[1], since I need to
revise
that series many times but the first one would be not that often, so I'd
like to
submit this one for review first so that I don't need to maintain it again
and again.

v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch

Introduce notnullattrs field in RelOptInfo to indicate which attr are not
null
in current query. The not null is judged by checking pg_attribute and
query's
restrictinfo. The info is only maintained at Base RelOptInfo and Partition's
RelOptInfo level so far.


Any thoughts?

[1]
https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)

Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-09 Thread Kyotaro Horiguchi

At Wed, 10 Feb 2021 12:13:44 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Tue, 9 Feb 2021 22:01:45 -0300, Ranier Vilela  wrote 
> in 
> > Hi Hackers,
> > 
> > Per Coverity.
> > 
> > Coverity complaints about pg_cryptohash_final function.
> > And I agree with Coverity, it's a bad design.
> > Its allows this:
> > 
> > #define MY_RESULT_LENGTH 32
> > 
> > function pgtest(char * buffer, char * text) {
> > pg_cryptohash_ctx *ctx;
> > uint8 digest[MY_RESULT_LENGTH];
> > 
> > ctx = pg_cryptohash_create(PG_SHA512);
> > pg_cryptohash_init(ctx);
> > pg_cryptohash_update(ctx, (uint8 *) buffer, text);
> > pg_cryptohash_final(ctx, digest); // <--  CID 1446240 (#1 of 1):
> > Out-of-bounds access (OVERRUN)
> > pg_cryptohash_free(ctx);
> > return
> > }
> >
> > Attached has a patch with suggestions to make things better.
> 
> I'm not sure about the details, but it looks like broken.
> 
> make complains for inconsistent prototypes abd cryptohahs.c and sha1.c
> doesn't seem to agree on its interface.

Sorry, my messages was broken.

make complains for inconsistent prototypes, and cryptohahs.c and
sha1.c don't seem to agree on the interface of pg_sha1_final.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-09 Thread Kyotaro Horiguchi

At Tue, 9 Feb 2021 22:01:45 -0300, Ranier Vilela  wrote in 
> Hi Hackers,
> 
> Per Coverity.
> 
> Coverity complaints about pg_cryptohash_final function.
> And I agree with Coverity, it's a bad design.
> Its allows this:
> 
> #define MY_RESULT_LENGTH 32
> 
> function pgtest(char * buffer, char * text) {
> pg_cryptohash_ctx *ctx;
> uint8 digest[MY_RESULT_LENGTH];
> 
> ctx = pg_cryptohash_create(PG_SHA512);
> pg_cryptohash_init(ctx);
> pg_cryptohash_update(ctx, (uint8 *) buffer, text);
> pg_cryptohash_final(ctx, digest); // <--  CID 1446240 (#1 of 1):
> Out-of-bounds access (OVERRUN)
> pg_cryptohash_free(ctx);
> return
> }
>
> Attached has a patch with suggestions to make things better.

I'm not sure about the details, but it looks like broken.

make complains for inconsistent prototypes abd cryptohahs.c and sha1.c
doesn't seem to agree on its interface.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Is Recovery actually paused?

2021-02-09 Thread Kyotaro Horiguchi

At Tue, 9 Feb 2021 12:27:21 +0530, Bharath Rupireddy 
 wrote in 
> What I meant was that if we were to add waiting logic inside
> pg_wal_replay_pause, we should also have a timeout with some default
> value, to avoid pg_wal_replay_pause waiting forever in the waiting
> loop. Within that timeout, if the recovery isn't paused,
> pg_wal_replay_pause will return probably a warning and a false(this
> requires us to change the return value of the existing
> pg_wal_replay_pause)?

I thought that rm_redo finishes shortly unless any trouble
happens. But on second thought, I found that I forgot a case of a
recovery-conflict. So as you pointed out, pg_wal_replay_pause() needs
a flag 'wait' to wait for a pause established. And the flag can be
turned into "timeout".

# And the prevous verision had another silly bug.

> To avoid changing the existing API and return type, a new function
> pg_get_wal_replay_pause_state is introduced.

I mentioned about IN parameters, not OUTs. IN parameters can be
optional to accept existing usage. pg_wal_replay_pause() is changed
that way in the attached.

If all of you still disagree with my proposal, I withdraw it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1ab31a9056..7eb93f74dd 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25320,14 +25320,19 @@ postgres=# SELECT * FROM 
pg_walfile_name_offset(pg_stop_backup());
 
  pg_wal_replay_pause
 
-pg_wal_replay_pause ()
+pg_wal_replay_pause (
+ timeout integer
+ )
 void


 Pauses recovery.  While recovery is paused, no further database
 changes are applied.  If hot standby is active, all new queries will
 see the same consistent snapshot of the database, and no further query
-conflicts will be generated until recovery is resumed.
+conflicts will be generated until recovery is resumed.  Zero or
+positive timeout value means the function errors out after that
+milliseconds elapsed before recovery is paused (default is -1, wait
+forever).


 This function is restricted to superusers by default, but other users
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 8e3b5df7dc..8fd614cded 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6072,10 +6072,61 @@ RecoveryIsPaused(void)
return recoveryPause;
 }
 
+/*
+ * Pauses recovery.
+ *
+ * It is guaranteed that no WAL replay happens after this function returns. If
+ * timeout is zero or positive, emits ERROR when the timeout is reached before
+ * recovery is paused.
+ */
 void
-SetRecoveryPause(bool recoveryPause)
+SetRecoveryPause(bool recoveryPause, int timeout)
 {
+   TimestampTz finish_time = 0;
+   TimestampTz now;
+   int sleep_ms;
+
SpinLockAcquire(&XLogCtl->info_lck);
+
+   /* No need of timeout in the startup process */
+   Assert(!InRecovery || timeout < 0);
+
+   /*
+* Wait for the concurrent rm_redo() to finish, so that no records will 
be
+* applied after this function returns. No need to wait while resuming.
+* Anyway we are requesting a recovery pause, we don't mind a possible 
slow
+* down of recovery by the info_lck here.  We don't need to wait in the
+* startup process since no concurrent rm_redo() runs.
+*/
+   while(!InRecovery &&
+ recoveryPause && !XLogCtl->recoveryPause &&
+ XLogCtl->replayEndRecPtr != XLogCtl->lastReplayedEndRecPtr)
+   {
+   SpinLockRelease(&XLogCtl->info_lck);
+   now = GetCurrentTimestamp();
+
+   if (timeout >= 0)
+   {
+   if (timeout > 0 && finish_time == 0)
+   finish_time = TimestampTzPlusMilliseconds(now, 
timeout);
+
+   if (finish_time < now)
+   ereport(ERROR,
+   
(errcode(ERRCODE_SQL_STATEMENT_NOT_YET_COMPLETE),
+errmsg ("could not pause 
recovery: timed out")));
+   }
+
+   CHECK_FOR_INTERRUPTS();
+
+   sleep_ms = 1L;  /* 10 ms */
+
+   /* finish_time may be reached earlier than 10ms */
+   if (finish_time > 0)
+   Min(sleep_ms, TimestampDifferenceMilliseconds(now, 
finish_time));
+
+   pg_usleep(sleep_ms);
+   SpinLockAcquire(&XLogCtl->info_lck);
+   }
XLogCtl->recoveryPause = recoveryPause;
SpinLockRelease(&XLogCtl->info_lck);
 }
@@ -6270,7 +6321,7 @@ RecoveryRequiresIntParameter(const char *param_name, int 
currValue, int minValue

Re: pg_replication_origin_drop API potential race condition

2021-02-09 Thread Amit Kapila

On Tue, Feb 9, 2021 at 5:53 PM Alvaro Herrera  wrote:
>
> On 2021-Feb-09, Amit Kapila wrote:
>
> > On Tue, Feb 9, 2021 at 4:16 PM Alvaro Herrera  
> > wrote:
>
> > > By all means let's get the bug fixed.
> >
> > I am planning to push this in HEAD only as there is no user reported
> > problem and this is actually more about giving correct information to
> > the user rather than some misleading message. Do you see any need to
> > back-patch this change?
>
> master-only sounds OK.
>

Pushed!

-- 
With Regards,
Amit Kapila.

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread Hou, Zhijie

> > Till now, what I found is that:
> > With tang's conf, when doing parallel insert, the walrecord is more
> > than serial insert (IMO, this is the main reason why it has
> > performance degradation) See the attatchment for the plan info.
> >
> > I have tried alter the target table to unlogged and then the
> > performance degradation will not happen any more.
> >
> > And the additional walrecord seems related to the index on the target
> table.
> > If the target table does not have any index, the wal record is the
> > same between parallel plan and serial plan.
> > Also, it does not have performance degradation without index.
> 
> 
> [serial]
>  Insert on public.testscan  (cost=3272.20..3652841.26 rows=0 width=0)
> (actual time=360.474..360.476 rows=0 loops=1)
>Buffers: shared hit=392569 read=3 dirtied=934 written=933
>WAL: records=260354 bytes=16259841
> 
> [parallel]
>->  Insert on public.testscan  (cost=3272.20..1260119.35 rows=0
> width=0) (actual time=378.227..378.229 rows=0 loops=5)
>  Buffers: shared hit=407094 read=4 dirtied=1085 written=1158
>  WAL: records=260498 bytes=17019359
> 
> 
> More pages are dirtied and written in the parallel execution.  Aren't the
> index and possibly the target table bigger with parallel execution than
> with serial execution?  That may be due to the difference of inserts of
> index keys.

Yes, the table size and index size is bigger with parallel execution.

table and index's size after parallel insert
--
postgres=# select pg_size_pretty(pg_indexes_size('testscan_index'));
 pg_size_pretty

 4048 kB
(1 row)

postgres=#
postgres=# select pg_size_pretty(pg_relation_size('testscan_index'));
 pg_size_pretty

 4768 kB
(1 row)
--

table and index's size after serial insert 
--
postgres=# select pg_size_pretty(pg_indexes_size('testscan_index'));
 pg_size_pretty

 2864 kB
(1 row)

postgres=# select pg_size_pretty(pg_relation_size('testscan_index'));
 pg_size_pretty

 4608 kB
--


To Amit:
> I think you might want to see which exact WAL records are extra by using 
> pg_waldump?

Yes, thanks for the hint, I was doing that and the result is as follow:

Heap wal record is the same between parallel and serial: (12 which is the 
number count of the query result).

parallel Btree walrecord(130500 record):
--
INSERT_LEAF:129500
INSERT_UPPER:497
SPLIT_L:172
SPLIT_R:328
INSERT_POST:0
DEDUP:0
VACUUM:0
DELETE:0
MARK_PAGE_HALFDEAD:0
UNLINK_PAGE:0
UNLINK_PAGE_META:0
NEWROOT:3
REUSE_PAGE:0
META_CLEANUP:0
--

serial Btree walrecord(130355 record):
--
INSERT_LEAF:129644
INSERT_UPPER:354
SPLIT_L:0
SPLIT_R:355
INSERT_POST:0
DEDUP:0
VACUUM:0
DELETE:0
MARK_PAGE_HALFDEAD:0
UNLINK_PAGE:0
UNLINK_PAGE_META:0
NEWROOT:2
REUSE_PAGE:0
META_CLEANUP:0
--

IMO, due to the difference of inserts with parallel execution,
the btree insert's cost is more than serial.

At the same time, the parallel does not have a huge performance gain with 
bitmapscan,
So the extra cost of btree index will result in performance degradation.
Does it make sense ?

Best regards,
Houzj

Re: repeated decoding of prepared transactions

2021-02-09 Thread Amit Kapila

On Wed, Feb 10, 2021 at 12:08 AM Robert Haas  wrote:
>
> On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila  wrote:
> > I think similar happens without any of the work done in PG-14 as well
> > if we restart the apply worker before the commit completes on the
> > subscriber. After the restart, we will send the start_decoding_at
> > point based on some previous commit which will make publisher send the
> > entire transaction again. I don't think restart of WAL sender or WAL
> > receiver is such a common thing. It can only happen due to some bug in
> > code or user wishes to stop the nodes or some crash happened.
>
> Really? My impression is that the logical replication protocol is
> supposed to be designed in such a way that once a transaction is
> successfully confirmed, it won't be sent again. Now if something is
> not confirmed then it has to be sent again. But if it is confirmed
> then it shouldn't happen.
>

If by successfully confirmed, you mean that once the subscriber node
has received, it won't be sent again then as far as I know that is not
true. We rely on the flush location sent by the subscriber to advance
the decoding locations. We update the flush locations after we apply
the transaction's commit successfully. Also, after the restart, we use
the replication origin's last flush location as a point from where we
need the transactions and the origin's progress is updated at commit
time.

OTOH, If by successfully confirmed, you mean that once the subscriber
has applied the complete transaction (including commit), then you are
right that it won't be sent again.

-- 
With Regards,
Amit Kapila.

Re: New IndexAM API controlling index vacuum strategies

2021-02-09 Thread Masahiko Sawada

On Sat, Feb 6, 2021 at 5:02 AM Peter Geoghegan  wrote:
>
> On Wed, Feb 3, 2021 at 8:18 PM Masahiko Sawada  wrote:
> > > I'm starting to think that the right short term goal should not
> > > directly involve bottom-up index deletion. We should instead return to
> > > the idea of "unifying" the vacuum_cleanup_index_scale_factor feature
> > > with the INDEX_CLEANUP feature, which is kind of where this whole idea
> > > started out at. This short term goal is much more than mere
> > > refactoring. It is still a whole new user-visible feature. The patch
> > > would teach VACUUM to skip doing any real index work within both
> > > ambulkdelete() and amvacuumcleanup() in many important cases.
> > >
> >
> > I agree to cut the scope. I've also been thinking about the impact of
> > this patch on users.
>
> It's probably also true that on balance users care more about the
> "~99.9% append-only table" case (as well as the HOT updates workload I
> brought up in response to Victor on February 2) than making VACUUM
> very sensitive to how well bottom-up deletion is working. Yes, it's
> annoying that VACUUM still wastes effort on indexes where bottom-up
> deletion alone can do all required garbage collection. But that's not
> going to be a huge surprise to users. Whereas the "~99.9% append-only
> table" case causes huge surprises to users -- users hate this kind of
> thing.

Agreed.

>
> > If vacuum skips both ambulkdelete and amvacuumcleanup in that case,
> > I'm concerned that this could increase users who are affected by the
> > known issue of leaking deleted pages. Currently, users who are
> > affected by that is only users who use INDEX_CLEANUP off. But if we
> > enable this feature by default, all users potentially are affected by
> > that issue.
>
> FWIW I think that it's unfair to blame INDEX_CLEANUP for any problems
> in this area. The truth is that the design of the
> deleted-page-recycling stuff has always caused leaked pages, even with
> workloads that should not be challenging to the implementation in any
> way. See my later remarks.
>
> > To improve index tuple deletion in that case, skipping bulkdelete is
> > also a good idea whereas the retail index deletion is also a good
> > solution. I have thought the retail index deletion would be
> > appropriate to this case but since some index AM cannot support it
> > skipping index scans is a good solution anyway.
>
> The big problem with retail index tuple deletion is that it is not
> possible once heap pruning takes place (opportunistic pruning, or
> pruning performed by VACUUM). Pruning will destroy the information
> that retail deletion needs to find the index tuple (the column
> values).

Right.

>
> I think that we probably will end up using retail index tuple
> deletion, but it will only be one strategy among several. We'll never
> be able to rely on it, even within nbtree. My personal opinion is that
> completely eliminating VACUUM is not a useful long term goal.

Totally agreed. We are not able to rely on it. It would be a good way
to delete small amount index garbage tuples but the usage is limited.

>
> > > Here is how the triggering criteria could work: maybe skipping
> > > accessing all indexes during VACUUM happens when less than 1% or
> > > 10,000 of the items from the table are to be removed by VACUUM --
> > > whichever is greater. Of course this is just the first thing I thought
> > > of. It's a starting point for further discussion.
> >
> > I also prefer your second idea :)
>
> Cool. Yeah, I always like it when the potential downside of a design
> is obviously low, and the potential upside is obviously very high. I
> am much less concerned about any uncertainty around when and how users
> will get the big upside. I like it when my problems are "luxury
> problems".  :-)
>
> > As I mentioned above, I'm still concerned that the extent of users who
> > are affected by the issue of leaking deleted pages could get expanded.
> > Currently, we don't have a way to detect how many index pages are
> > leaked. If there are potential cases where many deleted pages are
> > leaked this feature would make things worse.
>
> The existing problems here were significant even before you added
> INDEX_CLEANUP. For example, let's say VACUUM deletes a page, and then
> later recycles it in the normal/correct way -- this is the simplest
> possible case for page deletion. The page is now in the FSM, ready to
> be recycled during the next page split. Or is it? Even in this case
> there are no guarantees! This is because _bt_getbuf() does not fully
> trust the FSM to give it a 100% recycle-safe page for its page split
> caller -- _bt_getbuf() checks the page using _bt_page_recyclable()
> (which is the same check that VACUUM does to decide a deleted page is
> now recyclable). Obviously this means that the FSM can "leak" a page,
> just because there happened to be no page splits before wraparound
> occurred (and so now _bt_page_recyclable() thinks the very old page is
> very new

Re: Single transaction in the tablesync worker?

2021-02-09 Thread Peter Smith

On Tue, Feb 9, 2021 at 10:38 AM Peter Smith  wrote:
>
> On Mon, Feb 8, 2021 at 11:42 AM Peter Smith  wrote:
> >
> > On Sun, Feb 7, 2021 at 2:38 PM Peter Smith  wrote:
> > >
> > > On Sat, Feb 6, 2021 at 2:10 AM Petr Jelinek
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > Some minor comments about code:
> > > >
> > > > > + else if (res->status == WALRCV_ERROR && missing_ok)
> > > > > + {
> > > > > + /* WARNING. Error, but missing_ok = true. */
> > > > > + ereport(WARNING,
> > > >
> > > > I wonder if we need to add error code to the WalRcvExecResult and check
> > > > for the appropriate ones here. Because this can for example return error
> > > > because of timeout, not because slot is missing. Not sure if it matters
> > > > for current callers though (but then maybe don't call the param
> > > > missign_ok?).
> > >
> > > You are right. The way we are using this function has evolved beyond
> > > the original intention.
> > > Probably renaming the param to something like "error_ok" would be more
> > > appropriate now.
> > >
> >
> > PSA a patch (apply on top of V28) to change the misleading param name.
> >
>
> PSA an alternative patch. This one adds a new member to
> WalRcvExecResult and so is able to detect the "slot does not exist"
> error. This patch also applies on top of V28, if you want it.
>

PSA v2 of this WalRcvExceResult patch (it is same as v1 but includes
some PG doc updates).
This applies OK on top of v30 of the main patch.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v2-0001-ReplicationSlotDropAtPubNode-detect-slot-does-not.patch
Description: Binary data

Re: 64-bit XIDs in deleted nbtree pages

2021-02-09 Thread Peter Geoghegan

On Tue, Feb 9, 2021 at 2:14 PM Peter Geoghegan  wrote:
> The first patch teaches nbtree to use 64-bit transaction IDs here, and
> so makes it impossible to leak deleted nbtree pages. This patch is the
> nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
> due to exactly the same set of problems.

There is an unresolved question for my deleted page XID patch: what
should it do about the vacuum_cleanup_index_scale_factor feature,
which added an XID to the metapage (its btm_oldest_btpo_xact field). I
refer to the work done by commit 857f9c36cda for Postgres 11 by
Masahiko. It would be good to get your opinion on this as the original
author of that feature, Masahiko.

To recap, btm_oldest_btpo_xact is supposed to be the oldest XID among
all deleted pages in the index, so clearly it needs to be carefully
considered in my patch to make the XIDs 64-bit. Even still, v1 of my
patch from today more or less ignores the issue -- it just gets a
32-bit version of the oldest value as determined by the oldestBtpoXact
XID tracking stuff (which is largely unchanged, except that it works
with 64-bit Full Transaction Ids now).

Obviously it is still possible for the 32-bit btm_oldest_btpo_xact
field to wrap around in v1 of my patch. The obvious thing to do here
is to add a new epoch metapage field, effectively making
btm_oldest_btpo_xact 64-bit. However, I don't think that that's a good
idea. The only reason that we have the btm_oldest_btpo_xact field in
the first place is to ameliorate the problem that the patch
comprehensively solves! We should stop storing *any* XIDs in the
metapage. (Besides, adding a new "epoch" field to the metapage would
be relatively messy.)

Here is a plan that allows us to stop storing any kind of XID in the
metapage in all cases:

1. Stop maintaining the oldest XID among all deleted pages in the
entire nbtree index during VACUUM. So we can remove all of the
BTVacState.oldestBtpoXact XID tracking stuff, which is currently
something that even _bt_pagedel() needs special handling for.

2. Stop considering the btm_oldest_btpo_xact metapage field in
_bt_vacuum_needs_cleanup() -- now the "Cleanup needed?" logic only
cares about maintaining reasonably accurate statistics for the index.
Which is really how the vacuum_cleanup_index_scale_factor feature was
intended to work all along, anyway -- ISTM that the oldestBtpoXact
stuff was always just an afterthought to paper-over this annoying
32-bit XID issue.

3. We cannot actually remove the btm_oldest_btpo_xact XID field from
the metapage, because of course that would change the BTMetaPageData
struct layout, which breaks on-disk compatibility. But why not use it
for something useful instead? _bt_update_meta_cleanup_info() can use
the same field to store the number of "newly deleted" pages from the
last btbulkdelete() instead. (See my email from earlier for the
definition of "newly deleted".)

4. Now _bt_vacuum_needs_cleanup() can once again consider the
btm_oldest_btpo_xact metapage field -- except in a totally different
way, because now it means something totally different: "newly deleted
pages during last btbulkdelete() call" (per item 3). If this # pages
is very high then we probably should do a full call to btvacuumscan()
-- _bt_vacuum_needs_cleanup() will return true to make that happen.

It's unlikely but still possible that a high number of "newly deleted
pages during the last btbulkdelete() call" is in itself a good enough
reason to do a full btvacuumscan() call when the question of calling
btvacuumscan() is considered within _bt_vacuum_needs_cleanup(). Item 4
here conservatively covers that. Maybe the 32-bit-XID-in-metapage
triggering condition had some non-obvious value due to a natural
tendency for it to limit the number of deleted pages that go
unrecycled for a long time. (Or maybe there never really was any such
natural tendency -- still seems like a good idea to make the change
described by item 4.)

Even though we are conservative (at least in this sense I just
described), we nevertheless don't actually care about very old deleted
pages that we have not yet recycled -- provided there are not very
many of them. I'm thinking of "~2% of index" as the new "newly deleted
during last btbulkdelete() call" threshold applied within
_bt_vacuum_needs_cleanup(). There is no good reason why older
deleted-but-not-yet-recycled pages should be considered more valuable
than any other page that can be used when there is a page split.

Observations about on-disk compatibility with my patch + this 4 point scheme:

A. It doesn't matter that pg_upgrade'd indexes will have an XID value
in btm_oldest_btpo_xact that now gets incorrectly interpreted as
"newly deleted pages during last btbulkdelete() call" under the 4
point scheme I just outlined.

The spurious value will get cleaned up on the next VACUUM anyway
(whether VACUUM goes through btbulkdelete() or through
btvacuumcleanup()). Besides, most indexes always have a
btm_oldest_btpo_xact value of 0.

Re: adding wait_start column to pg_locks

2021-02-09 Thread Fujii Masao




On 2021/02/09 23:31, torikoshia wrote:

On 2021-02-09 22:54, Fujii Masao wrote:

On 2021/02/09 19:11, Fujii Masao wrote:



On 2021/02/09 18:13, Fujii Masao wrote:



On 2021/02/09 17:48, torikoshia wrote:

On 2021-02-05 18:49, Fujii Masao wrote:

On 2021/02/05 0:03, torikoshia wrote:

On 2021-02-03 11:23, Fujii Masao wrote:

64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating 
"waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock 
when reading "waitStart"?


Also it might be worth thinking to use 64-bit atomic operations like
pg_atomic_read_u64(), for that.


Thanks for your suggestion and advice!

In the attached patch I used pg_atomic_read_u64() and pg_atomic_write_u64().

waitStart is TimestampTz i.e., int64, but it seems pg_atomic_read_xxx and 
pg_atomic_write_xxx only supports unsigned int, so I cast the type.

I may be using these functions not correctly, so if something is wrong, I would 
appreciate any comments.


About the documentation, since your suggestion seems better than v6, I used it 
as is.


Thanks for updating the patch!

+    if (pg_atomic_read_u64(&MyProc->waitStart) == 0)
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &now));

pg_atomic_read_u64() is really necessary? I think that
"pg_atomic_write_u64(&MyProc->waitStart, now)" is enough.

+    deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT);
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &deadlockStart));

Same as above.

+    /*
+ * Record waitStart reusing the deadlock timeout timer.
+ *
+ * It would be ideal this can be synchronously done with updating
+ * lock information. Howerver, since it gives performance impacts
+ * to hold partitionLock longer time, we do it here asynchronously.
+ */

IMO it's better to comment why we reuse the deadlock timeout timer.

 proc->waitStatus = waitStatus;
+    pg_atomic_init_u64(&MyProc->waitStart, 0);

pg_atomic_write_u64() should be used instead? Because waitStart can be
accessed concurrently there.

I updated the patch and addressed the above review comments. Patch attached.
Barring any objection, I will commit this version.


Thanks for modifying the patch!
I agree with your comments.

BTW, I ran pgbench several times before and after applying
this patch.

The environment is virtual machine(CentOS 8), so this is
just for reference, but there were no significant difference
in latency or tps(both are below 1%).


Thanks for the test! I pushed the patch.


But I reverted the patch because buildfarm members rorqual and
prion don't like the patch. I'm trying to investigate the cause
of this failures.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2021-02-09%2009%3A20%3A10


-    relation | locktype |    mode
--+--+-
- test_prepared_1 | relation | RowExclusiveLock
- test_prepared_1 | relation | AccessExclusiveLock
-(2 rows)
-
+ERROR:  invalid spinlock number: 0

"rorqual" reported that the above error happened in the server built with
--disable-atomics --disable-spinlocks when reading pg_locks after
the transaction was prepared. The cause of this issue is that "waitStart"
atomic variable in the dummy proc created at the end of prepare transaction
was not initialized. I updated the patch so that pg_atomic_init_u64() is
called for the "waitStart" in the dummy proc for prepared transaction.
Patch attached. I confirmed that the patched server built with
--disable-atomics --disable-spinlocks passed all the regression tests.


Thanks for fixing the bug, I also tested v9.patch configured with
--disable-atomics --disable-spinlocks on my environment and confirmed
that all tests have passed.


Thanks for the test!

I found another bug in the patch. InitProcess() initializes "waitStart",
but previously InitAuxiliaryProcess() did not. This could cause "invalid
spinlock number" error when reading pg_locks in the standby server.
I fixed that. Attached is the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/amcheck/expected/check_btree.out 
b/contrib/amcheck/expected/check_btree.out
index 13848b7449..5a3f1ef737 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -97,8 +97,8 @@ SELECT bt_index_parent_check('bttest_b_idx');
 SELECT * FROM pg_locks
 WHERE relation = ANY(ARRAY['bttest_a', 'bttest_a_idx', 'bttest_b', 
'bttest_b_idx']::regclass[])
 AND pid = pg_backend_pid();
- locktype | database | relation | page | tuple | virtualxid | transactionid | 
classid | objid | objsubid | virtualtransaction | pid | mode | granted | 
fastpath 
---+--+--+--+---++-

Re: Clean up code

2021-02-09 Thread Euler Taveira

On Mon, Feb 8, 2021, at 7:09 PM, CK Tan wrote:
> Hi, can someone point me to the code that cleans up temp files should a query 
> crashed unexpectedly? Thanks!
See this patch [1].


[1] 
https://www.postgresql.org/message-id/CAH503wDKdYzyq7U-QJqGn%3DGm6XmoK%2B6_6xTJ-Yn5WSvoHLY1Ww%40mail.gmail.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread tsunakawa.ta...@fujitsu.com

From: Hou, Zhijie/侯 志杰 
> Till now, what I found is that:
> With tang's conf, when doing parallel insert, the walrecord is more than 
> serial
> insert (IMO, this is the main reason why it has performance degradation) See
> the attatchment for the plan info.
> 
> I have tried alter the target table to unlogged and then the performance
> degradation will not happen any more.
> 
> And the additional walrecord seems related to the index on the target table.
> If the target table does not have any index, the wal record is the same 
> between
> parallel plan and serial plan.
> Also, it does not have performance degradation without index.


[serial]
 Insert on public.testscan  (cost=3272.20..3652841.26 rows=0 width=0) (actual 
time=360.474..360.476 rows=0 loops=1)
   Buffers: shared hit=392569 read=3 dirtied=934 written=933
   WAL: records=260354 bytes=16259841

[parallel]
   ->  Insert on public.testscan  (cost=3272.20..1260119.35 rows=0 width=0) 
(actual time=378.227..378.229 rows=0 loops=5)
 Buffers: shared hit=407094 read=4 dirtied=1085 written=1158
 WAL: records=260498 bytes=17019359


More pages are dirtied and written in the parallel execution.  Aren't the index 
and possibly the target table bigger with parallel execution than with serial 
execution?  That may be due to the difference of inserts of index keys.


Regards
Takayuki Tsunakawa

pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-09 Thread Ranier Vilela

Hi Hackers,

Per Coverity.

Coverity complaints about pg_cryptohash_final function.
And I agree with Coverity, it's a bad design.
Its allows this:

#define MY_RESULT_LENGTH 32

function pgtest(char * buffer, char * text) {
pg_cryptohash_ctx *ctx;
uint8 digest[MY_RESULT_LENGTH];

ctx = pg_cryptohash_create(PG_SHA512);
pg_cryptohash_init(ctx);
pg_cryptohash_update(ctx, (uint8 *) buffer, text);
pg_cryptohash_final(ctx, digest); // <--  CID 1446240 (#1 of 1):
Out-of-bounds access (OVERRUN)
pg_cryptohash_free(ctx);
return
}

Attached has a patch with suggestions to make things better.

regards,
Ranier Vilela


pg_cryptohash.patch
Description: Binary data

Re: Single transaction in the tablesync worker?

2021-02-09 Thread Peter Smith

On Tue, Feb 9, 2021 at 8:32 PM Amit Kapila  wrote:
>
> On Tue, Feb 9, 2021 at 12:02 PM Peter Smith  wrote:
> >
> > Here are my feedback comments for the V29 patch.
> >
>
> Thanks.
>
> >
> > 3.
> > Previously the tablesync origin name format was encapsulated in a
> > common function. IMO it was cleaner/safer how it was before, instead
> > of the same "pg_%u_%u" cut/paste and scattered in many places.
> > (same comment applies multiple places, in this file and in tablesync.c)

OK. I confirmed it is fixed in V30.

But I noticed that the new function name is not quite consistent with
existing function for slot name. e.g.
ReplicationSlotNameForTablesync versus
ReplicationOriginNameForTableSync (see "TableSync" instead of
"Tablesync")

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: WIP: BRIN multi-range indexes

2021-02-09 Thread Tomas Vondra

On 2/9/21 3:46 PM, John Naylor wrote:
On Wed, Feb 3, 2021 at 7:54 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

 >
 > [v-20210203]

Hi Tomas,

I have some random comments from reading the patch, but haven't gone 
into detail in the newer aspects. I'll do so in the near future.

The cfbot seems to crash on this patch during make check, but it doesn't 
crash for me. I'm not even sure what date that cfbot status is from.

Yeah, I noticed that too, and I'm investigating.

I tried running the regression tests on a 32-bit machine (rpi4), which 
sometimes uncovers strange failures, and that indeed fails. There are 
two or three bugs.

Firstly, the allocation optimization patch does this:

MAXALIGN(sizeof(ScanKey) * scan->numberOfKeys * natts)

instead of

MAXALIGN(sizeof(ScanKey) * scan->numberOfKeys) * natts

and that sometimes produces the wrong result, triggering an assert.

Secondly, there seems to be an issue with cross-type bloom indexes. 
Imagine you have an int8 column, with a bloom index on it, and then you 
do this:

   WHERE column = '122'::int4;

Currently, we end up passing this to the consistent function, which 
tries to call hashint8 on the int4 datum - that succeeds on 64 bits 
(because both types are byval), but fails on 32-bits (where int8 is 
byref, so it fails on int4). Which causes a segfault.

I think I included those cross-type operators as a copy-paste from 
minmax indexes, but I see hash indexes don't allow this. And removing 
those cross-type rows from pg_amop.dat makes the crashes go away.

It's also possible I simplified the get_strategy_procinfo a bit too 
much. I see the minmax variant has subtype, so maybe we could do this 
instead (I recall the integer types should have "compatible" results).

There are a couple failues where the index does not produce the right 
number of results, though. I haven't investigated that yet. Once I fix 
this, I'll post an updated patch - hopefully that'll make cfbot happy.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Tightening up allowed custom GUC names

2021-02-09 Thread Noah Misch

On Tue, Feb 09, 2021 at 05:34:37PM -0500, Tom Lane wrote:
> Now granting that the best answer is just to forbid these cases,
> there are still a couple of decisions about how extensive the
> prohibition ought to be:
> 
> * We could forbid these characters only when you try to actually
> put such a GUC into pg_db_role_setting, and otherwise allow them.
> That seems like a weird nonorthogonal choice though, so I'd
> rather just forbid them period.

Agreed.

> * A case could be made for tightening things up a lot more, and not
> allowing anything that doesn't look like an identifier.  I'm not
> pushing for that, as it seems more likely to break existing
> applications than the narrow restriction proposed here.  But I could
> live with it if people prefer that way.

I'd prefer that.  Characters like backslash, space, and double quote have
significant potential to reveal bugs, while having negligible application
beyond revealing bugs.

Tightening up allowed custom GUC names

2021-02-09 Thread Tom Lane

Over in [1] it was noted that the system behaves rather oddly if
you try to do ALTER USER/DATABASE SET with a custom GUC name
containing "=" or "-".  I think we should just disallow such cases.
Relaxing the restriction is harder than it might seem:

* The convention for entries in pg_db_role_setting is just
"name=value" with no quoting rule, so GUC names containing "="
can't work.  We could imagine installing some kind of quoting rule,
but that would break client-side code that looks at this catalog;
pg_dump, for one, does so.  On balance it seems clearly not worth
changing that.

* The problem with using "-" is that we parse pg_db_role_setting
entries with ParseLongOption(), which converts "-" to "_" because
that's what makes sense to do in the context of command-line switches
such as "-c work-mem=42MB".  We could imagine adjusting the code to
not do that in the pg_db_role_setting case, but you'd still be left
with a GUC that cannot be set via PGOPTIONS="-c custom.my-guc=42".
To avoid that potential confusion, it seems best to ban "-" as well
as "=".

Now granting that the best answer is just to forbid these cases,
there are still a couple of decisions about how extensive the
prohibition ought to be:

* We could forbid these characters only when you try to actually
put such a GUC into pg_db_role_setting, and otherwise allow them.
That seems like a weird nonorthogonal choice though, so I'd
rather just forbid them period.

* A case could be made for tightening things up a lot more, and not
allowing anything that doesn't look like an identifier.  I'm not
pushing for that, as it seems more likely to break existing
applications than the narrow restriction proposed here.  But I could
live with it if people prefer that way.

Anyway, attached is a proposed patch that implements the restriction
as stated.  I'm inclined to propose this for HEAD only and not
worry about the issue in the back branches.

Thoughts?

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/20210209144059.GA21360%40depesz.com

diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l
index 7885a169bb..43cf0c4434 100644
--- a/src/backend/utils/misc/guc-file.l
+++ b/src/backend/utils/misc/guc-file.l
@@ -282,7 +282,7 @@ ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel)
 		 * Try to find the variable; but do not create a custom placeholder if
 		 * it's not there already.
 		 */
-		record = find_option(item->name, false, elevel);
+		record = find_option(item->name, false, true, elevel);
 
 		if (record)
 		{
@@ -306,7 +306,8 @@ ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel)
 			/* Now mark it as present in file */
 			record->status |= GUC_IS_IN_FILE;
 		}
-		else if (strchr(item->name, GUC_QUALIFIER_SEPARATOR) == NULL)
+		else if (strchr(item->name, GUC_QUALIFIER_SEPARATOR) == NULL ||
+ strcspn(item->name, GUC_DISALLOWED_NAME_CHARS) != strlen(item->name))
 		{
 			/* Invalid non-custom variable, so complain */
 			ereport(elevel,
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index eafdb1118e..959099122b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -5289,12 +5289,23 @@ add_placeholder_variable(const char *name, int elevel)
 }
 
 /*
- * Look up option NAME.  If it exists, return a pointer to its record,
- * else return NULL.  If create_placeholders is true, we'll create a
- * placeholder record for a valid-looking custom variable name.
+ * Look up option "name".  If it exists, return a pointer to its record.
+ * Otherwise, if create_placeholders is true and name is a valid-looking
+ * custom variable name, we'll create and return a placeholder record.
+ * Otherwise, if skip_errors is true, then we silently return NULL for
+ * an unrecognized or invalid name.  Otherwise, the error is reported at
+ * error level elevel (and we return NULL if that's less than ERROR).
+ *
+ * Note: internal errors, primarily out-of-memory, draw an elevel-level
+ * report and NULL return regardless of skip_errors.  Hence, callers must
+ * handle a NULL return whenever elevel < ERROR, but they should not need
+ * to emit any additional error message.  (In practice, internal errors
+ * can only happen when create_placeholders is true, so callers passing
+ * false need not think terribly hard about this.)
  */
 static struct config_generic *
-find_option(const char *name, bool create_placeholders, int elevel)
+find_option(const char *name, bool create_placeholders, bool skip_errors,
+			int elevel)
 {
 	const char **key = &name;
 	struct config_generic **res;
@@ -5322,19 +5333,38 @@ find_option(const char *name, bool create_placeholders, int elevel)
 	for (i = 0; map_old_guc_names[i] != NULL; i += 2)
 	{
 		if (guc_name_compare(name, map_old_guc_names[i]) == 0)
-			return find_option(map_old_guc_names[i + 1], false, elevel);
+			return find_option(map_old_guc_names[i + 1], false,
+

64-bit XIDs in deleted nbtree pages

2021-02-09 Thread Peter Geoghegan

There is a long standing problem with the way that nbtree page
deletion places deleted pages in the FSM for recycling: The use of a
32-bit XID within the deleted page (in the special
area's/BTPageOpaqueData struct's btpo.xact field) is not robust
against XID wraparound, which can lead to permanently leaking pages in
a variety of scenarios. The problems became worse with the addition of
the INDEX_CLEANUP option in Postgres 12 [1]. And, using a 32-bit XID
in this context creates risk for any further improvements in VACUUM
that similarly involve skipping whole indexes. For example, Masahiko
has been working on a patch that teaches VACUUM to skip indexes that
are known to have very little garbage [2].

Attached patch series fixes the issue once and for all. This is
something that I'm targeting for Postgres 14, since it's more or less
a bug fix.

The first patch teaches nbtree to use 64-bit transaction IDs here, and
so makes it impossible to leak deleted nbtree pages. This patch is the
nbtree equivalent of commit 6655a729, which made GiST use 64-bit XIDs
due to exactly the same set of problems. The first patch also makes
the level field stored in nbtree page's special area/BTPageOpaqueData
reliably store the level, even in a deleted page. This allows me to
consistently use the level field within amcheck, including even within
deleted pages.

Of course it will still be possible for the FSM to leak deleted nbtree
index pages with the patch -- in general the FSM isn't crash safe.
That isn't so bad with the patch, though, because a subsequent VACUUM
will eventually notice the really old deleted pages, and add them back
to the FSM once again. This will always happen because
VACUUM/_bt_getbuf()/_bt_page_recyclable() can no longer become
confused about the age of deleted pages, even when they're really old.

The second patch in the series adds new information to VACUUM VERBOSE.
This makes it easy to understand what's going on here. Index page
deletion related output becomes more useful. It might also help with
debugging the first patch.

Currently, VACUUM VERBOSE output for an index that has some page
deletions looks like this:

"38 index pages have been deleted, 38 are currently reusable."

With the second patch applied, we might see this output at the same
point in VACUUM VERBOSE output instead:

"38 index pages have been deleted, 0 are newly deleted, 38 are
currently reusable."

This means that out of the 38 of the pages that were found to be
marked deleted in the index, 0 were deleted by the VACUUM operation
whose output we see here. That is, there were 0 nbtree pages that were
newly marked BTP_DELETED within _bt_unlink_halfdead_page() during
*this particular* VACUUM -- the VACUUM operation that we see
instrumentation about here. It follows that the 38 deleted pages that
we encountered must have been marked BTP_DELETED by some previous
VACUUM operation.

In practice the "%u are currently reusable" output should never
include newly deleted pages, since there is no way that a page marked
BTP_DELETED can be put in the FSM during the same VACUUM operation --
that's unsafe (we need all of this recycling/XID indirection precisely
because we need to delay recycling until it is truly safe, of course).
Note that the "%u index pages have been deleted" output includes both
pages deleted by some previous VACUUM operation, and newly deleted
pages (no change there).

Note that the new "newly deleted" output is instrumentation about this
particular *VACUUM operation*. In contrast, the other two existing
output numbers ("deleted" and "currently reusable") are actually
instrumentation about the state of the *index as a whole* at a point
in time (barring concurrent recycling of pages counted in VACUUM by
some random _bt_getbuf() call in another backend). This fundamental
distinction is important here. All 3 numbers/stats that we output can
have different values, which can be used to debug the first patch. You
can directly observe uncommon cases just from the VERBOSE output, like
when a long running transaction holds up recycling of a deleted page
that was actually marked BTP_DELETED in an *earlier* VACUUM operation.
And so if the first patch had any bugs, there'd be a pretty good
chance that you could observe them using multiple VACUUM VERBOSE
operations -- you might notice something inconsistent or contradictory
just by examining the output over time, how things change, etc.

[1] 
https://postgr.es/m/ca+tgmoyd7xpr1dwewwxxiw4-wc1nbjf3rb9d2qgpvyh9ejz...@mail.gmail.com
[2] 
https://postgr.es/m/cah2-wzmkebqpd4mvguptos9bmfvp9mds5crtcosv1rqj3jc...@mail.gmail.com
--
Peter Geoghegan
From 39ef90d96d0c061b2e537c4cdc9899e4770c3023 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 27 Aug 2019 11:44:17 -0700
Subject: [PATCH v1 1/2] Use full 64-bit XID for nbtree page deletion.

Otherwise, after a deleted page gets even older, it becomes unrecyclable
again.  This is the nbtree equivalent of commit 6655a729, which did the
same thing wit

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread John Naylor

I wrote:
>
> On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas  wrote:
> One of his earlier demos [1] (in simdutf8check.h) had a version that used
mostly SSE2 with just three intrinsics from SSSE3. That's widely available
by now. He measured that at 0.7 cycles per byte, which is still good
compared to AVX2 0.45 cycles per byte [2].
>
> Testing for three SSSE3 intrinsics in autoconf is pretty easy. I would
assume that if that check (and the corresponding runtime check) passes, we
can assume SSE2. That code has three licenses to choose from -- Apache 2,
Boost, and MIT. Something like that might be straightforward to start from.
I think the only obstacles to worry about are license and getting it to fit
into our codebase. Adding more than zero high-level comments with a good
description of how it works in detail is also a bit of a challenge.

I double checked, and it's actually two SSSE3 intrinsics and one SSE4.1,
but the 4.1 one can be emulated with a few SSE2 intrinsics. But we could
probably fold all three into the SSE4.2 CRC check and have a single symbol
to save on boilerplate.

I hacked that demo [1] into wchar.c (very ugly patch attached), and got the
following:

master

 mixed | ascii
---+---
   757 |   366

Lemire demo:

 mixed | ascii
---+---
   172 |   168

This one lacks an ascii fast path, but the AVX2 version in the same file
has one that could probably be easily adapted. With that, I think this
would be worth adapting to our codebase and license. Thoughts?

The advantage of this demo is that it's not buried in a mountain of modern
C++.

Simdjson can use AVX -- do you happen to know which target it got compiled
to? AVX vectors are 256-bits wide and that requires OS support. The OS's we
care most about were updated 8-12 years ago, but that would still be
something to check, in addition to more configure checks.

[1] https://github.com/lemire/fastvalidate-utf-8/tree/master/include

--
John Naylor
EDB: http://www.enterprisedb.com

utf-sse42-demo.patch
Description: Binary data

Re: [HACKERS] Custom compression methods

2021-02-09 Thread Robert Haas

On Tue, Feb 9, 2021 at 3:37 PM Justin Pryzby  wrote:
> I think you misunderstood: I mean that the WIP patch should default to
> --enable-lz4, to exercise on a few CI.  It's hardly useful to run CI with the
> feature disabled.  I assume that the patch would be committed with default
> --disable-lz4.

Oh, I see. I guess we could do that.

> Right, it's not one-time, it's also whenever setting a non-default compression
> method.  I say it should go into 0001 to avoid a whole bunch of churn in
> src/test/regress, and then more churn (and rebase conflicts in other patches)
> while adding HIDE_COMPRESSAM in 0002.

Hmm, I guess that makes some sense, too.

I'm not sure either one is completely critical, but it does make sense
to me now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-02-09 Thread Justin Pryzby

I see the thread got broken somehow (or cfbot thought it did), so I added the
new thread, and this is now passing all tests.  (I think using the v22
patches).  http://cfbot.cputube.org/dilip-kumar.html

On Fri, Feb 05, 2021 at 11:07:53AM -0500, Robert Haas wrote:
> > Also, I think we may want to make enable-lz4 the default *for testing
> > purposes*, now that the linux and BSD environments include that.
> 
> My guess was that would annoy some hackers whose build environments
> got broken. If everyone thinks otherwise I'm willing to be persuaded,
> but it's going to take more than 1 vote...

I think you misunderstood: I mean that the WIP patch should default to
--enable-lz4, to exercise on a few CI.  It's hardly useful to run CI with the
feature disabled.  I assume that the patch would be committed with default
--disable-lz4.

On Fri, Feb 05, 2021 at 11:23:46AM -0500, Robert Haas wrote:
> On Fri, Feb 5, 2021 at 11:07 AM Robert Haas  wrote:
> > Personally, my preference is to just update the test outputs. It's not
> > important whether many people look closely to verify the differences;
> > we just need to look them over on a one-time basis to see if they seem
> > OK. After that it's 0 effort, vs. having to maintain HIDE_COMPRESSAM
> > forever.
> 
> Oh, I guess you're thinking about the case where someone wants to run
> the tests with a different default. That might be a good reason to
> have this. But then those changes should go in 0002.

Right, it's not one-time, it's also whenever setting a non-default compression
method.  I say it should go into 0001 to avoid a whole bunch of churn in
src/test/regress, and then more churn (and rebase conflicts in other patches)
while adding HIDE_COMPRESSAM in 0002.

-- 
Justin

Re: Is txid_status() actually safe? / What is 011_crash_recovery.pl testing?

2021-02-09 Thread Robert Haas

On Mon, Feb 8, 2021 at 4:52 PM Andres Freund  wrote:
> The 011_crash_recovery.pl test test starts a transaction, creates a
> table, fetches the transaction's xid. Then shuts down the server in
> immediate mode. It then asserts that after crash recovery the previously
> assigned xid is shown as aborted, and that new xids are assigned after
> the xid.
>
> But as far as I can tell there's no guarantee that that is the case.

I think you are right.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread Heikki Linnakangas


On 09/02/2021 22:08, John Naylor wrote:
Maybe there's a smarter way to check for zeros in C. Or maybe be more 
careful about cache -- running memchr() on the whole input first might 
not be the best thing to do.


The usual trick is the haszero() macro here: 
https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord. That's 
how memchr() is typically implemented, too.


- Heikki

Re: [HACKERS] Custom compression methods

2021-02-09 Thread Robert Haas

Please remember to trim unnecessary quoted material.

On Sun, Feb 7, 2021 at 6:45 AM Dilip Kumar  wrote:
> [ a whole lot of quoted stuff ]
>
> I have tested the performance, pglz vs lz4
>
> Test1: With a small simple string, pglz doesn't select compression but
> lz4 select as no min limit
> Table: 100 varchar column
> Test: Insert 1000 tuple, each column of 25 bytes string (32 is min
> limit for pglz)
> Result:
> pglz: 1030 ms (doesn't attempt compression so externalize),
> lz4: 212 ms
>
> Test2: With small incompressible string, pglz don't select compression
> lz4 select but can not compress
> Table: 100 varchar column
> Test: Insert 1000 tuple, each column of 25 bytes string (32 is min
> limit for pglz)
> Result:
> pglz: 1030 ms (doesn't attempt compression so externalize),
> lz4: 1090 ms (attempt to compress but externalize):
>
> Test3: Test a few columns with large random data
> Table: 3 varchar column
> Test: Insert 1000 tuple  3 columns size(3500 byes, 4200 bytes, 4900bytes)
> pglz: 150 ms (compression ratio: 3.02%),
> lz4: 30 ms (compression ratio : 2.3%)
>
> Test4: Test3 with different large random slighly compressible, need to
> compress + externalize:
> Table: 3 varchar column
> Insert: Insert 1000 tuple  3 columns size(8192, 8192, 8192)
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS
> 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> Test: insert into t1 select large_val(), large_val(), large_val() from
> generate_series(1,1000);
> pglz: 2000 ms
> lz4: 1500 ms
>
> Conclusion:
> 1. In most cases lz4 is faster and doing better compression as well.
> 2. In Test2 when small data is incompressible then lz4 tries to
> compress whereas pglz doesn't try so there is some performance loss.
> But if we want we can fix
> it by setting some minimum limit of size for lz4 as well, maybe the
> same size as pglz?

So my conclusion here is that perhaps there's no real problem. It
looks like externalizing is so expensive compared to compression that
it's worth trying to compress even though it may not always pay off.
If, by trying to compress, we avoid externalizing, it's a huge win
(~5x). If we try to compress and don't manage to avoid externalizing,
it's a small loss (~6%). It's probably reasonable to expect that
compressible data is more common than incompressible data, so not only
is the win a lot bigger than the loss, but we should be able to expect
it to happen a lot more often. It's not impossible that somebody could
get bitten, but it doesn't feel like a huge risk to me.

One thing that does occur to me is that it might be a good idea to
skip compression if it doesn't change the number of chunks that will
be stored into the TOAST table. If we compress the value but still
need to externalize it, and the compression didn't save enough to
reduce the number of chunks, I suppose we ideally would externalize
the uncompressed version. That would save decompression time later,
without really costing anything. However, I suppose that would be a
separate improvement from this patch. Maybe the possibility of
compressing smaller values makes it slightly more important, but I'm
not sure that it's worth getting excited about.

If anyone feels otherwise on either point, it'd be good to hear about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-09 Thread John Naylor

On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas  wrote:
>
> I also tested the fallback implementation from the simdjson library
> (included in the patch, if you uncomment it in simdjson-glue.c):
>
>   mixed | ascii
> ---+---
> 447 |46
> (1 row)
>
> I think we should at least try to adopt that. At a high level, it looks
> pretty similar your patch: you load the data 8 bytes at a time, check if
> there are all ASCII. If there are any non-ASCII chars, you check the
> bytes one by one, otherwise you load the next 8 bytes. Your patch should
> be able to achieve the same performance, if done right. I don't think
> the simdjson code forbids \0 bytes, so that will add a few cycles, but
> still.

That fallback is very similar to my "inline C" case upthread, and they both
actually check 16 bytes at a time (the comment is wrong in the patch you
shared). I can work back and show how the performance changes with each
difference (just MacOS, clang 10 here):

master

 mixed | ascii
---+---
   757 |   366

v1, but using memcpy()

 mixed | ascii
---+---
   601 |   129

remove zero-byte check:

 mixed | ascii
---+---
   588 |93

inline ascii fastpath into pg_utf8_verifystr()

 mixed | ascii
---+---
   595 |71

use 16-byte stride

 mixed | ascii
---+---
   652 |49

With this cpu/compiler, v1 is fastest on the mixed input all else being
equal.

Maybe there's a smarter way to check for zeros in C. Or maybe be more
careful about cache -- running memchr() on the whole input first might not
be the best thing to do.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2021-02-09 Thread Zhihong Yu

Hi,
Minor comment:

+   if (errflag == true)
+   ereport(ERROR,

I think 'if (errflag)' should suffice.

Cheers

On Tue, Feb 9, 2021 at 10:44 AM Pavel Borisov 
wrote:

> вт, 9 февр. 2021 г. в 01:46, Mark Dilger :
>
>>
>>
>> > On Feb 8, 2021, at 2:46 AM, Pavel Borisov 
>> wrote:
>> >
>> > 0002 - is a temporary hack for testing. It will allow inserting
>> duplicates in a table even if an index with the exact name "idx" has a
>> unique constraint (generally it is prohibited to insert). Then a new
>> amcheck will tell us about these duplicates. It's pity but testing can not
>> be done automatically, as it needs a core recompile. For testing I'd
>> recommend a protocol similar to the following:
>> >
>> > - Apply patch 0002
>> > - Set autovaccum = off in postgresql.conf
>>
>> Thanks Pavel and Anastasia for working on this!
>>
>> Updating pg_catalog directly is ugly, but the following seems a simpler
>> way to set up a regression test than having to recompile.  What do you
>> think?
>>
>> Very nice idea, thanks!
> I've made a regression test based on it. PFA v.2 of a patch. Now it
> doesn't need anything external for testing.
>
> --
> Best regards,
> Pavel Borisov
>
> Postgres Professional: http://postgrespro.com 
>

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2021-02-09 Thread Pavel Borisov

To make tests stable I also removed lsn output under warning level. PFA v3
of a patch

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v3-0001-Make-amcheck-checking-UNIQUE-constraint-for-btree.patch
Description: Binary data

Re: [PATCH] Improve amcheck to also check UNIQUE constraint in btree index.

2021-02-09 Thread Pavel Borisov

вт, 9 февр. 2021 г. в 01:46, Mark Dilger :

>
>
> > On Feb 8, 2021, at 2:46 AM, Pavel Borisov 
> wrote:
> >
> > 0002 - is a temporary hack for testing. It will allow inserting
> duplicates in a table even if an index with the exact name "idx" has a
> unique constraint (generally it is prohibited to insert). Then a new
> amcheck will tell us about these duplicates. It's pity but testing can not
> be done automatically, as it needs a core recompile. For testing I'd
> recommend a protocol similar to the following:
> >
> > - Apply patch 0002
> > - Set autovaccum = off in postgresql.conf
>
> Thanks Pavel and Anastasia for working on this!
>
> Updating pg_catalog directly is ugly, but the following seems a simpler
> way to set up a regression test than having to recompile.  What do you
> think?
>
> Very nice idea, thanks!
I've made a regression test based on it. PFA v.2 of a patch. Now it doesn't
need anything external for testing.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


v2-0001-Make-amcheck-checking-UNIQUE-constraint-for-btree.patch
Description: Binary data

Re: repeated decoding of prepared transactions

2021-02-09 Thread Robert Haas

On Tue, Feb 9, 2021 at 6:57 AM Amit Kapila  wrote:
> I think similar happens without any of the work done in PG-14 as well
> if we restart the apply worker before the commit completes on the
> subscriber. After the restart, we will send the start_decoding_at
> point based on some previous commit which will make publisher send the
> entire transaction again. I don't think restart of WAL sender or WAL
> receiver is such a common thing. It can only happen due to some bug in
> code or user wishes to stop the nodes or some crash happened.

Really? My impression is that the logical replication protocol is
supposed to be designed in such a way that once a transaction is
successfully confirmed, it won't be sent again. Now if something is
not confirmed then it has to be sent again. But if it is confirmed
then it shouldn't happen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add MAIN_RELATION_CLEANUP and SECONDARY_RELATION_CLEANUP options to VACUUM

2021-02-09 Thread Bossart, Nathan

On 2/8/21, 9:19 PM, "Michael Paquier"  wrote:
> On Mon, Feb 08, 2021 at 06:59:45PM +, Bossart, Nathan wrote:
>> These suggestions seem reasonable to me.  I've applied them in v9.
>
> Sounds good to me, so applied.

Thanks!

Nathan

Re: Perform COPY FROM encoding conversions in larger chunks

2021-02-09 Thread Heikki Linnakangas

On 09/02/2021 15:40, John Naylor wrote:
On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas > wrote:

 >
 > On 02/02/2021 23:42, John Naylor wrote:
 > >
 > > In copyfromparse.c, this is now out of date:
 > >
 > >   * Read the next input line and stash it in line_buf, with 
conversion to

 > >   * server encoding.

This comment for CopyReadLine() is still there. Conversion already 
happened by now, so I think this comment is outdated.

Other than that, I think this is ready for commit.

Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1" 
somehow went missing again. That was causing a failure on Windows at 
cfbot.cputube.org.

I'll read through this one more time with fresh eyes tomorrow or the day 
after, and push. Thanks for all the review!

- Heikki

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-02-09 Thread David G. Johnston

On Thu, Feb 4, 2021 at 4:45 PM Masahiro Ikeda 
wrote:

> I pgindented the patches.
>
>
... XLogWrite, which is invoked during an
XLogFlush request (see ...).  This is also incremented
by the WAL receiver during replication.

("which normally called" should be "which is normally called" or "which
normally is called" if you want to keep true to the original)

You missed the adding the space before an opening parenthesis here and
elsewhere (probably copy-paste)

is ether -> is either

"This parameter is off by default as it will repeatedly query the operating
system..."
", because" -> "as"

wal_write_time and the sync items also need the note: "This is also
incremented by the WAL receiver during replication."

"The number of times it happened..." -> " (the tally of this event is
reported in wal_buffers_full in) This is undesirable because ..."

I notice that the patch for WAL receiver doesn't require explicitly
computing the sync statistics but does require computing the write
statistics.  This is because of the presence of issue_xlog_fsync but
absence of an equivalent pg_xlog_pwrite.  Additionally, I observe that the
XLogWrite code path calls pgstat_report_wait_*() while the WAL receiver
path does not.  It seems technically straight-forward to refactor here to
avoid the almost-duplicated logic in the two places, though I suspect there
may be a trade-off for not adding another function call to the stack given
the importance of WAL processing (though that seems marginalized compared
to the cost of actually writing the WAL).  Or, as Fujii noted, go the other
way and don't have any shared code between the two but instead implement
the WAL receiver one to use pg_stat_wal_receiver instead.  In either case,
this half-and-half implementation seems undesirable.

David J.

[patch] bit XOR aggregate functions

2021-02-09 Thread Alexey Bashtanov


Hi,

I personally use it as a checksum for a large unordered set, where 
performance and simplicity is prioritized over collision resilience.

Maybe there are other ways to use them.

Best, Alex
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1ab31a9056..f33358f8db 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19192,6 +19192,32 @@ SELECT NULLIF(value, '(none)') ...
Yes
   
 
+  
+   
+
+ bit_xor
+
+bit_xor ( smallint )
+smallint
+   
+   
+bit_xor ( integer )
+integer
+   
+   
+bit_xor ( bigint )
+bigint
+   
+   
+bit_xor ( bit )
+bit
+   
+   
+Computes the bitwise exclusive OR of all non-null input values.
+   
+   Yes
+  
+
   

 
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 5c1f962251..0d8c5a922a 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -505,18 +505,26 @@
   aggcombinefn => 'int2and', aggtranstype => 'int2' },
 { aggfnoid => 'bit_or(int2)', aggtransfn => 'int2or', aggcombinefn => 'int2or',
   aggtranstype => 'int2' },
+{ aggfnoid => 'bit_xor(int2)', aggtransfn => 'int2xor', aggcombinefn => 'int2xor',
+  aggtranstype => 'int2' },
 { aggfnoid => 'bit_and(int4)', aggtransfn => 'int4and',
   aggcombinefn => 'int4and', aggtranstype => 'int4' },
 { aggfnoid => 'bit_or(int4)', aggtransfn => 'int4or', aggcombinefn => 'int4or',
   aggtranstype => 'int4' },
+{ aggfnoid => 'bit_xor(int4)', aggtransfn => 'int4xor', aggcombinefn => 'int4xor',
+  aggtranstype => 'int4' },
 { aggfnoid => 'bit_and(int8)', aggtransfn => 'int8and',
   aggcombinefn => 'int8and', aggtranstype => 'int8' },
 { aggfnoid => 'bit_or(int8)', aggtransfn => 'int8or', aggcombinefn => 'int8or',
   aggtranstype => 'int8' },
+{ aggfnoid => 'bit_xor(int8)', aggtransfn => 'int8xor', aggcombinefn => 'int8xor',
+  aggtranstype => 'int8' },
 { aggfnoid => 'bit_and(bit)', aggtransfn => 'bitand', aggcombinefn => 'bitand',
   aggtranstype => 'bit' },
 { aggfnoid => 'bit_or(bit)', aggtransfn => 'bitor', aggcombinefn => 'bitor',
   aggtranstype => 'bit' },
+{ aggfnoid => 'bit_xor(bit)', aggtransfn => 'bitxor', aggcombinefn => 'bitxor',
+  aggtranstype => 'bit' },
 
 # xml
 { aggfnoid => 'xmlagg', aggtransfn => 'xmlconcat2', aggtranstype => 'xml' },
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4e0c9be58c..95577094bb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -7992,24 +7992,36 @@
 { oid => '2237', descr => 'bitwise-or smallint aggregate',
   proname => 'bit_or', prokind => 'a', proisstrict => 'f', prorettype => 'int2',
   proargtypes => 'int2', prosrc => 'aggregate_dummy' },
+{ oid => '8452', descr => 'bitwise-xor smallint aggregate',
+  proname => 'bit_xor', prokind => 'a', proisstrict => 'f', prorettype => 'int2',
+  proargtypes => 'int2', prosrc => 'aggregate_dummy' },
 { oid => '2238', descr => 'bitwise-and integer aggregate',
   proname => 'bit_and', prokind => 'a', proisstrict => 'f',
   prorettype => 'int4', proargtypes => 'int4', prosrc => 'aggregate_dummy' },
 { oid => '2239', descr => 'bitwise-or integer aggregate',
   proname => 'bit_or', prokind => 'a', proisstrict => 'f', prorettype => 'int4',
   proargtypes => 'int4', prosrc => 'aggregate_dummy' },
+{ oid => '8453', descr => 'bitwise-xor integer aggregate',
+  proname => 'bit_xor', prokind => 'a', proisstrict => 'f', prorettype => 'int4',
+  proargtypes => 'int4', prosrc => 'aggregate_dummy' },
 { oid => '2240', descr => 'bitwise-and bigint aggregate',
   proname => 'bit_and', prokind => 'a', proisstrict => 'f',
   prorettype => 'int8', proargtypes => 'int8', prosrc => 'aggregate_dummy' },
 { oid => '2241', descr => 'bitwise-or bigint aggregate',
   proname => 'bit_or', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
   proargtypes => 'int8', prosrc => 'aggregate_dummy' },
+{ oid => '8454', descr => 'bitwise-xor bigint aggregate',
+  proname => 'bit_xor', prokind => 'a', proisstrict => 'f', prorettype => 'int8',
+  proargtypes => 'int8', prosrc => 'aggregate_dummy' },
 { oid => '2242', descr => 'bitwise-and bit aggregate',
   proname => 'bit_and', prokind => 'a', proisstrict => 'f', prorettype => 'bit',
   proargtypes => 'bit', prosrc => 'aggregate_dummy' },
 { oid => '2243', descr => 'bitwise-or bit aggregate',
   proname => 'bit_or', prokind => 'a', proisstrict => 'f', prorettype => 'bit',
   proargtypes => 'bit', prosrc => 'aggregate_dummy' },
+{ oid => '8455', descr => 'bitwise-xor bit aggregate',
+  proname => 'bit_xor', prokind => 'a', proisstrict => 'f', prorettype => 'bit',
+  proargtypes => 'bit', prosrc => 'aggregate_dummy' },
 
 # formerly-missing interval + datetime operators
 { oid => '2546',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 55ca

RE: Support tab completion for upper character inputs in psql

2021-02-09 Thread Tang, Haiying

At Sun, 07 Feb 2021 13:55:00 -0500, Tom Lane  wrote in 
>
> This looks like you're trying to force case-insensitive behavior 
> whether that is appropriate or not.  Does not sound like a good idea.

I'm still confused about the APPROPRIATE behavior of tab completion.
It seems ALTER table/tablespace  SET/RESET is already case-insensitive.

For example
# alter tablespace dbspace set(e[tab]
# alter tablespace dbspace set(effective_io_concurrency

# alter tablespace dbspace set(E[tab]
# alter tablespace dbspace set(EFFECTIVE_IO_CONCURRENCY

The above behavior is exactly the same as what the patch(attached in the 
following message) did for SET/RESET etc.
https://www.postgresql.org/message-id/flat/a63cbd45e3884cf9b3961c2a6a95dcb7%40G08CNEXMBPEKD05.g08.fujitsu.local

If anyone can share me some cases which show inappropriate scenarios of forcing 
case-insensitive inputs in psql.
I'd be grateful for that.

Regards,
Tang

Re: WIP: BRIN multi-range indexes

2021-02-09 Thread John Naylor

On Wed, Feb 3, 2021 at 7:54 PM Tomas Vondra 
wrote:
>
> [v-20210203]

Hi Tomas,

I have some random comments from reading the patch, but haven't gone into
detail in the newer aspects. I'll do so in the near future.

The cfbot seems to crash on this patch during make check, but it doesn't
crash for me. I'm not even sure what date that cfbot status is from.

> BLOOM
> -

Looks good, but make sure you change the commit message -- it still refers
to sorted mode.

+ * not entirely clear how to distrubute the space between those columns.

s/distrubute/distribute/

> MINMAX-MULTI
> 

> c) 0007 - A hybrid approach, using a buffer that is multiple of the
> user-specified value, with some safety min/max limits. IMO this is what
> we should use, although perhaps with some tuning of the exact limits.

That seems like a good approach.

+#include "access/hash.h" /* XXX strange that it fails because of
BRIN_AM_OID without this */

I think you want #include "catalog/pg_am.h" here.

> Attached is a spreadsheet with benchmark results for each of those three
> approaches, on different data types (byval/byref), data set types, index
> parameters (pages/values per range) etc. I think 0007 is a reasonable
> compromise overall, with performance somewhere in betwen 0005 and 0006.
> Of course, there are cases where it's somewhat slow, e.g. for data types
> with expensive comparisons and data sets forcing frequent compactions,
> in which case it's ~10x slower compared to regular minmax (in most cases
> it's ~1.5x). Compared to btree, it's usually much faster - ~2-3x as fast
> (except for some extreme cases, of course).
>
>
> As for the opclasses for indexes without "natural" distance function,
> implemented in 0008, I think we should drop it. In theory it works, but

Sounds reasonable.

> The other thing we were considering was using the new multi-minmax
> opclasses as default ones, replacing the existing minmax ones. IMHO we
> shouldn't do that either. For existing minmax indexes that's useless
> (the opclass seems to be working, otherwise the index would be dropped).
> But even for new indexes I'm not sure it's the right thing, so I don't
> plan to change this.

Okay.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: adding wait_start column to pg_locks

2021-02-09 Thread torikoshia


On 2021-02-09 22:54, Fujii Masao wrote:

On 2021/02/09 19:11, Fujii Masao wrote:



On 2021/02/09 18:13, Fujii Masao wrote:



On 2021/02/09 17:48, torikoshia wrote:

On 2021-02-05 18:49, Fujii Masao wrote:

On 2021/02/05 0:03, torikoshia wrote:

On 2021-02-03 11:23, Fujii Masao wrote:
64-bit fetches are not atomic on some platforms. So spinlock is 
necessary when updating "waitStart" without holding the 
partition lock? Also GetLockStatusData() needs spinlock when 
reading "waitStart"?


Also it might be worth thinking to use 64-bit atomic operations 
like

pg_atomic_read_u64(), for that.


Thanks for your suggestion and advice!

In the attached patch I used pg_atomic_read_u64() and 
pg_atomic_write_u64().


waitStart is TimestampTz i.e., int64, but it seems 
pg_atomic_read_xxx and pg_atomic_write_xxx only supports unsigned 
int, so I cast the type.


I may be using these functions not correctly, so if something is 
wrong, I would appreciate any comments.



About the documentation, since your suggestion seems better than 
v6, I used it as is.


Thanks for updating the patch!

+    if (pg_atomic_read_u64(&MyProc->waitStart) == 0)
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 
*) &now));


pg_atomic_read_u64() is really necessary? I think that
"pg_atomic_write_u64(&MyProc->waitStart, now)" is enough.

+    deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT);
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) 
&deadlockStart));


Same as above.

+    /*
+ * Record waitStart reusing the deadlock timeout timer.
+ *
+ * It would be ideal this can be synchronously done with 
updating
+ * lock information. Howerver, since it gives performance 
impacts
+ * to hold partitionLock longer time, we do it here 
asynchronously.

+ */

IMO it's better to comment why we reuse the deadlock timeout timer.

 proc->waitStatus = waitStatus;
+    pg_atomic_init_u64(&MyProc->waitStart, 0);

pg_atomic_write_u64() should be used instead? Because waitStart can 
be

accessed concurrently there.

I updated the patch and addressed the above review comments. Patch 
attached.

Barring any objection, I will commit this version.


Thanks for modifying the patch!
I agree with your comments.

BTW, I ran pgbench several times before and after applying
this patch.

The environment is virtual machine(CentOS 8), so this is
just for reference, but there were no significant difference
in latency or tps(both are below 1%).


Thanks for the test! I pushed the patch.


But I reverted the patch because buildfarm members rorqual and
prion don't like the patch. I'm trying to investigate the cause
of this failures.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2021-02-09%2009%3A20%3A10


-relation | locktype |mode
--+--+-
- test_prepared_1 | relation | RowExclusiveLock
- test_prepared_1 | relation | AccessExclusiveLock
-(2 rows)
-
+ERROR:  invalid spinlock number: 0

"rorqual" reported that the above error happened in the server built 
with

--disable-atomics --disable-spinlocks when reading pg_locks after
the transaction was prepared. The cause of this issue is that 
"waitStart"
atomic variable in the dummy proc created at the end of prepare 
transaction
was not initialized. I updated the patch so that pg_atomic_init_u64() 
is

called for the "waitStart" in the dummy proc for prepared transaction.
Patch attached. I confirmed that the patched server built with
--disable-atomics --disable-spinlocks passed all the regression tests.


Thanks for fixing the bug, I also tested v9.patch configured with
--disable-atomics --disable-spinlocks on my environment and confirmed
that all tests have passed.



BTW, while investigating this issue, I found that pg_stat_wal_receiver 
also

could cause this error even in the current master (without the patch).
I will report that in separate thread.



https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16


"prion" reported the following error. But I'm not sure how the changes 
of
pg_locks caused this error. I found that Heikki also reported at [1] 
that

"prion" failed with the same error but was not sure how it happened.
This makes me think for now that this issue is not directly related to
the pg_locks changes.


Thanks! I was wondering how these errors were related to the commit.


Regards,

--
Atsushi Torikoshi


-
pg_dump: error: query failed: ERROR:  missing chunk number 0 for toast
value 1 in pg_toast_2619
pg_dump: error: query was: SELECT
a.attnum,
a.attname,
a.atttypmod,
a.attstattarget,
a.attstorage,
t.typstorage,
a.attnotnull,
a.atthasdef,
a.attisdropped,
a.attlen,
a.attalign,
a.attislocal,
pg_catalog.format_type(t.oid, a.atttypmod) AS atttypname,
array_to_string(a

Re: Routine usage information schema tables

2021-02-09 Thread Erik Rijkers

> On 02/09/2021 3:06 PM Peter Eisentraut  
> wrote:
> 
>  
> Several information schema views track dependencies between 
> functions/procedures and objects used by them.  These had not been

> [0001-Routine-usage-information-schema-tables.patch]

Spotted one typo:

included here ony if
included here only if


Erik Rijkers

ERROR: invalid spinlock number: 0

2021-02-09 Thread Fujii Masao


Hi,

Commit 3b733fcd04 caused the buildfarm member "rorqual" to report
the following error and fail the regression test. The cause of this issue
is a bug in that commit.

ERROR:  invalid spinlock number: 0

But while investigating the issue, I found that this error could happen
even in the current master (without commit 3b733fcd04). The error can
be easily reproduced by reading pg_stat_wal_receiver view before
walreceiver starts up, in the server built with --disable-atomics 
--disable-spinlocks.
Furthermore if you try to read pg_stat_wal_receiver again,
that gets stuck. This is not good.

ISTM that the commit 2c8dd05d6c caused this issue. The commit changed
pg_stat_get_wal_receiver() so that it reads "writtenUpto" by using
pg_atomic_read_u64(). But since "writtenUpto" is initialized only when
walreceiver starts up, reading "writtenUpto" before the startup of
walreceiver can cause the error.

Also pg_stat_get_wal_receiver() calls pg_atomic_read_u64() while
a spinlock is being held. Maybe this may cause the process to get stuck
in --disable-atomics case, I guess.

Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: TRUNCATE on foreign table

2021-02-09 Thread Kazutaka Onishi

> IIUC, "truncatable" would be set to "false" for relations which do not
> have physical storage e.g. views but will be true for regular tables.

"truncatable" option is just for the foreign table and it's not related
with whether it's on a physical storage or not.
"postgres_fdw" already has "updatable" option to make the table read-only.
However, "updatable" is for DML, and it's not suitable for TRUNCATE.
Therefore new options "truncatable" was added.

Please refer to this message for details.
https://www.postgresql.org/message-id/20200128040346.GC1552%40paquier.xyz

> DELETE is very different from TRUNCATE. Application may want to DELETE
> based on a join with a local table and hence it can not be executed on
> a foreign server. That's not true with TRUNCATE.

Yeah, As you say, Applications doesn't need  TRUNCATE.
We're focusing for analytical use, namely operating huge data.
TRUNCATE can erase rows faster than DELETE.

Routine usage information schema tables

2021-02-09 Thread Peter Eisentraut

Several information schema views track dependencies between 
functions/procedures and objects used by them.  These had not been

implemented so far because PostgreSQL doesn't track objects used in a
function body.  However, formally, these also show dependencies used
in parameter default expressions, which PostgreSQL does support and
track.  So for the sake of completeness, we might as well add these.
If dependency tracking for function bodies is ever implemented, these
views will automatically work correctly.

I developed this as part of the patch "SQL-standard function body", 
where it would become more useful, but I'm sending it now separately to 
not bloat the other patch further.


[0]: 
https://www.postgresql.org/message-id/flat/1c11f1eb-f00c-43b7-799d-2d44132c0...@2ndquadrant.com
From f3a62c680760f57693dac2be2e7ba7c8cfa04fd4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 9 Feb 2021 07:42:25 +0100
Subject: [PATCH] Routine usage information schema tables

Several information schema views track dependencies between
functions/procedures and objects used by them.  These had not been
implemented so far because PostgreSQL doesn't track objects used in a
function body.  However, formally, these also show dependencies used
in parameter default expressions, which PostgreSQL does support and
track.  So for the sake of completeness, we might as well add these.
If dependency tracking for function bodies is ever implemented, these
views will automatically work correctly.
---
 doc/src/sgml/information_schema.sgml  | 443 ++
 src/backend/catalog/information_schema.sql| 100 +++-
 src/backend/catalog/sql_features.txt  |   2 +-
 .../regress/expected/create_function_3.out|  38 ++
 src/test/regress/sql/create_function_3.sql|  24 +
 5 files changed, 602 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/information_schema.sgml 
b/doc/src/sgml/information_schema.sgml
index 36ec17a4c6..9d55120024 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -4841,6 +4841,126 @@ role_usage_grants 
Columns
   
  
 
+ 
+  routine_column_usage
+
+  
+   The view routine_column_usage is meant to identify all
+   columns that are used by a function or procedure.  This information is
+   currently not tracked by PostgreSQL.
+  
+
+  
+   routine_column_usage Columns
+
+   
+
+ 
+  
+   Column Type
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   specific_catalog sql_identifier
+  
+  
+   Name of the database containing the function (always the current 
database)
+  
+ 
+
+ 
+  
+   specific_schema sql_identifier
+  
+  
+   Name of the schema containing the function
+  
+ 
+
+ 
+  
+   specific_name sql_identifier
+  
+  
+   The specific name of the function.  See  for more information.
+  
+ 
+
+ 
+  
+   routine_catalog sql_identifier
+  
+  
+   Name of the database containing the function (always the current 
database)
+  
+ 
+
+ 
+  
+   routine_schema sql_identifier
+  
+  
+   Name of the schema containing the function
+  
+ 
+
+ 
+  
+   routine_name sql_identifier
+  
+  
+   Name of the function (might be duplicated in case of overloading)
+  
+ 
+
+ 
+  
+   table_catalog sql_identifier
+  
+  
+   Name of the database that contains the table that is used by the
+   function (always the current database)
+  
+ 
+
+ 
+  
+   table_schema sql_identifier
+  
+  
+   Name of the schema that contains the table that is used by the function
+  
+ 
+
+ 
+  
+   table_name sql_identifier
+  
+  
+   Name of the table that is used by the function
+  
+ 
+
+ 
+  
+   column_name sql_identifier
+  
+  
+   Name of the column that is used by the function
+  
+ 
+
+   
+  
+ 
+
  
   routine_privileges
 
@@ -4960,6 +5080,329 @@ routine_privileges 
Columns
   
  
 
+ 
+  routine_routine_usage
+
+  
+   The view routine_routine_usage is meant to identify all
+   functions or procedures that are used by another (or the same) function or
+   procedure, either in the body or in parameter default expressions.
+   Currently, only functions used in parameter default expressions are
+   tracked.  An entry is included here ony if the used function is owned by a
+   currently enabled role.  (There is no such restriction on the using
+   function.)
+  
+
+  
+   Note that the entries for both functions in the view refer to the
+   specific name of the routine, even though the column names
+   are used in a way that is inconsistent with other information schema views
+   about routines.  This is per SQL standard, although it is arguably a
+   misdesign.  See  for more information
+   a

Re: adding wait_start column to pg_locks

2021-02-09 Thread Fujii Masao




On 2021/02/09 19:11, Fujii Masao wrote:



On 2021/02/09 18:13, Fujii Masao wrote:



On 2021/02/09 17:48, torikoshia wrote:

On 2021-02-05 18:49, Fujii Masao wrote:

On 2021/02/05 0:03, torikoshia wrote:

On 2021-02-03 11:23, Fujii Masao wrote:

64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating 
"waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock 
when reading "waitStart"?


Also it might be worth thinking to use 64-bit atomic operations like
pg_atomic_read_u64(), for that.


Thanks for your suggestion and advice!

In the attached patch I used pg_atomic_read_u64() and pg_atomic_write_u64().

waitStart is TimestampTz i.e., int64, but it seems pg_atomic_read_xxx and 
pg_atomic_write_xxx only supports unsigned int, so I cast the type.

I may be using these functions not correctly, so if something is wrong, I would 
appreciate any comments.


About the documentation, since your suggestion seems better than v6, I used it 
as is.


Thanks for updating the patch!

+    if (pg_atomic_read_u64(&MyProc->waitStart) == 0)
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &now));

pg_atomic_read_u64() is really necessary? I think that
"pg_atomic_write_u64(&MyProc->waitStart, now)" is enough.

+    deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT);
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &deadlockStart));

Same as above.

+    /*
+ * Record waitStart reusing the deadlock timeout timer.
+ *
+ * It would be ideal this can be synchronously done with updating
+ * lock information. Howerver, since it gives performance impacts
+ * to hold partitionLock longer time, we do it here asynchronously.
+ */

IMO it's better to comment why we reuse the deadlock timeout timer.

 proc->waitStatus = waitStatus;
+    pg_atomic_init_u64(&MyProc->waitStart, 0);

pg_atomic_write_u64() should be used instead? Because waitStart can be
accessed concurrently there.

I updated the patch and addressed the above review comments. Patch attached.
Barring any objection, I will commit this version.


Thanks for modifying the patch!
I agree with your comments.

BTW, I ran pgbench several times before and after applying
this patch.

The environment is virtual machine(CentOS 8), so this is
just for reference, but there were no significant difference
in latency or tps(both are below 1%).


Thanks for the test! I pushed the patch.


But I reverted the patch because buildfarm members rorqual and
prion don't like the patch. I'm trying to investigate the cause
of this failures.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2021-02-09%2009%3A20%3A10


-relation | locktype |mode
--+--+-
- test_prepared_1 | relation | RowExclusiveLock
- test_prepared_1 | relation | AccessExclusiveLock
-(2 rows)
-
+ERROR:  invalid spinlock number: 0

"rorqual" reported that the above error happened in the server built with
--disable-atomics --disable-spinlocks when reading pg_locks after
the transaction was prepared. The cause of this issue is that "waitStart"
atomic variable in the dummy proc created at the end of prepare transaction
was not initialized. I updated the patch so that pg_atomic_init_u64() is
called for the "waitStart" in the dummy proc for prepared transaction.
Patch attached. I confirmed that the patched server built with
--disable-atomics --disable-spinlocks passed all the regression tests.

BTW, while investigating this issue, I found that pg_stat_wal_receiver also
could cause this error even in the current master (without the patch).
I will report that in separate thread.



https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16


"prion" reported the following error. But I'm not sure how the changes of
pg_locks caused this error. I found that Heikki also reported at [1] that
"prion" failed with the same error but was not sure how it happened.
This makes me think for now that this issue is not directly related to
the pg_locks changes.

-
pg_dump: error: query failed: ERROR:  missing chunk number 0 for toast value 
1 in pg_toast_2619
pg_dump: error: query was: SELECT
a.attnum,
a.attname,
a.atttypmod,
a.attstattarget,
a.attstorage,
t.typstorage,
a.attnotnull,
a.atthasdef,
a.attisdropped,
a.attlen,
a.attalign,
a.attislocal,
pg_catalog.format_type(t.oid, a.atttypmod) AS atttypname,
array_to_string(a.attoptions, ', ') AS attoptions,
CASE WHEN a.attcollation <> t.typcollation THEN a.attcollation ELSE 0 END AS 
attcollation,
pg_catalog.array_to_string(ARRAY(SELECT pg_catalog.quote_ident(option_name) || 
' ' || pg_catalog.quote_literal(option_value) FROM 
pg_catalog.pg_options_to_table(attfdwoptions) ORDER BY option_name), E',
') AS

Re: Perform COPY FROM encoding conversions in larger chunks

2021-02-09 Thread John Naylor

On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas  wrote:
>
> On 02/02/2021 23:42, John Naylor wrote:
> >
> > In copyfromparse.c, this is now out of date:
> >
> >   * Read the next input line and stash it in line_buf, with conversion
to
> >   * server encoding.

This comment for CopyReadLine() is still there. Conversion already happened
by now, so I think this comment is outdated.

Other than that, I think this is ready for commit.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread Amit Kapila

On Tue, Feb 9, 2021 at 5:49 PM Hou, Zhijie  wrote:
>
> > > > postgres=# explain verbose insert into testscan select a from x
> > > > where
> > > > a<8 or (a%2=0 and a>19990);
> > > > QUERY PLAN
> > > >
> > 
> > > > --
> > > > -
> > > >  Gather  (cost=4346.89..1281204.64 rows=81372 width=0)
> > > >Workers Planned: 4
> > > >->  Insert on public.testscan  (cost=3346.89..1272067.44 rows=0
> > > > width=0)
> > > >  ->  Parallel Bitmap Heap Scan on public.x1
> > > > (cost=3346.89..1272067.44 rows=20343 width=8)
> > > >Output: x1.a, NULL::integer
> > > >Recheck Cond: ((x1.a < 8) OR (x1.a > 19990))
> > > >Filter: ((x1.a < 8) OR (((x1.a % 2) = 0) AND
> > > > (x1.a >
> > > > 19990)))
> > > >->  BitmapOr  (cost=3346.89..3346.89 rows=178808
> > > > width=0)
> > > >  ->  Bitmap Index Scan on x1_a_idx
> > > > (cost=0.00..1495.19 rows=80883 width=0)
> > > >Index Cond: (x1.a < 8)
> > > >  ->  Bitmap Index Scan on x1_a_idx
> > > > (cost=0.00..1811.01 rows=97925 width=0)
> > > >Index Cond: (x1.a > 19990)
> > > >
> > > > PSA is my postgresql.conf file, maybe you can have a look. Besides,
> > > > I didn't do any parameters tuning in my test session.
> > >
> > > I reproduced this on my machine.
> > >
> > > I think we'd better do "analyze" before insert which helps reproduce this
> > easier.
> > > Like:
> > >
> > > -
> > > analyze;
> > > explain analyze verbose insert into testscan select a from x where
> > > a<8 or (a%2=0 and a>19990);
> > > -
> >
> > OK then.
> > Can you check if just the underlying SELECTs are run (without INSERT), is
> > there any performance degradation when compared to a non-parallel scan?
>
> It seems there is no performance degradation without insert.
>
> Till now, what I found is that:
> With tang's conf, when doing parallel insert, the walrecord is more than 
> serial insert
> (IMO, this is the main reason why it has performance degradation)
> See the attatchment for the plan info.
>
> I have tried alter the target table to unlogged and
> then the performance degradation will not happen any more.
>
> And the additional walrecord seems related to the index on the target table.
>

I think you might want to see which exact WAL records are extra by
using pg_waldump?

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] Custom compression methods

2021-02-09 Thread Dilip Kumar

On Tue, Feb 9, 2021 at 2:08 PM Dilip Kumar  wrote:
>
> On Sun, Feb 7, 2021 at 5:15 PM Dilip Kumar  wrote:
> >
> > On Fri, Feb 5, 2021 at 8:11 PM Dilip Kumar  wrote:
> > >
> > > On Wed, Feb 3, 2021 at 2:07 AM Robert Haas  wrote:
> > > >
> > > > Even more review comments, still looking mostly at 0001:
> > > >
> > > > If there's a reason why parallel_schedule is arranging to run the
> > > > compression test in parallel with nothing else, the comment in that
> > > > file should explain the reason. If there isn't, it should be added to
> > > > a parallel group that doesn't have the maximum number of tests yet,
> > > > probably the last such group in the file.
> > > >
> > > > serial_schedule should add the test in a position that roughly
> > > > corresponds to where it appears in parallel_schedule.
> > >
> > > Done
> > >
> > > > I believe it's relatively standard practice to put variable
> > > > declarations at the top of the file. compress_lz4.c and
> > > > compress_pglz.c instead put those declarations nearer to the point of
> > > > use.
> > >
> > > Do you mean pglz_compress_methods and lz4_compress_methods ? I
> > > followed that style from
> > > heapam_handler.c.  If you think that doesn't look good then I can move it 
> > > up.
> > >
> > > > compressamapi.c has an awful lot of #include directives for the code
> > > > it actually contains. I believe that we should cut that down to what
> > > > is required by 0001, and other patches can add more later as required.
> > > > In fact, it's tempting to just get rid of this .c file altogether and
> > > > make the two functions it contains static inline functions in the
> > > > header, but I'm not 100% sure that's a good idea.
> > >
> > > I think it looks better to move them to compressamapi.h so done that.
> > >
> > > > The copyright dates in a number of the file headers are out of date.
> > >
> > > Fixed
> > >
> > > > binary_upgrade_next_pg_am_oid and the related changes to
> > > > CreateAccessMethod don't belong in 0001, because it doesn't support
> > > > non-built-in compression methods. These changes and the related
> > > > pg_dump change should be moved to the patch that adds support for
> > > > that.
> > >
> > > Fixed
> > >
> > > > The comments added to dumpTableSchema() say that "compression is
> > > > assigned by ALTER" but don't give a reason. I think they should. I
> > > > don't know how much they need to explain about what the code does, but
> > > > they definitely need to explain why it does it. Also, isn't this bad?
> > > > If we create the column with the wrong compression setting initially
> > > > and then ALTER it, we have to rewrite the table. If it's empty, that's
> > > > cheap, but it'd still be better not to do it at all.
> > >
> > > Yeah, actually that part should go in 0003 patch where we implement
> > > the custom compression method.
> > > in that patch we need to alter and set because we want to keep the
> > > preserved method as well
> > > So I will add it there
> > >
> > > > I'm not sure it's a good idea for dumpTableSchema() to leave out
> > > > specifying the compression method if it happens to be pglz. I think we
> > > > definitely shouldn't do it in binary-upgrade mode. What if we changed
> > > > the default in a future release? For that matter, even 0002 could make
> > > > the current approach unsafe I think, anyway.
> > >
> > > Fixed
> > >
> > >
> > > > The changes to pg_dump.h look like they haven't had a visit from
> > > > pgindent. You should probably try to do that for the whole patch,
> > > > though it's a bit annoying since you'll have to manually remove
> > > > unrelated changes to the same files that are being modified by the
> > > > patch. Also, why the extra blank line here?
> > >
> > > Fixed, ran pgindent for other files as well.
> > >
> > > > GetAttributeCompression() is hard to understand. I suggest changing
> > > > the comment to "resolve column compression specification to an OID"
> > > > and somehow rejigger the code so that you aren't using one not-NULL
> > > > test and one NULL test on the same variable. Like maybe change the
> > > > first part to if (!IsStorageCompressible(typstorage)) { if
> > > > (compression == NULL) return InvalidOid; ereport(ERROR, ...); }
> > >
> > > Done
> > >
> > > > It puzzles me that CompareCompressionMethodAndDecompress() calls
> > > > slot_getallattrs() just before clearing the slot. It seems like this
> > > > ought to happen before we loop over the attributes, so that we don't
> > > > need to call slot_getattr() every time. See the comment for that
> > > > function. But even if we didn't do that for some reason, why would we
> > > > do it here? If it's already been done, it shouldn't do anything, and
> > > > if it hasn't been done, it might overwrite some of the values we just
> > > > poked into tts_values. It also seems suspicious that we can get away
> > > > with clearing the slot and then again marking it valid. I'm not sure
> > > > it really works like that. Like, can't clearin

Re: Improvements and additions to COPY progress reporting

2021-02-09 Thread Matthias van de Meent

On Tue, 9 Feb 2021 at 08:12, Bharath Rupireddy
 wrote:
>
> On Tue, Feb 9, 2021 at 12:06 AM Matthias van de Meent
>  wrote:
> > With [0] we got COPY progress reporting. Before the column names of
> > this newly added view are effectively set in stone with the release of
> > pg14, I propose the following set of relatively small patches. These
> > are v2, because it is a patchset that is based on a set of patches
> > that I previously posted in [0].
>
> Thanks for working on the patches. Here are some comments:
>
> 0001 - +1 to add tuples_excluded and the patch LGTM.
>
> 0002 - Yes, the tuples_processed or tuples_excluded makes more sense
> to me than lines_processed and lines_excluded. The patch LGTM.
>
> 0003 - Instead of just adding the progress reporting to "See also"
> sections in the footer of the respective pages analyze, cluster and
> others, it would be nice if we have a mention of it in the description
> as pg_basebackup has something like below:
>  
>Whenever pg_basebackup is taking a base
>backup, the server's pg_stat_progress_basebackup
>view will report the progress of the backup.
>See  for details.

Added

> 0004 -
> 1) How about PROGRESS_COPY_COMMAND_TYPE instead of
> PROGRESS_COPY_COMMAND? The names looks bit confusing with the existing
> PROGRESS_COMMAND_COPY.

The current name is consistent with the naming of the other
command-reporting progress views; CREATEIDX and CLUSTER both use the
*_COMMAND as this column indexes' internal name.

> 0005 -
> 1) How about
> +   or CALLBACK (used in the table
> synchronization background
> +   worker).
> instead of
> +   or CALLBACK (used in the tablesync background
> +   worker).
> Because "table synchronization" is being used in logical-replication.sgml.

Fixed

> 2) I think cstate->copy_src = COPY_CALLBACK is assigned after the
> switch case added in copyfrom.c
> if (data_source_cb)
> {
> cstate->copy_src = COPY_CALLBACK;
> cstate->data_source_cb = data_source_cb;
> }

Yes, I noticed this too while working on the patchset, but apparently
didn't act on this... Fixed in attachted version.

> Also, you can add this to the current commitfest.

See https://commitfest.postgresql.org/32/2977/

On Tue, 9 Feb 2021 at 12:53, Josef Šimánek  wrote:
>
> OK, would you mind to integrate my regression test initial patch as
> well in v3 or should I submit it later in a separate way?

Attached, with minor fixes


With regards,

Matthias van de Meent
From 720f58890c2910bba7829c380ccb5a022596a6c8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Josef=20=C5=A0im=C3=A1nek?= 
Date: Tue, 9 Feb 2021 13:06:45 +0100
Subject: [PATCH v3 6/6] Add progress reporting regression tests

This tests some basic features of copy progress reporting, as this is
possible to implement in one session using triggers. Other views may follow.
---
 src/test/regress/expected/progress.out | 47 ++
 src/test/regress/parallel_schedule |  2 +-
 src/test/regress/sql/progress.sql  | 43 +++
 3 files changed, 91 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/progress.out
 create mode 100644 src/test/regress/sql/progress.sql

diff --git a/src/test/regress/expected/progress.out b/src/test/regress/expected/progress.out
new file mode 100644
index 00..8feed9df42
--- /dev/null
+++ b/src/test/regress/expected/progress.out
@@ -0,0 +1,47 @@
+-- setup for COPY progress testing
+CREATE TEMP TABLE test_progress_with_trigger (
+  a int,
+  b text
+) ;
+CREATE function notice_after_progress_reporting() RETURNS trigger AS
+$$
+DECLARE report record;
+BEGIN
+  SELECT INTO report * FROM pg_stat_progress_copy report WHERE pid = pg_backend_pid();
+  raise info 'progress datname: %', report.datname;
+  raise info 'progress command: %', report.command;
+  raise info 'progress io_target: %', report.io_target;
+  raise info 'progress bytes_processed: %', report.bytes_processed;
+  raise info 'progress bytes_total: %', report.bytes_total;
+  raise info 'progress tuples_processed: %', report.tuples_processed;
+  raise info 'progress tuples_excluded: %', report.tuples_excluded;
+
+  RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER check_after_progress_reporting
+AFTER INSERT ON test_progress_with_trigger
+FOR EACH ROW
+EXECUTE FUNCTION notice_after_progress_reporting();
+-- simple COPY from STDIN
+COPY test_progress_with_trigger (a, b) FROM STDIN;
+INFO:  progress datname: regression
+INFO:  progress command: COPY FROM
+INFO:  progress io_target: STDIO
+INFO:  progress bytes_processed: 9
+INFO:  progress bytes_total: 0
+INFO:  progress tuples_processed: 1
+INFO:  progress tuples_excluded: 0
+-- COPY from STDIN with WHERE skipping lines
+COPY test_progress_with_trigger (a, b) FROM STDIN WHERE a > 1;
+INFO:  progress datname: regression
+INFO:  progress command: COPY FROM
+INFO:  progress io_target: STDIO
+INFO:  progress bytes_processed: 18
+INFO:  progress bytes_total: 0
+INFO

Re: TRUNCATE on foreign table

2021-02-09 Thread Ashutosh Bapat

On Tue, Feb 9, 2021 at 5:49 PM Bharath Rupireddy
 wrote:
>
> On Tue, Feb 9, 2021 at 5:31 PM Ashutosh Bapat
>  wrote:
> > Why would one want to truncate a foreign table instead of truncating
> > actual table wherever it is?
>
> I think when the deletion on foreign tables (which actually deletes
> rows from the remote table?) is allowed, it does make sense to have a
> way to truncate the remote table via foreign table. Also, it can avoid
> going to each and every remote server and doing the truncation
> instead.

DELETE is very different from TRUNCATE. Application may want to DELETE
based on a join with a local table and hence it can not be executed on
a foreign server. That's not true with TRUNCATE.

-- 
Best Wishes,
Ashutosh Bapat

Re: pg_replication_origin_drop API potential race condition

2021-02-09 Thread Alvaro Herrera

On 2021-Feb-09, Amit Kapila wrote:

> On Tue, Feb 9, 2021 at 4:16 PM Alvaro Herrera  wrote:

> > By all means let's get the bug fixed.
> 
> I am planning to push this in HEAD only as there is no user reported
> problem and this is actually more about giving correct information to
> the user rather than some misleading message. Do you see any need to
> back-patch this change?

master-only sounds OK.

-- 
Álvaro Herrera   Valdivia, Chile
"Cuando mañana llegue pelearemos segun lo que mañana exija" (Mowgli)

Re: TRUNCATE on foreign table

2021-02-09 Thread Bharath Rupireddy

On Tue, Feb 9, 2021 at 5:31 PM Ashutosh Bapat
 wrote:
> Why would one want to truncate a foreign table instead of truncating
> actual table wherever it is?

I think when the deletion on foreign tables (which actually deletes
rows from the remote table?) is allowed, it does make sense to have a
way to truncate the remote table via foreign table. Also, it can avoid
going to each and every remote server and doing the truncation
instead.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Parallel INSERT (INTO ... SELECT ...)

2021-02-09 Thread Hou, Zhijie

> > > postgres=# explain verbose insert into testscan select a from x
> > > where
> > > a<8 or (a%2=0 and a>19990);
> > > QUERY PLAN
> > >
> 
> > > --
> > > -
> > >  Gather  (cost=4346.89..1281204.64 rows=81372 width=0)
> > >Workers Planned: 4
> > >->  Insert on public.testscan  (cost=3346.89..1272067.44 rows=0
> > > width=0)
> > >  ->  Parallel Bitmap Heap Scan on public.x1
> > > (cost=3346.89..1272067.44 rows=20343 width=8)
> > >Output: x1.a, NULL::integer
> > >Recheck Cond: ((x1.a < 8) OR (x1.a > 19990))
> > >Filter: ((x1.a < 8) OR (((x1.a % 2) = 0) AND
> > > (x1.a >
> > > 19990)))
> > >->  BitmapOr  (cost=3346.89..3346.89 rows=178808
> > > width=0)
> > >  ->  Bitmap Index Scan on x1_a_idx
> > > (cost=0.00..1495.19 rows=80883 width=0)
> > >Index Cond: (x1.a < 8)
> > >  ->  Bitmap Index Scan on x1_a_idx
> > > (cost=0.00..1811.01 rows=97925 width=0)
> > >Index Cond: (x1.a > 19990)
> > >
> > > PSA is my postgresql.conf file, maybe you can have a look. Besides,
> > > I didn't do any parameters tuning in my test session.
> >
> > I reproduced this on my machine.
> >
> > I think we'd better do "analyze" before insert which helps reproduce this
> easier.
> > Like:
> >
> > -
> > analyze;
> > explain analyze verbose insert into testscan select a from x where
> > a<8 or (a%2=0 and a>19990);
> > -
> 
> OK then.
> Can you check if just the underlying SELECTs are run (without INSERT), is
> there any performance degradation when compared to a non-parallel scan?

It seems there is no performance degradation without insert.

Till now, what I found is that:
With tang's conf, when doing parallel insert, the walrecord is more than serial 
insert
(IMO, this is the main reason why it has performance degradation)
See the attatchment for the plan info.

I have tried alter the target table to unlogged and
then the performance degradation will not happen any more.

And the additional walrecord seems related to the index on the target table.
If the target table does not have any index, the wal record is the same between 
parallel plan and serial plan.
Also, it does not have performance degradation without index.

I am still looking at this problem, if someone think of something about it,
It's very grateful to share the knowledge with me.

Best regards,
houzj





postgres=# explain (ANALYZE, BUFFERS, VERBOSE, WAL)  insert into testscan 
select a from x where a<8 or (a%2=0 and a>19990);
   QUERY PLAN

 Insert on public.testscan  (cost=3272.20..3652841.26 rows=0 width=0) (actual 
time=360.474..360.476 rows=0 loops=1)
   Buffers: shared hit=392569 read=3 dirtied=934 written=933
   WAL: records=260354 bytes=16259841
   ->  Bitmap Heap Scan on public.x  (cost=3272.20..3652841.26 rows=79918 
width=8) (actual time=8.096..41.005 rows=12 loops=1)
 Output: x.a, NULL::integer
 Recheck Cond: ((x.a < 8) OR (x.a > 19990))
 Filter: ((x.a < 8) OR (((x.a % 2) = 0) AND (x.a > 19990)))
 Rows Removed by Filter: 5
 Heap Blocks: exact=975
 Buffers: shared hit=1475
 ->  BitmapOr  (cost=3272.20..3272.20 rows=174813 width=0) (actual 
time=7.975..7.976 rows=0 loops=1)
   Buffers: shared hit=500
   ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1468.38 rows=79441 
width=0) (actual time=3.469..3.470 rows=7 loops=1)
 Index Cond: (x.a < 8)
 Buffers: shared hit=222
   ->  Bitmap Index Scan on x_a_idx  (cost=0.00..1763.86 rows=95372 
width=0) (actual time=4.499..4.499 rows=10 loops=1)
 Index Cond: (x.a > 19990)
 Buffers: shared hit=278
 Planning:
   Buffers: shared hit=10
 Planning Time: 0.126 ms
 Execution Time: 360.547 ms
(22 rows)
postgres=# explain (ANALYZE, BUFFERS, VERBOSE, WAL)  insert into testscan 
select a from x where a<8 or (a%2=0 and a>19990);
   QUERY PLAN
-
 Gather  (cost=4272.20..1269111.15 rows=79918 width=0) (actual 
time=381.764..382.715 rows=0 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   Buffers: shared hit=407094 read=4 dirtied=1085 written=1158
   WAL: records=260498 bytes=17019359
   ->  Insert on public.testscan  (cost=3272.20..1260119.35 rows=0 widt

Re: TRUNCATE on foreign table

2021-02-09 Thread Ashutosh Bapat

IIUC, "truncatable" would be set to "false" for relations which do not
have physical storage e.g. views but will be true for regular tables.
When we are importing schema we need to set "truncatable"
appropriately. Is that something we will support with this patch?

Why would one want to truncate a foreign table instead of truncating
actual table wherever it is?

On Sun, Feb 7, 2021 at 6:06 PM Kazutaka Onishi  wrote:
>
> Thank you for your comment! :D
> I have fixed it and attached the revised patch.
>
> regards,
>
>
>
> 2021年2月7日(日) 2:08 Zhihong Yu :
>>
>> Hi,
>> +   if (strcmp(defel->defname, "truncatable") == 0)
>> +   server_truncatable = defGetBoolean(defel);
>>
>> Looks like we can break out of the loop when the condition is met.
>>
>> +   /* ExecForeignTruncate() is invoked for each server */
>>
>> The method name in the comment is slightly different from the actual method 
>> name.
>>
>> +   if (strcmp(defel->defname, "truncatable") == 0)
>> +   truncatable = defGetBoolean(defel);
>>
>> We can break out of the loop when the condition is met.
>>
>> Cheers
>>
>> On Sat, Feb 6, 2021 at 5:11 AM Kazutaka Onishi  wrote:
>>>
>>> Hello,
>>>
>>> The attached patch is for supporting "TRUNCATE" on  foreign tables.
>>>
>>> This patch includes:
>>> * Adding "ExecForeignTruncate" function into FdwRoutine.
>>> * Enabling "postgres_fdw" to use TRUNCATE.
>>>
>>> This patch was proposed by Kaigai-san in March 2020,
>>> but it was returned because it can't be applied to the latest source codes.
>>>
>>> Please refer to the discussion.
>>> https://www.postgresql.org/message-id/flat/CAOP8fzb-t3WVNLjGMC%2B4sV4AZa9S%3DMAQ7Q6pQoADMCf_1jp4ew%40mail.gmail.com#3b6c6ff85ff5c722b36c7a09b2dd7165
>>>
>>> I have fixed the patch due to submit it to Commit Fest 2021-03.
>>>
>>> regards,
>>>
>>> --
>>> --
>>> Kazutaka Onishi
>>> (oni...@heterodb.com)
>
>
>
> --
> --
> Kazutaka Onishi
> (oni...@heterodb.com)
>
>
> --
> --
> Kazutaka Onishi
> (oni...@heterodb.com)



-- 
Best Wishes,
Ashutosh Bapat

Re: repeated decoding of prepared transactions

2021-02-09 Thread Amit Kapila

On Tue, Feb 9, 2021 at 11:29 AM Ashutosh Bapat
 wrote:
>
> On Tue, Feb 9, 2021 at 8:32 AM Amit Kapila  wrote:
> >
> > On Mon, Feb 8, 2021 at 8:36 PM Markus Wanner
> >  wrote:
> > >
> > > Hello Amit,
> > >
> > > thanks for your very quick response.
> > >
> > > On 08.02.21 11:13, Amit Kapila wrote:
> > > > /*
> > > >   * It is possible that this transaction is not decoded at prepare time
> > > >   * either because by that time we didn't have a consistent snapshot or 
> > > > it
> > > >   * was decoded earlier but we have restarted. We can't distinguish 
> > > > between
> > > >   * those two cases so we send the prepare in both the cases and let
> > > >   * downstream decide whether to process or skip it. We don't need to
> > > >   * decode the xact for aborts if it is not done already.
> > > >   */
> > >
> > > The way I read the surrounding code, the only case a 2PC transaction
> > > does not get decoded a prepare time is if the transaction is empty.  Or
> > > are you aware of any other situation that might currently happen?
> > >
> >
> > We also skip decoding at prepare time if we haven't reached a
> > consistent snapshot by that time. See below code in DecodePrepare().
> > DecodePrepare()
> > {
> > ..
> > /* We can't start streaming unless a consistent state is reached. */
> > if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT)
> > {
> > ReorderBufferSkipPrepare(ctx->reorder, xid);
> > return;
> > }
> > ..
> > }
>
> Can you please provide steps which can lead to this situation?
>

Ajin has already shared the example with you.

> If
> there is an earlier discussion which has example scenarios, please
> point us to the relevant thread.
>

It started in the email [1] and from there you can read later emails
to know more about this.

> If we are not sending PREPARED transactions that's fine,
>

Hmm, I am not sure if that is fine because if the output plugin sets
the two-phase-commit option, it would expect all prepared xacts to
arrive not some only some of them.

> but sending
> the same prepared transaction as many times as the WAL sender is
> restarted between sending prepare and commit prepared is a waste of
> network bandwidth.
>

I think similar happens without any of the work done in PG-14 as well
if we restart the apply worker before the commit completes on the
subscriber. After the restart, we will send the start_decoding_at
point based on some previous commit which will make publisher send the
entire transaction again. I don't think restart of WAL sender or WAL
receiver is such a common thing. It can only happen due to some bug in
code or user wishes to stop the nodes or some crash happened.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2Bd3gzCyzsYjt1m6sfGf_C_uFmo9JK%3D3Wafp6yR8Mg8uQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Improvements and additions to COPY progress reporting

2021-02-09 Thread 0010203112132233

On Tue, 9 Feb 2021 at 09:32, Josef Šimánek  wrote:
>
> po 8. 2. 2021 v 19:35 odesílatel Matthias van de Meent
>  napsal:
> > Lastly, 0005 adds 'io_target' to the reported information, that is,
> > FILE, PROGRAM, STDIO or CALLBACK. Although this can relatively easily
> > be determined based on the commands in pg_stat_activity, it is
> > reasonably something that a user would want to query on, as the
> > origin/target of COPY has security and performance implications,
> > whereas other options (e.g. format) are less interesting for clients
> > that are not executing that specific COPY command.
>
> I took a little deeper look and I'm not sure if I understand FILE and
> STDIO. I have finally tried to finalize some initial regress testing
> of COPY command progress using triggers. I have attached the initial
> patch  applicable to your changes. As you can see COPY FROM STDIN is
> reported as FILE. That's probably expected, but it is a little
> confusing for me since STDIN and STDIO sound similar. What is the
> purpose of STDIO? When is the COPY command reported with io_target of
> STDIO?

I checked for the type of the copy_src before it was correctly set,
therefore only reporting FILE type, but this will be fixed shortly in
v3.

Matthias

Re: Improvements and additions to COPY progress reporting

2021-02-09 Thread Josef Šimánek

út 9. 2. 2021 v 12:51 odesílatel 0010203112132233  napsal:
>
> On Tue, 9 Feb 2021 at 09:32, Josef Šimánek  wrote:
> >
> > po 8. 2. 2021 v 19:35 odesílatel Matthias van de Meent
> >  napsal:
> > > Lastly, 0005 adds 'io_target' to the reported information, that is,
> > > FILE, PROGRAM, STDIO or CALLBACK. Although this can relatively easily
> > > be determined based on the commands in pg_stat_activity, it is
> > > reasonably something that a user would want to query on, as the
> > > origin/target of COPY has security and performance implications,
> > > whereas other options (e.g. format) are less interesting for clients
> > > that are not executing that specific COPY command.
> >
> > I took a little deeper look and I'm not sure if I understand FILE and
> > STDIO. I have finally tried to finalize some initial regress testing
> > of COPY command progress using triggers. I have attached the initial
> > patch  applicable to your changes. As you can see COPY FROM STDIN is
> > reported as FILE. That's probably expected, but it is a little
> > confusing for me since STDIN and STDIO sound similar. What is the
> > purpose of STDIO? When is the COPY command reported with io_target of
> > STDIO?
>
> I checked for the type of the copy_src before it was correctly set,
> therefore only reporting FILE type, but this will be fixed shortly in
> v3.

OK, would you mind to integrate my regression test initial patch as
well in v3 or should I submit it later in a separate way?

> Matthias

Re: Clean up code

2021-02-09 Thread Heikki Linnakangas


On 09/02/2021 00:09, CK Tan wrote:
Hi, can someone point me to the code that cleans up temp files should a 
query crashed unexpectedly? Thanks!


Autovacuum does that. Search for "orphan" in do_autovacuum() function.

- Heikki

Re: Single transaction in the tablesync worker?

2021-02-09 Thread Amit Kapila

On Tue, Feb 9, 2021 at 12:02 PM Peter Smith  wrote:
>
> Here are my feedback comments for the V29 patch.
>
> 
>
> FILE: logical-replication.sgml
>
> +slots have generated names:
> pg_%u_sync_%u_%llu
> +(parameters: Subscription oid,
> +Table relid, system
> identifiersysid)
> +   
>
> 1.
> There is a missing space before the sysid parameter.
>
> =
>
> FILE: subscriptioncmds.c
>
> + * SUBREL_STATE_FINISHEDCOPY. The apply worker can also
> + * concurrently try to drop the origin and by this time the
> + * origin might be already removed. For these reasons,
> + * passing missing_ok = true from here.
> + */
> + snprintf(originname, sizeof(originname), "pg_%u_%u", sub->oid, relid);
> + replorigin_drop_by_name(originname, true, false);
> + }
>
> 2.
> Don't really need to say "from here".
> (same comment applies multiple places, in this file and in tablesync.c)
>
> 3.
> Previously the tablesync origin name format was encapsulated in a
> common function. IMO it was cleaner/safer how it was before, instead
> of the same "pg_%u_%u" cut/paste and scattered in many places.
> (same comment applies multiple places, in this file and in tablesync.c)
>

Fixed all the three above comments.

> 4.
> Calls like replorigin_drop_by_name(originname, true, false); make it
> unnecessarily hard to read code when the boolean params are neither
> named as variables nor commented. I noticed on another thread [et0205]
> there was an idea that having no name/comments is fine because anyway
> it is not difficult to figure out when using a "modern IDE", but since
> my review tools are only "vi" and "meld" I beg to differ with that
> justification.
> (same comment applies multiple places, in this file and in tablesync.c)
>

Already responded to it separately. I went ahead and removed such
comments from other places in the patch.

> [et0205] 
> https://www.postgresql.org/message-id/c1d9833f-eeeb-40d5-89ba-87674e1b7ba3%40www.fastmail.com
>
> =
>
> FILE: tablesync.c
>
> 5.
> Previously there was a function tablesync_replorigin_drop which was
> encapsulating the tablesync origin name formatting. I thought that was
> better than the V29 code which now has the same formatting scattered
> over many places.
> (same comment applies for worker_internal.h)
>

I am not sure what different you are expecting here than point-3?

> + * Determine the tablesync slot name.
> + *
> + * The name must not exceed NAMEDATALEN - 1 because of remote node 
> constraints
> + * on slot name length. We do append system_identifier to avoid slot_name
> + * collision with subscriptions in other clusters. With current scheme
> + * pg_%u_sync_%u_UINT64_FORMAT (3 + 10 + 6 + 10 + 20 + '\0'), the maximum
> + * length of slot_name will be 50.
> + *
> + * The returned slot name is either:
> + * - stored in the supplied buffer (syncslotname), or
> + * - palloc'ed in current memory context (if syncslotname = NULL).
> + *
> + * Note: We don't use the subscription slot name as part of tablesync slot 
> name
> + * because we are responsible for cleaning up these slots and it could become
> + * impossible to recalculate what name to cleanup if the subscription slot 
> name
> + * had changed.
> + */
> +char *
> +ReplicationSlotNameForTablesync(Oid suboid, Oid relid, char
> syncslotname[NAMEDATALEN])
> +{
> + if (syncslotname)
> + sprintf(syncslotname, "pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
> + GetSystemIdentifier());
> + else
> + syncslotname = psprintf("pg_%u_sync_%u_" UINT64_FORMAT, suboid, relid,
> + GetSystemIdentifier());
> +
> + return syncslotname;
> +}
>
> 6.
> "We do append" --> "We append"
> "With current scheme" -> "With the current scheme"
>

Fixed.

> 7.
> Maybe consider to just assign GetSystemIdentifier() to a static
> instead of calling that function for every slot?
> static uint64 sysid = GetSystemIdentifier();
> IIUC the sysid value is never going to change for a process, right?
>

Already responded.

> FILE: alter_subscription.sgml
>
> 8.
> +  
> +   Commands ALTER SUBSCRIPTION ... REFRESH .. and
> +   ALTER SUBSCRIPTION ... SET PUBLICATION .. with refresh
> +   option as true cannot be executed inside a transaction block.
> +  
>
> My guess is those two lots of double dots ("..") were probably meant
> to be ellipsis ("...")
>

Fixed, for the first one I completed the command by adding PUBLICATION.

>
> When looking at the DropSubscription code I noticed that there is a
> small difference between the HEAD code and the V29 code when slot_name
> = NONE.
>
> HEAD does
> --
> if (!slotname)
> {
> table_close(rel, NoLock);
> return;
> }
> --
>
> V29 does
> --
> if (!slotname)
> {
> /* be tidy */
> list_free(rstates);
> return;
> }
> --
>
> Isn't the V29 code missing doing a table_close(rel, NoLock) there?
>

Yes, good catch. Fixed.

-- 
With Regards,
Amit Kapila.


v30-0001-Make-pg_replication_origin_drop-safe-against-con.patch
Description

Re: pg_replication_origin_drop API potential race condition

2021-02-09 Thread Amit Kapila

On Tue, Feb 9, 2021 at 4:16 PM Alvaro Herrera  wrote:
>
> On 2021-Feb-09, Amit Kapila wrote:
>
> > Now, we can do this optimization if we want but I am not sure if
> > origin_drop would be a frequent enough operation that we add such an
> > optimization. For now, I have added a note in the comments so that if
> > we find any such use case we can implement such optimization in the
> > future. What do you think?
>
> By all means let's get the bug fixed.
>

I am planning to push this in HEAD only as there is no user reported
problem and this is actually more about giving correct information to
the user rather than some misleading message. Do you see any need to
back-patch this change?

-- 
With Regards,
Amit Kapila.

Re: libpq debug log

2021-02-09 Thread Alvaro Herrera

On 2021-Feb-09, Kyotaro Horiguchi wrote:

> This looks like a fusion of PQtrace and PQtraceEX. By the way, the
> timestamp flag is needed at log emittion. So we can change the state
> anytime.
> 
> PQtrace(conn, of);
> PQtraceSetFlags(conn, PQTRACE_SUPPRESS_TIMESTAMPS);

I like this idea better than PQtraceEx().

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: libpq debug log

2021-02-09 Thread 'Alvaro Herrera'

On 2021-Feb-04, tsunakawa.ta...@fujitsu.com wrote:

> From: 'Alvaro Herrera' 
> > > (41)
> > > +void
> > > +PQtraceEx(PGconn *conn, FILE *debug_port, int flags)
> > > +{
> > 
> > I'm not really sure about making this a separate API call. We could just
> > make it PQtrace() and increment the libpq so version.  I don't think
> > it's a big deal, frankly.
> 
> If we change the function signature, we have to bump the so major version and 
> thus soname.  The libpq's current so major version is 5, which hasn't been 
> changed since 2006.  I'm hesitant to change it for this feature.  If you 
> think we can bump the version to 6, I think we can go.

I think it's pretty clear we would not change the so-version for this
feature.


-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: pg_replication_origin_drop API potential race condition

2021-02-09 Thread Alvaro Herrera

On 2021-Feb-09, Amit Kapila wrote:

> > IIUC, you are suggesting to use lock for the particular origin instead
> > of locking the corresponding catalog table in functions
> > pg_replication_origin_advance and replorigin_drop_by_name.

Right.

> I think it won't be that straightforward as we don't have origin_id.
> So what we instead need to do is first to acquire a lock on
> ReplicationOriginRelationId, get the origin_id, lock the specific
> origin and then re-check if the origin still exists. I feel some
> similar changes might be required in pg_replication_origin_advance.

Hmm, ok.

> Now, we can do this optimization if we want but I am not sure if
> origin_drop would be a frequent enough operation that we add such an
> optimization. For now, I have added a note in the comments so that if
> we find any such use case we can implement such optimization in the
> future. What do you think?

By all means let's get the bug fixed.  Then, in another patch, we can
optimize further, if there really is a problem.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Re: adding wait_start column to pg_locks

2021-02-09 Thread Fujii Masao





On 2021/02/09 18:13, Fujii Masao wrote:



On 2021/02/09 17:48, torikoshia wrote:

On 2021-02-05 18:49, Fujii Masao wrote:

On 2021/02/05 0:03, torikoshia wrote:

On 2021-02-03 11:23, Fujii Masao wrote:

64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating 
"waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock 
when reading "waitStart"?


Also it might be worth thinking to use 64-bit atomic operations like
pg_atomic_read_u64(), for that.


Thanks for your suggestion and advice!

In the attached patch I used pg_atomic_read_u64() and pg_atomic_write_u64().

waitStart is TimestampTz i.e., int64, but it seems pg_atomic_read_xxx and 
pg_atomic_write_xxx only supports unsigned int, so I cast the type.

I may be using these functions not correctly, so if something is wrong, I would 
appreciate any comments.


About the documentation, since your suggestion seems better than v6, I used it 
as is.


Thanks for updating the patch!

+    if (pg_atomic_read_u64(&MyProc->waitStart) == 0)
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &now));

pg_atomic_read_u64() is really necessary? I think that
"pg_atomic_write_u64(&MyProc->waitStart, now)" is enough.

+    deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT);
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &deadlockStart));

Same as above.

+    /*
+ * Record waitStart reusing the deadlock timeout timer.
+ *
+ * It would be ideal this can be synchronously done with updating
+ * lock information. Howerver, since it gives performance impacts
+ * to hold partitionLock longer time, we do it here asynchronously.
+ */

IMO it's better to comment why we reuse the deadlock timeout timer.

 proc->waitStatus = waitStatus;
+    pg_atomic_init_u64(&MyProc->waitStart, 0);

pg_atomic_write_u64() should be used instead? Because waitStart can be
accessed concurrently there.

I updated the patch and addressed the above review comments. Patch attached.
Barring any objection, I will commit this version.


Thanks for modifying the patch!
I agree with your comments.

BTW, I ran pgbench several times before and after applying
this patch.

The environment is virtual machine(CentOS 8), so this is
just for reference, but there were no significant difference
in latency or tps(both are below 1%).


Thanks for the test! I pushed the patch.


But I reverted the patch because buildfarm members rorqual and
prion don't like the patch. I'm trying to investigate the cause
of this failures.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2021-02-09%2009%3A20%3A10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Single transaction in the tablesync worker?

2021-02-09 Thread Amit Kapila

On Tue, Feb 9, 2021 at 1:37 PM Peter Smith  wrote:
>
> Looking at the V29 style tablesync slot names now they appear like this:
>
> WARNING:  could not drop tablesync replication slot
> "pg_16397_sync_16389_6927117142022745645"
> That is in the order subid +  relid + sysid
>
> Now that I see it in a message it seems a bit strange with the sysid
> just tacked onto the end like that.
>
> I am wondering if reordering of parent to child might be more natural.
> e.g sysid + subid + relid gives a more intuitive name IMO.
>
> So in this example it would be "pg_sync_6927117142022745645_16397_16389"
>

I have kept the order based on the importance of each parameter. Say
when the user sees this message in the server log of the subscriber
either for the purpose of tracking the origins progress or for errors,
the sysid parameter won't be of much use and they will be mostly
looking at subid and relid. OTOH, if due to some reason this parameter
appears in the publisher logs then sysid might be helpful.

Petr, anyone else, do you have any opinion on this matter?

-- 
With Regards,
Amit Kapila.

Re: Single transaction in the tablesync worker?

2021-02-09 Thread Amit Kapila

On Tue, Feb 9, 2021 at 12:02 PM Peter Smith  wrote:
>
> Here are my feedback comments for the V29 patch.
>

Thanks.

>
> 3.
> Previously the tablesync origin name format was encapsulated in a
> common function. IMO it was cleaner/safer how it was before, instead
> of the same "pg_%u_%u" cut/paste and scattered in many places.
> (same comment applies multiple places, in this file and in tablesync.c)
>
> 4.
> Calls like replorigin_drop_by_name(originname, true, false); make it
> unnecessarily hard to read code when the boolean params are neither
> named as variables nor commented. I noticed on another thread [et0205]
> there was an idea that having no name/comments is fine because anyway
> it is not difficult to figure out when using a "modern IDE", but since
> my review tools are only "vi" and "meld" I beg to differ with that
> justification.
> (same comment applies multiple places, in this file and in tablesync.c)
>

It would be a bit convenient for you but for most others, I think it
would be noise. Personally, I find the code more readable without such
name comments, it just breaks the flow of code unless you want to
study in detail the value of each param.

> [et0205] 
> https://www.postgresql.org/message-id/c1d9833f-eeeb-40d5-89ba-87674e1b7ba3%40www.fastmail.com
>
> =
>
> FILE: tablesync.c
>
> 5.
> Previously there was a function tablesync_replorigin_drop which was
> encapsulating the tablesync origin name formatting. I thought that was
> better than the V29 code which now has the same formatting scattered
> over many places.
> (same comment applies for worker_internal.h)
>

Isn't this the same as what you want to say in point-3?

>
> 7.
> Maybe consider to just assign GetSystemIdentifier() to a static
> instead of calling that function for every slot?
> static uint64 sysid = GetSystemIdentifier();
> IIUC the sysid value is never going to change for a process, right?
>

That's right but I am not sure if there is much value in saving one
call here by introducing extra variable.

I'll fix other comments raised by you.

-- 
With Regards,
Amit Kapila.

Re: Support for NSS as a libpq TLS backend

2021-02-09 Thread Daniel Gustafsson

> On 9 Feb 2021, at 07:47, Michael Paquier  wrote:
> 
> On Tue, Feb 09, 2021 at 12:08:37AM +0100, Daniel Gustafsson wrote:
>> Attached is a new patchset where I've tried to split the patches even further
>> to try and separate out changes for easier review.  While not a perfect split
>> I'm sure, and clearly only for review purposes, I do hope it helps a little.
>> There is one hunk in 0002 which moves some OpenSSL specific code from
>> underneath USE_SSL, but thats about the only non-NSS change left in this
>> patchset AFAICS.
> 
> I would have imagined 0010 to be either a 0001 or a 0002 :)

Well, 0010 is a 2 in binary =) Jokes aside, I just didn't want to have a patch
referencing files added by later patches in the series.

> errmsg("hostssl record cannot match because SSL is not supported by this 
> build"),
> -errhint("Compile with --with-ssl=openssl to use SSL connections."),
> +errhint("Compile with --with-ssl to use SSL connections."),
> Actually, we could change that directly on HEAD as you suggest.  This
> code area is surrounded with USE_SSL so there is no need to mention
> openssl at all.

We could, the only reason it says =openssl today is that it's the only possible
value but thats an implementation detail.  Changing it now before it's shipped
anywhere means the translation will be stable even if another library is
supported.

> 0002 could just be committed as-is.

It can be, it's not the most pressing patch scope reduction but everything
helps of course.

> I am also looking at 0003 a bit.

Thanks.  That patch is slightly more interesting in terms of reducing scope
here, but I also think it makes the test code a bit easier to digest when
certificate management is abstracted into the API rather than the job of the
testfile to perform.

--
Daniel Gustafsson   https://vmware.com/

Re: adding wait_start column to pg_locks

2021-02-09 Thread Fujii Masao





On 2021/02/09 17:48, torikoshia wrote:

On 2021-02-05 18:49, Fujii Masao wrote:

On 2021/02/05 0:03, torikoshia wrote:

On 2021-02-03 11:23, Fujii Masao wrote:

64-bit fetches are not atomic on some platforms. So spinlock is necessary when updating 
"waitStart" without holding the partition lock? Also GetLockStatusData() needs spinlock 
when reading "waitStart"?


Also it might be worth thinking to use 64-bit atomic operations like
pg_atomic_read_u64(), for that.


Thanks for your suggestion and advice!

In the attached patch I used pg_atomic_read_u64() and pg_atomic_write_u64().

waitStart is TimestampTz i.e., int64, but it seems pg_atomic_read_xxx and 
pg_atomic_write_xxx only supports unsigned int, so I cast the type.

I may be using these functions not correctly, so if something is wrong, I would 
appreciate any comments.


About the documentation, since your suggestion seems better than v6, I used it 
as is.


Thanks for updating the patch!

+    if (pg_atomic_read_u64(&MyProc->waitStart) == 0)
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &now));

pg_atomic_read_u64() is really necessary? I think that
"pg_atomic_write_u64(&MyProc->waitStart, now)" is enough.

+    deadlockStart = get_timeout_start_time(DEADLOCK_TIMEOUT);
+    pg_atomic_write_u64(&MyProc->waitStart,
+    pg_atomic_read_u64((pg_atomic_uint64 *) &deadlockStart));

Same as above.

+    /*
+ * Record waitStart reusing the deadlock timeout timer.
+ *
+ * It would be ideal this can be synchronously done with updating
+ * lock information. Howerver, since it gives performance impacts
+ * to hold partitionLock longer time, we do it here asynchronously.
+ */

IMO it's better to comment why we reuse the deadlock timeout timer.

 proc->waitStatus = waitStatus;
+    pg_atomic_init_u64(&MyProc->waitStart, 0);

pg_atomic_write_u64() should be used instead? Because waitStart can be
accessed concurrently there.

I updated the patch and addressed the above review comments. Patch attached.
Barring any objection, I will commit this version.


Thanks for modifying the patch!
I agree with your comments.

BTW, I ran pgbench several times before and after applying
this patch.

The environment is virtual machine(CentOS 8), so this is
just for reference, but there were no significant difference
in latency or tps(both are below 1%).


Thanks for the test! I pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: libpq compression

2021-02-09 Thread Konstantin Knizhnik





On 08.02.2021 22:23, Daniil Zakhlystov wrote:

Hi everyone,

I’ve been making some experiments with an on-the-fly compression switch lately 
and have some updates.

...
pg_restore of IMDB database test results

Chunked compression with only CopyData or DataRow compression (second approach):
time:
real2m27.947s
user0m45.453s
sys 0m3.113s
RX bytes diff, human:  1.8837M
TX bytes diff, human:  1.2810G

Permanent compression:
time:
real2m15.810s
user0m42.822s
sys 0m2.022s
RX bytes diff, human:  2.3274M
TX bytes diff, human:  1.2761G

Without compression:
time:
real2m38.245s
user0m18.946s
sys 0m2.443s
RX bytes diff, human:  5.6117M
TX bytes diff, human:  3.8227G


Also, I’ve run pgbench tests and measured the CPU load. Since chunked 
compression did not compress any messages
except for CopyData or DataRow, it demonstrated lower CPU usage compared to the 
permanent compression, full report with graphs
is available in the Google doc above.

Pull request with the second approach implemented:
https://github.com/postgrespro/libpq_compression/pull/7

Also, in this pull request, I’ve made the following changes:
- extracted the general-purpose streaming compression API into the separate 
structure (ZStream) so it can be used in other places without tx_func and 
rx_func,
maybe the other compression patches can utilize it?
- made some refactoring of ZpqStream
- moved the SSL and ZPQ buffered read data checks into separate function 
pqReadPending

What do you think of the results above? I think that the implemented approach 
is viable, but maybe I missed something in my tests.


Sorry, but my interpretation of your results is completely different:
permanent compression is faster than chunked compression (2m15 vs. 2m27) 
and consumes less CPU (44 vs 48 sec).

Size of RX data is slightly larger - 0.5Mb but TX size is smaller - 5Mb.
So permanent compression is better from all points of view: it is 
faster, consumes less CPU and reduces network traffic!


From my point of view your results just prove my original opinion that 
possibility to control compression on the fly and use different 
compression algorithm for TX/RX data

just complicates implementation and given no significant advantages.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Libpq support to connect to standby server as priority

2021-02-09 Thread Greg Nancarrow

On Mon, Feb 8, 2021 at 8:17 PM vignesh C  wrote:
>
> >
> > I think what we want to do is mark default_transaction_read_only as
> > GUC_REPORT, instead.  That will give a reliable report of what the
> > state of its GUC is, and you can combine it with is_hot_standby
> > to decide whether the session should be considered read-only.
> > If you don't get those two GUC values during connection, then you
> > can fall back on "SHOW transaction_read_only".
> >
>
> I have made a patch for the above with the changes suggested and
> rebased it with the head code.
> Attached v21 patch which has the changes for the same.
> Thoughts?

Further to my other doc change feedback, I can only spot the following
minor things (otherwise the changes that you have made seek OK to me).

1) doc/src/sgml/protocol.sgml

   default_transaction_read_only  and
   in_hot_standby were not reported by releases before
   14.)

should be:

   default_transaction_read_only  and
   in_hot_standby were not reported by releases before
   14.0)

2) doc/src/sgml/high-availability,sgml

During hot standby, the parameter in_hot_standby and
default_transaction_read_only are always true and may
not be changed.

should be:

During hot standby, the parameters in_hot_standby and
transaction_read_only are always true and may
not be changed.

[I believe that there's only checks on attempts to change
"transaction_read_only" when in hot_standby, not
"default_transaction_read_only"; see  check_transaction_read_only()]

3) src/interfaces/libpq/fe-connect.c

In rejectCheckedReadOrWriteConnection() and
rejectCheckedStandbyConnection(), now that host and port info are
emitted separately and are not included in each error message string
(as parameters in a format string), I think those functions should use
appendPQExpBufferStr() instead of appendPQExpBuffer(), as it's more
efficient if there is just a single string argument.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Online checksums patch - once again

2021-02-09 Thread Heikki Linnakangas

(I may have said this before, but) My overall high-level impression of 
this patch is that it's really cmmplex for a feature that you use maybe 
once in the lifetime of a cluster. I'm happy to review but I'm not 
planning to commit this myself. I don't object if some other committer 
picks this up (Magnus?).


Now to the latest patch version:

On 03/02/2021 18:15, Daniel Gustafsson wrote:

The previous v35 had a tiny conflict in pg_class.h, the attached v36 (which is
a squash of the 2 commits in v35) fixes that.  No other changes are introduced
in this version.




/*
 * Check to see if my copy of RedoRecPtr is out of date. If so, may have
 * to go back and have the caller recompute everything. This can only
 * happen just after a checkpoint, so it's better to be slow in this 
case
 * and fast otherwise.
 *
 * Also check to see if fullPageWrites or forcePageWrites was just 
turned
 * on, or if we are in the process of enabling checksums in the cluster;
 * if we weren't already doing full-page writes then go back and 
recompute.
 *
 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
 * affect the contents of the XLOG record, so we'll update our local 
copy
 * but not force a recomputation.  (If doPageWrites was just turned off,
 * we could recompute the record without full pages, but we choose not 
to
 * bother.)
 */
if (RedoRecPtr != Insert->RedoRecPtr)
{
Assert(RedoRecPtr < Insert->RedoRecPtr);
RedoRecPtr = Insert->RedoRecPtr;
}
doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || 
DataChecksumsOnInProgress());


Why does this use DataChecksumsOnInProgress() instead of 
DataChecksumsNeedWrite()? If checksums are enabled, you always need 
full-page writes, don't you? If not, then why is it needed in the 
inprogress-on state?


We also set doPageWrites in InitXLOGAccess(). That should match the 
condition above (although it doesn't matter for correctness).



/*
 * DataChecksumsNeedVerify
 *  Returns whether data checksums must be verified or not
 *
 * Data checksums are only verified if they are fully enabled in the cluster.
 * During the "inprogress-on" and "inprogress-off" states they are only
 * updated, not verified (see datachecksumsworker.c for a longer discussion).
 *
 * This function is intended for callsites which have read data and are about
 * to perform checksum validation based on the result of this. To avoid the
 * the risk of the checksum state changing between reading and performing the
 * validation (or not), interrupts must be held off. This implies that calling
 * this function must be performed as close to the validation call as possible
 * to keep the critical section short. This is in order to protect against
 * time of check/time of use situations around data checksum validation.
 */
bool
DataChecksumsNeedVerify(void)
{
Assert(InterruptHoldoffCount > 0);
return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_VERSION);
}


What exactly is the intended call pattern here? Something like this?

smgrread() a data page
HOLD_INTERRUPTS();
if (DataChecksumsNeedVerify())
{
  if (pg_checksum_page((char *) page, blkno) != expected)
elog(ERROR, "bad checksum");
}
RESUME_INTERRUPTS();

That seems to be what the code currently does. What good does holding 
interrupts do here? If checksums were not fully enabled at the 
smgrread() call, the page might have incorrect checksums, and if the 
state transitions from inprogress-on to on between the smggread() call 
and the DataChecksumsNeedVerify() call, you'll get an error. I think you 
need to hold the interrupts *before* the smgrread() call.



/*
 * Set checksum for a page in private memory.
 *
 * This must only be used when we know that no other process can be modifying
 * the page buffer.
 */
void
PageSetChecksumInplace(Page page, BlockNumber blkno)
{
HOLD_INTERRUPTS();
/* If we don't need a checksum, just return */
if (PageIsNew(page) || !DataChecksumsNeedWrite())
{
RESUME_INTERRUPTS();
return;
}

((PageHeader) page)->pd_checksum = pg_checksum_page((char *) page, 
blkno);
RESUME_INTERRUPTS();
}


The checksums state might change just after this call, before the caller 
has actually performed the smgrwrite() or smgrextend() call. The caller 
needs to hold interrupts across this function and the 
smgrwrite/smgrextend() call. It is a bad idea to HOLD_INTERRUPTS() here, 
because that just masks bugs where the caller isn't holding the 
interrupts. Same in PageSetChecksumCopy().


- Heikki

1 2 >

1 - 100 of 111 matches

Mail list logo