Re: automatically generating node support functions

2021-07-18 Thread Peter Eisentraut

On 07.06.21 22:27, Peter Eisentraut wrote:
I wrote a script to automatically generate the node support functions 
(copy, equal, out, and read, as well as the node tags enum) from the 
struct definitions.


The first eight patches are to clean up various inconsistencies to make 
parsing or generation easier.


Are there any concerns about the patches 0001 through 0008?  Otherwise, 
maybe we could get those out of the way.





Re: [HACKERS] logical decoding of two-phase transactions

2021-07-18 Thread Amit Kapila
On Mon, Jul 19, 2021 at 9:19 AM Peter Smith  wrote:
>
> On Mon, Jul 19, 2021 at 12:43 PM Amit Kapila  wrote:
> >
> > On Mon, Jul 19, 2021 at 1:55 AM Tom Lane  wrote:
> > >
> > > Amit Kapila  writes:
> > > > Pushed.
> > >
> > > I think you'd be way better off making the gid fields be "char *"
> > > and pstrdup'ing the result of pq_getmsgstring.  Another possibility
> > > perhaps is to use strlcpy, but I'd only go that way if it's important
> > > to constrain the received strings to 200 bytes.
> > >
> >
> > I think it is important to constrain length to 200 bytes for this case
> > as here we receive a prepared transaction identifier which according
> > to docs [1] has a max length of 200 bytes. Also, in
> > ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with
> > 200 as max length to copy prepare transaction identifier. So, I think
> > it is better to use strlcpy here unless you or Peter feels otherwise.
> >
>
> OK. I have implemented this reported [1] potential buffer overrun
> using the constraining strlcpy, because the GID limitation of 200
> bytes is already mentioned in the documentation [2].
>

This will work but I think it is better to use sizeof gid buffer as we
are using in ParseCommitRecord() and ParseAbortRecord(). Tomorrow, if
due to some unforeseen reason if we change the size of gid buffer to
be different than the GIDSIZE then it will work seamlessly.

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2021-07-18 Thread Masahiko Sawada
On Sat, Jul 17, 2021 at 12:02 AM Masahiko Sawada  wrote:
>
> On Wed, Jul 14, 2021 at 5:14 PM Masahiko Sawada  wrote:
> >
> > On Mon, Jul 12, 2021 at 8:52 PM Amit Kapila  wrote:
> > >
> > > On Mon, Jul 12, 2021 at 11:13 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Jul 12, 2021 at 1:15 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Mon, Jul 12, 2021 at 9:37 AM Alexey Lesovsky  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Jul 12, 2021 at 8:36 AM Amit Kapila 
> > > > > >  wrote:
> > > > > >>
> > > > > >> >
> > > > > >> > Ok, looks nice. But I am curious how this will work in the case 
> > > > > >> > when there are two (or more) errors in the same subscription, 
> > > > > >> > but different relations?
> > > > > >> >
> > > > > >>
> > > > > >> We can't proceed unless the first error is resolved, so there
> > > > > >> shouldn't be multiple unresolved errors.
> > > > > >
> > > > > >
> > > > > > Ok. I thought multiple errors are possible when many tables are 
> > > > > > initialized using parallel workers (with 
> > > > > > max_sync_workers_per_subscription > 1).
> > > > > >
> > > > >
> > > > > Yeah, that is possible but that covers under the second condition
> > > > > mentioned by me and in such cases I think we should have separate rows
> > > > > for each tablesync. Is that right, Sawada-san or do you have something
> > > > > else in mind?
> > > >
> > > > Yeah, I agree to have separate rows for each table sync. The table
> > > > should not be processed by both the table sync worker and the apply
> > > > worker at a time so the pair of subscription OID and relation OID will
> > > > be unique. I think that we have a boolean column in the view,
> > > > indicating whether the error entry is reported by the table sync
> > > > worker or the apply worker, or maybe we also can have the action
> > > > column show "TABLE SYNC" if the error is reported by the table sync
> > > > worker.
> > > >
> > >
> > > Or similar to backend_type (text) in pg_stat_activity, we can have
> > > something like error_source (text) which will display apply worker or
> > > tablesync worker? I think if we have this column then even if there is
> > > a chance that both apply and sync worker operates on the same
> > > relation, we can identify it via this column.
> >
> > Sounds good. I'll incorporate this in the next version patch that I'm
> > planning to submit this week.
>
> Sorry, I could not make it this week. I'll submit them early next week.
> While updating the patch I thought we need to have more design
> discussion on two points of clearing error details after the error is
> resolved:
>
> 1. How to clear apply worker errors. IIUC we've discussed that once
> the apply worker skipped the transaction we leave the error entry
> itself but clear its fields except for some fields such as failure
> counts. But given that the stats messages could be lost, how can we
> ensure to clear those error details? For table sync workers’ error, we
> can have autovacuum workers periodically check entires of
> pg_subscription_rel and clear the error entry if the table sync worker
> completes table sync (i.g., checking if srsubstate = ‘r’). But there
> is no such information for the apply workers and subscriptions. In
> addition to sending the message clearing the error details just after
> skipping the transaction, I thought that we can have apply workers
> periodically send the message clearing the error details but it seems
> not good.

I think that the motivation behind the idea of leaving error entries
and clearing theirs some fields is that users can check if the error
is successfully resolved and the worker is working find. But we can
check it also in another way, for example, checking
pg_stat_subscription view. So is it worth considering leaving the
apply worker errors as they are?

>
> 2. Do we really want to leave the table sync worker even after the
> error is resolved and the table sync completes? Unlike the apply
> worker error, the number of table sync worker errors could be very
> large, for example, if a subscriber subscribes to many tables. If we
> leave those errors in the stats view, it uses more memory space and
> could affect writing and reading stats file performance. If such left
> table sync error entries are not helpful in practice I think we can
> remove them rather than clear some fields. What do you think?
>

I've attached the updated version patch that incorporated all comments
I got so far except for the clearing error details part I mentioned
above. After getting a consensus on those parts, I'll incorporate the
idea into the patches.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/
From 4a2abc82db9ab37699f09df9be86f150c58db3cf Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Mon, 28 Jun 2021 13:18:58 +0900
Subject: [PATCH v2 3/3] Add skip_xid option to ALTER SUBSCRIPTION.

---
 doc/src/sgml/logical-replication.sgml  |  33 ++-
 doc/src/sgml/ref/alter_subscription.sgml   |  47 +++-

Re: Skipping logical replication transactions on subscriber side

2021-07-18 Thread Masahiko Sawada
On Mon, Jul 19, 2021 at 2:22 PM Amit Kapila  wrote:
>
> On Fri, Jul 16, 2021 at 8:33 PM Masahiko Sawada  wrote:
> >
> > On Wed, Jul 14, 2021 at 5:14 PM Masahiko Sawada  
> > wrote:
> > >
> > > Sounds good. I'll incorporate this in the next version patch that I'm
> > > planning to submit this week.
> >
> > Sorry, I could not make it this week. I'll submit them early next week.
> >
>
> No problem.
>
> > While updating the patch I thought we need to have more design
> > discussion on two points of clearing error details after the error is
> > resolved:
> >
> > 1. How to clear apply worker errors. IIUC we've discussed that once
> > the apply worker skipped the transaction we leave the error entry
> > itself but clear its fields except for some fields such as failure
> > counts. But given that the stats messages could be lost, how can we
> > ensure to clear those error details? For table sync workers’ error, we
> > can have autovacuum workers periodically check entires of
> > pg_subscription_rel and clear the error entry if the table sync worker
> > completes table sync (i.g., checking if srsubstate = ‘r’). But there
> > is no such information for the apply workers and subscriptions.
> >
>
> But won't the corresponding subscription (pg_subscription) have the
> XID as InvalidTransactionid once the xid is skipped or at least a
> different XID then we would have in pg_stat view? Can we use that to
> reset entry via vacuum?

I think the XID is InvalidTransaction until the user specifies it. So
I think we cannot know whether we're before skipping or after skipping
only by the transaction ID. No?

>
> > In
> > addition to sending the message clearing the error details just after
> > skipping the transaction, I thought that we can have apply workers
> > periodically send the message clearing the error details but it seems
> > not good.
> >
>
> Yeah, such things should be a last resort.
>
> > 2. Do we really want to leave the table sync worker even after the
> > error is resolved and the table sync completes? Unlike the apply
> > worker error, the number of table sync worker errors could be very
> > large, for example, if a subscriber subscribes to many tables. If we
> > leave those errors in the stats view, it uses more memory space and
> > could affect writing and reading stats file performance. If such left
> > table sync error entries are not helpful in practice I think we can
> > remove them rather than clear some fields. What do you think?
> >
>
> Sounds reasonable to me. One might think to update the subscription
> error count by including table_sync errors but not sure if that is
> helpful and even if that is helpful, we can extend it later.

Agreed.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-18 Thread Ronan Dunklau
> Looks like I did a sloppy job of that.  I messed up the condition in
> standard_qp_callback() that sets the ORDER BY aggregate pathkeys so
> that it accidentally set them when there was an unsortable GROUP BY
> clause, as highlighted by the postgres_fdw tests failing.  I've now
> added a comment to explain why the condition is the way it is so that
> I don't forget again.
> 
> Here's a cleaned-up version that passes make check-world.
> 

I've noticed that when the ORDER BY is a grouping key (which to be honest 
doesn't seem to make much sense to me), the sort key is duplicated, as 
demonstrated by one of the modified tests (partition_aggregate.sql). 

This leads to additional sort nodes being added when there is no necessity to 
do so. In the case of sort and index pathes, the duplicate keys are not 
considered,  I think the same should apply here.

It means the logic for appending the order by pathkeys to the existing group 
by pathkeys would ideally need to remove the redundant group by keys from the 
order by keys, considering this example:

regression=# explain select sum(unique1 order by ten, two), sum(unique1 order 
by four), sum(unique1 order by two, four) from tenk2 group by ten;
   QUERY PLAN   

 GroupAggregate  (cost=1109.39..1234.49 rows=10 width=28)
   Group Key: ten
   ->  Sort  (cost=1109.39..1134.39 rows=1 width=16)
 Sort Key: ten, ten, two
 ->  Seq Scan on tenk2  (cost=0.00..445.00 rows=1 width=16)


We would ideally like to sort on ten, two, four to satisfy the first and last 
aggref at once. Stripping the common prefix (ten) would eliminate this problem. 

Also, existing regression tests cover the first problem (order by a grouping 
key) but I feel like they should be extended with a case similar as the above 
to check which pathkeys are used in the "multiple ordered aggregates + group 
by" cases. 


-- 
Ronan Dunklau






Re: Replace remaining castNode(…, lfirst(…)) and friends calls with l*_node()

2021-07-18 Thread Peter Eisentraut

On 15.07.21 13:12, Dagfinn Ilmari Mannsåker wrote:

ilm...@ilmari.org (Dagfinn Ilmari Mannsåker) writes:


Hi hackers,

We've had node-casting versions of the list extraction macros since
2017, but several cases of the written-out version have been added since
then (even by Tom, who added the l*_node() macros).

Here's a patch to convert the remaining ones.  The macros were
back-patched to all supported branches, so this won't create any
new back-patching hazards.


Added to the 2021-09 commit fest: https://commitfest.postgresql.org/34/3253/


committed




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-18 Thread Masahiko Sawada
Sorry for the late reply.

On Sat, Jul 10, 2021 at 11:55 AM Andres Freund  wrote:
>
> Hi,
>
> On 2021-07-09 10:17:49 -0700, Andres Freund wrote:
> > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:
> > > Currently, the TIDs of dead tuples are stored in an array that is
> > > collectively allocated at the start of lazy vacuum and TID lookup uses
> > > bsearch(). There are the following challenges and limitations:
> >
> > > So I prototyped a new data structure dedicated to storing dead tuples
> > > during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
> > > The authors provide an implementation of Roaring Bitmap[3]  (Apache
> > > 2.0 license). But I've implemented this idea from scratch because we
> > > need to integrate it with Dynamic Shared Memory/Area to support
> > > parallel vacuum and need to support ItemPointerData, 6-bytes integer
> > > in total, whereas the implementation supports only 4-bytes integers.
> > > Also, when it comes to vacuum, we neither need to compute the
> > > intersection, the union, nor the difference between sets, but need
> > > only an existence check.
> > >
> > > The data structure is somewhat similar to TIDBitmap. It consists of
> > > the hash table and the container area; the hash table has entries per
> > > block and each block entry allocates its memory space, called a
> > > container, in the container area to store its offset numbers. The
> > > container area is actually an array of bytes and can be enlarged as
> > > needed. In the container area, the data representation of offset
> > > numbers varies depending on their cardinality. It has three container
> > > types: array, bitmap, and run.
> >
> > How are you thinking of implementing iteration efficiently for rtbm? The
> > second heap pass needs that obviously... I think the only option would
> > be to qsort the whole thing?
>
> I experimented further, trying to use an old radix tree implementation I
> had lying around to store dead tuples. With a bit of trickery that seems
> to work well.

Thank you for experimenting with another approach.

>
> The radix tree implementation I have basically maps an int64 to another
> int64. Each level of the radix tree stores 6 bits of the key, and uses
> those 6 bits to index a 1<<64 long array leading to the next level.
>
> My first idea was to use itemptr_encode() to convert tids into an int64
> and store the lower 6 bits in the value part of the radix tree. That
> turned out to work well performance wise, but awfully memory usage
> wise. The problem is that we at most use 9 bits for offsets, but reserve
> 16 bits for it in the ItemPointerData. Which means that there's often a
> lot of empty "tree levels" for those 0 bits, making it hard to get to a
> decent memory usage.
>
> The simplest way to address that was to simply compress out those
> guaranteed-to-be-zero bits. That results in memory usage that's quite
> good - nearly always beating array, occasionally beating rtbm. It's an
> ordered datastructure, so the latter isn't too surprising. For lookup
> performance the radix approach is commonly among the best, if not the
> best.

How were its both lookup performance and memory usage comparing to
intset? I guess the performance trends of those two approaches are
similar since both consists of a tree. Intset encodes uint64 by
simple-8B encoding so I'm interested also in the comparison in terms
of memory usage.

>
> A variation of the storage approach is to just use the block number as
> the index, and store the tids as the value. Even with the absolutely
> naive approach of just using a Bitmapset that reduces memory usage
> substantially - at a small cost to search performance. Of course it'd be
> better to use an adaptive approach like you did for rtbm, I just thought
> this is good enough.
>
>
> This largely works well, except when there are a large number of evenly
> spread out dead tuples. I don't think that's a particularly common
> situation, but it's worth considering anyway.
>
> The reason the memory usage can be larger for sparse workloads obviously
> can lead to tree nodes with only one child. As they are quite large
> (1<<6 pointers to further children) that then can lead to large increase
> in memory usage.

Interesting. How big was it in such workloads comparing to other data
structures?

I personally like adaptive approaches especially in the context of
vacuum improvements. We know common patterns of dead tuple
distribution but it’s not necessarily true since it depends on data
distribution and timings of autovacuum etc even with the same
workload. And we might be able to provide a new approach that works
well in 95% of use cases but if things get worse than before in
another 5% I think the approach is not a good approach. Ideally, it
should be better in common cases and at least be the same as before in
other cases.

BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.

>
> I have toyed

Re: corruption of WAL page header is never reported

2021-07-18 Thread Kyotaro Horiguchi
Hello.

At Sun, 18 Jul 2021 04:55:05 +0900, Yugo NAGATA  wrote in 
> Hello,
> 
> I found that any corruption of WAL page header found during recovery is never
> reported in log messages. If wal page header is broken, it is detected in
> XLogReaderValidatePageHeader called from  XLogPageRead, but the error messages
> are always reset and never reported.

Good catch!  Currently recovery stops showing no reason if it is
stopped by page-header errors.

> I attached a patch to fix it in this way.

However, it is a kind of a roof-over-a-roof.  What we should do is
just omitting the check in XLogPageRead while in standby mode.

> Or, if we wouldn't like to report an error for each check and also what we 
> want
> to check here is just about old recycled WAL instead of header corruption 
> itself, 
> I wander that we could check just xlp_pageaddr instead of calling
> XLogReaderValidatePageHeader.

I'm not sure. But as described in the commit message, the commit
intended to save a common case and there's no obvious reason to (and
not to) restrict the check only to page address. So it uses the
established checking function.

I was tempted to adjust the comment just above by adding "while in
standby mode", but "so that we can retry immediately" is suggesting
that so I didn't do that in the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 30033d810bcc784da55600792484603e1c46b3d7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 19 Jul 2021 14:49:34 +0900
Subject: [PATCH v1] Don't forget message of hage-header errors while not in
 standby mode

The commit 0668719801 intended to omit page-header errors only while
in standby mode but actually it is always forgotten.  As the result
the message of the end of a crash recovery lacks the reason for the
stop. Fix that by doing the additional check only while in standby
mode.
---
 src/backend/access/transam/xlog.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ee9515139..79513fb8b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12317,7 +12317,8 @@ retry:
 	 * Validating the page header is cheap enough that doing it twice
 	 * shouldn't be a big deal from a performance point of view.
 	 */
-	if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+	if (StandbyMode &&
+		!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
 	{
 		/* reset any error XLogReaderValidatePageHeader() might have set */
 		xlogreader->errormsg_buf[0] = '\0';
-- 
2.27.0



Re: RFC: Logging plan of the running query

2021-07-18 Thread Fujii Masao




On 2021/07/19 11:28, torikoshia wrote:

Agreed. Updated the patch.


Thanks for updating the patch!

+bool
+SendProcSignalForLogInfo(pid_t pid, ProcSignalReason reason)

I don't think that procsignal.c is proper place to check the permission and
check whether the specified PID indicates a PostgreSQL server process, etc
because procsignal.c just provides fundamental routines for interprocess
signaling. Isn't it better to move the function to signalfuncs.c or elsewhere?


+   ExplainQueryText(es, ActivePortal->queryDesc);
+   ExplainPrintPlan(es, ActivePortal->queryDesc);
+   ExplainPrintJITSummary(es, ActivePortal->queryDesc);

When text format is used, ExplainBeginOutput() and ExplainEndOutput()
do nothing. So (I guess) you thought that they don't need to be called and
implemented the code in that way. But IMO it's better to comment
why they don't need to be called, or to just call both of them
even if they do nothing in text format.


+   ExplainPrintJITSummary(es, ActivePortal->queryDesc);

It's better to check es->costs before calling this function,
like explain_ExecutorEnd() and ExplainOnePlan() do?


+   result = SendProcSignalForLogInfo(pid, PROCSIG_LOG_CURRENT_PLAN);
+
+   PG_RETURN_BOOL(result);

Currently SendProcSignalForLogInfo() calls PG_RETURN_BOOL() in some cases,
but instead it should just return true/false because pg_log_current_query_plan()
expects that?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [PATCH] Use optimized single-datum tuplesort in ExecSort

2021-07-18 Thread Ronan Dunklau
Le samedi 17 juillet 2021, 10:36:09 CEST David Rowley a écrit :
>  On Sat, 17 Jul 2021 at 01:14, James Coleman  wrote:
> > The only remaining question I have is whether or not costing needs to
> > change, given the very significant speedup for datum sort.
> 
> I'm looking at cost_tuplesort and the only thing that I think might
> make sense would be to adjust how the input_bytes value is calculated.
> For now, that's done with the following function that's used in quite
> a number of places.
> 
> static double
> relation_byte_size(double tuples, int width)
> {
> return tuples * (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
> }
> 
> It seems, at least in the case of Sort, that using SizeofHeapTupleHead
> is just always wrong as it should be SizeofMinimalTupleHeader. I know
> that's also the case for Memoize too. I've not checked the other
> locations.
> 
> The only thing I can really see that we might do would be not add the
> MAXALIGN(SizeofHeapTupleHeader) when there's just a single column.
> We'd need to pass down the number of attributes from
> create_sort_path() so we'd know when and when not to add that. I'm not
> saying that we should do this. I'm just saying that I don't really see
> what else we might do.
> 
> I can imagine another patch might just want to do a complete overhaul
> of all locations that use relation_byte_size().  There are various
> things that function just does not account for. e.g, the fact that we
> allocate chunks in powers of 2 and that there's a chunk header added
> on.  Of course, "width" is just an estimate, so maybe trying to
> calculate something too precisely wouldn't be too wise. However,
> there's a bit of a chicken and the egg problem there as there'd be
> little incentive to improve "width" unless we started making more
> accurate use of the value.
> 
> Anyway, none of the above take into account that the Datum sort is
> just a little faster, The only thing that exists in the existing cost
> modal that we could use to adjust the cost of an in memory sort is the
> comparison_cost.  The problem there is that the comparison is exactly
> the same in both Datum and Tuple sorts. The only thing that really
> changes between Datum and Tuple sort is the fact that we don't make a
> MinimalTuple when doing a Datum sort.  The cost modal, unfortunately,
> does not account for that.   That kinda makes me think that we should
> do nothing as if we start to account for making MemoryTuples then
> we'll just penalise Tuple sorts and that might cause someone to be
> upset.
> 
Thank you for taking the time to perform that analysis. I agree with you and 
tt looks to me that if we were to start accounting for it, we would have to 
make the change almost transparent for tuple sorts so that it stays roughly 
the same, which is impossible since we don't apply the comparison cost to all 
tuples but only to the number of tuples we actually expect to compare.

On the other hand, if we don't change the sorting cost and it just ends up 
being faster in some cases I doubt anyone would complain.


-- 
Ronan Dunklau






Re: [HACKERS] logical decoding of two-phase transactions

2021-07-18 Thread Greg Nancarrow
On Wed, Jul 14, 2021 at 6:33 PM Peter Smith  wrote:
>
> Please find attached the latest patch set v97*
>

I couldn't spot spot any significant issues in the v97-0001 patch, but
do have the following trivial feedback comments:

(1) doc/src/sgml/protocol.sgml
Suggestion:

BEFORE:
+   contains a Stream Prepare or Stream Commit or Stream Abort message.
AFTER:
+   contains a Stream Prepare, Stream Commit or Stream Abort message.


(2) src/backend/replication/logical/worker.c
It seems a bit weird to add a forward declaration here, without a
comment, like for the one immediately above it

/* Compute GID for two_phase transactions */
static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char
*gid, int szgid);
-
+static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn);


(3) src/backend/replication/logical/worker.c
Other DEBUG1 messages don't end with "."

+ elog(DEBUG1, "apply_handle_stream_prepare: replayed %d
(all) changes.", nchanges);


Regards,
Greg Nancarrow
Fujitsu Australia




Re: logical replication empty transactions

2021-07-18 Thread Peter Smith
Hi Ajin,

I have reviewed the v7 patch and given my feedback comments below.

Apply OK
Build OK
make check OK
TAP (subscriptions) make check OK
Build PG Docs (html) OK

Although I made lots of review comments below, the important point is
that none of them are functional - they are only minore re-wordings
and some code refactoring that I thought would make the code simpler
and/or easier to read. YMMV, so please feel free to disagree with any
of them.

//

1a. Commit Comment - wording

BEFORE
This patch addresses the above problem by postponing the BEGIN / BEGIN
PREPARE message until the first change.

AFTER
This patch addresses the above problem by postponing the BEGIN / BEGIN
PREPARE messages until the first change is encountered.

--

1b. Commit Comment - wording

BEFORE
While processing a COMMIT message or a PREPARE message, if there is no
other change for that transaction, do not send COMMIT message or
PREPARE message.

AFTER
If (when processing a COMMIT / PREPARE message) we find there had been
no other change for that transaction, then do not send the COMMIT /
PREPARE message.

--

2. doc/src/sgml/logicaldecoding.sgml - wording

@@ -884,11 +884,19 @@ typedef void (*LogicalDecodePrepareCB) (struct
LogicalDecodingContext *ctx,
   The required commit_prepared_cb callback is called
   whenever a transaction COMMIT PREPARED has
been decoded.
   The gid field, which is part of the
-  txn parameter, can be used in this callback.
+  txn parameter, can be used in this callback. The
+  parameters prepare_end_lsn and
+  prepare_time can be used to check if the plugin
+  has received this PREPARE TRANSACTION in which case
+  it can commit the transaction, otherwise, it can skip the commit. The
+  gid alone is not sufficient because the downstream
+  node can have a prepared transaction with the same identifier.

=>

(some minor rewording of the last part)

AFTER:

The parameters prepare_end_lsn and
prepare_time can be used to check if the plugin
has received this PREPARE TRANSACTION or not. If
yes, it can commit the transaction, otherwise, it can skip the commit.
The gid alone is not sufficient to determine
this because the downstream node may already have a prepared
transaction with the same identifier.


--

3. src/backend/replication/logical/proto.c - whitespace

@@ -244,12 +248,16 @@ logicalrep_read_commit_prepared(StringInfo in,
LogicalRepCommitPreparedTxnData *
  elog(ERROR, "unrecognized flags %u in commit prepared message", flags);

  /* read fields */
+ prepare_data->prepare_end_lsn = pq_getmsgint64(in);
+ if (prepare_data->prepare_end_lsn == InvalidXLogRecPtr)
+ elog(ERROR,"prepare_end_lsn is not set in commit prepared message");

=>

There is missing space before the 2nd elog param.

--

4. src/backend/replication/logical/worker.c - comment typos

  /*
- * Update origin state so we can restart streaming from correct position
- * in case of crash.
+ * It is possible that we haven't received the prepare because
+ * the transaction did not have any changes relevant to this
+ * subscription and was essentially an empty prepare. In which case,
+ * the walsender is optimized to drop the empty transaction and the
+ * accompanying prepare. Silently ignore if we don't find the prepared
+ * transaction.
  */

4a. =>

"and was essentially an empty prepare" --> "so was essentially an empty prepare"

4b. =>

"In which case" --> "In this case"

--

5. src/backend/replication/pgoutput/pgoutput.c - pgoutput_begin_txn

@@ -410,10 +417,32 @@ pgoutput_startup(LogicalDecodingContext *ctx,
OutputPluginOptions *opt,
 static void
 pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)
 {
+ PGOutputTxnData*data = MemoryContextAllocZero(ctx->context,
+ sizeof(PGOutputTxnData));
+
+ /*
+ * Don't send BEGIN message here. Instead, postpone it until the first
+ * change. In logical replication, a common scenario is to replicate a set
+ * of tables (instead of all tables) and transactions whose changes were on
+ * table(s) that are not published will produce empty transactions. These
+ * empty transactions will send BEGIN and COMMIT messages to subscribers,
+ * using bandwidth on something with little/no use for logical replication.
+ */
+ data->sent_begin_txn = false;
+ txn->output_plugin_private = data;
+}

=>

I felt that since this message postponement is now the new behaviour
of this function then probably this should all be a function level
comment instead of the comment being in the body of the function

--

6. src/backend/replication/pgoutput/pgoutput.c - pgoutput_begin

+
+static void
+pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn)

=>

Even though it is kind of obvious, it is probably better to provide a
function comment here too

--

7. src/backend/replication/pgoutput/pgoutput.c - pgoutput_commit_txn

@@ -428,8 +457,22 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *

Re: speed up verifying UTF-8

2021-07-18 Thread Amit Khandekar
On Sat, 17 Jul 2021 at 04:48, John Naylor  wrote:
> v17-0001 is the same as v14. 0002 is a stripped-down implementation of Amit's
> chunk idea for multibyte, and it's pretty good on x86. On Power8, not so
> much. 0003 and 0004 are shot-in-the-dark guesses to improve it on Power8,
> with some success, but end up making x86 weirdly slow, so I'm afraid that
> could happen on other platforms as well.

Thanks for trying the chunk approach. I tested your v17 versions on
Arm64. For the chinese characters, v17-0002 gave some improvement over
v14. But for all the other character sets, there was around 10%
degradation w.r.t. v14. I thought maybe the hhton64 call and memcpy()
for each mb character might be the culprit, so I tried iterating over
all the characters in the chunk within the same pg_utf8_verify_one()
function by left-shifting the bits. But that worsened the figures. So
I gave up that idea.

Here are the numbers on Arm64 :

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1781 |  1095 |   628 | 944 |   1151

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 852 |   484 |   144 | 584 |971


v17-0001+2:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 731 |   520 |   152 | 645 |   1118


Haven't looked at your v18 patch set yet.




Re: Skipping logical replication transactions on subscriber side

2021-07-18 Thread Amit Kapila
On Fri, Jul 16, 2021 at 8:33 PM Masahiko Sawada  wrote:
>
> On Wed, Jul 14, 2021 at 5:14 PM Masahiko Sawada  wrote:
> >
> > Sounds good. I'll incorporate this in the next version patch that I'm
> > planning to submit this week.
>
> Sorry, I could not make it this week. I'll submit them early next week.
>

No problem.

> While updating the patch I thought we need to have more design
> discussion on two points of clearing error details after the error is
> resolved:
>
> 1. How to clear apply worker errors. IIUC we've discussed that once
> the apply worker skipped the transaction we leave the error entry
> itself but clear its fields except for some fields such as failure
> counts. But given that the stats messages could be lost, how can we
> ensure to clear those error details? For table sync workers’ error, we
> can have autovacuum workers periodically check entires of
> pg_subscription_rel and clear the error entry if the table sync worker
> completes table sync (i.g., checking if srsubstate = ‘r’). But there
> is no such information for the apply workers and subscriptions.
>

But won't the corresponding subscription (pg_subscription) have the
XID as InvalidTransactionid once the xid is skipped or at least a
different XID then we would have in pg_stat view? Can we use that to
reset entry via vacuum?

> In
> addition to sending the message clearing the error details just after
> skipping the transaction, I thought that we can have apply workers
> periodically send the message clearing the error details but it seems
> not good.
>

Yeah, such things should be a last resort.

> 2. Do we really want to leave the table sync worker even after the
> error is resolved and the table sync completes? Unlike the apply
> worker error, the number of table sync worker errors could be very
> large, for example, if a subscriber subscribes to many tables. If we
> leave those errors in the stats view, it uses more memory space and
> could affect writing and reading stats file performance. If such left
> table sync error entries are not helpful in practice I think we can
> remove them rather than clear some fields. What do you think?
>

Sounds reasonable to me. One might think to update the subscription
error count by including table_sync errors but not sure if that is
helpful and even if that is helpful, we can extend it later.

-- 
With Regards,
Amit Kapila.




Re: O_DIRECT on macOS

2021-07-18 Thread Thomas Munro
On Mon, Jul 19, 2021 at 4:42 PM Tom Lane  wrote:
> prairiedog thinks that Assert is too optimistic about whether all
> those flags exist.

Fixed.

(Huh, I received no -committers email for 2dbe8905.)




Re: O_DIRECT on macOS

2021-07-18 Thread Tom Lane
Thomas Munro  writes:
> Agreed.  Pushed!

prairiedog thinks that Assert is too optimistic about whether all
those flags exist.

regards, tom lane




Re: Doc necessity for superuser privileges to execute pg_import_system_collations()

2021-07-18 Thread Fujii Masao




On 2021/07/19 11:45, torikoshia wrote:

Hi.

It seems that only superusers can execute pg_import_system_collations(), but 
this is not mentioned in the manual.

Since other functions that require superuser privileges describe that in the 
manual, I think it would be better to do the same for consistency.

Thoughts?


LGTM.

IMO it's better to back-patch this to v10
where pg_import_system_collations() was added.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




RE: Added schema level support for publication.

2021-07-18 Thread tanghy.f...@fujitsu.com
On Friday, July 16, 2021 6:10 PM vignesh C 
> On Wed, Jul 14, 2021 at 6:25 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > Wednesday, July 14, 2021 6:17 PM vignesh C  wrote:
> > > On Tue, Jul 13, 2021 at 12:06 PM tanghy.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Monday, July 12, 2021 5:36 PM vignesh C 
> wrote:
> > > > >
> > > > > Thanks for reporting this issue, this issue is fixed in the v10
> > > > > patch attached at [1].
> > > > > [1] - https://www.postgresql.org/message-id/CALDaNm2%2BtR%2B8R-
> > > > > sD1CSyMbZcZbkintZE-avefjsp7LCkm6HMmw%40mail.gmail.com
> > > >
> > > > Thanks for fixing it.
> > > >
> > > > By applying your V10 patch, I saw three problems, please have a look.
> > > >
> > > > 1. An issue about pg_dump.
> > > > When public schema was published, the publication was created in the
> > > > output file, but public schema was not added to it. (Other schemas
> > > > could be added as expected.)
> > > >
> > > > I looked into it and found that selectDumpableNamespace function marks
> > > DUMP_COMPONENT_DEFINITION as needless when the schema is public,
> > > leading to schema public is ignored in getPublicationSchemas. So we'd 
> > > better
> > > check whether schemas should be dumped in another way.
> > > >
> > > > I tried to fix it with the following change, please have a look.
> > > > (Maybe we also need to add some comments for it.)
> > > >
> > > > diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
> > > > index f6b4f12648..a327d2568b 100644
> > > > --- a/src/bin/pg_dump/pg_dump.c
> > > > +++ b/src/bin/pg_dump/pg_dump.c
> > > > @@ -4206,7 +4206,8 @@ getPublicationSchemas(Archive *fout,
> > > NamespaceInfo nspinfo[], int numSchemas)
> > > >  * Ignore publication membership of schemas whose
> > > definitions are not
> > > >  * to be dumped.
> > > >  */
> > > > -   if (!(nsinfo->dobj.dump &
> > > DUMP_COMPONENT_DEFINITION))
> > > > +   if (!((nsinfo->dobj.dump &
> > > DUMP_COMPONENT_DEFINITION)
> > > > +   || (strcmp(nsinfo->dobj.name, "public") == 0
> > > > + && nsinfo->dobj.dump != DUMP_COMPONENT_NONE)))
> > > > continue;
> > > >
> > > > pg_log_info("reading publication membership for schema
> > > > \"%s\"",
> > >
> > > I felt it is intentionally done like that as the pubic schema is created 
> > > by default,
> > > hence it is not required to dump else we will get errors while restoring.
> > > Thougths?
> >
> > Thanks for the new patches and I also looked at this issue.
> >
> > For user defined schema and publication:
> > --
> > create schema s1;
> > create publication pub2 for SCHEMA s1;
> > --
> >
> > pg_dump will only generate the following SQLs:
> >
> > --pg_dump result--
> > CREATE PUBLICATION pub2 WITH (publish = 'insert, update, delete, truncate');
> > ALTER PUBLICATION pub2 ADD SCHEMA s1;
> > --
> >
> > But for the public schema:
> > --
> > create publication pub for SCHEMA public;
> > --
> >
> > pg_dump will only generate the following SQL:
> >
> > --pg_dump result--
> > CREATE PUBLICATION pub WITH (publish = 'insert, update, delete, truncate');
> > --
> >
> > It didn't generate SQL like "ALTER PUBLICATION pub ADD SCHEMA public;" which
> > means the public schema won't be published after restoring. So, I think we'd
> > better let the pg_dump generate the ADD SCHEMA public SQL. Thoughts ?
> 
> Thanks for reporting this issue, this issue is fixed in the v12 patch 
> attached.
> 

I tested your v12 patch and found a problem in the following case.

Step 1:
postgres=# create schema s1;
CREATE SCHEMA
postgres=# create table s1.t1 (a int);
CREATE TABLE
postgres=# create publication pub_t for table s1.t1;
CREATE PUBLICATION
postgres=# create publication pub_s for schema s1;
CREATE PUBLICATION

Step 2:
pg_dump -N s1

I dumped and excluded schema s1, pg_dump generated the following SQL:
---
ALTER PUBLICATION pub_s ADD SCHEMA s1;

I think it was not expected because SQL like "ALTER PUBLICATION pub_t ADD TABLE 
s1.t1" was not generated in my case. Thoughts?

Regards
Tang


Re: [HACKERS] logical decoding of two-phase transactions

2021-07-18 Thread Peter Smith
On Mon, Jul 19, 2021 at 12:43 PM Amit Kapila  wrote:
>
> On Mon, Jul 19, 2021 at 1:55 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > Pushed.
> >
> > I think you'd be way better off making the gid fields be "char *"
> > and pstrdup'ing the result of pq_getmsgstring.  Another possibility
> > perhaps is to use strlcpy, but I'd only go that way if it's important
> > to constrain the received strings to 200 bytes.
> >
>
> I think it is important to constrain length to 200 bytes for this case
> as here we receive a prepared transaction identifier which according
> to docs [1] has a max length of 200 bytes. Also, in
> ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with
> 200 as max length to copy prepare transaction identifier. So, I think
> it is better to use strlcpy here unless you or Peter feels otherwise.
>

OK. I have implemented this reported [1] potential buffer overrun
using the constraining strlcpy, because the GID limitation of 200
bytes is already mentioned in the documentation [2].

PSA.

--
[1] https://www.postgresql.org/message-id/161029.1626639923%40sss.pgh.pa.us
[2] https://www.postgresql.org/docs/devel/sql-prepare-transaction.html

Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Fix-potential-buffer-overruns.patch
Description: Binary data


Re: Added documentation for cascade and restrict option of drop statistics

2021-07-18 Thread Michael Paquier
On Sun, Jul 18, 2021 at 12:37:48PM +0900, Michael Paquier wrote:
> Indeed, good catch.  The other commands document that, so let's fix
> it.

Applied.
--
Michael


signature.asc
Description: PGP signature


RE: Added schema level support for publication.

2021-07-18 Thread tanghy.f...@fujitsu.com
On Friday, July 16, 2021 6:13 PM vignesh C 
mailto:vignes...@gmail.com>> wrote:

>

> On Fri, Jul 16, 2021 at 9:25 AM Greg Nancarrow  
> wrote:

> >

> > Also, there seems to be an issue with ALTER PUBLICATION ... SET SCHEMA ...

> > (PubType is getting set to 'e' instead of 's'

> >

> > test_pub=# create publication pub1;

> > CREATE PUBLICATION

> > test_pub=# create table myschema.test(i int);

> > CREATE TABLE

> > test_pub=# alter publication pub1 set schema myschema;

> > ALTER PUBLICATION

> > test_pub=# \dRp pub1

> >List of publications

> >  Name | Owner | All tables | Inserts | Updates | Deletes | Truncates |

> > Via root | PubType

> > --+---++-+-+-+---+--+-

> >  pub1 | gregn | f  | t   | t   | t   | t |

> > f| e

> > (1 row)

> >

> > test_pub=# alter publication pub1 add table test;

> > ALTER PUBLICATION

> > test_pub=# \dRp pub1

> >List of publications

> >  Name | Owner | All tables | Inserts | Updates | Deletes | Truncates |

> > Via root | PubType

> > --+---++-+-+-+---+--+-

> >  pub1 | gregn | f  | t   | t   | t   | t |

> > f| t

> > (1 row)

> >

> >

> > When I use "ADD SCHEMA" instead of "SET SCHEMA" on an empty

> > publication, it seems OK.

>

> Modified.

>



Thanks for your patch. But there is a problem about "ALTER PUBLICATION … SET 
TABLE …", which is similar to the issue Greg reported at [1].



For example:

postgres=# CREATE TABLE public.t1 (a int);

CREATE TABLE

postgres=# CREATE PUBLICATION pub1;

CREATE PUBLICATION

postgres=# ALTER PUBLICATION pub1 SET TABLE public.t1;

ALTER PUBLICATION

postgres=# \dRp

List of publications

Name |  Owner   | All tables | Inserts | Updates | Deletes | Truncates | Via 
root | PubType

--+--++-+-+-+---+--+-

pub1 | postgres | f  | t   | t   | t   | t | f  
  | e

(1 row)



I think PubType in this case should be 't' instead of 'e'. Please have a look.



[1] - 
https://www.postgresql.org/message-id/CAJcOf-ddXvY%3DOFC54CshdMa1bswzFjG9qokjC0aFeiS%3D6CNRzw%40mail.gmail.com

Regards,
Tang




Doc necessity for superuser privileges to execute pg_import_system_collations()

2021-07-18 Thread torikoshia

Hi.

It seems that only superusers can execute pg_import_system_collations(), 
but this is not mentioned in the manual.


Since other functions that require superuser privileges describe that in 
the manual, I think it would be better to do the same for consistency.


Thoughts?

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ac6347a971..819f2c8914 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26563,6 +26563,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 be pg_catalog, but that is not a requirement; the
 collations could be installed into some other schema as well.  The
 function returns the number of new collation objects it created.
+This function is restricted to superusers.

   
  


Re: [HACKERS] logical decoding of two-phase transactions

2021-07-18 Thread Amit Kapila
On Mon, Jul 19, 2021 at 1:55 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > Pushed.
>
> I think you'd be way better off making the gid fields be "char *"
> and pstrdup'ing the result of pq_getmsgstring.  Another possibility
> perhaps is to use strlcpy, but I'd only go that way if it's important
> to constrain the received strings to 200 bytes.
>

I think it is important to constrain length to 200 bytes for this case
as here we receive a prepared transaction identifier which according
to docs [1] has a max length of 200 bytes. Also, in
ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with
200 as max length to copy prepare transaction identifier. So, I think
it is better to use strlcpy here unless you or Peter feels otherwise.

[1] - https://www.postgresql.org/docs/devel/sql-prepare-transaction.html

-- 
With Regards,
Amit Kapila.




Re: RFC: Logging plan of the running query

2021-07-18 Thread torikoshia
On Tue, Jul 13, 2021 at 11:11 PM Masahiro Ikeda 
 wrote:



When I tried this feature, I realized two things. So, I share them.


Thanks for your review!


(1) About output contents

> The format of the query plan is the same as when FORMAT
> TEXT
> and VEBOSE are used in the
> EXPLAIN command.
> For example:



I think the above needs to add COSTS and SETTINGS options too, and it's
better to use an
example which the SETTINGS option works like the following.


Agreed. Updated the patch.


(2) About EXPLAIN "BUFFER" option

When I checked EXPLAIN option, I found there is another option "BUFFER"
which can be
used without the "ANALYZE" option.

I'm not sure it's useful because your target use-case is analyzing a
long-running query,
not its planning phase. If so, the planning buffer usage is not so much
useful. But, since
the overhead to output buffer usages is not high and it's used for
debugging use cases,
I wonder it's not a bad idea to output buffer usages too. Thought?


As you pointed out, I also think it would be useful when queries are 
taking a long time in the planning phase.
However, as far as I read ExplainOneQuery(), the buffer usages in the 
planner phase are not retrieved by default. They are retrieved only when 
BUFFERS is specified in the EXPLAIN.


If we change it to always get the buffer usages and expose them as a 
global variable, we can get them through pg_log_current_plan(), but I 
think it doesn't pay.



Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION
From 7ad9a280b6f74e4718293863716046c02b0a3835 Mon Sep 17 00:00:00 2001
From: atorik 
Date: Mon, 19 Jul 2021 10:33:02 +0900
Subject: [PATCH v6] Add function to log the untruncated query string and its
 plan for the query currently running on the backend with the specified
 process ID.

Currently, we have to wait for the query execution to finish
before we check its plan. This is not so convenient when
investigating long-running queries on production environments
where we cannot use debuggers.
To improve this situation, this patch adds
pg_log_current_query_plan() function that requests to log the
plan of the specified backend process.

Only superusers are allowed to request to log the plans because
allowing any users to issue this request at an unbounded rate
would cause lots of log messages and which can lead to denial
of service.

On receipt of the request, at the next CHECK_FOR_INTERRUPTS(),
the target backend logs its plan at LOG_SERVER_ONLY level, so
that these plans will appear in the server log but not be sent
to the client.

Since some codes and comments of pg_log_current_query_plan() 
are the same with pg_log_backend_memory_contexts(), this
patch also refactors them to make them common.

Reviewed-by: Bharath Rupireddy, Fujii Masao, Dilip Kumar, Masahiro Ikeda
---
 doc/src/sgml/func.sgml   | 46 
 src/backend/commands/explain.c   | 79 
 src/backend/storage/ipc/procsignal.c | 66 
 src/backend/tcop/postgres.c  |  4 +
 src/backend/utils/adt/mcxtfuncs.c| 53 ++---
 src/backend/utils/init/globals.c |  1 +
 src/include/catalog/pg_proc.dat  |  6 ++
 src/include/commands/explain.h   |  2 +
 src/include/miscadmin.h  |  1 +
 src/include/storage/procsignal.h |  3 +
 src/test/regress/expected/misc_functions.out | 16 +++-
 src/test/regress/sql/misc_functions.sql  | 12 +--
 12 files changed, 232 insertions(+), 57 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ac6347a971..21cd8c9fbb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24940,6 +24940,27 @@ SELECT collation for ('foo' COLLATE "de_DE");

   
 
+  
+   
+
+ pg_log_current_query_plan
+
+pg_log_current_query_plan ( pid integer )
+boolean
+   
+   
+Requests to log the plan of the query currently running on the
+backend with specified process ID along with the untruncated
+query string.
+They will be logged at LOG message level and
+will appear in the server log based on the log
+configuration set (See 
+for more information), but will not be sent to the client
+regardless of .
+Only superusers can request to log plan of the running query.
+   
+  
+
   

 
@@ -25053,6 +25074,31 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
 because it may generate a large number of log messages.

 
+   
+pg_log_current_query_plan can be used
+to log the plan of a backend process. For example:
+
+postgres=# SELECT pg_log_current_query_plan(17793);
+ pg_log_current_query_plan
+---
+ t
+(1 row)
+
+The format of the query plan is the same as when VERBOSE,
+COSTS, SETTINGS and
+FORMAT TEXT are used in the EXPLAIN
+command. 

Re: pgbench: using prepared BEGIN statement in a pipeline could cause an error

2021-07-18 Thread Yugo NAGATA
Hello Fabien,

On Sat, 17 Jul 2021 07:03:01 +0200 (CEST)
Fabien COELHO  wrote:

> 
> Hello Yugo-san,
> 
> > [...] One way to avoid these errors is to send Parse messages before 
> > pipeline mode starts. I attached a patch to fix to prepare commands at 
> > starting of a script instead of at the first execution of the command.
> 
> > What do you think?
> 
> ISTM that moving prepare out of command execution is a good idea, so I'm 
> in favor of this approach: the code is simpler and cleaner.
> 
> ISTM that a minor impact is that the preparation is not counted in the 
> command performance statistics. I do not think that it is a problem, even 
> if it would change detailed results under -C -r -M prepared.

I agree with you. Currently, whether prepares are sent in pipeline mode or
not depends on whether the first SQL command is placed between \startpipeline
and \endpipeline regardless whether other commands are executed in pipeline
or not. ISTM, this behavior would be not intuitive for users.  Therefore, 
I think preparing all statements not using pipeline mode is not problem for now.

If some users would like to send prepares in  pipeline, I think it would be
better to provide more simple and direct way. For example, we prepare statements
in pipeline if the user use an option, or if the script has at least one
\startpipeline in their pipeline. Maybe,  --pipeline option is useful for users
who want to use pipeline mode for all queries in scirpts including built-in 
ones.
However, these features seems to be out of the patch proposed in this thread.

> Patch applies & compiles cleanly, global & local make check ok. However 
> the issue is not tested. I think that the patch should add a tap test case 
> for the problem being addressed.

Ok. I'll add a tap test to confirm the error I found is avoidable.

> I'd suggest to move the statement preparation call in the 
> CSTATE_CHOOSE_SCRIPT case.

I thought so at first, but I noticed we cannot do it at least if we are
using -C because the connection may not be established in the
CSTATE_CHOOSE_SCRIPT state. 

> In comments: not yet -> needed.

Thanks. I'll fix it.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: speed up verifying UTF-8

2021-07-18 Thread John Naylor
I wrote:

> On Fri, Jul 16, 2021 at 1:44 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:
> >
> > Have you considered shift-based DFA for a portable implementation
https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 ?
>
> I did consider some kind of DFA a while back and it was too slow.

I took a closer look at this "shift-based DFA", and it seemed pretty
straightforward to implement this on top of my DFA attempt from some months
ago. The DFA technique is not a great fit with our API, since we need to
return how many bytes we found valid. On x86 (not our target for the
fallback, but convenient to test) all my attempts were either worse than
HEAD in multiple cases, or showed no improvement for the important ASCII
case. On Power8, it's more compelling, and competitive with v14, so I'll
characterize it on that platform as I describe the patch series:

0001 is a pure DFA, and has decent performance on multibyte, but terrible
on ascii.
0002 dispatches on the leading byte category, unrolls the DFA loop
according to how many valid bytes we need, and only checks the DFA state
afterwards. It's good on multibyte (3-byte, at least) but still terrible on
ascii.
0003 adds a 1-byte ascii fast path -- while robust on all inputs, it still
regresses a bit on ascii.
0004 uses the same 8-byte ascii check as previous patches do.
0005 and 0006 use combinations of 1- and 8-byte ascii checks similar to in
v17.

0005 seems the best on Power8, and is very close to v4. FWIW, v14's
measurements seem lucky and fragile -- if I change any little thing, even

- return -1;
+ return 0;

it easily loses 100-200ms on non-pure-ascii tests. That said, v14 still
seems the logical choice, unless there is some further tweak on top of v17
or v18 that gives some non-x86 platform a significant boost.

Power8, gcc 4.8:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
2944 |  1523 |   871 |1473 |   1509

v18-0001:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1257 |  1681 |  1385 |1744 |   2018

v18-0002:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 951 |  1381 |  1217 |1469 |   1172

v18-0003:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 911 |   |   942 |1112 |865

v18-0004:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 987 |   730 |   222 |1325 |   2306

v18-0005:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 962 |   664 |   180 | 928 |   1179

v18-0006:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 908 |   663 |   244 |1026 |   1464

and for comparison,

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 888 |   607 |   179 | 777 |   1328

v17-0003:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1205 |   662 |   180 | 767 |   1256


Macbook, clang 12:

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 974 |   691 |   370 | 456 |526

v18-0001:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1334 |  2713 |  2802 |2665 |   2541

v18-0002:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 733 |  1212 |  1064 |1034 |   1007

v18-0003:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 653 |   560 |   370 | 420 |465

v18-0004:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 574 |   402 |88 | 584 |   1033

v18-0005:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1345 |   730 |   334 | 578 |909

v18-0006:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 674 |   485 |   153 | 594 |989

and for comparison,

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 674 |   346 |78 | 309 |504

v17-0002:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 516 |   324 |78 | 331 |544

-- 
John Naylor
EDB: http://www.enterprisedb.com


v18-0001-Use-pure-DFA.patch
Description: Binary data


v18-0002-Unroll-loop-in-DFA.patch
Description: Binary data


v18-0004-Check-ascii-8-bytes-at-a-time-with-bitwise-opera.patch
Description: Binary data


v18-0003-Add-ascii-fast-path-before-resorting-to-DFA.patch
Description: Binary data


v18-0005-Do-1-byte-ascii-check-if-8-byte-check-fails.patch
Description: Binary data


v18-0006-Do-8-byte-check-only-if-1-byte-check-succeeds.patch
Description: Binary data


Re: 回复: Why is XLOG_FPI_FOR_HINT always need backups?

2021-07-18 Thread Kyotaro Horiguchi
At Sat, 17 Jul 2021 00:14:34 +0900, Fujii Masao  
wrote in 
> Thanks for updating the patch! It basically looks good to me.
> 
>* Full-page image (FPI) records contain nothing else but a 
> backup
>* block (or multiple backup blocks). Every block reference must
>* include a full-page image - otherwise there would be no 
> point in
>* this record.
> 
> The above comment also needs to be updated?

In short, no.  In contrast to the third paragraph, the first paragraph
should be thought that it is describing XLOG_FPI.  However, actually
it is not super obvious so it's better to make it clearer. Addition to
that, it seems to me (yes, to *me*) somewhat confused between "block
reference", "backup block" and "full-page image". So I'd like to
adjust the paragraph as the following.

> * XLOG_FPI records contain nothing else but one or more block
> * references. Every block reference must include a full-page image
> * regardless of the full_page_writes setting - otherwise there would
> * be no point in this record.

FYI (or, for the record), the first paragraph got to the current shape
by the commit 9155580fd5, where XLOG_FPI was modified to be able to
hold more than one block references.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From cd54ae028878fac0a97baebf71e39cb9b518e885 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 7 Jul 2021 15:34:41 +0900
Subject: [PATCH v3] Make FPI_FOR_HINT follow standard FPI emitting policy

Commit 2c03216d831160bedd72d45f712601b6f7d03f1c changed
XLogSaveBufferForHint to enforce FPI but the caller didn't intend to
do so.  Restore the intended behavior that FPI_FOR_HINT follows
full_page_writes.
---
 src/backend/access/transam/xlog.c   | 30 +
 src/backend/access/transam/xloginsert.c |  5 +
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ee9515139..ff10fef0d3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10148,7 +10148,10 @@ xlog_redo(XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 	XLogRecPtr	lsn = record->EndRecPtr;
 
-	/* in XLOG rmgr, backup blocks are only used by XLOG_FPI records */
+	/*
+	 * in XLOG rmgr, backup blocks are only used by XLOG_FPI and
+	 * XLOG_FPI_FOR_HINT records.
+	 */
 	Assert(info == XLOG_FPI || info == XLOG_FPI_FOR_HINT ||
 		   !XLogRecHasAnyBlockRefs(record));
 
@@ -10356,25 +10359,34 @@ xlog_redo(XLogReaderState *record)
 	else if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT)
 	{
 		/*
-		 * Full-page image (FPI) records contain nothing else but a backup
-		 * block (or multiple backup blocks). Every block reference must
-		 * include a full-page image - otherwise there would be no point in
-		 * this record.
+		 * XLOG_FPI records contain nothing else but one or more block
+		 * references. Every block reference must include a full-page image
+		 * regardless of the full_page_writes setting - otherwise there would
+		 * be no point in this record.
 		 *
 		 * No recovery conflicts are generated by these generic records - if a
 		 * resource manager needs to generate conflicts, it has to define a
 		 * separate WAL record type and redo routine.
 		 *
 		 * XLOG_FPI_FOR_HINT records are generated when a page needs to be
-		 * WAL- logged because of a hint bit update. They are only generated
-		 * when checksums are enabled. There is no difference in handling
-		 * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
-		 * code just to distinguish them for statistics purposes.
+		 * WAL-logged because of a hint bit update. They are only generated
+		 * when checksums and/or wal_log_hints are enabled. The only difference
+		 * in handling XLOG_FPI and XLOG_FPI_FOR_HINT records is that the
+		 * latter is allowed to be missing actual block image. In that case the
+		 * record is effectively a NOP record and should not even try to read
+		 * the page from disk.
 		 */
 		for (uint8 block_id = 0; block_id <= record->max_block_id; block_id++)
 		{
 			Buffer		buffer;
 
+			if (!XLogRecHasBlockImage(record, block_id))
+			{
+if (info == XLOG_FPI)
+	elog(ERROR, "missing full page image in XLOG_FPI record");
+continue;
+			}
+
 			if (XLogReadBufferForRedo(record, block_id, &buffer) != BLK_RESTORED)
 elog(ERROR, "unexpected XLogReadBufferForRedo result when restoring backup block");
 			UnlockReleaseBuffer(buffer);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 3d2c9c3e8c..e596a0470a 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -287,8 +287,6 @@ XLogRegisterBlock(uint8 block_id, RelFileNode *rnode, ForkNumber forknum,
 {
 	registered_buffer *regbuf;
 
-	/* This is currently only used to WAL-log a full-page image of a page */
-	As

Re: Remove redundant strlen call in ReplicationSlotValidateName

2021-07-18 Thread Ranier Vilela
Em dom., 18 de jul. de 2021 às 21:23, Peter Smith 
escreveu:

>
>
> On Sun, Jul 18, 2021 at 11:09 PM Ranier Vilela 
> wrote:
>
>> ...
>>
> I did the patch, but to my surprise, the results weren't so good.
>> Despite that claiming a tiny improvement in performance, I didn't expect
>> any slowdown.
>> I put a counter in pg_regress.c, summing the results of each test and did
>> it three times for HEAD and for the patch.
>> Some tests were better, but others were bad.
>> Tests comments per example, show 180%, combocid 174%, dbize 165%, xmlmap
>> 136%, lock 134%.
>>
>> ... ...
>>
>> So I'm posting the patch here, merely as an illustration of my findings.
>> Perhaps someone with a better understanding of the process of translating
>> C to asm can have an explanation.
>> Is it worth it to change only where there has been improvement?
>>
>>
> My guess is that your hypothetical performance improvement has been
> completely swamped by the natural variations of each run.
>
> For example,
> drop_if_exists 115 84 83 94
> 138 63 57 86 109,30%
>
> Those numbers are all over the place, so I doubt the results are really
> saying anything at all about what is better/worse, because I think you have
> zero chance to notice a couple of nanoseconds of improvement within the
> noise when each run is varying from 57 to 138 ms.
>
> IMO the only conclusion you can draw from your results is that any
> performance gain is too small to be observable.
>
Thanks Peter for your explanations.

I can conclude then that the test results are not a reference for
performance/regression.
So the patch serves as a refactoring, without any further indication.

regards,
Ranier Vilela


Re: O_DIRECT on macOS

2021-07-18 Thread Thomas Munro
On Tue, Jul 13, 2021 at 1:56 PM Andres Freund  wrote:
> On 2021-07-13 13:25:50 +1200, Thomas Munro wrote:
> > I'm planning to go with that idea (#1), if there are no objections.
>
> The only other viable approach I see is to completely separate our
> internal flag representation from the OS representation and do the whole
> mapping inside fd.c - but that seems like a too big hammer right now.

Agreed.  Pushed!

For the record, Solaris has directio() that could be handled the same
way.  I'm not planning to look into that myself, but patches welcome.
Illumos (née OpenSolaris) got with the programme and added O_DIRECT.
Of our 10-or-so target systems I guess that'd leave just HPUX (judging
by an old man page found on the web) and OpenBSD with no direct I/O
support.




Re: Implementing Incremental View Maintenance

2021-07-18 Thread Yugo NAGATA
On Wed, 14 Jul 2021 21:22:37 +0530
vignesh C  wrote:

> On Mon, May 17, 2021 at 10:08 AM Yugo NAGATA  wrote:
> >
> > On Fri, 7 May 2021 14:14:16 +0900
> > Yugo NAGATA  wrote:
> >
> > > On Mon, 26 Apr 2021 16:03:48 +0900
> > > Yugo NAGATA  wrote:
> > >
> > > > On Mon, 26 Apr 2021 15:46:21 +0900
> > > > Yugo NAGATA  wrote:
> > > >
> > > > > On Tue, 20 Apr 2021 09:51:34 +0900
> > > > > Yugo NAGATA  wrote:
> > > > >
> > > > > > On Mon, 19 Apr 2021 17:40:31 -0400
> > > > > > Tom Lane  wrote:
> > > > > >
> > > > > > > Andrew Dunstan  writes:
> > > > > > > > This patch (v22c) just crashed for me with an assertion failure 
> > > > > > > > on
> > > > > > > > Fedora 31. Here's the stack trace:
> > > > > > >
> > > > > > > > #2  0x0094a54a in ExceptionalCondition
> > > > > > > > (conditionName=conditionName@entry=0xa91dae 
> > > > > > > > "queryDesc->sourceText !=
> > > > > > > > NULL", errorType=errorType@entry=0x99b468 "FailedAssertion",
> > > > > > > > fileName=fileName@entry=0xa91468
> > > > > > > > "/home/andrew/pgl/pg_head/src/backend/executor/execMain.c",
> > > > > > > > lineNumber=lineNumber@entry=199) at
> > > > > > > > /home/andrew/pgl/pg_head/src/backend/utils/error/assert.c:69
> > > > > > >
> > > > > > > That assert just got added a few days ago, so that's why the patch
> > > > > > > seemed OK before.
> > > > > >
> > > > > > Thank you for letting me know. I'll fix it.
> > > > >
> > > > > Attached is the fixed patch.
> > > > >
> > > > > queryDesc->sourceText cannot be NULL after commit b2668d8,
> > > > > so now we pass an empty string "" for refresh_matview_datafill() 
> > > > > instead NULL
> > > > > when maintaining views incrementally.
> > > >
> > > > I am sorry, I forgot to include a fix for 8aba9322511.
> > > > Attached is the fixed version.
> > >
> > > Attached is the rebased patch (for 6b8d29419d).
> >
> > I attached a rebased patch.
> 
> The patch does not apply on Head anymore, could you rebase and post a
> patch. I'm changing the status to "Waiting for Author".

Ok. I'll update the patch in a few days.

Regards,
Yugo Nagata


-- 
Yugo NAGATA 




Re: Remove redundant strlen call in ReplicationSlotValidateName

2021-07-18 Thread Peter Smith
On Sun, Jul 18, 2021 at 11:09 PM Ranier Vilela  wrote:

> ...
>
I did the patch, but to my surprise, the results weren't so good.
> Despite that claiming a tiny improvement in performance, I didn't expect
> any slowdown.
> I put a counter in pg_regress.c, summing the results of each test and did
> it three times for HEAD and for the patch.
> Some tests were better, but others were bad.
> Tests comments per example, show 180%, combocid 174%, dbize 165%, xmlmap
> 136%, lock 134%.
>
> ... ...
>
> So I'm posting the patch here, merely as an illustration of my findings.
> Perhaps someone with a better understanding of the process of translating
> C to asm can have an explanation.
> Is it worth it to change only where there has been improvement?
>
>
My guess is that your hypothetical performance improvement has been
completely swamped by the natural variations of each run.

For example,
drop_if_exists 115 84 83 94
138 63 57 86 109,30%

Those numbers are all over the place, so I doubt the results are really
saying anything at all about what is better/worse, because I think you have
zero chance to notice a couple of nanoseconds of improvement within the
noise when each run is varying from 57 to 138 ms.

IMO the only conclusion you can draw from your results is that any
performance gain is too small to be observable.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


Re: [PATCH] Allow multiple recursive self-references

2021-07-18 Thread David Fetter
On Thu, Jul 15, 2021 at 09:18:00AM +0200, Denis Hirn wrote:
> > The patch does not apply on Head anymore, could you rebase and post a patch.
> 
> Sure thing. Here's the new patch.
> 
> Best wishes,

Thanks for updating this.

In the next version of the patch, would you be so kind as to include
updated documentation of the feature and at least one regression test
of same?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: Failure with 004_logrotate in prairiedog

2021-07-18 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Jul 18, 2021 at 01:42:18AM -0400, Tom Lane wrote:
>> Awhile back we discovered that old macOS versions have non-atomic rename
>> [1].  I eventually shut down dromedary because that was causing failures
>> often enough to be annoying.  I'd not seen such a failure before on
>> prairiedog, but it sure seems plausible that this is one.

> Thanks for the pointer.  This indeed looks like the same problem.

For context, dromedary (now florican) is a dual-CPU machine while
prairiedog has but one CPU.  I'd thought that maybe not being
multi-CPU insulated prairiedog from the non-atomic-rename problem.
But now it looks like it's merely a lot less probable on that hardware.

regards, tom lane




Re: Failure with 004_logrotate in prairiedog

2021-07-18 Thread Michael Paquier
On Sun, Jul 18, 2021 at 01:42:18AM -0400, Tom Lane wrote:
> Awhile back we discovered that old macOS versions have non-atomic rename
> [1].  I eventually shut down dromedary because that was causing failures
> often enough to be annoying.  I'd not seen such a failure before on
> prairiedog, but it sure seems plausible that this is one.

Thanks for the pointer.  This indeed looks like the same problem.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] logical decoding of two-phase transactions

2021-07-18 Thread Tom Lane
Amit Kapila  writes:
> Pushed.

Coverity thinks this has security issues, and I agree.

/srv/coverity/git/pgsql-git/postgresql/src/backend/replication/logical/proto.c: 
144 in logicalrep_read_begin_prepare()
143 /* read gid (copy it into a pre-allocated buffer) */
>>> CID 1487517:  Security best practices violations  (STRING_OVERFLOW)
>>> You might overrun the 200-character fixed-size string "begin_data->gid" 
>>> by copying the return value of "pq_getmsgstring" without checking the 
>>> length.
144 strcpy(begin_data->gid, pq_getmsgstring(in));

200 /* read gid (copy it into a pre-allocated buffer) */
>>> CID 1487515:  Security best practices violations  (STRING_OVERFLOW)
>>> You might overrun the 200-character fixed-size string 
>>> "prepare_data->gid" by copying the return value of "pq_getmsgstring" 
>>> without checking the length.
201 strcpy(prepare_data->gid, pq_getmsgstring(in));

256 /* read gid (copy it into a pre-allocated buffer) */
>>> CID 1487516:  Security best practices violations  (STRING_OVERFLOW)
>>> You might overrun the 200-character fixed-size string 
>>> "prepare_data->gid" by copying the return value of "pq_getmsgstring" 
>>> without checking the length.
257 strcpy(prepare_data->gid, pq_getmsgstring(in));

316 /* read gid (copy it into a pre-allocated buffer) */
>>> CID 1487519:  Security best practices violations  (STRING_OVERFLOW)
>>> You might overrun the 200-character fixed-size string 
>>> "rollback_data->gid" by copying the return value of "pq_getmsgstring" 
>>> without checking the length.
317 strcpy(rollback_data->gid, pq_getmsgstring(in));

I think you'd be way better off making the gid fields be "char *"
and pstrdup'ing the result of pq_getmsgstring.  Another possibility
perhaps is to use strlcpy, but I'd only go that way if it's important
to constrain the received strings to 200 bytes.

regards, tom lane




Re: unnesting multirange data types

2021-07-18 Thread Alexander Korotkov
On Thu, Jul 15, 2021 at 10:27 PM Jonathan S. Katz  wrote:
> On 7/15/21 12:26 PM, Alexander Korotkov wrote:
> > On Thu, Jul 15, 2021 at 6:47 PM Tom Lane  wrote:
> >> Yeah, that seems pretty horrid.  I still don't like the way the
> >> array casts were done, but I'd be okay with pushing the unnest
> >> addition.
> >
> > +1 for just unnest().
>
> ...which was my original ask at the beginning of the thread :) So +1.

Thanks for the feedback.  I've pushed the unnest() patch to master and
pg14.  I've initially forgotten to change catversion.h for master, so
I made it with an additional commit.

I've double-checked that "make check-world" passes on my machine for
both master and pg14.  And I'm keeping my fingers crossed looking at
buildfarm.

--
Regards,
Alexander Korotkov




Re: cleaning up PostgresNode.pm

2021-07-18 Thread Andrew Dunstan

On 7/18/21 11:48 AM, Andrew Dunstan wrote:
> On 7/16/21 3:32 PM, Andrew Dunstan wrote:
>> On 6/28/21 1:02 PM, Andrew Dunstan wrote:
>>> On 4/24/21 3:14 PM, Alvaro Herrera wrote:
 On 2021-Apr-24, Andrew Dunstan wrote:

> I would like to undertake some housekeeping on PostgresNode.pm.
>
> 1. OO modules in perl typically don't export anything. We should remove
> the export settings. That would mean that clients would have to call
> "PostgresNode->get_new_node()" (but see item 2) and
> "PostgresNode::get_free_port()" instead of the unadorned calls they use 
> now.
 +1

> 2. There are two constructors, new() and get_new_node(). AFAICT nothing
> in our tests uses new(), and they almost certainly shouldn't anyway.
> get_new_node() calls new() to do some work, and I'd like to merge these
> two. The name of a constructor in perl is conventionally "new" as it is
> in many other OO languages, although in perl this can't apply where a
> class provides more than one constructor. Still, if we're merging them
> then the preference would be to call the merged function "new". Since
> we'd proposing to modify the calls anyway (see item 1) this shouldn't
> impose a huge extra workload.
 +1 on "new".  I think we weren't 100% clear on where we wanted it to go
 initially, but it's now clear that get_new_node() is the constructor,
 and that new() is merely a helper.  So let's rename them in a sane way.

> Another item that needs looking at is the consistent use of Carp.
> PostgresNode, TestLib and RecursiveCopy all use the Carp module, but
> contain numerous calls to "die" where they should probably have calls to
> "croak" or "confess".
 I wonder if it would make sense to think of PostgresNode as a feeder of
 sorts to Test::More and the like, so make it use diag(), note(),
 explain().

>>>
>>> Here is a set of small(ish) patches that does most of the above and then
>>> some.
>>>
>>>
>>> Patch 1 adds back the '-w' flag to pg_ctl in the start() method. It's
>>> redundant on modern versions of Postgres but it's harmless, and helps
>>> with subclassing for older versions where it wasn't the default.
>>>
>>> Patch 2 adds a method for altering config files as opposed to just
>>> appending to them. Again, this helps a lot in subclassing for older
>>> versions, which can call the parent's init() and then adjust whatever
>>> doesn't work.
>>>
>>> Patch 3 unifies the constructor methods and stops exporting a
>>> constructor. There is one constructor: PostgresNode::new()
>>>
>>> Patch 4 removes what's left of Exporter in PostgresNode, so it becomes a
>>> pure OO style module.
>>>
>>> Patch 5 adds a method for getting the major version string from a
>>> PostgresVersion object, again useful in subclassing.
>>>
>>> Patch 6 adds a method for getting the install_path of a PostgresNode
>>> object. While not strictly necessary it's consistent with other fields
>>> that have getter methods. Clients should not pry into the internals of
>>> objects. Experience has shown this method to be useful.
>>>
>>> Patches 7 8 and 9 contain additions to Patch 3 for things that I
>>> overlooked or that were not present when I originally prepared the
>>> patches. They would be applied alongside Patch 3, not separately.
>>>
>>>
>>>
>>> These patches are easily broken by e.g. the addition of a new TAP test
>>> or the modification of an existing test. So I'm hoping to get these
>>> added soon. I will add this email to the CF.
>>>
>>>
>> New version with a small change to fix bitrot.
>>
>>
>
> New set with fixups incorporated and review comments attended to. I'm
> intending to apply this later this week.
>


This time without a missing comma.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

>From 96949ade39118aa748210ae2e642661658797160 Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Sun, 18 Jul 2021 14:11:44 -0400
Subject: [PATCH 1/6] Add -w back to the flags for pg_ctl (re)start in
 PostgresNode

This is now the default for pg_ctl, but having the flag here explicitly
does no harm and helps with backwards compatibility of the PostgresNode
module.
---
 src/test/perl/PostgresNode.pm | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index ed5b4a1c4b..eb8b18d232 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -805,7 +805,9 @@ sub start
 
 	# Note: We set the cluster_name here, not in postgresql.conf (in
 	# sub init) so that it does not get copied to standbys.
-	$ret = TestLib::system_log('pg_ctl', '-D', $self->data_dir, '-l',
+	# -w is now the default but having it here does no harm and helps
+	# compatibility with older versions.
+	$ret = TestLib::system_log('pg_ctl', '-w', '-D', $self->data_dir, '-l',
 		$self->logfile, '-o', "--cluster-name=$name", 'start');
 
 	if ($ret != 0)
@@ -919,7 +921,

Re: slab allocator performance issues

2021-07-18 Thread Tomas Vondra

On 7/18/21 3:06 AM, Andres Freund wrote:

Hi,

On 2021-07-17 16:10:19 -0700, Andres Freund wrote:

Instead of populating a linked list with all chunks upon creation of a block -
which requires touching a fair bit of memory - keep a per-block pointer (or an
offset) into "unused" area of the block. When allocating from the block and
theres still "unused" memory left, use that, instead of bothering with the
freelist.

I tried that, and it nearly got slab up to the allocation/freeing performance
of aset.c (while winning after allocation, due to the higher memory density).


Combining that with limiting the number of freelists, and some
microoptimizations, allocation performance is now on par.

Freeing still seems to be a tad slower, mostly because SlabFree()
practically is immediately stalled on fetching the block, whereas
AllocSetFree() can happily speculate ahead and do work like computing
the freelist index. And then aset only needs to access memory inside the
context - which is much more likely to be in cache than a freelist
inside a block (there are many more).

But that's ok, I think. It's close and it's only a small share of the
overall runtime of my workload...



Sounds great! Thanks for investigating this and for the improvements.

It might be good to do some experiments to see how the changes affect 
memory consumption for practical workloads. I'm willing to spend soem 
time on that, if needed.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: cleaning up PostgresNode.pm

2021-07-18 Thread Andrew Dunstan

On 7/16/21 3:32 PM, Andrew Dunstan wrote:
> On 6/28/21 1:02 PM, Andrew Dunstan wrote:
>> On 4/24/21 3:14 PM, Alvaro Herrera wrote:
>>> On 2021-Apr-24, Andrew Dunstan wrote:
>>>
 I would like to undertake some housekeeping on PostgresNode.pm.

 1. OO modules in perl typically don't export anything. We should remove
 the export settings. That would mean that clients would have to call
 "PostgresNode->get_new_node()" (but see item 2) and
 "PostgresNode::get_free_port()" instead of the unadorned calls they use 
 now.
>>> +1
>>>
 2. There are two constructors, new() and get_new_node(). AFAICT nothing
 in our tests uses new(), and they almost certainly shouldn't anyway.
 get_new_node() calls new() to do some work, and I'd like to merge these
 two. The name of a constructor in perl is conventionally "new" as it is
 in many other OO languages, although in perl this can't apply where a
 class provides more than one constructor. Still, if we're merging them
 then the preference would be to call the merged function "new". Since
 we'd proposing to modify the calls anyway (see item 1) this shouldn't
 impose a huge extra workload.
>>> +1 on "new".  I think we weren't 100% clear on where we wanted it to go
>>> initially, but it's now clear that get_new_node() is the constructor,
>>> and that new() is merely a helper.  So let's rename them in a sane way.
>>>
 Another item that needs looking at is the consistent use of Carp.
 PostgresNode, TestLib and RecursiveCopy all use the Carp module, but
 contain numerous calls to "die" where they should probably have calls to
 "croak" or "confess".
>>> I wonder if it would make sense to think of PostgresNode as a feeder of
>>> sorts to Test::More and the like, so make it use diag(), note(),
>>> explain().
>>>
>>
>>
>> Here is a set of small(ish) patches that does most of the above and then
>> some.
>>
>>
>> Patch 1 adds back the '-w' flag to pg_ctl in the start() method. It's
>> redundant on modern versions of Postgres but it's harmless, and helps
>> with subclassing for older versions where it wasn't the default.
>>
>> Patch 2 adds a method for altering config files as opposed to just
>> appending to them. Again, this helps a lot in subclassing for older
>> versions, which can call the parent's init() and then adjust whatever
>> doesn't work.
>>
>> Patch 3 unifies the constructor methods and stops exporting a
>> constructor. There is one constructor: PostgresNode::new()
>>
>> Patch 4 removes what's left of Exporter in PostgresNode, so it becomes a
>> pure OO style module.
>>
>> Patch 5 adds a method for getting the major version string from a
>> PostgresVersion object, again useful in subclassing.
>>
>> Patch 6 adds a method for getting the install_path of a PostgresNode
>> object. While not strictly necessary it's consistent with other fields
>> that have getter methods. Clients should not pry into the internals of
>> objects. Experience has shown this method to be useful.
>>
>> Patches 7 8 and 9 contain additions to Patch 3 for things that I
>> overlooked or that were not present when I originally prepared the
>> patches. They would be applied alongside Patch 3, not separately.
>>
>>
>>
>> These patches are easily broken by e.g. the addition of a new TAP test
>> or the modification of an existing test. So I'm hoping to get these
>> added soon. I will add this email to the CF.
>>
>>
>
> New version with a small change to fix bitrot.
>
>


New set with fixups incorporated and review comments attended to. I'm
intending to apply this later this week.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

>From 4bf7696112af995b110e6f9f6dab21d3ec40077f Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Sun, 18 Jul 2021 11:29:18 -0400
Subject: [PATCH 1/6] Add -w back to the flags for pg_ctl (re)start in
 PostgresNode

This is now the default for pg_ctl, but having the flag here explicitly
does no harm and helps with backwards compatibility of the PostgresNode
module.
---
 src/test/perl/PostgresNode.pm | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index ed5b4a1c4b..aefb685dae 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -805,7 +805,9 @@ sub start
 
 	# Note: We set the cluster_name here, not in postgresql.conf (in
 	# sub init) so that it does not get copied to standbys.
-	$ret = TestLib::system_log('pg_ctl', '-D', $self->data_dir, '-l',
+	# -w is now the default but having it here does no harm and helps
+	# compatibility with older versions.
+	$ret = TestLib::system_log('pg_ctl', '-w', '-D', $self->data_dir, '-l',
 		$self->logfile, '-o', "--cluster-name=$name", 'start');
 
 	if ($ret != 0)
@@ -919,7 +921,9 @@ sub restart
 
 	print "### Restarting node \"$name\"\n";
 
-	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
+	# -w is now the default b

Re: Toast compression method options

2021-07-18 Thread Dilip Kumar
On Wed, Jul 14, 2021 at 5:35 PM vignesh C  wrote:
>
> On Thu, May 6, 2021 at 7:24 PM Dilip Kumar  wrote
>
> The patch does not apply on Head anymore, could you rebase and post a
> patch. I'm changing the status to "Waiting for Author".

Okay, I will rebase and send it by next week.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: corruption of WAL page header is never reported

2021-07-18 Thread Yugo NAGATA
On Sat, 17 Jul 2021 18:40:02 -0300
Ranier Vilela  wrote:

> Em sáb., 17 de jul. de 2021 às 16:57, Yugo NAGATA 
> escreveu:
> 
> > Hello,
> >
> > I found that any corruption of WAL page header found during recovery is
> > never
> > reported in log messages. If wal page header is broken, it is detected in
> > XLogReaderValidatePageHeader called from  XLogPageRead, but the error
> > messages
> > are always reset and never reported.
> >
> > if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> > readBuf))
> > {
> >/* reset any error XLogReaderValidatePageHeader() might
> > have set */
> >xlogreader->errormsg_buf[0] = '\0';
> >goto next_record_is_invalid;
> > }
> >
> > Since the commit 06687198018, we call XLogReaderValidatePageHeader here so
> > that
> > we can check a page header and retry immediately if it's invalid, but the
> > error
> > message is reset immediately and not reported. I guess the reason why the
> > error
> > message is reset is because we might get the right WAL after some retries.
> > However, I think it is better to report the error for each check in order
> > to let
> > users know the actual issues founded in the WAL.
> >
> > I attached a patch to fix it in this way.
> >
> I think to keep the same behavior as before, is necessary always run:
> 
> /* reset any error XLogReaderValidatePageHeader() might have set */
> xlogreader->errormsg_buf[0] = '\0';
> 
> not?

If we are not in StandbyMode, the check is not retried, and an error is returned
immediately. So, I think ,we don't have to display an error message in such 
cases,
and neither reset it. Instead, it would be better to leave the error message
handling to the caller of XLogReadRecord.

Regards,
Yugo Nagat

-- 
Yugo NAGATA 




Re: CREATE COLLATION - check for duplicate options and error out if found one

2021-07-18 Thread Dean Rasheed
On Sat, 17 Jul 2021 at 05:24, Bharath Rupireddy
 wrote:
>
> I also think that if it is specified as CREATE FUNCTION ... STRICT
> STRICT, CREATE FUNCTION ... CALLED ON NULL INPUT RETURNS NULL ON NULL
> INPUT etc. isn't the syntaxer catching that error while parsing the
> SQL text, similar to CREATE COLLATION mycoll1 FROM FROM "C";?

No, they're processed quite differently. The initial parsing of CREATE
FUNCTION allows an arbitrary list of things like STRICT, CALLED ON
NULL INPUT, etc., which it turns into a list of DefElem that is only
checked later on. That's the most natural way to do it, since this is
really just a list of options that can appear in any order, much like
the version of CREATE COLLATION that allows options in parentheses,
which is quite different from the version that takes a single FROM.
Reading the relevant portions of gram.y is probably the easiest way to
understand it.

It's actually quite instructive to search for "makeDefElem" in gram.y,
and see all the places that create a DefElem that doesn't match the
user-entered syntax. There are quite a few of them, and there may be
others elsewhere.

Regards,
Dean




Re: CREATE COLLATION - check for duplicate options and error out if found one

2021-07-18 Thread Dean Rasheed
On Fri, 16 Jul 2021 at 12:17, Dean Rasheed  wrote:
>
> On Fri, 16 Jul 2021 at 10:26, Bharath Rupireddy
>  wrote:
> >
> > Thanks. The v5 patch LGTM.
>
> OK, I'll push that in a while.
>

Pushed, with some additional tidying up.

In particular, I decided it was neater to follow the style of the code
in typecmds.c, and just do a single check for duplicates at the end of
the loop, since that makes for a significantly smaller patch, with
less code duplication. That, of course, means duplicate "from" options
are handled the same as any other option, but that's arguably more
consistent, and not particularly important anyway, since it's not a
documented syntax.

Regards,
Dean