Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

2024-07-10 Thread Masahiko Sawada
On Wed, Jul 10, 2024 at 4:14 PM Fujii Masao  wrote:
>
>
>
> On 2024/07/10 12:13, Masahiko Sawada wrote:
> > On Sat, Jul 6, 2024 at 4:06 PM Fujii Masao  
> > wrote:
> >>
> >> Hi,
> >>
> >> I noticed that ALTER TABLE MERGE PARTITIONS and SPLIT PARTITION commands
> >> always create new partitions in the default tablespace, regardless of
> >> the parent's tablespace. However, the indexes of these new partitions 
> >> inherit
> >> the tablespaces of their parent indexes. This inconsistency seems odd.
> >> Is this an oversight or intentional?
> >>
> >> Here are the steps I used to test this:
> >>
> >> ---
> >> CREATE TABLESPACE tblspc LOCATION '/tmp/tblspc';
> >> CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE tblspc)
> >> PARTITION BY RANGE (i) TABLESPACE tblspc;
> >>
> >> CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
> >> CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
> >>
> >> ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
> >>
> >> SELECT tablename, tablespace FROM pg_tables WHERE tablename IN ('t', 
> >> 'tp_0_2') ORDER BY tablename;
> >>tablename | tablespace
> >> ---+
> >>t | tblspc
> >>tp_0_2| (null)
> >> (2 rows)
> >>
> >> SELECT indexname, tablespace FROM pg_indexes WHERE tablename IN ('t', 
> >> 'tp_0_2') ORDER BY indexname;
> >> indexname  | tablespace
> >> -+
> >>t_pkey  | tblspc
> >>tp_0_2_pkey | tblspc
> >> ---
> >>
> >>
> >> If it's an oversight, I've attached a patch to ensure these commands create
> >> new partitions in the parent's tablespace.
> >
> > +1
> >
> > Since creating a child table through the CREATE TABLE statement sets
> > its parent table's tablespace as the child table's tablespace, it is
> > logical to set the parent table's tablespace as the merged table's
> > tablespace.
>
> Thanks for the review!
>
>
> > While the patch does not include test cases for SPLIT PARTITIONS,
> > which is understandable as these commands use the common function that
> > we have fixed, I believe it would be prudent to test SPLIT PARTITIONS
> > as well since we could change it in the future development.
>
> Unless I'm mistaken, the patch already includes tests for the split case.
> Could you please check the tests added to partition_split.sql?
>

Oops, sorry, I missed that part for some reason.So the patch looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

2024-07-09 Thread Masahiko Sawada
On Sat, Jul 6, 2024 at 4:06 PM Fujii Masao  wrote:
>
> Hi,
>
> I noticed that ALTER TABLE MERGE PARTITIONS and SPLIT PARTITION commands
> always create new partitions in the default tablespace, regardless of
> the parent's tablespace. However, the indexes of these new partitions inherit
> the tablespaces of their parent indexes. This inconsistency seems odd.
> Is this an oversight or intentional?
>
> Here are the steps I used to test this:
>
> ---
> CREATE TABLESPACE tblspc LOCATION '/tmp/tblspc';
> CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE tblspc)
>PARTITION BY RANGE (i) TABLESPACE tblspc;
>
> CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
> CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
>
> ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
>
> SELECT tablename, tablespace FROM pg_tables WHERE tablename IN ('t', 
> 'tp_0_2') ORDER BY tablename;
>   tablename | tablespace
> ---+
>   t | tblspc
>   tp_0_2| (null)
> (2 rows)
>
> SELECT indexname, tablespace FROM pg_indexes WHERE tablename IN ('t', 
> 'tp_0_2') ORDER BY indexname;
>indexname  | tablespace
> -+
>   t_pkey  | tblspc
>   tp_0_2_pkey | tblspc
> ---
>
>
> If it's an oversight, I've attached a patch to ensure these commands create
> new partitions in the parent's tablespace.

+1

Since creating a child table through the CREATE TABLE statement sets
its parent table's tablespace as the child table's tablespace, it is
logical to set the parent table's tablespace as the merged table's
tablespace.

While the patch does not include test cases for SPLIT PARTITIONS,
which is understandable as these commands use the common function that
we have fixed, I believe it would be prudent to test SPLIT PARTITIONS
as well since we could change it in the future development.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Parallel heap vacuum

2024-07-08 Thread Masahiko Sawada
On Fri, Jun 28, 2024 at 9:06 PM Amit Kapila  wrote:
>
> On Fri, Jun 28, 2024 at 9:44 AM Masahiko Sawada  wrote:
> >
> > # Benchmark results
> >
> > * Test-1: parallel heap scan on the table without indexes
> >
> > I created 20GB table, made garbage on the table, and run vacuum while
> > changing parallel degree:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 6); --- 20GB table
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > Here are the results (total time and heap scan time):
> >
> > PARALLEL 0: 21.99 s (single process)
> > PARALLEL 1: 11.39 s
> > PARALLEL 2:   8.36 s
> > PARALLEL 3:   6.14 s
> > PARALLEL 4:   5.08 s
> >
> > * Test-2: parallel heap scan on the table with one index
> >
> > I used a similar table to the test case 1 but created one btree index on it:
> >
> > create unlogged table test (a int) with (autovacuum_enabled = off);
> > insert into test select generate_series(1, 6); --- 20GB table
> > create index on test (a);
> > delete from test where a % 5 = 0;
> > vacuum (verbose, parallel 0) test;
> >
> > I've measured the total execution time as well as the time of each
> > vacuum phase (from left heap scan time, index vacuum time, and heap
> > vacuum time):
> >
> > PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
> > PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
> > PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
> > PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
> > PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)
> >
> > Overall, I can see the parallel heap scan in lazy vacuum has a decent
> > scalability; In both test-1 and test-2, the execution time of heap
> > scan got ~4x faster with 4 parallel workers. On the other hand, when
> > it comes to the total vacuum execution time, I could not see much
> > performance improvement in test-2 (45.11 vs. 35.84). Looking at the
> > results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
> > (21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
> > and heap scan in case 2 was not as fast as in case 1 with 1 parallel
> > worker (12.75 vs. 11.39).
> >
> > I think the reason is the shared TidStore is not very scalable since
> > we have a single lock on it. In all cases in the test-1, we don't use
> > the shared TidStore since all dead tuples are removed during heap
> > pruning. So the scalability was better overall than in test-2. In
> > parallel 0 case in test-2, we use the local TidStore, and from
> > parallel degree of 1 in test-2, we use the shared TidStore and
> > parallel worker concurrently update it. Also, I guess that the lookup
> > performance of the local TidStore is better than the shared TidStore's
> > lookup performance because of the differences between a bump context
> > and an DSA area. I think that this difference contributed the fact
> > that index vacuuming got slower (16.74 vs. 22.04).
> >

Thank you for the comments!

> > There are two obvious improvement ideas to improve overall vacuum
> > execution time: (1) improve the shared TidStore scalability and (2)
> > support parallel heap vacuum. For (1), several ideas are proposed by
> > the ART authors[1]. I've not tried these ideas but it might be
> > applicable to our ART implementation. But I prefer to start with (2)
> > since it would be easier. Feedback is very welcome.
> >
>
> Starting with (2) sounds like a reasonable approach. We should study a
> few more things like (a) the performance results where there are 3-4
> indexes,

Here are the results with 4 indexes (and restarting the server before
the benchmark):

PARALLEL 0: 115.48 s (32.76, 64.46, 18.24)
PARALLEL 1:  74.88 s (17.11, 44.43, 13.25)
PARALLEL 2:  71.15 s (14.13, 44.82, 12.12)
PARALLEL 3:  46.78 s (10.74, 24.50, 11.43)
PARALLEL 4:  46.42 s (8.95, 24.96, 12.39) (launched 4 workers for heap
scan and 3 workers for index vacuum)

> (b) What is the reason for performance improvement seen with
> only heap scans. We normally get benefits of parallelism because of
> using multiple CPUs but parallelizing scans (I/O) shouldn't give much
> benefits. Is it possible that you are seeing benefits because most of
> the data is either in shared_buffers or in memory? We can probably try
> vacuuming tables by restarting the nodes to ensure the data is not in
> memory.

I think it depends on the storage performance. FYI I use an EC2
instance (m6id.metal).

I've run the same benchmark script (table with no index) with
restarting the server before executing the vacuum, and here are the
results:

PARALLEL 0:  32.75 s
PARALLEL 1:  17.46 s
PARALLEL 2:  13.41 s
PARALLEL 3:  10.31 s
PARALLEL 4:8.48 s

With the above two tests, I used the updated patch that I just submitted[1].

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoAWHHnCg9OvtoEJnnvCc-3isyOyAGn%2B2KYoSXEv%3DvXauw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Parallel heap vacuum

2024-07-08 Thread Masahiko Sawada
On Fri, Jul 5, 2024 at 6:51 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> > The parallel vacuum we have today supports only for index vacuuming.
> > Therefore, while multiple workers can work on different indexes in
> > parallel, the heap table is always processed by the single process.
> > I'd like to propose $subject, which enables us to have multiple
> > workers running on the single heap table. This would be helpful to
> > speedup vacuuming for tables without indexes or tables with
> > INDEX_CLENAUP = off.
>
> Sounds great. IIUC, vacuuming is still one of the main weak point of postgres.
>
> > I've attached a PoC patch for this feature. It implements only
> > parallel heap scans in lazyvacum. We can extend this feature to
> > support parallel heap vacuum as well in the future or in the same
> > patch.
>
> Before diving into deep, I tested your PoC but found unclear point.
> When the vacuuming is requested with parallel > 0 with almost the same 
> workload
> as yours, only the first page was scanned and cleaned up.
>
> When parallel was set to zero, I got:
> ```
> INFO:  vacuuming "postgres.public.test"
> INFO:  finished vacuuming "postgres.public.test": index scans: 0
> pages: 0 removed, 2654868 remain, 2654868 scanned (100.00% of total)
> tuples: 12000 removed, 48000 remain, 0 are dead but not yet removable
> removable cutoff: 752, which was 0 XIDs old when operation ended
> new relfrozenxid: 739, which is 1 XIDs ahead of previous value
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item 
> identifiers removed
> avg read rate: 344.639 MB/s, avg write rate: 344.650 MB/s
> buffer usage: 2655045 hits, 2655527 misses, 2655606 dirtied
> WAL usage: 1 records, 1 full page images, 937 bytes
> system usage: CPU: user: 39.45 s, system: 20.74 s, elapsed: 60.19 s
> ```
>
> This meant that all pages were surely scanned and dead tuples were removed.
> However, when parallel was set to one, I got another result:
>
> ```
> INFO:  vacuuming "postgres.public.test"
> INFO:  launched 1 parallel vacuum worker for table scanning (planned: 1)
> INFO:  finished vacuuming "postgres.public.test": index scans: 0
> pages: 0 removed, 2654868 remain, 1 scanned (0.00% of total)
> tuples: 12 removed, 0 remain, 0 are dead but not yet removable
> removable cutoff: 752, which was 0 XIDs old when operation ended
> frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
> index scan not needed: 0 pages from table (0.00% of total) had 0 dead item 
> identifiers removed
> avg read rate: 92.952 MB/s, avg write rate: 0.845 MB/s
> buffer usage: 96 hits, 660 misses, 6 dirtied
> WAL usage: 1 records, 1 full page images, 937 bytes
> system usage: CPU: user: 0.05 s, system: 0.00 s, elapsed: 0.05 s
> ```
>
> It looked like that only a page was scanned and 12 tuples were removed.
> It looks very strange for me...
>
> Attached script emulate my test. IIUC it was almost the same as yours, but
> the instance was restarted before vacuuming.
>
> Can you reproduce and see the reason? Based on the requirement I can
> provide further information.

Thank you for the test!

I could reproduce this issue and it's a bug; it skipped even
non-all-visible pages. I've attached the new version patch.

BTW since we compute the number of parallel workers for the heap scan
based on the table size, it's possible that we launch multiple workers
even if most blocks are all-visible. It seems to be better if we
calculate it based on (relpages - relallvisible).

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


parallel_heap_vacuum_scan_v2.patch
Description: Binary data


Re: Conflict Detection and Resolution

2024-07-04 Thread Masahiko Sawada
On Mon, Jul 1, 2024 at 6:54 PM Amit Kapila  wrote:
>
> On Mon, Jul 1, 2024 at 1:35 PM Masahiko Sawada  wrote:
> >
> > Setting resolvers at table-level and subscription-level sounds good to
> > me. DDLs for setting resolvers at subscription-level would need the
> > subscription name to be specified?
> >
>
> Yes, it should be part of the ALTER/CREATE SUBSCRIPTION command. One
> idea could be to have syntax as follows:
>
> ALTER SUBSCRIPTION name SET CONFLICT RESOLVER 'conflict_resolver' FOR
> 'conflict_type';
> ALTER SUBSCRIPTION name RESET CONFLICT RESOLVER FOR 'conflict_type';
>
> CREATE SUBSCRIPTION subscription_name CONNECTION 'conninfo'
> PUBLICATION publication_name [, ...] CONFLICT RESOLVER
> 'conflict_resolver' FOR 'conflict_type';

Looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Conflict Detection and Resolution

2024-07-01 Thread Masahiko Sawada
On Thu, Jun 27, 2024 at 1:50 PM shveta malik  wrote:
>
> On Wed, Jun 26, 2024 at 2:33 PM Amit Kapila  wrote:
> >
> > On Tue, Jun 25, 2024 at 3:39 PM shveta malik  wrote:
> > >
> > > On Tue, Jun 25, 2024 at 3:12 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Jun 24, 2024 at 1:47 PM shveta malik  
> > > > wrote:
> > > > >
> > > > > On Thu, Jun 20, 2024 at 6:41 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > >> In the second patch, we can implement simple built-in resolution
> > > > > > >> strategies like apply and skip (which can be named as 
> > > > > > >> remote_apply and
> > > > > > >> keep_local, see [3][4] for details on these strategies) with 
> > > > > > >> ERROR or
> > > > > > >> LOG being the default strategy. We can allow these strategies to 
> > > > > > >> be
> > > > > > >> configured at the global and table level.
> > > > >
> > > > > Before we implement resolvers, we need a way to configure them. Please
> > > > > find the patch002 which attempts to implement Global Level Conflict
> > > > > Resolvers Configuration.  Note that patch002 is dependent upon
> > > > > Conflict-Detection patch001 which is reviewed in another thread [1].
> > > > > I have attached patch001 here for convenience and to avoid CFBot
> > > > > failures. But please use [1] if you have any comments on patch001.
> > > > >
> > > > > New DDL commands in patch002 are:
> > > > >
> > > > > To set global resolver for given conflcit_type:
> > > > > SET CONFLICT RESOLVER 'conflict_resolver' FOR 'conflict_type'
> > > > >
> > > > > To reset to default resolver:
> > > > > RESET CONFLICT RESOLVER FOR 'conflict_type'
> > > > >
> > > >
> > > > Does setting up resolvers have any meaning without subscriptions? I am
> > > > wondering whether we should allow to set up the resolvers at the
> > > > subscription level. One benefit is that users don't need to use a
> > > > different DDL to set up resolvers. The first patch gives a conflict
> > > > detection option at the subscription level, so it would be symmetrical
> > > > to provide a resolver at the subscription level. Yet another benefit
> > > > could be that it provides users facility to configure different
> > > > resolvers for a set of tables belonging to a particular
> > > > publication/node.
> > >
> > > There can be multiple tables included in a publication with varying
> > > business use-cases and thus may need different resolvers set, even
> > > though they all are part of the same publication.
> > >
> >
> > Agreed but this is the reason we are planning to keep resolvers at the
> > table level. Here, I am asking to set resolvers at the subscription
> > level rather than at the global level.
>
> Okay, got it. I misunderstood earlier that we want to replace table
> level resolvers with subscription ones.
> Having global configuration has one benefit that if the user has no
> requirement to set different resolvers for different subscriptions or
> tables, he may always set one global configuration and be done with
> it. OTOH, I also agree with benefits coming with subscription level
> configuration.

Setting resolvers at table-level and subscription-level sounds good to
me. DDLs for setting resolvers at subscription-level would need the
subscription name to be specified? And another question is: a
table-level resolver setting is precedent over all subscriber-level
resolver settings in the database?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Conflict Detection and Resolution

2024-07-01 Thread Masahiko Sawada
Hi,

On Thu, May 23, 2024 at 3:37 PM shveta malik  wrote:
>
> DELETE
> 
> Conflict Type:
> 
> delete_missing: An incoming delete is trying to delete a row on a
> target node which does not exist.

IIUC the 'delete_missing' conflict doesn't cover the case where an
incoming delete message is trying to delete a row that has already
been updated locally or by another node. I think in update/delete
conflict situations, we need to resolve the conflicts based on commit
timestamps like we do for update/update and insert/update conflicts.

For example, suppose there are two node-A and node-B and setup
bi-directional replication, and suppose further that both have the row
with id = 1, consider the following sequences:

09:00:00  DELETE ... WHERE id = 1 on node-A.
09:00:05  UPDATE ... WHERE id = 1 on node-B.
09:00:10  node-A received the update message from node-B.
09:00:15  node-B received the delete message from node-A.

At 09:00:10 on node-A, an update_deleted conflict is generated since
the row on node-A is already deleted locally. Suppose that we use
'apply_or_skip' resolution for this conflict, we convert the update
message into an insertion, so node-A now has the row with id = 1. At
09:00:15 on node-B, the incoming delete message is applied and deletes
the row with id = 1, even though the row has already been modified
locally. The node-A and node-B are now inconsistent. This
inconsistency can be avoided by using 'skip' resolution for the
'update_deleted' conflict on node-A, and 'skip' resolution is the
default method for that actually. However, if we handle it as
'update_missing', the 'apply_or_skip' resolution is used by default.

IIUC with the proposed architecture, DELETE always takes precedence
over UPDATE since both 'update_deleted' and 'update_missing' don't use
commit timestamps to resolve the conflicts. As long as that is true, I
think there is no use case for 'apply_or_skip' and 'apply_or_error'
resolutions in update/delete conflict cases. In short, I think we need
something like 'delete_differ' conflict type as well. FYI PGD and
Oracle GoldenGate seem to have this conflict type[1][2].

The 'delete'_differ' conflict type would have at least
'latest_timestamp_wins' resolution. With the timestamp based
resolution method, we would deal with update/delete conflicts as
follows:

09:00:00: DELETE ... WHERE id = 1 on node-A.
09:00:05: UPDATE ... WHERE id = 1 on node-B.
- the updated row doesn't have the origin since it's a local change.
09:00:10: node-A received the update message from node-B.
- the incoming update message has the origin of node-B whereas the
local row is already removed locally.
- 'update_deleted' conflict is generated.
- do the insert of the new row instead, because the commit
timestamp of UPDATE is newer than DELETE's one.
09:00:15: node-B received the delete message from node-A.
- the incoming delete message has the origin of node-B whereas the
(updated) row doesn't have the origin.
- 'update_differ' conflict is generated.
- discard DELETE, because the commit timestamp of UPDATE is newer
than DELETE' one.ard DELETE, because the commit timestamp of UPDATE is
newer than DELETE' one.

As a result, both nodes have the new version row.

Regards,

[1] 
https://www.enterprisedb.com/docs/pgd/latest/consistency/conflicts/#updatedelete-conflicts
[2] 
https://docs.oracle.com/goldengate/c1230/gg-winux/GWUAD/configuring-conflict-detection-and-resolution.htm
(see DELETEROWEXISTS conflict type)

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Parallel heap vacuum

2024-06-27 Thread Masahiko Sawada
 as the time of each
vacuum phase (from left heap scan time, index vacuum time, and heap
vacuum time):

PARALLEL 0: 45.11 s (21.89, 16.74, 6.48)
PARALLEL 1: 42.13 s (12.75, 22.04, 7.23)
PARALLEL 2: 39.27 s (8.93, 22.78, 7.45)
PARALLEL 3: 36.53 s (6.76, 22.00, 7.65)
PARALLEL 4: 35.84 s (5.85, 22.04, 7.83)

Overall, I can see the parallel heap scan in lazy vacuum has a decent
scalability; In both test-1 and test-2, the execution time of heap
scan got ~4x faster with 4 parallel workers. On the other hand, when
it comes to the total vacuum execution time, I could not see much
performance improvement in test-2 (45.11 vs. 35.84). Looking at the
results PARALLEL 0 vs. PARALLEL 1 in test-2, the heap scan got faster
(21.89 vs. 12.75) whereas index vacuum got slower (16.74 vs. 22.04),
and heap scan in case 2 was not as fast as in case 1 with 1 parallel
worker (12.75 vs. 11.39).

I think the reason is the shared TidStore is not very scalable since
we have a single lock on it. In all cases in the test-1, we don't use
the shared TidStore since all dead tuples are removed during heap
pruning. So the scalability was better overall than in test-2. In
parallel 0 case in test-2, we use the local TidStore, and from
parallel degree of 1 in test-2, we use the shared TidStore and
parallel worker concurrently update it. Also, I guess that the lookup
performance of the local TidStore is better than the shared TidStore's
lookup performance because of the differences between a bump context
and an DSA area. I think that this difference contributed the fact
that index vacuuming got slower (16.74 vs. 22.04).

There are two obvious improvement ideas to improve overall vacuum
execution time: (1) improve the shared TidStore scalability and (2)
support parallel heap vacuum. For (1), several ideas are proposed by
the ART authors[1]. I've not tried these ideas but it might be
applicable to our ART implementation. But I prefer to start with (2)
since it would be easier. Feedback is very welcome.

Regards,

[1] https://db.in.tum.de/~leis/papers/artsync.pdf

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6f8b1b7929..cf8c6614cd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2630,6 +2630,12 @@ static const TableAmRoutine heapam_methods = {
 	.relation_copy_data = heapam_relation_copy_data,
 	.relation_copy_for_cluster = heapam_relation_copy_for_cluster,
 	.relation_vacuum = heap_vacuum_rel,
+
+	.parallel_vacuum_compute_workers = heap_parallel_vacuum_compute_workers,
+	.parallel_vacuum_estimate = heap_parallel_vacuum_estimate,
+	.parallel_vacuum_initialize = heap_parallel_vacuum_initialize,
+	.parallel_vacuum_scan_worker = heap_parallel_vacuum_scan_worker,
+
 	.scan_analyze_next_block = heapam_scan_analyze_next_block,
 	.scan_analyze_next_tuple = heapam_scan_analyze_next_tuple,
 	.index_build_range_scan = heapam_index_build_range_scan,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3f88cf1e8e..4ccf15ffe3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -49,6 +49,7 @@
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
+#include "optimizer/paths.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
 #include "postmaster/autovacuum.h"
@@ -117,10 +118,22 @@
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
 /*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
+ * DSM keys for heap parallel vacuum scan. Unlike other parallel execution code, we
+ * we don't need to worry about DSM keys conflicting with plan_node_id, but need to
+ * avoid conflicting with DSM keys used in vacuumparallel.c.
+ */
+#define LV_PARALLEL_SCAN_SHARED			0x0001
+#define LV_PARALLEL_SCAN_DESC			0x0002
+#define LV_PARALLEL_SCAN_DESC_WORKER	0x0003
+
+/*
+ * Macro to check if we are in a parallel vacuum.  If ParallelVacuumIsActive() is
+ * true, we are in the parallel mode, meaning that we do either parallel index
+ * vacuuming or parallel table vacuuming, or both. If ParallelHeapVacuumIsActive()
+ * is true, we do at least parallel table vacuuming.
  */
 #define ParallelVacuumIsActive(vacrel) ((vacrel)->pvs != NULL)
+#define ParallelHeapVacuumIsActive(vacrel) ((vacrel)->phvstate != NULL)
 
 /* Phases of vacuum during which we report error context. */
 typedef enum
@@ -133,6 +146,80 @@ typedef enum
 	VACUUM_ERRCB_PHASE_TRUNCATE,
 } VacErrPhase;
 
+/*
+ * Relation statistics collected during heap scanning and need to be shared among
+ * parallel vacuum workers.
+ */
+typedef struct LVRelCounters
+{
+	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
+	BlockNumber removed_pages;	/* # pages removed

Re: Track the amount of time waiting due to cost_delay

2024-06-26 Thread Masahiko Sawada
On Mon, Jun 24, 2024 at 7:50 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Sat, Jun 22, 2024 at 12:48:33PM +, Bertrand Drouvot wrote:
> > 1. vacuuming indexes time has been longer on master because with v2, the 
> > leader
> > has been interrupted 342605 times while waiting, then making v2 "faster".
> >
> > 2. the leader being interrupted while waiting is also already happening on 
> > master
> > due to the pgstat_progress_parallel_incr_param() calls in
> > parallel_vacuum_process_one_index() (that have been added in
> > 46ebdfe164). It has been the case "only" 36 times during my test case.
> >
> > I think that 2. is less of a concern but I think that 1. is something that 
> > needs
> > to be addressed because the leader process is not honouring its cost delay 
> > wait
> > time in a noticeable way (at least during my test case).
> >
> > I did not think of a proposal yet, just sharing my investigation as to why
> > v2 has been faster than master during the vacuuming indexes phase.

Thank you for the benchmarking and analyzing the results! I agree with
your analysis and was surprised by the fact that the more times
workers go to sleep, the more times the leader wakes up.

>
> I think that a reasonable approach is to make the reporting from the parallel
> workers to the leader less aggressive (means occur less frequently).
>
> Please find attached v3, that:
>
> - ensures that there is at least 1 second between 2 reports, per parallel 
> worker,
> to the leader.
>
> - ensures that the reported delayed time is still correct (keep track of the
> delayed time between 2 reports).
>
> - does not add any extra pg_clock_gettime_ns() calls (as compare to v2).
>

Sounds good to me. I think it's better to keep the logic for
throttling the reporting the delay message simple. It's an important
consideration but executing parallel vacuum with delays would be less
likely to be used in practice.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: New standby_slot_names GUC in PG 17

2024-06-26 Thread Masahiko Sawada
On Wed, Jun 26, 2024 at 6:15 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, June 26, 2024 12:49 PM Bertrand Drouvot 
>  wrote:
> >
> > Hi,
> >
> > On Wed, Jun 26, 2024 at 04:17:45AM +, Zhijie Hou (Fujitsu) wrote:
> > > On Wednesday, June 26, 2024 9:40 AM Masahiko Sawada
> >  wrote:
> > > >
> > > > On Tue, Jun 25, 2024 at 5:32 PM Amit Kapila
> > > > 
> > > > wrote:
> > > > >
> > > > > I feel synchronized better indicates the purpose because we ensure
> > > > > such slots are synchronized before we process changes for logical
> > > > > failover slots. We already have a 'failover' option for logical
> > > > > slots which could make things confusing if we add 'failover' where
> > > > > physical slots need to be specified.
> > > >
> > > > Agreed. So +1 for synchronized_stnadby_slots.
> > >
> > > +1.
> > >
> > > Since there is a consensus on this name, I am attaching the patch to
> > > rename the GUC to synchronized_stnadby_slots. I have confirmed that
> > > the regression tests and pgindent passed for the patch.
> > A few comments:
>
> Thanks for the comments!
>
> > 1 
> >
> > In the commit message:
> >
> > "
> > The standby_slot_names GUC is intended to allow specification of physical
> > standby slots that must be synchronized before they are visible to
> > subscribers
> > "
> >
> > Not sure that wording is correct, if we feel the need to explain the GUC, 
> > maybe
> > repeat some wording from bf279ddd1c?
>
> I intentionally copied some words from release note of this GUC which was
> also part of the content in the initial email of this thread. I think it
> would be easy to understand than the original commit msg. But others may
> have different opinion, so I would leave the decision to the committer. (I 
> adjusted
> a bit the word in this version).
>
> >
> > 2 
> >
> > Should we rename StandbySlotNamesConfigData too?
> >
> > 3 
> >
> > Should we rename SlotExistsInStandbySlotNames too?
> >
> > 4 
> >
> > Should we rename validate_standby_slots() too?
> >
>
> Renamed these to the names suggested by Amit.
>
> Attach the v2 patch set which addressed above and removed
> the changes in release-17.sgml according to the comment from Amit.
>

Thank you for updating the patch. The v2 patch looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Vacuum statistics

2024-06-26 Thread Masahiko Sawada
On Fri, May 31, 2024 at 4:19 AM Andrei Zubkov  wrote:
>
> Hi,
>
> Th, 30/05/2024 at 10:33 -0700, Alena Rybakina wrote:
> > I suggest gathering information about vacuum resource consumption for
> > processing indexes and tables and storing it in the table and index
> > relationships (for example, PgStat_StatTabEntry structure like it has
> > realized for usual statistics). It will allow us to determine how
> > well
> > the vacuum is configured and evaluate the effect of overhead on the
> > system at the strategic level, the vacuum has gathered this
> > information
> > already, but this valuable information doesn't store it.
> >
> It seems a little bit unclear to me, so let me explain a little the
> point of a proposition.
>
> As the vacuum process is a backend it has a workload instrumentation.
> We have all the basic counters available such as a number of blocks
> read, hit and written, time spent on I/O, WAL stats and so on.. Also,
> we can easily get some statistics specific to vacuum activity i.e.
> number of tuples removed, number of blocks removed, number of VM marks
> set and, of course the most important metric - time spent on vacuum
> operation.

I've not reviewed the patch closely but it sounds helpful for users. I
would like to add a statistic, the high-water mark of memory usage of
dead tuple TIDs. Since the amount of memory used by TidStore is hard
to predict, I think showing the high-water mark would help users to
predict how much memory they set to maintenance_work_mem.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: New standby_slot_names GUC in PG 17

2024-06-25 Thread Masahiko Sawada
On Tue, Jun 25, 2024 at 5:32 PM Amit Kapila  wrote:
>
> On Tue, Jun 25, 2024 at 12:30 PM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jun 25, 2024 at 1:54 PM Amit Kapila  wrote:
> > >
> >
> > > So, my
> > > preference is in order as follows: synchronized_standby_slots,
> > > wait_for_standby_slots, logical_replication_wait_slots,
> > > logical_replication_synchronous_slots, and
> > > logical_replication_synchronous_standby_slots.
> >
> > I also prefer synchronized_standby_slots.
> >
> > From a different angle just for discussion, is it worth considering
> > the term 'failover' since the purpose of this feature is to ensure a
> > standby to be ready for failover in terms of logical replication? For
> > example, failover_standby_slot_names?
> >
>
> I feel synchronized better indicates the purpose because we ensure
> such slots are synchronized before we process changes for logical
> failover slots. We already have a 'failover' option for logical slots
> which could make things confusing if we add 'failover' where physical
> slots need to be specified.

Agreed. So +1 for synchronized_stnadby_slots.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: New standby_slot_names GUC in PG 17

2024-06-25 Thread Masahiko Sawada
On Tue, Jun 25, 2024 at 1:54 PM Amit Kapila  wrote:
>
> On Tue, Jun 25, 2024 at 8:20 AM Masahiko Sawada  wrote:
> >
> > On Tue, Jun 25, 2024 at 11:21 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > I agree the current name seems too generic and the suggested ' 
> > > synchronized_standby_slots '
> > > is better than the current one.
> > >
> > > Some other ideas could be:
> > >
> > > synchronize_slots_on_standbys: it indicates that the standbys that enabled
> > > slot sync should be listed in this GUC.
> > >
> > > logical_replication_wait_slots: it means the logical replication(logical
> > > Walsender process) will wait for these slots to advance the confirm flush
> > > lsn before proceeding.
> >
> > I feel that the name that has some connection to "logical replication"
> > also sounds good. Let me add some ideas:
> >
> > - logical_replication_synchronous_standby_slots (might be too long)
> > - logical_replication_synchronous_slots
> >
>
> I see your point about keeping logical_replication in the name but
> that could also lead one to think that this list can contain logical
> slots.

Right.

>  OTOH, there is some value in keeping '_standby_' in the name as
> that is more closely associated with physical standby's and this list
> contains physical slots corresponding to physical standby's.

Agreed.

> So, my
> preference is in order as follows: synchronized_standby_slots,
> wait_for_standby_slots, logical_replication_wait_slots,
> logical_replication_synchronous_slots, and
> logical_replication_synchronous_standby_slots.

I also prefer synchronized_standby_slots.

>From a different angle just for discussion, is it worth considering
the term 'failover' since the purpose of this feature is to ensure a
standby to be ready for failover in terms of logical replication? For
example, failover_standby_slot_names?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: New standby_slot_names GUC in PG 17

2024-06-24 Thread Masahiko Sawada
On Tue, Jun 25, 2024 at 11:21 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Saturday, June 22, 2024 5:47 PM Amit Kapila  
> wrote:
> >
> > On Sat, Jun 22, 2024 at 1:49 AM Nathan Bossart
> >  wrote:
> > >
> > > On Fri, Jun 21, 2024 at 03:50:00PM -0400, Tom Lane wrote:
> > > >>>>> Allow specification of physical standbys that must be
> > > >>>>> synchronized before they are visible to subscribers (Hou Zhijie,
> > > >>>>> Shveta Malik)
> > > >
> > > > it seems like the name ought to have some connection to
> > > > synchronization.  Perhaps something like "synchronized_standby_slots"?
> > >
> > > IMHO that might be a bit too close to synchronous_standby_names.
> > >
> >
> > Right, but better than the current one. The other possibility could be
> > wait_for_standby_slots.
>
> I agree the current name seems too generic and the suggested ' 
> synchronized_standby_slots '
> is better than the current one.
>
> Some other ideas could be:
>
> synchronize_slots_on_standbys: it indicates that the standbys that enabled
> slot sync should be listed in this GUC.
>
> logical_replication_wait_slots: it means the logical replication(logical
> Walsender process) will wait for these slots to advance the confirm flush
> lsn before proceeding.

I feel that the name that has some connection to "logical replication"
also sounds good. Let me add some ideas:

- logical_replication_synchronous_standby_slots (might be too long)
- logical_replication_synchronous_slots

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-20 Thread Masahiko Sawada
On Thu, Jun 20, 2024 at 4:58 PM John Naylor  wrote:
>
> On Thu, Jun 20, 2024 at 8:12 AM Masahiko Sawada  wrote:
>
> > On Thu, Jun 20, 2024 at 7:54 AM Tom Lane  wrote:
> > >
> > > I don't know if there's any reason why the current order
> > > is preferable.)
> >
> > IIUC there is no particular reason for the current order in RT_NODE_48.
>
> Yeah. I found that simply swapping them enables clang to avoid
> double-initialization, but gcc still can't figure it out and must be
> told to stop at slot_idxs[]. I'd prefer to do it that way and document
> that slot_idxs is purposefully the last member of the fixed part of
> the struct.

+1

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: suspicious valgrind reports about radixtree/tidstore on arm64

2024-06-19 Thread Masahiko Sawada
Hi,

On Thu, Jun 20, 2024 at 7:54 AM Tom Lane  wrote:
>
> I wrote:
> > I've reproduced what looks like about the same thing on
> > my Pi 4 using Fedora 38: just run "make installcheck-parallel"
> > under valgrind, and kaboom.  Definitely needs investigation.
>
> The problem appears to be that RT_ALLOC_NODE doesn't bother to
> initialize the chunks[] array when making a RT_NODE_16 node.
> If we fill fewer than RT_FANOUT_16_MAX of the chunks[] entries,
> then when RT_NODE_16_SEARCH_EQ applies vector operations that
> read the entire array, it's operating partially on uninitialized
> data.  Now, that's actually OK because of the "mask off invalid
> entries" step, but aarch64 valgrind complains anyway.
>
> I hypothesize that the reason we're not seeing equivalent failures
> on x86_64 is one of
>
> 1. x86_64 valgrind is stupider than aarch64, and fails to track that
> the contents of the SIMD registers are only partially defined.
>
> 2. x86_64 valgrind is smarter than aarch64, and is able to see
> that the "mask off invalid entries" step removes all the
> potentially-uninitialized bits.
>
> The first attached patch, "radixtree-fix-minimal.patch", is enough
> to stop the aarch64 valgrind failure for me.  However, I think
> that the coding here is pretty penny-wise and pound-foolish,
> and that what we really ought to do is the second patch,
> "radixtree-fix-proposed.patch".  I do not believe that asking
> memset to zero the three-byte RT_NODE structure produces code
> that's either shorter or faster than having it zero 8 bytes
> (as for RT_NODE_4) or having it do that and then separately
> zero some more stuff (as for the larger node types).  Leaving
> RT_NODE_4's chunks[] array uninitialized is going to bite us
> someday, too, even if it doesn't right now.  So I think we
> ought to just zero the whole fixed-size part of the nodes,
> which is what radixtree-fix-proposed.patch does.

I agree with radixtree-fix-proposed.patch. Even if we zero more fields
in the node it would not add noticeable overheads.

>
> (The RT_NODE_48 case could be optimized a bit if we cared
> to swap the order of its slot_idxs[] and isset[] arrays;
> then the initial zeroing could just go up to slot_idxs[].
> I don't know if there's any reason why the current order
> is preferable.)

IIUC there is no particular reason for the current order in RT_NODE_48.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Revive num_dead_tuples column of pg_stat_progress_vacuum

2024-06-18 Thread Masahiko Sawada
On Sat, Jun 15, 2024 at 8:47 PM Robert Treat  wrote:
>
> On Thu, Jun 13, 2024 at 9:22 PM Masahiko Sawada  wrote:
> > On Fri, Jun 14, 2024 at 9:57 AM Masahiko Sawada  
> > wrote:
> > > On Fri, Jun 14, 2024 at 9:41 AM Michael Paquier  
> > > wrote:
> > > > On Thu, Jun 13, 2024 at 08:38:05PM -0400, Tom Lane wrote:
> > > > > Masahiko Sawada  writes:
> > > > >> I was about to push the patch but let me confirm just in case: is it
> > > > >> okay to bump the catversion even after post-beta1?
> > > > >
> > > > > Yes, that happens somewhat routinely.
> > > >
> > > > Up to RC, even after beta2.  This happens routinely every year because
> > > > tweaks are always required for what got committed.  And that's OK to
> > > > do so now.
> > >
> > > Thank you both for confirmation. I'll push it shortly.
> > >
> >
> > Pushed. Thank you for giving feedback and reviewing the patch!
> >
>
> One minor side effect of this change is the original idea of comparing
> pg_stat_progress.num_dead_tuples to pg_stat_all_tables.n_dead_tup
> column becomes less obvious. I presume the release notes for
> pg_stat_progress_vacuum will be updated to also include this column
> name change as well, so maybe that's enough for folks to figure things
> out?

The release note has been updated, and I think it would help users
understand the change.

> At least I couldn't find anywhere in the docs where we have
> described the relationship between these columns before. Thoughts?

It would be a good idea to improve the documentation, but I think that
we cannot simply compare these two numbers since the numbers that
these fields count are slightly different. For instance,
pg_stat_all_tables.n_dead_tup includes the number of dead tuples that
are going to be HOT-pruned.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Logical Replication of sequences

2024-06-17 Thread Masahiko Sawada
On Fri, Jun 14, 2024 at 4:04 PM Amit Kapila  wrote:
>
> On Thu, Jun 13, 2024 at 6:14 PM Masahiko Sawada  wrote:
> >
> > On Thu, Jun 13, 2024 at 7:06 PM Amit Kapila  wrote:
> > >
> > > On Thu, Jun 13, 2024 at 1:09 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Wed, Jun 12, 2024 at 6:59 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > Yeah, starting with a single worker sounds good for now. Do you think
> > > > > we should sync all the sequences in a single transaction or have some
> > > > > threshold value above which a different transaction would be required
> > > > > or maybe a different sequence sync worker altogether? Now, having
> > > > > multiple sequence-sync workers requires some synchronization so that
> > > > > only a single worker is allocated for one sequence.
> > > > >
> > > > > The simplest thing is to use a single sequence sync worker that syncs
> > > > > all sequences in one transaction but with a large number of sequences,
> > > > > it could be inefficient. OTOH, I am not sure if it would be a problem
> > > > > in reality.
> > > >
> > > > I think that we can start with using a single worker and one
> > > > transaction, and measure the performance with a large number of
> > > > sequences.
> > > >
> > >
> > > Fair enough. However, this raises the question Dilip and Vignesh are
> > > discussing whether we need a new relfilenode for sequence update even
> > > during initial sync? As per my understanding, the idea is that similar
> > > to tables, the CREATE SUBSCRIPTION command (with copy_data = true)
> > > will create the new sequence entries in pg_subscription_rel with the
> > > state as 'i'. Then the sequence-sync worker would start a transaction
> > > and one-by-one copy the latest sequence values for each sequence (that
> > > has state as 'i' in pg_subscription_rel) and mark its state as ready
> > > 'r' and commit the transaction. Now if there is an error during this
> > > operation it will restart the entire operation. The idea of creating a
> > > new relfilenode is to handle the error so that if there is a rollback,
> > > the sequence state will be rolled back to 'i' and the sequence value
> > > will also be rolled back. The other option could be that we update the
> > > sequence value without a new relfilenode and if the transaction rolled
> > > back then only the sequence's state will be rolled back to 'i'. This
> > > would work with a minor inconsistency that sequence values will be
> > > up-to-date even when the sequence state is 'i' in pg_subscription_rel.
> > > I am not sure if that matters because anyway, they can quickly be
> > > out-of-sync with the publisher again.
> >
> > I think it would be fine in many cases even if the sequence value is
> > up-to-date even when the sequence state is 'i' in pg_subscription_rel.
> > But the case we would like to avoid is where suppose the sequence-sync
> > worker does both synchronizing sequence values and updating the
> > sequence states for all sequences in one transaction, and if there is
> > an error we end up retrying the synchronization for all sequences.
> >
>
> The one idea to avoid this is to update sequences in chunks (say 100
> or some threshold number of sequences in one transaction). Then we
> would only redo the sync for the last and pending set of sequences.

That could be one idea.

>
> > >
> > > Now, say we don't want to maintain the state of sequences for initial
> > > sync at all then after the error how will we detect if there are any
> > > pending sequences to be synced? One possibility is that we maintain a
> > > subscription level flag 'subsequencesync' in 'pg_subscription' to
> > > indicate whether sequences need sync. This flag would indicate whether
> > > to sync all the sequences in pg_susbcription_rel. This would mean that
> > > if there is an error while syncing the sequences we will resync all
> > > the sequences again. This could be acceptable considering the chances
> > > of error during sequence sync are low. The benefit is that both the
> > > REFRESH PUBLICATION SEQUENCES and CREATE SUBSCRIPTION can use the same
> > > idea and sync all sequences without needing a new relfilenode. Users
> > > can always refer 'subsequencesync' flag in 'pg_subscription' to see if
> > > all the sequences are synced a

Re: Revive num_dead_tuples column of pg_stat_progress_vacuum

2024-06-13 Thread Masahiko Sawada
On Fri, Jun 14, 2024 at 9:57 AM Masahiko Sawada  wrote:
>
> On Fri, Jun 14, 2024 at 9:41 AM Michael Paquier  wrote:
> >
> > On Thu, Jun 13, 2024 at 08:38:05PM -0400, Tom Lane wrote:
> > > Masahiko Sawada  writes:
> > >> I was about to push the patch but let me confirm just in case: is it
> > >> okay to bump the catversion even after post-beta1?
> > >
> > > Yes, that happens somewhat routinely.
> >
> > Up to RC, even after beta2.  This happens routinely every year because
> > tweaks are always required for what got committed.  And that's OK to
> > do so now.
>
> Thank you both for confirmation. I'll push it shortly.
>

Pushed. Thank you for giving feedback and reviewing the patch!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Revive num_dead_tuples column of pg_stat_progress_vacuum

2024-06-13 Thread Masahiko Sawada
On Fri, Jun 14, 2024 at 9:41 AM Michael Paquier  wrote:
>
> On Thu, Jun 13, 2024 at 08:38:05PM -0400, Tom Lane wrote:
> > Masahiko Sawada  writes:
> >> I was about to push the patch but let me confirm just in case: is it
> >> okay to bump the catversion even after post-beta1?
> >
> > Yes, that happens somewhat routinely.
>
> Up to RC, even after beta2.  This happens routinely every year because
> tweaks are always required for what got committed.  And that's OK to
> do so now.

Thank you both for confirmation. I'll push it shortly.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Revive num_dead_tuples column of pg_stat_progress_vacuum

2024-06-13 Thread Masahiko Sawada
On Tue, Jun 11, 2024 at 10:38 PM Masahiko Sawada  wrote:
>
> On Fri, Jun 7, 2024 at 10:22 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jun 5, 2024 at 7:19 PM Andrey M. Borodin  
> > wrote:
> > >
> > >
> > >
> > > > On 4 Jun 2024, at 00:26, Masahiko Sawada  wrote:
> > >
> > > Thank you! Vacuum enhancement is a really good step forward, and this 
> > > small change would help a lot of observability tools.
> > >
> > >
> > > > On 4 Jun 2024, at 00:49, Peter Geoghegan  wrote:
> > > >
> > > > Can we rename this to num_dead_item_ids (or something similar) in
> > > > passing?
> > >
> > > I do not insist, but many tools will have to adapt to this change [0,1]. 
> > > However, most of tools will have to deal with removed max_dead_tuples 
> > > anyway [2], so this is not that big problem.
> >
> > True, this incompatibility would not be a big problem.
> >
> > num_dead_item_ids seems good to me. I've updated the patch that
> > incorporated the comment from Álvaro[1].
>
> I'm going to push the v2 patch in a few days if there is no further comment.
>

I was about to push the patch but let me confirm just in case: is it
okay to bump the catversion even after post-beta1? This patch
reintroduces a previously-used column to pg_stat_progress_vacuum so it
requires bumping the catversion.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Logical Replication of sequences

2024-06-13 Thread Masahiko Sawada
On Thu, Jun 13, 2024 at 7:06 PM Amit Kapila  wrote:
>
> On Thu, Jun 13, 2024 at 1:09 PM Masahiko Sawada  wrote:
> >
> > On Wed, Jun 12, 2024 at 6:59 PM Amit Kapila  wrote:
> > >
> > >
> > > Yeah, starting with a single worker sounds good for now. Do you think
> > > we should sync all the sequences in a single transaction or have some
> > > threshold value above which a different transaction would be required
> > > or maybe a different sequence sync worker altogether? Now, having
> > > multiple sequence-sync workers requires some synchronization so that
> > > only a single worker is allocated for one sequence.
> > >
> > > The simplest thing is to use a single sequence sync worker that syncs
> > > all sequences in one transaction but with a large number of sequences,
> > > it could be inefficient. OTOH, I am not sure if it would be a problem
> > > in reality.
> >
> > I think that we can start with using a single worker and one
> > transaction, and measure the performance with a large number of
> > sequences.
> >
>
> Fair enough. However, this raises the question Dilip and Vignesh are
> discussing whether we need a new relfilenode for sequence update even
> during initial sync? As per my understanding, the idea is that similar
> to tables, the CREATE SUBSCRIPTION command (with copy_data = true)
> will create the new sequence entries in pg_subscription_rel with the
> state as 'i'. Then the sequence-sync worker would start a transaction
> and one-by-one copy the latest sequence values for each sequence (that
> has state as 'i' in pg_subscription_rel) and mark its state as ready
> 'r' and commit the transaction. Now if there is an error during this
> operation it will restart the entire operation. The idea of creating a
> new relfilenode is to handle the error so that if there is a rollback,
> the sequence state will be rolled back to 'i' and the sequence value
> will also be rolled back. The other option could be that we update the
> sequence value without a new relfilenode and if the transaction rolled
> back then only the sequence's state will be rolled back to 'i'. This
> would work with a minor inconsistency that sequence values will be
> up-to-date even when the sequence state is 'i' in pg_subscription_rel.
> I am not sure if that matters because anyway, they can quickly be
> out-of-sync with the publisher again.

I think it would be fine in many cases even if the sequence value is
up-to-date even when the sequence state is 'i' in pg_subscription_rel.
But the case we would like to avoid is where suppose the sequence-sync
worker does both synchronizing sequence values and updating the
sequence states for all sequences in one transaction, and if there is
an error we end up retrying the synchronization for all sequences.

>
> Now, say we don't want to maintain the state of sequences for initial
> sync at all then after the error how will we detect if there are any
> pending sequences to be synced? One possibility is that we maintain a
> subscription level flag 'subsequencesync' in 'pg_subscription' to
> indicate whether sequences need sync. This flag would indicate whether
> to sync all the sequences in pg_susbcription_rel. This would mean that
> if there is an error while syncing the sequences we will resync all
> the sequences again. This could be acceptable considering the chances
> of error during sequence sync are low. The benefit is that both the
> REFRESH PUBLICATION SEQUENCES and CREATE SUBSCRIPTION can use the same
> idea and sync all sequences without needing a new relfilenode. Users
> can always refer 'subsequencesync' flag in 'pg_subscription' to see if
> all the sequences are synced after executing the command.

I think that REFRESH PUBLICATION {SEQUENCES} can be executed even
while the sequence-sync worker is synchronizing sequences. In this
case, the worker might not see new sequences added by the concurrent
REFRESH PUBLICATION {SEQUENCES} command since it's already running.
The worker could end up marking the subsequencesync as completed while
not synchronizing these new sequences.

>
> > > > Or yet another idea I came up with is that a tablesync worker will
> > > > synchronize both the table and sequences owned by the table. That is,
> > > > after the tablesync worker caught up with the apply worker, the
> > > > tablesync worker synchronizes sequences associated with the target
> > > > table as well. One benefit would be that at the time of initial table
> > > > sync being completed, the table and its sequence data are consistent.
> >
> > Correction; it's not guaranteed that the sequence data and table data
> > are consistent even in

Re: Logical Replication of sequences

2024-06-13 Thread Masahiko Sawada
On Wed, Jun 12, 2024 at 6:59 PM Amit Kapila  wrote:
>
> On Wed, Jun 12, 2024 at 10:44 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jun 11, 2024 at 7:36 PM vignesh C  wrote:
> > >
> > > 1) CREATE PUBLICATION syntax enhancement:
> > > CREATE PUBLICATION ... FOR ALL SEQUENCES;
> > > The addition of a new column titled "all sequences" in the
> > > pg_publication system table will signify whether the publication is
> > > designated as all sequences publication or not.
> > >
> >
> > The first approach sounds like we don't create entries for sequences
> > in pg_subscription_rel. In this case, how do we know all sequences
> > that we need to refresh when executing the REFRESH PUBLICATION
> > SEQUENCES command you mentioned below?
> >
>
> As per my understanding, we should be creating entries for sequences
> in pg_subscription_rel similar to tables. The difference would be that
> we won't need all the sync_states (i = initialize, d = data is being
> copied, f = finished table copy, s = synchronized, r = ready) as we
> don't need any synchronization with apply workers.

Agreed.

>
> > > 2)  CREATE SUBSCRIPTION -- no syntax change.
> > > Upon creation of a subscription, the following additional steps will
> > > be managed by the subscriber:
> > > i) The subscriber will retrieve the list of sequences associated with
> > > the subscription's publications.
> > > ii) For each sequence: a) Retrieve the sequence value from the
> > > publisher by invoking the pg_sequence_state function. b) Set the
> > > sequence with the value obtained from the publisher. iv) Once the
> > > subscription creation is completed, all sequence values will become
> > > visible at the subscriber's end.
> >
> > Sequence values are always copied from the publisher? or does it
> > happen only when copy_data = true?
> >
>
> It is better to do it when "copy_data = true" to keep it compatible
> with the table's behavior.

+1

>
> > Probably we can
> > start with a single worker and extend it to have multiple workers.
>
> Yeah, starting with a single worker sounds good for now. Do you think
> we should sync all the sequences in a single transaction or have some
> threshold value above which a different transaction would be required
> or maybe a different sequence sync worker altogether? Now, having
> multiple sequence-sync workers requires some synchronization so that
> only a single worker is allocated for one sequence.
>
> The simplest thing is to use a single sequence sync worker that syncs
> all sequences in one transaction but with a large number of sequences,
> it could be inefficient. OTOH, I am not sure if it would be a problem
> in reality.

I think that we can start with using a single worker and one
transaction, and measure the performance with a large number of
sequences.

> > Or yet another idea I came up with is that a tablesync worker will
> > synchronize both the table and sequences owned by the table. That is,
> > after the tablesync worker caught up with the apply worker, the
> > tablesync worker synchronizes sequences associated with the target
> > table as well. One benefit would be that at the time of initial table
> > sync being completed, the table and its sequence data are consistent.

Correction; it's not guaranteed that the sequence data and table data
are consistent even in this case since the tablesync worker could get
on-disk sequence data that might have already been updated.

> > As soon as new changes come to the table, it would become inconsistent
> > so it might not be helpful much, though. Also, sequences that are not
> > owned by any table will still need to be synchronized by someone.
> >
>
> The other thing to consider in this idea is that we somehow need to
> distinguish the sequences owned by the table.

I think we can check pg_depend. The owned sequences reference to the table.

>
> > >
> > > 3) Refreshing the sequence can be achieved through the existing
> > > command: ALTER SUBSCRIPTION ... REFRESH PUBLICATION(no syntax change
> > > here).
> > > The subscriber identifies stale sequences, meaning sequences present
> > > in pg_subscription_rel but absent from the publication, and removes
> > > them from the pg_subscription_rel system table. The subscriber also
> > > checks for newly added sequences in the publisher and synchronizes
> > > their values from the publisher using the steps outlined in the
> > > subscription creation process. It's worth noting that previously
> > > synchronized sequences won't be 

Re: Conflict Detection and Resolution

2024-06-13 Thread Masahiko Sawada
On Wed, Jun 5, 2024 at 3:32 PM Zhijie Hou (Fujitsu)
 wrote:
>
> Hi,
>
> This time at PGconf.dev[1], we had some discussions regarding this
> project. The proposed approach is to split the work into two main
> components. The first part focuses on conflict detection, which aims to
> identify and report conflicts in logical replication. This feature will
> enable users to monitor the unexpected conflicts that may occur. The
> second part involves the actual conflict resolution. Here, we will provide
> built-in resolutions for each conflict and allow user to choose which
> resolution will be used for which conflict(as described in the initial
> email of this thread).

I agree with this direction that we focus on conflict detection (and
logging) first and then develop conflict resolution on top of that.

>
> Of course, we are open to alternative ideas and suggestions, and the
> strategy above can be changed based on ongoing discussions and feedback
> received.
>
> Here is the patch of the first part work, which adds a new parameter
> detect_conflict for CREATE and ALTER subscription commands. This new
> parameter will decide if subscription will go for conflict detection. By
> default, conflict detection will be off for a subscription.
>
> When conflict detection is enabled, additional logging is triggered in the
> following conflict scenarios:
>
> * updating a row that was previously modified by another origin.
> * The tuple to be updated is not found.
> * The tuple to be deleted is not found.
>
> While there exist other conflict types in logical replication, such as an
> incoming insert conflicting with an existing row due to a primary key or
> unique index, these cases already result in constraint violation errors.

What does detect_conflict being true actually mean to users? I
understand that detect_conflict being true could introduce some
overhead to detect conflicts. But in terms of conflict detection, even
if detect_confict is false, we detect some conflicts such as
concurrent inserts with the same key. Once we introduce the complete
conflict detection feature, I'm not sure there is a case where a user
wants to detect only some particular types of conflict.

> Therefore, additional conflict detection for these cases is currently
> omitted to minimize potential overhead. However, the pre-detection for
> conflict in these error cases is still essential to support automatic
> conflict resolution in the future.

I feel that we should log all types of conflict in an uniform way. For
example, with detect_conflict being true, the update_differ conflict
is reported as "conflict %s detected on relation "%s"", whereas
concurrent inserts with the same key is reported as "duplicate key
value violates unique constraint "%s"", which could confuse users.
Ideally, I think that we log such conflict detection details (table
name, column name, conflict type, etc) to somewhere (e.g. a table or
server logs) so that the users can resolve them manually.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Logical Replication of sequences

2024-06-11 Thread Masahiko Sawada
On Tue, Jun 11, 2024 at 7:36 PM vignesh C  wrote:
>
> On Tue, 11 Jun 2024 at 12:38, Masahiko Sawada  wrote:
> >
> > On Tue, Jun 11, 2024 at 12:25 PM vignesh C  wrote:
> > >
> > > On Mon, 10 Jun 2024 at 14:48, Amit Kapila  wrote:
> > > >
> > > > On Mon, Jun 10, 2024 at 12:43 PM Masahiko Sawada 
> > > >  wrote:
> > > > >
> > > > > On Mon, Jun 10, 2024 at 3:14 PM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Jun 7, 2024 at 7:30 PM Amit Kapila 
> > > > > >  wrote:
> > > > > > >
> > > > > > >
> > > > > > > Are you imagining the behavior for sequences associated with 
> > > > > > > tables
> > > > > > > differently than the ones defined by the CREATE SEQUENCE .. 
> > > > > > > command? I
> > > > > > > was thinking that users would associate sequences with 
> > > > > > > publications
> > > > > > > similar to what we do for tables for both cases. For example, they
> > > > > > > need to explicitly mention the sequences they want to replicate by
> > > > > > > commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; 
> > > > > > > CREATE
> > > > > > > PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR
> > > > > > > SEQUENCES IN SCHEMA sch1;
> > > > > > >
> > > > > > > In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1
> > > > > > > should copy both the explicitly defined sequences and sequences
> > > > > > > defined with the tables. Do you think a different variant for just
> > > > > > > copying sequences implicitly associated with tables (say for 
> > > > > > > identity
> > > > > > > columns)?
> > > > > >
> > > > > > Oh, I was thinking that your proposal was to copy literally all
> > > > > > sequences by REPLICA/REFRESH SEQUENCE command.
> > > > > >
> > > >
> > > > I am trying to keep the behavior as close to tables as possible.
> > > >
> > > > > > But it seems to make
> > > > > > sense to explicitly specify the sequences they want to replicate. It
> > > > > > also means that they can create a publication that has only 
> > > > > > sequences.
> > > > > > In this case, even if they create a subscription for that 
> > > > > > publication,
> > > > > > we don't launch any apply workers for that subscription. Right?
> > > > > >
> > > >
> > > > Right, good point. I had not thought about this.
> > > >
> > > > > > Also, given that the main use case (at least as the first step) is
> > > > > > version upgrade, do we really need to support SEQUENCES IN SCHEMA 
> > > > > > and
> > > > > > even FOR SEQUENCE?
> > > > >
> > > >
> > > > At the very least, we can split the patch to move these variants to a
> > > > separate patch. Once the main patch is finalized, we can try to
> > > > evaluate the remaining separately.
> > >
> > > I engaged in an offline discussion with Amit about strategizing the
> > > division of patches to facilitate the review process. We agreed on the
> > > following split: The first patch will encompass the setting and
> > > getting of sequence values (core sequence changes). The second patch
> > > will cover all changes on the publisher side related to "FOR ALL
> > > SEQUENCES." The third patch will address subscriber side changes aimed
> > > at synchronizing "FOR ALL SEQUENCES" publications. The fourth patch
> > > will focus on supporting "FOR SEQUENCE" publication. Lastly, the fifth
> > > patch will introduce support for  "FOR ALL SEQUENCES IN SCHEMA"
> > > publication.
> > >
> > > I will work on this and share an updated patch for the same soon.
> >
> > +1. Sounds like a good plan.
>
> Amit and I engaged in an offline discussion regarding the design and
> contemplated that it could be like below:
> 1) CREATE PUBLICATION syntax enhancement:
> CREATE PUBLICATION ... FOR ALL SEQUENCES;
> The addition of a new column titled "all sequences" in the
&

Re: Revive num_dead_tuples column of pg_stat_progress_vacuum

2024-06-11 Thread Masahiko Sawada
On Fri, Jun 7, 2024 at 10:22 AM Masahiko Sawada  wrote:
>
> On Wed, Jun 5, 2024 at 7:19 PM Andrey M. Borodin  wrote:
> >
> >
> >
> > > On 4 Jun 2024, at 00:26, Masahiko Sawada  wrote:
> >
> > Thank you! Vacuum enhancement is a really good step forward, and this small 
> > change would help a lot of observability tools.
> >
> >
> > > On 4 Jun 2024, at 00:49, Peter Geoghegan  wrote:
> > >
> > > Can we rename this to num_dead_item_ids (or something similar) in
> > > passing?
> >
> > I do not insist, but many tools will have to adapt to this change [0,1]. 
> > However, most of tools will have to deal with removed max_dead_tuples 
> > anyway [2], so this is not that big problem.
>
> True, this incompatibility would not be a big problem.
>
> num_dead_item_ids seems good to me. I've updated the patch that
> incorporated the comment from Álvaro[1].

I'm going to push the v2 patch in a few days if there is no further comment.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Logical Replication of sequences

2024-06-11 Thread Masahiko Sawada
On Tue, Jun 11, 2024 at 12:25 PM vignesh C  wrote:
>
> On Mon, 10 Jun 2024 at 14:48, Amit Kapila  wrote:
> >
> > On Mon, Jun 10, 2024 at 12:43 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Jun 10, 2024 at 3:14 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Fri, Jun 7, 2024 at 7:30 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > Are you imagining the behavior for sequences associated with tables
> > > > > differently than the ones defined by the CREATE SEQUENCE .. command? I
> > > > > was thinking that users would associate sequences with publications
> > > > > similar to what we do for tables for both cases. For example, they
> > > > > need to explicitly mention the sequences they want to replicate by
> > > > > commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; CREATE
> > > > > PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR
> > > > > SEQUENCES IN SCHEMA sch1;
> > > > >
> > > > > In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1
> > > > > should copy both the explicitly defined sequences and sequences
> > > > > defined with the tables. Do you think a different variant for just
> > > > > copying sequences implicitly associated with tables (say for identity
> > > > > columns)?
> > > >
> > > > Oh, I was thinking that your proposal was to copy literally all
> > > > sequences by REPLICA/REFRESH SEQUENCE command.
> > > >
> >
> > I am trying to keep the behavior as close to tables as possible.
> >
> > > > But it seems to make
> > > > sense to explicitly specify the sequences they want to replicate. It
> > > > also means that they can create a publication that has only sequences.
> > > > In this case, even if they create a subscription for that publication,
> > > > we don't launch any apply workers for that subscription. Right?
> > > >
> >
> > Right, good point. I had not thought about this.
> >
> > > > Also, given that the main use case (at least as the first step) is
> > > > version upgrade, do we really need to support SEQUENCES IN SCHEMA and
> > > > even FOR SEQUENCE?
> > >
> >
> > At the very least, we can split the patch to move these variants to a
> > separate patch. Once the main patch is finalized, we can try to
> > evaluate the remaining separately.
>
> I engaged in an offline discussion with Amit about strategizing the
> division of patches to facilitate the review process. We agreed on the
> following split: The first patch will encompass the setting and
> getting of sequence values (core sequence changes). The second patch
> will cover all changes on the publisher side related to "FOR ALL
> SEQUENCES." The third patch will address subscriber side changes aimed
> at synchronizing "FOR ALL SEQUENCES" publications. The fourth patch
> will focus on supporting "FOR SEQUENCE" publication. Lastly, the fifth
> patch will introduce support for  "FOR ALL SEQUENCES IN SCHEMA"
> publication.
>
> I will work on this and share an updated patch for the same soon.

+1. Sounds like a good plan.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Track the amount of time waiting due to cost_delay

2024-06-11 Thread Masahiko Sawada
Hi,

On Mon, Jun 10, 2024 at 3:05 PM Bertrand Drouvot
 wrote:
>
> Hi hackers,
>
> During the last pgconf.dev I attended Robert’s presentation about autovacuum 
> and
> it made me remember of an idea I had some time ago: $SUBJECT
>
> Please find attached a patch doing so by adding a new field (aka 
> "time_delayed")
> to the pg_stat_progress_vacuum view.
>
> Currently one can change [autovacuum_]vacuum_cost_delay and
> [auto vacuum]vacuum_cost_limit but has no reliable way to measure the impact 
> of
> the changes on the vacuum duration: one could observe the vacuum duration
> variation but the correlation to the changes is not accurate (as many others
> factors could impact the vacuum duration (load on the system, i/o 
> latency,...)).
>
> This new field reports the time that the vacuum has to sleep due to cost 
> delay:
> it could be useful to 1) measure the impact of the current cost_delay and
> cost_limit settings and 2) when experimenting new values (and then help for
> decision making for those parameters).
>
> The patch is relatively small thanks to the work that has been done in
> f1889729dd (to allow parallel worker to report to the leader).

Thank you for the proposal and the patch. I understand the motivation
of this patch. Beside the point Nathan mentioned, I'm slightly worried
that massive parallel messages could be sent to the leader process
when the cost_limit value is low.

FWIW when I want to confirm the vacuum delay effect, I often use the
information from the DEBUG2 log message in VacuumUpdateCosts()
function. Exposing these data (per-worker dobalance, cost_lmit,
cost_delay, active, and failsafe) somewhere in a view might also be
helpful for users for checking vacuum delay effects. It doesn't mean
to measure the impact of the changes on the vacuum duration, though.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Logical Replication of sequences

2024-06-10 Thread Masahiko Sawada
On Mon, Jun 10, 2024 at 3:14 PM Masahiko Sawada  wrote:
>
> On Fri, Jun 7, 2024 at 7:30 PM Amit Kapila  wrote:
> >
> > On Fri, Jun 7, 2024 at 7:55 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Jun 6, 2024 at 6:40 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > > To achieve this, we can allow sequences to be copied during
> > > > > > the initial CREATE SUBSCRIPTION command similar to what we do for
> > > > > > tables. And then later by new/existing command, we re-copy the 
> > > > > > already
> > > > > > existing sequences on the subscriber.
> > > > > >
> > > > > > The options for the new command could be:
> > > > > > Alter Subscription ... Refresh Sequences
> > > > > > Alter Subscription ... Replicate Sequences
> > > > > >
> > > > > > In the second option, we need to introduce a new keyword Replicate.
> > > > > > Can you think of any better option?
> > > > >
> > > > > Another idea is doing that using options. For example,
> > > > >
> > > > > For initial sequences synchronization:
> > > > >
> > > > > CREATE SUBSCRIPTION ... WITH (copy_sequence = true);
> > > > >
> > > >
> > > > How will it interact with the existing copy_data option? So copy_data
> > > > will become equivalent to copy_table_data, right?
> > >
> > > Right.
> > >
> > > >
> > > > > For re-copy (or update) sequences:
> > > > >
> > > > > ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = 
> > > > > true);
> > > > >
> > > >
> > > > Similar to the previous point it can be slightly confusing w.r.t
> > > > copy_data. And would copy_sequence here mean that it would copy
> > > > sequence values of both pre-existing and newly added sequences, if so,
> > > > that would make it behave differently than copy_data?  The other
> > > > possibility in this direction would be to introduce an option like
> > > > replicate_all_sequences/copy_all_sequences which indicates a copy of
> > > > both pre-existing and new sequences, if any.
> > >
> > > Copying sequence data works differently than replicating table data
> > > (initial data copy and logical replication). So I thought the
> > > copy_sequence option (or whatever better name) always does both
> > > updating pre-existing sequences and adding new sequences. REFRESH
> > > PUBLICATION updates the tables to be subscribed, so we also update or
> > > add sequences associated to these tables.
> > >
> >
> > Are you imagining the behavior for sequences associated with tables
> > differently than the ones defined by the CREATE SEQUENCE .. command? I
> > was thinking that users would associate sequences with publications
> > similar to what we do for tables for both cases. For example, they
> > need to explicitly mention the sequences they want to replicate by
> > commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; CREATE
> > PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR
> > SEQUENCES IN SCHEMA sch1;
> >
> > In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1
> > should copy both the explicitly defined sequences and sequences
> > defined with the tables. Do you think a different variant for just
> > copying sequences implicitly associated with tables (say for identity
> > columns)?
>
> Oh, I was thinking that your proposal was to copy literally all
> sequences by REPLICA/REFRESH SEQUENCE command. But it seems to make
> sense to explicitly specify the sequences they want to replicate. It
> also means that they can create a publication that has only sequences.
> In this case, even if they create a subscription for that publication,
> we don't launch any apply workers for that subscription. Right?
>
> Also, given that the main use case (at least as the first step) is
> version upgrade, do we really need to support SEQUENCES IN SCHEMA and
> even FOR SEQUENCE?

Also, I guess that specifying individual sequences might not be easy
to use for users in some cases. For sequences owned by a column of a
table, users might want to specify them altogether, rather than
separately. For example, CREATE PUBLICATION ... FOR TABLE tab1 WITH
SEQUENCES means to add the table tab1 and its sequences to the
publication. For other sequences (i.e., not owned by any tables),
users might want to specify them individually.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Logical Replication of sequences

2024-06-10 Thread Masahiko Sawada
On Fri, Jun 7, 2024 at 7:30 PM Amit Kapila  wrote:
>
> On Fri, Jun 7, 2024 at 7:55 AM Masahiko Sawada  wrote:
> >
> > On Thu, Jun 6, 2024 at 6:40 PM Amit Kapila  wrote:
> > >
> > > On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > >
> > > > > To achieve this, we can allow sequences to be copied during
> > > > > the initial CREATE SUBSCRIPTION command similar to what we do for
> > > > > tables. And then later by new/existing command, we re-copy the already
> > > > > existing sequences on the subscriber.
> > > > >
> > > > > The options for the new command could be:
> > > > > Alter Subscription ... Refresh Sequences
> > > > > Alter Subscription ... Replicate Sequences
> > > > >
> > > > > In the second option, we need to introduce a new keyword Replicate.
> > > > > Can you think of any better option?
> > > >
> > > > Another idea is doing that using options. For example,
> > > >
> > > > For initial sequences synchronization:
> > > >
> > > > CREATE SUBSCRIPTION ... WITH (copy_sequence = true);
> > > >
> > >
> > > How will it interact with the existing copy_data option? So copy_data
> > > will become equivalent to copy_table_data, right?
> >
> > Right.
> >
> > >
> > > > For re-copy (or update) sequences:
> > > >
> > > > ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = true);
> > > >
> > >
> > > Similar to the previous point it can be slightly confusing w.r.t
> > > copy_data. And would copy_sequence here mean that it would copy
> > > sequence values of both pre-existing and newly added sequences, if so,
> > > that would make it behave differently than copy_data?  The other
> > > possibility in this direction would be to introduce an option like
> > > replicate_all_sequences/copy_all_sequences which indicates a copy of
> > > both pre-existing and new sequences, if any.
> >
> > Copying sequence data works differently than replicating table data
> > (initial data copy and logical replication). So I thought the
> > copy_sequence option (or whatever better name) always does both
> > updating pre-existing sequences and adding new sequences. REFRESH
> > PUBLICATION updates the tables to be subscribed, so we also update or
> > add sequences associated to these tables.
> >
>
> Are you imagining the behavior for sequences associated with tables
> differently than the ones defined by the CREATE SEQUENCE .. command? I
> was thinking that users would associate sequences with publications
> similar to what we do for tables for both cases. For example, they
> need to explicitly mention the sequences they want to replicate by
> commands like CREATE PUBLICATION ... FOR SEQUENCE s1, s2, ...; CREATE
> PUBLICATION ... FOR ALL SEQUENCES, or CREATE PUBLICATION ... FOR
> SEQUENCES IN SCHEMA sch1;
>
> In this, variants FOR ALL SEQUENCES and SEQUENCES IN SCHEMA sch1
> should copy both the explicitly defined sequences and sequences
> defined with the tables. Do you think a different variant for just
> copying sequences implicitly associated with tables (say for identity
> columns)?

Oh, I was thinking that your proposal was to copy literally all
sequences by REPLICA/REFRESH SEQUENCE command. But it seems to make
sense to explicitly specify the sequences they want to replicate. It
also means that they can create a publication that has only sequences.
In this case, even if they create a subscription for that publication,
we don't launch any apply workers for that subscription. Right?

Also, given that the main use case (at least as the first step) is
version upgrade, do we really need to support SEQUENCES IN SCHEMA and
even FOR SEQUENCE? The WIP patch Vignesh recently submitted is more
than 6k lines. I think we can cut the scope for the first
implementation so as to make the review easy.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Logical Replication of sequences

2024-06-06 Thread Masahiko Sawada
On Thu, Jun 6, 2024 at 6:40 PM Amit Kapila  wrote:
>
> On Thu, Jun 6, 2024 at 11:10 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila  wrote:
> > >
> >
> > > To achieve this, we can allow sequences to be copied during
> > > the initial CREATE SUBSCRIPTION command similar to what we do for
> > > tables. And then later by new/existing command, we re-copy the already
> > > existing sequences on the subscriber.
> > >
> > > The options for the new command could be:
> > > Alter Subscription ... Refresh Sequences
> > > Alter Subscription ... Replicate Sequences
> > >
> > > In the second option, we need to introduce a new keyword Replicate.
> > > Can you think of any better option?
> >
> > Another idea is doing that using options. For example,
> >
> > For initial sequences synchronization:
> >
> > CREATE SUBSCRIPTION ... WITH (copy_sequence = true);
> >
>
> How will it interact with the existing copy_data option? So copy_data
> will become equivalent to copy_table_data, right?

Right.

>
> > For re-copy (or update) sequences:
> >
> > ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = true);
> >
>
> Similar to the previous point it can be slightly confusing w.r.t
> copy_data. And would copy_sequence here mean that it would copy
> sequence values of both pre-existing and newly added sequences, if so,
> that would make it behave differently than copy_data?  The other
> possibility in this direction would be to introduce an option like
> replicate_all_sequences/copy_all_sequences which indicates a copy of
> both pre-existing and new sequences, if any.

Copying sequence data works differently than replicating table data
(initial data copy and logical replication). So I thought the
copy_sequence option (or whatever better name) always does both
updating pre-existing sequences and adding new sequences. REFRESH
PUBLICATION updates the tables to be subscribed, so we also update or
add sequences associated to these tables.

>
> If we want to go in the direction of having an option such as
> copy_(all)_sequences then do you think specifying that copy_data is
> just for tables in the docs would be sufficient? I am afraid that it
> would be confusing for users.

I see your point. But I guess it would not be very problematic as it
doesn't break the current behavior and copy_(all)_sequences is
primarily for upgrade use cases.

>
> > >
> > > In addition to the above, the command Alter Subscription .. Refresh
> > > Publication will fetch any missing sequences similar to what it does
> > > for tables.
> >
> > On the subscriber side, do we need to track which sequences are
> > created via CREATE/ALTER SUBSCRIPTION?
> >
>
> I think so unless we find some other way to know at refresh
> publication time which all new sequences need to be part of the
> subscription. What should be the behavior w.r.t sequences when the
> user performs ALTER SUBSCRIPTION ... REFRESH PUBLICATION? I was
> thinking similar to tables, it should fetch any missing sequence
> information from the publisher.

It seems to make sense to me. But I have one question: do we want to
support replicating sequences that are not associated with any tables?
if yes, what if we refresh two different subscriptions that subscribe
to different tables on the same database? On the other hand, if no
(i.e. replicating only sequences owned by tables), can we know which
sequences to replicate by checking the subscribed tables?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Revive num_dead_tuples column of pg_stat_progress_vacuum

2024-06-06 Thread Masahiko Sawada
On Wed, Jun 5, 2024 at 7:19 PM Andrey M. Borodin  wrote:
>
>
>
> > On 4 Jun 2024, at 00:26, Masahiko Sawada  wrote:
>
> Thank you! Vacuum enhancement is a really good step forward, and this small 
> change would help a lot of observability tools.
>
>
> > On 4 Jun 2024, at 00:49, Peter Geoghegan  wrote:
> >
> > Can we rename this to num_dead_item_ids (or something similar) in
> > passing?
>
> I do not insist, but many tools will have to adapt to this change [0,1]. 
> However, most of tools will have to deal with removed max_dead_tuples anyway 
> [2], so this is not that big problem.

True, this incompatibility would not be a big problem.

num_dead_item_ids seems good to me. I've updated the patch that
incorporated the comment from Álvaro[1].

Regards,

[1] 
https://www.postgresql.org/message-id/202406041535.pmyty3ci4pfd%40alvherre.pgsql

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From 24838307afc52dfc00317579ad88a59ba5c00192 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 4 Jun 2024 06:17:25 +0900
Subject: [PATCH v2] Reintroduce dead tuple counter in pg_stat_progress_vacuum.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Commit 667e65aac3 changed both num_dead_tuples and max_dead_tuples
columns to dead_tuple_bytes and max_dead_tuple_bytes columns,
respectively. But as per discussion, the number of dead tuples
collected still can provide meaningful insights for users.

This change reintroduce the column for the count of dead tuples,
renamed as num_dead_item_ids to avoid confusion with the number of
dead tuples removed by VACUUM, which includes dead heap-only tuples,
but excludes any pre-existing LP_DEAD items left behind by
opportunistic pruning.

XXX: bump catalog version.

Reviewed-by: Peter Geoghegan, Álvaro Herrera, Andrey Borodin
Discussion: https://postgr.es/m/CAD21AoBL5sJE9TRWPyv%2Bw7k5Ee5QAJqDJEDJBUdAaCzGWAdvZw%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml |  9 +
 src/backend/access/heap/vacuumlazy.c | 12 +---
 src/backend/catalog/system_views.sql |  3 ++-
 src/include/commands/progress.h  |  5 +++--
 src/test/regress/expected/rules.out  |  5 +++--
 5 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 053da8d6e4..b2ad9b446f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6268,6 +6268,15 @@ FROM pg_stat_get_backend_idset() AS backendid;
   
  
 
+ 
+  
+  num_dead_item_ids bigint
+  
+  
+   Number of dead item identifiers collected since the last index vacuum cycle.
+  
+ 
+
  
   
indexes_total bigint
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8145ea8fc3..ef1df35afa 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2883,13 +2883,19 @@ dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
 			   int num_offsets)
 {
 	TidStore   *dead_items = vacrel->dead_items;
+	const int	prog_index[2] = {
+		PROGRESS_VACUUM_NUM_DEAD_ITEM_IDS,
+		PROGRESS_VACUUM_DEAD_TUPLE_BYTES
+	};
+	int64		prog_val[2];
 
 	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
 	vacrel->dead_items_info->num_items += num_offsets;
 
-	/* update the memory usage report */
-	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
- TidStoreMemoryUsage(dead_items));
+	/* update the progress information */
+	prog_val[0] = vacrel->dead_items_info->num_items;
+	prog_val[1] = TidStoreMemoryUsage(dead_items);
+	pgstat_progress_update_multi_param(2, prog_index, prog_val);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 53047cab5f..efb29adeb3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
 S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
 S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
 S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
-S.param8 AS indexes_total, S.param9 AS indexes_processed
+S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
+S.param10 AS indexes_processed
 FROM pg_stat_get_progress_info('VACUUM') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 82a8fe6bd1..5616d64523 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -25,8 +25,9 @@
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
 #define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
 #define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
-#define PROGRESS_VACUUM_INDEXES_TOTAL			7
-#define PROGRESS_V

Re: Logical Replication of sequences

2024-06-05 Thread Masahiko Sawada
On Wed, Jun 5, 2024 at 9:30 PM Amit Kapila  wrote:
>
> On Wed, Jun 5, 2024 at 12:51 PM Peter Eisentraut  wrote:
> >
> > On 04.06.24 12:57, Amit Kapila wrote:
> > > 2. Provide a command say Alter Subscription ...  Replicate Sequences
> > > (or something like that) which users can perform before shutdown of
> > > the publisher node during upgrade. This will allow copying all the
> > > sequences from the publisher node to the subscriber node directly.
> > > Similar to previous approach, this could also be inconvenient for
> > > users.
> >
> > I would start with this.  In any case, you're going to need to write
> > code to collect all the sequence values, send them over some protocol,
> > apply them on the subscriber.  The easiest way to start is to trigger
> > that manually.  Then later you can add other ways to trigger it, either
> > by timer or around shutdown, or whatever other ideas there might be.
> >
>
> Agreed.

+1

> To achieve this, we can allow sequences to be copied during
> the initial CREATE SUBSCRIPTION command similar to what we do for
> tables. And then later by new/existing command, we re-copy the already
> existing sequences on the subscriber.
>
> The options for the new command could be:
> Alter Subscription ... Refresh Sequences
> Alter Subscription ... Replicate Sequences
>
> In the second option, we need to introduce a new keyword Replicate.
> Can you think of any better option?

Another idea is doing that using options. For example,

For initial sequences synchronization:

CREATE SUBSCRIPTION ... WITH (copy_sequence = true);

For re-copy (or update) sequences:

ALTER SUBSCRIPTION ... REFRESH PUBLICATION WITH (copy_sequence = true);

>
> In addition to the above, the command Alter Subscription .. Refresh
> Publication will fetch any missing sequences similar to what it does
> for tables.

On the subscriber side, do we need to track which sequences are
created via CREATE/ALTER SUBSCRIPTION?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Logical Replication of sequences

2024-06-05 Thread Masahiko Sawada
On Wed, Jun 5, 2024 at 12:43 PM Amit Kapila  wrote:
>
> On Tue, Jun 4, 2024 at 8:56 PM Yogesh Sharma
>  wrote:
> >
> > On 6/4/24 06:57, Amit Kapila wrote:
> >
> > > 2. Provide a command say Alter Subscription ...  Replicate Sequences
> > > (or something like that) which users can perform before shutdown of
> > > the publisher node during upgrade. This will allow copying all the
> > > sequences from the publisher node to the subscriber node directly.
> > > Similar to previous approach, this could also be inconvenient for
> > > users.
> >
> > This is similar to option 1 except that it is a SQL command now.
> >
>
> Right, but I would still prefer a command as it provides clear steps
> for the upgrade. Users need to perform (a) Replicate Sequences for a
> particular subscription (b) Disable that subscription (c) Perform (a)
> and (b) for all the subscriptions corresponding to the publisher we
> want to shut down for upgrade.
>
> I agree there are some manual steps involved here but it is advisable
> for users to ensure that they have received the required data on the
> subscriber before the upgrade of the publisher node, otherwise, they
> may not be able to continue replication after the upgrade. For
> example, see the "Prepare for publisher upgrades" step in pg_upgrade
> docs [1].
>
> >
> > > 3. Replicate published sequences via walsender at the time of shutdown
> > > or incrementally while decoding checkpoint record. The two ways to
> > > achieve this are: (a) WAL log a special NOOP record just before
> > > shutting down checkpointer. Then allow the WALsender to read the
> > > sequence data and send it to the subscriber while decoding the new
> > > NOOP record. (b) Similar to the previous idea but instead of WAL
> > > logging a new record directly invokes a decoding callback after
> > > walsender receives a request to shutdown which will allow pgoutput to
> > > read and send required sequences. This approach has a drawback that we
> > > are adding more work at the time of shutdown but note that we already
> > > waits for all the WAL records to be decoded and sent before shutting
> > > down the walsender during shutdown of the node.
> >
> > At the time of shutdown a) most logical upgrades don't necessarily call
> > for shutdown
> >
>
> Won't the major version upgrade expect that the node is down? Refer to
> step "Stop both servers" in [1].

I think the idea is that the publisher is the old version and the
subscriber is the new version, and changes generated on the publisher
are replicated to the subscriber via logical replication. And at some
point, we change the application (or a router) settings so that no
more transactions come to the publisher, do the last upgrade
preparation work (e.g. copying the latest sequence values if
requried), and then change the application so that new transactions
come to the subscriber.

I remember the blog post about Knock doing a similar process to
upgrade the clusters with minimal downtime[1].

Regards,

[1] https://knock.app/blog/zero-downtime-postgres-upgrades


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Revive num_dead_tuples column of pg_stat_progress_vacuum

2024-06-03 Thread Masahiko Sawada
Hi,

Commit 667e65aac3 changed num_dead_tuples and max_dead_tuples columns
to dead_tuple_bytes and max_dead_tuple_bytes columns, respectively.
But at PGConf.dev, I got feedback from multiple people that
num_dead_tuples information still can provide meaning insights for
users to understand the vacuum progress. One use case is to compare
num_dead_tuples to pg_stat_all_tables.n_dead_tup column.

I've attached the patch to revive num_dead_tuples column back to the
pg_stat_progress_vacuum view. This requires to bump catalog version.
We're post-beta1 but it should be okay as it's only for PG17.

Feedback is very welcome.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


0001-Revive-num_dead_tuples-column-of-pg_stat_progress_va.patch
Description: Binary data


Re: First draft of PG 17 release notes

2024-05-21 Thread Masahiko Sawada
Hi,

On Thu, May 9, 2024 at 1:03 PM Bruce Momjian  wrote:
>
> I have committed the first draft of the PG 17 release notes;  you can
> see the results here:
>
> https://momjian.us/pgsql_docs/release-17.html
>
> It will be improved until the final release.  The item count is 188,
> which is similar to recent releases:
>

I found a typo:

s/pg_statstatement/pg_stat_statement/

I've attached a patch to fix it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_pg_stat_statements.patch
Description: Binary data


Re: PostgreSQL 17 Beta 1 release announcement draft

2024-05-20 Thread Masahiko Sawada
On Mon, May 20, 2024 at 8:47 PM Jonathan S. Katz  wrote:
>
> On 5/20/24 2:58 AM, John Naylor wrote:
> > Hi Jon,
> >
> > Regarding vacuum "has shown up to a 6x improvement in overall time to
> > complete its work" -- I believe I've seen reported numbers close to
> > that only 1) when measuring the index phase in isolation or maybe 2)
> > the entire vacuum of unlogged tables with one, perfectly-correlated
> > index (testing has less variance with WAL out of the picture). I
> > believe tables with many indexes would show a lot of improvement, but
> > I'm not aware of testing that case specifically. Can you clarify where
> > 6x came from?
>
> Sawada-san showed me the original context, but I can't rapidly find it
> in the thread. Sawada-san, can you please share the numbers behind this?
>

I referenced the numbers that I measured during the development[1]
(test scripts are here[2]). IIRC I used unlogged tables and indexes,
and these numbers were the entire vacuum execution time including heap
scanning, index vacuuming and heap vacuuming.

FYI today I've run the same script with PG17 and measured the
execution times. Here are results:

monotonically ordered int column index:
system usage: CPU: user: 1.72 s, system: 0.47 s, elapsed: 2.20 s

uuid column index:
system usage: CPU: user: 3.62 s, system: 0.89 s, elapsed: 4.52 s

int & uuid indexes in parallel:
system usage: CPU: user: 2.24 s, system: 0.44 s, elapsed: 5.01 s

These numbers are better than ones I measured with v62 patch set as we
now introduced some optimization into tidstore (8a1b31e6 and f35bd9b).

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CANWCAZYqWibTRCWs5mV57mLj1A0nbKX-eV5G%2Bd-KmBOGHTVY-w%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Pgoutput not capturing the generated columns

2024-05-20 Thread Masahiko Sawada
Hi,

On Wed, May 8, 2024 at 4:14 PM Shubham Khanna
 wrote:
>
> On Wed, May 8, 2024 at 11:39 AM Rajendra Kumar Dangwal
>  wrote:
> >
> > Hi PG Hackers.
> >
> > We are interested in enhancing the functionality of the pgoutput plugin by 
> > adding support for generated columns.
> > Could you please guide us on the necessary steps to achieve this? 
> > Additionally, do you have a platform for tracking such feature requests? 
> > Any insights or assistance you can provide on this matter would be greatly 
> > appreciated.
>
> The attached patch has the changes to support capturing generated
> column data using ‘pgoutput’ and’ test_decoding’ plugin. Now if the
> ‘include_generated_columns’ option is specified, the generated column
> information and generated column data also will be sent.

As Euler mentioned earlier, I think it's a decision not to replicate
generated columns because we don't know the target table on the
subscriber has the same expression and there could be locale issues
even if it looks the same. I can see that a benefit of this proposal
would be to save cost to compute generated column values if the user
wants the target table on the subscriber to have exactly the same data
as the publisher's one. Are there other benefits or use cases?

>
> Usage from pgoutput plugin:
> CREATE TABLE gencoltable (a int PRIMARY KEY, b int GENERATED ALWAYS AS
> (a * 2) STORED);
> CREATE publication pub1 for all tables;
> SELECT 'init' FROM pg_create_logical_replication_slot('slot1', 'pgoutput');
> SELECT * FROM pg_logical_slot_peek_binary_changes('slot1', NULL, NULL,
> 'proto_version', '1', 'publication_names', 'pub1',
> 'include_generated_columns', 'true');
>
> Usage from test_decoding plugin:
> SELECT 'init' FROM pg_create_logical_replication_slot('slot2', 
> 'test_decoding');
> CREATE TABLE gencoltable (a int PRIMARY KEY, b int GENERATED ALWAYS AS
> (a * 2) STORED);
> INSERT INTO gencoltable (a) VALUES (1), (2), (3);
> SELECT data FROM pg_logical_slot_get_changes('slot2', NULL, NULL,
> 'include-xids', '0', 'skip-empty-xacts', '1',
> 'include_generated_columns', '1');
>
> Currently it is not supported as a subscription option because table
> sync for the generated column is not possible as copy command does not
> support getting data for the generated column. If this feature is
> required we can remove this limitation from the copy command and then
> add it as a subscription option later.
> Thoughts?

I think that if we want to support an option to replicate generated
columns, the initial tablesync should support it too. Otherwise, we
end up filling the target columns data with NULL during the initial
tablesync but with replicated data during the streaming changes.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Lowering the minimum value for maintenance_work_mem

2024-05-19 Thread Masahiko Sawada
On Fri, May 17, 2024 at 5:55 AM Andres Freund  wrote:
>
> Hi,
>
> In the subthread at [1] I needed to trigger multiple rounds of index vacuuming
> within one vacuum.
>
> It turns out that with the new dead tuple implementation, that got actually
> somewhat expensive. Particularly if all tuples on all pages get deleted, the
> representation is just "too dense". Normally that's obviously very good, but
> for testing, not so much:
>
> With the minimum setting of maintenance_work_mem=1024kB, a simple table with
> narrow rows, where all rows are deleted, the first cleanup happens after
> 3697812 dead tids. The table for that has to be > ~128MB.
>
> Needing a ~128MB table to be able to test multiple cleanup passes makes it
> much more expensive to test and consequently will lead to worse test coverage.
>
> I think we should consider lowering the minimum setting of
> maintenance_work_mem to the minimum of work_mem.

+1 for lowering the minimum value of maintenance_work_mem. I've faced
the same situation.

Even if a shared tidstore is empty, TidStoreMemoryUsage() returns
256kB because it's the minimum segment size of DSA, i.e.
DSA_MIN_SEGMENT_SIZE. So we can lower the minimum maintenance_work_mem
down to 256kB, from a vacuum perspective.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-05-14 Thread Masahiko Sawada
On Sat, May 4, 2024 at 7:36 AM Joe Conway  wrote:
>
> On 5/3/24 11:44, Peter Eisentraut wrote:
> > On 03.05.24 16:13, Tom Lane wrote:
> >> Peter Eisentraut  writes:
> >>> On 30.04.24 19:29, Tom Lane wrote:
> >>>> Also, the bigger picture here is the seeming assumption that "if
> >>>> we change pg_trgm then it will be safe to replicate from x86 to
> >>>> arm".  I don't believe that that's a good idea and I'm unwilling
> >>>> to promise that it will work, regardless of what we do about
> >>>> char signedness.  That being the case, I don't want to invest a
> >>>> lot of effort in the signedness issue.  Option (1) is clearly
> >>>> a small change with little if any risk of future breakage.
> >>
> >>> But note that option 1 would prevent some replication that is currently
> >>> working.
> >>
> >> The point of this thread though is that it's working only for small
> >> values of "work".  People are rightfully unhappy if it seems to work
> >> and then later they get bitten by compatibility problems.
> >>
> >> Treating char signedness as a machine property in pg_control would
> >> signal that we don't intend to make it work, and would ensure that
> >> even the most minimal testing would find out that it doesn't work.
> >>
> >> If we do not do that, it seems to me we have to buy into making
> >> it work.  That would mean dealing with the consequences of an
> >> incompatible change in pg_trgm indexes, and then going through
> >> the same dance again the next time(s) similar problems are found.
> >
> > Yes, that is understood.  But anecdotally, replicating between x86-64 arm64 
> > is
> > occasionally used for upgrades or migrations.  In practice, this appears to 
> > have
> > mostly worked.  If we now discover that it won't work with certain index
> > extension modules, it's usable for most users. Even if we say, you have to
> > reindex everything afterwards, it's probably still useful for these 
> > scenarios.
>
> +1

+1

How about extending amcheck to support GIN and GIst indexes so that it
can detect potential data incompatibility due to changing 'char' to
'unsigned char'? I think these new tests would be useful also for
users to check if they really need to reindex indexes due to such
changes. Also we fix pg_trgm so that it uses 'unsigned char' in PG18.
Users who upgraded to PG18 can run the new amcheck tests on the
primary as well as the standby.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: First draft of PG 17 release notes

2024-05-14 Thread Masahiko Sawada
On Thu, May 9, 2024 at 10:48 PM Bruce Momjian  wrote:
>
> On Thu, May  9, 2024 at 02:17:12PM +0900, Masahiko Sawada wrote:
> > Hi,
> >
>
> > Also, please consider the following item:
> >
> > - Improve eviction algorithm in ReorderBuffer using max-heap for many
> > subtransactions (5bec1d6bc)
>
> I looked at that item and I don't have a generic "make logical
> replication apply faster" item to merge it into, and many
> subtransactions seemed like enough of an edge-case that I didn't think
> mentioning it make sense.  Can you see a good place to add it?

I think that since many subtransactions cases are no longer becoming
edge-cases these days, we needed to improve that and it might be
helpful for users to mention it. How about the following item for
example?

Improve logical decoding performance in cases where there are many
subtransactions.

>
> > Finally, should we mention the following commit in the release note?
> > It's not a user-visible change but added a new regression test module.
> >
> > - Add tests for XID wraparound (e255b646a)
>
> I don't normally add testing infrastructure changes unless they are
> major.

I've seen we had such item, for example in PG14 release note:

Add a test module for the regular expression package (Tom Lane)

But if our policy has already changed, I'm okay with not mentioning
the xid_wraparound test in the PG17 release note.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Fix parallel vacuum buffer usage reporting

2024-05-10 Thread Masahiko Sawada
On Fri, May 10, 2024 at 7:26 PM Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Wed, 1 May 2024 at 06:37, Masahiko Sawada  wrote:
> >
> > Thank you for further testing! I've pushed the patch.
>
> I realized a behaviour change while looking at 'Use pgBufferUsage for
> block reporting in analyze' thread [1]. Since that change applies here
> as well, I thought it is better to mention it here.
>
> Before this commit, VacuumPageMiss did not count the blocks if its
> read was already completed by other backends [2]. Now,
> 'pgBufferUsage.local_blks_read + pgBufferUsage.shared_blks_read'
> counts every block attempted to be read; possibly double counting if
> someone else has already completed the read.

True. IIUC there is such a difference only in HEAD but not in v15 and
v16. The following comment in WaitReadBuffers() says that it's a
traditional behavior that we count blocks as read even if someone else
has already completed its I/O:

/*
 * We count all these blocks as read by this backend.  This is traditional
 * behavior, but might turn out to be not true if we find that someone
 * else has beaten us and completed the read of some of these blocks.  In
 * that case the system globally double-counts, but we traditionally don't
 * count this as a "hit", and we don't have a separate counter for "miss,
 * but another backend completed the read".
 */

So I think using pgBufferUsage for (parallel) vacuum is a legitimate
usage and makes it more consistent with other parallel operations.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: First draft of PG 17 release notes

2024-05-08 Thread Masahiko Sawada
Hi,

On Thu, May 9, 2024 at 1:03 PM Bruce Momjian  wrote:
>
> I have committed the first draft of the PG 17 release notes;  you can
> see the results here:
>
> https://momjian.us/pgsql_docs/release-17.html

Thank you for working on that!

I'd like to mention some of my works. I think we can add the vacuum
performance improvements by the following commits:

- Add template for adaptive radix tree (ee1b30f1)
- Add TIDStore, to store sets of TIDs (ItemPointerData) efficiently (30e144287)
- Use TidStore for dead tuple TIDs storage during lazy vacuum (667e65aac)

Also, please consider the following item:

- Improve eviction algorithm in ReorderBuffer using max-heap for many
subtransactions (5bec1d6bc)

Finally, should we mention the following commit in the release note?
It's not a user-visible change but added a new regression test module.

- Add tests for XID wraparound (e255b646a)

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Fix parallel vacuum buffer usage reporting

2024-05-08 Thread Masahiko Sawada
On Fri, May 3, 2024 at 3:41 PM Anthonin Bonnefoy
 wrote:
>
> On Wed, May 1, 2024 at 5:37 AM Masahiko Sawada  wrote:
>>
>> Thank you for further testing! I've pushed the patch.
>
> Thanks!
>
> Here is the rebased version for the follow-up patch removing VacuumPage 
> variables. Though I'm not sure if I should create a dedicated mail thread 
> since the bug was fixed and the follow-up is more of a refactoring. What do 
> you think?

I'd suggest starting a new thread or changing the subject as the
current subject no longer matches what we're discussing.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-05-07 Thread Masahiko Sawada
On Wed, May 1, 2024 at 4:29 PM John Naylor  wrote:
>
> On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 15, 2024 at 6:12 PM John Naylor  wrote:
>
> > > - RT_KEY_GET_SHIFT is not covered for key=0:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
> > >
> > > That should be fairly simple to add to the tests.
> >
> > There are two paths to call RT_KEY_GET_SHIFT():
> >
> > 1. RT_SET() -> RT_KEY_GET_SHIFT()
> > 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()
> >
> > In both cases, it's called when key > tree->ctl->max_val. Since the
> > minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
> > when key=0.
>
> Ah, right, so it is dead code. Nothing to worry about, but it does
> point the way to some simplifications, which I've put together in the
> attached.

Thank you for the patch. It looks good to me.

+   /* compute the smallest shift that will allowing storing the key */
+   start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN;

The comment is moved from RT_KEY_GET_SHIFT() but I think s/will
allowing storing/will allow storing/.

>
> > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
> > >
> > > That should be easy to add.
> >
> > Agreed. The patch is attached.
>
> LGTM
>
> > > - TidStoreCreate* has some memory clamps that are not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
> > >
> > > Maybe we could experiment with using 1MB for shared, and something
> > > smaller for local.
> >
> > I've confirmed that the local and shared tidstore with small max sizes
> > such as 4kB and 1MB worked. Currently the max size is hard-coded in
> > test_tidstore.c but if we use work_mem as the max size, we can pass
> > different max sizes for local and shared in the test script.
>
> Seems okay, do you want to try that and see how it looks?

I've attached a simple patch for this. In test_tidstore.sql, we used
to create two local tidstore and one shared tidstore. I thought of
specifying small work_mem values for these three cases but it would
remove the normal test cases. So I created separate tidstore for this
test. Also, the new test is just to check if tidstore can be created
with such a small size, but it might be a good idea to add some TIDs
to check if it really works fine.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


use_work_mem_as_max_bytes.patch
Description: Binary data


Re: Fix parallel vacuum buffer usage reporting

2024-04-30 Thread Masahiko Sawada
On Tue, Apr 30, 2024 at 3:34 PM Anthonin Bonnefoy
 wrote:
>
> I've done some additional tests to validate the reported numbers. Using 
> pg_statio, it's possible to get the minimum number of block hits (Full script 
> attached).
>
> -- Save block hits before vacuum
> SELECT pg_stat_force_next_flush();
> SELECT heap_blks_hit, idx_blks_hit FROM pg_statio_all_tables where 
> relname='vestat' \gset
> vacuum (verbose, index_cleanup on) vestat;
> -- Check the difference
> SELECT pg_stat_force_next_flush();
> SELECT heap_blks_hit - :heap_blks_hit as delta_heap_hit,
>idx_blks_hit - :idx_blks_hit as delta_idx_hit,
>heap_blks_hit - :heap_blks_hit + idx_blks_hit - :idx_blks_hit as sum
> FROM pg_statio_all_tables where relname='vestat';
>
> Output:
> ...
> buffer usage: 14676 hits, 0 misses, 667 dirtied
> buffer usage (new): 16081 hits, 0 misses, 667 dirtied
> ...
>  -[ RECORD 1 ]--+--
> delta_heap_hit | 9747
> delta_idx_hit  | 6325
> sum| 16072
>
> From pg_statio, we had 16072 blocks for the relation + indexes.
> Pre-patch, we are under reporting with 14676.
> Post-patch, we have 16081. The 9 additional block hits come from vacuum 
> accessing catalog tables like pg_class or pg_class_oid_index.
>

Thank you for further testing! I've pushed the patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-04-30 Thread Masahiko Sawada
On Wed, May 1, 2024 at 2:29 AM Tom Lane  wrote:
>
> Alexander Korotkov  writes:
> > I agree that storing char signedness might seem weird.  But it appears
> > that we already store indexes that depend on char signedness.  So,
> > it's effectively property of bits-on-disk even though it affects
> > indirectly.  Then I see two options to make the picture consistent.
> > 1) Assume that char signedness is somehow a property of bits-on-disk
> > even though it's weird.  Then pg_trgm indexes are correct, but we need
> > to store char signedness in pg_control.
> > 2) Assume that char signedness is not a property of bits-on-disk.
> > Then pg_trgm indexes are buggy and need to be fixed.
> > What do you think?
>
> The problem with option (2) is the assumption that pg_trgm's behavior
> is the only bug of this kind, either now or in the future.  I think
> that's just about an impossible standard to meet, because there's no
> realistic way to test whether char signedness is affecting things.
> (Sure, you can compare results across platforms, but maybe you
> just didn't test the right case.)
>
> Also, the bigger picture here is the seeming assumption that "if
> we change pg_trgm then it will be safe to replicate from x86 to
> arm".  I don't believe that that's a good idea and I'm unwilling
> to promise that it will work, regardless of what we do about
> char signedness.  That being the case, I don't want to invest a
> lot of effort in the signedness issue.

I think that the char signedness issue is an issue also for developers
(and extension authors) since it could lead to confusion and potential
bugs in the future due to that. x86 developers would think of char as
always being signed and write code that will misbehave on arm
machines. For example, since logical replication should behave
correctly even in cross-arch replication all developers need to be
aware of that. I thought of using the -funsigned-char (or
-fsigned-char) compiler flag to avoid that but it would have a broader
impact not only on indexes.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-04-30 Thread Masahiko Sawada
On Tue, Apr 30, 2024 at 12:37 PM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > On Tue, Apr 23, 2024 at 11:57 PM Tom Lane  wrote:
> >> Reject as not a bug.  Discourage people from thinking that physical
> >> replication will work across architectures.
>
> > While cross-arch physical replication is not supported, I think having
> > architecture dependent differences is not good and It's legitimate to
> > fix it. FYI the 'char' data type comparison is done as though char is
> > unsigned. I've attached a small patch to fix it. What do you think?
>
> I think this will break existing indexes that are working fine.
> Yeah, it would have been better to avoid the difference, but
> it's too late now.

True. So it will be a PG18 item.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-04-29 Thread Masahiko Sawada
On Tue, Apr 23, 2024 at 11:57 PM Tom Lane  wrote:
>
> "Guo, Adam"  writes:
> > I would like to report an issue with the pg_trgm extension on
> > cross-architecture replication scenarios. When an x86_64 standby
> > server is replicating from an aarch64 primary server or vice versa,
> > the gist_trgm_ops opclass returns different results on the primary
> > and standby.
>
> I do not think that is a supported scenario.  Hash functions and
> suchlike are not guaranteed to produce the same results on different
> CPU architectures.  As a quick example, I get
>
> regression=# select hashfloat8(34);
>  hashfloat8
> 
>21570837
> (1 row)
>
> on x86_64 but
>
> postgres=# select hashfloat8(34);
>  hashfloat8
> 
>  -602898821
> (1 row)
>
> on ppc32 thanks to the endianness difference.
>
> > Given that this has problem has come up before and seems likely to
> > come up again, I'm curious what other broad solutions there might be
> > to resolve it?
>
> Reject as not a bug.  Discourage people from thinking that physical
> replication will work across architectures.

While cross-arch physical replication is not supported, I think having
architecture dependent differences is not good and It's legitimate to
fix it. FYI the 'char' data type comparison is done as though char is
unsigned. I've attached a small patch to fix it. What do you think?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_signedness_issue_in_pg_trgm.patch
Description: Binary data


Re: Fix parallel vacuum buffer usage reporting

2024-04-29 Thread Masahiko Sawada
On Fri, Apr 26, 2024 at 9:12 PM Alena Rybakina  wrote:
>
> Hi!
>>
>> The same script was run, but using vacuum verbose analyze, and I saw the 
>> difference again in the fifth step:
>> with your patch: buffer usage: 32312 hits, 607 misses, 1566 dirtied
>> master: buffer usage: 32346 hits, 573 misses, 1360 dirtied
>
> Isn't there a chance for the checkpointer to run during this time? That could 
> make the conditions between the two runs slightly different and explain the 
> change in buffer report.
>
> [0] 
> https://github.com/postgres/postgres/blob/8a1b31e6e59631807a08a4e9465134c343bbdf5e/src/backend/access/heap/vacuumlazy.c#L2826-L2831
>
> Looking at the script, you won't trigger the problem.
>
> Thank you for the link I accounted it in my next experiments.
>
> I repeated the test without processing checkpoints with a single index, and 
> the number of pages in the buffer used almost matched:
>
> master branch: buffer usage: 32315 hits, 606 misses, 4486 dirtied
>
> with applied patch v4 version: buffer usage: 32315 hits, 606 misses, 4489 
> dirtied
>
> I think you are right - the problem was interfering with the checkpoint 
> process, by the way I checked the first version patch. To cut a long story 
> short, everything is fine now with one index.
>
> Just in case, I'll explain: I considered this case because your patch could 
> have impact influenced it too.
>
> On 25.04.2024 10:17, Anthonin Bonnefoy wrote:
>
>
> On Wed, Apr 24, 2024 at 4:01 PM Alena Rybakina  
> wrote:
>>
>> I tested the main postgres branch with and without your fix using a script 
>> that was written by me. It consists of five scenarios and I made a 
>> comparison in the logs between the original version of the master branch and 
>> the master branch with your patch:
>
>  Hi! Thanks for the tests.
>
>> I have attached a test file (vacuum_check_logs.sql)
>
> The reporting issue will only happen if there's a parallel index vacuum and 
> it will only happen if there's at least 2 indexes [0]. You will need to 
> create an additional index.
>
> Speaking of the problem, I added another index and repeated the test and 
> found a significant difference:
>
> I found it when I commited the transaction (3):
>
> master: 2964 hits, 0 misses, 0 dirtied
>
> with applied patch v4 version: buffer usage: 33013 hits, 0 misses, 3 dirtied
>
> When I deleted all the data from the table and later started vacuum verbose 
> again (4):
>
> master: buffer usage: 51486 hits, 0 misses, 0 dirtied
>
> with applied patch v4 version:buffer usage: 77924 hits, 0 misses, 0 dirtied
>
> when I inserted 1 million data into the table and updated it (5):
>
> master:buffer usage: 27904 hits, 5021 misses, 1777 dirtied
>
> with applied patch v4 version:buffer usage: 41051 hits, 9973 misses, 2564 
> dirtied
>
> As I see, the number of pages is significantly more than it was in the master 
> branch and ,frankly, I couldn't fully figure out if it was a mistake or not.

I think that the patch fixes the problem correctly.

I've run pgindent and updated the commit message. I realized that
parallel vacuum was introduced in pg13 but buffer usage reporting in
VACUUM command was implemented in pg15. Therefore, in pg13 and pg14,
VACUUM (PARALLEL) is available but VACUUM (PARALLEL, VERBOSE) doesn't
show the buffer usage report. Autovacuum does show the buffer usage
report but parallel autovacuum is not supported. Therefore, we should
backpatch it down to 15, not 13.

I'm going to push the patch down to pg15, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v5-0001-Fix-parallel-vacuum-buffer-usage-reporting.patch
Description: Binary data


Re: New committers: Melanie Plageman, Richard Guo

2024-04-27 Thread Masahiko Sawada
On Fri, Apr 26, 2024 at 8:54 PM Jonathan S. Katz  wrote:
>
> The Core Team would like to extend our congratulations to Melanie
> Plageman and Richard Guo, who have accepted invitations to become our
> newest PostgreSQL committers.
>
> Please join us in wishing them much success and few reverts!
>

Congratulations to both!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-25 Thread Masahiko Sawada
On Thu, Apr 25, 2024 at 1:38 PM Masahiko Sawada  wrote:
>
> On Thu, Apr 25, 2024 at 12:17 PM John Naylor  wrote:
> >
> > On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada  
> > wrote:
> > >
> > > > I saw a SIGSEGV there when using tidstore to write a fix for something 
> > > > else.
> > > > Patch attached.
> > >
> > > Great find, thank you for the patch!
> >
> > +1
> >
> > (This occurred to me a few days ago, but I was far from my computer.)
> >
> > With the purge function that  Noah proposed, I believe we can also get
> > rid of the comment at the top of the .sql test file warning of a
> > maintenance hazard:
> > ..."To avoid adding duplicates,
> > -- each call to do_set_block_offsets() should use different block
> > -- numbers."
>
> Good point. Removed.
>
> >
> > > of do_gset_block_offset() and check_set_block_offsets(). If these are
> > > annoying, we can remove the cases of array[1] and array[1,2].
> >
> > Let's keep those -- 32-bit platforms should also exercise this path.
>
> Agreed.
>
> I've attached a new patch. I'll push it tonight, if there is no further 
> comment.
>

Pushed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Masahiko Sawada
On Thu, Apr 25, 2024 at 12:17 PM John Naylor  wrote:
>
> On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada  wrote:
> >
> > > I saw a SIGSEGV there when using tidstore to write a fix for something 
> > > else.
> > > Patch attached.
> >
> > Great find, thank you for the patch!
>
> +1
>
> (This occurred to me a few days ago, but I was far from my computer.)
>
> With the purge function that  Noah proposed, I believe we can also get
> rid of the comment at the top of the .sql test file warning of a
> maintenance hazard:
> ..."To avoid adding duplicates,
> -- each call to do_set_block_offsets() should use different block
> -- numbers."

Good point. Removed.

>
> > of do_gset_block_offset() and check_set_block_offsets(). If these are
> > annoying, we can remove the cases of array[1] and array[1,2].
>
> Let's keep those -- 32-bit platforms should also exercise this path.

Agreed.

I've attached a new patch. I'll push it tonight, if there is no further comment.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-val.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Masahiko Sawada
On Thu, Apr 25, 2024 at 6:03 AM Noah Misch  wrote:
>
> On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote:
> > - Some paths for single-value leaves are not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
> >
> > However, these paths do get regression test coverage on 32-bit
> > machines. 64-bit builds only have leaves in the TID store, which
> > doesn't (currently) delete entries, and doesn't instantiate the tree
> > with the debug option.
> >
> > - In RT_SET "if (found)" is not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
> >
> > That's because we don't yet have code that replaces an existing value
> > with a value of a different length.
>
> I saw a SIGSEGV there when using tidstore to write a fix for something else.
> Patch attached.

Great find, thank you for the patch!

The fix looks good to me. I think we can improve regression tests for
better coverage. In TidStore on a 64-bit machine, we can store 3
offsets in the header and these values are embedded to the leaf page.
With more than 3 offsets, the value size becomes more than 16 bytes
and a single value leaf. Therefore, if we can add the test with the
array[1,2,3,4,100], we can cover the case of replacing a single-value
leaf with a different size new single-value leaf. Now we add 9 pairs
of do_gset_block_offset() and check_set_block_offsets(). If these are
annoying, we can remove the cases of array[1] and array[1,2].

I've attached a new patch. In addition to the new test case I
mentioned, I've added some new comments and removed an unnecessary
added line in test_tidstore.sql.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-value-.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Masahiko Sawada
On Mon, Apr 15, 2024 at 6:12 PM John Naylor  wrote:
>
> I took a look at the coverage report from [1] and it seems pretty
> good, but there are a couple more tests we could do.

Thank you for checking!

>
> - RT_KEY_GET_SHIFT is not covered for key=0:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
>
> That should be fairly simple to add to the tests.

There are two paths to call RT_KEY_GET_SHIFT():

1. RT_SET() -> RT_KEY_GET_SHIFT()
2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()

In both cases, it's called when key > tree->ctl->max_val. Since the
minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
when key=0.

>
> - Some paths for single-value leaves are not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
>
> However, these paths do get regression test coverage on 32-bit
> machines. 64-bit builds only have leaves in the TID store, which
> doesn't (currently) delete entries, and doesn't instantiate the tree
> with the debug option.

Right.

>
> - In RT_SET "if (found)" is not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
>
> That's because we don't yet have code that replaces an existing value
> with a value of a different length.

Noah reported an issue around that. We should incorporate the patch
and cover this code path.

>
> - RT_FREE_RECURSE isn't well covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
>
> The TID store test is pretty simple as far as distribution of block
> keys, and focuses more on the offset bitmaps. We could try to cover
> all branches here, but it would make the test less readable, and it's
> kind of the wrong place to do that anyway. test_radixtree.c does have
> a commented-out option to use shared memory, but that's for local
> testing and won't be reflected in the coverage report. Maybe it's
> enough.

Agreed.

>
> - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
>
> That should be easy to add.

Agreed. The patch is attached.

>
> - RT_DUMP_NODE is not covered, and never called by default anyway:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804
>
> It seems we could just leave it alone since it's debug-only, but it's
> also a lot of lines. One idea is to use elog with DEBUG5 instead of
> commenting out the call sites, but that would cause a lot of noise.

I think we can leave it alone.

>
> - TidStoreCreate* has some memory clamps that are not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
>
> Maybe we could experiment with using 1MB for shared, and something
> smaller for local.

I've confirmed that the local and shared tidstore with small max sizes
such as 4kB and 1MB worked. Currently the max size is hard-coded in
test_tidstore.c but if we use work_mem as the max size, we can pass
different max sizes for local and shared in the test script.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


improve_code_coverage_radixtree.patch
Description: Binary data


Re: Fix parallel vacuum buffer usage reporting

2024-04-24 Thread Masahiko Sawada
On Mon, Apr 22, 2024 at 5:07 PM Anthonin Bonnefoy
 wrote:
>
> On Sat, Apr 20, 2024 at 2:00 PM Alena Rybakina  
> wrote:
>>
>> Hi, thank you for your work with this subject.
>>
>> While I was reviewing your code, I noticed that your patch conflicts with 
>> another patch [0] that been committed.
>>
>> [0] 
>> https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
>
>
> I've rebased the patch and also split the changes:

Thank you for updating the patch!

> 1: Use pgBufferUsage in Vacuum and Analyze block reporting

I think that if the anayze command doesn't have the same issue, we
don't need to change it. Making the vacuum and the analyze consistent
is a good point but I'd like to avoid doing unnecessary changes in
back branches. I think the patch set would contain:

(a) make lazy vacuum use BufferUsage instead of
VacuumPage{Hit,Miss,Dirty}. (backpatched down to pg13).
(b) make analyze use BufferUsage and remove VacuumPage{Hit,Miss,Dirty}
variables for consistency and simplicity (only for HEAD, if we agree).

BTW I realized that VACUUM VERBOSE running on a temp table always
shows the number of dirtied buffers being 0, which seems to be a bug.
The patch (a) will resolve it as well.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-23 Thread Masahiko Sawada
On Tue, Apr 23, 2024 at 12:37 PM Amit Kapila  wrote:
>
> On Mon, Apr 22, 2024 at 7:04 PM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 22, 2024 at 9:02 PM shveta malik  wrote:
> > >
> > > Thanks for the patch, the changes look good Amit. Please find the merged 
> > > patch.
> > >
> >
> > I've reviewed the patch and have some comments:
> >
> > ---
> > /*
> > -* Early initialization.
> > +* Register slotsync_worker_onexit() before we register
> > +* ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
> > +* exit of the slot sync worker, ReplicationSlotShmemExit() is called
> > +* first, followed by slotsync_worker_onexit(). The startup process 
> > during
> > +* promotion invokes ShutDownSlotSync() which waits for slot sync to
> > +* finish and it does that by checking the 'syncing' flag. Thus worker
> > +* must be done with the slots' release and cleanup before it marks 
> > itself
> > +* as finished syncing.
> >  */
> >
> > I'm slightly worried that we register the slotsync_worker_onexit()
> > callback before BaseInit(), because it could be a blocker when we want
> > to add more work in the callback, for example sending the stats.
> >
>
> The other possibility is that we do slot release/clean up in the
> slotsync_worker_onexit() call itself and then we can do it after
> BaseInit().

This approach sounds clearer and safer to me. The current approach
relies on the callback registration order of
ReplicationSlotShmemExit(). If it changes in the future, we will
silently have the same problem. Every slot sync related work should be
done before allowing someone to touch synced slots by clearing the
'syncing' flag.

>
> > ---
> > synchronize_slots(wrconn);
> > +
> > +   /* Cleanup the temporary slots */
> > +   ReplicationSlotCleanup();
> > +
> > +   /* We are done with sync, so reset sync flag */
> > +   reset_syncing_flag();
> >
> > I think it ends up removing other temp slots that are created by the
> > same backend process using
> > pg_create_{physical,logical_replication_slots() function, which could
> > be a large side effect of this function for users.
> >
>
> True, I think here we should either remove only temporary and synced
> marked slots. The other possibility is to create slots as RS_EPHEMERAL
> initially when called from the SQL function but that doesn't sound
> like a neat approach.
>
> >
>  Also, if users want
> > to have a process periodically calling pg_sync_replication_slots()
> > instead of the slotsync worker, it doesn't support a case where we
> > create a temp not-ready slot and turn it into a persistent slot if
> > it's ready for sync.
> >
>
> True, but eventually the API should be able to directly create the
> persistent slots and anyway this can happen only for the first time
> (till the slots are created and marked persistent) and one who wants
> to use this function periodically should be able to see regular syncs.

I agree that we remove temp-and-synced slots created via the API at
the end of the API . We end up creating and dropping slots in every
API call but since the pg_sync_replication_slots() function is a kind
of debug-purpose function and it will not be common to call this
function regularly instead of using the slot sync worker, we can live
with such overhead.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-22 Thread Masahiko Sawada
Hi,

On Thu, Apr 4, 2024 at 9:23 PM Bharath Rupireddy
 wrote:
>
> On Thu, Apr 4, 2024 at 4:35 PM Amit Kapila  wrote:
> >
> > > Thanks for the changes. v34-0001 LGTM.
> >
> > I was doing a final review before pushing 0001 and found that
> > 'inactive_since' could be set twice during startup after promotion,
> > once while restoring slots and then via ShutDownSlotSync(). The reason
> > is that ShutDownSlotSync() will be invoked in normal startup on
> > primary though it won't do anything apart from setting inactive_since
> > if we have synced slots. I think you need to check 'StandbyMode' in
> > update_synced_slots_inactive_since() and return if the same is not
> > set. We can't use 'InRecovery' flag as that will be set even during
> > crash recovery.
> >
> > Can you please test this once unless you don't agree with the above theory?
>
> Nice catch. I've verified that update_synced_slots_inactive_since is
> called even for normal server startups/crash recovery. I've added a
> check to exit if the StandbyMode isn't set.
>
> Please find the attached v35 patch.
>

The documentation says about both 'active' and 'inactive_since'
columns of pg_replication_slots say:

---
active bool
True if this slot is currently actively being used

inactive_since timestamptz
The time since the slot has become inactive. NULL if the slot is
currently being used. Note that for slots on the standby that are
being synced from a primary server (whose synced field is true), the
inactive_since indicates the last synchronization (see Section 47.2.3)
time.
---

When reading the description I thought if 'active' is true,
'inactive_since' is NULL, but it doesn't seem to apply for temporary
slots. Since we don't reset the active_pid field of temporary slots
when the release, the 'active' is still true in the view but
'inactive_since' is not NULL. Do you think we need to mention it in
the documentation?

As for the timeout-based slot invalidation feature, we could end up
invalidating the temporary slots even if they are shown as active,
which could confuse users. Do we want to somehow deal with it?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: promotion related handling in pg_sync_replication_slots()

2024-04-22 Thread Masahiko Sawada
On Mon, Apr 22, 2024 at 9:02 PM shveta malik  wrote:
>
> On Mon, Apr 22, 2024 at 5:10 PM Amit Kapila  wrote:
> >
> > On Fri, Apr 19, 2024 at 1:52 PM shveta malik  wrote:
> > >
> > > Please find v9 with the above comments addressed.
> > >
> >
> > I have made minor modifications in the comments and a function name.
> > Please see the attached top-up patch. Apart from this, the patch looks
> > good to me.
>
> Thanks for the patch, the changes look good Amit. Please find the merged 
> patch.
>

I've reviewed the patch and have some comments:

---
/*
-* Early initialization.
+* Register slotsync_worker_onexit() before we register
+* ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
+* exit of the slot sync worker, ReplicationSlotShmemExit() is called
+* first, followed by slotsync_worker_onexit(). The startup process during
+* promotion invokes ShutDownSlotSync() which waits for slot sync to
+* finish and it does that by checking the 'syncing' flag. Thus worker
+* must be done with the slots' release and cleanup before it marks itself
+* as finished syncing.
 */

I'm slightly worried that we register the slotsync_worker_onexit()
callback before BaseInit(), because it could be a blocker when we want
to add more work in the callback, for example sending the stats.

---
synchronize_slots(wrconn);
+
+   /* Cleanup the temporary slots */
+   ReplicationSlotCleanup();
+
+   /* We are done with sync, so reset sync flag */
+   reset_syncing_flag();

I think it ends up removing other temp slots that are created by the
same backend process using
pg_create_{physical,logical_replication_slots() function, which could
be a large side effect of this function for users. Also, if users want
to have a process periodically calling pg_sync_replication_slots()
instead of the slotsync worker, it doesn't support a case where we
create a temp not-ready slot and turn it into a persistent slot if
it's ready for sync.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-04-17 Thread Masahiko Sawada
On Wed, Apr 17, 2024 at 4:28 PM torikoshia  wrote:
>
> On 2024-04-16 13:16, Masahiko Sawada wrote:
> > On Tue, Apr 2, 2024 at 7:34 PM torikoshia 
> > wrote:
> >>
> >> On 2024-04-01 11:31, Masahiko Sawada wrote:
> >> > On Fri, Mar 29, 2024 at 11:54 AM torikoshia
> >> >  wrote:
> >> >>
> >> >> On 2024-03-28 21:54, Masahiko Sawada wrote:
> >> >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia 
> >> >> > 
> >> >> > wrote:
> >> >> >>
> >> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> >> >> >> > 
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> >> >> >>  wrote:
> >> >> >> >> >
> >> >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia 
> >> >> >> >> >  wrote:
> >> >> >> >> > > On 2024-01-18 10:10, jian he wrote:
> >> >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> >> >> >> > > > 
> >> >> >> >> > > > wrote:
> >> >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane 
> >> >> >> >> > > >>  wrote:
> >> >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might 
> >> >> >> >> > > >> > shorten it to
> >> >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" 
> >> >> >> >> > > >> > instead of "error")?
> >> >> >> >> > > >> > You will need a separate parameter anyway to specify the 
> >> >> >> >> > > >> > destination
> >> >> >> >> > > >> > of "log", unless "none" became an illegal table name 
> >> >> >> >> > > >> > when I wasn't
> >> >> >> >> > > >> > looking.  I don't buy that one parameter that has some 
> >> >> >> >> > > >> > special values
> >> >> >> >> > > >> > while other values could be names will be a good design. 
> >> >> >> >> > > >> >  Moreover,
> >> >> >> >> > > >> > what if we want to support (say) log-to-file along with 
> >> >> >> >> > > >> > log-to-table?
> >> >> >> >> > > >> > Trying to distinguish a file name from a table name 
> >> >> >> >> > > >> > without any other
> >> >> >> >> > > >> > context seems impossible.
> >> >> >> >> > > >>
> >> >> >> >> > > >> I've been thinking we can add more values to this option 
> >> >> >> >> > > >> to log errors
> >> >> >> >> > > >> not only to the server logs but also to the error table 
> >> >> >> >> > > >> (not sure
> >> >> >> >> > > >> details but I imagined an error table is created for each 
> >> >> >> >> > > >> table on
> >> >> >> >> > > >> error), without an additional option for the destination 
> >> >> >> >> > > >> name. The
> >> >> >> >> > > >> values would be like error_action 
> >> >> >> >> > > >> {error|ignore|save-logs|save-table}.
> >> >> >> >> > > >>
> >> >> >> >> > > >
> >> >> >> >> > > > another idea:
> >> >> >> >> > > > on_error {error|ignore|other_future_option}
> >> >> >> >> > > > if not specified then by default ERROR.
> >> >> >> >> > > > You can also specify ERROR or IGNORE for now.
&

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-04-15 Thread Masahiko Sawada
On Tue, Apr 2, 2024 at 7:34 PM torikoshia  wrote:
>
> On 2024-04-01 11:31, Masahiko Sawada wrote:
> > On Fri, Mar 29, 2024 at 11:54 AM torikoshia
> >  wrote:
> >>
> >> On 2024-03-28 21:54, Masahiko Sawada wrote:
> >> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia 
> >> > wrote:
> >> >>
> >> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
> >> >> > Hi,
> >> >> >
> >> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> >> >> > 
> >> >> > wrote:
> >> >> >>
> >> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> >> >>  wrote:
> >> >> >> >
> >> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia 
> >> >> >> >  wrote:
> >> >> >> > > On 2024-01-18 10:10, jian he wrote:
> >> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> >> >> > > > 
> >> >> >> > > > wrote:
> >> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  
> >> >> >> > > >> wrote:
> >> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten 
> >> >> >> > > >> > it to
> >> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead 
> >> >> >> > > >> > of "error")?
> >> >> >> > > >> > You will need a separate parameter anyway to specify the 
> >> >> >> > > >> > destination
> >> >> >> > > >> > of "log", unless "none" became an illegal table name when I 
> >> >> >> > > >> > wasn't
> >> >> >> > > >> > looking.  I don't buy that one parameter that has some 
> >> >> >> > > >> > special values
> >> >> >> > > >> > while other values could be names will be a good design.  
> >> >> >> > > >> > Moreover,
> >> >> >> > > >> > what if we want to support (say) log-to-file along with 
> >> >> >> > > >> > log-to-table?
> >> >> >> > > >> > Trying to distinguish a file name from a table name without 
> >> >> >> > > >> > any other
> >> >> >> > > >> > context seems impossible.
> >> >> >> > > >>
> >> >> >> > > >> I've been thinking we can add more values to this option to 
> >> >> >> > > >> log errors
> >> >> >> > > >> not only to the server logs but also to the error table (not 
> >> >> >> > > >> sure
> >> >> >> > > >> details but I imagined an error table is created for each 
> >> >> >> > > >> table on
> >> >> >> > > >> error), without an additional option for the destination 
> >> >> >> > > >> name. The
> >> >> >> > > >> values would be like error_action 
> >> >> >> > > >> {error|ignore|save-logs|save-table}.
> >> >> >> > > >>
> >> >> >> > > >
> >> >> >> > > > another idea:
> >> >> >> > > > on_error {error|ignore|other_future_option}
> >> >> >> > > > if not specified then by default ERROR.
> >> >> >> > > > You can also specify ERROR or IGNORE for now.
> >> >> >> > > >
> >> >> >> > > > I agree, the parameter "error_action" is better than 
> >> >> >> > > > "location".
> >> >> >> > >
> >> >> >> > > I'm not sure whether error_action or on_error is better, but 
> >> >> >> > > either way
> >> >> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> >> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >> >> >
>

Re: Improve eviction algorithm in ReorderBuffer

2024-04-11 Thread Masahiko Sawada
On Thu, Apr 11, 2024 at 10:46 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Heikki,
>
> I also prototyped the idea, which has almost the same shape.
> I attached just in case, but we may not have to see.
>
> Few comments based on the experiment.

Thank you for reviewing the patch. I didn't include the following
suggestions since firstly I wanted to just fix binaryheap part while
keeping other parts. If we need these changes, we can do them in
separate commits as fixes.

>
> ```
> +   /* txn_heap is ordered by transaction size */
> +   buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, 
> NULL);
> ```
>
> I think the pairing heap should be in the same MemoryContext with the buffer.
> Can we add MemoryContextSwithTo()?

The pairingheap_allocate() allocates a tiny amount of memory for
pairingheap and its memory usage doesn't grow even when adding more
data. And since it's allocated in logical decoding context its
lifetime is also fine. So I'm not sure it's worth including it in
rb->context for better memory accountability.

>
> ```
> +   /* Update the max-heap */
> +   if (oldsize != 0)
> +   pairingheap_remove(rb->txn_heap, >txn_node);
> +   pairingheap_add(rb->txn_heap, >txn_node);
> ...
> +   /* Update the max-heap */
> +   pairingheap_remove(rb->txn_heap, >txn_node);
> +   if (txn->size != 0)
> +   pairingheap_add(rb->txn_heap, >txn_node);
> ```
>
> Since the number of stored transactions does not affect to the insert 
> operation, we may able
> to add the node while creating ReorederBufferTXN and remove while cleaning up 
> it. This can
> reduce branches in ReorderBufferChangeMemoryUpdate().

I think it also means that we need to remove the entry while cleaning
up even if it doesn't have any changes, which is done in O(log n). I
feel the current approach that we don't store transactions with size 0
in the heap is better and I'm not sure that reducing these branches
really contributes to the performance improvements..


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-04-11 Thread Masahiko Sawada
On Thu, Apr 11, 2024 at 11:52 AM Masahiko Sawada  wrote:
>
> We can see 2% ~ 3% performance regressions compared to the current
> HEAD, but it's much smaller than I expected. Given that we can make
> the code simple, I think we can go with this direction.

Pushed the patch and reverted binaryheap changes.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Masahiko Sawada
On Thu, Apr 11, 2024 at 10:32 AM Masahiko Sawada  wrote:
>
> Hi,
>
> Sorry for the late reply, I took two days off.
>
> On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas  wrote:
> >
> > On 10/04/2024 08:31, Amit Kapila wrote:
> > > On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  
> > > wrote:
> > >>
> > >> On 10/04/2024 07:45, Michael Paquier wrote:
> > >>> On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:
> > >>>> On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:
> > >>>>> Wouldn't the best way forward be to revert
> > >>>>> 5bec1d6bc5e3 and revisit the whole in v18?
> > >>>>
> > >>>> Also consider commits b840508644 and bcb14f4abc.
> > >>>
> > >>> Indeed.  These are also linked.
> > >>
> > >> I don't feel the urge to revert this:
> > >>
> > >> - It's not broken as such, we're just discussing better ways to
> > >> implement it. We could also do nothing, and revisit this in v18. The
> > >> only must-fix issue is some compiler warnings IIUC.
> > >>
> > >> - It's a pretty localized change in reorderbuffer.c, so it's not in the
> > >> way of other patches or reverts. Nothing else depends on the binaryheap
> > >> changes yet either.
> > >>
> > >> - It seems straightforward to repeat the performance tests with whatever
> > >> alternative implementations we want to consider.
> > >>
> > >> My #1 choice would be to write a patch to switch the pairing heap,
> > >> performance test that, and revert the binary heap changes.
> > >>
> > >
> > > +1.
> >
> > To move this forward, here's a patch to switch to a pairing heap. In my
> > very quick testing, with the performance test cases posted earlier in
> > this thread [1] [2], I'm seeing no meaningful performance difference
> > between this and what's in master currently.
> >
> > Sawada-san, what do you think of this? To be sure, if you could also
> > repeat the performance tests you performed earlier, that'd be great. If
> > you agree with this direction, and you're happy with this patch, feel
> > free take it from here and commit this, and also revert commits
> > b840508644 and bcb14f4abc.
> >
>
> Thank you for the patch!
>
> I agree with the direction that we replace binaryheap + index with the
> existing pairing heap and revert the changes for binaryheap. Regarding
> the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD
> logic because otherwise we need to remove and add the txn node (i.e.
> O(log n)) for every memory update. I'm concerned it could cause some
> performance degradation in a case where there are not many
> transactions being decoded.
>
> I'll do performance tests, and share the results soon.
>

Here are some performance test results.

* test case 1 (many subtransactions)

test script:

create or replace function testfn (cnt int) returns void as $$
begin
  for i in 1..cnt loop
begin
  insert into test values (i);
exception when division_by_zero then
  raise notice 'caught error';
  return;
end;
  end loop;
end;
$$
language plpgsql;
select pg_create_logical_replication_slot('s', 'test_decoding');
select testfn(100);
set logical_decoding_work_mem to '4MB';
select from pg_logical_slot_peek_changes('s', null, null);

HEAD:

43128.266 ms
40116.313 ms
38790.666 ms

Patched:

43626.316 ms
44456.234 ms
39899.753 ms


* test case 2 (single big insertion)

test script:

create table test (c int);
select pg_create_logical_replication_slot('s', 'test_decoding');
insert into test select generate_series(1, 1000);
set logical_decoding_work_mem to '10GB'; -- avoid data spill
select from pg_logical_slot_peek_changes('s', null, null);

HEAD:

7996.476 ms
8034.022 ms
8005.583 ms

Patched:

8153.500 ms
8121.588 ms
8121.538 ms


* test case 3 (many small transactions)

test script:

pgbench -s -i 300
psql -c "select pg_create_replication_slot('s', 'test_decoding')";
pgbench -t 10 -c 32
psql -c "set logical_decoding_work_mem to '10GB'; select count(*) from
pg_logical_slot_peek_changes('s', null, null)"

HEAD:

22586.343 ms
22507.905 ms
22504.133 ms

Patched:

23365.142 ms
23110.651 ms
23102.170 ms

We can see 2% ~ 3% performance regressions compared to the current
HEAD, but it's much smaller than I expected. Given that we can make
the code simple, I think we can go with this direction.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Masahiko Sawada
Hi,

Sorry for the late reply, I took two days off.

On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas  wrote:
>
> On 10/04/2024 08:31, Amit Kapila wrote:
> > On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  wrote:
> >>
> >> On 10/04/2024 07:45, Michael Paquier wrote:
> >>> On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:
> >>>> On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:
> >>>>> Wouldn't the best way forward be to revert
> >>>>> 5bec1d6bc5e3 and revisit the whole in v18?
> >>>>
> >>>> Also consider commits b840508644 and bcb14f4abc.
> >>>
> >>> Indeed.  These are also linked.
> >>
> >> I don't feel the urge to revert this:
> >>
> >> - It's not broken as such, we're just discussing better ways to
> >> implement it. We could also do nothing, and revisit this in v18. The
> >> only must-fix issue is some compiler warnings IIUC.
> >>
> >> - It's a pretty localized change in reorderbuffer.c, so it's not in the
> >> way of other patches or reverts. Nothing else depends on the binaryheap
> >> changes yet either.
> >>
> >> - It seems straightforward to repeat the performance tests with whatever
> >> alternative implementations we want to consider.
> >>
> >> My #1 choice would be to write a patch to switch the pairing heap,
> >> performance test that, and revert the binary heap changes.
> >>
> >
> > +1.
>
> To move this forward, here's a patch to switch to a pairing heap. In my
> very quick testing, with the performance test cases posted earlier in
> this thread [1] [2], I'm seeing no meaningful performance difference
> between this and what's in master currently.
>
> Sawada-san, what do you think of this? To be sure, if you could also
> repeat the performance tests you performed earlier, that'd be great. If
> you agree with this direction, and you're happy with this patch, feel
> free take it from here and commit this, and also revert commits
> b840508644 and bcb14f4abc.
>

Thank you for the patch!

I agree with the direction that we replace binaryheap + index with the
existing pairing heap and revert the changes for binaryheap. Regarding
the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD
logic because otherwise we need to remove and add the txn node (i.e.
O(log n)) for every memory update. I'm concerned it could cause some
performance degradation in a case where there are not many
transactions being decoded.

I'll do performance tests, and share the results soon.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-04-08 Thread Masahiko Sawada
On Tue, Apr 9, 2024 at 12:30 AM Andres Freund  wrote:
>
> Hi,
>
> On 2024-04-08 09:26:09 -0400, Robert Haas wrote:
> > On Sun, Apr 7, 2024 at 6:50 PM Michael Paquier  wrote:
> > And maybe we need to think of a way to further mitigate this crush of
> > last minute commits. e.g. In the last week, you can't have more
> > feature commits, or more lines of insertions in your commits, than you
> > did in the prior 3 weeks combined. I don't know. I think this mad rush
> > of last-minute commits is bad for the project.
>
> Some will just polish commits until the last minute, until the
> the dot's on the i's really shine, others will continue picking up more CF
> entries until the freeze is reached, others will push half baked stuff.

I agree with this part.

Aside from considering how to institute some rules for mitigating the
last-minute rush, it might also be a good idea to consider how to
improve testing the new commits during beta. FWIW in each year, after
feature freeze I personally pick some new features that I didn't get
involved with during the development and do intensive reviews in
April. It might be good if more people did things like that. That
might help finding half baked features earlier and improve the quality
in general. So for example, we list features that could require more
reviews (e.g. because of its volume, complexity, and a span of
influence etc.) and we do intensive reviews for these items. Each item
should be reviewed by other than the author and the committer. We may
want to set aside a specific period for intensive testing.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-04-08 Thread Masahiko Sawada
h. For v17, changes for #2 are smaller, but I'm concerned
that the new API that requires a hash function to be able to use
binaryheap_update_{up|down} might not be user friendly. In terms of
APIs, I prefer #1 idea. And changes for #1 can make the binaryheap
code simple, although it requires adding a variable in
ReorderBufferTXN instead. But overall, it can remove the hash table
and some functions so it looks better to me.

When it comes to performance overhead, I mentioned that there is some
regression in the current binaryheap even without indexing. Since
function calling contributed to the regression, inlining some
functions reduced some overheads. For example, inlining set_node() and
replace_node(), the same benchmark test I used showed:

postgres(1:88476)=# select * from generate_series(1,3) x(x), lateral
(select * from bench_load(false, 1000 * (1+x-x)));
 x |   cnt| load_ms | xx_load_ms | old_load_ms
---+--+-++-
 1 | 1000 | 502 |624 | 427
 2 | 1000 | 503 |622 | 428
 3 | 1000 | 502 |621 | 427
(3 rows)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


01_use_update_index_func_in_binaryheap.patch
Description: Binary data


02_use_hashfunc_in_binaryheap.patch
Description: Binary data


Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-04-07 Thread Masahiko Sawada
On Fri, Apr 5, 2024 at 1:18 AM vignesh C  wrote:
>
> On Tue, 2 Apr 2024 at 13:08, Masahiko Sawada  wrote:
> >
> > On Mon, Apr 1, 2024 at 10:41 PM vignesh C  wrote:
> > >
> > > On Thu, 28 Mar 2024 at 13:05, Masahiko Sawada  
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > Thank you for the patch!
> > > >
> > > > On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER 
> > > > > TABLE":
> > > > > 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> > > > > completion of alter default privileges like the below statement:
> > > > > ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> > > > > ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> > > > > ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables 
> > > > > FROM dba1;
> > > >
> > > > +1
> > > >
> > > > >
> > > > > 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> > > > > public FOR " like in below statement:
> > > > > ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> > > > > ON TABLES TO PUBLIC;
> > > >
> > > > Since there is no difference FOR USER and FOR ROLE, I'm not sure we
> > > > really want to support both in tab-completion.
> > >
> > > I have removed this change
> > >
> > > > >
> > > > > 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> > > > > REVOKE " like in below statement:
> > > > > alter default privileges revoke grant option for select ON tables 
> > > > > FROM dba1;
> > > >
> > > > +1. But the v3 patch doesn't cover the following case:
> > > >
> > > > =# alter default privileges for role masahiko revoke [tab]
> > > > ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
> > > >  REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE
> > >
> > > Modified in the updated patch
> > >
> > > > And it doesn't cover MAINTAIN neither:
> > > >
> > > > =# alter default privileges revoke [tab]
> > > > ALL   DELETEGRANT OPTION FOR  REFERENCES
> > > >  TRIGGER   UPDATE
> > > > CREATEEXECUTE   INSERTSELECT
> > > >  TRUNCATE  USAGE
> > >
> > > Modified in the updated patch
> > >
> > > > The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
> > > > but we handle such case in GRANT and REVOKE part:
> > > >
> > > > (around L3958)
> > > > /*
> > > >  * With ALTER DEFAULT PRIVILEGES, restrict completion to 
> > > > grantable
> > > >  * privileges (can't grant roles)
> > > >  */
> > > > if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
> > > > COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
> > > >   "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
> > > >   "CREATE", "EXECUTE", "USAGE", "MAINTAIN", 
> > > > "ALL");
> > >
> > > The current patch handles the fix from here now.
> > >
> > > > Also, I think we can support WITH GRANT OPTION too. For example,
> > > >
> > > > =# alter default privileges for role masahiko grant all on tables to
> > > > public [tab]
> > >
> > > I have handled this in the updated patch
> > >
> > > > It's already supported in the GRANT statement.
> > > >
> > > > >
> > > > > 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> > > > > column-name SET" like in:
> > > > > ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
> > > > >
> > > >
> > > > +1. The patch looks good to me, so pushed.
> > >
> > > Thanks for committing this.
> > >
> > > The updated patch has the changes for the above comments.
> > >
> >
> > Thank you for updating the patch.
> >
> > I think it doesn't work well as "GRANT OPTION FOR" is complemented
> > twice. For example,
> >
> > =# alter default privileges for user masahiko revoke [tab]
> > ALL   DELETEGRANT OPTION FOR  MAINTAIN
> >  SELECTTRUNCATE  USAGE
> > CREATEEXECUTE   INSERTREFERENCES
> >  TRIGGER   UPDATE
> > =# alter default privileges for user masahiko revoke grant option for [tab]
> > ALL   DELETEGRANT OPTION FOR  MAINTAIN
> >  SELECTTRUNCATE  USAGE
> > CREATEEXECUTE   INSERTREFERENCES
> >  TRIGGER   UPDATE
>
> Thanks for finding this issue, the attached v5 version patch has the
> fix for the same.

Thank you for updating the patch! I've pushed with minor adjustments.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-04-05 Thread Masahiko Sawada
On Fri, Apr 5, 2024 at 2:55 AM Jeff Davis  wrote:
>
> On Thu, 2024-04-04 at 17:28 +0900, Masahiko Sawada wrote:
> > > Perhaps it's not worth the effort though, if performance is already
> > > good enough?
> >
> > Yeah, it would be better to measure the overhead first. I'll do that.
>
> I have some further comments and I believe changes are required for
> v17.
>
> An indexed binary heap API requires both a comparator and a hash
> function to be specified, and has two different kinds of keys: the heap
> key (mutable) and the hash key (immutable). It provides heap methods
> and hashtable methods, and keep the two internal structures (heap and
> HT) in sync.

IIUC for example in ReorderBuffer, the heap key is transaction size
and the hash key is xid.

>
> The implementation in b840508644 uses the bh_node_type as the hash key,
> which is just a Datum, and it just hashes the bytes. I believe the
> implicit assumption is that the Datum is a pointer -- I'm not sure how
> one would use that API if the Datum were a value. Hashing a pointer
> seems strange to me and, while I see why you did it that way, I think
> it reflects that the API boundaries are not quite right.

I see your point. It assumes that the bh_node_type is a pointer or at
least unique. So it cannot work with Datum being a value.

>
> One consequence of using the pointer as the hash key is that you need
> to find the pointer first: you can't change or remove elements based on
> the transaction ID, you have to get the ReorderBufferTXN pointer by
> finding it in another structure, first. Currently, that's being done by
> searching ReorderBuffer->by_txn. So we actually have two hash tables
> for essentially the same purpose: one with xid as the key, and the
> other with the pointer as the key. That makes no sense -- let's have a
> proper indexed binary heap to look things up by xid (the internal HT)
> or by transaction size (using the internal heap).
>
> I suggest:
>
>   * Make a proper indexed binary heap API that accepts a hash function
> and provides both heap methods and HT methods that operate based on
> values (transaction size and transaction ID, respectively).
>   * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
> instead.
>
> This will be a net simplification in reorderbuffer.c, which is good,
> because that file makes use of a *lot* of data strucutres.
>

It sounds like a data structure that mixes the hash table and the
binary heap and we use it as the main storage (e.g. for
ReorderBufferTXN) instead of using the binary heap as the secondary
data structure. IIUC with your idea, the indexed binary heap has a
hash table to store elements each of which has its index within the
heap node array. I guess it's better to create it as a new data
structure rather than extending the existing binaryheap, since APIs
could be very different. I might be missing something, though.

On Fri, Apr 5, 2024 at 3:55 AM Jeff Davis  wrote:
>
> On Thu, 2024-04-04 at 10:55 -0700, Jeff Davis wrote:
> >   * Make a proper indexed binary heap API that accepts a hash
> > function
> > and provides both heap methods and HT methods that operate based on
> > values (transaction size and transaction ID, respectively).
> >   * Get rid of ReorderBuffer->by_txn and use the indexed binary heap
> > instead.
>
> An alternative idea:
>
> * remove the hash table from binaryheap.c
>
> * supply a new callback to the binary heap with type like:
>
>   typedef void (*binaryheap_update_index)(
> bh_node_type node,
> int new_element_index);
>
> * make the remove, update_up, and update_down methods take the element
> index rather than the pointer
>
> reorderbuffer.c would then do something like:
>
>   void
>   txn_update_heap_index(ReorderBufferTXN *txn, int new_element_index)
>   {
>  txn->heap_element_index = new_element_index;
>   }
>
>   ...
>
>   txn_heap = binaryheap_allocate(..., txn_update_heap_index, ...);
>
> and then binaryheap.c would effectively maintain txn-
> >heap_element_index, so reorderbuffer.c can pass that to the APIs that
> require the element index.

Thank you for the idea. I was thinking the same idea when considering
your previous comment. With this idea, we still use the binaryheap for
ReorderBuffer as the second data structure. Since we can implement
this idea with relatively small changes to the current binaryheap,
I've implemented it and measured performances.

I've attached a patch that adds an extension for benchmarking
binaryheap implementations. binaryheap_bench.c is the main test
module. To make the comparison between different binaryheap
implementations, the extension includes two different binaryheap
implementations. Therefore, binaryheap_ben

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-04 Thread Masahiko Sawada
On Thu, Apr 4, 2024 at 5:36 PM Amit Kapila  wrote:
>
> On Thu, Apr 4, 2024 at 1:32 PM Masahiko Sawada  wrote:
> >
> > On Thu, Apr 4, 2024 at 1:34 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Thu, Apr 4, 2024 at 9:42 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > @@ -1368,6 +1416,7 @@ ShutDownSlotSync(void)
> > > > if (SlotSyncCtx->pid == InvalidPid)
> > > > {
> > > > SpinLockRelease(>mutex);
> > > > +   update_synced_slots_inactive_since();
> > > > return;
> > > > }
> > > > SpinLockRelease(>mutex);
> > > > @@ -1400,6 +1449,8 @@ ShutDownSlotSync(void)
> > > > }
> > > >
> > > > SpinLockRelease(>mutex);
> > > > +
> > > > +   update_synced_slots_inactive_since();
> > > >  }
> > > >
> > > > Why do we want to update all synced slots' inactive_since values at
> > > > shutdown in spite of updating the value every time when releasing the
> > > > slot? It seems to contradict the fact that inactive_since is updated
> > > > when releasing or restoring the slot.
> > >
> > > It is to get the inactive_since right for the cases where the standby
> > > is promoted without a restart similar to when a standby is promoted
> > > with restart in which case the inactive_since is set to current time
> > > in RestoreSlotFromDisk.
> > >
> > > Imagine the slot is synced last time at time t1 and then a few hours
> > > passed, the standby is promoted without a restart. If we don't set
> > > inactive_since to current time in this case in ShutDownSlotSync, the
> > > inactive timeout invalidation mechanism can kick in immediately.
> > >
> >
> > Thank you for the explanation! I understood the needs.
> >
>
> Do you want to review the v34_0001* further or shall I proceed with
> the commit of the same?

Thanks for asking. The v34-0001 patch looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-04-04 Thread Masahiko Sawada
On Thu, Apr 4, 2024 at 1:54 PM Jeff Davis  wrote:
>
> On Thu, 2024-04-04 at 09:31 +0900, Masahiko Sawada wrote:
> > IIUC, with your suggestion, sift_{up|down} needs to update the
> > heap_index field as well. Does it mean that the caller needs to pass
> > the address of heap_index down to sift_{up|down}?
>
> I'm not sure quite how binaryheap should be changed. Bringing the heap
> implementation into reorderbuffer.c would obviously work, but that
> would be more code.

Right.

>  Another option might be to make the API of
> binaryheap look a little more like simplehash, where some #defines
> control optional behavior and can tell the implementation where to find
> fields in the structure.

Interesting idea.

>
> Perhaps it's not worth the effort though, if performance is already
> good enough?

Yeah, it would be better to measure the overhead first. I'll do that.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-04-04 Thread Masahiko Sawada
On Wed, Apr 3, 2024 at 7:06 PM Amit Kapila  wrote:
>
> On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila  wrote:
> >
> > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy
> >  wrote:
> >
> > > I quickly looked at v8, and have a nit, rest all looks good.
> > >
> > > +if (DecodingContextReady(ctx) && found_consistent_snapshot)
> > > +*found_consistent_snapshot = true;
> > >
> > > Can the found_consistent_snapshot be checked first to help avoid the
> > > function call DecodingContextReady() for pg_replication_slot_advance
> > > callers?
> > >
> >
> > Okay, changed. Additionally, I have updated the comments and commit
> > message. I'll push this patch after some more testing.
> >
>
> Pushed!

While testing this change, I realized that it could happen that the
server logs are flooded with the following logical decoding logs that
are written every 200 ms:

2024-04-04 16:15:19.270 JST [3838739] LOG:  starting logical decoding
for slot "test_sub"
2024-04-04 16:15:19.270 JST [3838739] DETAIL:  Streaming transactions
committing after 0/50006F48, reading WAL from 0/50006F10.
2024-04-04 16:15:19.270 JST [3838739] LOG:  logical decoding found
consistent point at 0/50006F10
2024-04-04 16:15:19.270 JST [3838739] DETAIL:  There are no running
transactions.
2024-04-04 16:15:19.477 JST [3838739] LOG:  starting logical decoding
for slot "test_sub"
2024-04-04 16:15:19.477 JST [3838739] DETAIL:  Streaming transactions
committing after 0/50006F48, reading WAL from 0/50006F10.
2024-04-04 16:15:19.477 JST [3838739] LOG:  logical decoding found
consistent point at 0/50006F10
2024-04-04 16:15:19.477 JST [3838739] DETAIL:  There are no running
transactions.

For example, I could reproduce it with the following steps:

1. create the primary and start.
2. run "pgbench -i -s 100" on the primary.
3. run pg_basebackup to create the standby.
4. configure slotsync setup on the standby and start.
5. create a publication for all tables on the primary.
6. create the subscriber and start.
7. run "pgbench -i -Idtpf" on the subscriber.
8. create a subscription on the subscriber (initial data copy will start).

The logical decoding logs were written every 200 ms during the initial
data synchronization.

Looking at the new changes for update_local_synced_slot():

if (remote_slot->confirmed_lsn != slot->data.confirmed_flush ||
remote_slot->restart_lsn != slot->data.restart_lsn ||
remote_slot->catalog_xmin != slot->data.catalog_xmin)
{
/*
 * We can't directly copy the remote slot's LSN or xmin unless there
 * exists a consistent snapshot at that point. Otherwise, after
 * promotion, the slots may not reach a consistent point before the
 * confirmed_flush_lsn which can lead to a data loss. To avoid data
 * loss, we let slot machinery advance the slot which ensures that
 * snapbuilder/slot statuses are updated properly.
 */
if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
{
/*
 * Update the slot info directly if there is a serialized snapshot
 * at the restart_lsn, as the slot can quickly reach consistency
 * at restart_lsn by restoring the snapshot.
 */
SpinLockAcquire(>mutex);
slot->data.restart_lsn = remote_slot->restart_lsn;
slot->data.confirmed_flush = remote_slot->confirmed_lsn;
slot->data.catalog_xmin = remote_slot->catalog_xmin;
slot->effective_catalog_xmin = remote_slot->catalog_xmin;
SpinLockRelease(>mutex);

if (found_consistent_snapshot)
*found_consistent_snapshot = true;
}
else
{
LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
found_consistent_snapshot);
}

ReplicationSlotsComputeRequiredXmin(false);
ReplicationSlotsComputeRequiredLSN();

slot_updated = true;

We call LogicalSlotAdvanceAndCheckSnapState() if one of confirmed_lsn,
restart_lsn, and catalog_xmin is different between the remote slot and
the local slot. In my test case, during the initial sync performing,
only catalog_xmin was different and there was no serialized snapshot
at restart_lsn, and the slotsync worker called
LogicalSlotAdvanceAndCheckSnapState(). However no slot properties were
changed even after the function and it set slot_updated = true. So it
starts the next slot synchronization after 200ms.

It seems to me that we can skip calling
LogicalSlotAdvanceAndCheckSnapState() at least when the remote and
local have the same restart_lsn and confirmed_lsn.

I'm not sure there are other scenarios but is it worth fixing this symptom?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-04 Thread Masahiko Sawada
On Thu, Apr 4, 2024 at 1:34 PM Bharath Rupireddy
 wrote:
>
> On Thu, Apr 4, 2024 at 9:42 AM Masahiko Sawada  wrote:
> >
> > @@ -1368,6 +1416,7 @@ ShutDownSlotSync(void)
> > if (SlotSyncCtx->pid == InvalidPid)
> > {
> > SpinLockRelease(>mutex);
> > +   update_synced_slots_inactive_since();
> > return;
> > }
> > SpinLockRelease(>mutex);
> > @@ -1400,6 +1449,8 @@ ShutDownSlotSync(void)
> > }
> >
> > SpinLockRelease(>mutex);
> > +
> > +   update_synced_slots_inactive_since();
> >  }
> >
> > Why do we want to update all synced slots' inactive_since values at
> > shutdown in spite of updating the value every time when releasing the
> > slot? It seems to contradict the fact that inactive_since is updated
> > when releasing or restoring the slot.
>
> It is to get the inactive_since right for the cases where the standby
> is promoted without a restart similar to when a standby is promoted
> with restart in which case the inactive_since is set to current time
> in RestoreSlotFromDisk.
>
> Imagine the slot is synced last time at time t1 and then a few hours
> passed, the standby is promoted without a restart. If we don't set
> inactive_since to current time in this case in ShutDownSlotSync, the
> inactive timeout invalidation mechanism can kick in immediately.
>

Thank you for the explanation! I understood the needs.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Masahiko Sawada
On Wed, Apr 3, 2024 at 11:58 PM Bharath Rupireddy
 wrote:
>
>
> Please find the attached v33 patches.

@@ -1368,6 +1416,7 @@ ShutDownSlotSync(void)
if (SlotSyncCtx->pid == InvalidPid)
{
SpinLockRelease(>mutex);
+   update_synced_slots_inactive_since();
return;
}
SpinLockRelease(>mutex);
@@ -1400,6 +1449,8 @@ ShutDownSlotSync(void)
}

SpinLockRelease(>mutex);
+
+   update_synced_slots_inactive_since();
 }

Why do we want to update all synced slots' inactive_since values at
shutdown in spite of updating the value every time when releasing the
slot? It seems to contradict the fact that inactive_since is updated
when releasing or restoring the slot.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-04-03 Thread Masahiko Sawada
Hi,

On Thu, Apr 4, 2024 at 2:32 AM Jeff Davis  wrote:
>
> On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote:
> > I suggest that you add a "heap_index" field to ReorderBufferTXN that
> > would point to the index into the heap's array (the same as
> > bh_nodeidx_entry.index in your patch). Each time an element moves
> > within the heap array, just follow the pointer to the
> > ReorderBufferTXN
> > object and update the heap_index -- no hash lookup required.
>
> It looks like my email was slightly too late, as the work was already
> committed.

Thank you for the suggestions! I should have informed it earlier.

>
> My suggestion is not required for 17, and so it's fine if this waits
> until the next CF. If it turns out to be a win we can consider
> backporting to 17 just to keep the code consistent, otherwise it can go
> in 18.

IIUC, with your suggestion, sift_{up|down} needs to update the
heap_index field as well. Does it mean that the caller needs to pass
the address of heap_index down to sift_{up|down}?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-04-02 Thread Masahiko Sawada
On Mon, Apr 1, 2024 at 10:41 PM vignesh C  wrote:
>
> On Thu, 28 Mar 2024 at 13:05, Masahiko Sawada  wrote:
> >
> > Hi,
> >
> > Thank you for the patch!
> >
> > On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
> > >
> > > Hi,
> > >
> > > Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER TABLE":
> > > 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> > > completion of alter default privileges like the below statement:
> > > ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> > > ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> > > ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables FROM 
> > > dba1;
> >
> > +1
> >
> > >
> > > 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> > > public FOR " like in below statement:
> > > ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> > > ON TABLES TO PUBLIC;
> >
> > Since there is no difference FOR USER and FOR ROLE, I'm not sure we
> > really want to support both in tab-completion.
>
> I have removed this change
>
> > >
> > > 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> > > REVOKE " like in below statement:
> > > alter default privileges revoke grant option for select ON tables FROM 
> > > dba1;
> >
> > +1. But the v3 patch doesn't cover the following case:
> >
> > =# alter default privileges for role masahiko revoke [tab]
> > ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
> >  REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE
>
> Modified in the updated patch
>
> > And it doesn't cover MAINTAIN neither:
> >
> > =# alter default privileges revoke [tab]
> > ALL   DELETEGRANT OPTION FOR  REFERENCES
> >  TRIGGER   UPDATE
> > CREATEEXECUTE   INSERTSELECT
> >  TRUNCATE  USAGE
>
> Modified in the updated patch
>
> > The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
> > but we handle such case in GRANT and REVOKE part:
> >
> > (around L3958)
> > /*
> >  * With ALTER DEFAULT PRIVILEGES, restrict completion to grantable
> >  * privileges (can't grant roles)
> >  */
> > if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
> > COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
> >   "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
> >   "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL");
>
> The current patch handles the fix from here now.
>
> > Also, I think we can support WITH GRANT OPTION too. For example,
> >
> > =# alter default privileges for role masahiko grant all on tables to
> > public [tab]
>
> I have handled this in the updated patch
>
> > It's already supported in the GRANT statement.
> >
> > >
> > > 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> > > column-name SET" like in:
> > > ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
> > >
> >
> > +1. The patch looks good to me, so pushed.
>
> Thanks for committing this.
>
> The updated patch has the changes for the above comments.
>

Thank you for updating the patch.

I think it doesn't work well as "GRANT OPTION FOR" is complemented
twice. For example,

=# alter default privileges for user masahiko revoke [tab]
ALL   DELETEGRANT OPTION FOR  MAINTAIN
 SELECTTRUNCATE  USAGE
CREATEEXECUTE   INSERTREFERENCES
 TRIGGER   UPDATE
=# alter default privileges for user masahiko revoke grant option for [tab]
ALL   DELETEGRANT OPTION FOR  MAINTAIN
 SELECTTRUNCATE  USAGE
CREATEEXECUTE   INSERTREFERENCES
 TRIGGER   UPDATE

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-01 Thread Masahiko Sawada
ensure whether we need
> > to update the slot or not.

If we use such pre-checks, another problem might happen; it cannot
handle a case where the slot is acquired on the primary but its LSNs
don't move forward. Imagine a logical replication conflict happened on
the subscriber, and the logical replication enters the retry loop. In
this case, the remote slot's inactive_since gets updated for every
retry, but it looks inactive from the standby since the slot LSNs
don't change. Therefore, only the local slot could be invalidated due
to the timeout but probably we don't want to regard such a slot as
"inactive".

Another idea I came up with is that the slotsync worker updates the
local slot's inactive_since to the local timestamp only when the
remote slot might have got inactive. If the remote slot is acquired by
someone, the local slot's inactive_since is also NULL. If the remote
slot gets inactive, the slotsync worker sets the local timestamp to
the local slot's inactive_since. Since the remote slot could be
acquired and released before the slotsync worker gets the remote slot
data again, if the remote slot's inactive_since > the local slot's
inactive_since, the slotsync worker updates the local one. IOW, we
detect whether the remote slot was acquired and released since the
last synchronization, by checking the remote slot's inactive_since.
This idea seems to handle these cases I mentioned unless I'm missing
something, but it requires for the slotsync worker to update
inactive_since in a different way than other parameters.

Or a simple solution is that the slotsync worker updates
inactive_since as it does for non-synced slots, and disables
timeout-based slot invalidation for synced slots.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Add new error_action COPY ON_ERROR "log"

2024-04-01 Thread Masahiko Sawada
On Mon, Apr 1, 2024 at 10:03 AM Masahiko Sawada  wrote:
>
> On Sat, Mar 30, 2024 at 11:05 PM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Mar 28, 2024 at 6:31 PM Masahiko Sawada  
> > wrote:
> > >
> > > That is,
> > > since the LOG_VERBOSITY option is an enum parameter, it might make
> > > more sense to require the value, instead of making the value optional.
> > > For example, the following command could not be obvious for users:
> > >
> > > COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY);
> >
> > Agreed. Please see the attached v14 patch.
>
> Thank you for updating the patch!
>
> >  The LOG_VERBOSITY now needs
> > a value to be specified. Note that I've not added any test for this
> > case as there seemed to be no such tests so far generating "ERROR:
> > <> requires a parameter". I don't mind adding one for
> > LOG_VERBOSITY though.
>
> +1
>
> One minor point:
>
>  ENCODING 'encoding_name'
> +LOG_VERBOSITY [ mode ]
>  
>
> '[' and ']' are not necessary because the value is no longer optional.
>
> I've attached the updated patch. I'll push it, barring any objections.
>

Pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-03-31 Thread Masahiko Sawada
On Mon, Apr 1, 2024 at 11:26 AM Masahiko Sawada  wrote:
>
> On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila  wrote:
> >
> > On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
> > > >
> > > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > >
> > > > > > I've attached new version patches.
> > > > >
> > > > > Since the previous patch conflicts with the current HEAD, I've
> > > > > attached the rebased patches.
> > > >
> > > > Thanks for the updated patch.
> > > > One comment:
> > > > I felt we can mention the improvement where we update memory
> > > > accounting info at transaction level instead of per change level which
> > > > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> > > > ReorderBufferSerializeTXN also in the commit message:
> > >
> > > Agreed.
> > >
> > > I think the patch is in good shape. I'll push the patch with the
> > > suggestion next week, barring any objections.
> > >
> >
> > Few minor comments:
> > 1.
> > @@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
> >   Assert(txn->nentries_mem == 0);
> >   }
> >
> > + ReorderBufferMaybeResetMaxHeap(rb);
> > +
> >
> > Can we write a comment about why this reset is required here?
> > Otherwise, the reason is not apparent.
>
> Yes, added.
>
> >
> > 2.
> > Although using max-heap to select the largest
> > + * transaction is effective when there are many transactions being decoded,
> > + * there is generally no need to use it as long as all transactions being
> > + * decoded are top-level transactions. Therefore, we use MaxConnections as 
> > the
> > + * threshold so we can prevent switching to the state unless we use
> > + * subtransactions.
> > + */
> > +#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections
> >
> > Isn't using max-heap equally effective in finding the largest
> > transaction whether there are top-level or top-level plus
> > subtransactions? This comment indicates it is only effective when
> > there are subtransactions.
>
> You're right. Updated the comment.
>
> I've attached the updated patches.
>

While reviewing the patches, I realized the comment of
binearyheap_allocate() should also be updated. So I've attached the
new patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v12-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v12-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data


v12-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-31 Thread Masahiko Sawada
On Fri, Mar 29, 2024 at 4:21 PM John Naylor  wrote:
>
> On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  
> wrote:
> > I think the patch is in good shape. Do you have other comments or
> > suggestions, John?
>
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -1918,11 +1918,6 @@ include_dir 'conf.d'
>  too high.  It may be useful to control for this by separately
>  setting .
> 
> -   
> -Note that for the collection of dead tuple identifiers,
> -VACUUM is only able to utilize up to a maximum of
> -1GB of memory.
> -   
>
>   
>
> This is mentioned twice for two different GUCs -- need to remove the
> other one, too.

Good catch, removed.

> Other than that, I just have minor nits:
>
> - * The major space usage for vacuuming is storage for the array of dead TIDs
> + * The major space usage for vacuuming is TID store, a storage for dead TIDs
>
> I think I've helped edit this sentence before, but I still don't quite
> like it. I'm thinking now "is storage for the dead tuple IDs".
>
> - * set upper bounds on the number of TIDs we can keep track of at once.
> + * set upper bounds on the maximum memory that can be used for keeping track
> + * of dead TIDs at once.
>
> I think "maximum" is redundant with "upper bounds".

Fixed.

>
> I also feel the commit message needs more "meat" -- we need to clearly
> narrate the features and benefits. I've attached how I would write it,
> but feel free to use what you like to match your taste.

Well, that's much better than mine.

>
> I've marked it Ready for Committer.

Thank you! I've attached the patch that I'm going to push tomorrow.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v82-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data


Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-31 Thread Masahiko Sawada
On Fri, Mar 29, 2024 at 11:54 AM torikoshia  wrote:
>
> On 2024-03-28 21:54, Masahiko Sawada wrote:
> > On Thu, Mar 28, 2024 at 9:38 PM torikoshia 
> > wrote:
> >>
> >> On 2024-03-28 10:20, Masahiko Sawada wrote:
> >> > Hi,
> >> >
> >> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> >> > wrote:
> >> >>
> >> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >> >>  wrote:
> >> >> >
> >> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia 
> >> >> >  wrote:
> >> >> > > On 2024-01-18 10:10, jian he wrote:
> >> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> >> > > > 
> >> >> > > > wrote:
> >> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  
> >> >> > > >> wrote:
> >> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it 
> >> >> > > >> > to
> >> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
> >> >> > > >> > "error")?
> >> >> > > >> > You will need a separate parameter anyway to specify the 
> >> >> > > >> > destination
> >> >> > > >> > of "log", unless "none" became an illegal table name when I 
> >> >> > > >> > wasn't
> >> >> > > >> > looking.  I don't buy that one parameter that has some special 
> >> >> > > >> > values
> >> >> > > >> > while other values could be names will be a good design.  
> >> >> > > >> > Moreover,
> >> >> > > >> > what if we want to support (say) log-to-file along with 
> >> >> > > >> > log-to-table?
> >> >> > > >> > Trying to distinguish a file name from a table name without 
> >> >> > > >> > any other
> >> >> > > >> > context seems impossible.
> >> >> > > >>
> >> >> > > >> I've been thinking we can add more values to this option to log 
> >> >> > > >> errors
> >> >> > > >> not only to the server logs but also to the error table (not sure
> >> >> > > >> details but I imagined an error table is created for each table 
> >> >> > > >> on
> >> >> > > >> error), without an additional option for the destination name. 
> >> >> > > >> The
> >> >> > > >> values would be like error_action 
> >> >> > > >> {error|ignore|save-logs|save-table}.
> >> >> > > >>
> >> >> > > >
> >> >> > > > another idea:
> >> >> > > > on_error {error|ignore|other_future_option}
> >> >> > > > if not specified then by default ERROR.
> >> >> > > > You can also specify ERROR or IGNORE for now.
> >> >> > > >
> >> >> > > > I agree, the parameter "error_action" is better than "location".
> >> >> > >
> >> >> > > I'm not sure whether error_action or on_error is better, but either 
> >> >> > > way
> >> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >> >
> >> >> > OK.  What about this?
> >> >> > on_error {stop|ignore|other_future_option}
> >> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> >> > "table 'copy_log'".
> >> >>
> >> >> +1
> >> >>
> >> >
> >> > I realized that ON_ERROR syntax synoposis in the documentation is not
> >> > correct. The option doesn't require the value to be quoted and the
> >> > value can be omitted. The attached patch fixes it.
> >> >
> >> > Regards,
> >>
> >> Thanks!
> >>
> >> Attached patch fixes the doc, but I'm wondering perhaps it might be
> >> better to modify the codes to prohibit abbreviation of the value.
> >>
> >> When seeing the query which abbreviates ON_ERROR value, I feel it's
> >> not
> >> obvious what happens compared to other options which tolerates
> >> abbreviation of the value such as FREEZE or HEADER.
> >>
> >>COPY t1 FROM stdin WITH (ON_ERROR);
> >>
> >> What do you think?
> >
> > Indeed. Looking at options of other commands such as VACUUM and
> > EXPLAIN, I can see that we can omit a boolean value, but non-boolean
> > parameters require its value. The HEADER option is not a pure boolean
> > parameter but we can omit the value. It seems to be for backward
> > compatibility; it used to be a boolean parameter. I agree that the
> > above example would confuse users.
> >
> > Regards,
>
> Thanks for your comment!
>
> Attached a patch which modifies the code to prohibit omission of its
> value.
>
> I was a little unsure about adding a regression test for this, but I
> have not added it since other COPY option doesn't test the omission of
> its value.

Probably should we change the doc as well since ON_ERROR value doesn't
necessarily need to be single-quoted?

The rest looks good to me.

Alexander, what do you think about this change as you're the committer
of this feature?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-03-31 Thread Masahiko Sawada
On Fri, Mar 29, 2024 at 8:48 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Sawada-san,
>
> > Agreed.
> >
> > I think the patch is in good shape. I'll push the patch with the
> > suggestion next week, barring any objections.
>
> Thanks for working on this. Agreed it is committable.
> Few minor comments:

Thank you for the comments!

>
> ```
> + * Either txn or change must be non-NULL at least. We update the memory
> + * counter of txn if it's non-NULL, otherwise change->txn.
> ```
>
> IIUC no one checks the restriction. Should we add Assert() for it, e.g,:
> Assert(txn || change)?

Agreed to add it.

>
> ```
> +/* make sure enough space for a new node */
> ...
> +/* make sure enough space for a new node */
> ```
>
> Should be started with upper case?

I  don't think we need to change it. There are other comments in the
same file that are one line and start with lowercase.

I've just submitted the updated patches[1]

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoA6%3D%2BtL%3DbtB_s9N%2BcZK7tKz1W%3DPQyNq72nzjUcdyE%2BwZw%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve eviction algorithm in ReorderBuffer

2024-03-31 Thread Masahiko Sawada
On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila  wrote:
>
> On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada  
> wrote:
> >
> > On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
> > >
> > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada 
> > > >  wrote:
> > > > >
> > > > >
> > > > > I've attached new version patches.
> > > >
> > > > Since the previous patch conflicts with the current HEAD, I've
> > > > attached the rebased patches.
> > >
> > > Thanks for the updated patch.
> > > One comment:
> > > I felt we can mention the improvement where we update memory
> > > accounting info at transaction level instead of per change level which
> > > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> > > ReorderBufferSerializeTXN also in the commit message:
> >
> > Agreed.
> >
> > I think the patch is in good shape. I'll push the patch with the
> > suggestion next week, barring any objections.
> >
>
> Few minor comments:
> 1.
> @@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
>   Assert(txn->nentries_mem == 0);
>   }
>
> + ReorderBufferMaybeResetMaxHeap(rb);
> +
>
> Can we write a comment about why this reset is required here?
> Otherwise, the reason is not apparent.

Yes, added.

>
> 2.
> Although using max-heap to select the largest
> + * transaction is effective when there are many transactions being decoded,
> + * there is generally no need to use it as long as all transactions being
> + * decoded are top-level transactions. Therefore, we use MaxConnections as 
> the
> + * threshold so we can prevent switching to the state unless we use
> + * subtransactions.
> + */
> +#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections
>
> Isn't using max-heap equally effective in finding the largest
> transaction whether there are top-level or top-level plus
> subtransactions? This comment indicates it is only effective when
> there are subtransactions.

You're right. Updated the comment.

I've attached the updated patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v11-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data


v11-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v11-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data


Re: Add new error_action COPY ON_ERROR "log"

2024-03-31 Thread Masahiko Sawada
On Sat, Mar 30, 2024 at 11:05 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 28, 2024 at 6:31 PM Masahiko Sawada  wrote:
> >
> > That is,
> > since the LOG_VERBOSITY option is an enum parameter, it might make
> > more sense to require the value, instead of making the value optional.
> > For example, the following command could not be obvious for users:
> >
> > COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY);
>
> Agreed. Please see the attached v14 patch.

Thank you for updating the patch!

>  The LOG_VERBOSITY now needs
> a value to be specified. Note that I've not added any test for this
> case as there seemed to be no such tests so far generating "ERROR:
> <> requires a parameter". I don't mind adding one for
> LOG_VERBOSITY though.

+1

One minor point:

 ENCODING 'encoding_name'
+LOG_VERBOSITY [ mode ]
 

'[' and ']' are not necessary because the value is no longer optional.

I've attached the updated patch. I'll push it, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v15-0001-Add-new-COPY-option-LOG_VERBOSITY.patch
Description: Binary data


Re: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Masahiko Sawada
On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
>
> On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > I've attached new version patches.
> >
> > Since the previous patch conflicts with the current HEAD, I've
> > attached the rebased patches.
>
> Thanks for the updated patch.
> One comment:
> I felt we can mention the improvement where we update memory
> accounting info at transaction level instead of per change level which
> is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> ReorderBufferSerializeTXN also in the commit message:

Agreed.

I think the patch is in good shape. I'll push the patch with the
suggestion next week, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-29 Thread Masahiko Sawada
On Thu, Mar 28, 2024 at 6:15 PM John Naylor  wrote:
>
> On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  
> wrote:
> >
> > Pushed the refactoring patch.
> >
> > I've attached the rebased vacuum improvement patch for cfbot. I
> > mentioned in the commit message that this patch eliminates the 1GB
> > limitation.
> >
> > I think the patch is in good shape. Do you have other comments or
> > suggestions, John?
>
> I'll do another pass tomorrow, but first I wanted to get in another
> slightly-challenging in-situ test.

Thanks!

>
> About the "tuples missed" -- I didn't expect contention during this
> test. I believe that's completely unrelated behavior, but wanted to
> mention it anyway, since I found it confusing.

I don't investigate it enough but bgwriter might be related to the contention.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Add new error_action COPY ON_ERROR "log"

2024-03-28 Thread Masahiko Sawada
On Thu, Mar 28, 2024 at 5:28 PM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 28, 2024 at 1:43 PM torikoshia  wrote:
> >
> > Attached patch fixes the doc,
>
> May I know which patch you are referring to? And, what do you mean by
> "fixes the doc"?
>
> > but I'm wondering perhaps it might be
> > better to modify the codes to prohibit abbreviation of the value.
>
> Please help me understand the meaning here.
>
> > When seeing the query which abbreviates ON_ERROR value, I feel it's not
> > obvious what happens compared to other options which tolerates
> > abbreviation of the value such as FREEZE or HEADER.
> >
> >COPY t1 FROM stdin WITH (ON_ERROR);
> >
> > What do you think?
>
> So, do you mean to prohibit ON_ERROR being specified without any value
> like in COPY t1 FROM stdin WITH (ON_ERROR);? If yes, I think all the
> other options do allow that [1].
>
> Even if we were to do something like this, shall we discuss this separately?
>
> Having said that, what do you think of the v13 patch posted upthread?
>

This topic accidentally jumped in this thread, but it made me think
that the same might be true for the LOG_VERBOSITY option. That is,
since the LOG_VERBOSITY option is an enum parameter, it might make
more sense to require the value, instead of making the value optional.
For example, the following command could not be obvious for users:

COPY test FROM stdin (ON_ERROR ignore, LOG_VERBOSITY);

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-28 Thread Masahiko Sawada
On Thu, Mar 28, 2024 at 9:38 PM torikoshia  wrote:
>
> On 2024-03-28 10:20, Masahiko Sawada wrote:
> > Hi,
> >
> > On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada 
> > wrote:
> >>
> >> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov
> >>  wrote:
> >> >
> >> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
> >> > wrote:
> >> > > On 2024-01-18 10:10, jian he wrote:
> >> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> >> > > > 
> >> > > > wrote:
> >> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> >> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
> >> > > >> > "error")?
> >> > > >> > You will need a separate parameter anyway to specify the 
> >> > > >> > destination
> >> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> >> > > >> > looking.  I don't buy that one parameter that has some special 
> >> > > >> > values
> >> > > >> > while other values could be names will be a good design.  
> >> > > >> > Moreover,
> >> > > >> > what if we want to support (say) log-to-file along with 
> >> > > >> > log-to-table?
> >> > > >> > Trying to distinguish a file name from a table name without any 
> >> > > >> > other
> >> > > >> > context seems impossible.
> >> > > >>
> >> > > >> I've been thinking we can add more values to this option to log 
> >> > > >> errors
> >> > > >> not only to the server logs but also to the error table (not sure
> >> > > >> details but I imagined an error table is created for each table on
> >> > > >> error), without an additional option for the destination name. The
> >> > > >> values would be like error_action 
> >> > > >> {error|ignore|save-logs|save-table}.
> >> > > >>
> >> > > >
> >> > > > another idea:
> >> > > > on_error {error|ignore|other_future_option}
> >> > > > if not specified then by default ERROR.
> >> > > > You can also specify ERROR or IGNORE for now.
> >> > > >
> >> > > > I agree, the parameter "error_action" is better than "location".
> >> > >
> >> > > I'm not sure whether error_action or on_error is better, but either way
> >> > > "error_action error" and "on_error error" seems a bit odd to me.
> >> > > I feel "stop" is better for both cases as Tom suggested.
> >> >
> >> > OK.  What about this?
> >> > on_error {stop|ignore|other_future_option}
> >> > where other_future_option might be compound like "file 'copy.log'" or
> >> > "table 'copy_log'".
> >>
> >> +1
> >>
> >
> > I realized that ON_ERROR syntax synoposis in the documentation is not
> > correct. The option doesn't require the value to be quoted and the
> > value can be omitted. The attached patch fixes it.
> >
> > Regards,
>
> Thanks!
>
> Attached patch fixes the doc, but I'm wondering perhaps it might be
> better to modify the codes to prohibit abbreviation of the value.
>
> When seeing the query which abbreviates ON_ERROR value, I feel it's not
> obvious what happens compared to other options which tolerates
> abbreviation of the value such as FREEZE or HEADER.
>
>COPY t1 FROM stdin WITH (ON_ERROR);
>
> What do you think?

Indeed. Looking at options of other commands such as VACUUM and
EXPLAIN, I can see that we can omit a boolean value, but non-boolean
parameters require its value. The HEADER option is not a pure boolean
parameter but we can omit the value. It seems to be for backward
compatibility; it used to be a boolean parameter. I agree that the
above example would confuse users.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve tab completion for ALTER DEFAULT PRIVILEGE and ALTER TABLE

2024-03-28 Thread Masahiko Sawada
Hi,

Thank you for the patch!

On Mon, Jul 3, 2023 at 12:12 AM vignesh C  wrote:
>
> Hi,
>
> Improved tab completion for "ALTER DEFAULT PRIVILEGE" and "ALTER TABLE":
> 1) GRANT, REVOKE and FOR USER keyword was not displayed in tab
> completion of alter default privileges like the below statement:
> ALTER DEFAULT PRIVILEGES GRANT INSERT ON tables TO PUBLIC;
> ALTER DEFAULT PRIVILEGES REVOKE INSERT ON tables FROM PUBLIC;
> ALTER DEFAULT PRIVILEGES FOR USER vignesh revoke INSERT ON tables FROM dba1;

+1

>
> 2) USER was not displayed for "ALTER DEFAULT PRIVILEGES IN SCHEMA
> public FOR " like in below statement:
> ALTER DEFAULT PRIVILEGES IN SCHEMA public FOR USER dba1 GRANT INSERT
> ON TABLES TO PUBLIC;

Since there is no difference FOR USER and FOR ROLE, I'm not sure we
really want to support both in tab-completion.

>
> 3) "FOR GRANT OPTION" was not display for "ALTER DEFAULT PRIVILEGES
> REVOKE " like in below statement:
> alter default privileges revoke grant option for select ON tables FROM dba1;

+1. But the v3 patch doesn't cover the following case:

=# alter default privileges for role masahiko revoke [tab]
ALL CREATE  DELETE  EXECUTE INSERT  MAINTAIN
 REFERENCES  SELECT  TRIGGER TRUNCATEUPDATE  USAGE

And it doesn't cover MAINTAIN neither:

=# alter default privileges revoke [tab]
ALL   DELETEGRANT OPTION FOR  REFERENCES
 TRIGGER   UPDATE
CREATEEXECUTE   INSERTSELECT
 TRUNCATE  USAGE

The patch adds the completions for ALTER DEFAULT PRIVILEGES REVOKE,
but we handle such case in GRANT and REVOKE part:

(around L3958)
/*
 * With ALTER DEFAULT PRIVILEGES, restrict completion to grantable
 * privileges (can't grant roles)
 */
if (HeadMatches("ALTER", "DEFAULT", "PRIVILEGES"))
COMPLETE_WITH("SELECT", "INSERT", "UPDATE",
  "DELETE", "TRUNCATE", "REFERENCES", "TRIGGER",
  "CREATE", "EXECUTE", "USAGE", "MAINTAIN", "ALL");

Also, I think we can support WITH GRANT OPTION too. For example,

=# alter default privileges for role masahiko grant all on tables to
public [tab]

It's already supported in the GRANT statement.

>
> 4) "DATA TYPE" was missing in "ALTER TABLE table-name ALTER COLUMN
> column-name SET" like in:
> ALTER TABLE t1 ALTER COLUMN c1 SET DATA TYPE text;
>

+1. The patch looks good to me, so pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-27 Thread Masahiko Sawada
On Wed, Mar 27, 2024 at 5:43 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 27, 2024 at 9:25 AM John Naylor  wrote:
> >
> > On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor  
> > > wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada 
> > > >  wrote:
> >
> > > > - * remaining LP_DEAD line pointers on the page in the dead_items
> > > > - * array. These dead items include those pruned by lazy_scan_prune()
> > > > - * as well we line pointers previously marked LP_DEAD.
> > > > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > > > + * These dead items include those pruned by lazy_scan_prune() as well
> > > > + * we line pointers previously marked LP_DEAD.
> > > >
> > > > Here maybe "into dead_items".
> >
> > - * remaining LP_DEAD line pointers on the page in the dead_items.
> > + * remaining LP_DEAD line pointers on the page into the dead_items.
> >
> > Let me explain. It used to be "in the dead_items array." It is not an
> > array anymore, so it was changed to "in the dead_items". dead_items is
> > a variable name, and names don't take "the". "into dead_items" seems
> > most natural to me, but there are other possible phrasings.
>
> Thanks for the explanation. I was distracted. Fixed in the latest patch.
>
> >
> > > > > > Did you try it with 1MB m_w_m?
> > > > >
> > > > > I've incorporated the above comments and test results look good to me.
> > > >
> > > > Could you be more specific about what the test was?
> > > > Does it work with 1MB m_w_m?
> > >
> > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
> > >
> > > FYI other test cases I tested were:
> > >
> > > * m_w_m = 2199023254528 (maximum value)
> > > initial: 1MB
> > > max: 128GB
> > >
> > > * m_w_m = 64MB (default)
> > > initial: 1MB
> > > max: 8MB
> >
> > If the test was a vacuum, how big a table was needed to hit 128GB?
>
> I just checked how TIdStoreCreateLocal() calculated the initial and
> max segment sizes while changing m_w_m, so didn't check how big
> segments are actually allocated in the maximum value test case.
>
> >
> > > > The existing comment slipped past my radar, but max_bytes is not a
> > > > limit, it's a hint. Come to think of it, it never was a limit in the
> > > > normal sense, but in earlier patches it was the criteria for reporting
> > > > "I'm full" when asked.
> > >
> > > Updated the comment.
> >
> > + * max_bytes is not a limit; it's used to choose the memory block sizes of
> > + * a memory context for TID storage in order for the total memory 
> > consumption
> > + * not to be overshot a lot. The caller can use the max_bytes as the 
> > criteria
> > + * for reporting whether it's full or not.
> >
> > This is good information. I suggest this edit:
> >
> > "max_bytes" is not an internally-enforced limit; it is used only as a
> > hint to cap the memory block size of the memory context for TID
> > storage. This reduces space wastage due to over-allocation. If the
> > caller wants to monitor memory usage, it must compare its limit with
> > the value reported by TidStoreMemoryUsage().
> >
> > Other comments:
>
> Thanks for the suggestion!
>
> >
> > v79-0002 looks good to me.
> >
> > v79-0003:
> >
> > "With this commit, when creating a shared TidStore, a dedicated DSA
> > area is created for TID storage instead of using the provided DSA
> > area."
> >
> > This is very subtle, but "the provided..." implies there still is one.
> > -> "a provided..."
> >
> > + * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> > + * DSA area. The TID storage will live in the DSA area, and a memory
> > + * context rt_context will have only meta data of the radix tree.
> >
> > -> "the memory context"
>
> Fixed in the latest patch.
>
> >
> > I think you can go ahead and commit 0002 and 0003/4.
>
> I've pushed the 0002 (dsa init and max segment size) patch, and will
> push the attached 0001 patch next.

Pushed the refactoring patch.

I've attached the rebased vacuum improvement patch for cfbot. I
mentioned in the commit message that this patch eliminates the 1GB
limitation.

I think the patch is in good shape. Do you have other comments or
suggestions, John?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v81-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data


Re: Fix parallel vacuum buffer usage reporting

2024-03-27 Thread Masahiko Sawada
Hi,

Thank you for the report.

On Fri, Feb 9, 2024 at 6:10 PM Anthonin Bonnefoy
 wrote:
>
> Hi,
>
> With a db setup with pgbench, we add an additional index:
> CREATE INDEX ON pgbench_accounts(abalance)
>
> And trigger several updates and vacuum to reach a stable amount of
> dirtied pages:
> UPDATE pgbench_accounts set abalance = abalance + 1 WHERE aid=1; CHECKPOINT;
> VACUUM (VERBOSE, INDEX_CLEANUP ON) pgbench_accounts
>
> The vacuum will report the following:
> INFO:  vacuuming "postgres.public.pgbench_accounts"
> INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
> INFO:  finished vacuuming "postgres.public.pgbench_accounts": index scans: 1
> ...
> buffer usage: 122 hits, 165 misses, 4 dirtied
>
> 4 pages were reported dirtied. However, we have 5 dirtied blocks at
> the end of the vacuum when looking at pg_buffercache:
>
> SELECT c.relname, b.relfilenode
>  FROM
> pg_buffercache b LEFT JOIN pg_class c
>  ON b.relfilenode =
> pg_relation_filenode(c.oid)
>WHERE isdirty=true;
> relname| relfilenode
> ---+-
>  pg_class  |1259
>  pgbench_accounts  |   16400
>  pgbench_accounts  |   16400
>  pgbench_accounts_pkey |   16408
>  pgbench_accounts_abalance_idx |   16480
>
> The missing dirty block comes from the parallel worker vacuuming the
> abalance index. Running vacuum with parallel disabled will give the
> correct result.
>
> Vacuum uses dedicated VacuumPage{Hit,Miss,Dirty} globals to track
> buffer usage. However, those values are not collected at the end of
> parallel vacuum workers, leading to incorrect buffer count.

True. I think we should fix it also on backbranches.

>
> Those vacuum specific globals are redundant with the existing
> pgBufferUsage and only used in the verbose output. This patch removes
> them and replaces them by pgBufferUsage which is already correctly
> collected at the end of parallel workers, fixing the buffer count.

It seems to make sense to remove VacuumPageHit and friends, only on
the master branch, if we can use BufferUsage instead.

As for the proposed patch, the following part should handle the temp tables too:

appendStringInfo(, _("avg read rate: %.3f
MB/s, avg write rate: %.3f MB/s\n"),
 read_rate, write_rate);
appendStringInfo(, _("buffer usage: %lld
hits, %lld misses, %lld dirtied\n"),
-(long long)
AnalyzePageHit,
-(long long)
AnalyzePageMiss,
-(long long)
AnalyzePageDirty);
+(long long)
bufferusage.shared_blks_hit,
+(long long)
bufferusage.shared_blks_read,
+    (long long)
bufferusage.shared_blks_dirtied);
appendStringInfo(, _("system usage: %s"),
pg_rusage_show());

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-03-27 Thread Masahiko Sawada
Hi,

On Thu, Jan 18, 2024 at 5:33 PM Masahiko Sawada  wrote:
>
> On Thu, Jan 18, 2024 at 4:59 PM Alexander Korotkov  
> wrote:
> >
> > On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
> > wrote:
> > > On 2024-01-18 10:10, jian he wrote:
> > > > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> > > > wrote:
> > > >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> > > >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> > > >> > error_action {error|ignore|log} (or perhaps "stop" instead of 
> > > >> > "error")?
> > > >> > You will need a separate parameter anyway to specify the destination
> > > >> > of "log", unless "none" became an illegal table name when I wasn't
> > > >> > looking.  I don't buy that one parameter that has some special values
> > > >> > while other values could be names will be a good design.  Moreover,
> > > >> > what if we want to support (say) log-to-file along with log-to-table?
> > > >> > Trying to distinguish a file name from a table name without any other
> > > >> > context seems impossible.
> > > >>
> > > >> I've been thinking we can add more values to this option to log errors
> > > >> not only to the server logs but also to the error table (not sure
> > > >> details but I imagined an error table is created for each table on
> > > >> error), without an additional option for the destination name. The
> > > >> values would be like error_action {error|ignore|save-logs|save-table}.
> > > >>
> > > >
> > > > another idea:
> > > > on_error {error|ignore|other_future_option}
> > > > if not specified then by default ERROR.
> > > > You can also specify ERROR or IGNORE for now.
> > > >
> > > > I agree, the parameter "error_action" is better than "location".
> > >
> > > I'm not sure whether error_action or on_error is better, but either way
> > > "error_action error" and "on_error error" seems a bit odd to me.
> > > I feel "stop" is better for both cases as Tom suggested.
> >
> > OK.  What about this?
> > on_error {stop|ignore|other_future_option}
> > where other_future_option might be compound like "file 'copy.log'" or
> > "table 'copy_log'".
>
> +1
>

I realized that ON_ERROR syntax synoposis in the documentation is not
correct. The option doesn't require the value to be quoted and the
value can be omitted. The attached patch fixes it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


0001-doc-Fix-COPY-ON_ERROR-option-syntax-synopsis.patch
Description: Binary data


Re: Add new error_action COPY ON_ERROR "log"

2024-03-27 Thread Masahiko Sawada
On Thu, Mar 28, 2024 at 2:49 AM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 27, 2024 at 7:42 AM Masahiko Sawada  wrote:
> >
> > I think that there are two options to handle it:
> >
> > 1. change COPY grammar to accept DEFAULT as an option value.
> > 2. change tab-completion to complement 'DEFAULT' instead of DEFAULT,
> > and update the doc too.
> >
> > As for the documentation, we can add single-quotes as follows:
> >
> >  ENCODING 'encoding_name'
> > +LOG_VERBOSITY [ 'mode' ]
> >
> > I thought the option (2) is better but there seems no precedent of
> > complementing a single-quoted string other than file names. So the
> > option (1) could be clearer.
> >
> > What do you think?
>
> There is another option to change log_verbosity to {none, verbose} or
> {none, skip_row_info} (discuseed here
> https://www.postgresql.org/message-id/Zelrqq-pkfkvsjPn%40paquier.xyz
> that we might extend this option to other use-cases in future). I tend
> to agree with you to support log_verbose to be set to default without
> quotes just to be consistent with other commands that allow that [1].
> And, thanks for quickly identifying where to change in the gram.y.
> With that change, now I have changed all the new tests added to use
> log_verbosity default without quotes.
>
> FWIW, a recent commit [2] did the same. Therefore, I don't see a
> problem supporting it that way for COPY log_verbosity.
>
> Please find the attached v13 patch with the change.

Thank you for updating the patch quickly, and sharing the reference.

I think {default, verbose} is a good start and more consistent with
other similar features. We can add other modes later.

Regarding the syntax change, since copy_generic_opt_arg rule is only
for COPY option syntax, the change doesn't affect other syntaxes. I've
confirmed the tab-completion works fine.

I'll push the patch, barring any objections.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Skip collecting decoded changes of already-aborted transactions

2024-03-27 Thread Masahiko Sawada
On Wed, Mar 27, 2024 at 8:49 PM Ajin Cherian  wrote:
>
>
>
> On Mon, Mar 18, 2024 at 7:50 PM Masahiko Sawada  wrote:
>>
>>
>> In addition to these changes, I've made some changes to the latest
>> patch. Here is the summary:
>>
>> - Use txn_flags field to record the transaction status instead of two
>> 'committed' and 'aborted' flags.
>> - Add regression tests.
>> - Update commit message.
>>
>> Regards,
>>
>
> Hi Sawada-san,
>
> Thanks for the updated patch. Some comments:

Thank you for the view comments!

>
> 1.
> + * already aborted, we discards all changes accumulated so far and ignore
> + * future changes, and return true. Otherwise return false.
>
> we discards/we discard

Will fix it.

>
> 2. In function ReorderBufferCheckTXNAbort(): I haven't tested this but I 
> wonder how prepared transactions would be considered, they are neither 
> committed, nor in progress.

The transaction that is prepared but not resolved yet is considered as
in-progress.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-27 Thread Masahiko Sawada
On Wed, Mar 27, 2024 at 9:25 AM John Naylor  wrote:
>
> On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  wrote:
> >
> > On Mon, Mar 25, 2024 at 3:25 PM John Naylor  wrote:
> > >
> > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  
> > > wrote:
>
> > > - * remaining LP_DEAD line pointers on the page in the dead_items
> > > - * array. These dead items include those pruned by lazy_scan_prune()
> > > - * as well we line pointers previously marked LP_DEAD.
> > > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > > + * These dead items include those pruned by lazy_scan_prune() as well
> > > + * we line pointers previously marked LP_DEAD.
> > >
> > > Here maybe "into dead_items".
>
> - * remaining LP_DEAD line pointers on the page in the dead_items.
> + * remaining LP_DEAD line pointers on the page into the dead_items.
>
> Let me explain. It used to be "in the dead_items array." It is not an
> array anymore, so it was changed to "in the dead_items". dead_items is
> a variable name, and names don't take "the". "into dead_items" seems
> most natural to me, but there are other possible phrasings.

Thanks for the explanation. I was distracted. Fixed in the latest patch.

>
> > > > > Did you try it with 1MB m_w_m?
> > > >
> > > > I've incorporated the above comments and test results look good to me.
> > >
> > > Could you be more specific about what the test was?
> > > Does it work with 1MB m_w_m?
> >
> > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
> >
> > FYI other test cases I tested were:
> >
> > * m_w_m = 2199023254528 (maximum value)
> > initial: 1MB
> > max: 128GB
> >
> > * m_w_m = 64MB (default)
> > initial: 1MB
> > max: 8MB
>
> If the test was a vacuum, how big a table was needed to hit 128GB?

I just checked how TIdStoreCreateLocal() calculated the initial and
max segment sizes while changing m_w_m, so didn't check how big
segments are actually allocated in the maximum value test case.

>
> > > The existing comment slipped past my radar, but max_bytes is not a
> > > limit, it's a hint. Come to think of it, it never was a limit in the
> > > normal sense, but in earlier patches it was the criteria for reporting
> > > "I'm full" when asked.
> >
> > Updated the comment.
>
> + * max_bytes is not a limit; it's used to choose the memory block sizes of
> + * a memory context for TID storage in order for the total memory consumption
> + * not to be overshot a lot. The caller can use the max_bytes as the criteria
> + * for reporting whether it's full or not.
>
> This is good information. I suggest this edit:
>
> "max_bytes" is not an internally-enforced limit; it is used only as a
> hint to cap the memory block size of the memory context for TID
> storage. This reduces space wastage due to over-allocation. If the
> caller wants to monitor memory usage, it must compare its limit with
> the value reported by TidStoreMemoryUsage().
>
> Other comments:

Thanks for the suggestion!

>
> v79-0002 looks good to me.
>
> v79-0003:
>
> "With this commit, when creating a shared TidStore, a dedicated DSA
> area is created for TID storage instead of using the provided DSA
> area."
>
> This is very subtle, but "the provided..." implies there still is one.
> -> "a provided..."
>
> + * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> + * DSA area. The TID storage will live in the DSA area, and a memory
> + * context rt_context will have only meta data of the radix tree.
>
> -> "the memory context"

Fixed in the latest patch.

>
> I think you can go ahead and commit 0002 and 0003/4.

I've pushed the 0002 (dsa init and max segment size) patch, and will
push the attached 0001 patch next.

>
> v79-0005:
>
> - bypass = (vacrel->lpdead_item_pages < threshold &&
> -   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
> + bypass = (vacrel->lpdead_item_pages < threshold) &&
> + TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
>
> The parentheses look strange, and the first line shouldn't change
> without a good reason.

Fixed.

>
> - /* Set dead_items space */
> - dead_items = (VacDeadItems *) shm_toc_lookup(toc,
> - PARALLEL_VACUUM_KEY_DEAD_ITEMS,
> - false);
> + /* Set dead items */
> + dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
> + shared->dead_items_handle);
>
> I feel ambivalent about this comment change. The original is not very
> descriptive to begin with. If we need to change at all, maybe "find
> dead_items in shared memory"?

Agreed.

>
> v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.
>
> v79-0006:
>
> vac_work_mem should also go back to being an int.

Fixed.

I've attached the latest patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v80-0001-Rethink-create-and-attach-APIs-of-shared-TidStor.patch
Description: Binary data


v80-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data


Re: Add new error_action COPY ON_ERROR "log"

2024-03-26 Thread Masahiko Sawada
On Tue, Mar 26, 2024 at 6:36 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 1:46 PM Masahiko Sawada  wrote:
> >
> > Thank you for updating the patch! Looks good to me.
> >
> > Please find the attached patch. I've made some changes for the
> > documentation and the commit message. I'll push it, barring any
> > objections.
>
> Thanks. v12 patch LGTM.
>

While testing the new option, I realized that the tab-completion
complements DEFAULT value, but it doesn't work without single-quotes:

postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity default );
ERROR:  syntax error at or near "default"
LINE 1: ...py test from '/tmp/dump.data' with (log_verbosity default );
 ^
postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity 'default' );
COPY 0

Whereas VERBOSE works even without single-quotes:

postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity verbose );
COPY 0

postgres(1:2179134)=# copy test from '/tmp/dump.data' with
(log_verbosity 'verbose' );
COPY 0

Which could confuse users. This is because DEFAULT is a reserved
keyword and the COPY option doesn't accept them as an option value.

I think that there are two options to handle it:

1. change COPY grammar to accept DEFAULT as an option value.
2. change tab-completion to complement 'DEFAULT' instead of DEFAULT,
and update the doc too.

As for the documentation, we can add single-quotes as follows:

 ENCODING 'encoding_name'
+LOG_VERBOSITY [ 'mode' ]

I thought the option (2) is better but there seems no precedent of
complementing a single-quoted string other than file names. So the
option (1) could be clearer.

What do you think?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Add new error_action COPY ON_ERROR "log"

2024-03-26 Thread Masahiko Sawada
On Tue, Mar 26, 2024 at 3:04 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 9:56 AM Masahiko Sawada  wrote:
> >
> > > > errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
> >
> > > > I guess it would be better to make the log message clearer to convey
> > > > what we did for the malformed row. For example, how about something
> > > > like "skipping row due to data type incompatibility at line %llu for
> > > > column %s: \"s\""?
> > >
> > > The summary message which gets printed at the end says that "NOTICE:
> > > 6 rows were skipped due to data type incompatibility". Isn't this
> > > enough? If someone is using ON_ERROR 'ignore', it's quite natural that
> > > such rows get skipped softly and the summary message can help them,
> > > no?
> >
> > I think that in the main log message we should mention what happened
> > (or is happening) or what we did (or are doing). If the message "data
> > type incompatibility ..." was in the DETAIL message with the main
> > message saying something like "skipping row at line %llu for column
> > %s: ...", it would make sense to me. But the current message seems not
> > to be clear to me and consistent with other NOTICE messages. Also, the
> > last summary line would not be written if the user cancelled, and
> > someone other than person who used ON_ERROR 'ignore' might check the
> > server logs later.
>
> Agree. I changed the NOTICE message to what you've suggested. Thanks.
>

Thank you for updating the patch! Looks good to me.

Please find the attached patch. I've made some changes for the
documentation and the commit message. I'll push it, barring any
objections.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v12-0001-Add-new-COPY-option-LOG_VERBOSITY.patch
Description: Binary data


Re: Improve eviction algorithm in ReorderBuffer

2024-03-25 Thread Masahiko Sawada
On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  wrote:
>
>
> I've attached new version patches.

Since the previous patch conflicts with the current HEAD, I've
attached the rebased patches.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v10-0001-Make-binaryheap-enlargeable.patch
Description: Binary data


v10-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch
Description: Binary data


v10-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch
Description: Binary data


Re: Add new error_action COPY ON_ERROR "log"

2024-03-25 Thread Masahiko Sawada
On Tue, Mar 26, 2024 at 12:23 PM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 7:16 AM Masahiko Sawada  wrote:
> >
> > > Please see the attached v9 patch set.
> >
> > Thank you for updating the patch! The patch mostly looks good to me.
> > Here are some minor comments:
>
> Thanks for looking into this.
>
> > ---
> >  /* non-export function prototypes */
> > -static char *limit_printout_length(const char *str);
> > -
> > static void ClosePipeFromProgram(CopyFromState cstate);
> >
> > Now that we have only one function we should replace "prototypes" with
> > "prototype".
>
> Well no. We might add a few more (never know). A quick look around the
> GUCs under /* GUCs */ tells me that plural form there is being used
> even just one GUC is defined (xlogprefetcher.c for instance).

Understood.

>
> > ---
> > +ereport(NOTICE,
> > +
> > errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
> > +
> >  (unsigned long long) cstate->cur_lineno,
> > +
> >  cstate->cur_attname,
> > +
> >  attval));
> >
> > I guess it would be better to make the log message clearer to convey
> > what we did for the malformed row. For example, how about something
> > like "skipping row due to data type incompatibility at line %llu for
> > column %s: \"s\""?
>
> The summary message which gets printed at the end says that "NOTICE:
> 6 rows were skipped due to data type incompatibility". Isn't this
> enough? If someone is using ON_ERROR 'ignore', it's quite natural that
> such rows get skipped softly and the summary message can help them,
> no?

I think that in the main log message we should mention what happened
(or is happening) or what we did (or are doing). If the message "data
type incompatibility ..." was in the DETAIL message with the main
message saying something like "skipping row at line %llu for column
%s: ...", it would make sense to me. But the current message seems not
to be clear to me and consistent with other NOTICE messages. Also, the
last summary line would not be written if the user cancelled, and
someone other than person who used ON_ERROR 'ignore' might check the
server logs later.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Add new error_action COPY ON_ERROR "log"

2024-03-25 Thread Masahiko Sawada
On Mon, Mar 25, 2024 at 8:21 PM Bharath Rupireddy
 wrote:
>
> On Mon, Mar 25, 2024 at 10:42 AM Masahiko Sawada  
> wrote:
> >
> > The current approach, eliminating the duplicated information in
> > CONTEXT, seems good to me.
>
> Thanks for looking into it.
>
> > One question about the latest (v8) patch:
> >
> > +   else
> > +   ereport(NOTICE,
> > +   errmsg("data type incompatibility at
> > line %llu for column %s: null input",
> > +  (unsigned long long) 
> > cstate->cur_lineno,
> > +  cstate->cur_attname));
> > +
> >
> > How can we reach this path? It seems we don't cover this path by the tests.
>
> Tests don't cover that part, but it can be hit with something like
> [1]. I've added a test for this.
>
> Note the use of domain to provide an indirect way of providing null
> constraint check. Otherwise, COPY FROM fails early in
> CopyFrom->ExecConstraints if the NOT NULL constraint is directly
> provided next to the column in the table [2].
>
> Please see the attached v9 patch set.
>

Thank you for updating the patch! The patch mostly looks good to me.
Here are some minor comments:

---
 /* non-export function prototypes */
-static char *limit_printout_length(const char *str);
-
static void ClosePipeFromProgram(CopyFromState cstate);

Now that we have only one function we should replace "prototypes" with
"prototype".

---
+ereport(NOTICE,
+
errmsg("data type incompatibility at line %llu for column %s: \"%s\"",
+
 (unsigned long long) cstate->cur_lineno,
+
 cstate->cur_attname,
+
 attval));

I guess it would be better to make the log message clearer to convey
what we did for the malformed row. For example, how about something
like "skipping row due to data type incompatibility at line %llu for
column %s: \"s\""?

---
 extern void CopyFromErrorCallback(void *arg);
+extern char *limit_printout_length(const char *str);

I don't disagree with exposing the limit_printout_length() function
but I think it's better to rename it for consistency with other
exposed COPY command functions. Only this function is snake-case. How
about CopyLimitPrintoutLength() or similar?

FWIW I'm going to merge two patches before the push.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




  1   2   3   4   5   6   7   8   9   10   >