date:20230725

Re: WaitForOlderSnapshots in DETACH PARTITION causes deadlocks

2023-07-25 Thread Michael Paquier

On Mon, Jul 24, 2023 at 07:40:04PM +, Imseih (AWS), Sami wrote:
> WaitForOlderSnapshots is used here to ensure that snapshots older than
> the start of the ALTER TABLE DETACH CONCURRENTLY are completely removed
> to guarantee consistency, however it does seem to cause deadlocks for at
> least RR/SERIALIZABLE transactions.
> 
> There are cases [2] in which certain operations are accepted as not being
> MVCC-safe, and now I am wondering if this is another case? Deadlocks are
> not a good scenario, and WaitForOlderSnapshot does not appear to do
> anything for READ COMMITTED transactions.
> 
> So do we actually need WaitForOlderSnapshot in the FINALIZE code? and
> Could be acceptable that this operation is marked as not MVCC-safe like
> the other aforementioned operations?

I guess that there is an argument with lifting that a bit.  Based on
the test case your are providing, a deadlock occuring between the
FINALIZE and a scan of the top-level partitioned table in a
transaction that began before the DETACH CONCURRENTLY does not seem
like the best answer to have from the user perspective.  I got to
wonder whether there is room to make the wait for older snapshots in
the finalize phase more robust, though, and actually make it wait
until the first transaction commits rather than fail because of a
deadlock like that.

> Perhaps I am missing some important point here, so any feedback will be
> appreciated.

Adding Alvaro in CC as the author of 71f4c8c6 for input, FYI.
--
Michael

signature.asc
Description: PGP signature

Re: pg_rewind fails with in-place tablespace

2023-07-25 Thread Michael Paquier

On Wed, Jul 19, 2023 at 09:31:35PM +0800, 赵锐(惜元) wrote:
>  Recently I encountered an issue: pg_rewind fails when dealing with
>  in-place tablespace. The problem seems to be that pg_rewind is
>  treating in-place tablespace as symbolic link, while in fact it
>  should be treated as directory. 
>  Here is the output of pg_rewind:
> pg_rewind: error: file "pg_tblspc/16385" is of different type in
> source and target 
>  To help reproduce the failure, I have attached a tap test. And I am
>  pleased to say that I have also identified a solution for this
>  problem, which I have included in the patch. 
>  Thank you for your attention to this matter.

Issue reproduced here, and agreed that we'd better do something about
that.  I am not sure if your patch is right for the job though, but
I'll try to study that a bit more.
--
Michael


signature.asc
Description: PGP signature

Retiring is_pushed_down

2023-07-25 Thread Richard Guo

When forming an outer join's joinrel, we have the is_pushed_down flag in
RestrictInfo nodes to distinguish those quals that are in that join's
JOIN/ON condition from those that were pushed down to the joinrel and
thus act as filter quals.  Since now we have the outer-join-aware-Var
infrastructure, I think we can check to see whether a qual clause's
required_relids reference the outer join(s) being formed, in order to
tell if it's a join or filter clause.  This seems like a more principled
way.  (Interesting that optimizer/README actually describes this way in
section 'Relation Identification and Qual Clause Placement'.)

So I give it a try to retire is_pushed_down as attached.  But there are
several points that may need more thoughts.

* When we form an outer join, it's possible that more than one outer
join relid is added to the join's relid set, if there are any pushed
down outer joins per identity 3.  And it's also possible that no outer
join relid is added, for an outer join that has been pushed down.  So
instead of checking if a qual clause's required_relids include the outer
join's relid, I think we should check if its required_relids overlap
the outer join relids that are being formed, which means that we should
use bms_overlap(rinfo->required_relids, ojrelids) rather than
bms_is_member(ojrelid, rinfo->required_relids).  And we should do this
check only for outer joins.

* This patch calculates the outer join relids that are being formed
generally in this way:

bms_difference(joinrelids, bms_union(outerrelids, innerrelids))

Of course this can only be used after the outer join relids has been
added by add_outer_joins_to_relids().  This calculation is performed
multiple times during planning.  I'm not sure if this has performance
issues.  Maybe we can calculate it only once and store the result in
some place (such as in JoinPath)?

* ANTI joins are treated as outer joins but sometimes they do not have
rtindex (such as ANTI joins derived from SEMI).  This would be a problem
with this new check.  As an example, consider query

select * from a where not exists (select 1 from b where a.i = b.i) and
CURRENT_USER = SESSION_USER;

The pseudoconstant clause 'CURRENT_USER = SESSION_USER' is supposed to
be treated as a filter clause but the new check would treat it as a join
clause because the outer join relid set being formed is empty since the
ANTI join here does not have an rtindex.  To solve this problem, this
patch manually adds a RTE for a ANTI join derived from SEMI.

Any thoughts?

Thanks
Richard


v1-0001-Retiring-is_pushed_down.patch
Description: Binary data

Re: WAL Insertion Lock Improvements

2023-07-25 Thread Michael Paquier

On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote:
> Yes, it looks safe to me too.

0001 has been now applied.  I have done more tests while looking at
this patch since yesterday and was surprised to see higher TPS numbers
on HEAD with the same tests as previously, and the patch was still
shining with more than 256 clients.

> FWIW, 0001 essentially implements what
> an existing TODO comment introduced by commit 008608b9d5106 says:

We really need to do something in terms of documentation with
something like 0002, so I'll try to look at that next.  Regarding
0003, I don't know.  I think that we'd better look more into cases
where it shows actual benefits for specific workloads (like workloads
with a fixed rate of read and/or write operations?).
--
Michael


signature.asc
Description: PGP signature

Re: doc: improve the restriction description of using indexes on REPLICA IDENTITY FULL table.

2023-07-25 Thread Masahiko Sawada

On Mon, Jul 24, 2023 at 12:05 PM Amit Kapila  wrote:
>
> On Mon, Jul 24, 2023 at 6:39 AM Masahiko Sawada  wrote:
> >
> > On Sat, Jul 22, 2023 at 7:32 PM Amit Kapila  wrote:
> > >
> > >
> > > You have moved most of the comments related to the restriction of
> > > which index can be picked atop IsIndexUsableForReplicaIdentityFull().
> > > Now, the comments related to limitation atop
> > > FindUsableIndexForReplicaIdentityFull() look slightly odd as it refers
> > > to limitations but those limitation were not stated. The comments I am
> > > referring to are: "Note that the limitations of index scans for
> > > replica identity full only  might not be a good idea in some
> > > cases". Shall we move these as well atop
> > > IsIndexUsableForReplicaIdentityFull()?
> >
> > Good point.
> >
> > Looking at neighbor comments, the following comment looks slightly odd to 
> > me:
> >
> >  * XXX: See IsIndexUsableForReplicaIdentityFull() to know the challenges in
> >  * supporting indexes other than btree and hash. For partial indexes, the
> >  * required changes are likely to be larger. If none of the tuples satisfy
> >  * the expression for the index scan, we fall-back to sequential execution,
> >  * which might not be a good idea in some cases.
> >
> > Are the first and second sentences related actually?
> >
>
> Not really.
>
> > I think we can move it as well to
> > IsIndexUsableForReplicaIdentityFull() with some adjustments. I've
> > attached the updated patch that incorporated your comment and included
> > my idea to update the comment.
> >
>
> LGTM.

Pushed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Add loongarch native checksum implementation.

2023-07-25 Thread huchangqi



 Start of forwarded message 
From: huchangqi 
To: YANG Xudong 
Subject: Re: [PATCH] Add loongarch native checksum implementation.
Date: Tue, 25 Jul 2023 15:51:43 +0800

Both cisticola and nuthatch are on the buildfarm now。

cisticola is "old world ABI".
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=cisticola&br=HEAD

nuthatch is "new world ABI".
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nuthatch&br=HEAD

--
Best regards,
huchangqi

YANG Xudong  writes:

> On 2023/7/6 15:14, huchangqi wrote:
>> Hi, i have a loongarch machine runing Loongnix-server (based on 
>> redhat/centos, it has gcc-8.3 on it), i am trying to test buildfarm on it, 
>> when i edit build-farm.conf it seems to need animal and secret to connect 
>> the buildfarm server.
>> then i go to  https://buildfarm.postgresql.org/cgi-bin/register-form.pl, and 
>> registered the buildfarm  a week ago, but didn't receive any response. so 
>> what else i need to do next.
>> 
>> 
>
> Is it possible to provide a build farm instance for new world ABI of 
> loongarch also by loongson? It will be really appreciated.
>
> Thanks!
>
>> 
>>> -Original Messages-
>>> From: "YANG Xudong" 
>>> Send time:Wednesday, 07/05/2023 10:15:51
>>> To: "John Naylor" 
>>> Cc: pgsql-hackers@lists.postgresql.org, wengyanq...@ymatrix.cn, 
>>> wang...@ymatrix.cn
>>> Subject: Re: [PATCH] Add loongarch native checksum implementation.
>>>
>>> Is there any other comment?
>>>
>>> If the patch looks OK, I would like to update its status to ready for
>>> committer in the commitfest.
>>>
>>> Thanks!
>>>
>>> On 2023/6/16 09:28, YANG Xudong wrote:
 Updated the patch based on the comments.

 On 2023/6/15 18:30, John Naylor wrote:
>
> On Wed, Jun 14, 2023 at 9:20 AM YANG Xudong  > wrote:
>   >
>   > Attached a new patch with fixes based on the comment below.
>
> Note: It's helpful to pass "-v" to git format-patch, to have different
> versions.
>

 Added v2

>   > > For x86 and Arm, if it fails to link without an -march flag, we
> allow
>   > > for a runtime check. The flags "-march=armv8-a+crc" and
> "-msse4.2" are
>   > > for instructions not found on all platforms. The patch also
> checks both
>   > > ways, and each one results in "Use LoongArch CRC instruction
>   > > unconditionally". The -march flag here is general, not specific. In
>   > > other words, if this only runs inside "+elif host_cpu ==
> 'loongarch64'",
>   > > why do we need both with -march and without?
>   > >
>   >
>   > Removed the elif branch.
>
> Okay, since we've confirmed that no arch flag is necessary, some other
> places can be simplified:
>
> --- a/src/port/Makefile
> +++ b/src/port/Makefile
> @@ -98,6 +98,11 @@ pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
>    pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
>    pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
>
> +# all versions of pg_crc32c_loongarch.o need CFLAGS_CRC
> +pg_crc32c_loongarch.o: CFLAGS+=$(CFLAGS_CRC)
> +pg_crc32c_loongarch_shlib.o: CFLAGS+=$(CFLAGS_CRC)
> +pg_crc32c_loongarch_srv.o: CFLAGS+=$(CFLAGS_CRC)
>
> This was copy-and-pasted from platforms that use a runtime check, so
> should be unnecessary.
>

 Removed these lines.

> +# If the intrinsics are supported, sets
> pgac_loongarch_crc32c_intrinsics,
> +# and CFLAGS_CRC.
>
> +# Check if __builtin_loongarch_crcc_* intrinsics can be used
> +# with the default compiler flags.
> +# CFLAGS_CRC is set if the extra flag is required.
>
> Same here -- it seems we don't need to set CFLAGS_CRC at all. Can you
> confirm?
>

 We don't need to set CFLAGS_CRC as commented. I have updated the
 configure script to make it align with the logic in meson build script.

>   > > Also, I don't have a Loongarch machine for testing. Could you
> show that
>   > > the instructions are found in the binary, maybe using objdump and
> grep?
>   > > Or a performance test?
>   > >
>   >
>   > The output of the objdump command `objdump -dS
>   > ../postgres-build/tmp_install/usr/local/pgsql/bin/postgres  | grep
> -B 30
>   > -A 10 crcc` is attached.
>
> Thanks for confirming.
>
> -- 
> John Naylor
> EDB: http://www.enterprisedb.com 
>>>
>> 
>> 
>> 本邮件及其附件含有龙芯中科的商业秘密信息，仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制或散发）本邮件及其附件中的信息。如果您错收本邮件，请您立即电话或邮件通知发件人并删除本邮件。
>> This email and its attachments contain confidential information from 
>> Loongson Technology , which is intended only for the person or entity whose 
>> address is listed above. Any use of the information contained herein in any 
>> way (including, but not limited to, total or partial disclosure, 
>> reprod

A failure in 031_recovery_conflict.pl on Cirrus CI Windows Server

2023-07-25 Thread Bharath Rupireddy

Hi,

I've observed the following failure once in one of my Cirrus CI runs
on Windows Server on HEAD:

timed out waiting for match: (?^:User was holding shared buffer pin
for too long) at
C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318.
# Postmaster PID for node "primary" is 696

https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict

https://github.com/BRupireddy/postgres/runs/15296698158

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [BUG] Crash on pgbench initialization.

2023-07-25 Thread Anton A. Melnikov


On 25.07.2023 06:24, Andres Freund wrote:

Thanks Anton / Victoria for the report and fix. Pushed.


Thanks!
Have a nice day!


--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: PATCH: Add REINDEX tag to event triggers

2023-07-25 Thread jian he

On Fri, Jul 21, 2023 at 12:47 PM Garrett Thornburg  wrote:
>
> Added my v1 patch to add REINDEX to event triggers.
>
> I originally built this against pg15 but rebased to master for the patch to 
> hopefully make it easier for maintainers to merge. The rebase was automatic 
> so it should be easy to include into any recent version. I'd love for this to 
> land in pg16 if possible.
>
> I added regression tests and they are passing. I also formatted the code 
> using the project tools. Docs are included too.
>
> A few notes:
> 1. I tried to not touch the function parameters too much and opted to create 
> a convenience function that makes it easy to attach index or table OIDs to 
> the current event trigger list.
> 2. I made sure both concurrent and regular reindex of index, table, database 
> work as expected.
> 3. The ddl end command will make available all indexes that were modified by 
> the reindex command. This is a large list when you run "reindex database" but 
> works really well. I debated returning records of the table or DB that were 
> reindexed but that doesn't really make sense. Returning each index fits my 
> use case of building an audit record around the index lifecycle (hence the 
> motivation for the patch).
>
> Thanks for your time,
>
> Garrett Thornburg

main/postgres/src/backend/tcop/utility.c
535: /*
536:  * standard_ProcessUtility itself deals only with utility commands for
537:  * which we do not provide event trigger support.  Commands that do have
538:  * such support are passed down to ProcessUtilitySlow, which contains the
539:  * necessary infrastructure for such triggers.
540:  *
541:  * This division is not just for performance: it's critical that the
542:  * event trigger code not be invoked when doing START TRANSACTION for
543:  * example, because we might need to refresh the event trigger cache,
544:  * which requires being in a valid transaction.
545:  */

so  T_ReindexStmt should only be in  ProcessUtilitySlow, if you want
to create an event trigger on reindex?

regression tests work fine. I even play with partitions.

Partition pruning not working on updates

2023-07-25 Thread Mr.Bim

Hello,
Partition pruning is not working on the updates query, am I missing something? 
Able get around this by manually pinning the table partition to the date 
partition but, it's a manual process. 
PostgreSQL 13.8 on AWS
Partition table using pg_partman v4.5.1Each daily partition contains about 15 
million rows which require to be processed using an update on the column 
p_state.
Table:  public.stat                              Partitioned table 
"public.stat"       Column        |            Type             | Collation | 
Nullable |         Default         
-+-+---+--+-
 creationdate        | timestamp without time zone |           | not null |  
ver                 | text                        |           |          |  id  
                | text                        |           |          |  name    
            | text                        |           |          |  ip          
        | text                        |           | not null |  tags            
    | jsonb                       |           |          |  list                
| jsonb                       |           |          |  updated_tstamp      | 
timestamp without time zone |           |          |  p_state             | 
character varying           |           |          | 'N'::character varying 
insert_tstamp       | timestamp without time zone |           |          | 
CURRENT_TIMESTAMPPartition key: RANGE (creationdate)Indexes:    
"pk_public.stat" PRIMARY KEY, btree (creationdate, ip)    "idx_public.p_state" 
btree (p_state)Number of partitions: 47 (Use \d+ to list them.)

Query: explain WITH update_pr as (SELECT * FROM public.stat WHERE p_state = 'N' 
AND creationdate > current_timestamp - INTERVAL '5' day ORDER BY creationdate 
LIMIT 400) UPDATE public.stat s set p_state = 'PR' FROM update_pr u WHERE 
s.p_state = 'N' AND s.ip = u.ip AND s.creationdate = u.creationdate;            
                                                                  QUERY PLAN    
                                                                            
-
 Update on public.stat s  (cost=11.45..193994.25 rows=47 width=858)   Update on 
public.stat_p2023_06_25 s_1   Update on public.stat_p2023_06_26 s_2   ...   
Update on public.stat_p2023_08_09 s_46   Update on public.stat_default s_47   
->  Merge Join  (cost=11.45..4128.23 rows=1 width=966)         Merge Cond: 
(u.creationdate = s_1.creationdate)         Join Filter: (s_1.ip = u.ip)        
 ->  Subquery Scan on u  (cost=11.30..4036.97 rows=34432 width=322)             
  ->  Limit  (cost=11.30..3692.65 rows=34432 width=272)                     ->  
Merge Append  (cost=11.30..3692.65 rows=34432 width=272)                        
   Sort Key: public.stat.creationdate                           Subplans 
Removed: 29                           ->  Index Scan using 
public.stat_p2023_07_24_pkey on public.stat_p2023_07_24 public.stat_1  
(cost=0.29..1474.75 rows=23641 width=272)                                 Index 
Cond: (creationdate > (CURRENT_TIMESTAMP - '1 day'::interval day))              
                   Filter: ((p_state)::text = 'N'::text)                        
   ->  Index Scan using public.stat_p2023_07_25_pkey on public.stat_p2023_07_25 
public.stat_2  (cost=0.29..673.21 rows=10746 width=273)                         
        Index Cond: (creationdate > (CURRENT_TIMESTAMP - '1 day'::interval 
day))                                 Filter: ((p_state)::text = 'N'::text)     
                      ->  Index Scan using public.stat_p2023_07_26_pkey on 
public.stat_p2023_07_26 public.stat_3  (cost=0.15..11.89 rows=1 width=664) ..   
                    ->  Index Scan using public.stat_p2023_08_02_pkey on 
public.stat_p2023_08_02 public.stat_10  (cost=0.15..11.89 rows=1 width=664)     
                            Index Cond: (creationdate > (CURRENT_TIMESTAMP - '1 
day'::interval day))                                 Filter: ((p_state)::text = 
'N'::text)                           ->  Index Scan using 
public.stat_p2023_08_03_pkey on public.stat_p2023_08_03 public.stat_11  
(cost=0.15..11.89 rows=1 width=664)                                 Index Cond: 
(creationdate > (CURRENT_TIMESTAMP - '1 day'::interval day))                    
             Filter: ((p_state)::text = 'N'::text)                           -> 
 Index Scan using public.stat_p2023_08_04_pkey on public.stat_p2023_08_04 
public.stat_12  (cost=0.15..11.89 rows=1 width=664)                             
    Index Cond: (creationdate > (CURRENT_TIMESTAMP - '1 day'::interval day))    
                             Filter: ((p_state)::text = 'N'::text)              
             ->  Index Scan using public.stat_p2023_08_05_pkey on 
public.stat_p2023_08_05 public.stat_13  (

Logical walsenders don't process XLOG_CHECKPOINT_SHUTDOWN

2023-07-25 Thread Amit Kapila

Currently, we don't perform $SUBJECT at the time of shutdown of the
server. I think currently it will only have a minor impact that after
restart subscribers will ask to start processing before the
XLOG_CHECKPOINT_SHUTDOWN or maybe after the switchover the old
publisher will have an extra WAL record. However, if we want to
support the upgrade of the publisher node such that the existing slots
are copied/created into a new cluster, we need to ensure that all the
changes generated on the publisher must be sent and applied to the
subscriber. This is a hard requirement because after the upgrade we
reset the WAL and if some of the WAL has not been sent then that will
be lost. Now, even a clean shutdown of the publisher node can't ensure
that all the WAL has been sent because it is quite possible that the
subscriber node is down due to which at shutdown time walsenders won't
be available to send the data. Similarly, there could be some logical
slots created via backend which may not have processed all the data
and we can't copy those slots as it is during the upgrade.

To ensure that all the data has been sent during the upgrade, we can
ensure that each logical slot's confirmed_flush_lsn (position in the
WAL till which subscriber has confirmed that it has applied the WAL)
is the same as current_wal_insert_lsn. Now, because we don't send
XLOG_CHECKPOINT_SHUTDOWN even on clean shutdown, confirmed_flush_lsn
will never be the same as current_wal_insert_lsn. The one idea being
discussed in patch [1] (see 0003) is to ensure that each slot's LSN is
exactly XLOG_CHECKPOINT_SHUTDOWN ago which probably has some drawbacks
like what if we tomorrow add some other WAL in the shutdown checkpoint
path or the size of record changes then we would need to modify the
corresponding code in upgrade.

The other possibility is that we allow logical walsenders to process
XLOG_CHECKPOINT_SHUTDOWN before shutdown after which during the
upgrade confirmed_flush_lsn will be the same as
current_wal_insert_lsn. AFAICU, the primary reason that we don't allow
it is that we want to avoid writing any new WAL after the shutdown
checkpoint (to avoid any sort of PANIC as discussed in the thread [2])
which is possible during decoding due to hint bits but it doesn't seem
decoding of XLOG_CHECKPOINT_SHUTDOWN can lead to any hint bit updates.
It seems we made these changes as part of commit c6c3334364 [3]. Note
that even if we can ensure that walsenders send all the WAL before
shutdown and make corresponding logical slots up-to-date so that there
is no pending data but it would still be possible that logical slots
created manually via backends won't consume all the WAL before
shutdown. I think those will be the responsibility of users as those
are created by them.

We can also provide some guidelines to users similar to what we have
on physical standby in pg_upgrade docs [4] (See: 9 Prepare for standby
server upgrades). Something like, before upgrading, verify that the
subscriber is caught up with the publisher by comparing the current
WAL position on the publisher and pg_stat_subscription.received_lsn on
the subscriber.

Any better ideas or thoughts on the above?

[1] - 
https://www.postgresql.org/message-id/TYAPR01MB586619721863B7FFDAC4369FF550A%40TYAPR01MB5866.jpnprd01.prod.outlook.com
[2] - 
https://www.postgresql.org/message-id/CAHGQGwEsttg9P9LOOavoc9d6VB1zVmYgfBk%3DLjsk-UL9cEf-eA%40mail.gmail.com
[3] -
commit c6c333436491a292d56044ed6e167e2bdee015a2
Author: Andres Freund 
Date:   Mon Jun 5 18:53:41 2017 -0700

Prevent possibility of panics during shutdown checkpoint.
[4] - https://www.postgresql.org/docs/devel/pgupgrade.html

-- 
With Regards,
Amit Kapila.

Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-25 Thread Aleksander Alekseev

Hi Anthonin,

> I have a working prototype of a pg_tracing extension and wanted some feedback 
> on the design and architecture.

The patch looks very interesting, thanks for working on it and for
sharing. The facts that the patch doesn't change the core except for
two lines in instrument.{c.h} and that is uses pull-based model:

> - Collect the spans through a new pg_tracing_spans() function output

... IMO were the right design decisions. The patch lacks the
documentation, but this is OK for a PoC.

I added the patch to the nearest CF [1]. Let's see what the rest of
the community thinks.

[1] https://commitfest.postgresql.org/44/4456/

-- 
Best regards,
Aleksander Alekseev

Re: logical decoding and replication of sequences, take 2

2023-07-25 Thread Amit Kapila

On Mon, Jul 24, 2023 at 4:22 PM Tomas Vondra
 wrote:
>
> On 7/24/23 12:40, Amit Kapila wrote:
> > On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
> >  wrote:
> >>
> >> 0005, 0006 and 0007 are all related to the initial sequence sync. [3]
> >> resulted in 0007 and I think we need it. That leaves 0005 and 0006 to
> >> be reviewed in this response.
> >>
> >> I followed the discussion starting [1] till [2]. The second one
> >> mentions the interlock mechanism which has been implemented in 0005
> >> and 0006. While I don't have an objection to allowing LOCKing a
> >> sequence using the LOCK command, I am not sure whether it will
> >> actually work or is even needed.
> >>
> >> The problem described in [1] seems to be the same as the problem
> >> described in [2]. In both cases we see the sequence moving backwards
> >> during CATCHUP. At the end of catchup the sequence is in the right
> >> state in both the cases.
> >>
> >
> > I think we could see backward sequence value even after the catchup
> > phase (after the sync worker is exited and or the state of rel is
> > marked as 'ready' in pg_subscription_rel). The point is that there is
> > no guarantee that we will process all the pending WAL before
> > considering the sequence state is 'SYNCDONE' and or 'READY'. For
> > example, after copy_sequence, I see values like:
> >
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > +-+---
> > 165 |   0 | t
> > (1 row)
> > postgres=# select nextval('s');
> >  nextval
> > -
> >  166
> > (1 row)
> > postgres=# select nextval('s');
> >  nextval
> > -
> >  167
> > (1 row)
> > postgres=# select currval('s');
> >  currval
> > -
> >  167
> > (1 row)
> >
> > Then during the catchup phase:
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > +-+---
> >  33 |   0 | t
> > (1 row)
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > +-+---
> >  66 |   0 | t
> > (1 row)
> >
> > postgres=# select * from pg_subscription_rel;
> >  srsubid | srrelid | srsubstate | srsublsn
> > -+-++---
> >16394 |   16390 | r  | 0/16374E8
> >16394 |   16393 | s  | 0/1637700
> > (2 rows)
> >
> > postgres=# select * from pg_subscription_rel;
> >  srsubid | srrelid | srsubstate | srsublsn
> > -+-++---
> >16394 |   16390 | r  | 0/16374E8
> >16394 |   16393 | r  | 0/1637700
> > (2 rows)
> >
> > Here Sequence relid id 16393. You can see sequence state is marked as ready.
> >
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > +-+---
> >  66 |   0 | t
> > (1 row)
> >
> > Even after that, see below the value of the sequence is still not
> > caught up. Later, when the apply worker processes all the WAL, the
> > sequence state will be caught up.
> >
> > postgres=# select * from s;
> >  last_value | log_cnt | is_called
> > +-+---
> > 165 |   0 | t
> > (1 row)
> >
> > So, there will be a window where the sequence won't be caught up for a
> > certain period of time and any usage of it (even after the sync is
> > finished) during that time could result in inconsistent behaviour.
> >
>
> I'm rather confused about which node these queries are executed on.
> Presumably some of it is on publisher, some on subscriber?
>

These are all on the subscriber.

> Can you create a reproducer (TAP test demonstrating this?) I guess it
> might require adding some sleeps to hit the right timing ...
>

I have used the debugger to reproduce this as it needs quite some
coordination. I just wanted to see if the sequence can go backward and
didn't catch up completely before the sequence state is marked
'ready'. On the publisher side, I created a publication with a table
and a sequence. Then did the following steps:
SELECT nextval('s') FROM generate_series(1,50);
insert into t1 values(1);
SELECT nextval('s') FROM generate_series(51,150);

Then on the subscriber side with some debugging aid, I could find the
values in the sequence shown in the previous email. Sorry, I haven't
recorded each and every step but, if you think it helps, I can again
try to reproduce it and share the steps.

-- 
With Regards,
Amit Kapila.

Re: logicalrep_message_type throws an error

2023-07-25 Thread Amit Kapila

On Mon, Jul 24, 2023 at 6:14 PM Ashutosh Bapat
 wrote:
>
> On Sat, Jul 22, 2023 at 10:18 AM Amit Kapila  wrote:
>
> > Right. I'll push and backpatch this till 15 by Tuesday unless you guys
> > think otherwise.
>
> WFM.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: A failure in 031_recovery_conflict.pl on Cirrus CI Windows Server

2023-07-25 Thread Noah Misch

On Tue, Jul 25, 2023 at 01:56:41PM +0530, Bharath Rupireddy wrote:
> I've observed the following failure once in one of my Cirrus CI runs
> on Windows Server on HEAD:
> 
> timed out waiting for match: (?^:User was holding shared buffer pin
> for too long) at
> C:/cirrus/src/test/recovery/t/031_recovery_conflict.pl line 318.
> # Postmaster PID for node "primary" is 696
> 
> https://api.cirrus-ci.com/v1/artifact/task/5272062399348736/testrun/build/testrun/recovery/031_recovery_conflict/log/regress_log_031_recovery_conflict
> 
> https://github.com/BRupireddy/postgres/runs/15296698158

Known:
https://postgr.es/m/flat/20220409045515.35ypjzddp25v72ou%40alap3.anarazel.de
https://postgr.es/m/flat/CA+hUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO=ney...@mail.gmail.com

A good next step would be patch "Fix recovery conflict SIGUSR1 handling" from
https://postgr.es/m/ca+hukg+hi5p1j_8cveqllwnsvyjh4rntf04fywkenxftrh2...@mail.gmail.com
(near the end of that second thread).

Re: [PATCH] Check more invariants during syscache initialization

2023-07-25 Thread Aleksander Alekseev

Hi Michael,

> -   Assert(cacheinfo[cacheId].reloid != 0);
> +   Assert(cacheinfo[cacheId].reloid != InvalidOid);
> +   Assert(cacheinfo[cacheId].indoid != InvalidOid);
> No objections about checking for the index OID given out to catch
> any failures at an early stage before doing an actual lookup?  I guess
> that you've added an incorrect entry and noticed the problem only when
> triggering a syscache lookup for the new incorrect entry?
> +   Assert(key[i] != InvalidAttrNumber);
>
> Yeah, same here.

Not sure if I understand your question or suggestion. Thes proposed
patch adds Asserts() to InitCatalogCache() and InitCatCache() called
by it, before any actual lookups.

> +   Assert((cacheinfo[cacheId].nkeys > 0) && (cacheinfo[cacheId].nkeys <= 
> 4));
>
> Is this addition actually useful?

I believe it is. One can mistakenly use 5+ keys in cacheinfo[] declaration, e.g:

```
diff --git a/src/backend/utils/cache/syscache.c
b/src/backend/utils/cache/syscache.c
index 4883f36a67..c7a8a5b8bb 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -159,6 +159,7 @@ static const struct cachedesc cacheinfo[] = {
KEY(Anum_pg_amop_amopfamily,
Anum_pg_amop_amoplefttype,
Anum_pg_amop_amoprighttype,
+   Anum_pg_amop_amoplefttype,
Anum_pg_amop_amopstrategy),
64
},
```

What happens in this case, at least with GCC - the warning is shown
but the code compiles fine:

```
1032/1882] Compiling C object
src/backend/postgres_lib.a.p/utils_cache_syscache.c.o
src/include/catalog/pg_amop_d.h:30:35: warning: excess elements in
array initializer
   30 | #define Anum_pg_amop_amopstrategy 5
  |   ^
../src/backend/utils/cache/syscache.c:127:48: note: in definition of macro ‘KEY’
  127 | #define KEY(...) VA_ARGS_NARGS(__VA_ARGS__), { __VA_ARGS__ }
  |^~~
../src/backend/utils/cache/syscache.c:163:25: note: in expansion of
macro ‘Anum_pg_amop_amopstrategy’
  163 | Anum_pg_amop_amopstrategy),
  | ^
src/include/catalog/pg_amop_d.h:30:35: note: (near initialization for
‘cacheinfo[4].key’)
   30 | #define Anum_pg_amop_amopstrategy 5
  |   ^
../src/backend/utils/cache/syscache.c:127:48: note: in definition of macro ‘KEY’
  127 | #define KEY(...) VA_ARGS_NARGS(__VA_ARGS__), { __VA_ARGS__ }
  |^~~
../src/backend/utils/cache/syscache.c:163:25: note: in expansion of
macro ‘Anum_pg_amop_amopstrategy’
  163 | Anum_pg_amop_amopstrategy),
  | ^
```

All in all I'm not claiming that these proposed new Asserts() make a
huge difference. I merely noticed that InitCatalogCache checks only
cacheinfo[cacheId].reloid. Checking the rest of the values would be
helpful for the developers and will not impact the performance of the
release builds.

-- 
Best regards,
Aleksander Alekseev

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-25 Thread David Rowley

On Tue, 25 Jul 2023 at 17:34, Andres Freund  wrote:
> prep:
> COPY (SELECT generate_series(1, 200) a, (random() * 10 - 5)::int 
> b, 3243423 c) TO '/tmp/lotsaints.copy';
> DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);
>
> benchmark:
> psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY lotsaints 
> FROM '/tmp/lotsaints.copy';") -t 15
>
> I disabled turbo mode, pinned the server to a single core of my Xeon Gold 
> 5215:
>
> HEAD:   812.690
>
> your patch: 821.354
>
> strtoint from 8692f6644e7:  824.543
>
> strtoint from 6b423ec677d^: 806.678

I'm surprised to see the imul version is faster. It's certainly not
what we found when working on 6b423ec67.

I've been fooling around a bit to try to learn what might be going on.
I wrote 2 patches:

1) pg_strtoint_imul.patch:  Effectively reverts 6b423ec67 and puts the
code how it likely would have looked had I not done that work at all.
(I've assumed that the hex, octal, binary parsing would have been
added using the overflow multiplication functions just as the base 10
parsing).

2) pg_strtoint_optimize.patch: I've made another pass over the
functions with the current overflow checks and optimized the digit
checking code so that it can be done in a single check like: if (digit
< 10).  This can be done by assigning the result of *ptr - '0' to an
unsigned char.  Anything less than '0' will wrap around and anything
above '9' will remain so.  I've also removed a few inefficiencies with
the isspace checking. We didn't need to do "while (*ptr &&
isspace(*ptr)).  It's fine just to do while (isspace(*ptr)) since '\0'
isn't a space char.  I also got rid of the isxdigit call.  The
hexlookup array contains -1 for non-hex chars. We may as well check
the digit value is >= 0.

Here are the results I get using your test as quoted above:

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch

latency average = 568.102 ms

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch

latency average = 531.238 ms

master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch

latency average = 559.333 ms (surprisingly also faster on my machine)

The optimized version of the pg_strtoint functions wins over the imul
patch.  Could you test to see if this is also the case in your Xeon
machine?

> (when I say strtoint from, I did not replace the goto labels, so that part is
> unchanged and unrelated)

I didn't quite follow this.

I've not really studied the fix_COPY_DEFAULT.patch patch.  Is there a
reason to delay committing that?  It would be good to eliminate that
as a variable for the current performance regression.

David
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..a6e9d00727 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -104,10 +104,10 @@ static const int8 hexlookup[128] = {
  * pg_strtoint16() will throw ereport() upon bad input format or overflow;
  * while pg_strtoint16_safe() instead returns such complaints in *escontext,
  * if it's an ErrorSaveContext.
-*
- * NB: Accumulate input as an unsigned number, to deal with two's complement
+ *
+ * NB: Accumulate input as a negative number, to deal with two's complement
  * representation of the most negative number, which can't be represented as a
- * signed positive number.
+ * positive number.
  */
 int16
 pg_strtoint16(const char *s)
@@ -120,7 +120,7 @@ pg_strtoint16_safe(const char *s, Node *escontext)
 {
const char *ptr = s;
const char *firstdigit;
-   uint16  tmp = 0;
+   int16   tmp = 0;
boolneg = false;

/* skip leading spaces */
@@ -145,10 +145,11 @@ pg_strtoint16_safe(const char *s, Node *escontext)
{
if (isxdigit((unsigned char) *ptr))
{
-   if (unlikely(tmp > -(PG_INT16_MIN / 16)))
-   goto out_of_range;
+   int8 digit = hexlookup[(unsigned char) *ptr++];

-   tmp = tmp * 16 + hexlookup[(unsigned char) 
*ptr++];
+   if (unlikely(pg_mul_s16_overflow(tmp, 16, 
&tmp)) ||
+   unlikely(pg_sub_s16_overflow(tmp, 
digit, &tmp)))
+   goto out_of_range;
}
else if (*ptr == '_')
{
@@ -169,10 +170,11 @@ pg_strtoint16_safe(const char *s, Node *escontext)
{
if (*ptr >= '0' && *ptr <= '7')
{
-   if (unlikely(tmp > -(PG_INT16_MIN / 8)))
-   goto out_of_range;
+   int8 digit = (*ptr++ - '0');

-   tmp = tmp * 8 + (*ptr++ -

Re: logical decoding and replication of sequences, take 2

2023-07-25 Thread Tomas Vondra

On 7/25/23 08:28, Amit Kapila wrote:
> On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra
>  wrote:
>>
>> On 7/24/23 12:40, Amit Kapila wrote:
>>> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
>>>  wrote:
>>>
>>> Even after that, see below the value of the sequence is still not
>>> caught up. Later, when the apply worker processes all the WAL, the
>>> sequence state will be caught up.
>>>
>>
>> And how is this different from what tablesync does for tables? For that
>> 'r' also does not mean it's fully caught up, IIRC. What matters is
>> whether the sequence since this moment can go back. And I don't think it
>> can, because that would require replaying changes from before we did
>> copy_sequence ...
>>
> 
> For sequences, it is quite possible that we replay WAL from before the
> copy_sequence whereas the same is not true for tables (w.r.t
> copy_table()). This is because for tables we have a kind of interlock
> w.r.t LSN returned via create_slot (say this value of LSN is LSN1),
> basically, the walsender corresponding to tablesync worker in
> publisher won't send any WAL before that LSN whereas the same is not
> true for sequences. Also, even if apply worker can receive WAL before
> copy_table, it won't apply that as that would be behind the LSN1 and
> the same is not true for sequences. So, for tables, we will never go
> back to a state before the copy_table() but for sequences, we can go
> back to a state before copy_sequence().
> 

Right. I think the important detail is that during sync we have three
important LSNs

- LSN1 where the slot is created
- LSN2 where the copy happens
- LSN3 where we consider the sync completed

For tables, LSN1 == LSN2, because the data is completed using the
snapshot from the temporary slot. And (LSN1 <= LSN3).

But for sequences, the copy happens after the slot creation, possibly
with (LSN1 < LSN2). And because LSN3 comes from the main subscription
(which may be a bit behind, for whatever reason), it may happen that

   (LSN1 < LSN3 < LSN2)

The the sync ends at LSN3, but that means all sequence changes between
LSN3 and LSN2 will be applied "again" making the sequence go away.

IMHO the right fix is to make sure LSN3 >= LSN2 (for sequences).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Partition pruning not working on updates

2023-07-25 Thread David Rowley

On Tue, 25 Jul 2023 at 20:45, Mr.Bim  wrote:
> Partition pruning is not working on the updates query, am I missing something?

In PG13, partition pruning for UPDATE and DELETE only works during
query planning. Because you're using CURRENT_TIMESTAMP, that's not an
immutable expression which can be evaluated during query planning.
This means that the only possible way to prune partitions when using
CURRENT_TIMESTAMP is during query execution.  Unfortunately, execution
time pruning does not work for UPDATE/DELETE in PG13.  PG14 is the
first version to have that.

There's a note in the documents [1] about this which reads:

"Execution-time partition pruning currently only occurs for the Append
and MergeAppend node types. It is not yet implemented for the
ModifyTable node type, but that is likely to be changed in a future
release of PostgreSQL."

If you were never to imbed that query in a view or use PREPAREd
statements, then you could replace CURRENT_TIMESTAMP with
'now'::timestamp.  'now' will be evaluated at parse time and that
means it'll be a known value to the planner.  However, unless you
somehow can be completely certain that this query will never be put
inside a view or used with prepared statements, then I'd highly
recommend not doing this. SQLs tend to get copied and pasted and it
one day might get pasted into the wrong place (e.g. a view) and that
could cause problems.

Example with a view:
postgres=# create view v_test as select 'now'::timestamp;
CREATE VIEW
postgres=# explain verbose select * from v_test;
 QUERY PLAN
-
 Result  (cost=0.00..0.01 rows=1 width=8)
   Output: '2023-07-26 00:17:32.370399'::timestamp without time zone
(2 rows)


postgres=# explain verbose select * from v_test;
 QUERY PLAN
-
 Result  (cost=0.00..0.01 rows=1 width=8)
   Output: '2023-07-26 00:17:32.370399'::timestamp without time zone
(2 rows)

note that the view always just returns the time when the view was created.

My recommendation, if this is a show-stopper for you, would be to
consider using a newer version of PostgreSQL.

David

[1] https://www.postgresql.org/docs/13/ddl-partitioning.html

Re: Row pattern recognition

2023-07-25 Thread Tatsuo Ishii

Hi,

> diff --git a/src/test/regress/expected/rpr.out 
> b/src/test/regress/expected/rpr.out
> index 6bf8818911..f3fd22de2a 100644
> --- a/src/test/regress/expected/rpr.out
> +++ b/src/test/regress/expected/rpr.out
> @@ -230,6 +230,79 @@ SELECT company, tdate, price, rpr(price) OVER w FROM 
> stock
>   company2 | 07-10-2023 |  1300 | 
>  (20 rows)
>  
> +-- match everything
> +SELECT company, tdate, price, rpr(price) OVER w FROM stock
> + WINDOW w AS (
> + PARTITION BY company
> + ORDER BY tdate
> + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> + AFTER MATCH SKIP TO NEXT ROW

It seems it's a result with AFTER MATCH SKIP PAST LAST ROW.

> + INITIAL
> + PATTERN (A+)
> + DEFINE
> +  A AS TRUE
> +);
> + company  |   tdate| price | rpr 
> +--++---+-
> + company1 | 07-01-2023 |   100 | 100
> + company1 | 07-02-2023 |   200 |
> + company1 | 07-03-2023 |   150 |
> + company1 | 07-04-2023 |   140 |
> + company1 | 07-05-2023 |   150 |
> + company1 | 07-06-2023 |90 |
> + company1 | 07-07-2023 |   110 |
> + company1 | 07-08-2023 |   130 |
> + company1 | 07-09-2023 |   120 |
> + company1 | 07-10-2023 |   130 |
> + company2 | 07-01-2023 |50 |  50
> + company2 | 07-02-2023 |  2000 |
> + company2 | 07-03-2023 |  1500 |
> + company2 | 07-04-2023 |  1400 |
> + company2 | 07-05-2023 |  1500 |
> + company2 | 07-06-2023 |60 |
> + company2 | 07-07-2023 |  1100 |
> + company2 | 07-08-2023 |  1300 |
> + company2 | 07-09-2023 |  1200 |
> + company2 | 07-10-2023 |  1300 |
> +(20 rows)
> +
> +-- backtracking with reclassification of rows
> +SELECT company, tdate, price, rpr(price) OVER w FROM stock
> + WINDOW w AS (
> + PARTITION BY company
> + ORDER BY tdate
> + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> + AFTER MATCH SKIP TO NEXT ROW
> + INITIAL
> + PATTERN (A+ B+)
> + DEFINE
> +  A AS price > 100,
> +  B AS price > 100
> +);
> + company  |   tdate| price | rpr  
> +--++---+--
> + company1 | 07-01-2023 |   100 | 
> + company1 | 07-02-2023 |   200 |  200
> + company1 | 07-03-2023 |   150 | 
> + company1 | 07-04-2023 |   140 | 
> + company1 | 07-05-2023 |   150 | 
> + company1 | 07-06-2023 |90 | 
> + company1 | 07-07-2023 |   110 |  110
> + company1 | 07-08-2023 |   130 | 
> + company1 | 07-09-2023 |   120 | 
> + company1 | 07-10-2023 |   130 | 
> + company2 | 07-01-2023 |50 | 
> + company2 | 07-02-2023 |  2000 | 2000
> + company2 | 07-03-2023 |  1500 | 
> + company2 | 07-04-2023 |  1400 | 
> + company2 | 07-05-2023 |  1500 | 
> + company2 | 07-06-2023 |60 | 
> + company2 | 07-07-2023 |  1100 | 1100
> + company2 | 07-08-2023 |  1300 | 
> + company2 | 07-09-2023 |  1200 | 
> + company2 | 07-10-2023 |  1300 | 
> +(20 rows)
> +
>  --
>  -- Error cases
>  --
> diff --git a/src/test/regress/sql/rpr.sql b/src/test/regress/sql/rpr.sql
> index 951c9abfe9..f1cd0369f4 100644
> --- a/src/test/regress/sql/rpr.sql
> +++ b/src/test/regress/sql/rpr.sql
> @@ -94,6 +94,33 @@ SELECT company, tdate, price, rpr(price) OVER w FROM stock
>UPDOWN AS price > PREV(price) AND price > NEXT(price)
>  );
>  
> +-- match everything
> +SELECT company, tdate, price, rpr(price) OVER w FROM stock
> + WINDOW w AS (
> + PARTITION BY company
> + ORDER BY tdate
> + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> + AFTER MATCH SKIP TO NEXT ROW
> + INITIAL
> + PATTERN (A+)
> + DEFINE
> +  A AS TRUE
> +);
> +
> +-- backtracking with reclassification of rows
> +SELECT company, tdate, price, rpr(price) OVER w FROM stock
> + WINDOW w AS (
> + PARTITION BY company
> + ORDER BY tdate
> + ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
> + AFTER MATCH SKIP TO NEXT ROW
> + INITIAL
> + PATTERN (A+ B+)
> + DEFINE
> +  A AS price > 100,
> +  B AS price > 100
> +);
> +
>  --
>  -- Error cases
>  --

Re: logical decoding and replication of sequences, take 2

2023-07-25 Thread Ashutosh Bapat

On Tue, Jul 25, 2023 at 5:29 PM Tomas Vondra
 wrote:
>
> Right. I think the important detail is that during sync we have three
> important LSNs
>
> - LSN1 where the slot is created
> - LSN2 where the copy happens
> - LSN3 where we consider the sync completed
>
> For tables, LSN1 == LSN2, because the data is completed using the
> snapshot from the temporary slot. And (LSN1 <= LSN3).
>
> But for sequences, the copy happens after the slot creation, possibly
> with (LSN1 < LSN2). And because LSN3 comes from the main subscription
> (which may be a bit behind, for whatever reason), it may happen that
>
>(LSN1 < LSN3 < LSN2)
>
> The the sync ends at LSN3, but that means all sequence changes between
> LSN3 and LSN2 will be applied "again" making the sequence go away.
>
> IMHO the right fix is to make sure LSN3 >= LSN2 (for sequences).

Back in this thread, an approach to use page LSN (LSN2 above) to make
sure that no change before LSN2 is applied on subscriber. The approach
was discussed in emails around [1] and discarded later for no reason.
I think that approach has some merit.

[1] 
https://www.postgresql.org/message-id/flat/21c87ea8-86c9-80d6-bc78-9b95033ca00b%40enterprisedb.com#36bb9c7968b7af577dc080950761290d

-- 
Best Wishes,
Ashutosh Bapat

[PATCH] Small refactoring of inval.c and inval.h

2023-07-25 Thread Aleksander Alekseev

Hi,

The proposed patch is a small refactoring of inval.{c,h}:

"""
The "public functions" separator comment doesn't reflect reality anymore.
We could rearrange the order of the functions. However, this would
complicate
back-porting of the patches, thus removing the comment instead.

InvalidateSystemCachesExtended() is an internal function of inval.c. Make it
static.
"""

-- 
Best regards,
Aleksander Alekseev


v1-0001-Refactor-inval.c-and-inval.h.patch
Description: Binary data

Re: sslinfo extension - add notbefore and notafter timestamps

2023-07-25 Thread Daniel Gustafsson

> On 20 Jul 2023, at 17:24, Daniel Gustafsson  wrote:
> 
>> On 17 Jul 2023, at 20:26, Cary Huang  wrote:
> 
 Perhaps calling "tm2timestamp(&pgtm_time, 0, NULL, &ts)" without checking 
 the return code would be just fine. I see some other usages of 
 tm2timstamp() in other code areas also skip checking the return code.
>>> 
>>> I think we want to know about any failures, btu we can probably make it 
>>> into an
>>> elog() instead, as it should never fail.
>> 
>> Yes, sure. I have corrected the error message to elog(ERROR, "timestamp out 
>> of range") on a rare tm2timestamp() failure.
> 
> I went over this again and ended up pushing it along with a catversion bump.
> Due to a mistake in my testing I didn't however catch that it was using an API
> only present in OpenSSL 1.1.1 and higher, which caused buildfailures when 
> using
> older OpenSSL versions, so I ended up reverting it again (leaving certificate
> changes in place) to keep the buildfarm green.
> 
> Will look closer at an implementation which works across all supported 
> versions
> of OpenSSL when I have more time.

Finally had some time, and have made an updated version of the patch.

OpenSSL 1.0.2 doens't expose a function for getting the timestamp, so the patch
instead resorts to the older trick of getting the timestamp by inspecing the
diff against the UNIX epoch.  When doing this, OpenSSL internally use the same
function which later in 1.1.1 was exported for getting the timestamp.

The attached version passes ssl tests for me on 1.0.2 through OpenSSL Git HEAD.

--
Daniel Gustafsson



v7-0001-Add-notBefore-and-notAfter-to-SSL-cert-info-displ.patch
Description: Binary data

Avoid possible memory leak (src/common/rmtree.c)

2023-07-25 Thread Ranier Vilela

Hi,

Per Coverity.

rmtree function can leak 64 bytes per call,
when it can't open a directory.

patch attached.

best regards,
Ranier Vilela


0003-Avoid-possible-memory-leak-64-bytes-per-rmtree-call-.patch
Description: Binary data

Re: [PATCH] Small refactoring of inval.c and inval.h

2023-07-25 Thread Alvaro Herrera

On 2023-Jul-25, Aleksander Alekseev wrote:

> Hi,
> 
> The proposed patch is a small refactoring of inval.{c,h}:
> 
> """
> The "public functions" separator comment doesn't reflect reality anymore.
> We could rearrange the order of the functions. However, this would
> complicate
> back-porting of the patches, thus removing the comment instead.
> 
> InvalidateSystemCachesExtended() is an internal function of inval.c. Make 
> it
> static.
> """

Hmm, this *Extended function looks a bit funny, and I think it's because
it's part of a backpatched bugfix that didn't want to modify ABI.  If
we're modifying this code, maybe we should get rid of the shim, that is,
move the boolean argument to InvalidateSystemCaches() and get rid of the
*Extended() version altogether.

As for the /*--- public functions ---*/ comment, that one was just not
moved by b89e151054a0, which should have done so; but even at that
point, it had already been somewhat broken by the addition of
PrepareInvalidationState() (a static function) in 6cb4afff33ba below it.
If we really want to clean this up, we could move PrepareInvalidationState()
to just below RegisterSnapshotInvalidation() and move the "public
functions" header to just below it (so that it appears just before
InvalidateSystemCaches).

If we just remove the "public functions" comment, then we're still
inside a section that starts with the "private support functions"
comment, which seems even worse to me than the current situation.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

Re: Avoid possible memory leak (src/common/rmtree.c)

2023-07-25 Thread Daniel Gustafsson

> On 25 Jul 2023, at 16:31, Ranier Vilela  wrote:

> rmtree function can leak 64 bytes per call, 
> when it can't open a directory.

Skimming the tree there doesn't seem to be any callers which aren't exiting or
ereporting on failure so the real-world impact seems low.  That being said,
silencing static analyzers could be reason enough to delay allocation.

--
Daniel Gustafsson

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-25 Thread Melih Mutlu

Hi,

Melih Mutlu , 21 Tem 2023 Cum, 12:47 tarihinde şunu
yazdı:

> I did not realize the order is the same with .c files. Good to know. I'll
> fix it along with other comments.
>

Addressed the recent reviews and attached the updated patches.

Thanks,
-- 
Melih Mutlu
Microsoft


v22-0001-Refactor-to-split-Apply-and-Tablesync-Workers.patch
Description: Binary data


v22-0002-Reuse-Tablesync-Workers.patch
Description: Binary data


v22-0003-Reuse-connection-when-tablesync-workers-change-t.patch
Description: Binary data

Re: [PATCH] Small refactoring of inval.c and inval.h

2023-07-25 Thread Aleksander Alekseev

Hi Alvaro,

Thanks for your feedback.

> Hmm, this *Extended function looks a bit funny, and I think it's because
> it's part of a backpatched bugfix that didn't want to modify ABI.  If
> we're modifying this code, maybe we should get rid of the shim, that is,
> move the boolean argument to InvalidateSystemCaches() and get rid of the
> *Extended() version altogether.

This would affect quite a few places where InvalidateSystemCaches() is
called, and we will end-up with some sort of wrapper anyway. The
reason is that InvalidateSystemCaches() is used as a callback passed
to ReceiveSharedInvalidMessages():

```
ReceiveSharedInvalidMessages(LocalExecuteInvalidationMessage,
 InvalidateSystemCaches);
```

Unless of course we want to change its signature too. I don't think
this is going to be a good API change.

> As for the /*--- public functions ---*/ comment, that one was just not
> moved by b89e151054a0, which should have done so; but even at that
> point, it had already been somewhat broken by the addition of
> PrepareInvalidationState() (a static function) in 6cb4afff33ba below it.
> If we really want to clean this up, we could move PrepareInvalidationState()
> to just below RegisterSnapshotInvalidation() and move the "public
> functions" header to just below it (so that it appears just before
> InvalidateSystemCaches).
>
> If we just remove the "public functions" comment, then we're still
> inside a section that starts with the "private support functions"
> comment, which seems even worse to me than the current situation.

Fixed.

-- 
Best regards,
Aleksander Alekseev


v2-0001-Refactor-inval.c-and-inval.h.patch
Description: Binary data

Re: cataloguing NOT NULL constraints

2023-07-25 Thread Robert Haas

On Tue, Jul 25, 2023 at 8:36 AM Alvaro Herrera  wrote:
> Okay then, I've made these show up in the footer of \d+.  This is in
> patch 0003 here.  Please let me know what do you think of the regression
> changes.

Seems OK.

I'm not really thrilled with the idea of every not-null constraint
having a name, to be honest. Of all the kinds of constraints that we
have in the system, NOT NULL constraints are probably the ones where
naming them is least likely to be interesting, because they don't
really have any interesting properties. A CHECK constraint has an
expression; a foreign key constraint has columns that it applies to on
each side plus the identity of the table and opclass information, but
a NOT NULL constraint seems like it can never have any property other
than which column. So it sort of seems like a waste to name it. But if
we want it catalogued then we don't really have an option, so I
suppose we just have to accept a bit of clutter as the price of doing
business.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 23:37:08 +1200, David Rowley wrote:
> On Tue, 25 Jul 2023 at 17:34, Andres Freund  wrote:
> I've not really studied the fix_COPY_DEFAULT.patch patch.  Is there a
> reason to delay committing that?  It would be good to eliminate that
> as a variable for the current performance regression.

Yea, I don't think there's a reason to hold off on that. Sawada-san, do you
want to commit it? Or Andrew?


> > prep:
> > COPY (SELECT generate_series(1, 200) a, (random() * 10 - 
> > 5)::int b, 3243423 c) TO '/tmp/lotsaints.copy';
> > DROP TABLE lotsaints; CREATE UNLOGGED TABLE lotsaints(a int, b int, c int);
> >
> > benchmark:
> > psql -qX -c 'truncate lotsaints' && pgbench -n -P1 -f <( echo "COPY 
> > lotsaints FROM '/tmp/lotsaints.copy';") -t 15
> >
> > I disabled turbo mode, pinned the server to a single core of my Xeon Gold 
> > 5215:
> >
> > HEAD:   812.690
> >
> > your patch: 821.354
> >
> > strtoint from 8692f6644e7:  824.543
> >
> > strtoint from 6b423ec677d^: 806.678
> 
> I'm surprised to see the imul version is faster. It's certainly not
> what we found when working on 6b423ec67.

What CPUs did you test it on? I'd not be surprised if this were heavily
dependent on the microarchitecture.


> I've been fooling around a bit to try to learn what might be going on.
> I wrote 2 patches:
> 
> 1) pg_strtoint_imul.patch:  Effectively reverts 6b423ec67 and puts the
> code how it likely would have looked had I not done that work at all.
> (I've assumed that the hex, octal, binary parsing would have been
> added using the overflow multiplication functions just as the base 10
> parsing).


> 2) pg_strtoint_optimize.patch: I've made another pass over the
> functions with the current overflow checks and optimized the digit
> checking code so that it can be done in a single check like: if (digit
> < 10).  This can be done by assigning the result of *ptr - '0' to an
> unsigned char.  Anything less than '0' will wrap around and anything
> above '9' will remain so.  I've also removed a few inefficiencies with
> the isspace checking. We didn't need to do "while (*ptr &&
> isspace(*ptr)).  It's fine just to do while (isspace(*ptr)) since '\0'
> isn't a space char.  I also got rid of the isxdigit call.  The
> hexlookup array contains -1 for non-hex chars. We may as well check
> the digit value is >= 0.
> 
> Here are the results I get using your test as quoted above:
> 
> master @ 62e9af4c63f + fix_COPY_DEFAULT.patch
> 
> latency average = 568.102 ms
> 
> master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch
> 
> latency average = 531.238 ms
> 
> master @ 62e9af4c63f + fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch
> 
> latency average = 559.333 ms (surprisingly also faster on my machine)
> 
> The optimized version of the pg_strtoint functions wins over the imul
> patch.  Could you test to see if this is also the case in your Xeon
> machine?

(these are the numbers with turbo disabled, if I enable it they're all in the
6xx ms range, but the variance is higher)


fix_COPY_DEFAULT.patch
774.344

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch
777.673

fix_COPY_DEFAULT.patch + pg_strtoint_optimize.patch
777.545

fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch
795.298

fix_COPY_DEFAULT.patch + pg_strtoint_imul.patch + likely(isdigit())
773.477

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + 
pg_strtoint_imul.patch
774.443

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + 
pg_strtoint_imul.patch + likely(isdigit())
774.513

fix_COPY_DEFAULT.patch + pg_strtoint32_base_10_first.v2.patch + 
pg_strtoint_imul.patch + likely(isdigit()) + unlikely(*ptr == '_')
763.879


One idea I had was to add a fastpath that won't parse all strings, but will
parse the strings that we would generate, and fall back to the more general
variant if it fails. See the attached, rough, prototype:

fix_COPY_DEFAULT.patch + fastpath.patch:
746.971

fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch:
715.570

Now, the precise contents of this fastpath are not yet clear (wrt imul or
not), but I think the idea has promise.



> > (when I say strtoint from, I did not replace the goto labels, so that part 
> > is
> > unchanged and unrelated)
> 
> I didn't quite follow this.

I meant that I didn't undo ereport()->ereturn().

Greetings,

Andres Freund
diff --git c/src/backend/utils/adt/numutils.c i/src/backend/utils/adt/numutils.c
index 471fbb7ee63..52bbff801d0 100644
--- c/src/backend/utils/adt/numutils.c
+++ i/src/backend/utils/adt/numutils.c
@@ -301,6 +301,53 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 	uint32		tmp = 0;
 	bool		neg = false;
 
+	/*
+	 * Fast path recognizing the most common output.
+	 */
+	if (1)
+	{
+		int32		tmp_s = 0;
+
+		/* leading spaces are uncommon => slow path */
+
+		/* handle - sign, + is uncommon => slow path */
+		if (*ptr == '-')
+		{
+			ptr++;
+			neg = true;
+		}
+
+		/* requ

Re: cataloguing NOT NULL constraints

2023-07-25 Thread Isaac Morland

On Tue, 25 Jul 2023 at 11:39, Robert Haas  wrote:

>
> I'm not really thrilled with the idea of every not-null constraint
> having a name, to be honest. Of all the kinds of constraints that we
> have in the system, NOT NULL constraints are probably the ones where
> naming them is least likely to be interesting, because they don't
> really have any interesting properties. A CHECK constraint has an
> expression; a foreign key constraint has columns that it applies to on
> each side plus the identity of the table and opclass information, but
> a NOT NULL constraint seems like it can never have any property other
> than which column. So it sort of seems like a waste to name it. But if
> we want it catalogued then we don't really have an option, so I
> suppose we just have to accept a bit of clutter as the price of doing
> business.
>

I agree. I definitely do *not* want a bunch of NOT NULL constraint names
cluttering up displays. Can we legislate that all NOT NULL implementing
constraints are named by mashing together the table name, column name, and
something to identify it as a NOT NULL constraint? Maybe even something
like pg_not_null_[relname]_[attname] (with some escaping), using the pg_
prefix to make the name reserved similar to schemas and tables? And then
don't show such constraints in \d, not even \d+ - just indicate it in
the Nullable column of the column listing as done now. Show a NOT NULL
constraint if there is something odd about it - for example, if it gets
renamed, or not renamed when the table is renamed.

Sorry for the noise if this has already been decided otherwise.

Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 08:50:19 -0700, Andres Freund wrote:
> One idea I had was to add a fastpath that won't parse all strings, but will
> parse the strings that we would generate, and fall back to the more general
> variant if it fails. See the attached, rough, prototype:
> 
> fix_COPY_DEFAULT.patch + fastpath.patch:
> 746.971
> 
> fix_COPY_DEFAULT.patch + fastpath.patch + isdigit.patch:
> 715.570
> 
> Now, the precise contents of this fastpath are not yet clear (wrt imul or
> not), but I think the idea has promise.

Btw, I strongly suspect that fastpath wants to be branchless SSE when it grows
up.

Greetings,

Andres Freund

Re: cataloguing NOT NULL constraints

2023-07-25 Thread Alvaro Herrera

On 2023-Jul-25, Isaac Morland wrote:

> I agree. I definitely do *not* want a bunch of NOT NULL constraint names
> cluttering up displays. Can we legislate that all NOT NULL implementing
> constraints are named by mashing together the table name, column name, and
> something to identify it as a NOT NULL constraint?

All constraints are named like that already, and NOT NULL constraints
just inherited the same idea.  The names are __not_null
for NOT NULL constraints.  pg_dump goes great lengths to avoid printing
constraint names when they have this pattern.

I do not want these constraint names cluttering the output either.
That's why I propose moving them to a new \d++ command, where they will
only bother you if you absolutely need them.  But so far I have only one
vote supporting that idea.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: logical decoding and replication of sequences, take 2

2023-07-25 Thread Tomas Vondra

On 7/24/23 14:57, Ashutosh Bapat wrote:
> ...
> 
>>
>>
>> 2) Currently, the sequences hash table is in reorderbuffer, i.e. global.
>> I was thinking maybe we should have it in the transaction (because we
>> need to do cleanup at the end). It seem a bit inconvenient, because then
>> we'd need to either search htabs in all subxacts, or transfer the
>> entries to the top-level xact (otoh, we already do that with snapshots),
>> and cleanup on abort.
>>
>> What do you think?
> 
> Hash table per transaction seems saner design. Adding it to the top
> level transaction should be fine. The entry will contain an XID
> anyway. If we add it to every subtransaction we will need to search
> hash table in each of the subtransactions when deciding whether a
> sequence change is transactional or not. Top transaction is a
> reasonable trade off.
> 

It's not clear to me what design you're proposing, exactly.

If we track it in top-level transactions, then we'd need copy the data
whenever a transaction is assigned as a child, and perhaps also remove
it when there's a subxact abort.

And we'd need to still search the hashes in all toplevel transactions on
every sequence increment - in principle we can't have increment for a
sequence created in another in-progress transaction, but maybe it's just
not assigned yet.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Improve pg_stat_statements by making jumble handle savepoint names better

2023-07-25 Thread Greg Sabino Mullane

On Mon, Jul 24, 2023 at 6:46 PM Michael Paquier  wrote:

> Shouldn't this new field be marked as query_jumble_location
>

Yes, it should. I had some trouble getting it to work that way in the first
place, but now I realize it was just my unfamiliarity with this part of the
code. So thanks for the hint: v2 of the patch is much simplified by adding
two attributes to theTransactionStmt node. I've also added some tests per
your suggestion.

Unrelated to this patch, I'm struggling with meson testing. Why doesn't
this update the postgres test binary?:

meson test --suite pg_stat_statements

It runs "ninja" as expected, but it does not put a new
build/tmp_install/home/greg/pg/17/bin/postgres in place until I do a "meson
test"

Cheers,
Greg

savepoint_jumble.v2.patch
Description: Binary data

Re: WAL Insertion Lock Improvements

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 16:43:16 +0900, Michael Paquier wrote:
> On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote:
> > Yes, it looks safe to me too.
> 
> 0001 has been now applied.  I have done more tests while looking at
> this patch since yesterday and was surprised to see higher TPS numbers
> on HEAD with the same tests as previously, and the patch was still
> shining with more than 256 clients.
> 
> > FWIW, 0001 essentially implements what
> > an existing TODO comment introduced by commit 008608b9d5106 says:
> 
> We really need to do something in terms of documentation with
> something like 0002, so I'll try to look at that next.  Regarding
> 0003, I don't know.  I think that we'd better look more into cases
> where it shows actual benefits for specific workloads (like workloads
> with a fixed rate of read and/or write operations?).

FWIW, I'm working on a patch that replaces WAL insert locks as a whole,
because they don't scale all that well. If there's no very clear improvements,
I'm not sure it's worth putting too much effort into polishing them all that
much.

Greetings,

Andres Freund

Re: Logical walsenders don't process XLOG_CHECKPOINT_SHUTDOWN

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 14:31:00 +0530, Amit Kapila wrote:
> To ensure that all the data has been sent during the upgrade, we can
> ensure that each logical slot's confirmed_flush_lsn (position in the
> WAL till which subscriber has confirmed that it has applied the WAL)
> is the same as current_wal_insert_lsn. Now, because we don't send
> XLOG_CHECKPOINT_SHUTDOWN even on clean shutdown, confirmed_flush_lsn
> will never be the same as current_wal_insert_lsn. The one idea being
> discussed in patch [1] (see 0003) is to ensure that each slot's LSN is
> exactly XLOG_CHECKPOINT_SHUTDOWN ago which probably has some drawbacks
> like what if we tomorrow add some other WAL in the shutdown checkpoint
> path or the size of record changes then we would need to modify the
> corresponding code in upgrade.

Yea, that doesn't seem like a good path. But there is a variant that seems
better: We could just scan the end of the WAL for records that should have
been streamed out?

Greetings,

Andres Freund

Re: cataloguing NOT NULL constraints

2023-07-25 Thread Isaac Morland

On Tue, 25 Jul 2023 at 12:24, Alvaro Herrera 
wrote:

> On 2023-Jul-25, Isaac Morland wrote:
>
> > I agree. I definitely do *not* want a bunch of NOT NULL constraint names
> > cluttering up displays. Can we legislate that all NOT NULL implementing
> > constraints are named by mashing together the table name, column name,
> and
> > something to identify it as a NOT NULL constraint?
>
> All constraints are named like that already, and NOT NULL constraints
> just inherited the same idea.  The names are __not_null
> for NOT NULL constraints.  pg_dump goes great lengths to avoid printing
> constraint names when they have this pattern.
>

OK, this is helpful. Can \d do the same thing? I use a lot of NOT NULL
constraints and I very seriously do not want \d (including \d+) to have an
extra line for almost every column. It's just noise, and while my screen is
large, it's still not infinite.

I do not want these constraint names cluttering the output either.
> That's why I propose moving them to a new \d++ command, where they will
> only bother you if you absolutely need them.  But so far I have only one
> vote supporting that idea.

My suggestion is for \d+ to show NOT NULL constraints only if there is
something weird going on (wrong name, duplicate constraints, …). If there
is nothing weird about the constraint then explicitly listing it provides
absolutely no information that is not given by "not null" in the "Nullable"
column. Easier said than done I suppose. I'm just worried about my \d+
displays becoming less useful.

Re: Inefficiency in parallel pg_restore with many tables

2023-07-25 Thread Nathan Bossart

On Mon, Jul 24, 2023 at 12:00:15PM -0700, Nathan Bossart wrote:
> Here is a sketch of this approach.  It required fewer #ifdefs than I was
> expecting.  At the moment, this one seems like the winner to me.

Here is a polished patch set for this approach.  I've also added a 0004
that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
IMHO these patches are in decent shape.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From def4af18e68cece22008863f664e9e0fd060f917 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 22 Jul 2023 15:04:45 -0700
Subject: [PATCH v6 1/4] Make binaryheap available to frontend code.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There are a couple of places in frontend code that could make use
of this simple binary heap implementation.  This commit makes
binaryheap usable in frontend code, much like commit 26aaf97b68 did
for StringInfo.  Like StringInfo, the header file is left in lib/
to reduce the likelihood of unnecessary breakage.

The frontend version of binaryheap exposes a void *-based API since
frontend code does not have access to the Datum definitions.  This
seemed like a better approach than switching all existing uses to
void * or making the Datum definitions available to frontend code.

Reviewed-by: Tom Lane, Álvaro Herrera
Discussion: https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us
---
 src/backend/lib/Makefile |  1 -
 src/backend/lib/meson.build  |  1 -
 src/common/Makefile  |  1 +
 src/{backend/lib => common}/binaryheap.c | 37 +---
 src/common/meson.build   |  1 +
 src/include/lib/binaryheap.h | 26 -
 6 files changed, 47 insertions(+), 20 deletions(-)
 rename src/{backend/lib => common}/binaryheap.c (91%)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..b6cefd9cca 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,7 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
-	binaryheap.o \
 	bipartite_match.o \
 	bloomfilter.o \
 	dshash.o \
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..b4e88f54ae 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -1,7 +1,6 @@
 # Copyright (c) 2022-2023, PostgreSQL Global Development Group
 
 backend_sources += files(
-  'binaryheap.c',
   'bipartite_match.c',
   'bloomfilter.c',
   'dshash.c',
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..cc5c54dcee 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -48,6 +48,7 @@ LIBS += $(PTHREAD_LIBS)
 OBJS_COMMON = \
 	archive.o \
 	base64.o \
+	binaryheap.o \
 	checksum_helper.o \
 	compression.o \
 	config_info.o \
diff --git a/src/backend/lib/binaryheap.c b/src/common/binaryheap.c
similarity index 91%
rename from src/backend/lib/binaryheap.c
rename to src/common/binaryheap.c
index 1737546757..4836cd1c6d 100644
--- a/src/backend/lib/binaryheap.c
+++ b/src/common/binaryheap.c
@@ -6,15 +6,22 @@
  * Portions Copyright (c) 2012-2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/binaryheap.c
+ *	  src/common/binaryheap.c
  *
  *-
  */
 
+#ifdef FRONTEND
+#include "postgres_fe.h"
+#else
 #include "postgres.h"
+#endif
 
 #include 
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#endif
 #include "lib/binaryheap.h"
 
 static void sift_down(binaryheap *heap, int node_off);
@@ -34,7 +41,7 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
 	int			sz;
 	binaryheap *heap;
 
-	sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
+	sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity;
 	heap = (binaryheap *) palloc(sz);
 	heap->bh_space = capacity;
 	heap->bh_compare = compare;
@@ -106,10 +113,14 @@ parent_offset(int i)
  * afterwards.
  */
 void
-binaryheap_add_unordered(binaryheap *heap, Datum d)
+binaryheap_add_unordered(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_has_heap_property = false;
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
@@ -138,10 +149,14 @@ binaryheap_build(binaryheap *heap)
  * the heap property.
  */
 void
-binaryheap_add(binaryheap *heap, Datum d)
+binaryheap_add(binaryheap *heap, bh_node_type d)
 {
 	if (heap->bh_size >= heap->bh_space)
+#ifdef FRONTEND
+		pg_fatal("out of binary heap slots");
+#else
 		elog(ERROR, "out of binary heap slots");
+#endif
 	heap->bh_nodes[heap->bh_size] = d;
 	heap->bh_size++;
 	sift_up(heap, heap->bh_size - 1);
@@ -154,7 +169,7 @@ binaryheap_add(binaryheap *heap, Datum d)
  * without modifying the heap. The caller must ensure that

Re: cataloguing NOT NULL constraints

2023-07-25 Thread Robert Haas

On Tue, Jul 25, 2023 at 1:33 PM Isaac Morland  wrote:
> My suggestion is for \d+ to show NOT NULL constraints only if there is 
> something weird going on (wrong name, duplicate constraints, …). If there is 
> nothing weird about the constraint then explicitly listing it provides 
> absolutely no information that is not given by "not null" in the "Nullable" 
> column. Easier said than done I suppose. I'm just worried about my \d+ 
> displays becoming less useful.

I mean, the problem is that if you want to ALTER TABLE .. DROP
CONSTRAINT, you need to know what the valid arguments to that command
are, and the names of these constraints will be just as valid as the
names of any other constraints.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: cataloguing NOT NULL constraints

2023-07-25 Thread Isaac Morland

On Tue, 25 Jul 2023 at 14:59, Robert Haas  wrote:

> On Tue, Jul 25, 2023 at 1:33 PM Isaac Morland 
> wrote:
> > My suggestion is for \d+ to show NOT NULL constraints only if there is
> something weird going on (wrong name, duplicate constraints, …). If there
> is nothing weird about the constraint then explicitly listing it provides
> absolutely no information that is not given by "not null" in the "Nullable"
> column. Easier said than done I suppose. I'm just worried about my \d+
> displays becoming less useful.
>
> I mean, the problem is that if you want to ALTER TABLE .. DROP
> CONSTRAINT, you need to know what the valid arguments to that command
> are, and the names of these constraints will be just as valid as the
> names of any other constraints.
>

Can't I just ALTER TABLE … DROP NOT NULL still?

OK, I suppose ALTER CONSTRAINT to change the deferrable status and validity
(that is why we're doing this, right?) needs the constraint name. But the
constraint name is formulaic by default, and my proposal is to suppress it
only when it matches the formula, so you could just construct the
constraint name using the documented formula if it's not explicitly listed.

I really don’t see it as a good use of space to add n lines to the \d+
display just to confirm that the "not null" designations in the "Nullable"
column are implemented by named constraints with the expected names.

Re: WAL Insertion Lock Improvements

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 16:43:16 +0900, Michael Paquier wrote:
> On Sat, Jul 22, 2023 at 01:08:49PM +0530, Bharath Rupireddy wrote:
> > Yes, it looks safe to me too.
> 
> 0001 has been now applied.  I have done more tests while looking at
> this patch since yesterday and was surprised to see higher TPS numbers
> on HEAD with the same tests as previously, and the patch was still
> shining with more than 256 clients.

Just a small heads up:

I just rebased my aio tree over the commit and promptly, on the first run, saw
a hang. I did some debugging on that. Unfortunately repeated runs haven't
repeated that hang, despite quite a bit of trying.

The symptom I was seeing is that all running backends were stuck in
LWLockWaitForVar(), even though the value they're waiting for had
changed. Which obviously "shouldn't be possible".

It's of course possible that this is AIO specific, but I didn't see anything
in stacks to suggest that.

I do wonder if this possibly exposed an undocumented prior dependency on the
value update always happening under the list lock.

Greetings,

Andres Freund

Re: cataloguing NOT NULL constraints

2023-07-25 Thread Robert Haas

On Tue, Jul 25, 2023 at 3:07 PM Isaac Morland  wrote:
> OK, I suppose ALTER CONSTRAINT to change the deferrable status and validity 
> (that is why we're doing this, right?) needs the constraint name. But the 
> constraint name is formulaic by default, and my proposal is to suppress it 
> only when it matches the formula, so you could just construct the constraint 
> name using the documented formula if it's not explicitly listed.
>
> I really don’t see it as a good use of space to add n lines to the \d+ 
> display just to confirm that the "not null" designations in the "Nullable" 
> column are implemented by named constraints with the expected names.

Yeah, I mean, I get that. That was my initial concern, too. But I also
think if there's some complicated rule that determines what gets
displayed and what doesn't, nobody's going to remember it, and then
when you don't see something, you're never going to be sure exactly
what's going on. Displaying everything is going to be clunky
especially if, like me, you tend to be careful to mark columns NOT
NULL when they are, but when something goes wrong, the last thing you
want to do is run a \d command and have it show you incomplete
information.

I can't count the number of times that somebody's shown me the output
of a query against pg_locks or pg_stat_activity that had been filtered
to remove irrelevant information and it turned out that the hidden
information was not so irrelevant as the person who wrote the query
thought. It happens all the time. I don't want to create the same kind
of situation here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: MERGE ... RETURNING

2023-07-25 Thread Gurjeet Singh

On Fri, Jul 21, 2023 at 7:17 PM Dean Rasheed  wrote:
>
> On Mon, 17 Jul 2023 at 20:43, Jeff Davis  wrote:
> >
> > > > Maybe instead of a function it could be a special table reference
> > > > like:
> > > >
> > > >   MERGE ... RETURNING MERGE.action, MERGE.action_number, id, val?
> > > >
> > The benefits are:
> >
> > 1. It is naturally constrained to the right context. It doesn't require
> > global variables and the PG_TRY/PG_FINALLY, and can't be called in the
> > wrong contexts (like SELECT).
> >
> > 2. More likely to be consistent with eventual support for NEW/OLD
> > (actually BEFORE/AFTER for reasons the prior thread discussed).
> >
>
> Thinking about this some more, I think that the point about
> constraining these functions to the right context is a reasonable one,
> and earlier versions of this patch did that better, without needing
> global variables or a PG_TRY/PG_FINALLY block.
>
> Here is an updated patch that goes back to doing it that way. This is
> more like the way that aggregate functions and GROUPING() work, in
> that the parser constrains the location from which the functions can
> be used, and at execution time, the functions rely on the relevant
> context being passed via the FunctionCallInfo context.
>
> It's still possible to use these functions in subqueries in the
> RETURNING list, but attempting to use them anywhere else (like a
> SELECT on its own) will raise an error at parse time. If they do
> somehow get invoked in a non-MERGE context, they will elog an error
> (again, just like aggregate functions), because that's a "shouldn't
> happen" error.
>
> This does nothing to be consistent with eventual support for
> BEFORE/AFTER, but I think that's really an entirely separate thing,

+1

> From a user perspective, writing something like "BEFORE.id" is quite
> natural, because it's clear that "id" is a column, and "BEFORE" is the
> old state of the table. Writing something like "MERGE.action" seems a
> lot more counter-intuitive, because "action" isn't a column of
> anything (and if it was, I think this syntax would potentially cause
> even more confusion).
>
> So really, I think "MERGE.action" is an abuse of the syntax,
> inconsistent with any other SQL syntax, and using functions is much
> more natural, akin to GROUPING(), for example.

There seems to be other use cases which need us to invent a method to
expose a command-specific alias. See Tatsuo Ishii's call for help in
his patch for Row Pattern Recognition [1].

I was not able to find a way to implement expressions like START.price
(START is not a table alias). Any suggestion is greatly appreciated.

It looks like the SQL standard has started using more of such
context-specific keywords, and I'd expect such keywords to only
increase in future, as the SQL committee tries to introduce more
features into the standard.

So if MERGE.action is not to your taste, perhaps you/someone can
suggest an alternative that doesn't cause a confusion, and yet
implementing it would open up the way for more such context-specific
keywords.

I performed the review vo v9 patch by comparing it to v8 and v7
patches, and then comparing to HEAD.

The v9 patch is more or less a complete revert to v7 patch, plus the
planner support for calling the merge-support functions in subqueries,
parser catching use of merge-support functions outside MERGE command,
and name change for one of the support functions.

But reverting to v7 also means that some of my gripes with that
version also return; e.g. invention of EXPR_KIND_MERGE_RETURNING. And
as noted in v7 review, I don't have a better proposal.

Function name changed from pg_merge_when_clause() to
pg_merge_when_clause_number(). That's better, even though it's a bit
of mouthful.

Doc changes (compared to v7) look good.

The changes made to tests (compared to v7) are for the better.

- * Uplevel PlaceHolderVars and aggregates are replaced, too.
+ * Uplevel PlaceHolderVars, aggregates, GROUPING() expressions and merge
+ * support functions are replaced, too.

Needs Oxford comma: s/GROUPING() expressions and/GROUPING() expressions, and/

+pg_merge_action(PG_FUNCTION_ARGS)
+{
...
+relaction = mtstate->mt_merge_action;
+if (relaction)
+{
..
+}
+
+PG_RETURN_NULL();
+}

Under what circumstances would the relaction be null? Is it okay to
return NULL from this function if relaction is null, or is it better
to throw an error? These questions apply to the
pg_merge_when_clause_number() function as well.

[1]: Row pattern recognition
https://www.postgresql.org/message-id/flat/20230625.210509.1276733411677577841.t-ishii%40sranhm.sra.co.jp

Best regards,
Gurjeet
http://Gurje.et

Re: WAL Insertion Lock Improvements

2023-07-25 Thread Michael Paquier

On Tue, Jul 25, 2023 at 12:57:37PM -0700, Andres Freund wrote:
> I just rebased my aio tree over the commit and promptly, on the first run, saw
> a hang. I did some debugging on that. Unfortunately repeated runs haven't
> repeated that hang, despite quite a bit of trying.
> 
> The symptom I was seeing is that all running backends were stuck in
> LWLockWaitForVar(), even though the value they're waiting for had
> changed. Which obviously "shouldn't be possible".

Hmm.  I've also spent a few days looking at this past report that made
the LWLock part what it is today, but I don't quite see immediately
how it would be possible to reach a state where all the backends are
waiting for an update that's not happening:
https://www.postgresql.org/message-id/CAMkU=1zlztrowh3b42oxsb04r9zmesk3658qen4_8+b+k3e...@mail.gmail.com

All the assumptions of this code and its dependencies with
xloginsert.c are hard to come by.

> It's of course possible that this is AIO specific, but I didn't see anything
> in stacks to suggest that.

Or AIO handles the WAL syncs so quickly that it has more chances in
showing a race condition here?

> I do wonder if this possibly exposed an undocumented prior dependency on the
> value update always happening under the list lock.

I would not be surprised by that.
--
Michael

signature.asc
Description: PGP signature

Re: WAL Insertion Lock Improvements

2023-07-25 Thread Michael Paquier

On Tue, Jul 25, 2023 at 09:49:01AM -0700, Andres Freund wrote:
> FWIW, I'm working on a patch that replaces WAL insert locks as a whole,
> because they don't scale all that well.

What were you looking at here?  Just wondering.
--
Michael


signature.asc
Description: PGP signature

Re: WAL Insertion Lock Improvements

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-26 07:40:31 +0900, Michael Paquier wrote:
> On Tue, Jul 25, 2023 at 09:49:01AM -0700, Andres Freund wrote:
> > FWIW, I'm working on a patch that replaces WAL insert locks as a whole,
> > because they don't scale all that well.
>
> What were you looking at here?  Just wondering.

Here's what I had written offlist a few days ago:

> The basic idea is to have a ringbuffer of in-progress insertions. The
> acquisition of a position in the ringbuffer is done at the same time as
> advancing the reserved LSN, using a 64bit xadd. The trick that makes that
> possible is to use the high bits of the atomic for the position in the
> ringbuffer. The point of using the high bits is that they wrap around, without
> affecting the rest of the value.
>
> Of course, when using xadd, we can't keep the "prev" pointer in the
> atomic. That's where the ringbuffer comes into play. Whenever one inserter has
> determined the byte pos of its insertion, it updates the "prev byte pos" in
> the *next* ringbuffer entry.
>
> Of course that means that insertion N+1 needs to wait for N to set the prev
> position - but that happens very quickly. In my very hacky prototype the
> relevant path (which for now just spins) is reached very rarely, even when
> massively oversubscribed. While I've not implemented that, N+1 could actually
> do the first "iteration" in CopyXLogRecordToWAL() before it needs the prev
> position, the COMP_CRC32C() could happen "inside" the buffer.
>
>
> There's a fair bit of trickyness in turning that into something working, of
> course :).  Ensuring that the ring buffer of insertions doesn't wrap around is
> non-trivial. Nor is trivial to ensure that the "reduced" space LSN in the
> atomic can't overflow.
>
> I do wish MAX_BACKENDS were smaller...
>
>
> Until last night I thought all my schemes would continue to need something
> like the existing WAL insertion locks, to implement
> WALInsertLockAcquireExclusive().
>
> But I think I came up with an idea to do away with that (not even prototyped
> yet): Use one bit in the atomic that indicates that no new insertions are
> allowed. Whenever the xadd finds that old value actually was locked, it
> "aborts" the insertion, and instead waits for a condition variable (or
> something similar). Of course that's after modifying the atomic - to deal with
> that the "lock holder" reverts all modifications that have been made to the
> atomic when releasing the "lock", they weren't actually successful and all
> those backends will retry.
>
> Except that this doesn't quite suffice - XLogInsertRecord() needs to be able
> to "roll back", when it finds that we now need to log FPIs. I can't quite see
> how to make that work with what I describe above. The only idea I have so far
> is to just waste the space with a NOOP record - it should be pretty rare. At
> least if we updated RedoRecPtr eagerly (or just stopped this stupid business
> of having an outdated backend-local copy).
>
>
>
> My prototype shows this idea to be promising. It's a tad slower at low
> concurrency, but much better at high concurrency. I think most if not all of
> the low-end overhead isn't inherent, but comes from having both old and new
> infrastructure in place (as well as a lot of debugging cruft).


Greetings,

Andres Freund

Re: Improve pg_stat_statements by making jumble handle savepoint names better

2023-07-25 Thread Michael Paquier

On Tue, Jul 25, 2023 at 12:37:18PM -0400, Greg Sabino Mullane wrote:
> Yes, it should. I had some trouble getting it to work that way in the first
> place, but now I realize it was just my unfamiliarity with this part of the
> code. So thanks for the hint: v2 of the patch is much simplified by adding
> two attributes to theTransactionStmt node. I've also added some tests per
> your suggestion.

-   char   *savepoint_name; /* for savepoint commands */
+   char   *savepoint_name  pg_node_attr(query_jumble_ignore);
char   *gid;/* for two-phase-commit related commands */
boolchain;  /* AND CHAIN option */
+   int locationpg_node_attr(query_jumble_location);
 } TransactionStmt;

You got that right.  Applying query_jumble_ignore makes sure that we
don't finish with different query IDs for different savepoint names.

> Unrelated to this patch, I'm struggling with meson testing. Why doesn't
> this update the postgres test binary?:
> 
> meson test --suite pg_stat_statements

Yes, it should do so, I assume.  I've seen that myself.  That's
contrary to what make check does, but perhaps things are better
integrated this way with meson.

I think that I'm OK with your proposal as savepoint names are in
defined places in these queries (contrary to some of the craziness
with BEGIN and the node structure of TransactionStmt, for example).

Has somebody an opinion or a comment to share?
--
Michael

signature.asc
Description: PGP signature

Re: Show various offset arrays for heap WAL records

2023-07-25 Thread Peter Geoghegan

On Mon, Jul 10, 2023 at 10:29 PM Peter Geoghegan  wrote:
> > Let's add a src/backend/access/rmgrdesc/README file. We don't currently
> > have any explanation anywhere why the rmgr desc functions are in a
> > separate directory. The README would be a good place to explain that,
> > and to have the formatting guidelines. See attached.
>
> I agree that it's better this way, though.

Did you forget to follow up here?

-- 
Peter Geoghegan

Re: [PATCH] Add function to_oct

2023-07-25 Thread Nathan Bossart

On Tue, Apr 04, 2023 at 08:45:36PM -0400, Kirk Wolak wrote:
> Marked Ready for Committer.

I started taking a look at this and ended up adding to_binary() and a
bigint version of to_oct() for completeness.  While I was at it, I moved
the base-conversion logic out to a separate static function that
to_binary/oct/hex all use.

>From the other discussion referenced upthread, it sounds like we might want
to replace to_binary/oct/hex with a more generic base-conversion function.
Maybe we should try to do that instead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a8cac3f3de2394a6128515b6efc2ef3d0a92bce5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 25 Jul 2023 16:09:01 -0700
Subject: [PATCH v3 1/1] add to_binary() and to_oct()

---
 doc/src/sgml/func.sgml| 42 +
 src/backend/utils/adt/varlena.c   | 86 +++
 src/include/catalog/pg_proc.dat   | 12 
 src/test/regress/expected/strings.out | 26 +++-
 src/test/regress/sql/strings.sql  |  9 ++-
 5 files changed, 148 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b94827674c..79334ea145 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3737,6 +3737,48 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ to_binary
+
+to_binary ( integer )
+text
+   
+   
+to_binary ( bigint )
+text
+   
+   
+Converts the number to its equivalent binary representation.
+   
+   
+to_binary(2147483647)
+111
+   
+  
+
+  
+   
+
+ to_oct
+
+to_oct ( integer )
+text
+   
+   
+to_oct ( bigint )
+text
+   
+   
+Converts the number to its equivalent octal representation.
+   
+   
+to_oct(2147483647)
+177
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index b1ec5c32ce..b14f995fad 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -165,6 +165,7 @@ static void text_format_string_conversion(StringInfo buf, char conversion,
 		  int flags, int width);
 static void text_format_append_string(StringInfo buf, const char *str,
 	  int flags, int width);
+static char *convert_to_base(uint64 value, int base);
 
 
 /*
@@ -4919,53 +4920,90 @@ array_to_text_internal(FunctionCallInfo fcinfo, ArrayType *v,
 	return result;
 }
 
-#define HEXBASE 16
 /*
- * Convert an int32 to a string containing a base 16 (hex) representation of
- * the number.
+ * Convert an integer to a string containing a base-2 (binary) representation
+ * of the number.
  */
 Datum
-to_hex32(PG_FUNCTION_ARGS)
+to_binary32(PG_FUNCTION_ARGS)
 {
-	uint32		value = (uint32) PG_GETARG_INT32(0);
-	char	   *ptr;
-	const char *digits = "0123456789abcdef";
-	char		buf[32];		/* bigger than needed, but reasonable */
+	uint64		value = (uint64) PG_GETARG_INT32(0);
+	char	   *result = convert_to_base(value, 2);
 
-	ptr = buf + sizeof(buf) - 1;
-	*ptr = '\0';
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+Datum
+to_binary64(PG_FUNCTION_ARGS)
+{
+	uint64		value = (uint64) PG_GETARG_INT64(0);
+	char	   *result = convert_to_base(value, 2);
 
-	do
-	{
-		*--ptr = digits[value % HEXBASE];
-		value /= HEXBASE;
-	} while (ptr > buf && value);
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+/*
+ * Convert an integer to a string containing a base-8 (oct) representation of
+ * the number.
+ */
+Datum
+to_oct32(PG_FUNCTION_ARGS)
+{
+	uint64		value = (uint64) PG_GETARG_INT32(0);
+	char	   *result = convert_to_base(value, 8);
+
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+Datum
+to_oct64(PG_FUNCTION_ARGS)
+{
+	uint64		value = (uint64) PG_GETARG_INT64(0);
+	char	   *result = convert_to_base(value, 8);
 
-	PG_RETURN_TEXT_P(cstring_to_text(ptr));
+	PG_RETURN_TEXT_P(cstring_to_text(result));
 }
 
 /*
- * Convert an int64 to a string containing a base 16 (hex) representation of
+ * Convert an integer to a string containing a base-16 (hex) representation of
  * the number.
  */
 Datum
+to_hex32(PG_FUNCTION_ARGS)
+{
+	uint64		value = (uint64) PG_GETARG_INT64(0);
+	char	   *result = convert_to_base(value, 16);
+
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+Datum
 to_hex64(PG_FUNCTION_ARGS)
 {
 	uint64		value = (uint64) PG_GETARG_INT64(0);
-	char	   *ptr;
+	char	   *result = convert_to_base(value, 16);
+
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+/*
+ * Workhorse for to_binary, to_oct, and to_hex.  Note that base must be either
+ * 2, 8, or 16.
+ */
+static char *
+convert_to_base(uint64 value, int base)
+{
 	const char *digits = "0123456789abcdef";
-	char		buf[32];		/* bigger than needed, but reasonable */
+

Re: [PATCH] Check more invariants during syscache initialization

2023-07-25 Thread Michael Paquier

On Tue, Jul 25, 2023 at 02:35:31PM +0300, Aleksander Alekseev wrote:
>> -   Assert(cacheinfo[cacheId].reloid != 0);
>> +   Assert(cacheinfo[cacheId].reloid != InvalidOid);
>> +   Assert(cacheinfo[cacheId].indoid != InvalidOid);
>> No objections about checking for the index OID given out to catch
>> any failures at an early stage before doing an actual lookup?  I guess
>> that you've added an incorrect entry and noticed the problem only when
>> triggering a syscache lookup for the new incorrect entry?
>> +   Assert(key[i] != InvalidAttrNumber);
>>
>> Yeah, same here.
> 
> Not sure if I understand your question or suggestion. Thes proposed
> patch adds Asserts() to InitCatalogCache() and InitCatCache() called
> by it, before any actual lookups.

That was more a question.  I was wondering if it was something you've
noticed while working on a different patch because you somewhat
assigned incorrect values in the syscache array, but it looks like you
have noticed that while scanning the code.

>> +   Assert((cacheinfo[cacheId].nkeys > 0) && (cacheinfo[cacheId].nkeys 
>> <= 4));
>>
>> Is this addition actually useful?
> 
> I believe it is. One can mistakenly use 5+ keys in cacheinfo[] declaration, 
> e.g:

Still it's hard to miss at compile time.  I think that I would remove
this one.

> All in all I'm not claiming that these proposed new Asserts() make a
> huge difference. I merely noticed that InitCatalogCache checks only
> cacheinfo[cacheId].reloid. Checking the rest of the values would be
> helpful for the developers and will not impact the performance of the
> release builds.

No counter-arguments on that.  The checks for the index OID and the
keys allow to catch failures in these structures at an earlier stage
when initializing a backend.  Agreed that it can be useful for
developers.
--
Michael


signature.asc
Description: PGP signature

Re: POC, WIP: OR-clause support for indexes

2023-07-25 Thread Peter Geoghegan

On Thu, Jun 29, 2023 at 2:32 AM Alena Rybakina  wrote:
> Hi! I'm sorry I didn't answer you right away, I was too busy with work.

Same for me, this time. I was busy working on my patch, which I
finally posted yesterday.

> To be honest, I didn't think about the fact that my optimization can help 
> eliminate duplicates before reading the data before.

I'm not surprised that you didn't specifically think of that, because
it's very subtle.

> I am still only in the process of familiarizing myself with the thread [1] 
> (reference from your letter), but I have already seen that there are problems 
> related, for example, to when "or" expressions refer to the parent element.

I didn't intend to imply that you might have the same problem here. I
just meant that OR optimizations can have problems with duplicate
elimination, in general. I suspect that your patch doesn't have that
problem, because you are performing a transformation of one kind of OR
into another kind of OR.

> I think, I would face the similar problems if I complicate the current code, 
> for example, so that not only or expressions standing on the same level are 
> written in any, but also on different ones without violating the logic of the 
> priority of executing operators.

I can't say that I am particularly experienced in this general area --
I have never tried to formally reason about how two different
statements are equivalent. It just hasn't been something that I've
needed to have a totally rigorous understanding of up until now. But
my recent patch changes that. Now I need to study this area to make
sure that I have a truly rigorous understanding.

Jeff Davis suggested that I read "Logic and Databases", by C.J. Date.
So now I have some homework to do.

> Unfortunately, when I tried to make a transformation at the stage of index 
> formation, I encountered too incorrect an assessment of the selectivity of 
> relation, which affected the incorrect calculation of the cost and 
> cardinality.

It's not surprising that a weird shift in the plan chosen by the
optimizer is seen with some random test case, as a result of this
added transformation. Even changes that are 100% strictly better (e.g.
changes in a selectivity estimation function that is somehow
guaranteed to be more accurate in all cases) might do that. Here is a
recent example of that with another patch, involving a bitmap OR:

https://postgr.es/m/cah2-wzncdk9n2tz6j_-iln563_epuc3nzp6vsvtl6jhzs6n...@mail.gmail.com

This example was *not* a regression, if you go by conventional
measures. It was merely a less robust plan than the bitmap OR plan,
because it didn't pass down both columns as index quals.

BTW, there are various restrictions on the sort order of SAOPs that
you might want to try to familiarize yourself with. I describe them
(perhaps not very clearly) here:

https://postgr.es/m/CAH2-Wz=ksvn_sjcnd1+bt-wtifra5ok48adynq3pkkhxgmq...@mail.gmail.com

Currently, the optimizer doesn't recognize multi-column indexes with
SAOPs on every column as having a valid sort order, except on the
first column. It seems possible that that has consequences for your
patch. (I'm really only guessing, though; don't trust anything that I
say about the optimizer too much.)

--
Peter Geoghegan

Re: PATCH: Add REINDEX tag to event triggers

2023-07-25 Thread Michael Paquier

On Tue, Jul 25, 2023 at 04:34:47PM +0800, jian he wrote:
> so  T_ReindexStmt should only be in  ProcessUtilitySlow, if you want
> to create an event trigger on reindex?
> 
> regression tests work fine. I even play with partitions.

It would be an idea to have some regression tests for partitions,
actually, so as some patterns around ReindexMultipleInternal() are
checked.  We could have a REINDEX DATABASE in a TAP test with an event
trigger, as well, but I don't feel strongly about the need to do that
much extra work in 090_reindexdb.pl or 091_reindexdb_all.pl if
partitions cover the multi-table case.
--
Michael


signature.asc
Description: PGP signature

Re: Support worker_spi to execute the function dynamically.

2023-07-25 Thread Michael Paquier

On Mon, Jul 24, 2023 at 05:38:45PM +0900, Michael Paquier wrote:
> Which is basically the same thing with TAP except that these are
> grouped now?  The value of a few raw SQL queries with a
> NO_INSTALLCHECK does not strike me as enough on top of having to
> maintain two different sets of tests.  I'd still choose the cheap and
> extensible path here.

I've been sleeping on that a bit more, and I'd still go with the
refactoring where we initialize one cluster and have all the tests
done by TAP, for the sake of being much cheaper without changing the
coverage, while being more extensible when it comes to introduce tests
for the follow-up patch on custom wait events.
--
Michael


signature.asc
Description: PGP signature

Buildfarm animal grassquit is failing

2023-07-25 Thread Tom Lane

grassquit has been failing to run the regression tests for the last
few days, since [1]:

# +++ regress check in src/test/regress +++
# using temp instance on port 6880 with PID 766305
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth" (currently 
2048kB), after ensuring the platform's stack depth limit is adequate.
ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth" (currently 
2048kB), after ensuring the platform's stack depth limit is adequate.
# command failed: "psql" -X -q -c "CREATE DATABASE \\"regression\\" 
TEMPLATE=template0 LOCALE='C'" -c "ALTER DATABASE \\"regression\\" SET 
lc_messages TO 'C';ALTER DATABASE \\"regression\\" SET lc_monetary TO 'C';ALTER 
DATABASE \\"regression\\" SET lc_numeric TO 'C';ALTER DATABASE \\"regression\\" 
SET lc_time TO 'C';ALTER DATABASE \\"regression\\" SET bytea_output TO 
'hex';ALTER DATABASE \\"regression\\" SET timezone_abbreviations TO 'Default';" 
"postgres"

This seems to be happening in all branches, so I doubt it
has anything to do with recent PG commits.  Instead, I notice
that grassquit seems to have been updated to a newer Debian
release: 6.1.0-6-amd64 works, 6.3.0-2-amd64 doesn't.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2023-07-23%2011%3A05%3A50

Re: Add bump memory context type and use it for tuplesorts

2023-07-25 Thread Nathan Bossart

On Tue, Jun 27, 2023 at 09:19:26PM +1200, David Rowley wrote:
> Because of all of what is mentioned above about the current state of
> tuplesort, there does not really seem to be much need to have chunk
> headers in memory we allocate for tuples at all.  Not having these
> saves us a further 8 bytes per tuple.
> 
> In the attached patch, I've added a bump memory allocator which
> allocates chunks without and chunk header.  This means the chunks
> cannot be pfree'd or realloc'd.  That seems fine for the use case for
> storing tuples in tuplesort. I've coded bump.c in such a way that when
> built with MEMORY_CONTEXT_CHECKING, we *do* have chunk headers.  That
> should allow us to pick up any bugs that are introduced by any code
> which accidentally tries to pfree a bump.c chunk.

This is a neat idea.

> In terms of performance of tuplesort, there's a small (~5-6%)
> performance gain. Not as much as I'd hoped, but I'm also doing a bit
> of other work on tuplesort to make it more efficient in terms of CPU,
> so I suspect the cache efficiency improvements might be more
> pronounced after those.

Nice.

> One thing that might need more thought is that we're running a bit low
> on  MemoryContextMethodIDs. I had to use an empty slot that has a bit
> pattern like glibc malloc'd chunks sized 128kB.  Maybe it's worth
> freeing up a bit from the block offset in MemoryChunk. This is
> currently 30 bits allowing 1GB offset, but these offsets are always
> MAXALIGNED, so we could free up a couple of bits since those 2
> lowest-order bits will always be 0 anyway.

I think it'd be okay to steal those bits.  AFAICT it'd complicate the
macros in memutils_memorychunk.h a bit, but that doesn't seem like such a
terrible price to pay to allow us to keep avoiding the glibc bit patterns.

> + if (base->sortopt & TUPLESORT_ALLOWBOUNDED)
> + tuplen = GetMemoryChunkSpace(tuple);
> + else
> + tuplen = MAXALIGN(tuple->t_len);

nitpick: I see this repeated in a few places, and I wonder if it might
deserve a comment.

I haven't had a chance to try out your benchmark, but I'm hoping to do so
in the near future.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Add function to_oct

2023-07-25 Thread Nathan Bossart

On Tue, Jul 25, 2023 at 04:24:26PM -0700, Nathan Bossart wrote:
> I started taking a look at this and ended up adding to_binary() and a
> bigint version of to_oct() for completeness.  While I was at it, I moved
> the base-conversion logic out to a separate static function that
> to_binary/oct/hex all use.

Bleh, this patch seems to fail horribly on 32-bit builds.  I'll look into
it soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Buildfarm animal grassquit is failing

2023-07-25 Thread Andres Freund

Hi,

On 2023-07-25 20:10:06 -0400, Tom Lane wrote:
> grassquit has been failing to run the regression tests for the last
> few days, since [1]:
> 
> # +++ regress check in src/test/regress +++
> # using temp instance on port 6880 with PID 766305
> ERROR:  stack depth limit exceeded
> HINT:  Increase the configuration parameter "max_stack_depth" (currently 
> 2048kB), after ensuring the platform's stack depth limit is adequate.
> ERROR:  stack depth limit exceeded
> HINT:  Increase the configuration parameter "max_stack_depth" (currently 
> 2048kB), after ensuring the platform's stack depth limit is adequate.
> # command failed: "psql" -X -q -c "CREATE DATABASE \\"regression\\" 
> TEMPLATE=template0 LOCALE='C'" -c "ALTER DATABASE \\"regression\\" SET 
> lc_messages TO 'C';ALTER DATABASE \\"regression\\" SET lc_monetary TO 
> 'C';ALTER DATABASE \\"regression\\" SET lc_numeric TO 'C';ALTER DATABASE 
> \\"regression\\" SET lc_time TO 'C';ALTER DATABASE \\"regression\\" SET 
> bytea_output TO 'hex';ALTER DATABASE \\"regression\\" SET 
> timezone_abbreviations TO 'Default';" "postgres"
> 
> This seems to be happening in all branches, so I doubt it
> has anything to do with recent PG commits.  Instead, I notice
> that grassquit seems to have been updated to a newer Debian
> release: 6.1.0-6-amd64 works, 6.3.0-2-amd64 doesn't.

Indeed, the automated updates had been stuck, and I fixed that a few days
ago. I missed grassquit's failures unfortunately.

I think I know what the issue is - I have seen them locally. Newer versions of
gcc and clang (or libasan?) default to enabling "stack use after return"
checks - for some reason that implies using a shadow stack *sometimes*. Which
can confuse our stack depth checking terribly, not so surprisingly.

I've been meaning to look into what we could do to fix that, but I haven't
found the cycles...

For now I added :detect_stack_use_after_return=0 to ASAN_OPTIONS, which I
think should fix the issue.

Greetings,

Andres Freund

Re: Prevent psql \watch from running queries that return no rows

2023-07-25 Thread Michael Paquier

On Wed, Jul 05, 2023 at 10:14:22AM -0400, Greg Sabino Mullane wrote:
> Would like to hear others weigh in, I think it's still only three states
> plus a default, so I'm not convinced it warrants multiple statements yet. :)

I find that hard to parse, so having more lines to get a better idea
of what the states are would be good.

While on it, I can see that the patch has no tests.  Could you add
something in psql's 001_basic.pl?  The option to define the number of
iterations for \watch can make the tests predictible even for the
non-failure cases.  I would do checks with incorrect values, as well,
see the part of the test script about the intervals.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Add loongarch native checksum implementation.

2023-07-25 Thread YANG Xudong

Many thanks to huchangqi. Now we have loongarch64 support for both old 
world ABI and new world ABI on the buildfarm!

 Forwarded Message 
Subject: Re: [PATCH] Add loongarch native checksum implementation.
Date: Tue, 25 Jul 2023 15:51:43 +0800
From: huchangqi 
To: YANG Xudong 

Both cisticola and nuthatch are on the buildfarm now。

cisticola is "old world ABI".
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=cisticola&br=HEAD

nuthatch is "new world ABI".
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nuthatch&br=HEAD

--
Best regards,
huchangqi

On 2023/7/5 15:11, John Naylor wrote:
>
> Before I look at this again: Are there any objections to another CRC
> implementation for the reason of having no buildfarm member?

It is possible to try this patch on buildfarm now, I guess?

Re: [PATCH] Add loongarch native checksum implementation.

2023-07-25 Thread Michael Paquier

On Wed, Jul 05, 2023 at 02:11:02PM +0700, John Naylor wrote:
> Also, please don't top-post (which means: quoting an entire message, with
> new text at the top) -- it clutters our archives.
> 
> Before I look at this again: Are there any objections to another CRC
> implementation for the reason of having no buildfarm member?

The performance numbers presented upthread for the CRC computations
are kind of nice in this environment, but honestly I have no idea how
much this architecture is used.  Perhaps that's only something in
China?  I am not seeing much activity around that in Japan, for
instance, and that's really close.

Anyway, based on today's state of the buildfarm, we have a buildfarm
member named cisticola that should be able to test this new CRC
implementation, so I see no problem in applying this stuff now if you
think it is in good shape.
--
Michael

signature.asc
Description: PGP signature

Re: [PATCH] Add function to_oct

2023-07-25 Thread Nathan Bossart

On Tue, Jul 25, 2023 at 05:16:56PM -0700, Nathan Bossart wrote:
> On Tue, Jul 25, 2023 at 04:24:26PM -0700, Nathan Bossart wrote:
>> I started taking a look at this and ended up adding to_binary() and a
>> bigint version of to_oct() for completeness.  While I was at it, I moved
>> the base-conversion logic out to a separate static function that
>> to_binary/oct/hex all use.
> 
> Bleh, this patch seems to fail horribly on 32-bit builds.  I'll look into
> it soon.

Here's a new version of the patch with the silly mistakes fixed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 1a729185b7fb77dbfee10f9a715cef6c3cd7e74b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 25 Jul 2023 16:09:01 -0700
Subject: [PATCH v4 1/1] add to_binary() and to_oct()

---
 doc/src/sgml/func.sgml| 42 +
 src/backend/utils/adt/varlena.c   | 86 +++
 src/include/catalog/pg_proc.dat   | 12 
 src/test/regress/expected/strings.out | 26 +++-
 src/test/regress/sql/strings.sql  |  9 ++-
 5 files changed, 148 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index b94827674c..79334ea145 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3737,6 +3737,48 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ to_binary
+
+to_binary ( integer )
+text
+   
+   
+to_binary ( bigint )
+text
+   
+   
+Converts the number to its equivalent binary representation.
+   
+   
+to_binary(2147483647)
+111
+   
+  
+
+  
+   
+
+ to_oct
+
+to_oct ( integer )
+text
+   
+   
+to_oct ( bigint )
+text
+   
+   
+Converts the number to its equivalent octal representation.
+   
+   
+to_oct(2147483647)
+177
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index b1ec5c32ce..b8903f338f 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -165,6 +165,7 @@ static void text_format_string_conversion(StringInfo buf, char conversion,
 		  int flags, int width);
 static void text_format_append_string(StringInfo buf, const char *str,
 	  int flags, int width);
+static char *convert_to_base(uint64 value, int base);
 
 
 /*
@@ -4919,53 +4920,90 @@ array_to_text_internal(FunctionCallInfo fcinfo, ArrayType *v,
 	return result;
 }
 
-#define HEXBASE 16
 /*
- * Convert an int32 to a string containing a base 16 (hex) representation of
- * the number.
+ * Convert an integer to a string containing a base-2 (binary) representation
+ * of the number.
  */
 Datum
-to_hex32(PG_FUNCTION_ARGS)
+to_binary32(PG_FUNCTION_ARGS)
 {
-	uint32		value = (uint32) PG_GETARG_INT32(0);
-	char	   *ptr;
-	const char *digits = "0123456789abcdef";
-	char		buf[32];		/* bigger than needed, but reasonable */
+	uint64		value = (uint64) PG_GETARG_INT32(0);
+	char	   *result = convert_to_base(value, 2);
 
-	ptr = buf + sizeof(buf) - 1;
-	*ptr = '\0';
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+Datum
+to_binary64(PG_FUNCTION_ARGS)
+{
+	uint64		value = (uint64) PG_GETARG_INT64(0);
+	char	   *result = convert_to_base(value, 2);
 
-	do
-	{
-		*--ptr = digits[value % HEXBASE];
-		value /= HEXBASE;
-	} while (ptr > buf && value);
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
 
-	PG_RETURN_TEXT_P(cstring_to_text(ptr));
+/*
+ * Convert an integer to a string containing a base-8 (oct) representation of
+ * the number.
+ */
+Datum
+to_oct32(PG_FUNCTION_ARGS)
+{
+	uint64		value = (uint64) PG_GETARG_INT32(0);
+	char	   *result = convert_to_base(value, 8);
+
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+Datum
+to_oct64(PG_FUNCTION_ARGS)
+{
+	uint64		value = (uint64) PG_GETARG_INT64(0);
+	char	   *result = convert_to_base(value, 8);
+
+	PG_RETURN_TEXT_P(cstring_to_text(result));
 }
 
 /*
- * Convert an int64 to a string containing a base 16 (hex) representation of
+ * Convert an integer to a string containing a base-16 (hex) representation of
  * the number.
  */
 Datum
+to_hex32(PG_FUNCTION_ARGS)
+{
+	uint64		value = (uint64) PG_GETARG_INT32(0);
+	char	   *result = convert_to_base(value, 16);
+
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+Datum
 to_hex64(PG_FUNCTION_ARGS)
 {
 	uint64		value = (uint64) PG_GETARG_INT64(0);
-	char	   *ptr;
+	char	   *result = convert_to_base(value, 16);
+
+	PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+/*
+ * Workhorse for to_binary, to_oct, and to_hex.  Note that base must be either
+ * 2, 8, or 16.
+ */
+static char *
+convert_to_base(uint64 value, int base)
+{
 	const char *digits = "0123456789abcdef";
-	char		buf[32];		/* bigger than needed, but r

Re: Support worker_spi to execute the function dynamically.

2023-07-25 Thread Michael Paquier

On Wed, Jul 26, 2023 at 09:02:54AM +0900, Michael Paquier wrote:
> I've been sleeping on that a bit more, and I'd still go with the
> refactoring where we initialize one cluster and have all the tests
> done by TAP, for the sake of being much cheaper without changing the
> coverage, while being more extensible when it comes to introduce tests
> for the follow-up patch on custom wait events.

For now, please note that I have applied your idea to add "dynamic" to
the names of the bgworkers registered on a worker_spi_launch() as this
is useful on its own.  I have given up on the "static" part, because
that felt unconsistent with the API names, and we don't use this term
in the docs for bgworkers, additionally.
--
Michael


signature.asc
Description: PGP signature

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-25 Thread Thomas Munro

While chatting to Robert and Andres about all this, a new idea came
up.  Or, rather, one of the first ideas that was initially rejected,
now resurrected to try out a suggestion of Andres’s on how to
de-pessimise it.  Unfortunately, it also suffers from Windows-specific
problems that I originally mentioned at the top of this thread but
had since repressed.  Arrrghgh.

First, the good news:

We could write out a whole new control file, and durable_rename() it
into place.  We don’t want to do that in general, because we don’t
want to slow down UpdateMinRecoveryPoint().  The new concept is to do
that only if a backup is in progress.  That requires a bit of
interlocking with backup start/stop (ie when runningBackups is
changing in shmem, we don’t want to overlap with UpdateControlFile()'s
decision on how to do it).  Here is a patch to try that out.  No more
weasel wording needed for the docs; basebackup and low-level file
system backup should always see an atomic control file (and
occasionally also copy a harmless pg_control.tmp file).  Then we only
need the gross retry-until-stable hack for front-end programs.

And the bad news:

In my catalogue-of-Windows-weirdness project[1], I learned in v3-0003 that:

+ fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+ PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+ PG_REQUIRE_SYS(fd < 0 || close(fd) == 0);
+
+ fd = open(path2, O_RDWR | PG_BINARY, 0777);
+ PG_EXPECT_SYS(fd >= 0, "open name2.txt");
+ make_path(path2, "name2.txt");
+#ifdef WIN32
+
+ /*
+ * Windows can't rename over an open non-unlinked file, even with
+ * have_posix_unlink_semantics.
+ */
+ pgwin32_dirmod_loops = 2; /* minimize looping to fail fast in testing */
+ PG_EXPECT_SYS(rename(path, path2) == -1,
+   "Windows: can't rename name1.txt -> name2.txt while name2.txt is open");
+ PG_EXPECT_EQ(errno, EACCES);
+ PG_EXPECT_SYS(unlink(path) == 0, "unlink name1.txt");
+#else
+ PG_EXPECT_SYS(rename(path, path2) == 0,
+   "POSIX: can rename name1.txt -> name2.txt while name2.txt is open");
+#endif
+ PG_EXPECT_SYS(close(fd) == 0);

Luckily the code in dirmod.c:pgrename() should retry lots of times if
a concurrent transient opener/reader comes along, so I think that
should be OK in practice (but if backups_r_us.exe holds the file open
for 10 seconds while we're trying to rename it, I assume we'll PANIC);
call that problem #1.  What is slightly more disturbing is the clue in
the "Cygwin cleanup" thread[2] that rename() can fail to be 100%
atomic, so that a concurrent call to open() can fail with ENOENT (cf.
the POSIX requirement "... a link named new shall remain visible to
other processes throughout the renaming operation and refer either to
the file referred to by new or old ...").  Call that problem #2, a
problem that already causes us rare breakage (for example: could not
open file "pg_logical/snapshots/0-14FE6B0.snap").

I know that problem #1 can be fixed by applying v3-0004 from [1] but
that leads to impossible decisions like revoking support for non-NTFS
filesystems as discussed in that thread, and we certainly couldn't
back-patch that anyway.  I assume problem #2 can too.

That makes me want to *also* acquire ControlFileLock, for base
backup's read of pg_control.  Even though it seems redundant with the
rename() trick (the rename() trick should be enough for low-level
*and* basebackup on ext4), it would at least avoid the above
Windowsian pathologies during base backups.

I'm sorry for the patch/idea-churn in this thread.  It's like
Whac-a-Mole.  Blasted non-POSIX-compliant moles.  New patches
attached.  Are they getting better?

[1] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Be13wK0PBX5Z63CCwWm7MfRQuwBRabM_3aKWSko2AUww%40mail.gmail.com
From caf6f0d49d5b24b8c3d1f41b97d7824dadad8f8b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 26 Jul 2023 08:46:19 +1200
Subject: [PATCH v6 1/4] Update control file atomically during backups.

If there is a backup in progress, switch to a rename-based approach when
writing out the control file to avoid problems with torn reads that can
occur on some systems (at least ext4, ntfs).  We don't want to make
UpdateControlFile() slower in general, because it's called frequently
during recovery, so we only do it while runningBackups > 0.  In order to
activate the new behavior without races, we now also acquire
ControlFileLock when backups begin and end.

Back-patch to all supported releases.

Discussion: https://postgr.es/m/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
---
 src/backend/access/transam/xlog.c  | 33 +++--
 src/bin/pg_checksums/pg_checksums.c|  2 +-
 src/bin/pg_resetwal/pg_resetwal.c  |  2 +-
 src/bin/pg_rewind/pg_rewind.c  |  2 +-
 src/common/controldata_utils.c | 41 --
 src/include/common/controldata_utils.h |  4 ++-
 6 files changed, 74 inser

Re: logical decoding and replication of sequences, take 2

2023-07-25 Thread Amit Kapila

On Tue, Jul 25, 2023 at 5:29 PM Tomas Vondra
 wrote:
>
> On 7/25/23 08:28, Amit Kapila wrote:
> > On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra
> >  wrote:
> >>
> >> On 7/24/23 12:40, Amit Kapila wrote:
> >>> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
> >>>  wrote:
> >>>
> >>> Even after that, see below the value of the sequence is still not
> >>> caught up. Later, when the apply worker processes all the WAL, the
> >>> sequence state will be caught up.
> >>>
> >>
> >> And how is this different from what tablesync does for tables? For that
> >> 'r' also does not mean it's fully caught up, IIRC. What matters is
> >> whether the sequence since this moment can go back. And I don't think it
> >> can, because that would require replaying changes from before we did
> >> copy_sequence ...
> >>
> >
> > For sequences, it is quite possible that we replay WAL from before the
> > copy_sequence whereas the same is not true for tables (w.r.t
> > copy_table()). This is because for tables we have a kind of interlock
> > w.r.t LSN returned via create_slot (say this value of LSN is LSN1),
> > basically, the walsender corresponding to tablesync worker in
> > publisher won't send any WAL before that LSN whereas the same is not
> > true for sequences. Also, even if apply worker can receive WAL before
> > copy_table, it won't apply that as that would be behind the LSN1 and
> > the same is not true for sequences. So, for tables, we will never go
> > back to a state before the copy_table() but for sequences, we can go
> > back to a state before copy_sequence().
> >
>
> Right. I think the important detail is that during sync we have three
> important LSNs
>
> - LSN1 where the slot is created
> - LSN2 where the copy happens
> - LSN3 where we consider the sync completed
>
> For tables, LSN1 == LSN2, because the data is completed using the
> snapshot from the temporary slot. And (LSN1 <= LSN3).
>
> But for sequences, the copy happens after the slot creation, possibly
> with (LSN1 < LSN2). And because LSN3 comes from the main subscription
> (which may be a bit behind, for whatever reason), it may happen that
>
>(LSN1 < LSN3 < LSN2)
>
> The the sync ends at LSN3, but that means all sequence changes between
> LSN3 and LSN2 will be applied "again" making the sequence go away.
>

Yeah, the problem is something as you explained but an additional
minor point is that for sequences we also do end up applying the WAL
between LSN1 and LSN3 which makes it go backwards. The ideal way is
that sequences on subscribers never go backward in a way that is
visible to users. I will share my thoughts after studying your
proposal in a later email.

-- 
With Regards,
Amit Kapila.

Re: Logical walsenders don't process XLOG_CHECKPOINT_SHUTDOWN

2023-07-25 Thread Amit Kapila

On Tue, Jul 25, 2023 at 10:33 PM Andres Freund  wrote:
>
> On 2023-07-25 14:31:00 +0530, Amit Kapila wrote:
> > To ensure that all the data has been sent during the upgrade, we can
> > ensure that each logical slot's confirmed_flush_lsn (position in the
> > WAL till which subscriber has confirmed that it has applied the WAL)
> > is the same as current_wal_insert_lsn. Now, because we don't send
> > XLOG_CHECKPOINT_SHUTDOWN even on clean shutdown, confirmed_flush_lsn
> > will never be the same as current_wal_insert_lsn. The one idea being
> > discussed in patch [1] (see 0003) is to ensure that each slot's LSN is
> > exactly XLOG_CHECKPOINT_SHUTDOWN ago which probably has some drawbacks
> > like what if we tomorrow add some other WAL in the shutdown checkpoint
> > path or the size of record changes then we would need to modify the
> > corresponding code in upgrade.
>
> Yea, that doesn't seem like a good path. But there is a variant that seems
> better: We could just scan the end of the WAL for records that should have
> been streamed out?
>

This sounds like a better idea. So, one way to realize this is that
group slots based on confirmed_flush_lsn and then scan based on that.
Once we ensure that the slot group with the highest
confirm_flush_location is up-to-date (doesn't have any pending WAL
except for shutdown_checkpoint), any slot group having a lesser value
of confirm_flush_location would be considered a group with pending
data.

BTW, I think the main downside for not trying to send
XLOG_CHECKPOINT_SHUTDOWN for logical walsenders is that even if today
there is no risk of any hint bit updates (or any other possibility of
generating WAL) during decoding of XLOG_CHECKPOINT_SHUTDOWN but there
is no future guarantee of the same. Is there anything I am missing
here?

-- 
With Regards,
Amit Kapila.

Re: psql: Could we get "-- " prefixing on the QUERY outputs? (ECHO_HIDDEN)

2023-07-25 Thread Nathan Bossart

I took a look at this patch and changed a couple things:

 * I made a similar adjustment to a few lines that seem to have been
   missed.
 * I removed a couple of asterisks from the adjusted lines in order to
   maintain the existing line lengths.

Barring additional feedback, I think this is ready for commit.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9346a1bbb2fc772527e4b1df82cfdc2dfbc5afb0 Mon Sep 17 00:00:00 2001
From: Kirk Wolak 
Date: Wed, 17 May 2023 16:59:02 -0400
Subject: [PATCH v3 1/1] Change output of ECHO_HIDDEN comments to be SQL
 comments

Simply Add the "/" Required before/after the * QUERY ** (SHOW_HIDDEN on)
so that they are comments and are ignored if you copy and paste a series of them.

Author: Kirk Wolak 
Reviewed By: Pavel Stehule 
Reviewed By: Laurenz Albe 
Thread: https://postgr.es/m/caclu5mtfjrjytbvmz26txggmxwqo0hkghh2o3hequupmsbg...@mail.gmail.com
---
 src/bin/psql/command.c |  8 
 src/bin/psql/common.c  | 16 
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 6733f008fd..1300869d79 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -5395,16 +5395,16 @@ echo_hidden_command(const char *query)
 {
 	if (pset.echo_hidden != PSQL_ECHO_HIDDEN_OFF)
 	{
-		printf(_("* QUERY **\n"
+		printf(_("/ QUERY */\n"
  "%s\n"
- "**\n\n"), query);
+ "//\n\n"), query);
 		fflush(stdout);
 		if (pset.logfile)
 		{
 			fprintf(pset.logfile,
-	_("* QUERY **\n"
+	_("/ QUERY */\n"
 	  "%s\n"
-	  "**\n\n"), query);
+	  "//\n\n"), query);
 			fflush(pset.logfile);
 		}
 
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 5973df2e39..10ad1f2538 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -589,16 +589,16 @@ PSQLexec(const char *query)
 
 	if (pset.echo_hidden != PSQL_ECHO_HIDDEN_OFF)
 	{
-		printf(_("* QUERY **\n"
+		printf(_("/ QUERY */\n"
  "%s\n"
- "**\n\n"), query);
+ "//\n\n"), query);
 		fflush(stdout);
 		if (pset.logfile)
 		{
 			fprintf(pset.logfile,
-	_("* QUERY **\n"
+	_("/ QUERY */\n"
 	  "%s\n"
-	  "**\n\n"), query);
+	  "//\n\n"), query);
 			fflush(pset.logfile);
 		}
 
@@ -1060,9 +1060,9 @@ SendQuery(const char *query)
 		char		buf[3];
 
 		fflush(stderr);
-		printf(_("***(Single step mode: verify command)***\n"
+		printf(_("/**(Single step mode: verify command)**/\n"
  "%s\n"
- "***(press return to proceed or enter x and return to cancel)\n"),
+ "/**(press return to proceed or enter x and return to cancel)***/\n"),
 			   query);
 		fflush(stdout);
 		if (fgets(buf, sizeof(buf), stdin) != NULL)
@@ -1080,9 +1080,9 @@ SendQuery(const char *query)
 	if (pset.logfile)
 	{
 		fprintf(pset.logfile,
-_("* QUERY **\n"
+_("/ QUERY */\n"
   "%s\n"
-  "**\n\n"), query);
+  "//\n\n"), query);
 		fflush(pset.logfile);
 	}
 
-- 
2.25.1

Re: [PATCH] Add loongarch native checksum implementation.

2023-07-25 Thread Nathan Bossart

On Wed, Jul 26, 2023 at 12:16:28PM +0900, Michael Paquier wrote:
> On Wed, Jul 05, 2023 at 02:11:02PM +0700, John Naylor wrote:
>> Before I look at this again: Are there any objections to another CRC
>> implementation for the reason of having no buildfarm member?
>
> [ ... ] 
> 
> Anyway, based on today's state of the buildfarm, we have a buildfarm
> member named cisticola that should be able to test this new CRC
> implementation, so I see no problem in applying this stuff now if you
> think it is in good shape.

IMHO we should strive to maintain buildfarm coverage for all the
instrinsics used within Postgres, if for no other reason than to ensure
future changes do not break those platforms.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-25 Thread Peter Smith

Here are some comments for patch v22-0001.

==
1. General -- naming conventions

There is quite a lot of inconsistency with variable/parameter naming
styles in this patch. I understand in most cases the names are copied
unchanged from the original functions. Still, since this is a big
refactor anyway, it can also be a good opportunity to clean up those
inconsistencies instead of just propagating them to different places.
IIUC, the usual reluctance to rename things because it would cause
backpatch difficulties doesn't apply here (since everything is being
refactored anyway).

E.g. Consider using use snake_case names more consistently in the
following places:

~

1a. start_table_sync

+static void
+start_table_sync(XLogRecPtr *origin_startpos, char **myslotname)
+{
+ char*syncslotname = NULL;

origin_startpos -> (no change)
myslotname -> my_slot_name (But, is there a better name for this than
calling it "my" slot name)
syncslotname -> sync_slot_name

~

1b. run_tablesync_worker

+static void
+run_tablesync_worker()
+{
+ char originname[NAMEDATALEN];
+ XLogRecPtr origin_startpos = InvalidXLogRecPtr;
+ char*slotname = NULL;
+ WalRcvStreamOptions options;

originname -> origin_name
origin_startpos -> (no change)
slotname -> slot_name

~

1c. set_stream_options

+void
+set_stream_options(WalRcvStreamOptions *options,
+char *slotname,
+XLogRecPtr *origin_startpos)
+{
+ int server_version;

options -> (no change)
slotname -> slot_name
origin_startpos -> (no change)
server_version -> (no change)

~

1d. run_apply_worker

 static void
-start_apply(XLogRecPtr origin_startpos)
+run_apply_worker()
 {
- PG_TRY();
+ char originname[NAMEDATALEN];
+ XLogRecPtr origin_startpos = InvalidXLogRecPtr;
+ char*slotname = NULL;
+ WalRcvStreamOptions options;
+ RepOriginId originid;
+ TimeLineID startpointTLI;
+ char*err;
+ bool must_use_password;

originname -> origin_name
origin_startpos => (no change)
slotname -> slot_name
originid -> origin_id

==
src/backend/replication/logical/worker.c

2. SetupApplyOrSyncWorker

-ApplyWorkerMain(Datum main_arg)
+SetupApplyOrSyncWorker(int worker_slot)
 {
- int worker_slot = DatumGetInt32(main_arg);
- char originname[NAMEDATALEN];
- XLogRecPtr origin_startpos = InvalidXLogRecPtr;
- char*myslotname = NULL;
- WalRcvStreamOptions options;
- int server_version;
-
- InitializingApplyWorker = true;
-
  /* Attach to slot */
  logicalrep_worker_attach(worker_slot);

+ Assert(am_tablesync_worker() || am_leader_apply_worker());
+

Why is the Assert not the very first statement of this function?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2023-07-25 Thread shveta malik

On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila  wrote:
>
> On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy
>  wrote:
> >
> > On Fri, Jul 21, 2023 at 5:16 PM shveta malik  wrote:
> > >
> > > Thanks Bharat for letting us know. It is okay to split the patch, it
> > > may definitely help to understand the modules better but shall we take
> > > a step back and try to reevaluate the design first before moving to
> > > other tasks?
> >
> > Agree that design comes first. FWIW, I'm attaching the v9 patch set
> > that I have with me. It can't be a perfect patch set unless the design
> > is finalized.
> >
> > > I analyzed more on the issues stated in [1] for replacing LIST_SLOTS
> > > with SELECT query. On rethinking, it might not be a good idea to
> > > replace this cmd with SELECT in Launcher code-path
> >
> > I think there are open fundamental design aspects, before optimizing
> > LIST_SLOTS, see below. I'm sure we can come back to this later.
> >
> > > Secondly, I was thinking if the design proposed in the patch is the
> > > best one. No doubt, it is the most simplistic design and thus may
> > > .. Any feedback is appreciated.
> >
> > Here are my thoughts about this feature:
> >
> > Current design:
> >
> > 1. On primary, never allow walsenders associated with logical
> > replication slots to go ahead of physical standbys that are candidates
> > for future primary after failover. This enables subscribers to connect
> > to new primary after failover.
> > 2. On all candidate standbys, periodically sync logical slots from
> > primary (creating the slots if necessary) with one slot sync worker
> > per logical slot.
> >
> > Important considerations:
> >
> > 1. Does this design guarantee the row versions required by subscribers
> > aren't removed on candidate standbys as raised here -
> > https://www.postgresql.org/message-id/20220218222319.yozkbhren7vkjbi5%40alap3.anarazel.de?
> >
> > It seems safe with logical decoding on standbys feature. Also, a
> > test-case from upthread is already in patch sets (in v9 too)
> > https://www.postgresql.org/message-id/CAAaqYe9FdKODa1a9n%3Dqj%2Bw3NiB9gkwvhRHhcJNginuYYRCnLrg%40mail.gmail.com.
> > However, we need to verify the use cases extensively.
> >
>
> Agreed.
>
> > 2. All candidate standbys will start one slot sync worker per logical
> > slot which might not be scalable.
> >
>
> Yeah, that doesn't sound like a good idea but IIRC, the proposed patch
> is using one worker per database (for all slots corresponding to a
> database).
>
> > Is having one (or a few more - not
> > necessarily one for each logical slot) worker for all logical slots
> > enough?
> >
>
> I guess for a large number of slots the is a possibility of a large
> gap in syncing the slots which probably means we need to retain
> corresponding WAL for a much longer time on the primary. If we can
> prove that the gap won't be large enough to matter then this would be
> probably worth considering otherwise, I think we should find a way to
> scale the number of workers to avoid the large gap.
>

How about this:

1) On standby, spawn 1 worker per database in the start (as it is
doing currently).

2) Maintain statistics on activity against each primary's database on
standby by any means. Could be by maintaining 'last_synced_time' and
'last_activity_seen time'.  The last_synced_time is updated every time
we sync/recheck slots for that particular database. The
'last_activity_seen_time' changes only if we get any slot on that
database where actually confirmed_flush or say restart_lsn has changed
from what was maintained already.

3) If at any moment, we find that 'last_synced_time' -
'last_activity_seen' goes beyond a threshold, that means that DB is
not active currently. Add it to list of inactive DB

4) Launcher on the other hand is always checking if it needs to spawn
any other extra worker for any new DB. It will additionally check if
number of inactive databases (maintained on standby) has gone higher
(> some threshold), then it brings down the workers for those and
starts a common worker which takes care of all such inactive databases
(or merge all in 1), while workers for active databases remain as such
(i.e. one per db). Each worker maintains the list of DBs which it is
responsible for.

5) If in the list of these inactive databases, we again find any
active database using the above logic, then the launcher will spawn a
separate worker for that.

Pros:
Lesser workers on standby as per the load on primary.
Lesser poking of primary by standby i.e. standby will send queries to
get slot info for all inactive DBs in 1 run instead of each worker
sending such queries separately.

Cons:  We might see spawning and freeing of workers more frequently.

Please let me know your thoughts on this.

thanks
Shveta

Re: [PATCH] Add loongarch native checksum implementation.

2023-07-25 Thread YANG Xudong

On 2023/7/26 11:16, Michael Paquier wrote> The performance numbers 
presented upthread for the CRC computations

are kind of nice in this environment, but honestly I have no idea how
much this architecture is used.  Perhaps that's only something in
China?  I am not seeing much activity around that in Japan, for
instance, and that's really close.


The architecture is pretty new (to open source ecosystem). The support 
of it in Linux kernel and GCC were released last year.


Here is a site about the status of major open source project support of 
it. (Chinese only)


https://areweloongyet.com/

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-25 Thread Amit Kapila

On Wed, Jul 26, 2023 at 10:10 AM Peter Smith  wrote:
>
> Here are some comments for patch v22-0001.
>
> ==
> 1. General -- naming conventions
>
> There is quite a lot of inconsistency with variable/parameter naming
> styles in this patch. I understand in most cases the names are copied
> unchanged from the original functions. Still, since this is a big
> refactor anyway, it can also be a good opportunity to clean up those
> inconsistencies instead of just propagating them to different places.
>

I am not against improving consistency in the naming of existing
variables but I feel it would be better to do as a separate patch
along with improving the consistency function names. For new
functions/variables, it would be good to follow a consistent style.

-- 
With Regards,
Amit Kapila.

Remove unused fields in ReorderBufferTupleBuf

2023-07-25 Thread Masahiko Sawada

Hi,

ReorderBufferTupleBuf is defined as follow:

/* an individual tuple, stored in one chunk of memory */
typedef struct ReorderBufferTupleBuf
{
/* position in preallocated list */
slist_node  node;

/* tuple header, the interesting bit for users of logical decoding */
HeapTupleData tuple;

/* pre-allocated size of tuple buffer, different from tuple size */
Sizealloc_tuple_size;

/* actual tuple data follows */
} ReorderBufferTupleBuf;

However, node and alloc_tuple_size are not used at all. It seems an
oversight in commit a4ccc1cef5a, which introduced the generation
context and used it in logical decoding. I think we can remove them
(only on HEAD). I've attached the patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v1-0001-Remove-unused-fields-in-ReorderBufferTupleBuf.patch
Description: Binary data

Re: psql: Could we get "-- " prefixing on the QUERY outputs? (ECHO_HIDDEN)

2023-07-25 Thread Pavel Stehule

st 26. 7. 2023 v 6:22 odesílatel Nathan Bossart 
napsal:

> I took a look at this patch and changed a couple things:
>
>  * I made a similar adjustment to a few lines that seem to have been
>missed.
>  * I removed a couple of asterisks from the adjusted lines in order to
>maintain the existing line lengths.
>
> Barring additional feedback, I think this is ready for commit.
>
>
+1

Pavel

-- 
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>

Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns

2023-07-25 Thread Amit Langote

Hello,

On Thu, Jul 20, 2023 at 4:06 PM Zhang Mingli  wrote:
>
> Hi,
>
> On Jul 7, 2023 at 18:00 +0800, Damir Belyalov , wrote:
>
>
> V2 patch still have some errors when apply file doc/src/sgml/ref/copy.sgml, 
> rebased and fixed it in V3 path.
> Thanks a lot for review.
>
> I have updated https://commitfest.postgresql.org/43/3896/ to staus Ready for 
> Committer, thanks again.

I've looked at this patch and it looks mostly fine, though I do not
intend to commit it myself; perhaps Andrew will.

A few minor things to improve:

+  If * is specified, it will be applied in all columns.
...
+  If * is specified, it will be applied in all columns.

Please write "it will be applied in" as "the option will be applied to".

+   boolforce_notnull_all;  /* FORCE_NOT_NULL * */
...
+   boolforce_null_all; /* FORCE_NULL * */

Like in the comment for force_quote, please add a "?" after * in the
above comments.

+   if (cstate->opts.force_notnull_all)
+   MemSet(cstate->opts.force_notnull_flags, true, num_phys_attrs
* sizeof(bool));
...
+   if (cstate->opts.force_null_all)
+   MemSet(cstate->opts.force_null_flags, true, num_phys_attrs *
sizeof(bool));

While I am not especially opposed to using this 1-line variant to set
the flags array, it does mean that there are now different styles
being used for similar code, because force_quote_flags uses a for
loop:

if (cstate->opts.force_quote_all)
{
int i;

for (i = 0; i < num_phys_attrs; i++)
cstate->opts.force_quote_flags[i] = true;
}

Perhaps we could fix the inconsistency by changing the force_quote_all
code to use MemSet() too.  I'll defer whether to do that to Andrew's
judgement.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Question about use_physical_tlist() which is applied on Scan path

2023-07-25 Thread Jian Guo

Hi hackers,

I have a question about `use_physical_tlist()` which is applied in 
`create_scan_plan()`:
```
if (flags == CP_IGNORE_TLIST)
{
  tlist = NULL;
}
else if (use_physical_tlist(root, best_path, flags))
{
  if (best_path->pathtype == T_IndexOnlyScan)
  {
/* For index-only scan, the preferred tlist is the index's */
tlist = copyObject(((IndexPath *) best_path)->indexinfo->indextlist);

/*
 * Transfer sortgroupref data to the replacement tlist, if
 * requested (use_physical_tlist checked that this will work).
 */
if (flags & CP_LABEL_TLIST)
  apply_pathtarget_labeling_to_tlist(tlist, best_path->pathtarget);
  }
  else
  {
tlist = build_physical_tlist(root, rel);
……
```
And the comment above the code block says:

```
/*
* For table scans, rather than using the relation targetlist (which is
* only those Vars actually needed by the query), we prefer to generate a
* tlist containing all Vars in order.  This will allow the executor to
* optimize away projection of the table tuples, if possible.
*
* But if the caller is going to ignore our tlist anyway, then don't
* bother generating one at all.  We use an exact equality test here, so
* that this only applies when CP_IGNORE_TLIST is the only flag set.
*/
```

But for some column-oriented database based on Postgres, it may help a lot in 
case of projection of the table tuples in execution? And is there any other 
optimization considerations behind this design?

e.g. If we have such table definition and a query:

```
CREATE TABLE partsupp
(PS_PARTKEY INT,
PS_SUPPKEY INT,
PS_AVAILQTY INTEGER,
PS_SUPPLYCOST DECIMAL(15,2),
PS_COMMENT VARCHAR(199),
dummy text);

explain analyze verbose select sum(ps_supplycost * ps_availqty) from partsupp;
```

And the planner would give such plan:

```
QUERY PLAN
---
Aggregate  (cost=12.80..12.81 rows=1 width=32) (actual time=0.013..0.015 rows=1 
loops=1)
   Output: sum((ps_supplycost * (ps_availqty)::numeric))
   ->  Seq Scan on public.partsupp  (cost=0.00..11.60 rows=160 width=22) 
(actual time=0.005..0.005 rows=0 loops=1)
 Output: ps_partkey, ps_suppkey, ps_availqty, ps_supplycost, 
ps_comment, dummy
Planning Time: 0.408 ms
Execution Time: 0.058 ms
(6 rows)
```
It looks the columns besides `ps_supplycost` and `ps_availqty` are not 
necessary, but fetched from tuples all at once. For the row-based storage such 
as heap, it looks fine, but for column-based storage, it would result into 
unnecessary overhead and impact performance. Is there any plan to optimize here?

Thanks.

76 matches

Mail list logo