Re: pg_upgrade and logical replication

2023-06-30 Thread vignesh C
On Mon, 24 Apr 2023 at 12:52, Julien Rouhaud  wrote:
>
> Hi,
>
> On Tue, Apr 18, 2023 at 01:40:51AM +, Hayato Kuroda (Fujitsu) wrote:
> >
> > I found a cfbot failure on macOS [1]. According to the log,
> > "SELECT count(*) FROM t2" was executed before synchronization was done.
> >
> > ```
> > [09:24:21.018](0.132s) not ok 18 - Table t2 should now have 3 rows on the 
> > new subscriber
> > ```
> >
> > With the patch present, wait_for_catchup() is executed after REFRESH, but
> > it may not be sufficient because it does not check pg_subscription_rel.
> > wait_for_subscription_sync() seems better for the purpose.
>
> Fixed, thanks!
>
> v5 attached with all previously mentioned fixes.

Few comments:
1) Should we document this command:
+   case ALTER_SUBSCRIPTION_ADD_TABLE:
+   {
+   if (!IsBinaryUpgrade)
+   ereport(ERROR,
+
(errcode(ERRCODE_SYNTAX_ERROR)),
+   errmsg("ALTER
SUBSCRIPTION ... ADD TABLE is not supported"));
+
+   supported_opts = SUBOPT_RELID |
SUBOPT_STATE | SUBOPT_LSN;
+   parse_subscription_options(pstate,
stmt->options,
+
supported_opts, );
+
+   /* relid and state should always be provided. */
+   Assert(IsSet(opts.specified_opts,
SUBOPT_RELID));
+   Assert(IsSet(opts.specified_opts,
SUBOPT_STATE));
+
+   AddSubscriptionRelState(subid,
opts.relid, opts.state,
+
 opts.lsn);
+

Should we document something like:
This command is for use by in-place upgrade utilities. Its use for
other purposes is not recommended or supported. The behavior of the
option may change in future releases without notice.

2) Similarly in pg_dump too:
@@ -431,6 +431,7 @@ main(int argc, char **argv)
{"table-and-children", required_argument, NULL, 12},
{"exclude-table-and-children", required_argument, NULL, 13},
{"exclude-table-data-and-children", required_argument,
NULL, 14},
+   {"preserve-subscription-state", no_argument,
_subscriptions, 1},


Should we document something like:
This command is for use by in-place upgrade utilities. Its use for
other purposes is not recommended or supported. The behavior of the
option may change in future releases without notice.

3) This same error is possible for ready state table but with invalid
remote_lsn, should we include this too in the error message:
+   if (is_error)
+   pg_fatal("--preserve-subscription-state is incompatible with "
+"subscription relations in non-ready state");
+
+   check_ok();
+}

Regards,
Vignesh




Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-06-30 Thread Thomas Munro
On Sat, May 20, 2023 at 10:01 AM Jacob Champion  wrote:
> - The client implementation is currently epoll-/Linux-specific. I think
> kqueue shouldn't be too much trouble for the BSDs, but it's even more
> code to maintain.

I guess you also need a fallback that uses plain old POSIX poll()?  I
see you're not just using epoll but also timerfd.  Could that be
converted to plain old timeout bookkeeping?  That should be enough to
get every other Unix and *possibly* also Windows to work with the same
code path.

> - Unless someone is aware of some amazing Winsock magic, I'm pretty sure
> the multiplexed-socket approach is dead in the water on Windows. I think
> the strategy there probably has to be a background thread plus a fake
> "self-pipe" (loopback socket) for polling... which may be controversial?

I am not a Windows user or hacker, but there are certainly several
ways to multiplex sockets.  First there is the WSAEventSelect() +
WaitForMultipleObjects() approach that latch.c uses.  It has the
advantage that it allows socket readiness to be multiplexed with
various other things that use Windows "events".  But if you don't need
that, ie you *only* need readiness-based wakeup for a bunch of sockets
and no other kinds of fd or object, you can use winsock's plain old
select() or its fairly faithful poll() clone called WSAPoll().  It
looks a bit like that'd be true here if you could kill the timerfd?

It's a shame to write modern code using select(), but you can find
lots of shouting all over the internet about WSAPoll()'s defects, most
famously the cURL guys[1] whose blog is widely cited, so people still
do it.  Possibly some good news on that front:  by my reading of the
docs, it looks like that problem was fixed in Windows 10 2004[2] which
itself is by now EOL, so all systems should have the fix?  I suspect
that means that, finally, you could probably just use the same poll()
code path for Unix (when epoll is not available) *and* Windows these
days, making porting a lot easier.  But I've never tried it, so I
don't know what other problems there might be.  Another thing people
complain about is the lack of socketpair() or similar in winsock which
means you unfortunately can't easily make anonymous
select/poll-compatible local sockets, but that doesn't seem to be
needed here.

[1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/
[2] 
https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-wsapoll




Re: RFC: pg_stat_logmsg

2023-06-30 Thread Pavel Stehule
Hi

so 1. 7. 2023 v 1:57 odesílatel Joe Conway  napsal:

> Greetings,
>
> Attached please find a tarball (rather than a patch) for a proposed new
> contrib extension, pg_stat_logmsg.
>
> The basic idea is to mirror how pg_stat_statements works, except the
> logged messages keyed by filename, lineno, and elevel are saved with a
> aggregate count. The format string is displayed (similar to a query
> jumble) for context, along with function name and sqlerrcode.
>
> I threw this together rather quickly over the past couple of days
> between meetings, so not claiming that it is committable (and lacks
> documentation and regression tests as well), but I would love to get
> feedback on:
>
> 1/ the general concept
> 2/ the pg_stat_statement-like implementation
> 3/ contrib vs core vs external project
>
> Some samples and data:
>
> `make installcheck` with the extension loaded:
> 8<--
> # All 215 tests passed.
>
>
> real2m24.854s
> user0m0.086s
> sys 0m0.283s
> 8<--
>
> `make installcheck` without the extension loaded:
> 8<--
>
> # All 215 tests passed.
>
> real2m26.765s
> user0m0.076s
> sys 0m0.293s
> 8<--
>
> Sample output after running make installcheck a couple times (plus a few
> manually generated ERRORs):
>
> 8<--
> test=# select sum(count) from pg_stat_logmsg where elevel > 20;
>sum
> ---
>   10554
> (1 row)
>
> test=# \x
> Expanded display is on.
> test=# select * from pg_stat_logmsg where elevel > 20 order by count desc;
> -[ RECORD 1 ]---
> filename   | aclchk.c
> lineno | 2811
> elevel | 21
> funcname   | aclcheck_error
> sqlerrcode | 42501
> message| permission denied for schema %s
> count  | 578
> -[ RECORD 2 ]---
> filename   | scan.l
> lineno | 1241
> elevel | 21
> funcname   | scanner_yyerror
> sqlerrcode | 42601
> message| %s at or near "%s"
> count  | 265
> ...
>
> test=# select * from pg_stat_logmsg where elevel > 20 and sqlerrcode =
> 'XX000';
> -[ RECORD 1 ]---
> filename   | tid.c
> lineno | 352
> elevel | 21
> funcname   | currtid_for_view
> sqlerrcode | XX000
> message| ctid isn't of type TID
> count  | 2
> -[ RECORD 2 ]---
> filename   | pg_locale.c
> lineno | 2493
> elevel | 21
> funcname   | pg_ucol_open
> sqlerrcode | XX000
> message| could not open collator for locale "%s": %s
> count  | 2
> ...
>
> 8<--
>
> Part of the thinking is that people with fleets of postgres instances
> can use this to scan for various errors that they care about.
> Additionally it would be useful to look for "should not happen" errors.
>
> I will register this in the July CF and will appreciate feedback.
>

This can be a very interesting feature. I like it.

Regards

Pavel


> Thanks!
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com


Re: Extensible storage manager API - SMGR hook Redux

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 14:26:44 +0200, Matthias van de Meent wrote:
> diff --git a/src/backend/postmaster/postmaster.c 
> b/src/backend/postmaster/postmaster.c
> index 4c49393fc5..8685b9fde6 100644
> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c
> @@ -1002,6 +1002,11 @@ PostmasterMain(int argc, char *argv[])
>*/
>   ApplyLauncherRegister();
>  
> + /*
> +  * Register built-in managers that are not part of static arrays
> +  */
> + register_builtin_dynamic_managers();
> +
>   /*
>* process any libraries that should be preloaded at postmaster start
>*/

That doesn't strike me as a good place to initialize this, we'll need it in
multiple places that way.  How about putting it into BaseInit()?


> -static const f_smgr smgrsw[] = {
> +static f_smgr *smgrsw;

This adds another level of indirection. I would rather limit the number of
registerable smgrs than do that.



> +SMgrId
> +smgr_register(const f_smgr *smgr, Size smgrrelation_size)
> +{

> + MemoryContextSwitchTo(old);
> +
> + pg_compiler_barrier();

Huh, what's that about?


> @@ -59,14 +63,8 @@ typedef struct SMgrRelationData
>* Fields below here are intended to be private to smgr.c and its
>* submodules.  Do not touch them from elsewhere.
>*/
> - int smgr_which; /* storage manager 
> selector */
> -
> - /*
> -  * for md.c; per-fork arrays of the number of open segments
> -  * (md_num_open_segs) and the segments themselves (md_seg_fds).
> -  */
> - int md_num_open_segs[MAX_FORKNUM + 1];
> - struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
> + SMgrId  smgr_which; /* storage manager selector */
> + int smgrrelation_size;  /* size of this struct, 
> incl. smgr-specific data */

It looked to me like you determined this globally - why do we need it in every
entry then?

Greetings,

Andres Freund




Re: Synchronizing slots from primary to standby

2023-06-30 Thread Amit Kapila
On Thu, Jun 29, 2023 at 3:52 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> 2. I want to confirm the reason why new replication command is added.
>

Are you referring LIST_SLOTS command? If so, I don't think that is
required and instead, we can use a query to fetch the required
information.

> IIUC the
>launcher connects to primary by using primary_conninfo connection string, 
> but
>it establishes the physical replication connection so that any SQL cannot 
> be executed.
>Is it right? Another approach not to use is to specify the target database 
> via
>GUC, whereas not smart. How do you think?
> 3. You chose the per-db worker approach, however, it is difficult to extend 
> the
>feature to support physical slots. This may be problematic. Was there any
>reasons for that? I doubted ReplicationSlotCreate() or advance functions 
> might
>not be used from other databases and these may be reasons, but not sure.
>If these operations can do without connecting to specific database, I think
>the architecture can be changed.
>

I think this point needs some investigation but do we want just one
worker that syncs all the slots? That may lead to lag in keeping the
slots up-to-date. We probably need some tests.

> 4. Currently the launcher establishes the connection every time. Isn't it 
> better
>to reuse the same one instead?
>

I feel it is not the launcher but a separate sync slot worker that
establishes the connection. It is not clear to me what exactly you
have in mind. Can you please explain a bit more?

> Following comments are assumed the configuration, maybe the straightfoward:
>
> primary->standby
>|->subscriber
>
> 5. After constructing the system, I dropped the subscription on the 
> subscriber.
>In this case the logical slot on primary was removed, but that was not 
> replicated
>to standby server. Did you support the workload or not?
>

This should work.

>
> 6. Current approach may delay the startpoint of sync.
>
> Assuming that physical replication system is created first, and then the
> subscriber connects to the publisher node. In this case the launcher connects 
> to
> primary earlier than the apply worker, and reads the slot. At that time there 
> are
> no slots on primary, so launcher disconnects from primary and waits a time 
> period (up to 3min).
> Even if the apply worker creates the slot on publisher, but the launcher on 
> standby
> cannot notice that. The synchronization may start 3 min later.
>

I feel this should be based on some GUC like
'wal_retrieve_retry_interval' which we are already using in the
launcher or probably a new one if that doesn't seem to match.

-- 
With Regards,
Amit Kapila.




Re: cataloguing NOT NULL constraints

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 13:44:03 +0200, Alvaro Herrera wrote:
> OK, so here's a new attempt to get this working correctly.

Thanks for continuing to work on this!


> The main novelty in this version of the patch, is that we now emit
> "throwaway" NOT NULL constraints when a column is part of the primary
> key.  Then, after the PK is created, we run a DROP for that constraint.
> That lets us create the PK without having to scan the table during
> pg_upgrade.

Have you considered extending the DDL statement for this purpose? We have
  ALTER TABLE ... ADD CONSTRAINT ... PRIMARY KEY USING INDEX ...;
we could just do something similar for the NOT NULL constraint?  Which would
then delete the separate constraint NOT NULL constraint.

Greetings,

Andres Freund




RFC: pg_stat_logmsg

2023-06-30 Thread Joe Conway

Greetings,

Attached please find a tarball (rather than a patch) for a proposed new 
contrib extension, pg_stat_logmsg.


The basic idea is to mirror how pg_stat_statements works, except the 
logged messages keyed by filename, lineno, and elevel are saved with a 
aggregate count. The format string is displayed (similar to a query 
jumble) for context, along with function name and sqlerrcode.


I threw this together rather quickly over the past couple of days 
between meetings, so not claiming that it is committable (and lacks 
documentation and regression tests as well), but I would love to get 
feedback on:


1/ the general concept
2/ the pg_stat_statement-like implementation
3/ contrib vs core vs external project

Some samples and data:

`make installcheck` with the extension loaded:
8<--
# All 215 tests passed.


real2m24.854s
user0m0.086s
sys 0m0.283s
8<--

`make installcheck` without the extension loaded:
8<--

# All 215 tests passed.

real2m26.765s
user0m0.076s
sys 0m0.293s
8<--

Sample output after running make installcheck a couple times (plus a few 
manually generated ERRORs):


8<--
test=# select sum(count) from pg_stat_logmsg where elevel > 20;
  sum
---
 10554
(1 row)

test=# \x
Expanded display is on.
test=# select * from pg_stat_logmsg where elevel > 20 order by count desc;
-[ RECORD 1 ]---
filename   | aclchk.c
lineno | 2811
elevel | 21
funcname   | aclcheck_error
sqlerrcode | 42501
message| permission denied for schema %s
count  | 578
-[ RECORD 2 ]---
filename   | scan.l
lineno | 1241
elevel | 21
funcname   | scanner_yyerror
sqlerrcode | 42601
message| %s at or near "%s"
count  | 265
...

test=# select * from pg_stat_logmsg where elevel > 20 and sqlerrcode = 
'XX000';

-[ RECORD 1 ]---
filename   | tid.c
lineno | 352
elevel | 21
funcname   | currtid_for_view
sqlerrcode | XX000
message| ctid isn't of type TID
count  | 2
-[ RECORD 2 ]---
filename   | pg_locale.c
lineno | 2493
elevel | 21
funcname   | pg_ucol_open
sqlerrcode | XX000
message| could not open collator for locale "%s": %s
count  | 2
...

8<--

Part of the thinking is that people with fleets of postgres instances 
can use this to scan for various errors that they care about. 
Additionally it would be useful to look for "should not happen" errors.


I will register this in the July CF and will appreciate feedback.

Thanks!

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

pg_stat_logmsg-000.tgz
Description: application/compressed-tar


Re: Initdb-time block size specification

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 16:28:59 -0500, David Christensen wrote:
> On Fri, Jun 30, 2023 at 3:29 PM Andres Freund  wrote:
> >> And indeed. Comparing e.g. TPC-H, I see *massive* regressions.  Some 
> >> queries
> > are the same, sobut others regress by up to 70% (although more commonly 
> > around
> > 10-20%).
>
> Hmm, that is definitely not good.
>
> > That's larger than I thought, which makes me suspect that there's some bug 
> > in
> > the new code.
>
> Will do a little profiling here to see if I can figure out the
> regression. Which build optimization settings are you seeing this
> under?

gcc 12 with:

meson setup \
  -Doptimization=3 -Ddebug=true \
  -Dc_args="-ggdb -g3 -march=native -mtune=native -fno-plt 
-fno-semantic-interposition -Wno-array-bounds" \
  -Dc_link_args="-fuse-ld=mold -Wl,--gdb-index,--Bsymbolic" \
  ...

Relevant postgres settings:
-c huge_pages=on -c shared_buffers=12GB -c max_connections=120
-c work_mem=32MB
-c autovacuum=0 # I always do that for comparative benchmarks, too much variance
-c track_io_timing=on

The later run where I saw the smaller regression was with work_mem=1GB.  I
just had forgotten to adjust that.


I had loaded tpch scale 5 before, which is why I just used that.


FWIW, even just "SELECT count(*) FROM lineitem;" shows a substantial
regression.

I disabled parallelism, prewarmed the data and pinned postgres to a single
core to reduce noise. The result is the best of three (variance was low in all
cases).

 HEAD   patch
index only scan  1896.364   2242.288
seq scan 1586.990   1628.042


A profile shows that 20% of the runtime in the IOS case is in
visibilitymap_get_status():

+   20.50%  postgres.new  postgres.new  [.] visibilitymap_get_status
+   19.54%  postgres.new  postgres.new  [.] ExecInterpExpr
+   14.47%  postgres.new  postgres.new  [.] IndexOnlyNext
+6.47%  postgres.new  postgres.new  [.] index_deform_tuple_internal
+4.67%  postgres.new  postgres.new  [.] ExecScan
+4.12%  postgres.new  postgres.new  [.] btgettuple
+3.97%  postgres.new  postgres.new  [.] ExecAgg
+3.92%  postgres.new  postgres.new  [.] _bt_next
+3.71%  postgres.new  postgres.new  [.] _bt_readpage
+3.43%  postgres.new  postgres.new  [.] fetch_input_tuple
+2.87%  postgres.new  postgres.new  [.] index_getnext_tid
+2.45%  postgres.new  postgres.new  [.] MemoryContextReset
+2.35%  postgres.new  postgres.new  [.] tts_virtual_clear
+1.37%  postgres.new  postgres.new  [.] index_deform_tuple
+1.14%  postgres.new  postgres.new  [.] ExecStoreVirtualTuple
+1.13%  postgres.new  postgres.new  [.] PredicateLockPage
+1.12%  postgres.new  postgres.new  [.] int8inc
+1.04%  postgres.new  postgres.new  [.] ExecIndexOnlyScan
+0.57%  postgres.new  postgres.new  [.] BufferGetBlockNumber

mostly due to

   │  BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
  2.46 │lea-0x60(,%rdx,4),%rcx
   │xor%edx,%edx
 59.79 │div%rcx


You can't have have divisions for this kind of thing in the vicinity of a
peformance critical path. With compile time constants the compiler can turn
this into shifts, but that's not possible as-is after the change.

While not quite as bad as divisions, the paths with multiplications are also
not going to be ok. E.g.
return (Block) (BufferBlocks + ((Size) (buffer - 1)) * 
CLUSTER_BLOCK_SIZE);
is going to be noticeable.


You'd have to turn all of this into shifts (and enforce power of 2 sizes, if
you aren't yet).


I don't think pre-computed tables are a viable answer FWIW. Even just going
through a memory indirection is going to be noticable. This stuff is in a
crapton of hot code paths.

Greetings,

Andres Freund




Re: Initdb-time block size specification

2023-06-30 Thread Bruce Momjian
On Sat, Jul  1, 2023 at 01:18:34AM +0200, Tomas Vondra wrote:
> What does "smartctl -a /dev/nvme0n1" say? There should be something like
> this:
> 
> Supported LBA Sizes (NSID 0x1)
> Id Fmt  Data  Metadt  Rel_Perf
>  0 -4096   0 0
>  1 + 512   0 0
> 
> which says the drive supports 4k and 512B sectors, and is currently
> configures to use 512B sectors.

It says:

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 + 512   0 2
 1 -4096   0 0

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Initdb-time block size specification

2023-06-30 Thread Tomas Vondra
On 7/1/23 01:16, Bruce Momjian wrote:
> On Fri, Jun 30, 2023 at 04:04:57PM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote:
 [1] On linux I think you need to use stat() to figure out the st_dev for a
 file, then look in /proc/self/mountinfo for the block device, use the name
 of the file to look in /sys/block/$d/queue/physical_block_size.
>>>
>>> I just got a new server:
>>>
>>> https://momjian.us/main/blogs/blog/2023.html#June_28_2023
>>>
>>> so tested this on my new M.2 NVME storage device:
>>>
>>> $ /sys/block/nvme0n1/queue/physical_block_size
>>> 262144
>>
>> Ah, I got the relevant filename wrong. I think it's logical_block_size, not
>> physical one (that's the size of addressing).  I didn't realize because the
>> devices I looked at have the same...
> 
> That one reports 512 _bytes_ for me:
> 
>   $ cat /sys/block/nvme0n1/queue/logical_block_size
>   512
> 

What does "smartctl -a /dev/nvme0n1" say? There should be something like
this:

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 -4096   0 0
 1 + 512   0 0

which says the drive supports 4k and 512B sectors, and is currently
configures to use 512B sectors.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Initdb-time block size specification

2023-06-30 Thread Bruce Momjian
On Fri, Jun 30, 2023 at 04:04:57PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote:
> > > [1] On linux I think you need to use stat() to figure out the st_dev for a
> > > file, then look in /proc/self/mountinfo for the block device, use the name
> > > of the file to look in /sys/block/$d/queue/physical_block_size.
> >
> > I just got a new server:
> >
> > https://momjian.us/main/blogs/blog/2023.html#June_28_2023
> >
> > so tested this on my new M.2 NVME storage device:
> >
> > $ /sys/block/nvme0n1/queue/physical_block_size
> > 262144
> 
> Ah, I got the relevant filename wrong. I think it's logical_block_size, not
> physical one (that's the size of addressing).  I didn't realize because the
> devices I looked at have the same...

That one reports 512 _bytes_ for me:

$ cat /sys/block/nvme0n1/queue/logical_block_size
512

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Initdb-time block size specification

2023-06-30 Thread Tomas Vondra
On 7/1/23 00:59, Andres Freund wrote:
> On 2023-06-30 18:37:39 -0400, Bruce Momjian wrote:
>> On Sat, Jul  1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote:
>>> On 6/30/23 23:53, Bruce Momjian wrote:
 For a 4kB write, to say it is not partially written would be to require
 the operating system to guarantee that the 4kB write is not split into
 smaller writes which might each be atomic because smaller atomic writes
 would not help us.
>>>
>>> Right, that's the dance we do to protect against torn pages. But Andres
>>> suggested that if you have modern storage and configure it correctly,
>>> writing with 4kB pages would be atomic. So we wouldn't need to do this
>>> FPI stuff, eliminating pretty significant source of write amplification.
>>
>> I agree the hardware is atomic for 4k writes, but do we know the OS
>> always issues 4k writes?
> 
> When using a sector size of 4K you *can't* make smaller writes via normal
> paths. The addressing unit is in sectors. The details obviously differ between
> storage protocol, but you pretty much always just specify a start sector and a
> number of sectors to be operated on.
> 
> Obviously the kernel could read 4k, modify 512 bytes in-memory, and then write
> 4k back, but that shouldn't be a danger here.  There might also be debug
> interfaces to allow reading/writing in different increments, but that'd not be
> something happening during normal operation.

I think it's important to point out that there's a physical and logical
sector size. The "physical" is what the drive does internally, "logical"
defines what OS does.

Some drives have 4k physical sectors but only 512B logical sectors.
AFAIK most "old" SATA SSDs do it that way, for compatibility reasons.

New drives may have 4k physical sectors but typically support both 512B
and 4k logical sectors - my nvme SSDs do this, for example.

My understanding is that for drives with 4k physical+logical sectors,
the OS would only issue "full" 4k writes.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Initdb-time block size specification

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 18:58:20 -0400, Bruce Momjian wrote:
> > [1] On linux I think you need to use stat() to figure out the st_dev for a
> > file, then look in /proc/self/mountinfo for the block device, use the name
> > of the file to look in /sys/block/$d/queue/physical_block_size.
>
> I just got a new server:
>
>   https://momjian.us/main/blogs/blog/2023.html#June_28_2023
>
> so tested this on my new M.2 NVME storage device:
>
>   $ /sys/block/nvme0n1/queue/physical_block_size
>   262144

Ah, I got the relevant filename wrong. I think it's logical_block_size, not
physical one (that's the size of addressing).  I didn't realize because the
devices I looked at have the same...

Regards,

Andres




Re: Initdb-time block size specification

2023-06-30 Thread Bruce Momjian
On Fri, Jun 30, 2023 at 06:58:20PM -0400, Bruce Momjian wrote:
> I just got a new server:
> 
>   https://momjian.us/main/blogs/blog/2023.html#June_28_2023
> 
> so tested this on my new M.2 NVME storage device:
> 
>   $ /sys/block/nvme0n1/queue/physical_block_size
>   262144
> 
> that's 256k, not 4k.

I have another approach to this.  My storage device has power
protection, so even though it has a 256k physical block size, it should
be fine with 4k write atomicity.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Initdb-time block size specification

2023-06-30 Thread Andres Freund
On 2023-06-30 18:37:39 -0400, Bruce Momjian wrote:
> On Sat, Jul  1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote:
> > On 6/30/23 23:53, Bruce Momjian wrote:
> > > For a 4kB write, to say it is not partially written would be to require
> > > the operating system to guarantee that the 4kB write is not split into
> > > smaller writes which might each be atomic because smaller atomic writes
> > > would not help us.
> > 
> > Right, that's the dance we do to protect against torn pages. But Andres
> > suggested that if you have modern storage and configure it correctly,
> > writing with 4kB pages would be atomic. So we wouldn't need to do this
> > FPI stuff, eliminating pretty significant source of write amplification.
> 
> I agree the hardware is atomic for 4k writes, but do we know the OS
> always issues 4k writes?

When using a sector size of 4K you *can't* make smaller writes via normal
paths. The addressing unit is in sectors. The details obviously differ between
storage protocol, but you pretty much always just specify a start sector and a
number of sectors to be operated on.

Obviously the kernel could read 4k, modify 512 bytes in-memory, and then write
4k back, but that shouldn't be a danger here.  There might also be debug
interfaces to allow reading/writing in different increments, but that'd not be
something happening during normal operation.




Re: Initdb-time block size specification

2023-06-30 Thread Bruce Momjian
On Fri, Jun 30, 2023 at 03:51:18PM -0700, Andres Freund wrote:
> > For a 4kB write, to say it is not partially written would be to require
> > the operating system to guarantee that the 4kB write is not split into
> > smaller writes which might each be atomic because smaller atomic writes
> > would not help us.
> 
> That's why were talking about drives with 4k sector size - you *can't* split
> the writes below that.

Okay, good point.

> The problem is that, as far as I know,it's not always obvious what block size
> is being used on the actual storage level.  It's not even trivial when
> operating on a filesystem directly stored on a single block device ([1]). Once
> there's things like LVM or disk encryption involved, it gets pretty hairy
> ([2]).  Once you know all the block devices, it's not too bad, but ...
> 
> Greetings,
> 
> Andres Freund
> 
> [1] On linux I think you need to use stat() to figure out the st_dev for a
> file, then look in /proc/self/mountinfo for the block device, use the name
> of the file to look in /sys/block/$d/queue/physical_block_size.

I just got a new server:

https://momjian.us/main/blogs/blog/2023.html#June_28_2023

so tested this on my new M.2 NVME storage device:

$ /sys/block/nvme0n1/queue/physical_block_size
262144

that's 256k, not 4k.

> [2] The above doesn't work because e.g. a device mapper target might only
> support 4k sectors, even though the sectors on the underlying storage device
> are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the
> above recipe (with the added step of resolving the symlink to know the actual
> device name), you would see a 4k sector size.  Even though the underlying NVMe
> disk only supports 512b sectors.

Good point.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Initdb-time block size specification

2023-06-30 Thread Tomas Vondra
On 7/1/23 00:05, Andres Freund wrote:
> Hi,
> 
> On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote:
>> On 6/30/23 23:11, Andres Freund wrote:
>>> ...
>>>
>>> If we really wanted to do this - but I don't think we do - I'd argue for
>>> working on the buildsystem support to build the postgres binary multiple
>>> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
>>> exec's the relevant "real" binary based on the pg_control value.  I really
>>> don't see us ever wanting to make BLCKSZ runtime configurable within one
>>> postgres binary. There's just too much intrinsic overhead associated with
>>> that.
>>
>> I don't quite understand why we shouldn't do this (or at least try to).
>>
>> IMO the benefits of using smaller blocks were substantial (especially
>> for 4kB, most likely due matching the internal SSD page size). The other
>> benefits (reducing WAL volume) seem rather interesting too.
> 
> Mostly because I think there are bigger gains to be had elsewhere.
> 

I think that decision is up to whoever chooses to work on it, especially
if performance is not their primary motivation (IIRC this was discussed
as part of the TDE session).

> IME not a whole lot of storage ships by default with externally visible 4k
> sectors, but needs to be manually reformated [1], which looses all data, so it
> has to be done initially.

I don't see why "you have to configure stuff" would be a reason against
improvements in this area. I don't know how prevalent storage with 4k
sectors is now, but AFAIK it's not hard to get and it's likely to get
yet more common in the future.

FWIW I don't think the benefits of different (lower) page sizes hinge on
4k sectors - it's just that not having to do FPIs would make it even
more interesting.

> Then postgres would also need OS specific trickery
> to figure out that indeed the IO stack is entirely 4k (checking sector size is
> not enough).

I haven't suggested we should be doing that automatically (would be
nice, but I'd be happy with knowing when it's safe to disable FPW using
the GUC in config). But knowing when it's safe would make it yet more
interesting be able to use a different block page size at initdb.

> And you run into the issue that suddenly the #column and
> index-tuple-size limits are lower, which won't make it easier.
> 

True. This limit is annoying, but no one is proposing to change the
default page size. initdb would just provide a more convenient way to do
that, but the user would have to check. (I rather doubt many people
actually index such large values).

> 
> I think we should change the default of the WAL blocksize to 4k
> though. There's practically no downsides, and it drastically reduces
> postgres-side write amplification in many transactional workloads, by only
> writing out partially filled 4k pages instead of partially filled 8k pages.
> 

+1 (although in my tests the benefits we much smaller than for BLCKSZ)

> 
>> Sure, there are challenges (e.g. the overhead due to making it dynamic).
>> No doubt about that.
> 
> I don't think the runtime-dynamic overhead is avoidable with reasonable effort
> (leaving aside compiling code multiple times and switching between).
> 
> If we were to start building postgres for multiple compile-time settings, I
> think there are uses other than switching between BLCKSZ, potentially more
> interesting. E.g. you can see substantially improved performance by being able
> to use SSE4.2 without CPU dispatch (partially because it allows compiler
> autovectorization, partially because it allows to compiler to use newer
> non-vectorized math instructions (when targetting AVX IIRC), partially because
> the dispatch overhead is not insubstantial).  Another example: ARMv8
> performance is substantially better if you target ARMv8.1-A instead of
> ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles
> me that they didn't do this earlier, ll/sc is just inherently inefficient).
> 

Maybe, although I think it depends on what parts of the code this would
affect. If it's sufficiently small/isolated, it'd be possible to have
multiple paths, specialized to a particular page size (pretty common
technique for GPU/SIMD, I believe).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Initdb-time block size specification

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 17:53:34 -0400, Bruce Momjian wrote:
> On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:
> > On 6/30/23 23:11, Andres Freund wrote:
> > > I suspect you're going to see more benefits from going to a *lower* 
> > > setting
> > > than a higher one. Some practical issues aside, plenty of storage hardware
> > > these days would allow to get rid of FPIs if you go to 4k blocks 
> > > (although it
> > > often requires explicit sysadmin action to reformat the drive into that 
> > > mode
> > > etc).  But obviously that's problematic from the "postgres limits" POV.
> > >
> >
> > I wonder what are the conditions/options for disabling FPI. I kinda
> > assume it'd apply to new drives with 4k sectors, with properly aligned
> > partitions etc. But I haven't seen any particularly clear confirmation
> > that's correct.
>
> I don't think we have ever had to study this --- we just request the
> write to the operating system, and we either get a successful reply or
> we go into WAL recovery to reread the pre-image.  We never really care
> if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
> 2kB writes --- we don't care --- we only care if they are all done or
> not.

Well, that works because we have FPI. This sub-discussion is motivated by
getting rid of FPIs.


> For a 4kB write, to say it is not partially written would be to require
> the operating system to guarantee that the 4kB write is not split into
> smaller writes which might each be atomic because smaller atomic writes
> would not help us.

That's why were talking about drives with 4k sector size - you *can't* split
the writes below that.

The problem is that, as far as I know,it's not always obvious what block size
is being used on the actual storage level.  It's not even trivial when
operating on a filesystem directly stored on a single block device ([1]). Once
there's things like LVM or disk encryption involved, it gets pretty hairy
([2]).  Once you know all the block devices, it's not too bad, but ...

Greetings,

Andres Freund

[1] On linux I think you need to use stat() to figure out the st_dev for a
file, then look in /proc/self/mountinfo for the block device, use the name
of the file to look in /sys/block/$d/queue/physical_block_size.

[2] The above doesn't work because e.g. a device mapper target might only
support 4k sectors, even though the sectors on the underlying storage device
are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the
above recipe (with the added step of resolving the symlink to know the actual
device name), you would see a 4k sector size.  Even though the underlying NVMe
disk only supports 512b sectors.




Re: Initdb-time block size specification

2023-06-30 Thread Bruce Momjian
On Sat, Jul  1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote:
> On 6/30/23 23:53, Bruce Momjian wrote:
> > For a 4kB write, to say it is not partially written would be to require
> > the operating system to guarantee that the 4kB write is not split into
> > smaller writes which might each be atomic because smaller atomic writes
> > would not help us.
> 
> Right, that's the dance we do to protect against torn pages. But Andres
> suggested that if you have modern storage and configure it correctly,
> writing with 4kB pages would be atomic. So we wouldn't need to do this
> FPI stuff, eliminating pretty significant source of write amplification.

I agree the hardware is atomic for 4k writes, but do we know the OS
always issues 4k writes?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-06-30 Thread Bruce Momjian
On Fri, Jun 30, 2023 at 03:18:03PM -0700, Nikolay Samokhvalov wrote:
> I wonder, if we ensure that standbys are fully caught up before upgrading the
> primary, if we check the latest checkpoint positions, are we good to use 
> "rsync
> --size-only", or there are still some concerns? It seems so to me, but maybe
> I'm missing something.

Yes, I think you are correct.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Should we remove db_user_namespace?

2023-06-30 Thread Nathan Bossart
On Sat, Jul 01, 2023 at 12:13:26AM +0200, Magnus Hagander wrote:
> Strong +1 from here for removing it, assuming you don't find a bunch
> of users on -general who are using it. Having never come across one
> myself, I think it's unlikely, but I agree it's good to ask.

Cool.  I'll let that thread [0] sit for a while.

[0] https://postgr.es/m/20230630215608.GD2941194%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Initdb-time block size specification

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 23:42:30 +0200, Tomas Vondra wrote:
> I wonder what are the conditions/options for disabling FPI. I kinda
> assume it'd apply to new drives with 4k sectors, with properly aligned
> partitions etc. But I haven't seen any particularly clear confirmation
> that's correct.

Yea, it's not trivial. And the downsides are also substantial from a
replication / crash recovery performance POV - even with reading blocks ahead
of WAL replay, it's hard to beat just sequentially reading nearly all the data
you're going to need.


> On 6/30/23 23:11, Andres Freund wrote:
> > If we really wanted to do this - but I don't think we do - I'd argue for
> > working on the buildsystem support to build the postgres binary multiple
> > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> > exec's the relevant "real" binary based on the pg_control value.  I really
> > don't see us ever wanting to make BLCKSZ runtime configurable within one
> > postgres binary. There's just too much intrinsic overhead associated with
> > that.
>
> How would that work for extensions which may be built for a particular
> BLCKSZ value (like pageinspect)?

I think we'd need to do something similar for extensions, likely loading them
from a path that includes the "subvariant" the server currently is running. Or
alternatively adding a suffix to the filename indicating the
variant. Something like pageinspect.x86-64-v4-4kB.so.

The x86-64-v* stuff is something gcc and clang added a couple years ago, so
that not every project has to define different "baseline" levels. I think it
was done in collaboration with the sytem-v/linux AMD64 ABI specification group
([1]).

Greetings,

Andres

[1] 
https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build
section 3.1.1.




Re: Initdb-time block size specification

2023-06-30 Thread Tomas Vondra



On 6/30/23 23:53, Bruce Momjian wrote:
> On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:
>>
>>
>> On 6/30/23 23:11, Andres Freund wrote:
>>> Hi,
>>>
>>> ...
>>>
>>> I suspect you're going to see more benefits from going to a *lower* setting
>>> than a higher one. Some practical issues aside, plenty of storage hardware
>>> these days would allow to get rid of FPIs if you go to 4k blocks (although 
>>> it
>>> often requires explicit sysadmin action to reformat the drive into that mode
>>> etc).  But obviously that's problematic from the "postgres limits" POV.
>>>
>>
>> I wonder what are the conditions/options for disabling FPI. I kinda
>> assume it'd apply to new drives with 4k sectors, with properly aligned
>> partitions etc. But I haven't seen any particularly clear confirmation
>> that's correct.
> 
> I don't think we have ever had to study this --- we just request the
> write to the operating system, and we either get a successful reply or
> we go into WAL recovery to reread the pre-image.  We never really care
> if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
> 2kB writes --- we don't care --- we only care if they are all done or
> not.
> 
> For a 4kB write, to say it is not partially written would be to require
> the operating system to guarantee that the 4kB write is not split into
> smaller writes which might each be atomic because smaller atomic writes
> would not help us.
> 

Right, that's the dance we do to protect against torn pages. But Andres
suggested that if you have modern storage and configure it correctly,
writing with 4kB pages would be atomic. So we wouldn't need to do this
FPI stuff, eliminating pretty significant source of write amplification.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-06-30 Thread Nikolay Samokhvalov
On Fri, Jun 30, 2023 at 14:33 Bruce Momjian  wrote:

> On Fri, Jun 30, 2023 at 04:16:31PM -0400, Robert Haas wrote:
> > I'm not quite clear on how Nikolay got into trouble here. I don't
> > think I understand under exactly what conditions the procedure is
> > reliable and under what conditions it isn't. But there is no way in
> > heck I would ever advise anyone to use this procedure on a database
> > they actually care about. This is a great party trick or something to
> > show off in a lightning talk at PGCon, not something you ought to be
> > doing with valuable data that you actually care about.
>
> Well, it does get used, and if we remove it perhaps we can have it on
> our wiki and point to it from our docs.


In my case, we performed some additional writes on the primary before
running "pg_upgrade -k" and we did it *after* we shut down all the
standbys. So those changes were not replicated and then "rsync --size-only"
ignored them. (By the way, that cluster has wal_log_hints=on to allow
Patroni run pg_rewind when needed.)

But this can happen with anyone who follows the procedure from the docs as
is and doesn't do any additional steps, because in step 9 "Prepare for
standby server upgrades":

1) there is no requirement to follow specific order to shut down the nodes
   - "Streaming replication and log-shipping standby servers can remain
running until a later step" should probably be changed to a
requirement-like "keep them running"

2) checking the latest checkpoint position with pg_controldata now looks
like a thing that is good to do, but with uncertainty purpose -- it does
not seem to be used to support any decision
  - "There will be a mismatch if old standby servers were shut down before
the old primary or if the old standby servers are still running" should
probably be rephrased saying that if there is mismatch, it's a big problem

So following the steps as is, if some writes on the primary are not
replicated (due to whatever reason) before execution of pg_upgrade -k +
rsync --size-only, then those writes are going to be silently lost on
standbys.

I wonder, if we ensure that standbys are fully caught up before upgrading
the primary, if we check the latest checkpoint positions, are we good to
use "rsync --size-only", or there are still some concerns? It seems so to
me, but maybe I'm missing something.

> --

Thanks,
Nikolay Samokhvalov
Founder, Postgres.ai


Re: Should we remove db_user_namespace?

2023-06-30 Thread Magnus Hagander
On Fri, Jun 30, 2023 at 11:43 PM Nathan Bossart
 wrote:
>
> On Fri, Jun 30, 2023 at 05:40:18PM -0400, Tom Lane wrote:
> > Might be worth asking on pgsql-general whether anyone knows of
> > active use of this feature.  If not, I'm good with killing it.
>
> Will do.

Strong +1 from here for removing it, assuming you don't find a bunch
of users on -general who are using it. Having never come across one
myself, I think it's unlikely, but I agree it's good to ask.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2023-06-30 Thread Thomas Munro
On Fri, Jun 30, 2023 at 10:47 PM Palak Chaturvedi
 wrote:
> pgbench=# select count(pg_buffercache_invalidate(bufferid)) from
> pg_buffercache where relfilenode =
> pg_relation_filenode('pgbench_accounts'::regclass);

Hi Palak,

Thanks for working on this!  I think this will be very useful for
testing existing workloads but also for testing future work on
prefetching with AIO (and DIO), work on putting SLRUs (or anything
else) into the buffer pool, nearby proposals for caching buffer
mapping information, etc etc.

Palak and I talked about this idea a bit last week (stimulated by a
recent thread[1], but the topic has certainly come up before), and we
discussed some different ways one could specify which pages are
dropped.  For example, perhaps the pg_prewarm extension could have an
'unwarm' option instead.  I personally thought the buffer ID-based
approach was quite good because it's extremely simple, while giving
the user the full power of SQL to say which buffers.   Half a table?
Visibility map?  Everything?  Root page of an index?  I think that's
probably better than something that requires more code and
complication but is less flexible in the end.  It feels like the right
level of rawness for something primarily of interest to hackers and
advanced users.  I don't think it matters that there is a window
between selecting a buffer ID and invalidating it, for the intended
use cases.  That's my vote, anyway, let's see if others have other
ideas...

We also talked a bit about how one might control the kernel page cache
in more fine-grained ways for testing purposes, but it seems like the
pgfincore project has that covered with its pgfadvise_willneed() and
pgfadvise_dontneed().  IMHO that project could use more page-oriented
operations (instead of just counts and coarse grains operations) but
that's something that could be material for patches to send to the
extension maintainers.  This work, in contrast, is more tangled up
with bufmgr.c internals, so it feels like this feature belongs in a
core contrib module.

Some initial thoughts on the patch:

I wonder if we should include a simple exercise in
contrib/pg_buffercache/sql/pg_buffercache.sql.  One problem is that
it's not guaranteed to succeed in general.  It doesn't wait for pins
to go away, and it doesn't retry cleaning dirty buffers after one
attempt, it just returns false, which I think is probably the right
approach, but it makes the behaviour too non-deterministic for simple
tests.  Perhaps it's enough to include an exercise where we call it a
few times to hit a couple of cases, but not verify what effect it has.

It should be restricted by role, but I wonder which role it should be.
Testing for superuser is now out of fashion.

Where the Makefile mentions 1.4--1.5.sql, the meson.build file needs
to do the same.  That's because PostgreSQL is currently in transition
from autoconf/gmake to meson/ninja[2], so for now we have to maintain
both build systems.  That's why it fails to build in some CI tasks[3].
You can enable CI in your own GitHub account if you want to run test
builds on several operating systems, see [4] for info.

[1] 
https://www.postgresql.org/message-id/flat/CAFSGpE3y_oMK1uHhcHxGxBxs%2BKrjMMdGrE%2B6HHOu0vttVET0UQ%40mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Meson
[3] http://cfbot.cputube.org/palak-chaturvedi.html
[4] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=src/tools/ci/README;hb=HEAD




Re: Initdb-time block size specification

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote:
> On 6/30/23 23:11, Andres Freund wrote:
> > ...
> > 
> > If we really wanted to do this - but I don't think we do - I'd argue for
> > working on the buildsystem support to build the postgres binary multiple
> > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> > exec's the relevant "real" binary based on the pg_control value.  I really
> > don't see us ever wanting to make BLCKSZ runtime configurable within one
> > postgres binary. There's just too much intrinsic overhead associated with
> > that.
>
> I don't quite understand why we shouldn't do this (or at least try to).
>
> IMO the benefits of using smaller blocks were substantial (especially
> for 4kB, most likely due matching the internal SSD page size). The other
> benefits (reducing WAL volume) seem rather interesting too.

Mostly because I think there are bigger gains to be had elsewhere.

IME not a whole lot of storage ships by default with externally visible 4k
sectors, but needs to be manually reformated [1], which looses all data, so it
has to be done initially.  Then postgres would also need OS specific trickery
to figure out that indeed the IO stack is entirely 4k (checking sector size is
not enough).  And you run into the issue that suddenly the #column and
index-tuple-size limits are lower, which won't make it easier.


I think we should change the default of the WAL blocksize to 4k
though. There's practically no downsides, and it drastically reduces
postgres-side write amplification in many transactional workloads, by only
writing out partially filled 4k pages instead of partially filled 8k pages.


> Sure, there are challenges (e.g. the overhead due to making it dynamic).
> No doubt about that.

I don't think the runtime-dynamic overhead is avoidable with reasonable effort
(leaving aside compiling code multiple times and switching between).

If we were to start building postgres for multiple compile-time settings, I
think there are uses other than switching between BLCKSZ, potentially more
interesting. E.g. you can see substantially improved performance by being able
to use SSE4.2 without CPU dispatch (partially because it allows compiler
autovectorization, partially because it allows to compiler to use newer
non-vectorized math instructions (when targetting AVX IIRC), partially because
the dispatch overhead is not insubstantial).  Another example: ARMv8
performance is substantially better if you target ARMv8.1-A instead of
ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles
me that they didn't do this earlier, ll/sc is just inherently inefficient).

Greetings,

Andres Freund


[1] to see the for d in /dev/nvme*n1; do echo "$d:"; sudo nvme id-ns -H $d|grep 
'^LBA Format';echo ;done




Re: Initdb-time block size specification

2023-06-30 Thread Bruce Momjian
On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:
> 
> 
> On 6/30/23 23:11, Andres Freund wrote:
> > Hi,
> >
> > ...
> > 
> > I suspect you're going to see more benefits from going to a *lower* setting
> > than a higher one. Some practical issues aside, plenty of storage hardware
> > these days would allow to get rid of FPIs if you go to 4k blocks (although 
> > it
> > often requires explicit sysadmin action to reformat the drive into that mode
> > etc).  But obviously that's problematic from the "postgres limits" POV.
> > 
> 
> I wonder what are the conditions/options for disabling FPI. I kinda
> assume it'd apply to new drives with 4k sectors, with properly aligned
> partitions etc. But I haven't seen any particularly clear confirmation
> that's correct.

I don't think we have ever had to study this --- we just request the
write to the operating system, and we either get a successful reply or
we go into WAL recovery to reread the pre-image.  We never really care
if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
2kB writes --- we don't care --- we only care if they are all done or
not.

For a 4kB write, to say it is not partially written would be to require
the operating system to guarantee that the 4kB write is not split into
smaller writes which might each be atomic because smaller atomic writes
would not help us.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Should we remove db_user_namespace?

2023-06-30 Thread Nathan Bossart
On Fri, Jun 30, 2023 at 05:40:18PM -0400, Tom Lane wrote:
> Might be worth asking on pgsql-general whether anyone knows of
> active use of this feature.  If not, I'm good with killing it.

Will do.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Should we remove db_user_namespace?

2023-06-30 Thread Nathan Bossart
On Fri, Jun 30, 2023 at 05:29:04PM -0400, Bruce Momjian wrote:
> I am the original author, and it was a hack designed to support
> per-database user names.  I am fine with its removal.

Thanks, Bruce.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Initdb-time block size specification

2023-06-30 Thread Tomas Vondra



On 6/30/23 23:11, Andres Freund wrote:
> Hi,
>
> ...
> 
> I suspect you're going to see more benefits from going to a *lower* setting
> than a higher one. Some practical issues aside, plenty of storage hardware
> these days would allow to get rid of FPIs if you go to 4k blocks (although it
> often requires explicit sysadmin action to reformat the drive into that mode
> etc).  But obviously that's problematic from the "postgres limits" POV.
> 

I wonder what are the conditions/options for disabling FPI. I kinda
assume it'd apply to new drives with 4k sectors, with properly aligned
partitions etc. But I haven't seen any particularly clear confirmation
that's correct.

> 
> If we really wanted to do this - but I don't think we do - I'd argue for
> working on the buildsystem support to build the postgres binary multiple
> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> exec's the relevant "real" binary based on the pg_control value.  I really
> don't see us ever wanting to make BLCKSZ runtime configurable within one
> postgres binary. There's just too much intrinsic overhead associated with
> that.
> 

How would that work for extensions which may be built for a particular
BLCKSZ value (like pageinspect)?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Initdb-time block size specification

2023-06-30 Thread David Christensen
On Fri, Jun 30, 2023 at 4:17 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-06-30 15:05:54 -0500, David Christensen wrote:
> > > I am fairly certain this is going to be causing substantial performance
> > > regressions.  I think we should reject this even if we don't immediately 
> > > find
> > > them, because it's almost guaranteed to cause some.
> >
> > What would be considered substantial? Some overhead would be expected,
> > but I think having an actual patch to evaluate lets us see what
> > potential there is.
>
> Anything beyond 1-2%, although even that imo is a hard sell.

I'd agree that that threshold seems like a reasonable target, and
anything much above that would be regressive.

> > > Besides this, I've not really heard any convincing justification for 
> > > needing
> > > this in the first place.
> >
> > Doing this would open up experiments in larger block sizes, so we
> > would be able to have larger indexable tuples, say, or be able to
> > store data types that are larger than currently supported for tuple
> > row limits without dropping to toast (native vector data types come to
> > mind as a candidate here).
>
> You can do experiments today with the compile time option. Which does not
> require regressing performance for everyone.

Sure, not arguing that this is more performant than the current approach.

> > We've had 8k blocks for a long time while hardware has improved over 20+
> > years, and it would be interesting to see how tuning things would open up
> > additional avenues for performance without requiring packagers to make a
> > single choice on this regardless of use-case.
>
> I suspect you're going to see more benefits from going to a *lower* setting
> than a higher one. Some practical issues aside, plenty of storage hardware
> these days would allow to get rid of FPIs if you go to 4k blocks (although it
> often requires explicit sysadmin action to reformat the drive into that mode
> etc).  But obviously that's problematic from the "postgres limits" POV.
>
>
> If we really wanted to do this - but I don't think we do - I'd argue for
> working on the buildsystem support to build the postgres binary multiple
> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> exec's the relevant "real" binary based on the pg_control value.  I really
> don't see us ever wanting to make BLCKSZ runtime configurable within one
> postgres binary. There's just too much intrinsic overhead associated with
> that.

You may well be right, but I think if we haven't tried to do that and
measure it, it's hard to say exactly.  There are of course more parts
of the system that are about BLCKSZ than the backend, plus you'd need
to build other extensions to support each option, so there is a lot
more that would need to change. (That's neither here nor there, as my
approach also involves changing all those places, so change isn't
inherently bad; just saying it's not a trivial solution to merely
iterate over the block size for binaries.)

David




Re: Should we remove db_user_namespace?

2023-06-30 Thread Tom Lane
Nathan Bossart  writes:
> I'm personally not aware of anyone using this parameter.

Might be worth asking on pgsql-general whether anyone knows of
active use of this feature.  If not, I'm good with killing it.

regards, tom lane




Re: [PATCH] Extend the length of BackgroundWorker.bgw_library_name

2023-06-30 Thread Nathan Bossart
On Wed, Apr 26, 2023 at 03:07:18PM +0300, Aleksander Alekseev wrote:
> The commit message may require a bit of tweaking by the committer but
> other than that the patch seems to be fine. I'm going to mark it as
> RfC in a bit unless anyone objects.

In v4, I've introduced a new BGW_LIBLEN macro and set it to the default
value of MAXPGPATH (1024).  This way, the value can live in bgworker.h like
the other BGW_* macros do.  Plus, this should make the assertion that
checks for backward compatibility unnecessary.  Since bgw_library_name is
essentially a path, I can see the argument that we should just set
BGW_LIBLEN to MAXPGPATH directly.  I'm curious what folks think about this.

I also changed the added sizeofs to use the macro for consistency with the
surrounding code.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3760f8efd0a20f60f7eda19450f97c5b2e5daf8f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 30 Jun 2023 14:25:20 -0700
Subject: [PATCH v4 1/1] extend bgw_library_name

---
 doc/src/sgml/bgworker.sgml | 2 +-
 src/backend/postmaster/bgworker.c  | 2 +-
 src/backend/replication/logical/launcher.c | 4 ++--
 src/include/postmaster/bgworker.h  | 3 ++-
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 7ba5da27e5..f890216333 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -59,7 +59,7 @@ typedef struct BackgroundWorker
 int bgw_flags;
 BgWorkerStartTime bgw_start_time;
 int bgw_restart_time;   /* in seconds, or BGW_NEVER_RESTART */
-charbgw_library_name[BGW_MAXLEN];
+charbgw_library_name[BGW_LIBLEN];
 charbgw_function_name[BGW_MAXLEN];
 Datum   bgw_main_arg;
 charbgw_extra[BGW_EXTRALEN];
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 0dd22b2351..bee2b895e3 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -362,7 +362,7 @@ BackgroundWorkerStateChange(bool allow_new_workers)
 		ascii_safe_strlcpy(rw->rw_worker.bgw_type,
 		   slot->worker.bgw_type, BGW_MAXLEN);
 		ascii_safe_strlcpy(rw->rw_worker.bgw_library_name,
-		   slot->worker.bgw_library_name, BGW_MAXLEN);
+		   slot->worker.bgw_library_name, BGW_LIBLEN);
 		ascii_safe_strlcpy(rw->rw_worker.bgw_function_name,
 		   slot->worker.bgw_function_name, BGW_MAXLEN);
 
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 8395ae7b23..e1ed820b31 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -456,7 +456,7 @@ retry:
 	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
 		BGWORKER_BACKEND_DATABASE_CONNECTION;
 	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
-	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+	snprintf(bgw.bgw_library_name, BGW_LIBLEN, "postgres");
 
 	if (is_parallel_apply_worker)
 		snprintf(bgw.bgw_function_name, BGW_MAXLEN, "ParallelApplyWorkerMain");
@@ -910,7 +910,7 @@ ApplyLauncherRegister(void)
 	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
 		BGWORKER_BACKEND_DATABASE_CONNECTION;
 	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
-	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+	snprintf(bgw.bgw_library_name, BGW_LIBLEN, "postgres");
 	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "ApplyLauncherMain");
 	snprintf(bgw.bgw_name, BGW_MAXLEN,
 			 "logical replication launcher");
diff --git a/src/include/postmaster/bgworker.h b/src/include/postmaster/bgworker.h
index 845d4498e6..f8c401093e 100644
--- a/src/include/postmaster/bgworker.h
+++ b/src/include/postmaster/bgworker.h
@@ -84,6 +84,7 @@ typedef enum
 #define BGW_DEFAULT_RESTART_INTERVAL	60
 #define BGW_NEVER_RESTART-1
 #define BGW_MAXLEN		96
+#define BGW_LIBLEN		1024
 #define BGW_EXTRALEN	128
 
 typedef struct BackgroundWorker
@@ -93,7 +94,7 @@ typedef struct BackgroundWorker
 	int			bgw_flags;
 	BgWorkerStartTime bgw_start_time;
 	int			bgw_restart_time;	/* in seconds, or BGW_NEVER_RESTART */
-	char		bgw_library_name[BGW_MAXLEN];
+	char		bgw_library_name[BGW_LIBLEN];
 	char		bgw_function_name[BGW_MAXLEN];
 	Datum		bgw_main_arg;
 	char		bgw_extra[BGW_EXTRALEN];
-- 
2.25.1



Re: PG 16 draft release notes ready

2023-06-30 Thread Bruce Momjian
On Fri, Jun 30, 2023 at 05:29:17PM +0900, Masahiro Ikeda wrote:
> Hi,
> 
> Thanks for making the release notes. I found the release note of
> PG16 beta2 mentions a reverted following feature.
> 
> ```
> 
> 
> 
> 
> Have initdb use ICU by default if ICU is enabled in the binary (Jeff Davis)
> 
> 
> 
> Option --locale-provider=libc can be used to disable ICU.
> 
> 
> ```
> 
> Unfortunately, the feature is reverted with the commit.
> * 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2535c74b1a6190cc42e13f6b6b55d94bff4b7dd6

Oh, I didn't notice the revert --- item removed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Initdb-time block size specification

2023-06-30 Thread David Christensen
On Fri, Jun 30, 2023 at 4:27 PM Tomas Vondra
 wrote:
> On 6/30/23 23:11, Andres Freund wrote:
> > ...
> >
> > If we really wanted to do this - but I don't think we do - I'd argue for
> > working on the buildsystem support to build the postgres binary multiple
> > times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> > exec's the relevant "real" binary based on the pg_control value.  I really
> > don't see us ever wanting to make BLCKSZ runtime configurable within one
> > postgres binary. There's just too much intrinsic overhead associated with
> > that.
> >
>
> I don't quite understand why we shouldn't do this (or at least try to).
> IMO the benefits of using smaller blocks were substantial (especially
> for 4kB, most likely due matching the internal SSD page size). The other
> benefits (reducing WAL volume) seem rather interesting too.

If it's dynamic, we could also be able to add detection of the best
block size at initdb time, leading to improvements all around, say.
:-)

> Sure, there are challenges (e.g. the overhead due to making it dynamic).
> No doubt about that.

I definitely agree that it seems worth looking into.  If nothing else,
being able to quantify just what/where the overhead comes from may be
instructive for future efforts.

David




Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-06-30 Thread Bruce Momjian
On Fri, Jun 30, 2023 at 04:16:31PM -0400, Robert Haas wrote:
> On Fri, Jun 30, 2023 at 1:41 PM Bruce Momjian  wrote:
> > I think --size-only was chosen only because it is the minimal comparison
> > option.
> 
> I think it's worse than that. I think that the procedure relies on
> using the --size-only option to intentionally trick rsync into
> thinking that files are identical when they're not.
> 
> Say we have a file like base/23246/78901 on the primary. Unless
> wal_log_hints=on, the standby version is very likely different, but
> only in ways that don't matter to WAL replay. So the procedure aims to
> trick rsync into hard-linking the version of that file that exists on
> the standby in the old cluster into the new cluster on the standby,
> instead of copying the slightly-different version from the master,
> thus making the upgrade very fast. If rsync actually checksummed the
> files, it would realize that they're different and copy the file from
> the original primary, which the person who wrote this procedure does
> not want.

What is the problem with having different hint bits between the two
servers?

> That's kind of a crazy thing for us to be documenting. I think we
> really ought to consider removing from this documentation. If somebody
> wants to write a reliable tool for this to ship as part of PostgreSQL,
> well and good. But this procedure has no real sanity checks and is
> based on very fragile assumptions. That doesn't seem suitable for
> end-user use.
> 
> I'm not quite clear on how Nikolay got into trouble here. I don't
> think I understand under exactly what conditions the procedure is
> reliable and under what conditions it isn't. But there is no way in
> heck I would ever advise anyone to use this procedure on a database
> they actually care about. This is a great party trick or something to
> show off in a lightning talk at PGCon, not something you ought to be
> doing with valuable data that you actually care about.

Well, it does get used, and if we remove it perhaps we can have it on
our wiki and point to it from our docs.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Initdb-time block size specification

2023-06-30 Thread David Christensen
On Fri, Jun 30, 2023 at 4:12 PM Tomas Vondra
 wrote:
>
>
>
> On 6/30/23 22:05, David Christensen wrote:
> > On Fri, Jun 30, 2023 at 2:39 PM Andres Freund  wrote:
> >>
> >> ...
> >>
> >> Besides this, I've not really heard any convincing justification for 
> >> needing
> >> this in the first place.
> >
> > Doing this would open up experiments in larger block sizes, so we
> > would be able to have larger indexable tuples, say, or be able to
> > store data types that are larger than currently supported for tuple
> > row limits without dropping to toast (native vector data types come to
> > mind as a candidate here).  We've had 8k blocks for a long time while
> > hardware has improved over 20+ years, and it would be interesting to
> > see how tuning things would open up additional avenues for performance
> > without requiring packagers to make a single choice on this regardless
> > of use-case.  (The fact that we allow compiling this at a different
> > value suggests there is thought to be some utility having this be
> > something other than the default value.)
> >
> > I just think it's one of those things that is hard to evaluate without
> > actually having something specific, which is why we have this patch
> > now.
> >
>
> But it's possible to evaluate that - you just need to rebuild with a
> different configuration option. Yes, allowing doing that at initdb is
> simpler and allows testing this on systems where rebuilding is not
> convenient. And having a binary that can deal with any block size would
> be nice too.
>
> In fact, I did exactly that a year ago for a conference, and I spoke
> about it at the 2022 unconference too. Not sure if there's recording
> from pgcon, but there is one from the other conference [1][2].
>
> The short story is that the possible gains are significant (say +50%)
> for data sets that don't fit into RAM. But that was with block size set
> at compile time, the question is what's the impact of making it a
> variable instead of a macro 
>
> [1] https://www.youtube.com/watch?v=mVKpoQxtCXk
> [2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf

Cool, thanks for the video links; will review.

David




Re: Initdb-time block size specification

2023-06-30 Thread David Christensen
On Fri, Jun 30, 2023 at 3:29 PM Andres Freund  wrote:
>> And indeed. Comparing e.g. TPC-H, I see *massive* regressions.  Some queries
> are the same, sobut others regress by up to 70% (although more commonly around
> 10-20%).

Hmm, that is definitely not good.

> That's larger than I thought, which makes me suspect that there's some bug in
> the new code.

Will do a little profiling here to see if I can figure out the
regression. Which build optimization settings are you seeing this
under?

> Interestingly, repeating the benchmark with a larger work_mem setting, the
> regressions are still quite present, but smaller. I suspect the planner
> chooses smarter plans which move bottlenecks more towards hashjoin code etc,
> which won't be affected by this change.

Interesting.

> IOW, you seriously need to evaluate analytics queries before this is worth
> looking at further.

Makes sense, thanks for reviewing.

David




Re: Should we remove db_user_namespace?

2023-06-30 Thread Bruce Momjian
On Fri, Jun 30, 2023 at 01:05:09PM -0700, Nathan Bossart wrote:
> I think this is the second decennial thread [0] for removing this GUC.
> This topic came up at PGCon, so I thought I'd start the discussion on the
> lists.
> 
> I'm personally not aware of anyone using this parameter.  A couple of my
> colleagues claimed to have used it in the aughts, but they also noted that
> users were confused by the current implementation, and they seemed
> generally in favor of removing it.  AFAICT the strongest reason for keeping
> it is that there is presently no viable replacement.  Does this opinion
> still stand?  If so, perhaps we can look into adding a viable replacement
> for v17.

I am the original author, and it was a hack designed to support
per-database user names.  I am fine with its removal.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Initdb-time block size specification

2023-06-30 Thread Tomas Vondra



On 6/30/23 23:11, Andres Freund wrote:
> ...
> 
> If we really wanted to do this - but I don't think we do - I'd argue for
> working on the buildsystem support to build the postgres binary multiple
> times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
> exec's the relevant "real" binary based on the pg_control value.  I really
> don't see us ever wanting to make BLCKSZ runtime configurable within one
> postgres binary. There's just too much intrinsic overhead associated with
> that.
> 

I don't quite understand why we shouldn't do this (or at least try to).
IMO the benefits of using smaller blocks were substantial (especially
for 4kB, most likely due matching the internal SSD page size). The other
benefits (reducing WAL volume) seem rather interesting too.

Sure, there are challenges (e.g. the overhead due to making it dynamic).
No doubt about that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Initdb-time block size specification

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 15:05:54 -0500, David Christensen wrote:
> > I am fairly certain this is going to be causing substantial performance
> > regressions.  I think we should reject this even if we don't immediately 
> > find
> > them, because it's almost guaranteed to cause some.
> 
> What would be considered substantial? Some overhead would be expected,
> but I think having an actual patch to evaluate lets us see what
> potential there is.

Anything beyond 1-2%, although even that imo is a hard sell.


> > Besides this, I've not really heard any convincing justification for needing
> > this in the first place.
> 
> Doing this would open up experiments in larger block sizes, so we
> would be able to have larger indexable tuples, say, or be able to
> store data types that are larger than currently supported for tuple
> row limits without dropping to toast (native vector data types come to
> mind as a candidate here).

You can do experiments today with the compile time option. Which does not
require regressing performance for everyone.


> We've had 8k blocks for a long time while hardware has improved over 20+
> years, and it would be interesting to see how tuning things would open up
> additional avenues for performance without requiring packagers to make a
> single choice on this regardless of use-case.

I suspect you're going to see more benefits from going to a *lower* setting
than a higher one. Some practical issues aside, plenty of storage hardware
these days would allow to get rid of FPIs if you go to 4k blocks (although it
often requires explicit sysadmin action to reformat the drive into that mode
etc).  But obviously that's problematic from the "postgres limits" POV.


If we really wanted to do this - but I don't think we do - I'd argue for
working on the buildsystem support to build the postgres binary multiple
times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
exec's the relevant "real" binary based on the pg_control value.  I really
don't see us ever wanting to make BLCKSZ runtime configurable within one
postgres binary. There's just too much intrinsic overhead associated with
that.

Greetings,

Andres Freund




Re: Initdb-time block size specification

2023-06-30 Thread Tomas Vondra



On 6/30/23 22:05, David Christensen wrote:
> On Fri, Jun 30, 2023 at 2:39 PM Andres Freund  wrote:
>>
>> ...
>>
>> Besides this, I've not really heard any convincing justification for needing
>> this in the first place.
> 
> Doing this would open up experiments in larger block sizes, so we
> would be able to have larger indexable tuples, say, or be able to
> store data types that are larger than currently supported for tuple
> row limits without dropping to toast (native vector data types come to
> mind as a candidate here).  We've had 8k blocks for a long time while
> hardware has improved over 20+ years, and it would be interesting to
> see how tuning things would open up additional avenues for performance
> without requiring packagers to make a single choice on this regardless
> of use-case.  (The fact that we allow compiling this at a different
> value suggests there is thought to be some utility having this be
> something other than the default value.)
> 
> I just think it's one of those things that is hard to evaluate without
> actually having something specific, which is why we have this patch
> now.
> 

But it's possible to evaluate that - you just need to rebuild with a
different configuration option. Yes, allowing doing that at initdb is
simpler and allows testing this on systems where rebuilding is not
convenient. And having a binary that can deal with any block size would
be nice too.

In fact, I did exactly that a year ago for a conference, and I spoke
about it at the 2022 unconference too. Not sure if there's recording
from pgcon, but there is one from the other conference [1][2].

The short story is that the possible gains are significant (say +50%)
for data sets that don't fit into RAM. But that was with block size set
at compile time, the question is what's the impact of making it a
variable instead of a macro 

[1] https://www.youtube.com/watch?v=mVKpoQxtCXk
[2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Should we remove db_user_namespace?

2023-06-30 Thread Nathan Bossart
On Fri, Jun 30, 2023 at 01:05:09PM -0700, Nathan Bossart wrote:
> The attached patch simply removes the GUC.

And here's a new version of the patch with docs that actually build.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3b7fdd41eb429bc9bb03dcecf38126fbc63dafa3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 30 Jun 2023 12:46:08 -0700
Subject: [PATCH v2 1/1] remove db_user_namespace

---
 doc/src/sgml/client-auth.sgml |  5 --
 doc/src/sgml/config.sgml  | 52 ---
 src/backend/libpq/auth.c  |  5 --
 src/backend/libpq/hba.c   | 12 -
 src/backend/postmaster/postmaster.c   | 19 ---
 src/backend/utils/misc/guc_tables.c   |  9 
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/libpq/pqcomm.h|  2 -
 8 files changed, 105 deletions(-)

diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 204d09df67..6c95f0df1e 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -1253,11 +1253,6 @@ omicron bryanh  guest1
attacks.
   
 
-  
-   The md5 method cannot be used with
-   the  feature.
-  
-
   
To ease transition from the md5 method to the newer
SCRAM method, if md5 is specified as a method
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6262cb7bb2..e6cea8ddfc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1188,58 +1188,6 @@ include_dir 'conf.d'

   
  
-
- 
-  db_user_namespace (boolean)
-  
-   db_user_namespace configuration parameter
-  
-  
-  
-   
-This parameter enables per-database user names.  It is off by default.
-This parameter can only be set in the postgresql.conf
-file or on the server command line.
-   
-
-   
-If this is on, you should create users as username@dbname.
-When username is passed by a connecting client,
-@ and the database name are appended to the user
-name and that database-specific user name is looked up by the
-server. Note that when you create users with names containing
-@ within the SQL environment, you will need to
-quote the user name.
-   
-
-   
-With this parameter enabled, you can still create ordinary global
-users.  Simply append @ when specifying the user
-name in the client, e.g., joe@.  The @
-will be stripped off before the user name is looked up by the
-server.
-   
-
-   
-db_user_namespace causes the client's and
-server's user name representation to differ.
-Authentication checks are always done with the server's user name
-so authentication methods must be configured for the
-server's user name, not the client's.  Because
-md5 uses the user name as salt on both the
-client and server, md5 cannot be used with
-db_user_namespace.
-   
-
-   
-
- This feature is intended as a temporary measure until a
- complete solution is found.  At that time, this option will
- be removed.
-
-   
-  
- 
  
  
 
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index a98b934a8e..65d452f099 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -873,11 +873,6 @@ CheckMD5Auth(Port *port, char *shadow_pass, const char **logdetail)
 	char	   *passwd;
 	int			result;
 
-	if (Db_user_namespace)
-		ereport(FATAL,
-(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
- errmsg("MD5 authentication is not supported when \"db_user_namespace\" is enabled")));
-
 	/* include the salt to use for computing the response */
 	if (!pg_strong_random(md5Salt, 4))
 	{
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index f89f138f3c..5d4ddbb04d 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -1741,19 +1741,7 @@ parse_hba_line(TokenizedAuthLine *tok_line, int elevel)
 	else if (strcmp(token->string, "reject") == 0)
 		parsedline->auth_method = uaReject;
 	else if (strcmp(token->string, "md5") == 0)
-	{
-		if (Db_user_namespace)
-		{
-			ereport(elevel,
-	(errcode(ERRCODE_CONFIG_FILE_ERROR),
-	 errmsg("MD5 authentication is not supported when \"db_user_namespace\" is enabled"),
-	 errcontext("line %d of configuration file \"%s\"",
-line_num, file_name)));
-			*err_msg = "MD5 authentication is not supported when \"db_user_namespace\" is enabled";
-			return NULL;
-		}
 		parsedline->auth_method = uaMD5;
-	}
 	else if (strcmp(token->string, "scram-sha-256") == 0)
 		parsedline->auth_method = uaSCRAM;
 	else if (strcmp(token->string, "pam") == 0)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c

Re: Initdb-time block size specification

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 14:09:55 -0500, David Christensen wrote:
> On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra
> > I wonder how to best evaluate/benchmark this. AFAICS what we want to
> > measure is the extra cost of making the values dynamic (which means the
> > compiler can't just optimize them out). I'd say a "pgbench -S" seems
> > like a good test.
> 
> Yep, I tested 100 runs apiece with pgbench -S at scale factor 100,
> default settings for optimized builds of the same base commit with and
> without the patch; saw a reduction of TPS around 1% in that case, but
> I do think we'd want to look at different workloads; I presume the
> largest impact would be seen when it's all in shared_buffers and no IO
> is required.

I think pgbench -S indeed isn't a good workload - the overhead for it is much
more in context switches and instantiating executor state etc than code that
is affected by the change.

And indeed. Comparing e.g. TPC-H, I see *massive* regressions.  Some queries
are the same, sobut others regress by up to 70% (although more commonly around
10-20%).

That's larger than I thought, which makes me suspect that there's some bug in
the new code.

Interestingly, repeating the benchmark with a larger work_mem setting, the
regressions are still quite present, but smaller. I suspect the planner
chooses smarter plans which move bottlenecks more towards hashjoin code etc,
which won't be affected by this change.


IOW, you seriously need to evaluate analytics queries before this is worth
looking at further.

Greetings,

Andres Freund




Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-06-30 Thread Robert Haas
On Fri, Jun 30, 2023 at 1:41 PM Bruce Momjian  wrote:
> I think --size-only was chosen only because it is the minimal comparison
> option.

I think it's worse than that. I think that the procedure relies on
using the --size-only option to intentionally trick rsync into
thinking that files are identical when they're not.

Say we have a file like base/23246/78901 on the primary. Unless
wal_log_hints=on, the standby version is very likely different, but
only in ways that don't matter to WAL replay. So the procedure aims to
trick rsync into hard-linking the version of that file that exists on
the standby in the old cluster into the new cluster on the standby,
instead of copying the slightly-different version from the master,
thus making the upgrade very fast. If rsync actually checksummed the
files, it would realize that they're different and copy the file from
the original primary, which the person who wrote this procedure does
not want.

That's kind of a crazy thing for us to be documenting. I think we
really ought to consider removing from this documentation. If somebody
wants to write a reliable tool for this to ship as part of PostgreSQL,
well and good. But this procedure has no real sanity checks and is
based on very fragile assumptions. That doesn't seem suitable for
end-user use.

I'm not quite clear on how Nikolay got into trouble here. I don't
think I understand under exactly what conditions the procedure is
reliable and under what conditions it isn't. But there is no way in
heck I would ever advise anyone to use this procedure on a database
they actually care about. This is a great party trick or something to
show off in a lightning talk at PGCon, not something you ought to be
doing with valuable data that you actually care about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Initdb-time block size specification

2023-06-30 Thread David Christensen
On Fri, Jun 30, 2023 at 2:39 PM Andres Freund  wrote:
>
> Hi,
>
> On 2023-06-30 12:35:09 -0500, David Christensen wrote:
> > As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
> > variable block sizes; basically instead of BLCKSZ being a compile-time
> > constant, a single set of binaries can support all of the block sizes
> > Pg can support, using the value stored in pg_control as the basis.
> > (Possible future plans would be to make this something even more
> > dynamic, such as configured per tablespace, but this is out of scope;
> > this just sets up the infrastructure for this.)
>
> I am extremely doubtful this is a good idea. For one it causes a lot of churn,
> but more importantly it turns currently cheap code paths into more expensive
> ones.
>
> Changes like
>
> > -#define BufHdrGetBlock(bufHdr)   ((Block) (BufferBlocks + ((Size) 
> > (bufHdr)->buf_id) * BLCKSZ))
> > +#define BufHdrGetBlock(bufHdr)   ((Block) (BufferBlocks + ((Size) 
> > (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE))
>
> Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is
> actually variable.

Correct; that is mainly a notational device which would be easy enough
to change (and presumably would follow along the lines of the commit
Tomas pointed out above).

> I am fairly certain this is going to be causing substantial performance
> regressions.  I think we should reject this even if we don't immediately find
> them, because it's almost guaranteed to cause some.

What would be considered substantial? Some overhead would be expected,
but I think having an actual patch to evaluate lets us see what
potential there is. Seems like this will likely be optimized as an
offset stored in a register, so wouldn't expect huge changes here.
(There may be other approaches I haven't thought of in terms of
getting this.)

> Besides this, I've not really heard any convincing justification for needing
> this in the first place.

Doing this would open up experiments in larger block sizes, so we
would be able to have larger indexable tuples, say, or be able to
store data types that are larger than currently supported for tuple
row limits without dropping to toast (native vector data types come to
mind as a candidate here).  We've had 8k blocks for a long time while
hardware has improved over 20+ years, and it would be interesting to
see how tuning things would open up additional avenues for performance
without requiring packagers to make a single choice on this regardless
of use-case.  (The fact that we allow compiling this at a different
value suggests there is thought to be some utility having this be
something other than the default value.)

I just think it's one of those things that is hard to evaluate without
actually having something specific, which is why we have this patch
now.

Thanks,

David




Should we remove db_user_namespace?

2023-06-30 Thread Nathan Bossart
I think this is the second decennial thread [0] for removing this GUC.
This topic came up at PGCon, so I thought I'd start the discussion on the
lists.

I'm personally not aware of anyone using this parameter.  A couple of my
colleagues claimed to have used it in the aughts, but they also noted that
users were confused by the current implementation, and they seemed
generally in favor of removing it.  AFAICT the strongest reason for keeping
it is that there is presently no viable replacement.  Does this opinion
still stand?  If so, perhaps we can look into adding a viable replacement
for v17.

The attached patch simply removes the GUC.

[0] 
https://postgr.es/m/CAA-aLv6wnwp6Qr5fZo%2B7K%3DVSr51qFMuLsCeYvTkSF3E5qEPvqA%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 6677f4b98fd0b1bd68e07d773b04caf45cf27715 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 30 Jun 2023 12:46:08 -0700
Subject: [PATCH v1 1/1] remove db_user_namespace

---
 doc/src/sgml/config.sgml  | 52 ---
 src/backend/libpq/auth.c  |  5 --
 src/backend/libpq/hba.c   | 12 -
 src/backend/postmaster/postmaster.c   | 19 ---
 src/backend/utils/misc/guc_tables.c   |  9 
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/include/libpq/pqcomm.h|  2 -
 7 files changed, 100 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6262cb7bb2..e6cea8ddfc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1188,58 +1188,6 @@ include_dir 'conf.d'

   
  
-
- 
-  db_user_namespace (boolean)
-  
-   db_user_namespace configuration parameter
-  
-  
-  
-   
-This parameter enables per-database user names.  It is off by default.
-This parameter can only be set in the postgresql.conf
-file or on the server command line.
-   
-
-   
-If this is on, you should create users as username@dbname.
-When username is passed by a connecting client,
-@ and the database name are appended to the user
-name and that database-specific user name is looked up by the
-server. Note that when you create users with names containing
-@ within the SQL environment, you will need to
-quote the user name.
-   
-
-   
-With this parameter enabled, you can still create ordinary global
-users.  Simply append @ when specifying the user
-name in the client, e.g., joe@.  The @
-will be stripped off before the user name is looked up by the
-server.
-   
-
-   
-db_user_namespace causes the client's and
-server's user name representation to differ.
-Authentication checks are always done with the server's user name
-so authentication methods must be configured for the
-server's user name, not the client's.  Because
-md5 uses the user name as salt on both the
-client and server, md5 cannot be used with
-db_user_namespace.
-   
-
-   
-
- This feature is intended as a temporary measure until a
- complete solution is found.  At that time, this option will
- be removed.
-
-   
-  
- 
  
  
 
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index a98b934a8e..65d452f099 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -873,11 +873,6 @@ CheckMD5Auth(Port *port, char *shadow_pass, const char **logdetail)
 	char	   *passwd;
 	int			result;
 
-	if (Db_user_namespace)
-		ereport(FATAL,
-(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
- errmsg("MD5 authentication is not supported when \"db_user_namespace\" is enabled")));
-
 	/* include the salt to use for computing the response */
 	if (!pg_strong_random(md5Salt, 4))
 	{
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index f89f138f3c..5d4ddbb04d 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -1741,19 +1741,7 @@ parse_hba_line(TokenizedAuthLine *tok_line, int elevel)
 	else if (strcmp(token->string, "reject") == 0)
 		parsedline->auth_method = uaReject;
 	else if (strcmp(token->string, "md5") == 0)
-	{
-		if (Db_user_namespace)
-		{
-			ereport(elevel,
-	(errcode(ERRCODE_CONFIG_FILE_ERROR),
-	 errmsg("MD5 authentication is not supported when \"db_user_namespace\" is enabled"),
-	 errcontext("line %d of configuration file \"%s\"",
-line_num, file_name)));
-			*err_msg = "MD5 authentication is not supported when \"db_user_namespace\" is enabled";
-			return NULL;
-		}
 		parsedline->auth_method = uaMD5;
-	}
 	else if (strcmp(token->string, "scram-sha-256") == 0)
 		parsedline->auth_method = uaSCRAM;
 	else if (strcmp(token->string, "pam") == 0)
diff --git a/src/backend/postmaster/postmaster.c 

Re: Initdb-time block size specification

2023-06-30 Thread Andres Freund
Hi,

On 2023-06-30 12:35:09 -0500, David Christensen wrote:
> As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
> variable block sizes; basically instead of BLCKSZ being a compile-time
> constant, a single set of binaries can support all of the block sizes
> Pg can support, using the value stored in pg_control as the basis.
> (Possible future plans would be to make this something even more
> dynamic, such as configured per tablespace, but this is out of scope;
> this just sets up the infrastructure for this.)

I am extremely doubtful this is a good idea. For one it causes a lot of churn,
but more importantly it turns currently cheap code paths into more expensive
ones.

Changes like

> -#define BufHdrGetBlock(bufHdr)   ((Block) (BufferBlocks + ((Size) 
> (bufHdr)->buf_id) * BLCKSZ))
> +#define BufHdrGetBlock(bufHdr)   ((Block) (BufferBlocks + ((Size) 
> (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE))

Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is
actually variable.

I am fairly certain this is going to be causing substantial performance
regressions.  I think we should reject this even if we don't immediately find
them, because it's almost guaranteed to cause some.


Besides this, I've not really heard any convincing justification for needing
this in the first place.

Greetings,

Andres Freund




Re: Initdb-time block size specification

2023-06-30 Thread David Christensen
On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra
 wrote:

> Do we really want to prefix the option with CLUSTER_? That seems to just
> add a lot of noise into the patch, and I don't see much value in this
> rename. I'd prefer keeping BLCKSZ and tweak just the couple places that
> need "default" to use BLCKSZ_DEFAULT or something like that.
>
> But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time
> values, so after making this initdb-time parameter we should abandon
> that (just like fc49e24fa69a did for segment sizes). Perhaps something
> like cluste_block_size would work ... (yes, that touches all the places
> too).

Yes, I can see that being an equivalent change; thanks for the pointer
there. Definitely the "cluster_block_size" could be an approach,
though since it's just currently a #define for GetBlockSize(), maybe
we just replace with the equivalent instead. I was mainly trying to
make it something that was conceptually similar and easy to reason
about without getting bogged down in the details locally, but can see
that ALL_CAPS does have a specific meaning.  Also eliminating the
BLCKSZ symbol meant it was easier to catch anything which depended on
that value.  If we wanted to keep BLCKSZ, I'd be okay with that at
this point vs the CLUSTER_BLOCK_SIZE approach, could help to make the
patch smaller at this point.

> > Initial (basic) performance testing shows only minor changes with the 
> > pgbench -S
> > benchmark, though this is obviously an area that will need considerable
> > testing/verification across multiple workloads.
> >
>
> I wonder how to best evaluate/benchmark this. AFAICS what we want to
> measure is the extra cost of making the values dynamic (which means the
> compiler can't just optimize them out). I'd say a "pgbench -S" seems
> like a good test.

Yep, I tested 100 runs apiece with pgbench -S at scale factor 100,
default settings for optimized builds of the same base commit with and
without the patch; saw a reduction of TPS around 1% in that case, but
I do think we'd want to look at different workloads; I presume the
largest impact would be seen when it's all in shared_buffers and no IO
is required.

Thanks,

David




Re: Avoid unncessary always true test (src/backend/storage/buffer/bufmgr.c)

2023-06-30 Thread Karina Litskevich
Hi,

Good catch. Indeed, eb.rel shouldn't be NULL there and the tests should be
unnecessary. However, it doesn't follow from the code of these functions.
>From what I can see eb.rel can be NULL in both of these functions. There is
the following Assert in the beginning of the ExtendBufferedRelTo() function:

Assert((eb.rel != NULL) != (eb.smgr != NULL));

And ExtendBufferedRelShared() is only called from ExtendBufferedRelCommon()
which can be called from ExtendBufferedRelTo() or ExtendBufferedRelBy() that
also has the same Assert(). And none of these functions assigns eb.rel, so
it can be NULL from the very beginning and it stays the same.


And there is the following call in xlogutils.c, which is exactly the case
when
eb.rel is NULL:

buffer = ExtendBufferedRelTo(EB_SMGR(smgr, RELPERSISTENCE_PERMANENT),
 forknum,
 NULL,
 EB_PERFORMING_RECOVERY |
 EB_SKIP_EXTENSION_LOCK,
 blkno + 1,
 mode);


So as for me calling LockRelationForExtension() and
UnlockRelationForExtension()
without testing eb.rel first looks more like a bug here. However, they are
never
actually called with eb.rel=NULL because of the EB_* flags, so there is no
bug
here. I believe we should add Assert checking that when eb.rel is NULL,
flags
are such that we won't use eb.rel. And yes, we can remove unnecessary checks
where the flags value guaranty us that eb.rel is not NULL.


And another minor observation. It seems to me that we don't need a "could
have
been closed while waiting for lock" in ExtendBufferedRelShared(), because I
believe the comment above already explains why updating eb.smgr:

 * Note that another backend might have extended the relation by the time
 * we get the lock.


I attached the new version of the patch as I see it.


Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/
From 7425d5b1b9d30f0da75aabd94e862906f9635a60 Mon Sep 17 00:00:00 2001
From: Karina Litskevich 
Date: Fri, 30 Jun 2023 21:26:18 +0300
Subject: [PATCH v4] Remove always true checks

Also add asserts that help understanding these checks are always true
---
 src/backend/storage/buffer/bufmgr.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3c59bbd04e..7b1e2e1284 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -851,6 +851,8 @@ ExtendBufferedRelBy(ExtendBufferedWhat eb,
 {
 	Assert((eb.rel != NULL) != (eb.smgr != NULL));
 	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(eb.rel != NULL || (flags & EB_SKIP_EXTENSION_LOCK) &&
+			!(flags & EB_CREATE_FORK_IF_NEEDED));
 	Assert(extend_by > 0);
 
 	if (eb.smgr == NULL)
@@ -887,6 +889,8 @@ ExtendBufferedRelTo(ExtendBufferedWhat eb,
 
 	Assert((eb.rel != NULL) != (eb.smgr != NULL));
 	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(eb.rel != NULL || (flags & EB_SKIP_EXTENSION_LOCK) &&
+			!(flags & EB_CREATE_FORK_IF_NEEDED));
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 
 	if (eb.smgr == NULL)
@@ -908,8 +912,7 @@ ExtendBufferedRelTo(ExtendBufferedWhat eb,
 		LockRelationForExtension(eb.rel, ExclusiveLock);
 
 		/* could have been closed while waiting for lock */
-		if (eb.rel)
-			eb.smgr = RelationGetSmgr(eb.rel);
+		eb.smgr = RelationGetSmgr(eb.rel);
 
 		/* recheck, fork might have been created concurrently */
 		if (!smgrexists(eb.smgr, fork))
@@ -1875,8 +1878,7 @@ ExtendBufferedRelShared(ExtendBufferedWhat eb,
 	if (!(flags & EB_SKIP_EXTENSION_LOCK))
 	{
 		LockRelationForExtension(eb.rel, ExclusiveLock);
-		if (eb.rel)
-			eb.smgr = RelationGetSmgr(eb.rel);
+		eb.smgr = RelationGetSmgr(eb.rel);
 	}
 
 	/*
-- 
2.25.1



Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-06-30 Thread Nathan Bossart
On Thu, Jun 29, 2023 at 10:09:21PM -0700, Nathan Bossart wrote:
> On Thu, Jun 29, 2023 at 08:53:56PM -0400, Tom Lane wrote:
>> I'm leaning to Robert's thought that we need to revert this for now,
>> and think harder about how to make it work cleanly and safely.
> 
> Since it sounds like this is headed towards a revert, here's a patch for
> removing MAINTAIN and pg_maintain.

I will revert this next week unless opinions change before then.  I'm
currently planning to revert on both master and REL_16_STABLE, but another
option would be to keep it on master while we sort out the remaining
issues.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Move un-parenthesized syntax docs to "compatibility" for few SQL commands

2023-06-30 Thread Nathan Bossart
On Mon, May 15, 2023 at 12:36:51PM -0700, Nathan Bossart wrote:
> On Tue, Apr 25, 2023 at 03:18:44PM +0900, Michael Paquier wrote:
>> FWIW, I agree to hold on this stuff for v17~ once it opens for
>> business.  There is no urgency here.
> 
> There's still some time before we'll be able to commit any of these, but
> here is an attempt at addressing all the feedback thus far.

I think these patches are in decent shape, so I'd like to commit them soon,
but I will wait at least a couple more weeks in case anyone has additional
feedback.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Initdb-time block size specification

2023-06-30 Thread Tomas Vondra
On 6/30/23 19:35, David Christensen wrote:
> Hi,
> 
> As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
> variable block sizes; basically instead of BLCKSZ being a compile-time
> constant, a single set of binaries can support all of the block sizes
> Pg can support, using the value stored in pg_control as the basis.
> (Possible future plans would be to make this something even more
> dynamic, such as configured per tablespace, but this is out of scope;
> this just sets up the infrastructure for this.)
> 
> Whereas we had traditionally used BLCKSZ to indicate the compile-time selected
> block size, this commit adjusted things so the cluster block size can be
> selected at initdb time.
> 
> In order to code for this, we introduce a few new defines:
> 
> - CLUSTER_BLOCK_SIZE is the blocksize for this cluster itself.  This is not
> valid until BlockSizeInit() has been called in the given backend, which we do 
> as
> early as possible by parsing the ControlFile and using the blcksz field.
> 
> - MIN_BLOCK_SIZE and MAX_BLOCK_SIZE are the limits for the selectable block
> size. It is required that CLUSTER_BLOCK_SIZE is a power of 2 between these two
> constants.
> 
> - DEFAULT_BLOCK_SIZE is the moral equivalent of BLCKSZ; it is the built-in
> default value.  This is used in a few places that just needed a buffer of an
> arbitrary size, but the dynamic value CLUSTER_BLOCK_SIZE should almost always 
> be
> used instead.
> 
> - CLUSTER_RELSEG_SIZE is used instead of RELSEG_SIZE, since internally we are
> storing the segment size in terms of number of blocks.  RELSEG_SIZE is still
> kept, but is used in terms of the number of blocks of DEFAULT_BLOCK_SIZE;
> CLUSTER_RELSEG_SIZE scales appropriately (and is the only thing used 
> internally)
> to keep the same target total segment size regardless of block size.
> 

Do we really want to prefix the option with CLUSTER_? That seems to just
add a lot of noise into the patch, and I don't see much value in this
rename. I'd prefer keeping BLCKSZ and tweak just the couple places that
need "default" to use BLCKSZ_DEFAULT or something like that.

But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time
values, so after making this initdb-time parameter we should abandon
that (just like fc49e24fa69a did for segment sizes). Perhaps something
like cluste_block_size would work ... (yes, that touches all the places
too).

> This patch uses a precalculated table to store the block size itself, as well 
> as
> additional derived values that have traditionally been compile-time
> constants (example: MaxHeapTuplesPerPage).  The traditional macro names are 
> kept
> so code that doesn't care about it should not need to change, however the
> definition of these has changed (see the CalcXXX() routines in blocksize.h for
> details).
> 
> A new function, BlockSizeInit() populates the appropriate values based on the
> target block size.  This should be called as early as possible in any code 
> that
> utilizes block sizes.  This patch adds this in the appropriate place on the
> handful of src/bin/ programs that used BLCKSZ, so this caveat mainly impacts 
> new
> code.
> 
> Code which had previously used BLCKZ should likely be able to get away with
> changing these instances to CLUSTER_BLOCK_SIZE, unless you're using a 
> structure
> allocated on the stack.  In these cases, the compiler will complain about
> dynamic structure.  The solution is to utilize an expression with 
> MAX_BLOCK_SIZE
> instead of BLCKSZ, ensuring enough stack space is allocated for the maximum
> size. This also does require using CLUSTER_BLOCK_SIZE or an expression based 
> on
> it when actually using this structure, so in practice more stack space may be
> allocated then used in principal; as long as there is plenty of stack this
> should have no specific impacts on code.
> 
> Initial (basic) performance testing shows only minor changes with the pgbench 
> -S
> benchmark, though this is obviously an area that will need considerable
> testing/verification across multiple workloads.
> 

I wonder how to best evaluate/benchmark this. AFAICS what we want to
measure is the extra cost of making the values dynamic (which means the
compiler can't just optimize them out). I'd say a "pgbench -S" seems
like a good test.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: sslinfo extension - add notbefore and notafter timestamps

2023-06-30 Thread Cary Huang
 > This needs to adjust the tests in src/test/ssl which now fails due to SELECT 
 > *
 > returning a row which doesn't match what the test was coded for.

Thank you so much for pointing out. I have adjusted the extra ssl test to 
account for the extra columns returned. It should not fail now. 
 
> The new patchset isn't updating contrib/sslinfo/meson with the 1.3 update so 
> it
 > fails to build with Meson. 

Thanks again for pointing out, I have adjusted the meson build file to include 
the 1.3 update

Please see attached patches for the fixes. 
Thank you so much!

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca


v3-0001-sslinfo-add-notbefore-and-notafter-timestamps.patch
Description: Binary data


v3-0002-pg-stat-ssl-add-notbefore-and-notafter-timestamps.patch
Description: Binary data


Re: Adding SHOW CREATE TABLE

2023-06-30 Thread Kirk Wolak
On Wed, Jun 21, 2023 at 8:52 PM Kirk Wolak  wrote:

> On Mon, Jun 5, 2023 at 7:43 AM Jelte Fennema  wrote:
>
>> On Thu, 1 Jun 2023 at 18:57, Kirk Wolak  wrote:
>> > Can this get turned into a Patch?  Were you offering this code up for
>> others (me?) to pull, and work into a patch?
>> > [If I do the patch, I am not sure it gives you the value of reducing
>> what CITUS has to maintain.  But it dawns on
>> > me that you might be pushing a much bigger patch...  But I would take
>> that, as I think there is other value in there]
>>
>

> Yeah, the Citus code only handles things that Citus supports in
>> distributed tables. Which is quite a lot, but indeed not everything
>> yet. Temporary and inherited tables are not supported in this code
>> afaik. Possibly more. See the commented out
>> EnsureRelationKindSupported for what should be supported (normal
>> tables and partitioned tables afaik).
>>
>
>
Okay, apologies for the long delay on this.  I have the code Jelte
submitted working.  And I have (almost) figured out how to add the function
so it shows up in the pg_catalog...  (I edited files I should not have, I
need to know the proper process... Anyone...)

Not sure if it is customary to attach the code when asking about stuff.
For the most part, it was what Jelte Gave us with a pg_get_tabledef()
wrapper to call...

Here is the output it produces for *select
pg_get_tabledef('pg_class'::regclass); * (Feedback Welcome)

CREATE TABLE pg_class (oid oid NOT NULL, relname name NOT NULL COLLATE "C",
relnamespace oid NOT NULL, reltype oid NOT NULL, reloftype oid NOT NULL,
relowner oid NOT NULL, relam oid NOT NULL, relfilenode oid NOT NULL,
reltablespace oid NOT NULL, relpages integer NOT NULL, reltuples real NOT
NULL, relallvisible integer NOT NULL, reltoastrelid oid NOT NULL,
relhasindex boolean NOT NULL, relisshared boolean NOT NULL, relpersistence
"char" NOT NULL, relkind "char" NOT NULL, relnatts smallint NOT NULL,
relchecks smallint NOT NULL, relhasrules boolean NOT NULL, relhastriggers
boolean NOT NULL, relhassubclass boolean NOT NULL, relrowsecurity boolean
NOT NULL, relforcerowsecurity boolean NOT NULL, relispopulated boolean NOT
NULL, relreplident "char" NOT NULL, relispartition boolean NOT NULL,
relrewrite oid NOT NULL, relfrozenxid xid NOT NULL, relminmxid xid NOT
NULL, relacl aclitem[], reloptions text[] COLLATE "C", relpartbound
pg_node_tree COLLATE "C") USING heap

==
My Comments/Questions:
1) I would prefer Legible output, like below
2) I would prefer to leave off COLLATE "C"  IFF that is the DB Default
3) The USING heap... I want to pull UNLESS the value is NOT the default
(That's a theme in my comments)
4) I *THINK* including the schema would be nice?
5) This version will work with a TEMP table, but NOT EMIT "TEMPORARY"...
Thoughts?  Is emitting [pg_temp.] good enough?
6) This version enumerates sequence values (Drop always, or Drop if they
are the default values?)
7) Should I enable the pg_get_seqdef() code
8) It does NOT handle Inheritance (Yet... Is this important?  Is it okay to
just give the table structure for this table?)
9) I have not tested against Partitions, etc...  I SIMPLY want initial
feedback on Formatting

-- Legible:
CREATE TABLE pg_class (oid oid NOT NULL,
 relname name NOT NULL COLLATE "C",
 relnamespace oid NOT NULL,
 reltype oid NOT NULL,
 ...
 reloptions text[] COLLATE "C",
 relpartbound pg_node_tree COLLATE "C"
)

-- Too verbose with "*DEFAULT*" Sequence Values:
CREATE TABLE t1 (id bigint GENERATED BY DEFAULT AS IDENTITY *(INCREMENT BY
1 MINVALUE 1 MAXVALUE 9223372036854775807 START WITH 1 CACHE 1 NO CYCLE)*
NOT NULL,
 f1 text
) WITH (autovacuum_vacuum_cost_delay='0', fillfactor='80',
autovacuum_vacuum_insert_threshold='-1',
autovacuum_analyze_threshold='5',
autovacuum_vacuum_threshold='5',
autovacuum_vacuum_scale_factor='1.5')

Thanks,

Kirk...
PS: After I get feedback on Formatting the output, etc.  I will gladly
generate a new .patch file and send it along.  Otherwise Jelte gets 100% of
the credit, and I don't want to look like I am changing that.


Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

2023-06-30 Thread Bruce Momjian
On Thu, Jun 29, 2023 at 02:38:58PM -0400, Robert Haas wrote:
> On Thu, Jun 29, 2023 at 1:50 PM Nikolay Samokhvalov  wrote:
> > Does this make sense or I'm missing something and the current docs describe 
> > a reliable process? (As I said, we have deviated from the process, to 
> > involve logical replication, so I'm not 100% sure I'm right suspecting the 
> > original procedure in having standby corruption risks.)
> 
> I'm very suspicious about this section of the documentation. It
> doesn't explain why --size-only is used or why --no-inc-recursive is
> used.

I think --size-only was chosen only because it is the minimal comparison
option.
 
> > > 9. Prepare for standby server upgrades
> > > If you are upgrading standby servers using methods outlined in section 
> > > Step 11, verify that the old standby servers are caught up by running 
> > > pg_controldata against the old primary and standby clusters. Verify that 
> > > the “Latest checkpoint location” values match in all clusters. (There 
> > > will be a mismatch if old standby servers were shut down before the old 
> > > primary or if the old standby servers are still running.) Also, make sure 
> > > wal_level is not set to minimal in the postgresql.conf file on the new 
> > > primary cluster.
> >
> > – admitting that there might be mismatch. But if there is mismatch, rsync 
> > --size-only is not going to help synchronize properly, right?
> 
> I think the idea is that you shouldn't use the procedure in this case.
> But honestly I don't think it's probably a good idea to use this
> procedure at all. It's not clear enough under what circumstances, if
> any, it's safe to use, and there's not really any way to know if
> you've done it correctly. You couldn't pay me enough to recommend this
> procedure to anyone.

I think it would be good to revisit all the steps outlined in that
procedure and check which ones are still valid or need adjusting.  It is
very possible the original steps have bugs or that new Postgres features
added since the steps were created don't work with these steps.  I think
we need to bring Stephen Frost into the discussion, so I have CC'ed him.

Frankly, I didn't think the documented procedure would work either, but
people say it does, so it is in the docs.  I do think it is overdue for
a re-analysis.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-30 Thread Drouvot, Bertrand

Hi,

On 6/30/23 5:54 PM, Tom Lane wrote:

Nathan Bossart  writes:

After taking another look at this, I wonder if it'd be better to fail as
soon as we see the database or user name is too long instead of lugging
them around when authentication is destined to fail.


If we're agreed that we aren't going to truncate these identifiers,
that seems like a reasonable way to handle it.



Yeah agree, thanks Nathan for the idea.
I'll work on a new patch version proposal.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: SPI isolation changes

2023-06-30 Thread Tom Lane
Seino Yuki  writes:
> Of course, executing SET TRANSACTION ISOLATION LEVEL with SPI_execute 
> will result in error.
> ---
> SPI_execute("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE", false, 0);

> (Log Output)
> ERROR:  SET TRANSACTION ISOLATION LEVEL must be called before any query
> CONTEXT:  SQL statement "SET TRANSACTION ISOLATION LEVEL SERIALIZABLE"

Even if you just did SPI_commit?  That *should* fail if you just do
it right off the bat in a SPI-using procedure, because you're already
within the transaction that called the procedure.  But I think it
will work if you do SPI_commit followed by this SPI_execute.

regards, tom lane




Re: SPI isolation changes

2023-06-30 Thread Seino Yuki

On 2023-07-01 00:06, Tom Lane wrote:

Seino Yuki  writes:

I also thought that using SPI_start_transaction would be more readable
than using SPI_commit/SPI_rollback to implicitly start a transaction.
What do you think?


I think you're trying to get us to undo commit 2e517818f, which
is not going to happen.  See the threads that led up to that:

Discussion:
https://postgr.es/m/3375ffd8-d71c-2565-e348-a597d6e73...@enterprisedb.com
Discussion: https://postgr.es/m/17416-ed8fe5d7213d6...@postgresql.org

It looks to me like you can just change the transaction property
settings immediately after SPI_start_transaction if you want to.
Compare this bit in SnapBuildExportSnapshot:

StartTransactionCommand();

/* There doesn't seem to a nice API to set these */
XactIsoLevel = XACT_REPEATABLE_READ;
XactReadOnly = true;

Also look at the implementation of SPI_commit_and_chain,
particularly RestoreTransactionCharacteristics.

regards, tom lane


Thanks for sharing past threads.
I was understand how SPI_start_transaction went no-operation.

I also understand how to set the transaction property.
However, it was a little disappointing that the transaction property
could not be changed only by SPI commands.

Of course, executing SET TRANSACTION ISOLATION LEVEL with SPI_execute 
will result in error.

---
SPI_execute("SET TRANSACTION ISOLATION LEVEL SERIALIZABLE", false, 0);

(Log Output)
ERROR:  SET TRANSACTION ISOLATION LEVEL must be called before any query
CONTEXT:  SQL statement "SET TRANSACTION ISOLATION LEVEL SERIALIZABLE"
---

Thanks for answering.




Re: Avoid unused value (src/fe_utils/print.c)

2023-06-30 Thread Ranier Vilela
Em sex., 30 de jun. de 2023 às 13:00, Alexander Lakhin 
escreveu:

> Hello Karina,
>
> 30.06.2023 17:25, Karina Litskevich wrote:
> > Hi,
> >
> > Alexander wrote:
> >
> >> It also aligns the code with print_unaligned_vertical(), but I can't
> see why
> >> need_recordsep = true;
> >> is a no-op here (scan-build dislikes only need_recordsep = false;).
> >> I suspect that removing that line will change the behaviour in cases
> when
> >> need_recordsep = false after the loop "print cells" and the loop
> >> "for (footers)" is executed.
> > As I understand cont->cells is supoused to have all cont->ncolumns *
> cont->nrows
> > entries filled so the loop "print cells" always assigns need_recordsep =
> true,
> > except when there are no cells at all or cancel_pressed == true.
> > If cancel_pressed == true then footers are not printed. So to have
> > need_recordsep == false before the loop "for (footers)" table should be
> empty,
> > and need_recordsep should be false before the loop "print cells". It can
> only
> > be false there when cont->opt->start_table == true and opt_tuples_only
> == true
> > so that headers are not printed. But when opt_tuples_only == true
> footers are
> > not printed either.
> >
> > So technically removing "need_recordsep = true;" won't change the
> outcome. But
> > it's not obvious, so I also have doubts about removing this line. If
> someday
> > print options are changed, for example to support printing footers and
> not
> > printing headers, or anything else changes in this function, the output
> might
> > be unexpected with this line removed.
>
> Hi Alexander,


> I think that the question that should be answered before moving forward
> here is: what this discussion is expected to result in?
>
I hope to make the function more readable and maintainable.


> If the goal is to avoid an unused value to make Coverity/clang`s scan-build
> a little happier, then maybe just don't touch other line, that is not
> recognized as dead (at least by scan-build;



> I wonder what Coverity says
> about that line).
>
   if (cont->opt->stop_table)
477{
478printTableFooter *footers = footers_with_default(cont);
479
480if (!opt_tuples_only && footers != NULL && !
cancel_pressed)
481{
482printTableFooter *f;
483
484for (f = footers; f; f = f->next)
485{
486if (need_recordsep)
487{
488print_separator(cont->opt->
recordSep, fout);

CID 1512766 (#1 of 1): Unused value (UNUSED_VALUE)assigned_value: Assigning
value false to need_recordsep here, but that stored value is overwritten
before it can be used.
489need_recordsep = false;
490}
491fputs(f->data, fout);

value_overwrite: Overwriting previous write to need_recordsep with value
true.
492need_recordsep = true;
493}
494}
495


> Otherwise, if the goal is to do review the code and make it cleaner, then
> why not get rid of "if (need_recordsep)" there?
>
I don't agree with removing this line, as it is essential to print the
separators, in the footers.

best regards,
Ranier Vilela


Re: Avoid unused value (src/fe_utils/print.c)

2023-06-30 Thread Ranier Vilela
Em sex., 30 de jun. de 2023 às 11:26, Karina Litskevich <
litskevichkar...@gmail.com> escreveu:

> Hi,
>
> Alexander wrote:
>
> > It also aligns the code with print_unaligned_vertical(), but I can't see
> why
> > need_recordsep = true;
> > is a no-op here (scan-build dislikes only need_recordsep = false;).
> > I suspect that removing that line will change the behaviour in cases when
> > need_recordsep = false after the loop "print cells" and the loop
> > "for (footers)" is executed.
>
> Hi Karina,


> As I understand cont->cells is supoused to have all cont->ncolumns *
> cont->nrows
> entries filled so the loop "print cells" always assigns need_recordsep =
> true,
> except when there are no cells at all or cancel_pressed == true.
> If cancel_pressed == true then footers are not printed. So to have
> need_recordsep == false before the loop "for (footers)" table should be
> empty,
> and need_recordsep should be false before the loop "print cells". It can
> only
> be false there when cont->opt->start_table == true and opt_tuples_only ==
> true
> so that headers are not printed. But when opt_tuples_only == true footers
> are
> not printed either.
>
> So technically removing "need_recordsep = true;" won't change the outcome.

Thanks for the confirmation.


> But
> it's not obvious, so I also have doubts about removing this line. If
> someday
> print options are changed, for example to support printing footers and not
> printing headers, or anything else changes in this function, the output
> might
> be unexpected with this line removed.


That part I didn't understand.
How are we going to make this function less readable by removing the
complicating part.

best regards,
Ranier Vilela


Re: Avoid unused value (src/fe_utils/print.c)

2023-06-30 Thread Alexander Lakhin

Hello Karina,

30.06.2023 17:25, Karina Litskevich wrote:

Hi,

Alexander wrote:


It also aligns the code with print_unaligned_vertical(), but I can't see why
need_recordsep = true;
is a no-op here (scan-build dislikes only need_recordsep = false;).
I suspect that removing that line will change the behaviour in cases when
need_recordsep = false after the loop "print cells" and the loop
"for (footers)" is executed.

As I understand cont->cells is supoused to have all cont->ncolumns * cont->nrows
entries filled so the loop "print cells" always assigns need_recordsep = true,
except when there are no cells at all or cancel_pressed == true.
If cancel_pressed == true then footers are not printed. So to have
need_recordsep == false before the loop "for (footers)" table should be empty,
and need_recordsep should be false before the loop "print cells". It can only
be false there when cont->opt->start_table == true and opt_tuples_only == true
so that headers are not printed. But when opt_tuples_only == true footers are
not printed either.

So technically removing "need_recordsep = true;" won't change the outcome. But
it's not obvious, so I also have doubts about removing this line. If someday
print options are changed, for example to support printing footers and not
printing headers, or anything else changes in this function, the output might
be unexpected with this line removed.


I think that the question that should be answered before moving forward
here is: what this discussion is expected to result in?
If the goal is to avoid an unused value to make Coverity/clang`s scan-build
a little happier, then maybe just don't touch other line, that is not
recognized as dead (at least by scan-build; I wonder what Coverity says
about that line).
Otherwise, if the goal is to do review the code and make it cleaner, then
why not get rid of "if (need_recordsep)" there?

Best regards,
Alexander




Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-30 Thread Tom Lane
Nathan Bossart  writes:
> After taking another look at this, I wonder if it'd be better to fail as
> soon as we see the database or user name is too long instead of lugging
> them around when authentication is destined to fail.

If we're agreed that we aren't going to truncate these identifiers,
that seems like a reasonable way to handle it.

regards, tom lane




Re: ProcessStartupPacket(): database_name and user_name truncation

2023-06-30 Thread Nathan Bossart
After taking another look at this, I wonder if it'd be better to fail as
soon as we see the database or user name is too long instead of lugging
them around when authentication is destined to fail.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: On /*----- comments

2023-06-30 Thread Tom Lane
Heikki Linnakangas  writes:
> Except for the translator comments, I think those others forgot about 
> the end-guards by accident. But they look just as nice to me. It's 
> probably not worth the code churn to remove them from existing comments, 
> but how about we stop requiring them in new code, and update the 
> pgindent README accordingly?

Seems reasonable; the trailing dashes eat a line without adding much.

Should we also provide specific guidance about how many leading dashes
to use for this?  I vaguely recall that pgindent might only need one,
but I think using somewhere around 5 to 10 looks better.

regards, tom lane




On /*----- comments

2023-06-30 Thread Heikki Linnakangas

I spotted this comment in walsender.c:


/*---
 * When reading from a historic timeline, and there is a timeline switch
 * [.. long comment omitted ...]
 * portion of the old segment is copied to the new file.  ---
 */


Note the bogus dashes at the end. This was introduced in commit 
0dc8ead463, which moved and reformatted the comment. It's supposed to 
end the pgindent-guard at the beginning of the comment like this:



/*---
 * When reading from a historic timeline, and there is a timeline switch
 * [.. long comment omitted ...]
 * portion of the old segment is copied to the new file.
 *---
 */


But that got me wondering, do we care about those end-guards? pgindent 
doesn't need them. We already have a bunch of comments that don't have 
the end-guard, for example:


analyze.c:

/*--
  translator: %s is a SQL row locking clause 
such as FOR UPDATE */


gistproc.c:

/*
 * The goal is to form a left and right interval, so that every 
entry
 * interval is contained by either left or right interval (or 
both).
 *
 * For example, with the intervals (0,1), (1,3), (2,3), (2,4):
 *
 * 0 1 2 3 4
 * +-+
 *   +---+
 * +-+
 * +---+
 *
 * The left and right intervals are of the form (0,a) and (b,4).
 * We first consider splits where b is the lower bound of an 
entry.
 * We iterate through all entries, and for each b, calculate the
 * smallest possible a. Then we consider splits where a is the
 * upper bound of an entry, and for each a, calculate the 
greatest
 * possible b.
 *
 * In the above example, the first loop would consider splits:
 * b=0: (0,1)-(0,4)
 * b=1: (0,1)-(1,4)
 * b=2: (0,3)-(2,4)
 *
 * And the second loop:
 * a=1: (0,1)-(1,4)
 * a=3: (0,3)-(2,4)
 * a=4: (0,4)-(2,4)
 */


predicate.c:

/*--
 * The SLRU is no longer needed. Truncate to head before we set 
head
 * invalid.
 *
 * XXX: It's possible that the SLRU is not needed again until 
XID
 * wrap-around has happened, so that the segment containing 
headPage
 * that we leave behind will appear to be new again. In that 
case it
 * won't be removed until XID horizon advances enough to make it
 * current again.
 *
 * XXX: This should happen in vac_truncate_clog(), not in 
checkpoints.
 * Consider this scenario, starting from a system with no 
in-progress
 * transactions and VACUUM FREEZE having maximized oldestXact:
 * - Start a SERIALIZABLE transaction.
 * - Start, finish, and summarize a SERIALIZABLE transaction, 
creating
 *   one SLRU page.
 * - Consume XIDs to reach xidStopLimit.
 * - Finish all transactions.  Due to the long-running 
SERIALIZABLE
 *   transaction, earlier checkpoints did not touch headPage.  
The
 *   next checkpoint will change it, but that checkpoint 
happens after
 *   the end of the scenario.
 * - VACUUM to advance XID limits.
 * - Consume ~2M XIDs, crossing the former xidWrapLimit.
 * - Start, finish, and summarize a SERIALIZABLE transaction.
 *   SerialAdd() declines to create the targetPage, because 
headPage
 *   is not regarded as in the past relative to that 
targetPage.  The
 *   transaction instigating the summarize fails in
 *   SimpleLruReadPage().
 */

indexcmds.c:

/*-
 * Now we have all the indexes we want to process in indexIds.
 *
 * The phases now are:
 *
 * 1. create new indexes in the catalog
 * 2. build new indexes
 * 3. let new indexes catch up with tuples inserted in the meantime
 * 4. swap index names
 * 5. mark old indexes as dead
 * 6. drop old indexes
 *
 * We process each phase for all indexes before moving to the next 
phase,
 * for efficiency.
 */



Except for the translator comments, I think those others forgot about 
the end-guards by accident. But they look just as nice to me. It's 
probably not worth the code churn to remove them from existing comments, 
but 

Re: SPI isolation changes

2023-06-30 Thread Tom Lane
Seino Yuki  writes:
> I also thought that using SPI_start_transaction would be more readable
> than using SPI_commit/SPI_rollback to implicitly start a transaction.
> What do you think?

I think you're trying to get us to undo commit 2e517818f, which
is not going to happen.  See the threads that led up to that:

Discussion: 
https://postgr.es/m/3375ffd8-d71c-2565-e348-a597d6e73...@enterprisedb.com
Discussion: https://postgr.es/m/17416-ed8fe5d7213d6...@postgresql.org

It looks to me like you can just change the transaction property
settings immediately after SPI_start_transaction if you want to.
Compare this bit in SnapBuildExportSnapshot:

StartTransactionCommand();

/* There doesn't seem to a nice API to set these */
XactIsoLevel = XACT_REPEATABLE_READ;
XactReadOnly = true;

Also look at the implementation of SPI_commit_and_chain,
particularly RestoreTransactionCharacteristics.

regards, tom lane




Re: Assert !bms_overlap(joinrel->relids, required_outer)

2023-06-30 Thread Tom Lane
Richard Guo  writes:
> On Fri, Jun 30, 2023 at 12:20 AM Tom Lane  wrote:
>> Yeah, while looking at this I was wondering why try_mergejoin_path and
>> try_hashjoin_path don't do the same "Paths are parameterized by
>> top-level parents, so run parameterization tests on the parent relids"
>> dance that try_nestloop_path does.  This omission is consistent with
>> that, but it's not obvious why it'd be okay to skip it for
>> non-nestloop joins.  I guess we'd have noticed by now if it wasn't
>> okay, but ...

> I think it just makes these two assertions meaningless to skip it for
> non-nestloop joins if the input paths are for otherrels, because paths
> would never be parameterized by the member relations.  So these two
> assertions would always be true for otherrel paths.  I think this is why
> we have not noticed any problem by now.

After studying this some more, I think that maybe it's the "run
parameterization tests on the parent relids" bit that is misguided.
I believe the way it's really working is that all paths arriving
here are parameterized by top parents, because that's the only thing
we generate to start with.  A path can only become parameterized
by an otherrel when we apply reparameterize_path_by_child to it.
But the only place that happens is in try_nestloop_path itself
(or try_partial_nestloop_path), and then we immediately wrap it in
a nestloop join node, which becomes a child of an append that's
forming a partitionwise join.  The partitionwise join as a
whole won't be parameterized by any child rels.  So I think that
a path that's parameterized by a child rel can't exist "in the wild"
in a way that would allow it to get fed to one of the try_xxx_path
functions.  This explains why the seeming oversights in the merge
and hash cases aren't causing a problem.

If this theory is correct, we could simplify try_nestloop_path a
bit.  I doubt the code savings would matter, but maybe it's
worth changing for clarity's sake.

regards, tom lane




Re: SPI isolation changes

2023-06-30 Thread Seino Yuki

Thanks for the reply!

On 2023-06-30 23:26, Heikki Linnakangas wrote:

On 30/06/2023 17:15, Seino Yuki wrote:

Hi,

When I read the documents and coding of SPI, [1]
I found that the following the SPI_start_transaction does not support
transaciton_mode(ISOLATION LEVEL, READ WRITE/READ ONLY) like BEGIN
command. [2]
Is there a reason for this?


Per the documentation for SPI_start_transaction that you linked to:

"SPI_start_transaction does nothing, and exists only for code
compatibility with earlier PostgreSQL releases."

I haven't tested it, but perhaps you can do "SET TRANSACTION ISOLATION
LEVEL" in the new transaction after calling SPI_commit() though. Or
"SET DEFAULT TRANSACTION ISOLATION LEVEL" before committing.


I understand that too.
However, I thought SPI_start_transaction was the function equivalent to 
BEGIN (or START TRANSACTION).
Therefore, I did not understand why the same option could not be 
specified.


I also thought that using SPI_start_transaction would be more readable
than using SPI_commit/SPI_rollback to implicitly start a transaction.
What do you think?

Regards,
--
Seino Yuki
NTT DATA CORPORATION




Re: SPI isolation changes

2023-06-30 Thread Heikki Linnakangas

On 30/06/2023 17:15, Seino Yuki wrote:

Hi,

When I read the documents and coding of SPI, [1]
I found that the following the SPI_start_transaction does not support
transaciton_mode(ISOLATION LEVEL, READ WRITE/READ ONLY) like BEGIN
command. [2]
Is there a reason for this?


Per the documentation for SPI_start_transaction that you linked to:

"SPI_start_transaction does nothing, and exists only for code 
compatibility with earlier PostgreSQL releases."


I haven't tested it, but perhaps you can do "SET TRANSACTION ISOLATION 
LEVEL" in the new transaction after calling SPI_commit() though. Or "SET 
DEFAULT TRANSACTION ISOLATION LEVEL" before committing.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Avoid unused value (src/fe_utils/print.c)

2023-06-30 Thread Karina Litskevich
Hi,

Alexander wrote:

> It also aligns the code with print_unaligned_vertical(), but I can't see why
> need_recordsep = true;
> is a no-op here (scan-build dislikes only need_recordsep = false;).
> I suspect that removing that line will change the behaviour in cases when
> need_recordsep = false after the loop "print cells" and the loop
> "for (footers)" is executed.

As I understand cont->cells is supoused to have all cont->ncolumns * cont->nrows
entries filled so the loop "print cells" always assigns need_recordsep = true,
except when there are no cells at all or cancel_pressed == true.
If cancel_pressed == true then footers are not printed. So to have
need_recordsep == false before the loop "for (footers)" table should be empty,
and need_recordsep should be false before the loop "print cells". It can only
be false there when cont->opt->start_table == true and opt_tuples_only == true
so that headers are not printed. But when opt_tuples_only == true footers are
not printed either.

So technically removing "need_recordsep = true;" won't change the outcome. But
it's not obvious, so I also have doubts about removing this line. If someday
print options are changed, for example to support printing footers and not
printing headers, or anything else changes in this function, the output might
be unexpected with this line removed.

Best regards,
Karina Litskevich
Postgres Professional: http://postgrespro.com/




SPI isolation changes

2023-06-30 Thread Seino Yuki

Hi,

When I read the documents and coding of SPI, [1]
I found that the following the SPI_start_transaction does not support
transaciton_mode(ISOLATION LEVEL, READ WRITE/READ ONLY) like BEGIN 
command. [2]

Is there a reason for this?

I would like to be able to set transaciton_mode in 
SPI_start_transaction.

What do you think?

[1]
https://www.postgresql.org/docs/devel/spi-spi-start-transaction.html

[2]
https://www.postgresql.org/docs/devel/sql-begin.html

Regards,
--
Seino Yuki
NTT DATA CORPORATION




RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-06-30 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

Thank you for giving comments!

> > > Sorry for the delay, I didn't had time to come back to it until this 
> > > afternoon.
> >
> > No issues, everyone is busy:-).
> >
> > > I don't think that your analysis is correct.  Slots are guaranteed to be
> > > stopped after all the normal backends have been stopped, exactly to avoid
> such
> > > extraneous records.
> > >
> > > What is happening here is that the slot's confirmed_flush_lsn is properly
> > > updated in memory and ends up being the same as the current LSN before the
> > > shutdown.  But as it's a logical slot and those records aren't decoded, 
> > > the
> > > slot isn't marked as dirty and therefore isn't saved to disk.  You don't 
> > > see
> > > that behavior when doing a manual checkpoint before (per your script
> comment),
> > > as in that case the checkpoint also tries to save the slot to disk but 
> > > then
> > > finds a slot that was marked as dirty and therefore saves it.
> > >
> 
> Here, why the behavior is different for manual and non-manual checkpoint?

I have analyzed more, and concluded that there are no difference between manual
and shutdown checkpoint.

The difference was whether the CHECKPOINT record has been decoded or not.
The overall workflow of this test was:

1. do INSERT
(2. do CHECKPOINT)
(3. decode CHECKPOINT record)
4. receive feedback message from standby
5. do shutdown CHECKPOINT

At step 3, the walsender decoded that WAL and set candidate_xmin_lsn. The 
stucktrace was:
standby_decode()->SnapBuildProcessRunningXacts()->LogicalIncreaseXminForSlot().

At step 4, the confirmed_flush of the slot was updated, but 
ReplicationSlotSave()
was executed only when the slot->candidate_xmin_lsn had valid lsn. If step 2 and
3 are misssed, the dirty flag is not set and the change is still on the memory.

FInally, the CHECKPOINT was executed at step 5. If step 2 and 3 are misssed and
the patch from Julien is not applied, the updated value will be discarded. This
is what I observed. The patch forces to save the logical slot at the shutdown
checkpoint, so the confirmed_lsn is save to disk at step 5.

> Can you please explain what led to updating the confirmed_flush in
> memory but not in the disk?

The code-level workflow was said above. The slot info is updated only after
decoding CHECKPOINT. I'm not sure the initial motivation, but I suspect we 
wanted
to reduce the number of writing to disk.

> BTW, have we ensured that discarding the
> additional records are already sent to the subscriber, if so, why for
> those records confirmed_flush LSN is not progressed?

In this case, the apply worker request the LSN which is greater than 
confirmed_lsn
via START_REPLICATION. Therefore, according to CreateDecodingContext(), the
walsender sends from the appropriate records, doesn't it? I think discarding is
not happened on subscriber.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Extensible storage manager API - SMGR hook Redux

2023-06-30 Thread Matthias van de Meent
Hi hackers,

At Neon, we've been working on removing the file system dependency
from PostgreSQL and replacing it with a distributed storage layer. For
now, we've seen most success in this by replacing the implementation
of the smgr API, but it did require some core modifications like those
proposed early last year  by Anastasia [0].

As mentioned in the previous thread, there are several reasons why you
would want to use a non-default storage manager: storage-level
compression, encryption, and disk limit quotas [0]; offloading of cold
relation data was also mentioned [1].

In the thread on Anastasia's patch, Yura Sokolov mentioned that
instead of a hook-based smgr extension, a registration-based smgr
would be preferred, with integration into namespaces. Please find
attached an as of yet incomplete patch that starts to do that.

The patch is yet incomplete (as it isn't derived from Anastasia's
patch), but I would like comments on this regardless, as this is a
fairly fundamental component of PostgreSQL that is being modified, and
it is often better to get comments early in the development cycle. One
significant issue that I've seen so far are that catcache is not
guaranteed to be available in all backends that need to do smgr
operations, and I've not yet found a good solution.

Changes compared to HEAD:
- smgrsw is now dynamically allocated and grows as new storage
managers are loaded (during shared_preload_libraries)
- CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
- tablespace storage is (planned) fully managed by smgr through some
new smgr apis

Changes compared to Anastasia's patch:
- extensions do not get to hook and replace the api of the smgr code
directly - they are hidden behind the smgr registry.

Successes:
- 0001 passes tests (make check-world)
- 0002 builds without warnings (make)

TODO:
- fix dependency failures when catcache is unavailable
- tablespace redo is currently broken with 0002
- fix tests for 0002
- ensure that pg_dump etc. works with the new tablespace storage manager options

Looking forward to any comments, suggestions and reviews.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)


[0] 
https://www.postgresql.org/message-id/CAP4vRV6JKXyFfEOf%3Dn%2Bv5RGsZywAQ3CTM8ESWvgq%2BS87Tmgx_g%40mail.gmail.com
[1] 
https://www.postgresql.org/message-id/d365f19f-bc3e-4f96-a91e-8db130497...@yandex-team.ru


v1-0001-Expose-f_smgr-to-extensions-for-manual-implementa.patch
Description: Binary data


v1-0002-Prototype-Allow-tablespaces-to-specify-which-SMGR.patch
Description: Binary data


Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG

2023-06-30 Thread Joe Conway

On 6/29/23 22:13, Tristan Partin wrote:

On Mon Jun 5, 2023 at 11:00 AM CDT, Heikki Linnakangas wrote:
I think the uselocale() call renders ineffective the setlocale() calls 
that we make later. Maybe we should replace our setlocale() calls with 
uselocale(), too.


For what it's worth to everyone else in the thread (especially Joe), I
have a patch locally that fixes the mentioned bug using uselocale(). I
am not sure that it is worth committing for v16 given how _large_ (the
patch is actually quite small, +216 -235) of a change it is. I am going
to spend tomorrow combing over it a bit more and evaluating other
setlocale uses in the codebase.


(moving thread to hackers)

I don't see a patch attached -- how is it different than what I posted a 
week ago and added to the commitfest here?


 https://commitfest.postgresql.org/43/4413/

FWIW, if you are proposing replacing all uses of setlocale() with 
uselocale() as Heikki suggested:


1/ I don't think that is pg16 material, and almost certainly not 
back-patchable to earlier.


2/ It probably does not solve all of the identified issues caused by the 
newer perl libraries by itself, i.e. I believe the patch posted to the 
CF is still needed.


3/ I believe it is probably the right way to go for pg17+, but I would 
love to hear opinions from Jeff Davis, Peter Eisentraut, and/or Thomas 
Munroe (the locale code "usual suspects" ;-)), and others, about that.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



OpenPGP_signature
Description: OpenPGP digital signature


Re: index prefetching

2023-06-30 Thread Tomas Vondra
Hi,

attached is a v4 of the patch, with a fairly major shift in the approach.

Until now the patch very much relied on the AM to provide information
which blocks to prefetch next (based on the current leaf index page).
This seemed like a natural approach when I started working on the PoC,
but over time I ran into various drawbacks:

* a lot of the logic is at the AM level

* can't prefetch across the index page boundary (have to wait until the
  next index leaf page is read by the indexscan)

* doesn't work for distance searches (gist/spgist),

After thinking about this, I decided to ditch this whole idea of
exchanging prefetch information through an API, and make the prefetching
almost entirely in the indexam code.

The new patch maintains a queue of TIDs (read from index_getnext_tid),
with up to effective_io_concurrency entries - calling getnext_slot()
adds a TID at the queue tail, issues a prefetch for the block, and then
returns TID from the queue head.

Maintaining the queue is up to index_getnext_slot() - it can't be done
in index_getnext_tid(), because then it'd affect IOS (and prefetching
heap would mostly defeat the whole point of IOS). And we can't do that
above index_getnext_slot() because that already fetched the heap page.

I still think prefetching for IOS is doable (and desirable), in mostly
the same way - except that we'd need to maintain the queue from some
other place, as IOS doesn't do index_getnext_slot().

FWIW there's also the "index-only filters without IOS" patch [1] which
switches even regular index scans to index_getnext_tid(), so maybe
relying on index_getnext_slot() is a lost cause anyway.

Anyway, this has the nice consequence that it makes AM code entirely
oblivious of prefetching - there's no need to API, we just get TIDs as
before, and the prefetching magic happens after that. Thus it also works
for searches ordered by distance (gist/spgist). The patch got much
smaller (about 40kB, down from 80kB), which is nice.

I ran the benchmarks [2] with this v4 patch, and the results for the
"point" queries are almost exactly the same as for v3. The SAOP part is
still running - I'll add those results in a day or two, but I expect
similar outcome as for point queries.


regards


[1] https://commitfest.postgresql.org/43/4352/

[2] https://github.com/tvondra/index-prefetch-tests-2/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069c..9045c6eb7aa 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -678,7 +678,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
 	scan->xs_hitup = so->pageData[so->curPageData].recontup;
 
 so->curPageData++;
-
 return true;
 			}
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0755be83901..f0412da94ae 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
+#include "utils/spccache.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 	 Relation OldHeap, Relation NewHeap,
@@ -751,6 +752,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 			PROGRESS_CLUSTER_INDEX_RELID
 		};
 		int64		ci_val[2];
+		int			prefetch_target;
+
+		prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
 
 		/* Set phase and OIDOldIndex to columns */
 		ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -759,7 +763,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 		tableScan = NULL;
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+	prefetch_target, prefetch_target);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aebab..264ebe1d8e5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
 	scan->xs_hitup = NULL;
 	scan->xs_hitupdesc = NULL;
 
+	/* set in each AM when applicable */
+	scan->xs_prefetch = NULL;
+
 	return scan;
 }
 
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
 elog(ERROR, "column is not in index");
 		}
 
+		/* no index prefetch for system catalogs */
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-		 snapshot, nkeys, 0);
+		 snapshot, nkeys, 0, 0, 0);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
 			elog(ERROR, "column is not in index");
 	}
 
+	/* no index prefetch for system catalogs */
 	

Re: Removing unneeded self joins

2023-06-30 Thread Andrey Lepikhov

On 4/4/2023 02:30, Gregory Stark (as CFM) wrote:

On Mon, 6 Mar 2023 at 00:30, Michał Kłeczek  wrote:


Hi All,

I just wanted to ask about the status and plans for this patch.
I can see it being stuck at “Waiting for Author” status in several commit tests.


Sadly it seems to now be badly in need of a rebase. There are large
hunks failing in the guts of analyzejoins.c as well as minor failures
elsewhere and lots of offsets which need to be reviewed.

I think given the lack of activity it's out of time for this release
at this point. I'm moving it ahead to the next CF.

Hi,

Version 41 is heavily remade of the feature:

1. In previous versions, I tried to reuse remove_rel_from_query() for 
both left and self-join removal. But for now, I realized that it is a 
bit different procedures which treat different operations. In this 
patch, only common stages of the PlannerInfo fixing process are united 
in one function.
2. Transferring clauses from the removing table to keeping one is more 
transparent now and contains comments.
3. Equivalence classes update procedure was changed according to David's 
commit 3373c71. As I see, Tom has added remove_rel_from_eclass since the 
last v.40 version, and it looks pretty similar to the update_eclass 
routine in this patch.


It passes regression tests, but some questions are still open:
1. Should we look for duplicated or redundant clauses (the same for 
eclasses) during the clause transfer procedure? On the one side, we 
reduce the length of restrict lists that can impact planning or 
executing time. Additionally, we improve the accuracy of cardinality 
estimation. On the other side, it is one more place that can make 
planning time much longer in specific cases. It would have been better 
to avoid calling the equal() function here, but it's the only way to 
detect duplicated inequality expressions.
2. Could we reuse ChangeVarNodes instead of sje_walker(), merge 
remove_rel_from_restrictinfo with replace_varno?
3. Also, I still don't finish with the split_selfjoin_quals: some 
improvements could be made.


--
regards,
Andrey Lepikhov
Postgres Professional
From 4a342b9789f5be209318c13fb7ec336fcbd2aee5 Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Mon, 15 May 2023 09:04:51 +0500
Subject: [PATCH] Remove self-joins.

A Self Join Elimination (SJE) feature removes inner join of plain table to 
itself
in a query tree if can be proved that the join can be replaced with a scan.
The proof based on innerrel_is_unique machinery.

We can remove a self-join when for each outer row:
1. At most one inner row matches the join clause.
2. If the join target list contains any inner vars, an inner row
must be (physically) the same row as the outer one.

In this patch we use the next approach to identify a self-join:
1. Collect all mergejoinable join quals which look like a.x = b.x
2. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation.
3. If uniqueness of outer relation is deduced from baserestrictinfo clause like
'x=const' on unique field(s), check what the inner has the same clause with the
same constant(s).
4. Compare RowMarks of inner and outer relations. They must have the same
strength.

Some regression tests changed due to self-join removal logic.
---
 doc/src/sgml/config.sgml  |   16 +
 src/backend/optimizer/path/indxpath.c |   38 +
 src/backend/optimizer/plan/analyzejoins.c | 1077 -
 src/backend/optimizer/plan/planmain.c |5 +
 src/backend/utils/misc/guc_tables.c   |   22 +
 src/include/optimizer/paths.h |3 +
 src/include/optimizer/planmain.h  |7 +
 src/test/regress/expected/equivclass.out  |   32 +
 src/test/regress/expected/join.out|  798 +++
 src/test/regress/expected/sysviews.out|3 +-
 src/test/regress/sql/equivclass.sql   |   16 +
 src/test/regress/sql/join.sql |  351 +++
 src/tools/pgindent/typedefs.list  |2 +
 13 files changed, 2319 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6262cb7bb2..68215e1093 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5419,6 +5419,22 @@ ANY num_sync ( 
+  enable_self_join_removal (boolean)
+  
+   enable_self_join_removal configuration 
parameter
+  
+  
+  
+   
+   Enables or disables the query planner's optimization which analyses
+query tree and replaces self joins with semantically equivalent single
+scans. Take into consideration only plain tables.
+The default is on.
+   
+  
+ 
+
  
   enable_seqscan (boolean)
   
diff --git a/src/backend/optimizer/path/indxpath.c 
b/src/backend/optimizer/path/indxpath.c
index 0065c8992b..57bdc6811f 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -3491,6 +3491,21 

Extension Enhancement: Buffer Invalidation in pg_buffercache

2023-06-30 Thread Palak Chaturvedi
I hope this email finds you well. I am excited to share that I have
extended the functionality of the `pg_buffercache` extension by
implementing buffer invalidation capability, as requested by some
PostgreSQL contributors for improved testing scenarios.

This marks my first time submitting a patch to pgsql-hackers, and I am
eager to receive your expert feedback on the changes made. Your
insights are invaluable, and any review or comments you provide will
be greatly appreciated.

The primary objective of this enhancement is to enable explicit buffer
invalidation within the `pg_buffercache` extension. By doing so, we
can simulate scenarios where buffers are invalidated and observe the
resulting behavior in PostgreSQL.

As part of this patch, a new function or mechanism has been introduced
to facilitate buffer invalidation. I would like to hear your thoughts
on whether this approach provides a good user interface for this
functionality. Additionally, I seek your evaluation of the buffer
locking protocol employed in the extension to ensure its correctness
and efficiency.

Please note that I plan to add comprehensive documentation once the
details of this enhancement are agreed upon. This documentation will
serve as a valuable resource for users and contributors alike. I
believe that your expertise will help uncover any potential issues and
opportunities for further improvement.

I have attached the patch file to this email for your convenience.
Your valuable time and consideration in reviewing this extension are
sincerely appreciated.

Thank you for your continued support and guidance. I am looking
forward to your feedback and collaboration in enhancing the PostgreSQL
ecosystem.

The working of the extension:

1. Creating the extension pg_buffercache and then call select query on
a table and note the buffer to be cleared.
pgbench=# create extension pg_buffercache;
CREATE EXTENSION
pgbench=# select count(*) from pgbench_accounts;
 count

 10
(1 row)

pgbench=# SELECT *
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass);
 bufferid | relfilenode | reltablespace | reldatabase | relforknumber
| relblocknumber | isdirty | usagecount | pinning_backends
--+-+---+-+---++-++--
  233 |   16397 |  1663 |   16384 | 0
|  0 | f   |  1 |0
  234 |   16397 |  1663 |   16384 | 0
|  1 | f   |  1 |0
  235 |   16397 |  1663 |   16384 | 0
|  2 | f   |  1 |0
  236 |   16397 |  1663 |   16384 | 0
|  3 | f   |  1 |0
  237 |   16397 |  1663 |   16384 | 0
|  4 | f   |  1 |0


2. Clearing a single buffer by entering the bufferid.
pgbench=# SELECT count(*)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass);
 count
---
  1660
(1 row)

pgbench=# select pg_buffercache_invalidate(233);
 pg_buffercache_invalidate
---
 t
(1 row)

pgbench=# SELECT count(*)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass);
 count
---
  1659
(1 row)

3. Clearing the entire buffer for a relation using the function.
pgbench=# SELECT count(*)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass);
 count
---
  1659
(1 row)

pgbench=# select count(pg_buffercache_invalidate(bufferid)) from
pg_buffercache where relfilenode =
pg_relation_filenode('pgbench_accounts'::regclass);
 count
---
  1659
(1 row)

pgbench=# SELECT count(*)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('pgbench_accounts'::regclass);
 count
---
 0
(1 row)


Best regards,
Palak


v1-0001-Invalidate-Buffer-By-Bufnum.patch
Description: Binary data


Re: [PATCH] Using named captures in Catalog::ParseHeader()

2023-06-30 Thread Dagfinn Ilmari Mannsåker
Michael Paquier  writes:

> On Wed, Jun 14, 2023 at 10:30:23AM +0100, Dagfinn Ilmari Mannsåker wrote:
>> Thanks for the review!
>
> v17 is now open, so applied this one.

Thanks for committing!

- ilmari




Re: Tab completion for CREATE SCHEMAAUTHORIZATION

2023-06-30 Thread Dagfinn Ilmari Mannsåker
Michael Paquier  writes:

> On Tue, May 09, 2023 at 12:26:16PM +0900, Michael Paquier wrote:
>> That looks pretty much OK to me.  One tiny comment I have is that this
>> lacks brackets for the inner blocks, so I have added some in the v4
>> attached.
>
> The indentation was a bit wrong, so fixed it, and applied on HEAD.

Thanks!

- ilmari




Re: [HACKERS] make async slave to wait for lsn to be replayed

2023-06-30 Thread Картышов Иван
All rebased and tested

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres comp...@postgrespro.ru>
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..c7460bd9b8 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1752,6 +1753,15 @@ PerformWalRecovery(void)
 break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for,
+			 * set latches in shared memory array to notify the waiter.
+			 */
+			if (XLogRecoveryCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+			{
+WaitSetLatch(XLogRecoveryCtl->lastReplayedEndRecPtr);
+			}
+
 			/* Else, try to fetch the next WAL record */
 			record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
 		} while (record != NULL);
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 48f7348f91..d8f6965d8c 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -61,6 +61,7 @@ OBJS = \
 	vacuum.o \
 	vacuumparallel.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/meson.build b/src/backend/commands/meson.build
index 42cced9ebe..ec6ab7722a 100644
--- a/src/backend/commands/meson.build
+++ b/src/backend/commands/meson.build
@@ -50,4 +50,5 @@ backend_sources += files(
   'vacuumparallel.c',
   'variable.c',
   'view.c',
+  'wait.c',
 )
diff --git a/src/backend/commands/wait.c b/src/backend/commands/wait.c
new file mode 100644
index 00..2f9be4f9ad
--- /dev/null
+++ b/src/backend/commands/wait.c
@@ -0,0 +1,275 @@
+/*-
+ *
+ * wait.c
+ *	  Implements wait lsn, which allows waiting for events such as
+ *	  LSN having been replayed on replica.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2023, Regents of PostgresPro
+ *
+ * IDENTIFICATION
+ *	  src/backend/commands/wait.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+
+#include "access/xact.h"
+#include "access/xlogrecovery.h"
+#include "access/xlogdefs.h"
+#include "commands/wait.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/backendid.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "storage/sinvaladt.h"
+#include "storage/spin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+#include "utils/timestamp.h"
+
+/* Add to shared memory array */
+static void AddWaitedLSN(XLogRecPtr lsn_to_wait);
+
+/* Shared memory structure */
+typedef struct
+{
+	int			backend_maxid;
+	pg_atomic_uint64 min_lsn; /* XLogRecPtr of minimal waited for LSN */
+	slock_t		mutex;
+	/* LSNs that different backends are waiting */
+	XLogRecPtr	lsn[FLEXIBLE_ARRAY_MEMBER];
+} WaitState;
+
+static WaitState *state;
+
+/*
+ * Add the wait event of the current backend to shared memory array
+ */
+static void
+AddWaitedLSN(XLogRecPtr lsn_to_wait)
+{
+	SpinLockAcquire(>mutex);
+	if (state->backend_maxid < MyBackendId)
+		state->backend_maxid = MyBackendId;
+
+	state->lsn[MyBackendId] = lsn_to_wait;
+
+	if (lsn_to_wait < state->min_lsn.value)
+		state->min_lsn.value = lsn_to_wait;
+	SpinLockRelease(>mutex);
+}
+
+/*
+ * Delete wait event of the current backend from the shared memory array.
+ */
+void
+DeleteWaitedLSN(void)
+{
+	int			i;
+	XLogRecPtr	lsn_to_delete;
+
+	SpinLockAcquire(>mutex);
+
+	lsn_to_delete = state->lsn[MyBackendId];
+	state->lsn[MyBackendId] = InvalidXLogRecPtr;
+
+	/* If we are deleting the minimal LSN, then choose the next min_lsn */
+	if (lsn_to_delete != InvalidXLogRecPtr &&
+		lsn_to_delete == state->min_lsn.value)
+	{
+		state->min_lsn.value = PG_UINT64_MAX;
+		for (i = 2; i <= state->backend_maxid; i++)
+			if (state->lsn[i] != InvalidXLogRecPtr &&
+state->lsn[i] < state->min_lsn.value)
+state->min_lsn.value = state->lsn[i];
+	}
+
+	/* If deleting from the end of the array, shorten the array's used part */
+	if (state->backend_maxid == MyBackendId)
+		for (i = (MyBackendId); i >= 2; i--)
+			if (state->lsn[i] != InvalidXLogRecPtr)
+			{
+state->backend_maxid = i;
+break;
+			}
+
+	SpinLockRelease(>mutex);
+}
+
+/*
+ * Report amount of shared memory space needed for WaitState
+ */
+Size
+WaitShmemSize(void)
+{
+	Size		size;
+
+	size = offsetof(WaitState, lsn);
+	size = add_size(size, mul_size(MaxBackends + 1, sizeof(XLogRecPtr)));
+	return size;
+}
+
+/*
+ * Initialize an array of events to wait for in shared memory
+ */
+void
+WaitShmemInit(void)
+{
+	bool		found;
+	uint32		i;
+
+	

Re: PG 16 draft release notes ready

2023-06-30 Thread Masahiro Ikeda

Hi,

Thanks for making the release notes. I found the release note of
PG16 beta2 mentions a reverted following feature.

```




Have initdb use ICU by default if ICU is enabled in the binary (Jeff 
Davis)




Option --locale-provider=libc can be used to disable ICU.


```

Unfortunately, the feature is reverted with the commit.
* 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2535c74b1a6190cc42e13f6b6b55d94bff4b7dd6


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs

2023-06-30 Thread Daniel Gustafsson
> On 30 Jun 2023, at 09:09, Michael Paquier  wrote:
> 
> On Thu, Jun 29, 2023 at 09:05:59AM -0700, Jacob Champion wrote:
>> Sounds good -- 0002 can be ignored as needed, then. (Or I can resend a
>> v3 for CI purposes, if you'd like.)
> 
> I am assuming that this is 0001 posted here:
> https://www.postgresql.org/message-id/8be3d35a-9608-b1d5-e5e6-7a744ea45...@timescale.com
> 
> And that looks OK to me.  This is something I'd rather backpatch down
> to v11 on usability ground for developers.  Any comments or objections
> about that?

Agreed, I'd prefer all branches to work the same for this.

Reading the patch, only one thing stood out:

-variable PG_TEST_NOCLEAN is set, data directories will be retained
-regardless of test status.
+variable PG_TEST_NOCLEAN is set, those directories will be retained

I would've written "the data directories" instead of "those directories" here.

--
Daniel Gustafsson





Re: pgsql: Fix search_path to a safe value during maintenance operations.

2023-06-30 Thread Jeff Davis
On Thu, 2023-06-29 at 20:53 -0400, Tom Lane wrote:
> I think that's a seriously awful kluge.  It will mean that things
> behave
> differently for the owner than for MAINTAIN grantees, which pretty
> much
> destroys the use-case for that privilege, as well as being very
> confusing
> and hard to debug.

In version 15, try this:

  CREATE USER foo;
  CREATE SCHEMA foo AUTHORIZATION foo;
  CREATE USER bar;
  CREATE SCHEMA bar AUTHORIZATION bar;
  \c - foo
  CREATE FUNCTION foo.mod10(INT) RETURNS INT IMMUTABLE
LANGUAGE plpgsql AS $$ BEGIN RETURN mod($1,10); END; $$;
  CREATE TABLE t(i INT);
  -- units digit must be unique
  CREATE UNIQUE INDEX t_idx ON t (foo.mod10(i));
  INSERT INTO t VALUES(7); -- success
  INSERT INTO t VALUES(17); -- fails
  GRANT USAGE ON SCHEMA foo TO bar;
  GRANT INSERT ON t TO bar;
  \c - bar
  CREATE FUNCTION bar.mod(INT, INT) RETURNS INT IMMUTABLE
LANGUAGE plpgsql AS $$ BEGIN RETURN $1 + 100; END; $$;
  SET search_path = bar, pg_catalog;
  INSERT INTO foo.t VALUES(7); -- succeeds
  \c - foo
  SELECT * FROM t;
   i 
  ---
   7
   7
  (2 rows)


I'm not sure that everyone in this thread realizes just how broken it
is to depend on search_path in a functional index at all. And doubly so
if it depends on a schema other than pg_catalog in the search_path.

Let's also not forget that logical replication always uses
search_path=pg_catalog, so if you depend on a different search_path for
any function attached to the table (not just functional indexes, also
functions inside expressions or trigger functions), then those are
already broken in version 15. And if a superuser is executing
maintenance commands, there's little reason to think they'll have the
same search path as the user that created the table.

At some point in the very near future (though I realize that point may
come after version 16), we need to lock down the search path in a lot
of cases (not just maintenance commands), and I don't see any way
around that.


Regards,
Jeff Davis





Re: [PATCH] Honor PG_TEST_NOCLEAN for tempdirs

2023-06-30 Thread Michael Paquier
On Thu, Jun 29, 2023 at 09:05:59AM -0700, Jacob Champion wrote:
> Sounds good -- 0002 can be ignored as needed, then. (Or I can resend a
> v3 for CI purposes, if you'd like.)

I am assuming that this is 0001 posted here:
https://www.postgresql.org/message-id/8be3d35a-9608-b1d5-e5e6-7a744ea45...@timescale.com

And that looks OK to me.  This is something I'd rather backpatch down
to v11 on usability ground for developers.  Any comments or objections
about that?
--
Michael


signature.asc
Description: PGP signature


Re: ExecRTCheckPerms() and many prunable partitions (checkAsUser)

2023-06-30 Thread Amit Langote
On Wed, Jun 28, 2023 at 4:30 PM Amit Langote  wrote:
>
> Hi,
>
> On Tue, Feb 21, 2023 at 4:12 PM Amit Langote  wrote:
> > On Tue, Feb 21, 2023 at 12:40 AM Tom Lane  wrote:
> > > Alvaro Herrera  writes:
> > > > On 2023-Feb-20, Amit Langote wrote:
> > > >> One more thing we could try is come up with a postgres_fdw test case,
> > > >> because it uses the RelOptInfo.userid value for remote-costs-based
> > > >> path size estimation.  But adding a test case to contrib module's
> > > >> suite test a core planner change might seem strange, ;-).
> > >
> > > > Maybe.  Perhaps adding it in a separate file there is okay?
> > >
> > > There is plenty of stuff in contrib module tests that is really
> > > there to test core-code behavior.  (You could indeed argue that
> > > *all* of contrib is there for that purpose.)  If it's not
> > > convenient to test something without an extension, just do it
> > > and don't sweat about that.
> >
> > OK.  Attached adds a test case to postgres_fdw's suite.  You can see
> > that it fails without a316a3bc.
>
> Noticed that there's an RfC entry for this in the next CF.  Here's an
> updated version of the patch where I updated the comments a bit and
> the commit message.
>
> I'm thinking of pushing this on Friday barring objections.

Seeing none, I've pushed this to HEAD and 16.

Marking the CF entry as committed.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: Allow pg_archivecleanup to remove backup history files

2023-06-30 Thread Michael Paquier
On Fri, Jun 23, 2023 at 05:37:09PM +0900, torikoshia wrote:
> On 2023-06-22 16:47, Kyotaro Horiguchi wrote:
> Thanks for your review!

I have begun cleaning up my board, and applied 0001 for the moment.
--
Michael


signature.asc
Description: PGP signature


Re: Todo: Teach planner to evaluate multiple windows in the optimal order

2023-06-30 Thread John Naylor
On Wed, Feb 15, 2023 at 11:23 AM John Naylor 
wrote:
>
>
> On Tue, Feb 14, 2023 at 5:46 PM David Rowley  wrote:
> >
> > I didn't do a detailed review of the sort patch, but I did wonder
> > about the use of the name "fallback"  in the new functions. The
> > comment in the following snippet from qsort_tuple_unsigned_compare()
> > makes me think "tiebreak" is a better name.
>
> I agree "tiebreak" is better.

Here is v3 with that change. I still need to make sure the tests cover all
cases, so I'll do that as time permits. Also creating CF entry.

--
John Naylor
EDB: http://www.enterprisedb.com
From be88d7cab46e44385762365f1ba23d7684f76d3a Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 30 Jan 2023 17:10:00 +0700
Subject: [PATCH v3] Split out tiebreaker comparisons from comparetup_*
 functions

Previously, if a specialized comparator found equal datum1 keys,
the "comparetup" function would repeat the comparison on the
datum before proceeding with the unabbreviated first key
and/or additional sort keys.

Move comparing additional sort keys into "tiebreak" functions so
that specialized comparators can call these directly if needed,
avoiding duplicate work.

Reviewed by David Rowley
Discussion: https://www.postgresql.org/message-id/CAFBsxsGaVfUrjTghpf%3DkDBYY%3DjWx1PN-fuusVe7Vw5s0XqGdGw%40mail.gmail.com
---
 src/backend/utils/sort/tuplesort.c |   6 +-
 src/backend/utils/sort/tuplesortvariants.c | 112 -
 src/include/utils/tuplesort.h  |   7 ++
 3 files changed, 95 insertions(+), 30 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e5a4e5b371..3e872d7518 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -513,7 +513,7 @@ qsort_tuple_unsigned_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 	if (state->base.onlyKey != NULL)
 		return 0;
 
-	return state->base.comparetup(a, b, state);
+	return state->base.comparetup_tiebreak(a, b, state);
 }
 
 #if SIZEOF_DATUM >= 8
@@ -537,7 +537,7 @@ qsort_tuple_signed_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 	if (state->base.onlyKey != NULL)
 		return 0;
 
-	return state->base.comparetup(a, b, state);
+	return state->base.comparetup_tiebreak(a, b, state);
 }
 #endif
 
@@ -561,7 +561,7 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 	if (state->base.onlyKey != NULL)
 		return 0;
 
-	return state->base.comparetup(a, b, state);
+	return state->base.comparetup_tiebreak(a, b, state);
 }
 
 /*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb6cfcfd00..dd9d4d3764 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -47,18 +47,24 @@ static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 			   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
 			Tuplesortstate *state);
+static int	comparetup_heap_tiebreak(const SortTuple *a, const SortTuple *b,
+			Tuplesortstate *state);
 static void writetup_heap(Tuplesortstate *state, LogicalTape *tape,
 		  SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 		 LogicalTape *tape, unsigned int len);
 static int	comparetup_cluster(const SortTuple *a, const SortTuple *b,
 			   Tuplesortstate *state);
+static int	comparetup_cluster_tiebreak(const SortTuple *a, const SortTuple *b,
+			   Tuplesortstate *state);
 static void writetup_cluster(Tuplesortstate *state, LogicalTape *tape,
 			 SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 			LogicalTape *tape, unsigned int tuplen);
 static int	comparetup_index_btree(const SortTuple *a, const SortTuple *b,
    Tuplesortstate *state);
+static int	comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
+   Tuplesortstate *state);
 static int	comparetup_index_hash(const SortTuple *a, const SortTuple *b,
   Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
@@ -67,6 +73,8 @@ static void readtup_index(Tuplesortstate *state, SortTuple *stup,
 		  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 			 Tuplesortstate *state);
+static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
+			 Tuplesortstate *state);
 static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 		   SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
@@ -165,6 +173,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 
 	base->removeabbrev = removeabbrev_heap;
 	base->comparetup = comparetup_heap;
+	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
@@ -242,6 +251,7 @@ 

Re: Synchronizing slots from primary to standby

2023-06-30 Thread Drouvot, Bertrand

Hi Kuroda-san,

On 6/29/23 12:22 PM, Hayato Kuroda (Fujitsu) wrote:

Dear Drouvot,

Hi, I'm also interested in the feature. Followings are my high-level comments.
I did not mention some detailed notations because this patch is not at the 
stage.
And very sorry that I could not follow all of this discussions.



Thanks for looking at it and your feedback!

All I've done so far is to provide a re-based version in April of the existing 
patch.

I'll have a closer look at the code, at your feedback and Amit's one while 
working on the new version that will:

- take care of slot invalidation
- ensure that synchronized slot cant' be consumed until the standby gets 
promoted

as discussed up-thread.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com