date:20230608

Re: Error in calculating length of encoded base64 string

2023-06-08 Thread Tom Lane

Gurjeet Singh  writes:
> On Thu, Jun 8, 2023 at 7:35 AM Tom Lane  wrote:
>> This bug is very ancient, dating to commit 79d78bb26 which
>> added encode.c.  (The other instances were presumably copied
>> from there.)  Still, it doesn't quite seem worth back-patching.

> Is it worth investing time in trying to unify these 3 occurrences of
> base64 length (and possibly other relevant) code to one place? If yes,
> I can volunteer for it.

I wondered about that too.  It seems really silly that we made
a copy in src/common and did not replace the others with calls
to that.

regards, tom lane

Re: Error in calculating length of encoded base64 string

2023-06-08 Thread Gurjeet Singh

On Thu, Jun 8, 2023 at 7:35 AM Tom Lane  wrote:
>
> o.tselebrovs...@postgrespro.ru writes:
> > While working on an extension I've found an error in how length of
> > encoded base64 string is calulated;
>
> Yeah, I think you're right.  It's not of huge significance, because
> it just overestimates by 1 or 2 bytes, but we might as well get
> it right.  Thanks for the report and patch!

>From your commit d98ed080bb

>This bug is very ancient, dating to commit 79d78bb26 which
>added encode.c.  (The other instances were presumably copied
>from there.)  Still, it doesn't quite seem worth back-patching.

Is it worth investing time in trying to unify these 3 occurrences of
base64 length (and possibly other relevant) code to one place? If yes,
I can volunteer for it.

The common code facility under src/common/ did not exist back when
pgcrypto was added, but since it does now, it may be worth it make
others depend on implementation in src/common/ code.

Best regards,
Gurjeet
http://Gurje.et

Re: Introduce WAIT_EVENT_EXTENSION and WAIT_EVENT_BUFFER_PIN

2023-06-08 Thread Michael Paquier

On Fri, Jun 09, 2023 at 06:26:07AM +0200, Drouvot, Bertrand wrote:
> +1, that's something I've in mind to work on once/if this patch is/get
> committed.

FWIW, I'm OK with the patch, without the need to worry about the
performance.  Even if that's the case, we could just mark all these as
inline and move on..
--
Michael


signature.asc
Description: PGP signature

Re: Support logical replication of DDLs

2023-06-08 Thread Amit Kapila

On Fri, Jun 9, 2023 at 5:35 AM Masahiko Sawada  wrote:
>
> On Thu, Jun 8, 2023 at 9:12 PM shveta malik  wrote:
> >
> >
> > Please find new set of patches addressing below:
> > a) Issue mentioned by Wang-san in [1],
> > b) Comments from Peter given in [2]
> > c) Comments from Amit given in the last 2 emails.
> >
> > [1]: 
> > https://www.postgresql.org/message-id/OS3PR01MB62750D43D4F7F075B33BD2609E52A%40OS3PR01MB6275.jpnprd01.prod.outlook.com
> > [2]: 
> > https://www.postgresql.org/message-id/CAHut%2BPv9vPbUQc0fzrKmDkKOsS_bj-hup_E%2BsLHNEX%2B6F%2BSY5Q%40mail.gmail.com
> >
> > Thank You Vignesh for handling (a), Ajin for handling (b), Shi-san and
> > Hou-san for contributing in (c).
> >
>
> I have some questions about DDL deparser:
>
> I've tested deparsed and reformed DDL statements with the following
> function and event trigger borrowed from the regression tests:
>
> CREATE OR REPLACE FUNCTION test_ddl_deparse()
>   RETURNS event_trigger LANGUAGE plpgsql AS
> $$
> DECLARE
> r record;
> deparsed_json text;
> BEGIN
> FOR r IN SELECT * FROM pg_event_trigger_ddl_commands()
> -- Some TABLE commands generate sequence-related
> commands, also deparse them.
> WHERE command_tag in ('ALTER FOREIGN TABLE', 'ALTER TABLE',
>   'CREATE
> FOREIGN TABLE', 'CREATE TABLE',
>   'CREATE
> TABLE AS', 'DROP FOREIGN TABLE',
>   'DROP
> TABLE', 'ALTER SEQUENCE',
>   'CREATE
> SEQUENCE', 'DROP SEQUENCE')
> LOOP
> deparsed_json = pg_catalog.ddl_deparse_to_json(r.command);
> RAISE NOTICE 'deparsed json: %', deparsed_json;
> RAISE NOTICE 're-formed command: %',
> pg_catalog.ddl_deparse_expand_command(deparsed_json);
> END LOOP;
> END;
> $$;
>
> CREATE EVENT TRIGGER test_ddl_deparse
> ON ddl_command_end EXECUTE PROCEDURE test_ddl_deparse();
>
> The query "CREATE TABLE TEST AS SELECT 1" is deparsed and reformed by
> DDL deparse to:
>
> CREATE  TABLE  public.test ("?column?" pg_catalog.int4 STORAGE PLAIN  )
>
> I agree that we need to deparse CTAS in such a way for logical
> replication but IIUC DDL deparse feature (e.g., ddl_deparse_to_json()
> and ddl_deparse_expand_command()) is a generic feature in principle so
> I'm concerned that there might be users who want to get the DDL
> statement that is actually executed. If so, we might want to have a
> switch to get the actual DDL statement instead.
>

I agree with an additional switch here but OTOH I think we should
consider excluding CTAS and SELECT INTO in the first version to avoid
further complications to a bigger patch. Let's just do it for
CREATE/ALTER/DROP table.

> Also, the table and column data type are schema qualified, but is
> there any reason why the reformed query doesn't explicitly specify
> tablespace, toast compression and access method? Since their default
> values depend on GUC parameters, the table could be created in a
> different tablespace on the subscriber if the subscriber set a
> different value to default_tablespace GUC parameter. IIUC the reason
> why we use schema-qualified names instead of sending along with
> search_path is to prevent tables from being created in a different
> schema depending on search_patch setting on the subscriber.
>

I think we do send schema name during DML replication as well, so
probably doing the same for DDL replication makes sense from that
angle as well. The other related point is that apply worker always set
search_path as an empty string.

> So I
> wondered why we don't do that for other GUC-depends propety.
>

Yeah, that would probably be a good idea but do we want to do it in
default mode? I think if we want to optimize the default mode such
that we only WAL log the DDL with user-specified options then
appending options for default GUCs would make the string to be WAL
logged unnecessarily long. I am thinking that in default mode we can
allow subscribers to perform operations with their default respective
GUCs.

-- 
With Regards,
Amit Kapila.

Skip collecting decoded changes of already-aborted transactions

2023-06-08 Thread Masahiko Sawada

Hi,

In logical decoding, we don't need to collect decoded changes of
aborted transactions. While streaming changes, we can detect
concurrent abort of the (sub)transaction but there is no mechanism to
skip decoding changes of transactions that are known to already be
aborted. With the attached WIP patch, we check CLOG when decoding the
transaction for the first time. If it's already known to be aborted,
we skip collecting decoded changes of such transactions. That way,
when the logical replication is behind or restarts, we don't need to
decode large transactions that already aborted, which helps improve
the decoding performance.

Feedback is very welcome.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


skip_decoding_already_aborted_txn.patch
Description: Binary data

Re: Fix search_path for all maintenance commands

2023-06-08 Thread Nathan Bossart

On Thu, Jun 08, 2023 at 06:08:08PM -0400, Greg Stark wrote:
> I guess that's pretty narrow and a reasonable thing to desupport.
> Users could just mark those functions with search_path or schema
> qualify the object references in them. Perhaps we should also be
> picking up cases like that sooner so users realize they've created a
> footgun for themselves?

I'm inclined to agree that this is reasonable to desupport.  Relying on the
search_path for the cases Greg describes already seems rather fragile, so
I'm skeptical that forcing a safe one for maintenance commands would make
things significantly worse.  At least, it sounds like the right trade-off
based on Jeff's note about privilege escalation risks.

I bet we could skip forcing the search_path for maintenance commands run as
the table owner, but such a discrepancy seems likely to cause far more
confusion than anything else.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Introduce WAIT_EVENT_EXTENSION and WAIT_EVENT_BUFFER_PIN

2023-06-08 Thread Drouvot, Bertrand


Hi,

On 6/9/23 1:15 AM, Michael Paquier wrote:

On Thu, Jun 08, 2023 at 10:57:55AM +0900, Masahiro Ikeda wrote:

(Excuse me for cutting in, and this is not directly related to the thread.)
+1. I'm interested in the feature.

This is just a example and it probable be useful for other users. IMO, at
least, it's better to improve the specification that "Extension"
wait event type has only the "Extension" wait event.


I hope that nobody would counter-argue you here.  In my opinion, we
should just introduce an API that allows extensions to retrieve wait
event numbers that are allocated by the backend under
PG_WAIT_EXTENSION, in a fashion similar to GetNamedLWLockTranche().
Say something like:
int GetExtensionWaitEvent(const char *wait_event_name);


+1, that's something I've in mind to work on once/if this patch is/get
committed.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Typo in src/backend/access/nbtree/README?

2023-06-08 Thread Nathan Bossart

On Fri, Jun 09, 2023 at 11:29:02AM +0800, Richard Guo wrote:
> On Fri, Jun 9, 2023 at 10:37 AM Gurjeet Singh  wrote:
> 
>> On Thu, Jun 8, 2023 at 7:11 AM Daniel Westermann (DWE)
>>  wrote:
>> >
>> > ... shouldn't there be a "to" before "detect"?
>> >
>> > These two additions make it possible detect a concurrent page split
>>
>> Agreed. Attached is a small patch that fixes this.
> 
> 
> +1.  A little nitpick: the new line seems overly long compared to
> adjacent lines, should we wrap it?

Committed, thanks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Typo in src/backend/access/nbtree/README?

2023-06-08 Thread Richard Guo

On Fri, Jun 9, 2023 at 10:37 AM Gurjeet Singh  wrote:

> On Thu, Jun 8, 2023 at 7:11 AM Daniel Westermann (DWE)
>  wrote:
> >
> > ... shouldn't there be a "to" before "detect"?
> >
> > These two additions make it possible detect a concurrent page split
>
> Agreed. Attached is a small patch that fixes this.


+1.  A little nitpick: the new line seems overly long compared to
adjacent lines, should we wrap it?

Thanks
Richard

Re: Fix last unitialized memory warning

2023-06-08 Thread Gurjeet Singh

On Wed, Jun 7, 2023 at 7:31 AM Tristan Partin  wrote:
>
> This patch is really not necessary from a functional point of view. It
> is only necessary if we want to silence a compiler warning.
>
> Tested on `gcc (GCC) 13.1.1 20230511 (Red Hat 13.1.1-2)`.

...

> From my perspective, this warning definitely seems like a false
> positive, but I don't know the code well-enough to say that for certain.

It was a bit confusing to see a patch for src/bin/pgbench/pgbench.c,
but that file not mentioned in the warning messages you quoted. I take
it that your patch silences a warning in pgbench.c; it would've been
nice to see the actual warning.

-PgBenchValue vargs[MAX_FARGS];
+PgBenchValue vargs[MAX_FARGS] = { 0 };

If I'm right about what kind of warning this might've caused (use of
possibly uninitialized variable), you're correct that it is benign.
The for loop after declarations initializes all the elements of this
array using evaluateExpr(), and if the initialization fails for some
reason, the loop ends prematurely and returns from the function.

I analyzed a few code-paths that return true from evaluateExpr(), and
I'd like to believe that _all_ code paths that return true also
initialize the array element passed. But because there are so many
branches and function calls beneath evaluateExpr(), I think it's
better to be paranoid and initialize all the array elements to 0.

Also, it is better to initialize/clear an array at the point of
definition, like your patch does. So, +1 to the patch.

Best regards,
Gurjeet
http://Gurje.et

Re: ERROR: no relation entry for relid 6

2023-06-08 Thread Richard Guo

On Fri, Jun 9, 2023 at 5:13 AM Tom Lane  wrote:

> Richard Guo  writes:
> > Hmm, maybe we can additionally check if the PHV needs to be evaluated
> > above the join.  If so it cannot be removed.
> Yeah, that seems to make sense, and it squares with the existing
> comment saying that PHVs used above the join can't be removed.
> Pushed that way.


Thanks for pushing it!  I've closed the open item for it.

Thanks
Richard

Re: Support logical replication of DDLs

2023-06-08 Thread shveta malik

On Thu, Jun 8, 2023 at 5:31 PM shveta malik  wrote:
>
> Please find new set of patches addressing below:
> a) Issue mentioned by Wang-san in [1],
> b) Comments from Peter given in [2]
> c) Comments from Amit given in the last 2 emails.
>
> [1]: 
> https://www.postgresql.org/message-id/OS3PR01MB62750D43D4F7F075B33BD2609E52A%40OS3PR01MB6275.jpnprd01.prod.outlook.com
> [2]: 
> https://www.postgresql.org/message-id/CAHut%2BPv9vPbUQc0fzrKmDkKOsS_bj-hup_E%2BsLHNEX%2B6F%2BSY5Q%40mail.gmail.com
>
> Thank You Vignesh for handling (a), Ajin for handling (b), Shi-san and
> Hou-san for contributing in (c).
>
> The new changes are in patch 0001, 0002, 0005 and 0008.
>

The patch 0005 has some issue in
'doc/src/sgml/logical-replication.sgml' which makes documentation to
fail to compile. It will be fixed along with next-version. Please
review rest of the changes meanwhile. Sorry for inconvenience.

thanks
Shveta

Re: Implement generalized sub routine find_in_log for tap test

2023-06-08 Thread Michael Paquier

On Tue, Jun 06, 2023 at 06:43:44PM +0530, vignesh C wrote:
> Please find the attached patches that can be applied on back branches
> too. v5*master.patch can be applied on master, v5*PG15.patch can be
> applied on PG15, v5*PG14.patch can be applied on PG14, v5*PG13.patch
> can be applied on PG13, v5*PG12.patch can be applied on PG12, PG11 and
> PG10.

Thanks.  The amount of minimal conflicts across all these branches is
always fun to play with.  I have finally got around and applied all
that, after doing a proper split, applying one part down to 14 and the
second back to 11.
--
Michael


signature.asc
Description: PGP signature

Re: Implement generalized sub routine find_in_log for tap test

2023-06-08 Thread Michael Paquier

On Tue, Jun 06, 2023 at 05:53:40PM +0530, Bharath Rupireddy wrote:
> Yes. A simpler way of doing it would be to move advance_wal() in
> 019_replslot_limit.pl to Cluster.pm, something like the attached. CI
> members don't complain with it
> https://github.com/BRupireddy/postgres/tree/add_a_function_in_Cluster.pm_to_generate_WAL.
> Perhaps, we can name it better instead of advance_wal, say
> generate_wal or some other?

Why not discussing that on a separate thread?  What you are proposing
is independent of what Vignesh has proposed.  Note that the patch
format is octet-stream, causing extra CRs to exist in the patch.
Something happened on your side when you sent your patch, I guess?
--
Michael


signature.asc
Description: PGP signature

Re: Typo in src/backend/access/nbtree/README?

2023-06-08 Thread Gurjeet Singh

On Thu, Jun 8, 2023 at 7:11 AM Daniel Westermann (DWE)
 wrote:
>
> ... shouldn't there be a "to" before "detect"?
>
> These two additions make it possible detect a concurrent page split

Agreed. Attached is a small patch that fixes this.

Thanks for the report!

Best regards,
Gurjeet
http://Gurje.et
From f694a97e72a3ad744073c9a04f3c39615b1a9987 Mon Sep 17 00:00:00 2001
From: Gurjeet Singh 
Date: Thu, 8 Jun 2023 19:30:57 -0700
Subject: [PATCH v1] Fix grammar

As reported by Daniel Westermann.
---
 src/backend/access/nbtree/README | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index dd0f7ad2bd..1174ab9131 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -17,7 +17,7 @@ The basic Lehman & Yao Algorithm
 Compared to a classic B-tree, L&Y adds a right-link pointer to each page,
 to the page's right sibling.  It also adds a "high key" to each page, which
 is an upper bound on the keys that are allowed on that page.  These two
-additions make it possible detect a concurrent page split, which allows the
+additions make it possible to detect a concurrent page split, which allows the
 tree to be searched without holding any read locks (except to keep a single
 page from being modified while reading it).
 
-- 
2.35.1

Re: running logical replication as the subscription owner

2023-06-08 Thread Masahiko Sawada

On Thu, Jun 8, 2023 at 7:29 PM Amit Kapila  wrote:
>
> On Thu, Jun 8, 2023 at 6:32 AM Masahiko Sawada  wrote:
> >
> > On Mon, Jun 5, 2023 at 3:15 AM Amit Kapila  wrote:
> > >
> > > On Fri, May 26, 2023 at 6:18 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, May 25, 2023 at 5:41 PM Amit Kapila  
> > > > wrote:
> > > >
> > > > I've attached the updated patch. Please review it.
> > > >
> > >
> > > Few comments:
> > > 1.
> > > + /* get the owner for ACL and RLS checks */
> > > + run_as_owner = MySubscription->runasowner;
> > > + checkowner = run_as_owner ? MySubscription->owner : 
> > > rel->rd_rel->relowner;
> > > +
> > >   /*
> > >   * Check that our table sync worker has permission to insert into the
> > >   * target table.
> > >   */
> > > - aclresult = pg_class_aclcheck(RelationGetRelid(rel), GetUserId(),
> > > + aclresult = pg_class_aclcheck(RelationGetRelid(rel), checkowner,
> > >
> > > One thing that slightly worries me about this change is that we
> > > started to check the permission for relowner before even ensuring that
> > > we can switch to relowner. See checks in SwitchToUntrustedUser(). If
> > > we want to first ensure that we can switch to relowner then I think we
> > > should move this permission-checking code before we try to copy the
> > > table.
> >
> > Agreed. I thought it's better to do ACL and RLS checks before creating
> > the replication slot but it's not important. Rather checking them
> > after switching user would make sense since we do the same in
> > worker.c.
> >
>
> LGTM.

Thanks, pushed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-08 20:20:18 -0400, Gregory Smith wrote:
> On Thu, Jun 8, 2023 at 6:18 PM Andres Freund  wrote:
> 
> > Could you get a profile with call graphs? We need to know what leads to all
> > those osq_lock calls.
> > perf record --call-graph dwarf -a sleep 1
> > or such should do the trick, if run while the workload is running.
> >
> 
> I'm doing something wrong because I can't find the slow part in the perf
> data; I'll get back to you on this one.

You might need to add --no-children to the perf report invocation, otherwise
it'll show you the call graph inverted.

- Andres

Re: index prefetching

2023-06-08 Thread Peter Geoghegan

On Thu, Jun 8, 2023 at 4:38 PM Peter Geoghegan  wrote:
> This is conceptually a "mini bitmap index scan", though one that takes
> place "inside" a plain index scan, as it processes one particular leaf
> page. That's the kind of design that "plain index scan vs bitmap index
> scan as a continuum" leads me to (a little like the continuum between
> nested loop joins, block nested loop joins, and merge joins). I bet it
> would be practical to do things this way, and help a lot with some
> kinds of queries. It might even be simpler than avoiding excessive
> prefetching using an LRU cache thing.

I'll now give a simpler (though less realistic) example of a case
where "mini bitmap index scan" would be expected to help index scans
in general, and prefetching during index scans in particular.
Something very simple:

create table bitmap_parity_test(randkey int4, filler text);
create index on bitmap_parity_test (randkey);
insert into bitmap_parity_test select (random()*1000),
repeat('filler',10) from generate_series(1,250) i;

This gives me a table with 4 pages, and an index with 2 pages.

The following query selects about half of the rows from the table:

select * from bitmap_parity_test where randkey < 500;

If I force the query to use a bitmap index scan, I see that the total
number of buffers hit is exactly as expected (according to
EXPLAIN(ANALYZE,BUFFERS), that is): there are 5 buffers/pages hit. We
need to access every single heap page once, and we need to access the
only leaf page in the index once.

I'm sure that you know where I'm going with this already. I'll force
the same query to use a plain index scan, and get a very different
result. Now EXPLAIN(ANALYZE,BUFFERS) shows that there are a total of
89 buffers hit -- 88 of which must just be the same 5 heap pages,
again and again. That's just silly. It's probably not all that much
slower, but it's not helping things. And it's likely that this effect
interferes with the prefetching in your patch.

Obviously you can come up with a variant of this test case where
bitmap index scan does way fewer buffer accesses in a way that really
makes sense -- that's not in question. This is a fairly selective
index scan, since it only touches one index page -- and yet we still
see this difference.

(Anybody pedantic enough to want to dispute whether or not this index
scan counts as "selective" should run "insert into bitmap_parity_test
select i, repeat('actshually',10)  from generate_series(2000,1e5) i"
before running the "randkey < 500" query, which will make the index
much larger without changing any of the details of how the query pins
pages -- non-pedants should just skip that step.)

-- 
Peter Geoghegan

Re: git.postgresql.org seems to be down

2023-06-08 Thread Jaime Casanova

On Thu, Jun 8, 2023 at 7:22 PM Joe Conway  wrote:
>
> On 6/8/23 20:02, Jaime Casanova wrote:
> > Hi,
> >
> > Someone made https://git.postgresql.org/ depressed
> >
> > After not being able to pull, I dropped and tried to clone again:
>
>
> Try now
>

It's up again... thanks!

-- 
Jaime Casanova
Director de Servicios Profesionales
SYSTEMGUARDS - Consultores de PostgreSQL

Re: git.postgresql.org seems to be down

2023-06-08 Thread Joe Conway


On 6/8/23 20:02, Jaime Casanova wrote:

Hi,

Someone made https://git.postgresql.org/ depressed

After not being able to pull, I dropped and tried to clone again:



Try now

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15

2023-06-08 Thread Gregory Smith

On Thu, Jun 8, 2023 at 6:18 PM Andres Freund  wrote:

> Could you get a profile with call graphs? We need to know what leads to all
> those osq_lock calls.
> perf record --call-graph dwarf -a sleep 1
> or such should do the trick, if run while the workload is running.
>

I'm doing something wrong because I can't find the slow part in the perf
data; I'll get back to you on this one.

> I think it's unwise to compare builds of such different vintage. The
> compiler
> options and compiler version can have substantial effects.
>
I recommend also using -P1. Particularly when using unix sockets, the
> specifics of how client threads and server threads are scheduled plays a
> huge
> role.

Fair suggestions, those graphs come out of pgbench-tools where I profile
all the latency, fast results for me are ruler flat.  It's taken me several
generations of water cooling experiments to reach that point, but even that
only buys me 10 seconds before I can overload a CPU to higher latency with
tougher workloads.  Here's a few seconds of slightly updated examples, now
with matching PGDG sourced 14+15 on the 5950X and with
sched_autogroup_enabled=0 too:

$ pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5434 -P 1 pgbench
pgbench (14.8 (Ubuntu 14.8-1.pgdg23.04+1))
progress: 1.0 s, 1032929.3 tps, lat 0.031 ms stddev 0.004
progress: 2.0 s, 1051239.0 tps, lat 0.030 ms stddev 0.001
progress: 3.0 s, 1047528.9 tps, lat 0.030 ms stddev 0.008...
$ pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5432 -P 1 pgbench
pgbench (15.3 (Ubuntu 15.3-1.pgdg23.04+1))
progress: 1.0 s, 171816.4 tps, lat 0.184 ms stddev 0.029, 0 failed
progress: 2.0 s, 173501.0 tps, lat 0.184 ms stddev 0.024, 0 failed...

On the slow runs it will even do this, watch my 5950X accomplish 0 TPS for
a second!

progress: 38.0 s, 177376.9 tps, lat 0.180 ms stddev 0.039, 0 failed
progress: 39.0 s, 35861.5 tps, lat 0.181 ms stddev 0.032, 0 failed
progress: 40.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 41.0 s, 222.1 tps, lat 304.500 ms stddev 741.413, 0 failed
progress: 42.0 s, 101199.6 tps, lat 0.530 ms stddev 18.862, 0 failed
progress: 43.0 s, 98286.9 tps, lat 0.328 ms stddev 8.156, 0 failed

Gonna have to measure seconds/transaction if this gets any worse.

> I've seen such issues in the past, primarily due to contention internal to
> cgroups, when the memory controller is enabled.  IIRC that could be
> alleviated
> to a substantial degree with cgroup.memory=nokmem.
>

I cannot express on-list how much I dislike everything about the cgroups
code.  Let me dig up the right call graph data first and will know more
then.  The thing that keeps me from chasing kernel tuning too hard is
seeing the PG14 go perfectly every time.   This is a really weird one.  All
the suggestions much appreciated.

Re: Remove WindowClause PARTITION BY items belonging to redundant pathkeys

2023-06-08 Thread David Rowley

Thank you for having a look at this.

On Thu, 8 Jun 2023 at 21:11, Richard Guo  wrote:
>
> On Thu, Jun 8, 2023 at 7:37 AM David Rowley  wrote:
>>
>> What the attached patch does is process each WindowClause and removes
>> any items from the PARTITION BY clause that are columns or expressions
>> relating to redundant PathKeys.

> Also I'm wondering if we can do the same optimization to
> wc->orderClause.  I tested it with the query below and saw performance
> gains.

After looking again at nodeWindowAgg.c, I think it might be possible
to do a bit more work and apply this to ORDER BY items too.  Without
an ORDER BY clause, all rows in the partition are peers of each other,
and if the ORDER BY column is redundant due to belonging to a
redundant pathkey, then those rows must also be peers too since the
redundant pathkey must mean all rows have an equal value in the
redundant column.

However, there is a case where we must be much more careful.  The
comment you highlighted in create_windowagg_plan() does mention this.
It reads "we must *not* remove the ordering column for RANGE OFFSET
cases".

The following query can't work when the WindowClause has no ORDER BY column.

postgres=# select relname,sum(pg_relation_size(oid)) over (range
between 10 preceding and current row) from pg_Class;
ERROR:  RANGE with offset PRECEDING/FOLLOWING requires exactly one
ORDER BY column
LINE 1: select relname,sum(pg_relation_size(oid)) over (range between...

It might be possible to make adjustments in nodeWindowAgg.c to have
the equality checks come out as true when there is no ORDER BY.
update_frameheadpos() is one location that would need to be adjusted.
It would need further study to ensure we don't accidentally break
anything. I've not done that study, so won't be adjusting the patch
for now.

I've attached an updated patch which updates the outdated comment
which you highlighted.  I ended up moving the mention of removing
redundant columns into make_pathkeys_for_window() as it seemed a much
more relevant location to mention this optimisation.

David

remove_redundant_partition_by_items_v2.patch
Description: Binary data

Re: index prefetching

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.

I'm really excited about this work.


> 1) pairing-heap in GiST / SP-GiST
> 
> For most AMs, the index state is pretty trivial - matching items from a
> single leaf page. Prefetching that is pretty trivial, even if the
> current API is a bit cumbersome.
> 
> Distance queries on GiST and SP-GiST are a problem, though, because
> those do not just read the pointers into a simple array, as the distance
> ordering requires passing stuff through a pairing-heap :-(
> 
> I don't know how to best deal with that, especially not in the simple
> API. I don't think we can "scan forward" stuff from the pairing heap, so
> the only idea I have is actually having two pairing-heaps. Or maybe
> using the pairing heap for prefetching, but stashing the prefetched
> pointers into an array and then returning stuff from it.
> 
> In the patch I simply prefetch items before we add them to the pairing
> heap, which is good enough for demonstrating the benefits.

I think it'd be perfectly fair to just not tackle distance queries for now.


> 2) prefetching from executor
> 
> Another question is whether the prefetching shouldn't actually happen
> even higher - in the executor. That's what Andres suggested during the
> unconference, and it kinda makes sense. That's where we do prefetching
> for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...


> 4) per-leaf prefetching
> 
> The code is restricted only prefetches items from one leaf page. If the
> index scan needs to scan multiple (many) leaf pages, we have to process
> the first leaf page first before reading / prefetching the next one.
> 
> I think this is acceptable limitation, certainly for v0. Prefetching
> across multiple leaf pages seems way more complex (particularly for the
> cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.


> 5) index-only scans
> 
> I'm not sure what to do about index-only scans. On the one hand, the
> point of IOS is not to read stuff from the heap at all, so why prefetch
> it. OTOH if there are many allvisible=false pages, we still have to
> access that. And if that happens, this leads to the bizarre situation
> that IOS is slower than regular index scan. But to address this, we'd
> have to consider the visibility during prefetching.

That should be easy to do, right?



> Benchmark / TPC-H
> -
> 
> I ran the 22 queries on 100GB data set, with parallel query either
> disabled or enabled. And I measured timing (and speedup) for each query.
> The speedup results look like this (see the attached PDF for details):
> 
> queryserialparallel
> 1  101% 99%
> 2  119%100%
> 3  100% 99%
> 4  101%100%
> 5  101%100%
> 6   12% 99%
> 7  100%100%
> 8   52% 67%
> 10 102%101%
> 11 100% 72%
> 12 101%100%
> 13 100%101%
> 14  13%100%
> 15 101%100%
> 16  99% 99%
> 17  95%101%
> 18 101%106%
> 19  30% 40%
> 20  99%100%
> 21 101%100%
> 22 101%107%
> 
> The percentage is (timing patched / master, so <100% means faster, >100%
> means slower).
> 
> The different queries are affected depending on the query plan - many
> queries are close to 100%, which means "no difference". For the serial
> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
> while for the parallel case the benefits are somewhat less significant.
> 
> My explanation is that either (a) parallel case used a different plan
> with fewer index scans or (b) the parallel query does more concurrent
> I/O simply by using parallel workers. Or maybe both.
> 
> There are a couple regressions too, I believe those are due to doing too
> much prefetching in some cases, and some of the heuristics mentioned
> earlier should eliminate most of this, I think.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Un

Re: Support logical replication of DDLs

2023-06-08 Thread Masahiko Sawada

Hi,

On Thu, Jun 8, 2023 at 9:12 PM shveta malik  wrote:
>
>
> Please find new set of patches addressing below:
> a) Issue mentioned by Wang-san in [1],
> b) Comments from Peter given in [2]
> c) Comments from Amit given in the last 2 emails.
>
> [1]: 
> https://www.postgresql.org/message-id/OS3PR01MB62750D43D4F7F075B33BD2609E52A%40OS3PR01MB6275.jpnprd01.prod.outlook.com
> [2]: 
> https://www.postgresql.org/message-id/CAHut%2BPv9vPbUQc0fzrKmDkKOsS_bj-hup_E%2BsLHNEX%2B6F%2BSY5Q%40mail.gmail.com
>
> Thank You Vignesh for handling (a), Ajin for handling (b), Shi-san and
> Hou-san for contributing in (c).
>

I have some questions about DDL deparser:

I've tested deparsed and reformed DDL statements with the following
function and event trigger borrowed from the regression tests:

CREATE OR REPLACE FUNCTION test_ddl_deparse()
  RETURNS event_trigger LANGUAGE plpgsql AS
$$
DECLARE
r record;
deparsed_json text;
BEGIN
FOR r IN SELECT * FROM pg_event_trigger_ddl_commands()
-- Some TABLE commands generate sequence-related
commands, also deparse them.
WHERE command_tag in ('ALTER FOREIGN TABLE', 'ALTER TABLE',
  'CREATE
FOREIGN TABLE', 'CREATE TABLE',
  'CREATE
TABLE AS', 'DROP FOREIGN TABLE',
  'DROP
TABLE', 'ALTER SEQUENCE',
  'CREATE
SEQUENCE', 'DROP SEQUENCE')
LOOP
deparsed_json = pg_catalog.ddl_deparse_to_json(r.command);
RAISE NOTICE 'deparsed json: %', deparsed_json;
RAISE NOTICE 're-formed command: %',
pg_catalog.ddl_deparse_expand_command(deparsed_json);
END LOOP;
END;
$$;

CREATE EVENT TRIGGER test_ddl_deparse
ON ddl_command_end EXECUTE PROCEDURE test_ddl_deparse();

The query "CREATE TABLE TEST AS SELECT 1" is deparsed and reformed by
DDL deparse to:

CREATE  TABLE  public.test ("?column?" pg_catalog.int4 STORAGE PLAIN  )

I agree that we need to deparse CTAS in such a way for logical
replication but IIUC DDL deparse feature (e.g., ddl_deparse_to_json()
and ddl_deparse_expand_command()) is a generic feature in principle so
I'm concerned that there might be users who want to get the DDL
statement that is actually executed. If so, we might want to have a
switch to get the actual DDL statement instead.

Also, the table and column data type are schema qualified, but is
there any reason why the reformed query doesn't explicitly specify
tablespace, toast compression and access method? Since their default
values depend on GUC parameters, the table could be created in a
different tablespace on the subscriber if the subscriber set a
different value to default_tablespace GUC parameter. IIUC the reason
why we use schema-qualified names instead of sending along with
search_path is to prevent tables from being created in a different
schema depending on search_patch setting on the subscriber. So I
wondered why we don't do that for other GUC-depends propety.

---
I got a SEGV in the following case:

=# create event trigger test_trigger ON ddl_command_start execute
function publication_deparse_ddl_command_end();
CREATE EVENT TRIGGER
=# create table t ();

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

git.postgresql.org seems to be down

2023-06-08 Thread Jaime Casanova

Hi,

Someone made https://git.postgresql.org/ depressed

After not being able to pull, I dropped and tried to clone again:

"""
jcasanov@DangerBox:/opt/pgdg$ git clone
https://git.postgresql.org/git/postgresql.git
Clonando en 'postgresql'...
fatal: unable to access
'https://git.postgresql.org/git/postgresql.git/': Empty reply from
server
"""

--
Jaime Casanova
Director de Servicios Profesionales
SYSTEMGUARDS - Consultores de PostgreSQL

Re: index prefetching

2023-06-08 Thread Peter Geoghegan

On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
 wrote:
> Normal index scans are an even more interesting case but I'm not
> sure how hard it would be to get that information. It may only be
> convenient to get the blocks from the last leaf page we looked at,
> for example.
>
> So this suggests we simply started prefetching for the case where the
> information was readily available, and it'd be harder to do for index
> scans so that's it.

What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.

> Even if SAOP (probably) wasn't the reason, I think you're right it may
> be an issue for prefetching, causing regressions. It didn't occur to me
> before, because I'm not that familiar with the btree code and/or how it
> deals with SAOP (and didn't really intend to study it too deeply).

I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.

> So if you're planning to work on this for PG17, collaborating on it
> would be great.
>
> For now I plan to just ignore SAOP, or maybe just disabling prefetching
> for SAOP index scans if it proves to be prone to regressions. That's not
> great, but at least it won't make matters worse.

Makes sense, but I hope that it won't come to that.

IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.

> I guess something like this might be a "nice" bad case:
>
> insert into btree_test mod(i,10), md5(i::text)
>   from generate_series(1, $ROWS) s(i)
>
> select * from btree_test where a in (999, 1000, 1001, 1002)
>
> The values are likely colocated on the same heap page, the bitmap scan
> is going to do a single prefetch. With index scan we'll prefetch them
> repeatedly. I'll give it a try.

This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?

I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.

I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.

This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Stephan Doliov

This is an interesting message thread. I think in regards to the OP's call
to make PG multi-threaded, there should be a clear and identifiable
performance target and use cases for the target. How much performance boost
can be expected, and if so, in which data application context? Will queries
return faster for transactional use cases? analytic use cases? How much
data needs to be stored before one can observe the difference, or better
yet, a difference with a measurable impact on reduced cloud compute costs
as a % of compute cloud costs. I think if you can demonstrate for different
test datasets what those savings amount to you can either find momentum to
pursue it. Beyond that, even with better modern tooling for multi-threaded
development, it's obviously a big lift (may well be worth it!). Some of us
cagey old cats on this list (at least me) still have some work to do to
shed the baggage that previous pain of MT dev has caused us. :-)

Cheers,
Steve

On Thu, Jun 8, 2023 at 1:26 PM Andres Freund  wrote:

> Hi,
>
> On 2023-06-09 07:34:49 +1200, Thomas Munro wrote:
> > I wasn't in Mathew Wilcox's unconference in Ottawa but I found an
> > older article on LWN:
> >
> > https://lwn.net/Articles/895217/
> >
> > For what it's worth, FreeBSD hackers have studied this topic too (and
> > it's been done in Android and no doubt other systems before):
> >
> > https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
> >
> > I've shared that paper on this list before in the context of
> > super/huge pages and their benefits (to executable code, and to the
> > buffer pool), but a second topic in that paper is the idea of a shared
> > page table: "We find that sharing PTPs across different processes can
> > reduce execution cycles by as much as 6.9%. Moreover, the combined
> > effects of using superpages to map the main executable and sharing
> > PTPs for the small shared libraries can reduce execution cycles up to
> > 18.2%."  And that's just part of it, because those guys are more
> > interested in shared code/libraries and such so that's probably not
> > even getting to the stuff like buffer pool and DSMs that we might tend
> > to think of first.
>
> I've experimented with using huge pages for executable code on linux, and
> the
> benefits are quite noticable:
>
> https://www.postgresql.org/message-id/20221104212126.qfh3yzi7luvyy5d6%40awork3.anarazel.de
>
> I'm a bit dubious that sharing the page table for executable code increase
> the
> benefit that much further in real workloads. I suspect the reason it was
> different for the authors of the paper is:
>
> > A fixed number of back-to-back
> > transactions are performed on a 5GB database, and we use the
> > -C option of pgbench to toggle between reconnecting after
> > each transaction (reconnect mode) and using one persistent
> > connection per client (persistent connection mode). We use
> > the reconnect mode by default unless stated otherwise.
>
> Using -C explains why you'd see a lot of benefit from sharing page tables
> for
> executable code. But I don't think -C is a particularly interesting
> workload
> to optimize for.
>
>
> > I'm no expert in this stuff, but it seems to be that with shared page
> > table schemes you can avoid wasting huge amounts of RAM on duplicated
> > page table entries (pages * processes), and with huge/super pages you
> > can reduce the number of pages, but AFAIK you still can't escape the
> > TLB shootdown cost, which is all-or-nothing (PCID level at best).
>
> Pretty much that. While you can avoid some TLB shootdowns via PCIDs, that
> only
> avoids flushing the TLB, it doesn't help with the TLB hit rate being much
> lower due to the number of "redundant" mappings with different PCIDs.
>
>
> > The only way to avoid TLB shootdowns on context switches is to have
> *exactly
> > the same memory map*.  Or, as Robert succinctly shouted, "THREADS".
>
> +1
>
> Greetings,
>
> Andres Freund
>
>
>

Re: Introduce WAIT_EVENT_EXTENSION and WAIT_EVENT_BUFFER_PIN

2023-06-08 Thread Michael Paquier

On Thu, Jun 08, 2023 at 10:57:55AM +0900, Masahiro Ikeda wrote:
> (Excuse me for cutting in, and this is not directly related to the thread.)
> +1. I'm interested in the feature.
> 
> This is just a example and it probable be useful for other users. IMO, at
> least, it's better to improve the specification that "Extension"
> wait event type has only the "Extension" wait event.

I hope that nobody would counter-argue you here.  In my opinion, we
should just introduce an API that allows extensions to retrieve wait
event numbers that are allocated by the backend under
PG_WAIT_EXTENSION, in a fashion similar to GetNamedLWLockTranche().
Say something like:
int GetExtensionWaitEvent(const char *wait_event_name);

I don't quite see a design where extensions could rely on their own
numbers statically assigned by the extension maintainers, as this is
most likely going to cause conflicts.  And I would guess that a lot of
external code would want to get more data pushed to pg_stat_activity,
meaning a lot of conflicts, potentially.
--
Michael

signature.asc
Description: PGP signature

Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-08 15:18:07 -0700, Andres Freund wrote:
> E.g. on my workstation (two sockets, 10 cores/20 threads each), with 32
> clients, performance changes back and forth between ~600k and ~850k. Whereas
> with 42 clients, it's steadily at 1.1M, with little variance.

FWIW, this is with linux 6.2.12, compiled by myself though.

Greetings,

Andres Freund

Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-08 15:08:57 -0400, Gregory Smith wrote:
> Pushing SELECT statements at socket speeds with prepared statements is a
> synthetic benchmark that normally demos big pgbench numbers.  My benchmark
> farm moved to Ubuntu 23.04/kernel 6.2.0-20 last month, and that test is
> badly broken on the system PG15 at larger core counts, with as much as an
> 85% drop from expectations.  Since this is really just a benchmark workload
> the user impact is very narrow, probably zero really, but as the severity
> of the problem is high we should get to the bottom of what's going on.

> First round of profile data suggests the lost throughput is going here:
> Overhead  Shared Object  Symbol
>   74.34%  [kernel]   [k] osq_lock
>2.26%  [kernel]   [k] mutex_spin_on_owner

Could you get a profile with call graphs? We need to know what leads to all
those osq_lock calls.

perf record --call-graph dwarf -a sleep 1

or such should do the trick, if run while the workload is running.

> Quick test to find if you're impacted:  on the server and using sockets,
> run a 10 second SELECT test with/without preparation using 1 or 2
> clients/[core|thread] and see if preparation is the slower result.  Here's
> a PGDG PG14 on port 5434 as a baseline, next to Ubuntu 23.04's regular
> PG15, all using the PG15 pgbench on AMD 5950X:

I think it's unwise to compare builds of such different vintage. The compiler
options and compiler version can have substantial effects.

> $ pgbench -i -s 100 pgbench -p 5434
> $ pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5434 pgbench
> pgbench (14.8 (Ubuntu 14.8-1.pgdg23.04+1))
> tps = 1058195.197298 (without initial connection time)

I recommend also using -P1. Particularly when using unix sockets, the
specifics of how client threads and server threads are scheduled plays a huge
role. How large a role can change significantly between runs and between
fairly minor changes to how things are executed (e.g. between major PG
versions).

E.g. on my workstation (two sockets, 10 cores/20 threads each), with 32
clients, performance changes back and forth between ~600k and ~850k. Whereas
with 42 clients, it's steadily at 1.1M, with little variance.

I also have seen very odd behaviour on larger machines when
/proc/sys/kernel/sched_autogroup_enabled is set to 1.

> There's been plenty of recent chatter on LKML about *osq_lock*, in January
> Intel reported a 20% benchmark regression on UnixBench that might be
> related.  Work is still ongoing this week:

I've seen such issues in the past, primarily due to contention internal to
cgroups, when the memory controller is enabled.  IIRC that could be alleviated
to a substantial degree with cgroup.memory=nokmem.

Greetings,

Andres Freund

Re: index prefetching

2023-06-08 Thread Tomas Vondra

On 6/8/23 20:56, Peter Geoghegan wrote:
> On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
>  wrote:
>> We already do prefetching for bitmap index scans, where the bitmap heap
>> scan prefetches future pages based on effective_io_concurrency. I'm not
>> sure why exactly was prefetching implemented only for bitmap scans, but
>> I suspect the reasoning was that it only helps when there's many
>> matching tuples, and that's what bitmap index scans are for. So it was
>> not worth the implementation effort.
> 
> I have an educated guess as to why prefetching was limited to bitmap
> index scans this whole time: it might have been due to issues with
> ScalarArrayOpExpr quals.
> 
> Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
> "natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
> were supported by both index scans and index-only scans -- not just
> bitmap scans, which could handle ScalarArrayOpExpr quals even without
> nbtree directly understanding them. The commit was in late 2011,
> shortly after the introduction of index-only scans -- which seems to
> have been the real motivation. And so it seems to me that support for
> ScalarArrayOpExpr was built with bitmap scans and index-only scans in
> mind. Plain index scan ScalarArrayOpExpr quals do work, but support
> for them seems kinda perfunctory to me (maybe you can think of a
> specific counter-example where plain index scans really benefit from
> ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
> original motivation).
>
I don't think SAOP is the reason. I did a bit of digging in the list
archives, and found thread [1], which says:

Regardless of what mechanism is used and who is responsible for
doing it someone is going to have to figure out which blocks are
specifically interesting to prefetch. Bitmap index scans happen
to be the easiest since we've already built up a list of blocks
we plan to read. Somehow that information has to be pushed to the
storage manager to be acted upon.

Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.

So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.

There's a couple more ~2008 threads mentioning prefetching, bitmap scans
and even regular index scans (like [2]). None of them even mentions SAOP
stuff at all.

[1]
https://www.postgresql.org/message-id/871wa17vxb.fsf%40oxford.xeocode.com

[2]
https://www.postgresql.org/message-id/87wsnnz046.fsf%40oxford.xeocode.com

> ScalarArrayOpExpr for plain index scans don't really make that much
> sense right now because there is no heap prefetching in the index scan
> case, which is almost certainly going to be the major bottleneck
> there. At the same time, adding useful prefetching for
> ScalarArrayOpExpr execution more or less requires that you first
> improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
> in mind that ScalarArrayOpExpr execution (whether for bitmap index
> scans or index scans) is related to skip scan/MDAM techniques -- so
> there are tricky dependencies that need to be considered together.
> 
> Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
> descend the B-Tree for each array constant -- even though in principle
> we could avoid all that work in cases that happen to have locality. In
> other words we'll often descend the tree multiple times and land on
> exactly the same leaf page again and again, without ever noticing that
> we could have gotten away with only descending the tree once (it'd
> also be possible to start the next "descent" one level up, not at the
> root, intelligently reusing some of the work from an initial descent
> -- but you don't need anything so fancy to greatly improve matters
> here).
> 
> This lack of smarts around how many times we call _bt_first() to
> descend the index is merely a silly annoyance when it happens in
> btgetbitmap(). We do at least sort and deduplicate the array up-front
> (inside _bt_sort_array_elements()), so there will be significant
> locality of access each time we needlessly descend the tree.
> Importantly, there is no prefetching "pipeline" to mess up in the
> bitmap index scan case -- since that all happens later on. Not so for
> the superficially similar (though actually rather different) plain
> index scan case -- at least not once you add prefetching. If you're
> uselessly processing the same leaf page multiple times, then there is
> no way that heap prefetching can notice that it should be batching
> things up. The context that would allow prefetching to work well isn't
> really available right now. So the plain index scan case is kinda at a
> gratuitous disadvantage (with prefetching) relative to the bitmap
> in

Re: Fix search_path for all maintenance commands

2023-06-08 Thread Greg Stark

On Fri, 26 May 2023 at 19:22, Jeff Davis  wrote:
>
> Maintenance commands (ANALYZE, CLUSTER, REFRESH MATERIALIZED VIEW,
> REINDEX, and VACUUM) currently run as the table owner, and as a
> SECURITY_RESTRICTED_OPERATION.
>
> I propose that we also fix the search_path to "pg_catalog, pg_temp"
> when running maintenance commands, for two reasons:
>
> 1. Make the behavior of maintenance commands more consistent because
> they'd always have the same search_path.

What exactly would this impact? Offhand... expression indexes where
the functions in the expression (which would already be schema
qualified) themselves reference other objects without schema
qualification?

So this would negatively affect someone who was using such a dangerous
function definition but was careful to always use the same search_path
on it. Perhaps someone who had created an expression index on their
own table in their own schema calling their own functions in their own
schema. As long as nobody else ever calls it that would work but this
would cause superuser to no longer be able to reindex it even if
superuser set the same search_path?

I guess that's pretty narrow and a reasonable thing to desupport.
Users could just mark those functions with search_path or schema
qualify the object references in them. Perhaps we should also be
picking up cases like that sooner so users realize they've created a
footgun for themselves?

-- 
greg

Re: Order changes in PG16 since ICU introduction

2023-06-08 Thread Joe Conway


On 6/8/23 17:15, Jeff Davis wrote:

On Wed, 2023-06-07 at 20:52 -0400, Joe Conway wrote:

If the provider has no such thing, throw an error.


Just to be clear, that implies that users (and buildfarm members) with
LANG=C.UTF-8 in their environment would not be able to run a plain
"initdb -D data"; they'd get an error. It's hard for me to estimate how
many users might be inconvenienced by that, but it sounds like a risk.


Well, but only if their libc provider does not have C.UTF-8, correct?

I see

Linux Mint 21.1:/usr/lib/locale/C.utf8
RHEL 8: /usr/lib/locale/C.utf8
RHEL 9: /usr/lib/locale/C.utf8
AL2:/usr/lib/locale/C.utf8

However I do not see it on RHEL 7 :-(


Perhaps for this specific case, and only in initdb, we change
C.anything and POSIX.anything to the builtin provider?


Might be best, with some kind of warning perhaps?


CREATE DATABASE and CREATE COLLATION could still reject such
locales.


Seems to make sense.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15

2023-06-08 Thread Melanie Plageman

On Thu, Jun 8, 2023 at 3:09 PM Gregory Smith  wrote:
> Pushing SELECT statements at socket speeds with prepared statements is a 
> synthetic benchmark that normally demos big pgbench numbers.  My benchmark 
> farm moved to Ubuntu 23.04/kernel 6.2.0-20 last month, and that test is badly 
> broken on the system PG15 at larger core counts, with as much as an 85% drop 
> from expectations.
> Attached are full scaling graphs for all 4 combinations on this AMD 32 thread 
> 5950X, and an Intel i5-13600K with 20 threads and similar impact.  The 
> regular, unprepared sockets peak speeds took a solid hit in PG15 from this 
> issue too.  I could use some confirmation of where this happens from other 
> tester's hardware and Linux kernels.

Since it doesn't look like you included results on pre-23x Ubuntu, I
thought I would reply with my own results using your example. I also
have a 32 thread AMD 5950X but am on Ubuntu 22.10 (kernel 5.19). I did
not see the regression you mention.

HEAD
  pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5432 pgbench
  tps = 837819.220854 (without initial connection time)

  pgbench -S -T 10 -c 32 -j 32 -M simple -p 5432 pgbench
  tps = 576845.930845 (without initial connection time)

REL_15_STABLE
  pgbench -S -T 10 -c 32 -j 32 -M prepared -p 5432 pgbench
  tps = 794380.991666 (without initial connection time)

  pgbench -S -T 10 -c 32 -j 32 -M simple -p 5432 pgbench
  tps = 534358.379838 (without initial connection time)

- Melanie

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck


On 6/8/23 15:57, Dave Cramer wrote:


Apparently this is coming in pgbouncer Support of prepared statements by 
knizhnik · Pull Request #845 · pgbouncer/pgbouncer (github.com) 



I am quite interested in that patch. Considering how pgbouncer works 
internally I am very curious.



Jan

Re: Order changes in PG16 since ICU introduction

2023-06-08 Thread Jeff Davis

On Wed, 2023-06-07 at 20:52 -0400, Joe Conway wrote:
> If the provider has no such thing, throw an error.

Just to be clear, that implies that users (and buildfarm members) with
LANG=C.UTF-8 in their environment would not be able to run a plain
"initdb -D data"; they'd get an error. It's hard for me to estimate how
many users might be inconvenienced by that, but it sounds like a risk.

Perhaps for this specific case, and only in initdb, we change
C.anything and POSIX.anything to the builtin provider? CREATE DATABASE
and CREATE COLLATION could still reject such locales.

Regards,
Jeff Davis

Re: ERROR: no relation entry for relid 6

2023-06-08 Thread Tom Lane

Richard Guo  writes:
> On Tue, May 30, 2023 at 10:28 AM Richard Guo  wrote:
>> I haven't thought through how to fix it, but I suspect that we may need
>> to do more checking before we decide to remove PHVs in
>> remove_rel_from_query.

Oh, I like this example!  It shows a place where we are now smarter
than we used to be, because v15 and earlier fail to recognize that
the join could be removed.  But we do have to clean up the query
properly afterwards.

> Hmm, maybe we can additionally check if the PHV needs to be evaluated
> above the join.  If so it cannot be removed.

Yeah, that seems to make sense, and it squares with the existing
comment saying that PHVs used above the join can't be removed.
Pushed that way.

regards, tom lane

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-09 07:34:49 +1200, Thomas Munro wrote:
> I wasn't in Mathew Wilcox's unconference in Ottawa but I found an
> older article on LWN:
> 
> https://lwn.net/Articles/895217/
> 
> For what it's worth, FreeBSD hackers have studied this topic too (and
> it's been done in Android and no doubt other systems before):
> 
> https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
> 
> I've shared that paper on this list before in the context of
> super/huge pages and their benefits (to executable code, and to the
> buffer pool), but a second topic in that paper is the idea of a shared
> page table: "We find that sharing PTPs across different processes can
> reduce execution cycles by as much as 6.9%. Moreover, the combined
> effects of using superpages to map the main executable and sharing
> PTPs for the small shared libraries can reduce execution cycles up to
> 18.2%."  And that's just part of it, because those guys are more
> interested in shared code/libraries and such so that's probably not
> even getting to the stuff like buffer pool and DSMs that we might tend
> to think of first.

I've experimented with using huge pages for executable code on linux, and the
benefits are quite noticable:
https://www.postgresql.org/message-id/20221104212126.qfh3yzi7luvyy5d6%40awork3.anarazel.de

I'm a bit dubious that sharing the page table for executable code increase the
benefit that much further in real workloads. I suspect the reason it was
different for the authors of the paper is:

> A fixed number of back-to-back
> transactions are performed on a 5GB database, and we use the
> -C option of pgbench to toggle between reconnecting after
> each transaction (reconnect mode) and using one persistent
> connection per client (persistent connection mode). We use
> the reconnect mode by default unless stated otherwise.

Using -C explains why you'd see a lot of benefit from sharing page tables for
executable code. But I don't think -C is a particularly interesting workload
to optimize for.


> I'm no expert in this stuff, but it seems to be that with shared page
> table schemes you can avoid wasting huge amounts of RAM on duplicated
> page table entries (pages * processes), and with huge/super pages you
> can reduce the number of pages, but AFAIK you still can't escape the
> TLB shootdown cost, which is all-or-nothing (PCID level at best).

Pretty much that. While you can avoid some TLB shootdowns via PCIDs, that only
avoids flushing the TLB, it doesn't help with the TLB hit rate being much
lower due to the number of "redundant" mappings with different PCIDs.


> The only way to avoid TLB shootdowns on context switches is to have *exactly
> the same memory map*.  Or, as Robert succinctly shouted, "THREADS".

+1

Greetings,

Andres Freund

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Dave Cramer

On Thu, 8 Jun 2023 at 13:08, Hannu Krosing  wrote:

> I discovered this thread from a Twitter post "PostgreSQL will finally
> be rewritten in Rust"  :)
>

By the time we got around to finishing this, there would be a better
language to write it in.

Dave

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Dave Cramer

On Thu, 8 Jun 2023 at 15:49, Jan Wieck  wrote:

> On 6/8/23 13:31, Dave Cramer wrote:
> >
> > On Thu, 8 Jun 2023 at 11:22, Konstantin Knizhnik  > > wrote:
> >
>
> > So it will be responsibility of client to remember text of prepared
> > query to be able to resend it when statement doesn't exists at
> server?
> > IMHO very strange decision. Why not to handle it in connection
> > pooler (doesn't matter - external or embedded)?
> >
> >
> > I may be myopic but in the JDBC world and I assume others we have a
> > `PreparedStatement` object which has the text of the query.
> > The text is readily available to us.
> >
> > Also again from the JDBC point of view we have use un-named statements
> > normally and then name them after 5 uses so we already have embedded
> > logic on how to deal with PreparedStatements
>
> The entire problem only surfaces when using a connection pool of one
> sort or another. Without one the session is persistent to the client.
>
> At some point I created a "functional" proof of concept for a connection
> pool that did a mapping of the client side name to a pool managed server
> side name. It kept track of which query was known by a server. It kept a
> hashtable of poolname+username+query MD5 sums. On each prepare request
> it would look up if that query is known, add a query-client reference in
> another hashtable and so on. On a Bind/Exec message it would check that
> the server has the query prepared and issue a P message if not. What was
> missing was to keep track of no longer needed queries and deallocate them.
>
> As said, it was a POC. Since it was implemented in Tcl it performed
> miserable, but I got it to the point of being able to pause & resume and
> the whole thing did work with prepared statements on the transaction
> level. So it was a full functioning POC.
>
> What makes this design appealing to me is that it is completely
> transparent to every existing client that uses the extended query
> protocol for server side prepared statements.
>

Apparently this is coming in pgbouncer Support of prepared statements by
knizhnik · Pull Request #845 · pgbouncer/pgbouncer (github.com)


Dave

>
>
> Jan
>
>

Re: Major pgbench synthetic SELECT workload regression, Ubuntu 23.04+PG15

2023-06-08 Thread Tom Lane

Gregory Smith  writes:
> Pushing SELECT statements at socket speeds with prepared statements is a
> synthetic benchmark that normally demos big pgbench numbers.  My benchmark
> farm moved to Ubuntu 23.04/kernel 6.2.0-20 last month, and that test is
> badly broken on the system PG15 at larger core counts, with as much as an
> 85% drop from expectations.
> ... I could use some confirmation of where this happens from
> other tester's hardware and Linux kernels.

FWIW, I can't reproduce any such effect with PG HEAD on RHEL 8.8
(4.18.0 kernel) or Fedora 37 (6.2.14 kernel).  Admittedly this is
with less-beefy hardware than you're using, but your graphs say
this should be obvious with even a dozen clients, and I don't
see that.  I'm getting results like

$ pgbench -S -T 10 -c 16 -j 16 -M prepared pgbench
tps = 472503.628370 (without initial connection time)
$ pgbench -S -T 10 -c 16 -j 16 pgbench
tps = 297844.301319 (without initial connection time)

which is about what I'd expect.

Could it be that the breakage is Ubuntu-specific?  Seems unlikely,
but ...

regards, tom lane

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck


On 6/8/23 13:31, Dave Cramer wrote:


On Thu, 8 Jun 2023 at 11:22, Konstantin Knizhnik > wrote:





So it will be responsibility of client to remember text of prepared
query to be able to resend it when statement doesn't exists at server?
IMHO very strange decision. Why not to handle it in connection
pooler (doesn't matter - external or embedded)?


I may be myopic but in the JDBC world and I assume others we have a 
`PreparedStatement` object which has the text of the query.

The text is readily available to us.

Also again from the JDBC point of view we have use un-named statements 
normally and then name them after 5 uses so we already have embedded 
logic on how to deal with PreparedStatements


The entire problem only surfaces when using a connection pool of one 
sort or another. Without one the session is persistent to the client.


At some point I created a "functional" proof of concept for a connection 
pool that did a mapping of the client side name to a pool managed server 
side name. It kept track of which query was known by a server. It kept a 
hashtable of poolname+username+query MD5 sums. On each prepare request 
it would look up if that query is known, add a query-client reference in 
another hashtable and so on. On a Bind/Exec message it would check that 
the server has the query prepared and issue a P message if not. What was 
missing was to keep track of no longer needed queries and deallocate them.


As said, it was a POC. Since it was implemented in Tcl it performed 
miserable, but I got it to the point of being able to pause & resume and 
the whole thing did work with prepared statements on the transaction 
level. So it was a full functioning POC.


What makes this design appealing to me is that it is completely 
transparent to every existing client that uses the extended query 
protocol for server side prepared statements.



Jan

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Dmitry Dolgov

> On Mon, Jun 05, 2023 at 06:43:54PM +0300, Heikki Linnakangas wrote:
> On 05/06/2023 11:28, Tristan Partin wrote:
> > > # Exposed PIDs
> > >
> > > We expose backend process PIDs to users in a few places.
> > > pg_stat_activity.pid and pg_terminate_backend(), for example. They need
> > > to be replaced, or we can assign a fake PID to each connection when
> > > running in multi-threaded mode.
> >
> > Would it be possible to just transparently slot in the thread ID
> > instead?
>
> Perhaps. It might break applications that use the PID directly with e.g.
> 'kill ', though.

I think things are getting more interesting if some external resource
accounting like cgroups is taking place. From what I know cgroup v2 has
only few controllers that allow threaded granularity, and memory or io
controllers are not part of this list. Since Postgres is doing quite a
lot of different things, sometimes it makes sense to put different
limitations on different types of activity, e.g. to give more priority
to a certain critical internal job on the account of slowing down
backends. In the end it might be complicated or not possible to do that
for individual threads. Such cases are probably not very important from
the high level point of view, but could become an argument when deciding
what should be a process and what should be a thread.

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Thomas Munro

On Fri, Jun 9, 2023 at 4:00 AM Andres Freund  wrote:
> On 2023-06-08 12:15:58 +0200, Hannu Krosing wrote:
> > > This part was touched in the "AMA with a Linux Kernale Hacker"
> > > Unconference session where he mentioned that the had proposed a
> > > 'mshare' syscall for this.
>
> As-is that'd just lead to sharing page table, not the TLB. I don't think you
> currently do sharing of the TLB for parts of your address space on x86
> hardware. It's possible that something like that gets added to future
> hardware, but ...

I wasn't in Mathew Wilcox's unconference in Ottawa but I found an
older article on LWN:

https://lwn.net/Articles/895217/

For what it's worth, FreeBSD hackers have studied this topic too (and
it's been done in Android and no doubt other systems before):

https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf

I've shared that paper on this list before in the context of
super/huge pages and their benefits (to executable code, and to the
buffer pool), but a second topic in that paper is the idea of a shared
page table: "We find that sharing PTPs across different processes can
reduce execution cycles by as much as 6.9%. Moreover, the combined
effects of using superpages to map the main executable and sharing
PTPs for the small shared libraries can reduce execution cycles up to
18.2%."  And that's just part of it, because those guys are more
interested in shared code/libraries and such so that's probably not
even getting to the stuff like buffer pool and DSMs that we might tend
to think of first.

I'm pretty sure PostgreSQL (along with another fork-based RDBMSs
mentioned in this thread) must be one of the worst offenders for page
table bloat, simply because we can have a lot of processes and touch a
lot of memory.

I'm no expert in this stuff, but it seems to be that with shared page
table schemes you can avoid wasting huge amounts of RAM on duplicated
page table entries (pages * processes), and with huge/super pages you
can reduce the number of pages, but AFAIK you still can't escape the
TLB shootdown cost, which is all-or-nothing (PCID level at best).  The
only way to avoid TLB shootdowns on context switches is to have
*exactly the same memory map*.  Or, as Robert succinctly shouted,
"THREADS".

Re: Postgres v15 windows bincheck regression test failures

2023-06-08 Thread Andrew Dunstan



On 2023-06-08 Th 13:41, Russell Foster wrote:

Hi All:

I upgraded to postgres v15, and I am getting intermittent failures for
some of the bin regression tests when building on Windows 10. Example:

perl vcregress.pl bincheck

Installation complete.
t/001_initdb.pl .. ok
All tests successful.
Files=1, Tests=25, 12 wallclock secs ( 0.03 usr +  0.01 sys =  0.05 CPU)
Result: PASS
t/001_basic.pl ... ok
t/002_nonesuch.pl  1/?
#   Failed test 'checking a non-existent database stderr /(?^:FATAL:
database "qqq" does not exist)/'
#   at t/002_nonesuch.pl line 25.
#   'pg_amcheck: error: connection to server at
"127.0.0.1", port 49393 failed: server closed the connection
unexpectedly
#   This probably means the server terminated abnormally
#   before or while processing the request.
# '
# doesn't match '(?^:FATAL:  database "qqq" does not exist)'
t/002_nonesuch.pl  97/? # Looks like you failed 1 test of 100.
t/002_nonesuch.pl  Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/100 subtests
t/003_check.pl ... ok
t/004_verify_heapam.pl ... ok
t/005_opclass_damage.pl .. ok

Test Summary Report
---
t/002_nonesuch.pl  (Wstat: 256 Tests: 100 Failed: 1)
   Failed test:  3
   Non-zero exit status: 1
Files=5, Tests=196, 86 wallclock secs ( 0.11 usr +  0.08 sys =  0.19 CPU)
Result: FAIL
...

I see a similar failure on the build farm at:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-06-03%2020%3A03%3A07

I have also received the same error in the pg_dump test as the build
server above. Are these errors expected? Are they due to the fact that
windows tests use SSPI? It seems to work correctly if I recreate all
of the steps with an HBA that does not use SSPI.



In general you're better off using something like this


set PG_TEST_USE_UNIX_SOCKETS=1
set PG_REGRESS_SOCK_DIR=%LOCALAPPDATA%\Local\temp


That avoids several sorts of issues.


cheers

andrew


--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Jose Luis Tallon


On 8/6/23 15:56, Robert Haas wrote:

Yeah, I've had similar thoughts. I'm not exactly sure what the
advantages of such a refactoring might be, but the current structure
feels pretty limiting. It works OK because we don't do anything in the
postmaster other than fork a new backend, but I'm not sure if that's
the best strategy. It means, for example, that if there's a ton of new
connection requests, we're spawning a ton of new processes, which
means that you can put a lot of load on a PostgreSQL instance even if
you can't authenticate. Maybe we'd be better off with a pool of
processes accepting connections; if authentication fails, that
connection goes back into the pool and tries again.


    This. It's limited by connection I/O, hence a perfect use for 
threads (minimize per-connection overhead).


IMV, "session state" would be best stored/managed here. Would need a way 
to convey it efficiently, though.



If authentication
succeeds, either that process transitions to being a regular backend,
leaving the authentication pool, or perhaps hands off the connection
to a "real backend" at that point and loops around to accept() the
next request.


Nicely done by passing the FD around

But at this point, we'd just get a nice reimplementation of a threaded 
connection pool inside Postgres :\



Whether that's a good ideal in detail or not, the point remains that
having the postmaster handle this task is quite limiting. It forces us
to hand off the connection to a new process at the earliest possible
stage, so that the postmaster remains free to handle other duties.
Giving the responsibility to another process would let us make
decisions about where to perform the hand-off based on real
architectural thought rather than being forced to do a certain way
because nothing else will work.


At least "tcop" surely feels like belonging in a separate process 


    J.L.

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Thomas Munro

On Fri, Jun 9, 2023 at 5:02 AM Ilya Anfimov  wrote:
>  Isn't  all  the  memory operations would require nearly the same
> shared memory allocators if someone switches to a threaded imple-
> mentation?

It's true that we'd need concurrency-aware MemoryContext
implementations (details can be debated), but we wouldn't need that
address translation layer, which adds a measurable cost at every
access.

Re: index prefetching

2023-06-08 Thread Peter Geoghegan

On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
 wrote:
> We already do prefetching for bitmap index scans, where the bitmap heap
> scan prefetches future pages based on effective_io_concurrency. I'm not
> sure why exactly was prefetching implemented only for bitmap scans, but
> I suspect the reasoning was that it only helps when there's many
> matching tuples, and that's what bitmap index scans are for. So it was
> not worth the implementation effort.

I have an educated guess as to why prefetching was limited to bitmap
index scans this whole time: it might have been due to issues with
ScalarArrayOpExpr quals.

Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
"natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
were supported by both index scans and index-only scans -- not just
bitmap scans, which could handle ScalarArrayOpExpr quals even without
nbtree directly understanding them. The commit was in late 2011,
shortly after the introduction of index-only scans -- which seems to
have been the real motivation. And so it seems to me that support for
ScalarArrayOpExpr was built with bitmap scans and index-only scans in
mind. Plain index scan ScalarArrayOpExpr quals do work, but support
for them seems kinda perfunctory to me (maybe you can think of a
specific counter-example where plain index scans really benefit from
ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
original motivation).

ScalarArrayOpExpr for plain index scans don't really make that much
sense right now because there is no heap prefetching in the index scan
case, which is almost certainly going to be the major bottleneck
there. At the same time, adding useful prefetching for
ScalarArrayOpExpr execution more or less requires that you first
improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
in mind that ScalarArrayOpExpr execution (whether for bitmap index
scans or index scans) is related to skip scan/MDAM techniques -- so
there are tricky dependencies that need to be considered together.

Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
descend the B-Tree for each array constant -- even though in principle
we could avoid all that work in cases that happen to have locality. In
other words we'll often descend the tree multiple times and land on
exactly the same leaf page again and again, without ever noticing that
we could have gotten away with only descending the tree once (it'd
also be possible to start the next "descent" one level up, not at the
root, intelligently reusing some of the work from an initial descent
-- but you don't need anything so fancy to greatly improve matters
here).

This lack of smarts around how many times we call _bt_first() to
descend the index is merely a silly annoyance when it happens in
btgetbitmap(). We do at least sort and deduplicate the array up-front
(inside _bt_sort_array_elements()), so there will be significant
locality of access each time we needlessly descend the tree.
Importantly, there is no prefetching "pipeline" to mess up in the
bitmap index scan case -- since that all happens later on. Not so for
the superficially similar (though actually rather different) plain
index scan case -- at least not once you add prefetching. If you're
uselessly processing the same leaf page multiple times, then there is
no way that heap prefetching can notice that it should be batching
things up. The context that would allow prefetching to work well isn't
really available right now. So the plain index scan case is kinda at a
gratuitous disadvantage (with prefetching) relative to the bitmap
index scan case.

Queries with (say) quals with many constants appearing in an "IN()"
are both common and particularly likely to benefit from prefetching.
I'm not suggesting that you need to address this to get to a
committable patch. But you should definitely think about it now. I'm
strongly considering working on this problem for 17 anyway, so we may
end up collaborating on these aspects of prefetching. Smarter
ScalarArrayOpExpr execution for index scans is likely to be quite
compelling if it enables heap prefetching.

> But there's three shortcomings in logic:
>
> 1) It's not clear the thresholds for prefetching being beneficial and
> switching to bitmap index scans are the same value. And as I'll
> demonstrate later, the prefetching threshold is indeed much lower
> (perhaps a couple dozen matching tuples) on large tables.

As I mentioned during the pgCon unconference session, I really like
your framing of the problem; it makes a lot of sense to directly
compare an index scan's execution against a very similar bitmap index
scan execution -- there is an imaginary continuum between index scan
and bitmap index scan. If the details of when and how we scan the
index are rather similar in each case, then there is really no reason
why the performance shouldn't be fairly similar. I suspect that it
will be useful to ask the same question for various

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-08 11:56:13 -0400, Robert Haas wrote:
> On Thu, Jun 8, 2023 at 11:02 AM Hannu Krosing  wrote:
> > No, I meant that this needs to be fixed at OS level, by being able to
> > use the same mapping.
> >
> > We should not shy away from asking the OS people for adding the useful
> > features still missing.
> >
> > It was mentioned in the Unconference Kernel Hacker AMA talk and said
> > kernel hacker works for Oracle, andf they also seemed to be needing
> > this :)
> 
> Fair enough, but we aspire to work on a bunch of different operating
> systems. To make use of an OS facility, we need something that works
> on at least Linux, Windows, macOS, and a few different BSD flavors.
> It's not as if when the PostgreSQL project asks for a new operating
> system facility everyone springs into action to provide it
> immediately. And even if they did, and even if they all released an
> implementation of whatever we requested next year, it would still be
> at least five, more realistically ten, years before systems with those
> facilities were ubiquitous.

I'm less concerned about this aspect - most won't have upgraded to a version
of postgres that benefit from threaded postgres in a similar timeframe. And if
the benefits are large enough, people will move.  But:

> And unless we have truly obscene amounts of clout in the OS community, it's
> likely that all of those different operating systems would implement
> different things to meet the stated need, and then we'd have to have a
> complex bunch of platform-dependent code in order to keep working on all of
> those systems.

And even more likely, they just won't do anything, because it's a model that
large parts of the industry have decided isn't going anywhere. It'd be one
thing if we had 5 kernel devs that we could deploy to work on this, but we
don't. So we have to convince kernel devs employed by others that somehow this
is an urgent enough thing that they should work on it. The likely, imo
justified, answer is just going to be: Fix your architecture, then we can
talk.

Greetings,

Andres Freund

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-08 17:55:57 +0200, Matthias van de Meent wrote:
> While I agree that "sharing page tables across processes" is useful,
> it looks like it'd be much more effort to correctly implement for e.g.
> DSM than implementing threading.
> Konstantin's diff is "only" 20.1k lines [0] added and/or modified,
> which is a lot, but it's manageable (13k+ of which are from files that
> were auto-generated and then committed, likely accidentally).

Honestly, I don't think this patch is in a good enough state to allow a
realistic estimation of the overall work. Making global variables TLS is the
*easy* part.  Redesigning postmaster, definining how to deal with extension
libraries, extension compatibility, developing tools to make developing a
threaded postgres feasible, dealing with freeing session lifetime memory
allocations that previously were freed via process exit, making the change
realistically reviewable, portability are all much harder.

Greetings,

Andres Freund

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-08 17:02:08 +0200, Hannu Krosing wrote:
> On Thu, Jun 8, 2023 at 4:56 PM Robert Haas  wrote:
> >
> > On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing  wrote:
> > > > That sounds like a bad idea, dynamic shared memory is more expensive
> > > > to maintain than our static shared memory systems, not in the least
> > > > because DSM is not guaranteed to share the same addresses in each
> > > > process' address space.
> > >
> > > Then this too needs to be fixed
> >
> > Honestly, I'm struggling to respond to this non-sarcastically. I mean,
> > I was the one who implemented DSM. Do you think it works the way that
> > it works because I considered doing something smart and decided to do
> > something dumb instead?
>
> No, I meant that this needs to be fixed at OS level, by being able to
> use the same mapping.
>
> We should not shy away from asking the OS people for adding the useful
> features still missing.

There's a large part of this that is about hardware, not software. And
honestly, for most of the problems the answer is to just use threads. Adding
complexity to operating systems to make odd architectures like postgres'
better is a pretty dubious proposition.

I don't think we have even remotely enough influence on CPU design to make
e.g. *partial* TLB sharing across processes a thing.

> It was mentioned in the Unconference Kernel Hacker AMA talk and said
> kernel hacker works for Oracle, andf they also seemed to be needing
> this :)

The proposals around that don't really help us all that much. Sharing the page
table will be a bit more efficient, but it won't really change anything
dramatically.  From what I understand they are primarily interested in
changing properties of a memory mapping across multiple processes, e.g. making
some memory executable and have that reflected in all processes. I don't think
this will help us much.

Greetings,

Andres Freund

Re: explain analyze rows=%.0f

2023-06-08 Thread Ibrar Ahmed

On Mon, Mar 20, 2023 at 7:56 PM Gregory Stark (as CFM) 
wrote:

> On Wed, 4 Jan 2023 at 10:05, Ibrar Ahmed  wrote:
> >
> > Thanks, I have modified everything as suggested, except one point
> >
> > > Don't use variable format strings. They're hard to read and they
> > > probably defeat compile-time checks that the arguments match the
> > > format string. You could use a "*" field width instead, ie
> ...
> > > * Another thought is that the non-text formats tend to prize output
> > > consistency over readability, so maybe we should just always use 2
> > > fractional digits there, rather than trying to minimize visible
> changes.
> >
> > In that, we need to adjust a lot of test case outputs.
>
> > > * We need some doc adjustments, surely, to explain what the heck this
> > > means.
>
> That sounds like three points :) But they seem like pretty good
> arguments to me and straightforward if a little tedious to adjust.
>
> This patch was marked Returned with Feedback and then later Waiting on
> Author. And it hasn't had any updates since January. What is the state
> on this feedback? If it's already done we can set the patch to Ready
> for Commit and if not do you disagree with the proposed changes?
>
> If there is a consensus to modify the test cases' output, I am willing to
make the necessary changes and adjust the patch accordingly. However,
if there is a preference to keep the output of certain test cases unchanged,
I can rebase and modify the patch accordingly to accommodate those
preferences.




> It's actually the kind of code cleanup changes I'm reluctant to bump a
> patch for. It's not like a committer can't make these kinds of changes
> when committing. But there are so many patches they're likely to just
> focus on a different patch when there are adjustments like this
> pending.
>
> --
> Gregory Stark
> As Commitfest Manager
>


-- 
Ibrar Ahmed

Postgres v15 windows bincheck regression test failures

2023-06-08 Thread Russell Foster

Hi All:

I upgraded to postgres v15, and I am getting intermittent failures for
some of the bin regression tests when building on Windows 10. Example:

perl vcregress.pl bincheck

Installation complete.
t/001_initdb.pl .. ok
All tests successful.
Files=1, Tests=25, 12 wallclock secs ( 0.03 usr +  0.01 sys =  0.05 CPU)
Result: PASS
t/001_basic.pl ... ok
t/002_nonesuch.pl  1/?
#   Failed test 'checking a non-existent database stderr /(?^:FATAL:
database "qqq" does not exist)/'
#   at t/002_nonesuch.pl line 25.
#   'pg_amcheck: error: connection to server at
"127.0.0.1", port 49393 failed: server closed the connection
unexpectedly
#   This probably means the server terminated abnormally
#   before or while processing the request.
# '
# doesn't match '(?^:FATAL:  database "qqq" does not exist)'
t/002_nonesuch.pl  97/? # Looks like you failed 1 test of 100.
t/002_nonesuch.pl  Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/100 subtests
t/003_check.pl ... ok
t/004_verify_heapam.pl ... ok
t/005_opclass_damage.pl .. ok

Test Summary Report
---
t/002_nonesuch.pl  (Wstat: 256 Tests: 100 Failed: 1)
  Failed test:  3
  Non-zero exit status: 1
Files=5, Tests=196, 86 wallclock secs ( 0.11 usr +  0.08 sys =  0.19 CPU)
Result: FAIL
...

I see a similar failure on the build farm at:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2023-06-03%2020%3A03%3A07

I have also received the same error in the pg_dump test as the build
server above. Are these errors expected? Are they due to the fact that
windows tests use SSPI? It seems to work correctly if I recreate all
of the steps with an HBA that does not use SSPI.

thanks,
Russell

Re: Use of additional index columns in rows filtering

2023-06-08 Thread Tomas Vondra

Hi,

I took a stab at this and implemented the trick with the VM - during
index scan, we also extract the filters that only need the indexed
attributes (just like in IOS). And then, during the execution we:

  1) scan the index using the scan keys (as before)

  2) if the heap page is all-visible, we check the new filters that can
 be evaluated on the index tuple

  3) fetch the heap tuple and evaluate the filters

This is pretty much exactly the same thing we do for IOS, so I don't see
why this would be incorrect while IOS is correct.

This also adds "Index Filter" to explain output, to show which filters
are executed on the index tuple (at the moment the filters are a subset
of "Filter"), so if the index tuple matches we'll execute them again on
the heap tuple. I guess that could be fixed by having two "filter"
lists, depending on whether we were able to evaluate the index filters.

Most of the patch is pretty mechanical - particularly the planning part
is about identifying filters that can be evaluated on the index tuple,
and that code was mostly shamelessly copied from index-only scan.

The matching of filters to index is done in check_index_filter(), and
it's simpler than match_clause_to_indexcol() as it does not need to
consider operators etc. (I think). But maybe it should be careful about
other things, not sure.

The actual magic happens in IndexNext (nodeIndexscan.c). As mentioned
earlier, the idea is to check VM and evaluate the filters on the index
tuple if possible, similar to index-only scans. Except that we then have
to fetch the heap tuple. Unfortunately, this means the code can't use
index_getnext_slot() anymore. Perhaps we should invent a new variant
that'd allow evaluating the index filters in between.


With the patch applied, the query plan changes from:

 QUERY PLAN
---
 Limit  (cost=0.42..10929.89 rows=1 width=12)
(actual time=94.649..94.653 rows=0 loops=1)
   Buffers: shared hit=197575 read=661
   ->  Index Scan using t_a_include_b on t
  (cost=0.42..10929.89 rows=1 width=12)
  (actual time=94.646..94.647 rows=0 loops=1)
 Index Cond: (a > 100)
 Filter: (b = 4)
 Rows Removed by Filter: 197780
 Buffers: shared hit=197575 read=661
 Planning Time: 0.091 ms
 Execution Time: 94.674 ms
(9 rows)

to

 QUERY PLAN
---
 Limit  (cost=0.42..3662.15 rows=1 width=12)
(actual time=13.663..13.667 rows=0 loops=1)
   Buffers: shared hit=544
   ->  Index Scan using t_a_include_b on t
   (cost=0.42..3662.15 rows=1 width=12)
   (actual time=13.659..13.660 rows=0 loops=1)
 Index Cond: (a > 100)
 Index Filter: (b = 4)
 Rows Removed by Index Recheck: 197780
 Filter: (b = 4)
 Buffers: shared hit=544
 Planning Time: 0.105 ms
 Execution Time: 13.690 ms
(10 rows)

which is much closer to the "best" case:

 QUERY PLAN
---
 Limit  (cost=0.42..4155.90 rows=1 width=12)
(actual time=10.152..10.156 rows=0 loops=1)
   Buffers: shared read=543
   ->  Index Scan using t_a_b_idx on t
   (cost=0.42..4155.90 rows=1 width=12)
   (actual time=10.148..10.150 rows=0 loops=1)
 Index Cond: ((a > 100) AND (b = 4))
 Buffers: shared read=543
 Planning Time: 0.089 ms
 Execution Time: 10.176 ms
(7 rows)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom b372bb44b18038fa0c57ea6289a1e629fa029241 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 9 Apr 2023 02:08:45 +0200
Subject: [PATCH] evaluate filters on the index tuple (when possible)

Discussion: https://www.postgresql.org/message-id/flat/N1xaIrU29uk5YxLyW55MGk5fz9s6V2FNtj54JRaVlFbPixD5z8sJ07Ite5CvbWwik8ZvDG07oSTN-usENLVMq2UAcizVTEd5b-o16ZGDIIU%3D%40yamlcoder.me
---
 src/backend/commands/explain.c  |   2 +
 src/backend/executor/nodeIndexscan.c| 190 +++-
 src/backend/optimizer/path/costsize.c   |  23 +++
 src/backend/optimizer/path/indxpath.c   | 187 ---
 src/backend/optimizer/plan/createplan.c | 139 +
 src/backend/optimizer/plan/planner.c|   2 +-
 src/backend/optimizer/plan/setrefs.c|   6 +
 src/backend/optimizer/util/pathnode.c   |   2 +
 src/backend/utils/misc/guc_tables.c |  10 ++
 src/include/nodes/execnodes.h   |   6 +
 src/include/nodes/pathnodes.h   |   1 +
 src/include/nodes/plannodes.h   |   2 +
 src/include/optimizer/cost.h|   1 +
 src/include/optimizer/pathnode.h|   1 +
 14 files changed, 547 ins

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Dave Cramer

On Thu, 8 Jun 2023 at 11:22, Konstantin Knizhnik  wrote:

>
>
> On 08.06.2023 6:18 PM, Dave Cramer wrote:
>
>
>
> On Thu, 8 Jun 2023 at 11:15, Jan Wieck  wrote:
>
>> On 6/8/23 10:56, Dave Cramer wrote:
>> >
>> >
>> >
>> >
>> > On Thu, 8 Jun 2023 at 10:31, Jan Wieck > > > wrote:
>> >
>> > On 6/8/23 09:53, Jan Wieck wrote:
>> >  > On 6/8/23 09:21, Dave Cramer wrote:
>> >  > The server doesn't know about all the clients of the pooler, does
>> > it? It
>> >  > has no way of telling if/when a client disconnects from the
>> pooler.
>> >
>> > Another problem that complicates doing it in the server is that the
>> > information require to (re-)prepare a statement in a backend that
>> > currently doesn't have it needs to be kept in shared memory. This
>> > includes the query string itself. Doing that without shared memory
>> in a
>> > pooler that is multi-threaded or based on async-IO is much simpler
>> and
>> > allows for easy ballooning.
>> >
>> >
>> > I don't expect the server to re-prepare the statement. If the server
>> > responds with "statement doesn't exist" the client would send a prepare.
>>
>> Are you proposing a new libpq protocol version?
>>
>
> I believe we would need to add this to the protocol, yes.
>
>
>
> So it will be responsibility of client to remember text of prepared query
> to be able to resend it when statement doesn't exists at server?
> IMHO very strange decision. Why not to handle it in connection pooler
> (doesn't matter - external or embedded)?
>

I may be myopic but in the JDBC world and I assume others we have a
`PreparedStatement` object which has the text of the query.
The text is readily available to us.

Also again from the JDBC point of view we have use un-named statements
normally and then name them after 5 uses so we already have embedded logic
on how to deal with PreparedStatements

Dave

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

I discovered this thread from a Twitter post "PostgreSQL will finally
be rewritten in Rust"  :)

On Mon, Jun 5, 2023 at 5:18 PM Tom Lane  wrote:
>
> Heikki Linnakangas  writes:
> > I spoke with some folks at PGCon about making PostgreSQL multi-threaded,
> > so that the whole server runs in a single process, with multiple
> > threads. It has been discussed many times in the past, last thread on
> > pgsql-hackers was back in 2017 when Konstantin made some experiments [0].
>
> > I feel that there is now pretty strong consensus that it would be a good
> > thing, more so than before. Lots of work to get there, and lots of
> > details to be hashed out, but no objections to the idea at a high level.
>
> > The purpose of this email is to make that silent consensus explicit. If
> > you have objections to switching from the current multi-process
> > architecture to a single-process, multi-threaded architecture, please
> > speak up.
>
> For the record, I think this will be a disaster.  There is far too much
> code that will get broken, largely silently, and much of it is not
> under our control.
>
> regards, tom lane
>
>

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Ilya Anfimov

On Wed, Jun 07, 2023 at 10:26:07AM +1200, Thomas Munro wrote:
> On Tue, Jun 6, 2023 at 6:52???AM Andrew Dunstan  wrote:
> > If we were starting out today we would probably choose a threaded 
> > implementation. But moving to threaded now seems to me like a 
> > multi-year-multi-person project with the prospect of years to come chasing 
> > bugs and the prospect of fairly modest advantages. The risk to reward 
> > doesn't look great.
> >
> > That's my initial reaction. I could be convinced otherwise.
> 
> Here is one thing I often think about when contemplating threads.
> Take a look at dsa.c.  It calls itself a shared memory allocator, but
> really it has two jobs, the second being to provide software emulation
> of virtual memory.  That???s behind dshash.c and now the stats system,
> and various parts of the parallel executor code.  It???s slow and
> complicated, and far from the state of the art.  I wrote that code
> (building on allocator code from Robert) with the expectation that it
> was a transitional solution to unblock a bunch of projects.  I always
> expected that we'd eventually be deleting it.  When I explain that
> subsystem to people who are not steeped in the lore of PostgreSQL, it
> sounds completely absurd.  I mean, ... it is, right?My point is

 Isn't  all  the  memory operations would require nearly the same
shared memory allocators if someone switches to a threaded imple-
mentation?

> that we???re doing pretty unreasonable and inefficient contortions to
> develop new features -- we're not just happily chugging along without
> threads at no cost.
>

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-08 16:47:48 +0300, Konstantin Knizhnik wrote:
> Actually TLS not not more expensive then accessing struct fields (at least
> at x86 platform), consider the following program:

It really depends on the OS and the architecture, not just the
architecture. And even on x86-64 Linux, the fact that you're using the segment
offset in the address calculation means you can't use the more complicated
addressing modes for other reasons. And plenty instructions, e.g. most (?) SSE
instructions, won't be able to use that kind of addressing directly.

Even just compiling your, example you can see that with gcc -O2 you get
considerably faster code with the non-TLS version.

As a fairly extreme example, here's the mingw -O3 compiled code:

use_struct:
movqxmm1, QWORD PTR .LC0[rip]
movqxmm0, QWORD PTR [rcx]
add DWORD PTR 8[rcx], 1
paddd   xmm0, xmm1
movqQWORD PTR [rcx], xmm0
ret
use_tls:
sub rsp, 40
lea rcx, __emutls_v.a[rip]
call__emutls_get_address
lea rcx, __emutls_v.b[rip]
add DWORD PTR [rax], 1
call__emutls_get_address
lea rcx, __emutls_v.c[rip]
add DWORD PTR [rax], 1
call__emutls_get_address
add DWORD PTR [rax], 1
add rsp, 40
ret

Greetings,

Andres Freund

Re: Making Vars outer-join aware

2023-06-08 Thread Tom Lane

[ back from PGCon ... ]

"Anton A. Melnikov"  writes:
> On 04.05.2023 15:22, Tom Lane wrote:
>> Under what circumstances would you be trying to inject INDEX_VAR
>> into a nullingrel set?  Only outer-join relids should ever appear there.

> The thing is that i don't try to push INDEX_VAR into a nullingrel set at all,
> i just try to replace the existing rt_index that equals to INDEX_VAR in Var 
> nodes with
> the defined positive indexes by using ChangeVarNodes_walker() function call.

Hmm.  That implies that you're changing plan data structures around after
setrefs.c, which doesn't seem like a great design to me --- IMO that ought
to happen in PlanCustomPath, which will still see the original varnos.
However, it's probably not worth breaking existing code for this, so
now I agree that ChangeVarNodes ought to (continue to) allow negative
rt_index.

> Therefore it also seems better and more logical to me in the case of an index 
> that
> cannot possibly be a member of the Bitmapset, immediately return false.
> Here is a patch like that.

I do not like the blast radius of this patch.  Yes, I know about that
comment in bms_is_member --- I wrote it, if memory serves.  But it's
stood like that for more than two decades, and I believe it's caught
its share of mistakes.  This issue doesn't seem like a sufficient
reason to change a globally-visible behavior.

I think the right thing here is not either of your patches, but
to tweak adjust_relid_set() to not fail on negative oldrelid.
I'll go make it so.

regards, tom lane

Re: Use COPY for populating all pgbench tables

2023-06-08 Thread Tristan Partin

On Thu Jun 8, 2023 at 12:33 AM CDT, David Rowley wrote:
> On Thu, 8 Jun 2023 at 07:16, Tristan Partin  wrote:
> >
> > master:
> >
> > 5000 of 5000 tuples (100%) done (elapsed 260.93 s, remaining 0.00 
> > s))
> > vacuuming...
> > creating primary keys...
> > done in 1414.26 s (drop tables 0.20 s, create tables 0.82 s, client-side 
> > generate 1280.43 s, vacuum 2.55 s, primary keys 130.25 s).
> >
> > patchset:
> >
> > 5000 of 5000 tuples (100%) of pgbench_accounts done (elapsed 243.82 
> > s, remaining 0.00 s))
> > vacuuming...
> > creating primary keys...
> > done in 375.66 s (drop tables 0.14 s, create tables 0.73 s, client-side 
> > generate 246.27 s, vacuum 2.77 s, primary keys 125.75 s).
>
> I've also previously found pgbench -i to be slow.  It was a while ago,
> and IIRC, it was due to the printfPQExpBuffer() being a bottleneck
> inside pgbench.
>
> On seeing your email, it makes me wonder if PG16's hex integer
> literals might help here.  These should be much faster to generate in
> pgbench and also parse on the postgres side.

Do you have a link to some docs? I have not heard of the feature.
Definitely feels like a worthy cause.

> I wrote a quick and dirty patch to try that and I'm not really getting
> the same performance increases as I'd have expected. I also tested
> with your patch too and it does not look that impressive either when
> running pgbench on the same machine as postgres.

I didn't expect my patch to increase performance in all workloads. I was
mainly aiming to fix high-latency connections. Based on your results
that looks like a 4% reduction in performance of client-side data
generation. I had thought maybe it is worth having a flag to keep the
old way too, but I am not sure a 4% hit is really that big of a deal.

> pgbench copy speedup
>
> ** master
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) done (elapsed 74.15 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 95.71 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 74.45 s, vacuum 0.12 s, primary keys 21.13 s).
>
> ** David's Patched
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) done (elapsed 69.64 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 90.22 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 69.91 s, vacuum 0.12 s, primary keys 20.18 s).
>
> ** Tristan's patch
> drowley@amd3990x:~$ pgbench -i -s 1000 postgres
> 1 of 1 tuples (100%) of pgbench_accounts done (elapsed
> 77.44 s, remaining 0.00 s)
> vacuuming...
> creating primary keys...
> done in 98.64 s (drop tables 0.00 s, create tables 0.01 s, client-side
> generate 77.47 s, vacuum 0.12 s, primary keys 21.04 s).
>
> I'm interested to see what numbers you get.  You'd need to test on
> PG16 however. I left the old code in place to generate the decimal
> numbers for versions < 16.

I will try to test this soon and follow up on the thread. I definitely
see no problems with your patch as is though. I would be more than happy
to rebase my patches on yours.

-- 
Tristan Partin
Neon (https://neon.tech)

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Greg Sabino Mullane

On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing  wrote:

> Do we have any statistics for the distribution of our user base ?
>
> My gut feeling says that for performance-critical use the non-Linux is
> in low single digits at best.
>

Stats are probably not possible, but based on years of consulting, as well
as watching places like SO, Slack, IRC, etc. over the years, IMO that's a
very accurate gut feeling. I'd hazard 1% or less for non-Linux systems.

Cheers,
Greg

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Andres Freund

Hi,

On 2023-06-08 12:15:58 +0200, Hannu Krosing wrote:
> On Thu, Jun 8, 2023 at 11:54 AM Hannu Krosing  wrote:
> >
> > On Wed, Jun 7, 2023 at 11:37 PM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote:
> > > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I
> > > > would ask if developer time could be better spent on tackling some of 
> > > > the
> > > > other problems around vertical scalability? Per some PGCon discussions,
> > > > there's still room for improvement in how PostgreSQL can best utilize
> > > > resources available very large "commodity" machines (a 448-core / 24TB 
> > > > RAM
> > > > instance comes to mind).
> > >
> > > I think we're starting to hit quite a few limits related to the process 
> > > model,
> > > particularly on bigger machines. The overhead of cross-process context
> > > switches is inherently higher than switching between threads in the same
> > > process - and my suspicion is that that overhead will continue to
> > > increase. Once you have a significant number of connections we end up 
> > > spending
> > > a *lot* of time in TLB misses, and that's inherent to the process model,
> > > because you can't share the TLB across processes.
> >
> >
> > This part was touched in the "AMA with a Linux Kernale Hacker"
> > Unconference session where he mentioned that the had proposed a
> > 'mshare' syscall for this.

As-is that'd just lead to sharing page table, not the TLB. I don't think you
currently do sharing of the TLB for parts of your address space on x86
hardware. It's possible that something like that gets added to future
hardware, but ...


> Also, the *static* huge pages already let you solve this problem now
> by sharing the page tables

You don't share the page tables with huge pages on linux.


- Andres

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Andres Freund

On 2023-06-08 14:01:16 +0200, Jose Luis Tallon wrote:
> * For "heavyweight" queries, the scalability of "almost independent"
> processes w.r.t. NUMA is just _impossible to achieve_ (locality of
> reference!) with a pure threaded system. When CPU+mem-bound
> (bandwidth-wise), threads add nothing IMO.

I don't think this is true in any sort of way.

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Robert Haas

On Thu, Jun 8, 2023 at 11:02 AM Hannu Krosing  wrote:
> No, I meant that this needs to be fixed at OS level, by being able to
> use the same mapping.
>
> We should not shy away from asking the OS people for adding the useful
> features still missing.
>
> It was mentioned in the Unconference Kernel Hacker AMA talk and said
> kernel hacker works for Oracle, andf they also seemed to be needing
> this :)

Fair enough, but we aspire to work on a bunch of different operating
systems. To make use of an OS facility, we need something that works
on at least Linux, Windows, macOS, and a few different BSD flavors.
It's not as if when the PostgreSQL project asks for a new operating
system facility everyone springs into action to provide it
immediately. And even if they did, and even if they all released an
implementation of whatever we requested next year, it would still be
at least five, more realistically ten, years before systems with those
facilities were ubiquitous. And unless we have truly obscene amounts
of clout in the OS community, it's likely that all of those different
operating systems would implement different things to meet the stated
need, and then we'd have to have a complex bunch of platform-dependent
code in order to keep working on all of those systems.

To me, this is a road to nowhere. I have no problem at all with us
expressing our needs to the OS community, but realistically, any
PostgreSQL feature that depends on an OS feature less than twenty
years old is going to have to be optional, which means that if we want
to do anything about sharing address space mappings in the next few
years, it's going to need to be based on threads.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Matthias van de Meent

On Thu, 8 Jun 2023 at 17:02, Hannu Krosing  wrote:
>
> On Thu, Jun 8, 2023 at 4:56 PM Robert Haas  wrote:
> >
> > On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing  wrote:
> > > > That sounds like a bad idea, dynamic shared memory is more expensive
> > > > to maintain than our static shared memory systems, not in the least
> > > > because DSM is not guaranteed to share the same addresses in each
> > > > process' address space.
> > >
> > > Then this too needs to be fixed
> >
> > Honestly, I'm struggling to respond to this non-sarcastically. I mean,
> > I was the one who implemented DSM. Do you think it works the way that
> > it works because I considered doing something smart and decided to do
> > something dumb instead?
>
> No, I meant that this needs to be fixed at OS level, by being able to
> use the same mapping.
>
> We should not shy away from asking the OS people for adding the useful
> features still missing.

While I agree that "sharing page tables across processes" is useful,
it looks like it'd be much more effort to correctly implement for e.g.
DSM than implementing threading.
Konstantin's diff is "only" 20.1k lines [0] added and/or modified,
which is a lot, but it's manageable (13k+ of which are from files that
were auto-generated and then committed, likely accidentally).

> It was mentioned in the Unconference Kernel Hacker AMA talk and said
> kernel hacker works for Oracle, andf they also seemed to be needing
> this :)

Though these new kernel features allowing for better performance
(mostly in kernel memory usage, probably) would be nice to have, we
wouldn't get performance benefits for older kernels, benefits which we
would get if we were to implement threading.
I'm not on board with a policy of us twiddling thumbs and waiting for
the OS to fix our architectural performance issues. Sure, the kernel
could optimize for our usage pattern, but I think that's not something
we can (or should) rely on for performance ^1.

Kind regards,

Matthias van de Meent

[0] 
https://github.com/postgrespro/postgresql.pthreads/compare/801386af...d5933309?w=1
^1 OT: I think the same about us (ab)using the OS page cache, but
that's a tale for a different time and thread.

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Andres Freund

On 2023-06-08 10:33:26 -0400, Greg Stark wrote:
> On Wed, 7 Jun 2023 at 18:09, Andres Freund  wrote:
> > Having the same memory mapping between threads makes allows the
> > hardware to share the TLB (on x86 via process context identifiers), which
> > isn't realistically possible with different processes.
> 
> As a matter of historical interest Solaris actually did implement this
> across different processes. It was called by the somewhat unfortunate
> name "Intimate Shared Memory". I don't think Linux ever implemented
> anything like it but I'm not sure.

I don't think it shared the TLB - it did share page tables though.

Re: Cleaning up nbtree after logical decoding on standby work

2023-06-08 Thread Peter Geoghegan

On Thu, Jun 8, 2023 at 7:22 AM Alvaro Herrera  wrote:
> IMO this kind of change definitely does not have place in a post-beta1
> restructuring patch.  We rarely indulge in case-fixing exercises at any
> other time, and I don't see any good argument why post-beta1 is a better
> time for it.

There is a glaring inconsistency. Now about half of the relevant
functions in nbtree.h use "heaprel", while the other half use
"heapRel". Obviously that's not the end of the world, but it's
annoying. It's exactly the kind of case-fixing exercise that does tend
to happen.

I'm not going to argue this point any further, though. I will make
this change at a later date. That will introduce an inconsistency
between branches, of course, but apparently there isn't any
alternative.

> I suggest that you should strive to keep the patch as
> small as possible.

Attached is v4, which goes back to using "heaprel" in new-to-16 code.
As a result, it is slightly smaller than v3.

My new plan is to commit this tomorrow, since the clear consensus is
that we should go ahead with this for 16.

-- 
Peter Geoghegan

v4-0001-nbtree-Allocate-new-pages-in-separate-function.patch
Description: Binary data

Re: Allow pg_archivecleanup to remove backup history files

2023-06-08 Thread Fujii Masao





On 2023/05/31 10:51, torikoshia wrote:

Update the patch according to the advice.


Thanks for updating the patches! I have small comments regarding 0002 patch.

+   
+ Remove backup history files.

Isn't it better to document clearly which backup history files to be removed? For 
example, "In addition to removing WAL files, remove backup history files with 
prefixes logically preceding the oldestkeptwalfile.".


printf(_("  -n, --dry-run   dry run, show the names of the files 
that would be removed\n"));
+   printf(_("  -b, --clean-backup-history  clean up files including backup 
history files\n"));

Shouldn't -b option be placed in alphabetical order?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: GTIN14 support for contrib/isn

2023-06-08 Thread Josef Šimánek

čt 8. 6. 2023 v 17:20 odesílatel Michael Kefeder  napsal:
>
>
> Am 15.03.19 um 17:27 schrieb Tom Lane:
> > Michael Kefeder  writes:
> >> For a project of ours we need GTIN14 data type support.
> >
> > Hm, what is that and where would a reviewer find the specification for it?
> >
> specs are from GS1 here https://www.gs1.org/standards/id-keys/gtin
> side-note EAN13 is actually called GTIN-13 now. Wikipedia has a quick
> overview https://en.wikipedia.org/wiki/Global_Trade_Item_Number
>
> >> Looking at the code I saw every format that isn-extension supports is
> >> stored as an EAN13. Theoretically that can be changed to be GTIN14, but
> >> that would mean quite a lot of rewrite I feared, so I chose to code only
> >> GTIN14 I/O separetely to not interfere with any existing conversion
> >> magic. This yields an easier to understand patch and doesn't touch
> >> existing functionality. However it introduces redundancy to a certain
> >> extent.
> >
> > Yeah, you certainly don't get to change the on-disk format of the existing
> > types, unfortunately.  Not sure what the least messy way of dealing with
> > that is.  I guess we do want this to be part of contrib/isn rather than
> > an independent module, if there are sane datatype conversions with the
> > existing isn types.
> >
> the on-disk format does not change (it would support even longer codes
> it's just an integer where one bit is used for valid/invalid flag, did
> not touch that at all). Putting GTIN14 in isn makes sense I find and is
> back/forward compatible.
>
> >> Find my patch attached. Please let me know if there are things that need
> >> changes, I'll do my best to get GTIN support into postgresql.
> >
> > Well, two comments immediately:
> >
> > * where's the documentation changes?
> >
> > * simply editing the .sql file in-place is not acceptable; that breaks
> > the versioning conventions for extensions, and leaves users with no
> > easy upgrade path.  What you need to do is create a version upgrade
> > script that adds the new objects.  For examples look for other recent
> > patches that have added features to contrib modules, eg
> >
> > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=eb6f29141bed9dc95cb473614c30f470ef980705
> >
> > Also, I'm afraid you've pretty much missed the deadline to get this
> > into PG v12; we've already got more timely-submitted patches than
> > we're likely to be able to finish reviewing.  Please add it to the
> > first v13 commit fest,
> >
> > https://commitfest.postgresql.org/23/
> >
> > so that we don't forget about it when the time does come to look at it.
> >
> >   regards, tom lane
> >
>
> thanks for the feedback! will do mentioned documentation changes and
> create a separate upgrade sql file. Making it into v13 is fine by me.

Hello!

If I understand it well, this patch wasn't finished and submitted
after this discussion. If there is still interest, I can try to polish
the patch, rebase and submit. I'm interested in GTIN14 support.

> br
>   mike
>
>
>

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Konstantin Knizhnik

On 08.06.2023 6:18 PM, Dave Cramer wrote:

On Thu, 8 Jun 2023 at 11:15, Jan Wieck  wrote:

On 6/8/23 10:56, Dave Cramer wrote:
>
>
>
>
> On Thu, 8 Jun 2023 at 10:31, Jan Wieck  > wrote:
>
>     On 6/8/23 09:53, Jan Wieck wrote:
>      > On 6/8/23 09:21, Dave Cramer wrote:
>      > The server doesn't know about all the clients of the
pooler, does
>     it? It
>      > has no way of telling if/when a client disconnects from
the pooler.
>
>     Another problem that complicates doing it in the server is
that the
>     information require to (re-)prepare a statement in a backend
that
>     currently doesn't have it needs to be kept in shared memory.
This
>     includes the query string itself. Doing that without shared
memory in a
>     pooler that is multi-threaded or based on async-IO is much
simpler and
>     allows for easy ballooning.
>
>
> I don't expect the server to re-prepare the statement. If the
server
> responds with "statement doesn't exist" the client would send a
prepare.

Are you proposing a new libpq protocol version?

I believe we would need to add this to the protocol, yes.

So it will be responsibility of client to remember text of prepared 
query to be able to resend it when statement doesn't exists at server?
IMHO very strange decision. Why not to handle it in connection pooler 
(doesn't matter - external or embedded)?

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Dave Cramer

On Thu, 8 Jun 2023 at 11:15, Jan Wieck  wrote:

> On 6/8/23 10:56, Dave Cramer wrote:
> >
> >
> >
> >
> > On Thu, 8 Jun 2023 at 10:31, Jan Wieck  > > wrote:
> >
> > On 6/8/23 09:53, Jan Wieck wrote:
> >  > On 6/8/23 09:21, Dave Cramer wrote:
> >  > The server doesn't know about all the clients of the pooler, does
> > it? It
> >  > has no way of telling if/when a client disconnects from the
> pooler.
> >
> > Another problem that complicates doing it in the server is that the
> > information require to (re-)prepare a statement in a backend that
> > currently doesn't have it needs to be kept in shared memory. This
> > includes the query string itself. Doing that without shared memory
> in a
> > pooler that is multi-threaded or based on async-IO is much simpler
> and
> > allows for easy ballooning.
> >
> >
> > I don't expect the server to re-prepare the statement. If the server
> > responds with "statement doesn't exist" the client would send a prepare.
>
> Are you proposing a new libpq protocol version?
>

I believe we would need to add this to the protocol, yes.

Dave

>
>
> Jan
>

Re: Seeking Guidance on Using Valgrind in PostgreSQL for Detecting Memory Leaks in Extension Code

2023-06-08 Thread Heikki Linnakangas


On 08/06/2023 14:08, Pradeep Kumar wrote:
I am writing to seek your guidance and utilization of Valgrind in 
PostgreSQL for detecting memory leaks in extension-related code. 


https://wiki.postgresql.org/wiki/Valgrind is a good place to start. The 
same tricks for using Valgrind on PostgreSQL itself should work for 
extensions too.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck

On 6/8/23 10:56, Dave Cramer wrote:

On Thu, 8 Jun 2023 at 10:31, Jan Wieck > wrote:

On 6/8/23 09:53, Jan Wieck wrote:
 > On 6/8/23 09:21, Dave Cramer wrote:
 > The server doesn't know about all the clients of the pooler, does
it? It
 > has no way of telling if/when a client disconnects from the pooler.

Another problem that complicates doing it in the server is that the
information require to (re-)prepare a statement in a backend that
currently doesn't have it needs to be kept in shared memory. This
includes the query string itself. Doing that without shared memory in a
pooler that is multi-threaded or based on async-IO is much simpler and
allows for easy ballooning.

I don't expect the server to re-prepare the statement. If the server 
responds with "statement doesn't exist" the client would send a prepare.

Are you proposing a new libpq protocol version?

Jan

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Matthias van de Meent

On Thu, 8 Jun 2023 at 14:44, Hannu Krosing  wrote:
>
> On Thu, Jun 8, 2023 at 2:15 PM Matthias van de Meent
>  wrote:
> >
> > On Thu, 8 Jun 2023 at 11:54, Hannu Krosing  wrote:
> > >
> > > This part was touched in the "AMA with a Linux Kernale Hacker"
> > > Unconference session where he mentioned that the had proposed a
> > > 'mshare' syscall for this.
> > >
> > > So maybe a more fruitful way to fixing the perceived issues with
> > > process model is to push for small changes in Linux to overcome these
> > > avoiding a wholesale rewrite ?
> >
> > We support not just Linux, but also Windows and several (?) BSDs. I'm
> > not against pushing Linux to make things easier for us, but Linux is
> > an open source project, too, where someone need to put in time to get
> > the shiny things that you want. And I'd rather see our time spent in
> > PostgreSQL, as Linux is only used by a part of our user base.
>
> Do we have any statistics for the distribution of our user base ?
>
> My gut feeling says that for performance-critical use the non-Linux is
> in low single digits at best.
>
> My fascination for OpenSource started with realisation that instead of
> workarounds you can actually fix the problem at source. So if the
> specific problem is that TLB is not shared then the proper fix is
> making it shared instead of rewriting everything else to get around
> it. None of us is limited to writing code in PostgreSQL only. If the
> easiest and more generix fix can be done in Linux then so be it.

TLB is a CPU hardware facility, not something that the OS can decide
to share between processes. While sharing (some) OS memory management
facilities across threads might be possible (as you mention, that
mshare syscall would be an example), that doesn't solve the issue of
the hardware not supporting sharing TLB entries across processes. We'd
use less kernel memory for memory management, but the CPU would still
stall on TLB misses every time we switch processes on the CPU (unless
we somehow were able to use non-process-namespaced TLB entries, which
would make our processes not meaningfully different from threads
w.r.t. address space).

> > >
> > > Maybe we can already remove the distinction between static and dynamic
> > > shared memory ?
> >
> > That sounds like a bad idea, dynamic shared memory is more expensive
> > to maintain than our static shared memory systems, not in the least
> > because DSM is not guaranteed to share the same addresses in each
> > process' address space.
>
> Then this too needs to be fixed

That needs kernel facilities in all (most?) supported OSes, and I
think that's much more work than moving to threads:
Allocations from the kernel are arbitrarily random across the
available address space, so a DSM segment that is allocated in one
backend might overlap with unshared allocations of a different
backend, making those backends have conflicting memory address spaces.
The only way to make that work is to have a shared memory addressing
space, but some backends just not having the allocation mapped into
their local address space; which seems only slightly more isolated
than threads and much more effort to maintain.

> > > Though I already heard some complaints at the conference discussions
> > > that having the dynamic version available has made some developers
> > > sloppy in using it resulting in wastefulness.
> >
> > Do you know any examples of this wastefulness?
>
> No. Just somebody mentioned it in a hallway conversation and the rest
> of the developers present mumbled approvingly :)

The only "wastefulness" that I know of in our use of DSM is the queue,
and that's by design: We need to move data from a backend's private
memory to memory that's accessible to other backends; i.e. shared
memory. You can't do that without copying or exposing your private
memory.

> > > Still we should be focusing our attention at solving the issues and
> > > not at "moving to threads" and hoping this will fix the issues by
> > > itself.
> >
> > I suspect that it is much easier to solve some of the issues when
> > working in a shared address space.
>
> Probably. But it would come at the cost of needing to change a lot of
> other parts of PostgreSQL.
>
> I am not against making code cleaner for potential threaded model
> support. I am just a bit sceptical about the actual switch being easy,
> or doable in the next 10-15 years.

PostgreSQL only has a support cycle of 5 years. 5 years after the last
release of un-threaded PostgreSQL we could drop support for "legacy"
extension models that don't support threading.

> > E.g. resizing shared_buffers is difficult right now due to the use of
> > a static allocation of shared memory, but if we had access to a single
> > shared address space, it'd be easier to do any cleanup necessary for
> > dynamically increasing/decreasing its size.
>
> This again could be done with shared memory mapping + dynamic shared memory.

Yes, but as I said, that's much more difficult than lock and/or atom

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

On Thu, Jun 8, 2023 at 4:56 PM Robert Haas  wrote:
>
> On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing  wrote:
> > > That sounds like a bad idea, dynamic shared memory is more expensive
> > > to maintain than our static shared memory systems, not in the least
> > > because DSM is not guaranteed to share the same addresses in each
> > > process' address space.
> >
> > Then this too needs to be fixed
>
> Honestly, I'm struggling to respond to this non-sarcastically. I mean,
> I was the one who implemented DSM. Do you think it works the way that
> it works because I considered doing something smart and decided to do
> something dumb instead?

No, I meant that this needs to be fixed at OS level, by being able to
use the same mapping.

We should not shy away from asking the OS people for adding the useful
features still missing.

It was mentioned in the Unconference Kernel Hacker AMA talk and said
kernel hacker works for Oracle, andf they also seemed to be needing
this :)

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Robert Haas

On Thu, Jun 8, 2023 at 8:44 AM Hannu Krosing  wrote:
> > That sounds like a bad idea, dynamic shared memory is more expensive
> > to maintain than our static shared memory systems, not in the least
> > because DSM is not guaranteed to share the same addresses in each
> > process' address space.
>
> Then this too needs to be fixed

Honestly, I'm struggling to respond to this non-sarcastically. I mean,
I was the one who implemented DSM. Do you think it works the way that
it works because I considered doing something smart and decided to do
something dumb instead?

Suppose you have two PostgreSQL backends A and B. If we're not running
on Windows, each of these was forked from the postmaster, so things
like the text and data segments and the main shared memory segment are
going to be mapped at the same address in both processes, because they
inherit those mappings from the postmaster. However, additional things
can get mapped into the address space of either process later. This
can happen in a variety of ways. For instance, a shared library can
get loaded into one process and not the other. Or it can get loaded
into both processes but at different addresses - keep in mind that
it's the OS, not PostgreSQL, that decides what address to use when
loading a shared library. Or, if one process allocates a bunch of
memory, then new address space will have to be mapped into that
process to handle those memory allocations and, again, it is the OS
that decides where to put those mappings. So over time the memory
mappings of these two processes can diverge arbitrarily. That means
that if the same DSM has to be mapped into both processes, there is no
guarantee that it can be placed at the same address in both processes.
The address that gets used in one process might not be available in
the other process.

It's worth pointing out here that there are no portable primitives
available for a process to examine what memory segments are mapped
into its address space. I think it's probably possible on every OS,
but it works differently on different ones. Linux exposes such details
through /proc, for example, but macOS doesn't have /proc. So if we're
using standard, portable primitives, we can't even TRY to put the DSM
at the same address in every process that maps it. But even if we used
non-portable primitives to examine what's mapped into the address
space of every process, it wouldn't solve the problem. Suppose 4
processes want to share a DSM, so they all run around and use
non-portable OS-specific interfaces to figure out where there's a free
chunk of address space large enough to accommodate that DSM and they
all map it there. Hooray! But then say a fifth process comes along and
it ALSO wants to map that DSM, but in that fifth process the address
space that was available in the other four processes has already been
used by something else. Well, now we're screwed.

The fact that DSM is expensive and awkward to use isn't a defect in
the implementation of DSM. It's a consequence of the fact that the
address space mappings in one PostgreSQL backend can be almost
arbitrarily different from the address space mappings in another
PostgreSQL backend. If only there were some kind of OS feature
available that would allow us to set things up so that all of the
PostgreSQL backends shared the same address space mappings!

Oh, right, there is: THREADS.

The fact that we don't use threads is the reason why DSM sucks and has
to suck. In fact it's the reason why DSM has to exist at all. Saying
"fix DSM instead of using threads" is roughly in the same category as
saying "if the peasants are revolting because they have no bread, then
let them eat cake." Both statements evince a complete failure to
understand the actual source of the problem.

With apologies for my grumpiness,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Dave Cramer

On Thu, 8 Jun 2023 at 10:31, Jan Wieck  wrote:

> On 6/8/23 09:53, Jan Wieck wrote:
> > On 6/8/23 09:21, Dave Cramer wrote:
> > The server doesn't know about all the clients of the pooler, does it? It
> > has no way of telling if/when a client disconnects from the pooler.
>
> Another problem that complicates doing it in the server is that the
> information require to (re-)prepare a statement in a backend that
> currently doesn't have it needs to be kept in shared memory. This
> includes the query string itself. Doing that without shared memory in a
> pooler that is multi-threaded or based on async-IO is much simpler and
> allows for easy ballooning.
>
>
I don't expect the server to re-prepare the statement. If the server
responds with "statement doesn't exist" the client would send a prepare.

Dave

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Dave Cramer

On Thu, 8 Jun 2023 at 09:53, Jan Wieck  wrote:

> On 6/8/23 09:21, Dave Cramer wrote:
> >
> >
> > On Thu, Jun 8, 2023 at 8:43 AM Jan Wieck  > > wrote:
> >
> > On 6/8/23 02:15, Konstantin Knizhnik wrote:
> >
> >  > There is a PR with support of prepared statement support to
> > pgbouncer:
> >  > https://github.com/pgbouncer/pgbouncer/pull/845
> > 
> >  > any feedback, reviews and suggestions are welcome.
> >
> > I was about to say that the support would have to come from the
> pooler
> > as it is possible to have multiple applications in different
> languages
> > connecting to the same pool(s).
> >
> >
> > Why from the pooler? If it were done at the server every client could
> > use it?
>
> The server doesn't know about all the clients of the pooler, does it? It
> has no way of telling if/when a client disconnects from the pooler.
>

Why does it have to know if the client disconnects ? It just keeps a cache
of prepared statements.
In large apps it is very likely there will be another client wanting to use
the statement

Dave

>
>

Re: Error in calculating length of encoded base64 string

2023-06-08 Thread Tom Lane

o.tselebrovs...@postgrespro.ru writes:
> While working on an extension I've found an error in how length of 
> encoded base64 string is calulated;

Yeah, I think you're right.  It's not of huge significance, because
it just overestimates by 1 or 2 bytes, but we might as well get
it right.  Thanks for the report and patch!

regards, tom lane

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Greg Stark

On Wed, 7 Jun 2023 at 18:09, Andres Freund  wrote:
> Having the same memory mapping between threads makes allows the
> hardware to share the TLB (on x86 via process context identifiers), which
> isn't realistically possible with different processes.

As a matter of historical interest Solaris actually did implement this
across different processes. It was called by the somewhat unfortunate
name "Intimate Shared Memory". I don't think Linux ever implemented
anything like it but I'm not sure.

I think this was not so much about cache hit rate but about just sheer
wasted memory in page mappings. So I guess hugepages more or less
target the same issues. But I find it interesting that they were
already running into issues like this 20 years ago -- presumably those
issues have only grown.

-- 
greg

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck


On 6/8/23 09:53, Jan Wieck wrote:

On 6/8/23 09:21, Dave Cramer wrote:
The server doesn't know about all the clients of the pooler, does it? It
has no way of telling if/when a client disconnects from the pooler.


Another problem that complicates doing it in the server is that the 
information require to (re-)prepare a statement in a backend that 
currently doesn't have it needs to be kept in shared memory. This 
includes the query string itself. Doing that without shared memory in a 
pooler that is multi-threaded or based on async-IO is much simpler and 
allows for easy ballooning.



Jan

Re: Cleaning up nbtree after logical decoding on standby work

2023-06-08 Thread Alvaro Herrera

On 2023-Jun-07, Peter Geoghegan wrote:

> On Wed, Jun 7, 2023 at 5:12 PM Andres Freund  wrote:

> > I don't really understand why the patch does s/heaprel/heapRel/.
> 
> That has been the style used within nbtree for many years now.

IMO this kind of change definitely does not have place in a post-beta1
restructuring patch.  We rarely indulge in case-fixing exercises at any
other time, and I don't see any good argument why post-beta1 is a better
time for it.  I suggest that you should strive to keep the patch as
small as possible.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Robert Haas

On Wed, Jun 7, 2023 at 5:45 PM Andres Freund  wrote:
> People have argued that the process model is more robust. But it turns out
> that we have to crash-restart for just about any "bad failure" anyway. It used
> to be (a long time ago) that we didn't, but that was just broken.

How hard have you thought about memory leaks as a failure mode? Or
file descriptor leaks?

Right now, a process needs to release all of its shared resources
before exiting, or trigger a crash-and-restart cycle. But it doesn't
need to release any process-local resources, because the OS will take
care of that. But that wouldn't be true any more, and that seems like
it might require fixing quite a few things.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Robert Haas

On Wed, Jun 7, 2023 at 5:39 PM Peter Eisentraut  wrote:
> On 07.06.23 23:30, Andres Freund wrote:
> > Yea, we definitely need the supervisor function in a separate
> > process. Presumably that means we need to split off some of the postmaster
> > responsibilities - e.g. I don't think it'd make sense to handle connection
> > establishment in the supervisor process. I wonder if this is something that
> > could end up being beneficial even in the process world.
>
> Something to think about perhaps ... how would that be different from
> using an existing external supervisor process like systemd or supervisord.

systemd wouldn't start individual PostgreSQL processes, right? If we
want a checkpointer and a wal writer and a background writer and
whatever we have to have our own supervisor process to spawn all those
and keep them running. We could remove the logic to do a full system
reset without a postmaster exit in favor of letting systemd restart
everything from scratch, if we wanted to do that. But we'd still need
our own supervisor to start up all of the individual threads/processes
that we need.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Typo in src/backend/access/nbtree/README?

2023-06-08 Thread Daniel Westermann (DWE)

Hi,

I am not a native English speaker, but shouldn't there be a "to" before 
"detect"?

These two additions make it possible detect a concurrent page split

Regards
Daniel

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Robert Haas

On Wed, Jun 7, 2023 at 5:37 PM Andres Freund  wrote:
> I think we're starting to hit quite a few limits related to the process model,
> particularly on bigger machines. The overhead of cross-process context
> switches is inherently higher than switching between threads in the same
> process - and my suspicion is that that overhead will continue to
> increase. Once you have a significant number of connections we end up spending
> a *lot* of time in TLB misses, and that's inherent to the process model,
> because you can't share the TLB across processes.

This is a very good point.

Our default posture on this mailing list is to try to maximize use of
OS facilities rather than reimplementing things - well and good. But
if a user writes a query with FOO JOIN BAR ON FOO.X = BAR.X OR FOO.Y =
BAR.Y and then complains that the resulting query plan sucks, we don't
slink off in embarrassment: we tell the user that there's not really
any fast plan for that query and that if they write queries like that
they have to live with the consequences. But the same thing applies
here. To the extent that context switching between more processes is
more expensive than context switching between threads for
hardware-related reasons, that's not something that the OS can fix for
us. If we choose to do the expensive thing then we pay the overhead.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Robert Haas

On Wed, Jun 7, 2023 at 5:30 PM Andres Freund  wrote:
> On 2023-06-05 17:51:57 +0300, Heikki Linnakangas wrote:
> > If there are no major objections, I'm going to update the developer FAQ,
> > removing the excuses there for why we don't use threads [1].
>
> I think we should do this even if there's no concensus to slowly change to
> threads. There's clearly no concensus on the opposite either.

This is a very fair point.

> One interesting bit around the transition is what tooling we ought to provide
> to detect problems. It could e.g. be reasonably feasible to write something
> checking how many read-write global variables an extension has on linux
> systems.

Yes, this would be great.

> I don't think the control file is the right place - that seems more like
> something that should be signalled via PG_MODULE_MAGIC. We need to check this
> not just during CREATE EXTENSION, but also during loading of libraries - think
> of shared_preload_libraries.

+1.

> Yea, we definitely need the supervisor function in a separate
> process. Presumably that means we need to split off some of the postmaster
> responsibilities - e.g. I don't think it'd make sense to handle connection
> establishment in the supervisor process. I wonder if this is something that
> could end up being beneficial even in the process world.

Yeah, I've had similar thoughts. I'm not exactly sure what the
advantages of such a refactoring might be, but the current structure
feels pretty limiting. It works OK because we don't do anything in the
postmaster other than fork a new backend, but I'm not sure if that's
the best strategy. It means, for example, that if there's a ton of new
connection requests, we're spawning a ton of new processes, which
means that you can put a lot of load on a PostgreSQL instance even if
you can't authenticate. Maybe we'd be better off with a pool of
processes accepting connections; if authentication fails, that
connection goes back into the pool and tries again. If authentication
succeeds, either that process transitions to being a regular backend,
leaving the authentication pool, or perhaps hands off the connection
to a "real backend" at that point and loops around to accept() the
next request.

Whether that's a good ideal in detail or not, the point remains that
having the postmaster handle this task is quite limiting. It forces us
to hand off the connection to a new process at the earliest possible
stage, so that the postmaster remains free to handle other duties.
Giving the responsibility to another process would let us make
decisions about where to perform the hand-off based on real
architectural thought rather than being forced to do a certain way
because nothing else will work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck

On 6/8/23 09:21, Dave Cramer wrote:

On Thu, Jun 8, 2023 at 8:43 AM Jan Wieck > wrote:

On 6/8/23 02:15, Konstantin Knizhnik wrote:

 > There is a PR with support of prepared statement support to
pgbouncer:
 > https://github.com/pgbouncer/pgbouncer/pull/845

 > any feedback, reviews and suggestions are welcome.

I was about to say that the support would have to come from the pooler
as it is possible to have multiple applications in different languages
connecting to the same pool(s).

Why from the pooler? If it were done at the server every client could 
use it?

The server doesn't know about all the clients of the pooler, does it? It 
has no way of telling if/when a client disconnects from the pooler.

Jan

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Konstantin Knizhnik





On 07.06.2023 3:53 PM, Robert Haas wrote:

I think I remember a previous conversation with Andres
where he opined that thread-local variables are "really expensive"
(and I apologize in advance if I'm mis-remembering this). Now, Andres
is not a man who accepts a tax on performance of any size without a
fight, so his "really expensive" might turn out to resemble my "pretty
cheap." However, if widespread use of TLS is too expensive and we have
to start rewriting code to not depend on global variables, that's
going to be more of a problem. If we can get by with doing such
rewrites only in performance-critical places, it might not still be
too bad. Personally, I think the degree of dependence that PostgreSQL
has on global variables is pretty excessive and I don't think that a
certain amount of refactoring to reduce it would be a bad thing. If it
turns into an infinite series of hastily-written patches to rejigger
every source file we have, though, then I'm not really on board with
that.


Actually TLS not not more expensive then accessing struct fields (at 
least at x86 platform), consider the following program:


typedef struct {
    int a;
    int b;
    int c;
} ABC;

__thread int a;
__thread int b;
__thread int c;


void use_struct(ABC* abc) {
    abc->a += 1;
    abc->b += 1;
    abc->c += 1;
}

void use_tls(ABC* abc) {
    a += 1;
    b += 1;
    c += 1;
}


Now look at the generated assembler:

use_struct:
    addl    $1, (%rdi)
    addl    $1, 4(%rdi)
    addl    $1, 8(%rdi)
    ret


use_tls:
    addl    $1, %fs:a@tpoff
    addl    $1, %fs:b@tpoff
    addl    $1, %fs:c@tpoff
    ret


Heikki mentions the idea of having a central Session object and just
passing that around. I have a hard time believing that's going to work
out nicely. First, it's not extensible. Right now, if you need a bit
of additional session-local state, you just declare a variable and
you're all set. That's not a perfect system and does cause some
problems, but we can't go from there to a system where it's impossible
to add session-local state without hacking core. Second, we will be
sad if session.h ends up #including every other header file that
defines a data structure anywhere in the backend. Or at least I'll be
sad. I'm not actually against the idea of having some kind of session
object that we pass around, but I think it either needs to be limited
to a relatively small set of well-defined things, or else it needs to
be design in some kind of extensible way that doesn't require it to
know the full details of every sort of object that's being used as
session-local state anywhere in the system. I haven't really seen any
convincing design ideas around this yet.



There are about 2k static/global variables in Postgres.
It is almost impossible to maintain such struct.
But session context may be still needed for other purposes - if we want 
to support built-in connection pool.


If we are using threads, then all variables needs to be either 
thread-local, either access to them should be synchronized.
But If we want to save session context, then there is no need to 
save/restore all this 2k variables.
We need to capture and these variables which lifetime  exceeds 
transaction boundary.

There are not so much such variables - tens not hundreds.

The question is how to better handle this "session context".
There are two alternatives:
1. Save/restore this context from/to normal TLS variables.
2. Replace such variables with access through the session context struct.

I prefer 2) because it requires less changes in code.
And performance overhead of session context store/resume is negligible 
when number of such variables is ~10.

Re: Inconsistent results with libc sorting on Windows

2023-06-08 Thread Juan José Santamaría Flecha

> On 6/7/23 07:58, Daniel Verite wrote:
> >   Thomas Munro wrote:
> >
> >> > > Also, it does not occur at all if parallel scan is disabled.
> >> >
> >> > Could this be a clue that it is failing to be transitive?
> >>
> >> That vaguely rang a bell for me...  and then I remembered this thread:
> >>
> >>
> https://www.postgresql.org/message-id/flat/20191206063401.GB1629883%40rfd.leadboat.com
> >
> > Thanks for the pointer, non-transitive comparisons seem a likely cause
> > indeed.
>

Just to make sure we are all seeing the same problem, does the attached
patch fix your test?

Regards,

Juan José Santamaría Flecha


0001-POC-Inconsistent-results-with-libc-sorting-on-Window.patch
Description: Binary data

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Robert Haas

On Thu, Jun 8, 2023 at 6:04 AM Hannu Krosing  wrote:
> Here I was hoping to go in the opposite direction and support parallel
> query across replicas.
>
> This looks much more doable based on the process model than the single
> process / multiple threads model.

I don't think this is any more or less difficult to support in one
model vs. the other. The problems seem pretty much unrelated.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Konstantin Knizhnik




On 08.06.2023 3:43 PM, Jan Wieck wrote:

On 6/8/23 02:15, Konstantin Knizhnik wrote:


There is a PR with support of prepared statement support to pgbouncer:
https://github.com/pgbouncer/pgbouncer/pull/845
any feedback, reviews and suggestions are welcome.


I was about to say that the support would have to come from the pooler 
as it is possible to have multiple applications in different languages 
connecting to the same pool(s)


Ideally, support should be provided by both sides: only pooler knows 
mapping between clients and postgres backends and only server knows
which queries require session semantic and which not (in principle it is 
possible to make connection pooler to determine it, but it is very 
non-trivial).

.

I can certainly give this a try, possibly over the weekend. I have a 
TPC-C that can use prepared statements plus pause/resume. That might 
be a good stress for it.




By the way, I have done some small benchmarking of different connection 
poolers for Postgres.
Benchmark was very simple: I just create small pgbench database with 
scale 10 and then

run read-only queries with 100 clients:

pgbench -c 100 -P 10 -T 100 -S -M prepared postgres


Number of connections to the database was limited in an all pooler
configurations to 10. I have tested only transaction mode. If pooler 
supports prepared statements, I have also tested them.
Just for reference I also include results with direct connection to 
Postgres.
All benchamrking was done at my notebook, so it is not quite 
representative scenario.



Direct:
Connections  Prepared TPS
10   yes   135507
10   no 73218
100  yes79042
100  no 59245

Pooler: (100 client connections, 10 server connections, transaction mode)
Pooler Prepared TPS
pgbouncer  no  65029
pgbouncer-ps   no  65570
pgbouncer-ps   yes 65825
odysseyyes 18351
odysseyno  21299
pgagrolno  29673
pgcat  no  23247

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Dave Cramer

On Thu, Jun 8, 2023 at 8:43 AM Jan Wieck  wrote:

> On 6/8/23 02:15, Konstantin Knizhnik wrote:
>
> > There is a PR with support of prepared statement support to pgbouncer:
> > https://github.com/pgbouncer/pgbouncer/pull/845
> > any feedback, reviews and suggestions are welcome.
>
> I was about to say that the support would have to come from the pooler
> as it is possible to have multiple applications in different languages
> connecting to the same pool(s).


Why from the pooler? If it were done at the server every client could use
it?

>
> Dave

>
> --
Dave Cramer

Parallel CREATE INDEX for BRIN indexes

2023-06-08 Thread Tomas Vondra

Hi,

Here's a WIP patch allowing parallel CREATE INDEX for BRIN indexes. The
infrastructure (starting workers etc.) is "inspired" by the BTREE code
(i.e. copied from that and massaged a bit to call brin stuff).

 _bt_begin_parallel -> _brin_begin_parallel
 _bt_end_parallel -> _brin_end_parallel
 _bt_parallel_estimate_shared -> _brin_parallel_estimate_shared
 _bt_leader_participate_as_worker -> _brin_leader_participate_as_worker
 _bt_parallel_scan_and_sort -> _brin_parallel_scan_and_build

This is mostly mechanical stuff - setting up the parallel workers,
starting the scan etc.

The tricky part is how to divide the work between workers and how we
combine the partial results. For BTREE we simply let each worker to read
a subset of the table (using a parallel scan), sort it and then do a
merge sort on the partial results.

For BRIN it's a bit different, because the indexes essentially splits
the table into smaller ranges and treat them independently. So the
easiest way is to organize the table scan so that each range gets
processed by exactly one worker. Each worker writes the index tuples
into a temporary file, and then when all workers are done we read and
write them into the index.

The problem is a parallel scan assigns mostly random subset of the table
to each worker - it's not guaranteed a BRIN page range to be processed
by a single worker.


0001 does that in a bit silly way - instead of doing single large scan,
each worker does a sequence of TID range scans for each worker (see
_brin_parallel_scan_and_build), and BrinShared has fields used to track
which ranges were already assigned to workers. A bit cumbersome, but it
works pretty well.

0002 replaces the TID range scan sequence with a single parallel scan,
modified to assign "chunks" in multiple of pagesPerRange.


In both cases _brin_end_parallel then reads the summaries from worker
files, and adds them into the index. In 0001 this is fairly simple,
although we could do one more improvement and sort the ranges by range
start to make the index nicer (and possibly a bit more efficient). This
should be simple, because the per-worker results are already sorted like
that (so a merge sort in _brin_end_parallel would be enough).

For 0002 it's a bit more complicated, because with a single parallel
scan brinbuildCallbackParallel can't decide if a range is assigned to a
different worker or empty. And we want to generate summaries for empty
ranges in the index. We could either skip such range during index build,
and then add empty summaries in _brin_end_parallel (if needed), or add
them and then merge them using "union".


I just realized there's a third option to do this - we could just do
regular parallel scan (with no particular regard to pagesPerRange), and
then do "union" when merging results from workers. It doesn't require
the sequence of TID scans, and the union would also handle the empty
ranges. The per-worker results might be much larger, though, because
each worker might produce up to the "full" BRIN index.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 0d37a829e768772ef3e9c080f96333e24cdd43b7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 26 Mar 2023 00:44:01 +0100
Subject: [PATCH 1/2] parallel CREATE INDEX for BRIN

---
 src/backend/access/brin/brin.c| 714 +-
 src/backend/access/transam/parallel.c |   4 +
 src/backend/catalog/index.c   |   3 +-
 src/include/access/brin.h |   3 +
 4 files changed, 719 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..13d94931efc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -31,8 +31,10 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
+#include "storage/buffile.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
@@ -41,6 +43,98 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_BRIN_SHARED		UINT64CONST(0xA001)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xA002)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xA003)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xA004)
+
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct BrinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist for
+	 * the benefit of worker processes that need to create state corresponding
+	 * to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	BlockNumber	pagesPerRange;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Hannu Krosing

On Thu, Jun 8, 2023 at 2:15 PM Matthias van de Meent
 wrote:
>
> On Thu, 8 Jun 2023 at 11:54, Hannu Krosing  wrote:
> >
> > On Wed, Jun 7, 2023 at 11:37 PM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote:
> > > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I
> > > > would ask if developer time could be better spent on tackling some of 
> > > > the
> > > > other problems around vertical scalability? Per some PGCon discussions,
> > > > there's still room for improvement in how PostgreSQL can best utilize
> > > > resources available very large "commodity" machines (a 448-core / 24TB 
> > > > RAM
> > > > instance comes to mind).
> > >
> > > I think we're starting to hit quite a few limits related to the process 
> > > model,
> > > particularly on bigger machines. The overhead of cross-process context
> > > switches is inherently higher than switching between threads in the same
> > > process - and my suspicion is that that overhead will continue to
> > > increase. Once you have a significant number of connections we end up 
> > > spending
> > > a *lot* of time in TLB misses, and that's inherent to the process model,
> > > because you can't share the TLB across processes.
> >
> >
> > This part was touched in the "AMA with a Linux Kernale Hacker"
> > Unconference session where he mentioned that the had proposed a
> > 'mshare' syscall for this.
> >
> > So maybe a more fruitful way to fixing the perceived issues with
> > process model is to push for small changes in Linux to overcome these
> > avoiding a wholesale rewrite ?
>
> We support not just Linux, but also Windows and several (?) BSDs. I'm
> not against pushing Linux to make things easier for us, but Linux is
> an open source project, too, where someone need to put in time to get
> the shiny things that you want. And I'd rather see our time spent in
> PostgreSQL, as Linux is only used by a part of our user base.

Do we have any statistics for the distribution of our user base ?

My gut feeling says that for performance-critical use the non-Linux is
in low single digits at best.

My fascination for OpenSource started with realisation that instead of
workarounds you can actually fix the problem at source. So if the
specific problem is that TLB is not shared then the proper fix is
making it shared instead of rewriting everything else to get around
it. None of us is limited to writing code in PostgreSQL only. If the
easiest and more generix fix can be done in Linux then so be it.

It is also possible that Windows and *BSD already have a similar feature.

>
> > > The amount of duplicated code we have to deal with due to to the process 
> > > model
> > > is quite substantial. We have local memory, statically allocated shared 
> > > memory
> > > and dynamically allocated shared memory variants for some things. And 
> > > that's
> > > just going to continue.
> >
> > Maybe we can already remove the distinction between static and dynamic
> > shared memory ?
>
> That sounds like a bad idea, dynamic shared memory is more expensive
> to maintain than our static shared memory systems, not in the least
> because DSM is not guaranteed to share the same addresses in each
> process' address space.

Then this too needs to be fixed

>
> > Though I already heard some complaints at the conference discussions
> > that having the dynamic version available has made some developers
> > sloppy in using it resulting in wastefulness.
>
> Do you know any examples of this wastefulness?

No. Just somebody mentioned it in a hallway conversation and the rest
of the developers present mumbled approvingly :)

> > > > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but
> > > > rather I'd be curious where it could stack up against some other 
> > > > efforts to
> > > > continue to help PostgreSQL improve performance and handle very large
> > > > workloads.
> > >
> > > There's plenty of things we can do before, but in the end I think 
> > > tackling the
> > > issues you mention and moving to threads are quite tightly linked.
> >
> > Still we should be focusing our attention at solving the issues and
> > not at "moving to threads" and hoping this will fix the issues by
> > itself.
>
> I suspect that it is much easier to solve some of the issues when
> working in a shared address space.

Probably. But it would come at the cost of needing to change a lot of
other parts of PostgreSQL.

I am not against making code cleaner for potential threaded model
support. I am just a bit sceptical about the actual switch being easy,
or doable in the next 10-15 years.

> E.g. resizing shared_buffers is difficult right now due to the use of
> a static allocation of shared memory, but if we had access to a single
> shared address space, it'd be easier to do any cleanup necessary for
> dynamically increasing/decreasing its size.

This again could be done with shared memory mapping + dynamic shared memory.

> Same wit

Re: Named Prepared statement problems and possible solutions

2023-06-08 Thread Jan Wieck


On 6/8/23 02:15, Konstantin Knizhnik wrote:


There is a PR with support of prepared statement support to pgbouncer:
https://github.com/pgbouncer/pgbouncer/pull/845
any feedback, reviews and suggestions are welcome.


I was about to say that the support would have to come from the pooler 
as it is possible to have multiple applications in different languages 
connecting to the same pool(s).


I can certainly give this a try, possibly over the weekend. I have a 
TPC-C that can use prepared statements plus pause/resume. That might be 
a good stress for it.



Best Regards, Jan

Re: Support logical replication of DDLs

2023-06-08 Thread shveta malik

On Tue, Jun 6, 2023 at 11:31 AM Wei Wang (Fujitsu)
 wrote:
>
> On Thur, June 1, 2023 at 23:42 vignesh C  wrote:
> > On Wed, 31 May 2023 at 14:32, Wei Wang (Fujitsu) 
> > wrote:
> > > ~~~
> > >
> > > 2. Deparsed results of the partition table.
> > > When I run the following SQLs:
> > > ```
> > > create table parent (a int primary key) partition by range (a);
> > > create table child partition of parent default;
> > > ```
> > >
> > > I got the following two deparsed results:
> > > ```
> > > CREATE  TABLE  public.parent (a pg_catalog.int4 STORAGE PLAIN  ,
> > CONSTRAINT parent_pkey PRIMARY KEY (a))  PARTITION BY RANGE (a)
> > > CREATE  TABLE  public.child PARTITION OF public.parent (CONSTRAINT
> > child_pkey PRIMARY KEY (a)) DEFAULT
> > > ```
> > >
> > > When I run these two deparsed results on another instance, I got the 
> > > following
> > error:
> > > ```
> > > postgres=# CREATE  TABLE  public.parent (a pg_catalog.int4 STORAGE PLAIN  
> > > ,
> > CONSTRAINT parent_pkey PRIMARY KEY (a))  PARTITION BY RANGE (a);
> > > CREATE TABLE
> > > postgres=# CREATE  TABLE  public.child PARTITION OF public.parent
> > (CONSTRAINT child_pkey PRIMARY KEY (a)) DEFAULT;
> > > ERROR:  multiple primary keys for table "child" are not allowed
> > > ```
> > >
> > > I think that we could skip deparsing the primary key related constraint 
> > > for
> > > partition (child) table in the function obtainConstraints for this case.
> >
> > Not applicable for 0008 patch
>
> I think this issue still exists after applying the 0008 patch. Is this error 
> the
> result we expected?
> If no, I think we could try to address this issue in the function
> deparse_Constraints_ToJsonb in 0008 patch like the attachment. What do you
> think? BTW, we also need to skip the parentheses in the above case if you 
> think
> this approach is OK.
>

Thank You Wang-san for the patch, we have included the fix after
slight modification in the latest patch-set (*2023_06_08.patch).

thanks
Shveta

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Matthias van de Meent

On Thu, 8 Jun 2023 at 11:54, Hannu Krosing  wrote:
>
> On Wed, Jun 7, 2023 at 11:37 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2023-06-05 13:40:13 -0400, Jonathan S. Katz wrote:
> > > 2. While I wouldn't want to necessarily discourage a moonshot effort, I
> > > would ask if developer time could be better spent on tackling some of the
> > > other problems around vertical scalability? Per some PGCon discussions,
> > > there's still room for improvement in how PostgreSQL can best utilize
> > > resources available very large "commodity" machines (a 448-core / 24TB RAM
> > > instance comes to mind).
> >
> > I think we're starting to hit quite a few limits related to the process 
> > model,
> > particularly on bigger machines. The overhead of cross-process context
> > switches is inherently higher than switching between threads in the same
> > process - and my suspicion is that that overhead will continue to
> > increase. Once you have a significant number of connections we end up 
> > spending
> > a *lot* of time in TLB misses, and that's inherent to the process model,
> > because you can't share the TLB across processes.
>
>
> This part was touched in the "AMA with a Linux Kernale Hacker"
> Unconference session where he mentioned that the had proposed a
> 'mshare' syscall for this.
>
> So maybe a more fruitful way to fixing the perceived issues with
> process model is to push for small changes in Linux to overcome these
> avoiding a wholesale rewrite ?

We support not just Linux, but also Windows and several (?) BSDs. I'm
not against pushing Linux to make things easier for us, but Linux is
an open source project, too, where someone need to put in time to get
the shiny things that you want. And I'd rather see our time spent in
PostgreSQL, as Linux is only used by a part of our user base.

> > The amount of duplicated code we have to deal with due to to the process 
> > model
> > is quite substantial. We have local memory, statically allocated shared 
> > memory
> > and dynamically allocated shared memory variants for some things. And that's
> > just going to continue.
>
> Maybe we can already remove the distinction between static and dynamic
> shared memory ?

That sounds like a bad idea, dynamic shared memory is more expensive
to maintain than our static shared memory systems, not in the least
because DSM is not guaranteed to share the same addresses in each
process' address space.

> Though I already heard some complaints at the conference discussions
> that having the dynamic version available has made some developers
> sloppy in using it resulting in wastefulness.

Do you know any examples of this wastefulness?

> > > I'm purposely giving a nonanswer on whether it's a worthwhile goal, but
> > > rather I'd be curious where it could stack up against some other efforts 
> > > to
> > > continue to help PostgreSQL improve performance and handle very large
> > > workloads.
> >
> > There's plenty of things we can do before, but in the end I think tackling 
> > the
> > issues you mention and moving to threads are quite tightly linked.
>
> Still we should be focusing our attention at solving the issues and
> not at "moving to threads" and hoping this will fix the issues by
> itself.

I suspect that it is much easier to solve some of the issues when
working in a shared address space.
E.g. resizing shared_buffers is difficult right now due to the use of
a static allocation of shared memory, but if we had access to a single
shared address space, it'd be easier to do any cleanup necessary for
dynamically increasing/decreasing its size.
Same with parallel workers - if we have a shared address space, the
workers can pass any sized objects around without being required to
move the tuples through DSM and waiting for the leader process to
empty that buffer when it gets full.

Sure, most of that is probably possible with DSM as well, it's just
that I see a lot more issues that you need to take care of when you
don't have a shared address space (such as the pointer translation we
do in dsa_get_address).

Kind regards,

Matthias van de Meent
Neon, Inc.

Re: Support logical replication of DDLs

2023-06-08 Thread shveta malik

On Tue, Jun 6, 2023 at 4:26 PM Amit Kapila  wrote:
>
> On Mon, Jun 5, 2023 at 3:00 PM shveta malik  wrote:
> >
>
> Few assorted comments:

Hi Amit, thanks for the feedback. Addressed these in recent patch
posted  (*2023_06_08.patch)

> ===
> 1. I see the following text in 0005 patch: "It supports most of ALTER
> TABLE command except some commands(DDL related to PARTITIONED TABLE
> ...) that are recently introduced but are not yet supported by the
> current ddl_deparser, we will support that later.". Is this still
> correct?
>

Removed this from the commit message.

> 2. I think the commit message of 0008
> (0008-ObjTree-Removal-for-multiple-commands-2023_05_22) should be
> expanded to explain why ObjTree is not required and or how you have
> accomplished the same with jsonb functions.
>

Done.

> 3. 0005* patch has the following changes in docs:
> +
> +  DDL Replication Support by Command Tag
> +  
> +
> +
> +
> +  
> +   
> +Command Tag
> +For Replication
> +Notes
> +   
> +  
> +  
> +   
> +ALTER AGGREGATE
> +-
> +
> +   
> +   
> +ALTER CAST
> +-
> +
> ...
> ...
>
> If the patch's scope is to only support replication of table DDLs, why
> such other commands are mentioned?
>

Removed the other commands and made some adjustments.

> 4. Can you write some comments about the deparsing format in one of
> the file headers or in README?
>

Added atop ddljson.c as this file takes care of expansion based on
fmt-string added.

> 5.
> +   
> +The table_init_write event occurs just after
> the creation of
> +table while execution of CREATE TABLE AS and
> +SELECT INTO commands.
>
> Patch 0001 has multiple references to table_init_write trigger but it
> is introduced in the 0002 patch, so those changes either belong to
> 0002 or one of the later patches. I think that may reduce most of the
> changes in event-trigger.sgml.
>

Moved it to 0002 patch.

> 6.
> + if (relpersist == RELPERSISTENCE_PERMANENT)
> + {
> + ddl_deparse_context context;
> + char*json_string;
> +
> + context.verbose_mode = false;
> + context.func_volatile = PROVOLATILE_IMMUTABLE;
>
> Can we write some comments as to why PROVOLATILE_IMMUTABLE is chosen here?
>

Added some comments and modified the variable name to make it more
appropriate.

> 7.
> diff --git 
> a/src/test/modules/test_ddl_deparse_regress/t/001_compare_dumped_results.pl.orig
> b/src/test/modules/test_ddl_deparse_regress/t/001_compare_dumped_results.pl.orig
> new file mode 100644
> index 00..58cf7cdf33
> --- /dev/null
> +++ 
> b/src/test/modules/test_ddl_deparse_regress/t/001_compare_dumped_results.pl.orig
>
> Will this file require for the 0008 patch? Or is this just a leftover?
>

Sorry, added by mistake. Removed it now.

thanks
Shveta

Re: Let's make PostgreSQL multi-threaded

2023-06-08 Thread Jose Luis Tallon


On 7/6/23 23:37, Andres Freund wrote:

[snip]
I think we're starting to hit quite a few limits related to the process model,
particularly on bigger machines. The overhead of cross-process context
switches is inherently higher than switching between threads in the same
process - and my suspicion is that that overhead will continue to
increase. Once you have a significant number of connections we end up spending
a *lot* of time in TLB misses, and that's inherent to the process model,
because you can't share the TLB across processes.


IMHO, as one sysadmin who has previously played with Postgres on "quite 
large" machines, I'd propose what most would call a "hybrid model"


* Threads are a very valuable addition for the "frontend" of the server. 
Most would call this a built-in session-aware connection pooler :)


    Heikki's (and others') efforts towards separating connection state 
into discrete structs is clearly a prerequisite for this; 
Implementation-wise, just toss the connState into a TLS[thread-local 
storage] variable and many problems just vanish.


    Postgres wouldn't be the first to adopt this approach, either...

* For "heavyweight" queries, the scalability of "almost independent" 
processes w.r.t. NUMA is just _impossible to achieve_ (locality of 
reference!) with a pure threaded system. When CPU+mem-bound 
(bandwidth-wise), threads add nothing IMO.


Indeed a separate postmaster is very much needed in order to control the 
processes / guard overall integrity.



Hence, my humble suggestion is to consider a hybrid architecture which 
benefits from each model's strengths. I am quite convinced that 
transition would be much safer and simpler (I do share most of Tom and 
other's concerns...)


Other projects to draw inspiration from:

 * Postfix -- multi-process, postfix's master guards processes and 
performs privileged operations; unprivileged "subsystems". Interesting 
IPC solutions
 * Apache -- MPMs provide flexibility and support for e.g. non-threaded 
workloads (PHP is the most popular; cfr. "prefork" multi-process MPM)
 * NginX is actually multi-process (one per CPU) + event-based 
(multiplexing) ...
 * PowerDNS is internally threaded, but has a "guardian" process. Seems 
to be evolving to a more hybrid model.



I would suggest something along the lines of :

* postmaster -- process supervision and (potentially privileged) 
operations; process coordination (i.e descriptor passing); mostly as-is

* *frontend* -- connection/session handling; possibly even event-driven
* backends -- process heavyweight queries as independently as possible. 
Can span worker threads AND processes when needed
* *dispatcher* -- takes care of cached/lightweight queries (cached 
catalog / full snapshot visibility+processing)
* utility processes can be left "as is" mostly, except to be made 
multi-threaded for heavy-sync ones (e.g. vacuum workers, stat workers)


For fixed-size buffers, i.e. pages / chunks, I'd say mmaped (anonymous) 
shared memory isn't that bad... but haven't read the actual code in years.


For message queues / invalidation messages, i guess that shmem-based 
sync is really a nuisance. My understanding is that Linux-specific (i.e. 
eventfd) mechanisms aren't quite considered .. or are they?



The amount of duplicated code we have to deal with due to to the process model
is quite substantial. We have local memory, statically allocated shared memory
and dynamically allocated shared memory variants for some things. And that's
just going to continue.


Code duplication is indeed a problem... but I wouldn't call "different 
approaches/solution for very similar problems depending on 
context/requirement" a duplicate. I might well be wrong / lack detail, 
though... (again: haven't read PG's code for some years already).



Just my two cents.


Thanks,

    J.L.

--
Parkinson's Law: Work expands to fill the time alloted to it.

1 2 >

1 - 100 of 126 matches

Mail list logo