Re: Synchronizing slots from primary to standby

2023-07-26 Thread Amit Kapila
On Wed, Jul 26, 2023 at 10:31 AM shveta malik  wrote:
>
> On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila  wrote:
> >
> > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy
> >  wrote:
> > >
> > > Is having one (or a few more - not
> > > necessarily one for each logical slot) worker for all logical slots
> > > enough?
> > >
> >
> > I guess for a large number of slots the is a possibility of a large
> > gap in syncing the slots which probably means we need to retain
> > corresponding WAL for a much longer time on the primary. If we can
> > prove that the gap won't be large enough to matter then this would be
> > probably worth considering otherwise, I think we should find a way to
> > scale the number of workers to avoid the large gap.
> >
>
> How about this:
>
> 1) On standby, spawn 1 worker per database in the start (as it is
> doing currently).
>
> 2) Maintain statistics on activity against each primary's database on
> standby by any means. Could be by maintaining 'last_synced_time' and
> 'last_activity_seen time'.  The last_synced_time is updated every time
> we sync/recheck slots for that particular database. The
> 'last_activity_seen_time' changes only if we get any slot on that
> database where actually confirmed_flush or say restart_lsn has changed
> from what was maintained already.
>
> 3) If at any moment, we find that 'last_synced_time' -
> 'last_activity_seen' goes beyond a threshold, that means that DB is
> not active currently. Add it to list of inactive DB
>

I think we should also increase the next_sync_time if in current sync,
there is no update.

> 4) Launcher on the other hand is always checking if it needs to spawn
> any other extra worker for any new DB. It will additionally check if
> number of inactive databases (maintained on standby) has gone higher
> (> some threshold), then it brings down the workers for those and
> starts a common worker which takes care of all such inactive databases
> (or merge all in 1), while workers for active databases remain as such
> (i.e. one per db). Each worker maintains the list of DBs which it is
> responsible for.
>
> 5) If in the list of these inactive databases, we again find any
> active database using the above logic, then the launcher will spawn a
> separate worker for that.
>

I wonder if we anyway some sort of design like this because we
shouldn't allow to spawn as many workers as the number of databases.
There has to be some existing or new GUC like max_sync_slot_workers
which decided the number of workers.

Overall, this sounds to be a more workload-adaptive approach as
compared to the current one.

-- 
With Regards,
Amit Kapila.




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-26 Thread Amit Kapila
On Thu, Jul 27, 2023 at 6:46 AM Peter Smith  wrote:
>
> Here are some review comments for v22-0003
>
> ==
>
> 1. ApplicationNameForTablesync
> +/*
> + * Determine the application_name for tablesync workers.
> + *
> + * Previously, the replication slot name was used as application_name. Since
> + * it's possible to reuse tablesync workers now, a tablesync worker can 
> handle
> + * several different replication slots during its lifetime. Therefore, we
> + * cannot use the slot name as application_name anymore. Instead, the slot
> + * number of the tablesync worker is used as a part of the application_name.
> + *
> + * FIXME: if the tablesync worker starts to reuse the replication slot during
> + * synchronization, we should again use the replication slot name as
> + * application_name.
> + */
> +static void
> +ApplicationNameForTablesync(Oid suboid, int worker_slot,
> + char *application_name, Size szapp)
> +{
> + snprintf(application_name, szapp, "pg_%u_sync_%i_" UINT64_FORMAT, suboid,
> + worker_slot, GetSystemIdentifier());
> +}
>
> 1a.
> The intent of the "FIXME" comment was not clear. Is this some existing
> problem that needs addressing, or is this really more like just an
> "XXX" warning/note for the future, in case the tablesync logic
> changes?
>

This seems to be a Note for the future, so better to use XXX notation here.

> ~
>
> 1b.
> Since this is a new function, should it be named according to the
> convention for static functions?
>
> e.g.
> ApplicationNameForTablesync -> app_name_for_tablesync
>

I think for now let's follow the style for similar functions like
ReplicationOriginNameForLogicalRep() and
ReplicationSlotNameForTablesync().

-- 
With Regards,
Amit Kapila.




Re: PATCH: Add REINDEX tag to event triggers

2023-07-26 Thread Garrett Thornburg
I added a v2 patch for adding REINDEX to event triggers. The following has
changed:

1. I fixed the docs to note that ddl_command_start is supported for
REINDEX. Thanks, Michael!
2. I added Jian He's excellent partition table test and updated the
expectations to include that regression test.

What is still up for debate:

Michael pointed out that REINDEX probably isn't DDL because the docs only
specify commands in the standard like CREATE, ALTER, etc. It might be worth
creating a new event to trigger on (similar to what was done for
table_rewrite) to make a REINDEX trigger fit in a little nicer. Let me
know what you think here. Until then, I left the code as `lev =
LOGSTMT_DDL` for `T_ReindexStmt`.

Thanks,
Garrett

On Thu, Jul 20, 2023 at 10:47 PM Garrett Thornburg  wrote:

> Added my v1 patch to add REINDEX to event triggers.
>
> I originally built this against pg15 but rebased to master for the patch
> to hopefully make it easier for maintainers to merge. The rebase was
> automatic so it should be easy to include into any recent version. I'd love
> for this to land in pg16 if possible.
>
> I added regression tests and they are passing. I also formatted the code
> using the project tools. Docs are included too.
>
> A few notes:
> 1. I tried to not touch the function parameters too much and opted to
> create a convenience function that makes it easy to attach index or table
> OIDs to the current event trigger list.
> 2. I made sure both concurrent and regular reindex of index, table,
> database work as expected.
> 3. The ddl end command will make available all indexes that were modified
> by the reindex command. This is a large list when you run "reindex
> database" but works really well. I debated returning records of the table
> or DB that were reindexed but that doesn't really make sense. Returning
> each index fits my use case of building an audit record around the index
> lifecycle (hence the motivation for the patch).
>
> Thanks for your time,
>
> Garrett Thornburg
>


v2-0001-Add-REINDEX-tag-to-event-triggers.patch
Description: Binary data


Re: WAL Insertion Lock Improvements

2023-07-26 Thread Michael Paquier
On Tue, Jul 25, 2023 at 04:43:16PM +0900, Michael Paquier wrote:
> We really need to do something in terms of documentation with
> something like 0002, so I'll try to look at that next.

I have applied a slightly-tweaked version of 0002 as of 66d86d4 to
improve a bit the documentation of the area, and switched the CF entry
as committed.

(I got interested in what Andres has seen on his latest AIO branch, so
I have a few extra benchmarks running in the background on HEAD, but
nothing able to freeze all the backends yet waiting for a variable
update.  These are still running now.)
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: Add REINDEX tag to event triggers

2023-07-26 Thread Garrett Thornburg
Thank you for this, jian he! I will include it in the next patch version.

P.S. Sorry for the double post. Sent from the wrong email address so I'm
resending so the mailing list gets the email. My apologies!

On Wed, Jul 26, 2023 at 4:30 AM jian he  wrote:

> On Wed, Jul 26, 2023 at 7:51 AM Michael Paquier 
> wrote:
> >
> > On Tue, Jul 25, 2023 at 04:34:47PM +0800, jian he wrote:
> > > so  T_ReindexStmt should only be in  ProcessUtilitySlow, if you want
> > > to create an event trigger on reindex?
> > >
> > > regression tests work fine. I even play with partitions.
> >
> > It would be an idea to have some regression tests for partitions,
> > actually, so as some patterns around ReindexMultipleInternal() are
> > checked.  We could have a REINDEX DATABASE in a TAP test with an event
> > trigger, as well, but I don't feel strongly about the need to do that
> > much extra work in 090_reindexdb.pl or 091_reindexdb_all.pl if
> > partitions cover the multi-table case.
> > --
> > Michael
>
> quite verbose, copied from partition-info.sql. meet the expectation:
> partitioned index will do nothing, partition index will trigger event
> trigger.
> 
> DROP EVENT TRIGGER  IF EXISTS  end_reindex_command  CASCADE;
> DROP EVENT TRIGGER  IF EXISTS  start_reindex_command  CASCADE;
>
> BEGIN;
> CREATE OR REPLACE FUNCTION reindex_end_command()
> RETURNS event_trigger AS $$
> DECLARE
>   obj record;
> BEGIN
> raise notice 'begin of reindex_end_command';
>
> FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
> LOOP
>   RAISE NOTICE
> 'obj.command_tag:% obj.object_type:% obj.schema_name:%
> obj.object_identity:%'
> ,obj.command_tag,
> obj.object_type,obj.schema_name,obj.object_identity;
>   RAISE NOTICE 'ddl_end_command -- REINDEX: %',
> pg_get_indexdef(obj.objid);
> END LOOP;
> END;
> $$ LANGUAGE plpgsql;
>
> CREATE OR REPLACE FUNCTION start_reindex_command()
> RETURNS event_trigger AS $$
> DECLARE
> obj record;
> BEGIN
> FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
> LOOP
> RAISE NOTICE
> 'obj.command_tag:% obj.object_type:% obj.schema_name:%
> obj.object_identity:%'
> , obj.command_tag,
> obj.object_type,obj.schema_name,obj.object_identity;
> RAISE NOTICE 'ddl_start_command -- REINDEX: %',
> pg_get_indexdef(obj.objid);
> END LOOP;
> raise notice 'end of start_reindex_command';
> END;
> $$ LANGUAGE plpgsql;
>
> BEGIN;
> CREATE EVENT TRIGGER end_reindex_command ON ddl_command_end
> WHEN TAG IN ('REINDEX') EXECUTE PROCEDURE reindex_end_command();
> CREATE EVENT TRIGGER start_reindex_command ON ddl_command_start
> WHEN TAG IN ('REINDEX') EXECUTE PROCEDURE start_reindex_command();
> COMMIT;
>
> -- test Reindex Event Trigger
> BEGIN;
> drop table if EXISTS ptif_test CASCADE;
> CREATE TABLE ptif_test (a int, b int) PARTITION BY range (a);
> CREATE TABLE ptif_test0 PARTITION OF ptif_test
>   FOR VALUES FROM (minvalue) TO (0) PARTITION BY list (b);
> CREATE TABLE ptif_test01 PARTITION OF ptif_test0 FOR VALUES IN (1);
> CREATE TABLE ptif_test1 PARTITION OF ptif_test
>   FOR VALUES FROM (0) TO (100) PARTITION BY list (b);
> CREATE TABLE ptif_test11 PARTITION OF ptif_test1 FOR VALUES IN (1);
>
> CREATE TABLE ptif_test2 PARTITION OF ptif_test
>   FOR VALUES FROM (100) TO (200);
> -- This partitioned table should remain with no partitions.
> CREATE TABLE ptif_test3 PARTITION OF ptif_test
>   FOR VALUES FROM (200) TO (maxvalue) PARTITION BY list (b);
>
> -- Test index partition tree
> CREATE INDEX ptif_test_index ON ONLY ptif_test (a);
>
> CREATE INDEX ptif_test0_index ON ONLY ptif_test0 (a);
> ALTER INDEX ptif_test_index ATTACH PARTITION ptif_test0_index;
>
> CREATE INDEX ptif_test01_index ON ptif_test01 (a);
> ALTER INDEX ptif_test0_index ATTACH PARTITION ptif_test01_index;
>
> CREATE INDEX ptif_test1_index ON ONLY ptif_test1 (a);
> ALTER INDEX ptif_test_index ATTACH PARTITION ptif_test1_index;
>
> CREATE INDEX ptif_test11_index ON ptif_test11 (a);
> ALTER INDEX ptif_test1_index ATTACH PARTITION ptif_test11_index;
>
> CREATE INDEX ptif_test2_index ON ptif_test2 (a);
> ALTER INDEX ptif_test_index ATTACH PARTITION ptif_test2_index;
>
> CREATE INDEX ptif_test3_index ON ptif_test3 (a);
> ALTER INDEX ptif_test_index ATTACH PARTITION ptif_test3_index;
> COMMIT;
>
> --top level partitioned index. will recurse to each partition index.
> REINDEX INDEX CONCURRENTLY public.ptif_test_index;
>
> --ptif_test0 is partitioned table. it will index partition:
> ptif_test01_index
> -- event trigger will log  ptif_test01_index
> REINDEX INDEX CONCURRENTLY public.ptif_test0_index;
>
> --ptif_test1_index is partitioned index. it will index partition:
> ptif_test11_index
> -- event trigger will effect on partion index:ptif_test11_index
> REINDEX INDEX CONCURRENTLY public.ptif_test1_index;
>
> --ptif_test2 is a partition. event trigger will log ptif_test2_index
> REINDEX INDEX CONCURRENTLY 

Re: PATCH: Add REINDEX tag to event triggers

2023-07-26 Thread Garrett Thornburg
Thanks! I added the patch to the commitfest. TIL.

> Hmm. What are the use cases you have in mind where you need to know the
list of indexes listed by an event trigger like this?

Direct use case is general index health and rebuild. We had a recent
upgrade go poorly (our fault) that left a bunch of indexes in a corrupted
state. We ran a mass rebuild a few times. Would have been great to be able
to observe changes happening in scripts. That left us wishing there was a
way to monitor all changes around indexes. Figured I'd add support since my
first thread seemed somewhat supportive of the idea.

> We have a table_rewrite which is equivalent to check if a relfilenode has
changed, so why not, I guess..

Yeah. I was actually debating if we need to add a new reindex event instead
of building on top of the DDL start/end. You are right that the
documentation does not include nonsql commands like reindex. Maybe I should
consider going that direction again?

> Assuming that ddl_command_start is not supported is incorrect.

Also great callout. Using `SELECT * FROM pg_event_trigger_ddl_commands()`
in the event does not yield anything. Bad assumption there. I'll add a test
with my next patch version.

Re: LOGSTMT_ALL

Agree here. I think this is more for reference than anything else so I'm
fine to call it LOGSTMT_ALL, but that goes back to the previous question
around ddl events vs a new event similar to table_rewrite.

P.S. Sorry for the double post. Sent from the wrong email so the mailing
list didn't get the message.

On Tue, Jul 25, 2023 at 12:55 AM Michael Paquier 
wrote:

> On Thu, Jul 20, 2023 at 10:47:00PM -0600, Garrett Thornburg wrote:
> > Added my v1 patch to add REINDEX to event triggers.
> >
> > I originally built this against pg15 but rebased to master for the patch
> > to hopefully make it easier for maintainers to merge. The rebase was
> > automatic so it should be easy to include into any recent version. I'd
> love
> > for this to land in pg16 if possible.
>
> The development of Postgres 16 has finished last April and this
> version is in a stabilization period.  We are currently running the
> first commit fest of PostgreSQL 17, though.  I would recommend to
> register your patch here so as it is not lost:
> https://commitfest.postgresql.org/44/
>
> > I added regression tests and they are passing. I also formatted the code
> > using the project tools. Docs are included too.
>
> Hmm.  What are the use cases you have in mind where you need to know
> the list of indexes listed by an event trigger like this?  Just
> monitoring to know which indexes have been rebuilt?  We have a
> table_rewrite which is equivalent to check if a relfilenode has
> changed, so why not, I guess..
>
> Assuming that ddl_command_start is not supported is incorrect.  For
> example, with your patch applied, if I do the following setup:
> create event trigger regress_event_trigger on ddl_command_start
>   execute procedure test_event_trigger();
> create function test_event_trigger() returns event_trigger as $$
>   BEGIN
>   RAISE NOTICE 'test_event_trigger: % %', tg_event, tg_tag;
>   END
>   $$ language plpgsql;
>
> Then a REINDEX gives me that:
> reindex table pg_class;
> NOTICE:  0: test_event_trigger: ddl_command_start REINDEX
>
> The first paragraphs of event-trigger.sgml describe the following:
>  The ddl_command_start event occurs just before the
>  execution of a CREATE, ALTER,
> DROP,
>  SECURITY LABEL,
>  COMMENT, GRANT or
> REVOKE
> [...]
> The ddl_command_end event occurs just after the
> execution of
> this same set of commands.  To obtain more details on the
> DDL
>
> So it would need a refresh to list REINDEX.
>
> case T_ReindexStmt:
> -   lev = LOGSTMT_ALL;  /* should this be DDL? */
> +   lev = LOGSTMT_DDL;
>
> This, unfortunately, may not be as right as it is simple because
> REINDEX is not a DDL as far as I know (it is not in the SQL
> standard).
> --
> Michael
>


Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-26 Thread Peter Geoghegan
On Wed, Jul 26, 2023 at 12:07 PM Matthias van de Meent
 wrote:
> We could cache the last accessed leaf page across amrescan operations
> to reduce the number of index traversals needed when the join key of
> the left side is highly (but not necessarily strictly) correllated.

That sounds like block nested loop join. It's possible that that could
reuse some infrastructure from this patch, but I'm not sure.

In general, SAOP execution/MDAM performs "duplicate elimination before
it reads the data" by sorting and deduplicating the arrays up front.
While my patch sometimes elides a primitive index scan, primitive
index scans are already disjuncts that are combined to create what can
be considered one big index scan (that's how the planner and executor
think of them). The patch takes that one step further by recognizing
that it could quite literally be one big index scan in some cases (or
fewer, larger scans, at least). It's a natural incremental
improvement, as opposed to inventing a new kind of index scan. If
anything the patch makes SAOP execution more similar to traditional
index scans, especially when costing them.

Like InnoDB style loose index scan (for DISTINCT and GROUP BY
optimization), block nested loop join would require inventing a new
type of index scan. Both of these other two optimizations involve the
use of semantic information that spans multiple levels of abstraction.
Loose scan requires duplicate elimination (that's the whole point),
while IIUC block nested loop join needs to "simulate multiple inner
index scans" by deliberately returning duplicates for each would-be
inner index scan. These are specialized things.

To be clear, I think that all of these ideas are reasonable. I just
find it useful to classify these sorts of techniques according to
whether or not the index AM API would have to change or not, and the
general nature of any required changes. MDAM can do a lot of cool
things without requiring any revisions to the index AM API, which
should allow it to play nice with everything else (index path clause
safety issues notwithstanding).

-- 
Peter Geoghegan




Re: Retiring is_pushed_down

2023-07-26 Thread Richard Guo
On Tue, Jul 25, 2023 at 3:39 PM Richard Guo  wrote:

> * This patch calculates the outer join relids that are being formed
> generally in this way:
>
> bms_difference(joinrelids, bms_union(outerrelids, innerrelids))
>
> Of course this can only be used after the outer join relids has been
> added by add_outer_joins_to_relids().  This calculation is performed
> multiple times during planning.  I'm not sure if this has performance
> issues.  Maybe we can calculate it only once and store the result in
> some place (such as in JoinPath)?
>

In the v2 patch, I added a member in JoinPath to store the relid set of
any outer joins that will be calculated at this join, and this would
avoid repeating this calculation when creating nestloop/merge/hash join
plan nodes.  Also fixed a comment in v2.

Thanks
Richard


v2-0001-Retiring-is_pushed_down.patch
Description: Binary data


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-26 Thread David Rowley
On Wed, 26 Jul 2023 at 03:50, Andres Freund  wrote:
> On 2023-07-25 23:37:08 +1200, David Rowley wrote:
> > On Tue, 25 Jul 2023 at 17:34, Andres Freund  wrote:
> > I've not really studied the fix_COPY_DEFAULT.patch patch.  Is there a
> > reason to delay committing that?  It would be good to eliminate that
> > as a variable for the current performance regression.
>
> Yea, I don't think there's a reason to hold off on that. Sawada-san, do you
> want to commit it? Or Andrew?

Just to keep this moving and to make it easier for people to test the
pg_strtoint patches, I've pushed the fix_COPY_DEFAULT.patch patch.
The only thing I changed was to move the line that was allocating the
array to a location more aligned with the order that the fields are
defined in the struct.

David




Re: pg_usleep for multisecond delays

2023-07-26 Thread Kyotaro Horiguchi
At Wed, 26 Jul 2023 16:41:25 -0700, Nathan Bossart  
wrote in 
> On Mon, Mar 13, 2023 at 02:16:31PM -0700, Nathan Bossart wrote:
> I started on the former approach (work-in-progress patch attached), but I
> figured I'd ask whether folks are alright with this before I spend more
> time on it.  Many of the sleeps in question are relatively short, are
> intended for debugging, or are meant to prevent errors from repeating as
> fast as possible, so I don't know if we should bother adding interrupt
> handling support.

It is responsive to an immediate shutdown.  I think that's fine, as
things might become overly complex if we aim for it to respond to fast
shutdown as well.

The pg_msleep() implemented in the patch accepts a wait event type,
and some event types other than WAIT_EVENT_PG_SLEEP are passed to it.

WAIT_EVENT_CHECKPOINTER_MAIN:

 - At checkpointer.c:309, it is in a long-jump'ed block, where out of
   the main loop.

 - At checkpointer.c:1005: RequestCheckpoint is not executed on checkpointer.

WAIT_EVENT_ARCHIVER_MAIN:
 - At pgarch.c:453,485: They are not at the main-loop level idle-waiting.

WAIT_EVENT_PRE/POST_AUTH_DELAY:

 - These are really useless since they're not seen anywhere. Client
   backends don't show up in pg_stat_activity until this sleep
   finishes. (We could use ps-display instead, but...)

WAIT_EVENT_VACUUM_DELAY:

 - This behaves the same as it did before the change. Actually, we
   don't want to use WAIT_EVENT_PG_SLEEP for it.

So, we have at least one sensible use case for the parameter, which
seems to be sufficient reason.

I'm not sure if others will agree, but I'm leaning towards providing a
dedicated WaitEventSet for the new sleep function.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [PATCH] Check more invariants during syscache initialization

2023-07-26 Thread Michael Paquier
On Wed, Jul 26, 2023 at 07:01:11PM +0300, Aleksander Alekseev wrote:
> Hi,
> 
> > Shouldn't these be calling `OidIsValid(…)`, not comparing directly to
> > `InvalidOid`?
> 
> They should, thanks. Here is the updated patch. I made sure there are
> no others != InvalidOid checks in syscache.c and catcache.c which are
> affected by my patch. I didn't change any other files since Zhang
> wants to propose the corresponding patch.

While arguing about OidIsValid() for the relation OID part, I found a
bit surprising that you did not consider AttributeNumberIsValid() for
the new check on the keys.  I've switched the second check to that,
and applied v3.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Check more invariants during syscache initialization

2023-07-26 Thread Zhang Mingli
Hi,

> On Jul 27, 2023, at 09:05, Michael Paquier  wrote:
> 
>  One reason is that this increases the odds of
> conflicts when backpatching on a stable branch.

Agree. We could suggest to use OidIsValid() for new patches during review.

Zhang Mingli
https://www.hashdata.xyz



Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-26 Thread Peter Smith
Here are some review comments for v22-0003

==

1. ApplicationNameForTablesync
+/*
+ * Determine the application_name for tablesync workers.
+ *
+ * Previously, the replication slot name was used as application_name. Since
+ * it's possible to reuse tablesync workers now, a tablesync worker can handle
+ * several different replication slots during its lifetime. Therefore, we
+ * cannot use the slot name as application_name anymore. Instead, the slot
+ * number of the tablesync worker is used as a part of the application_name.
+ *
+ * FIXME: if the tablesync worker starts to reuse the replication slot during
+ * synchronization, we should again use the replication slot name as
+ * application_name.
+ */
+static void
+ApplicationNameForTablesync(Oid suboid, int worker_slot,
+ char *application_name, Size szapp)
+{
+ snprintf(application_name, szapp, "pg_%u_sync_%i_" UINT64_FORMAT, suboid,
+ worker_slot, GetSystemIdentifier());
+}

1a.
The intent of the "FIXME" comment was not clear. Is this some existing
problem that needs addressing, or is this really more like just an
"XXX" warning/note for the future, in case the tablesync logic
changes?

~

1b.
Since this is a new function, should it be named according to the
convention for static functions?

e.g.
ApplicationNameForTablesync -> app_name_for_tablesync

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2023-07-26 Thread Peter Smith
Here are some review comments for v22-0002

==
1. General - errmsg

AFAIK, the errmsg part does not need to be enclosed by extra parentheses.

e.g.
BEFORE
ereport(LOG,
(errmsg("logical replication table synchronization worker for
subscription \"%s\" has finished",
MySubscription->name)));
AFTER
ereport(LOG,
errmsg("logical replication table synchronization worker for
subscription \"%s\" has finished",
MySubscription->name));

~

The patch has multiple cases similar to that example.

==
src/backend/replication/logical/tablesync.c

2.
+ if (reuse_worker)
+ {
+ ereport(LOG,
+ (errmsg("logical replication table synchronization worker for
subscription \"%s\" will be reused to sync table \"%s\" with relid
%u.",
+ MySubscription->name,
+ get_rel_name(MyLogicalRepWorker->relid),
+ MyLogicalRepWorker->relid)));
+ }
+ else
+ {
+ ereport(LOG,
+ (errmsg("logical replication table synchronization worker for
subscription \"%s\" has finished",
+ MySubscription->name)));
+ }

These brackets { } are not really necessary.

~~~

3. TablesyncWorkerMain
+ for (;!done;)
+ {
+ List*rstates;
+ ListCell   *lc;
+
+ run_tablesync_worker();
+
+ if (IsTransactionState())
+ CommitTransactionCommand();
+
+ if (MyLogicalRepWorker->relsync_completed)
+ {
+ /*
+ * This tablesync worker is 'done' unless another table that needs
+ * syncing is found.
+ */
+ done = true;

Those variables 'rstates' and 'lc' do not need to be declared at this
scope -- they can be declared further down, closer to where they are
needed.

=
src/backend/replication/logical/worker.c

4. LogicalRepApplyLoop
+
+ if (am_tablesync_worker())
+ /*
+ * If relsync_completed is true, this means that the tablesync
+ * worker is done with synchronization. Streaming has already been
+ * ended by process_syncing_tables_for_sync. We should move to the
+ * next table if needed, or exit.
+ */
+ if (MyLogicalRepWorker->relsync_completed)
+ endofstream = true;

Here I think it is better to use bracketing { } for the outer "if",
instead of only relying on the indentation for readability. YMMV.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [PATCH] Check more invariants during syscache initialization

2023-07-26 Thread Michael Paquier
On Wed, Jul 26, 2023 at 03:28:55PM +0100, Dagfinn Ilmari Mannsåker wrote:
> Shouldn't these be calling `OidIsValid(…)`, not comparing directly to
> `InvalidOid`?

I think that they should use OidIsValid() on correctness ground, and
that's the style I prefer.  Now, I don't think that these are worth
changing now except if some of the surrounding code is changed for a
different reason.  One reason is that this increases the odds of
conflicts when backpatching on a stable branch.
--
Michael


signature.asc
Description: PGP signature


Re: Improve pg_stat_statements by making jumble handle savepoint names better

2023-07-26 Thread Michael Paquier
On Wed, Jul 26, 2023 at 07:53:02AM +0900, Michael Paquier wrote:
> I think that I'm OK with your proposal as savepoint names are in
> defined places in these queries (contrary to some of the craziness
> with BEGIN and the node structure of TransactionStmt, for example).
> 
> Has somebody an opinion or a comment to share?

Well, on second look this is really simple, so I've applied that after
adding back some comments that ought to be and simplifying a bit the
tests (less queries, same coverage).
--
Michael


signature.asc
Description: PGP signature


Re: Performance degradation on concurrent COPY into a single relation in PG16.

2023-07-26 Thread David Rowley
> On 2023-07-25 23:37:08 +1200, David Rowley wrote:
> > On Tue, 25 Jul 2023 at 17:34, Andres Freund  wrote:
> > > HEAD:   812.690
> > >
> > > your patch: 821.354
> > >
> > > strtoint from 8692f6644e7:  824.543
> > >
> > > strtoint from 6b423ec677d^: 806.678
> >
> > I'm surprised to see the imul version is faster. It's certainly not
> > what we found when working on 6b423ec67.
>
> What CPUs did you test it on? I'd not be surprised if this were heavily
> dependent on the microarchitecture.

This was on AMD 3990x.

> One idea I had was to add a fastpath that won't parse all strings, but will
> parse the strings that we would generate, and fall back to the more general
> variant if it fails. See the attached, rough, prototype:

There were a couple of problems with fastpath.patch.  You need to
reset the position of ptr at the start of the slow path and also you
were using tmp in the if (neg) part instead of tmp_s in the fast path
section.

I fixed that up and made two versions of the patch, one using the
overflow functions (pg_strtoint_fastpath1.patch) and one testing if
the number is going to overflow (same as current master)
(pg_strtoint_fastpath2.patch)

AMD 3990x:

master + fix_COPY_DEFAULT.patch:
latency average = 525.226 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch:
latency average = 488.171 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch:
latency average = 481.827 ms


Apple M2 Pro:

master + fix_COPY_DEFAULT.patch:
latency average = 348.433 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch:
latency average = 336.778 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch:
latency average = 335.992 ms

Zen 4 7945HX CPU:

master + fix_COPY_DEFAULT.patch:
latency average = 296.881 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath1.patch:
latency average = 287.052 ms

master + fix_COPY_DEFAULT.patch + pg_strtoint_fastpath2.patch:
latency average = 280.742 ms

The M2 chip does not seem to be clearly faster with the fastpath2
method of overflow checking, but both AMD CPUs seem pretty set on
fastpath2 being faster.

It would be really good if someone with another a newish intel CPU
could test this too.

David
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..a50c5ccd4a 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -298,9 +298,70 @@ pg_strtoint32_safe(const char *s, Node *escontext)
 {
const char *ptr = s;
const char *firstdigit;
-   uint32  tmp = 0;
+   uint32  tmp;
+   int32   tmp_s = 0;
boolneg = false;
 
+   /*
+* The majority of cases are likely to be base-10 digits without any
+* underscore separator characters.  We'll first try to parse the 
string with
+* the assumption that's the case and only fallback on a slower
+* implementation which handles hex, octal and binary strings and
+* underscores.
+*/
+
+   /* leave it up to the slow path to look for leading spaces */
+
+   if (*ptr == '-')
+   {
+   ptr++;
+   neg = true;
+   }
+
+   /* a leading '+' is uncommon so leave that for the slow path */
+
+   /* process digits */
+   for (;;)
+   {
+   unsigned char   digit = (*ptr - '0');
+
+   /*
+* Exploit unsigned arithmetic to save having to check both the 
upper
+* and lower bounds of the digit
+*/
+   if (digit >= 10)
+   break;
+
+   ptr++;
+
+   if (unlikely(pg_mul_s32_overflow(tmp_s, 10, _s)) ||
+   unlikely(pg_sub_s32_overflow(tmp_s, digit, _s)))
+   goto out_of_range;
+   }
+
+   /* we need at least one digit */
+   if (unlikely(ptr == s))
+   goto slow;
+
+   /* when the string does not end in a digit, let the slow path handle it 
*/
+   if (unlikely(*ptr != '\0'))
+   goto slow;
+
+   if (!neg)
+   {
+   /* could fail if input is most negative number */
+   if (unlikely(tmp_s == PG_INT32_MIN))
+   goto out_of_range;
+   tmp_s = -tmp_s;
+   }
+
+   return tmp_s;
+
+slow:
+   tmp = 0;
+   ptr = s;
+   /* no need to reset neg */
+
/* skip leading spaces */
while (likely(*ptr) && isspace((unsigned char) *ptr))
ptr++;
diff --git a/src/backend/utils/adt/numutils.c b/src/backend/utils/adt/numutils.c
index 471fbb7ee6..bb48e5157e 100644
--- a/src/backend/utils/adt/numutils.c
+++ b/src/backend/utils/adt/numutils.c
@@ -301,6 +301,70 @@ pg_strtoint32_safe(const char *s, Node *escontext)
uint32  tmp = 0;
boolneg = false;
 
+   /*
+* The 

Re: psql: Could we get "-- " prefixing on the **** QUERY **** outputs? (ECHO_HIDDEN)

2023-07-26 Thread Nathan Bossart
On Wed, Jul 26, 2023 at 02:39:25PM -0700, Nathan Bossart wrote:
> Great.  I spent some time on the commit message in v4.  I plan to commit
> this shortly.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: pg_usleep for multisecond delays

2023-07-26 Thread Nathan Bossart
On Mon, Mar 13, 2023 at 02:16:31PM -0700, Nathan Bossart wrote:
> At the moment, I'm thinking about either removing the check_interrupts
> function pointer argument altogether or making it optional for code paths
> where we might not want any interrupt handling to run.  In the former
> approach, we could simply call WaitLatch() without a latch.  While this
> wouldn't do any interrupt handling, we'd still get custom wait event
> support and better responsiveness when the postmaster dies, which is
> strictly better than what's there today.  And many of these sleeps are in
> uncommon or debug paths, so delaying interrupt handling might be okay.  In
> the latter approach, we would have interrupt handling, but I'm worried that
> would be easy to get wrong.  I think it would be nice to have interrupt
> handling if possible, so I'm still (over)thinking about this.

I started on the former approach (work-in-progress patch attached), but I
figured I'd ask whether folks are alright with this before I spend more
time on it.  Many of the sleeps in question are relatively short, are
intended for debugging, or are meant to prevent errors from repeating as
fast as possible, so I don't know if we should bother adding interrupt
handling support.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7ff61d6d1a4829e861a57bae621fb939f0133bc7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 26 Jul 2023 16:24:01 -0700
Subject: [PATCH v3 1/1] WORK IN PROGRESS: pg_msleep

---
 src/backend/access/nbtree/nbtpage.c |  2 +-
 src/backend/access/transam/multixact.c  |  3 ++-
 src/backend/access/transam/xlog.c   |  6 +++---
 src/backend/access/transam/xlogutils.c  |  3 ++-
 src/backend/commands/vacuum.c   |  4 +---
 src/backend/libpq/pqcomm.c  |  3 ++-
 src/backend/port/win32/socket.c |  2 +-
 src/backend/postmaster/autovacuum.c |  8 
 src/backend/postmaster/bgworker.c   |  2 +-
 src/backend/postmaster/bgwriter.c   |  2 +-
 src/backend/postmaster/checkpointer.c   |  4 ++--
 src/backend/postmaster/pgarch.c |  6 --
 src/backend/postmaster/postmaster.c |  2 +-
 src/backend/postmaster/walwriter.c  |  2 +-
 src/backend/replication/walsender.c |  2 +-
 src/backend/storage/file/fd.c   |  4 ++--
 src/backend/storage/ipc/latch.c | 14 ++
 src/backend/storage/ipc/procarray.c |  2 +-
 src/backend/storage/ipc/standby.c   |  4 ++--
 src/backend/storage/lmgr/lmgr.c |  4 ++--
 src/backend/utils/activity/wait_event_names.txt |  2 ++
 src/backend/utils/init/postinit.c   |  4 ++--
 src/include/storage/latch.h |  1 +
 23 files changed, 53 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d78971bfe8..ab31da93a4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -3013,7 +3013,7 @@ _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate)
 	 * never be effective without some other backend concurrently consuming an
 	 * XID.
 	 */
-	pg_usleep(500L);
+	pg_msleep(5000, WAIT_EVENT_PG_SLEEP);
 #endif
 
 	/*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index abb022e067..18c4041b19 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -90,6 +90,7 @@
 #include "utils/builtins.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -1390,7 +1391,7 @@ retry:
 			/* Corner case 2: next multixact is still being filled in */
 			LWLockRelease(MultiXactOffsetSLRULock);
 			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
+			pg_msleep(1, WAIT_EVENT_PG_SLEEP);
 			goto retry;
 		}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..0dce93637b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5130,7 +5130,7 @@ StartupXLOG(void)
 	/* This is just to allow attaching to startup process with a debugger */
 #ifdef XLOG_REPLAY_DELAY
 	if (ControlFile->state != DB_SHUTDOWNED)
-		pg_usleep(6000L);
+		pg_msleep(6, WAIT_EVENT_PG_SLEEP);
 #endif
 
 	/*
@@ -6710,7 +6710,7 @@ CreateCheckPoint(int flags)
 	{
 		do
 		{
-			pg_usleep(1L);	/* wait for 10 msec */
+			pg_msleep(10, WAIT_EVENT_PG_SLEEP);
 		} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
 			  DELAY_CHKPT_START));
 	}
@@ -6723,7 +6723,7 @@ CreateCheckPoint(int flags)
 	{
 		do
 		{
-			pg_usleep(1L);	/* wait for 10 msec */
+			pg_msleep(10, WAIT_EVENT_PG_SLEEP);
 		} while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
 			  DELAY_CHKPT_COMPLETE));
 	}
diff --git a/src/backend/access/transam/xlogutils.c 

Re: There should be a way to use the force flag when restoring databases

2023-07-26 Thread Gurjeet Singh
On Sun, Jul 23, 2023 at 6:09 AM Ahmed Ibrahim
 wrote:
>
> Hi everyone,
>
> I have been working on this. This is a proposed patch for it so we have a 
> force option for DROPping the database.
>
> I'd appreciate it if anyone can review.

Hi Ahmed,

Thanks for working on this patch!

+
+int force;

That extra blank line is unnecessary.

Using the bool data type, instead of int, for this option would've
more natural.

+if (ropt->force){

Postgres coding style is to place the curly braces on a new line,
by themselves.

+char   *dropStmt = pg_strdup(te->dropStmt);

See if you can use pnstrdup(). Using that may obviate the need for
doing the null-placement acrobatics below.

+PQExpBuffer ftStmt = createPQExpBuffer();

What does the 'ft' stand for in this variable's name?

+dropStmt[strlen(dropStmt) - 2] = ' ';
+dropStmt[strlen(dropStmt) - 1] = '\0';

Try to evaluate the strlen() once and reuse it.

+appendPQExpBufferStr(ftStmt, dropStmt);
+appendPQExpBufferStr(ftStmt, "WITH(FORCE);");
+te->dropStmt = ftStmt->data;
+}
+

Remove the extra trailing whitespace on that last blank line.

I think this whole code block needs to be protected by an 'if
(ropt->createDB)' check, like it's done about 20 lines above this
hunk. Otherwise, we may be appending 'WITH (FORCE)' for the DROP
command of a different (not a database) object type.

Also, you may want to check that the target database version is
one that supports WITH force option. This command will fail for
anything before v13.

The patch needs doc changes (pg_restore.sgml). And it needs to
mention --force option in the help output, as well (usage() function).

Can you please see if you can add appropriate test case for this.
The committers may insist on it, when reviewing.

Here are a couple of helpful links on how to prepare and submit
patches to the community. You may not need to strictly adhere to
these, but try to pick up a few recommendations that would make the
reviewer's job a bit easier.

[1]: https://wiki.postgresql.org/wiki/Creating_Clean_Patches
[2]: https://wiki.postgresql.org/wiki/Submitting_a_Patch

Best regards,
Gurjeet
http://Gurje.et




Re: POC, WIP: OR-clause support for indexes

2023-07-26 Thread Alena Rybakina

Hi!

On 26.07.2023 02:47, Peter Geoghegan wrote:

On Thu, Jun 29, 2023 at 2:32 AM Alena Rybakina  wrote:

Hi! I'm sorry I didn't answer you right away, I was too busy with work.

Same for me, this time. I was busy working on my patch, which I
finally posted yesterday.
I'm glad to hear it, I've seen your thread ("Optimizing nbtree 
ScalarArrayOp execution, allowing multi-column ordered scans, skip 
scan"), but, unfortunately, I didn't have enough time to read it. I'll 
review it soon!

To be honest, I didn't think about the fact that my optimization can help 
eliminate duplicates before reading the data before.

I'm not surprised that you didn't specifically think of that, because
it's very subtle.


I am still only in the process of familiarizing myself with the thread [1] (reference 
from your letter), but I have already seen that there are problems related, for example, 
to when "or" expressions refer to the parent element.

I didn't intend to imply that you might have the same problem here. I
just meant that OR optimizations can have problems with duplicate
elimination, in general. I suspect that your patch doesn't have that
problem, because you are performing a transformation of one kind of OR
into another kind of OR.


Yes, you are right, but I studied this topic and two other sources to 
accumulate my knowledge. It was an exciting experience for me)


I was especially inspired after studying the interview with Goetz Graf 
[2], his life experience is the most inspiring, and from this article I 
was able to get a broad understanding of the field of databases:
current problems, future development, how it works... Thank you for the 
recommendation.


I discovered for myself that the idea described in the article [1] is 
similar to the idea of representing grouped data in OLAP cubes, and 
also, if I saw correctly, an algorithm like depth-first search is used 
there, but for indexes.


I think it really helps to speed up the search with similar deep 
filtering compared to cluster indexes, but do we have cases where we 
don't use this algorithm because it takes longer than an usual index?
I thought about the situation with wide indexes (with a lot of multiple 
columns) and having a lot of filtering predicates for them.
But I'm not sure about this, so it seems to me that this is a problem of 
improper use of indexes rather.

I think, I would face the similar problems if I complicate the current code, 
for example, so that not only or expressions standing on the same level are 
written in any, but also on different ones without violating the logic of the 
priority of executing operators.

I can't say that I am particularly experienced in this general area --
I have never tried to formally reason about how two different
statements are equivalent. It just hasn't been something that I've
needed to have a totally rigorous understanding of up until now. But
my recent patch changes that. Now I need to study this area to make
sure that I have a truly rigorous understanding.

Jeff Davis suggested that I read "Logic and Databases", by C.J. Date.
So now I have some homework to do.
I'll read this book too. Maybe I can finish work with the knowledge I 
got from there. Thank you for sharing!

Unfortunately, when I tried to make a transformation at the stage of index 
formation, I encountered too incorrect an assessment of the selectivity of 
relation, which affected the incorrect calculation of the cost and cardinality.

It's not surprising that a weird shift in the plan chosen by the
optimizer is seen with some random test case, as a result of this
added transformation. Even changes that are 100% strictly better (e.g.
changes in a selectivity estimation function that is somehow
guaranteed to be more accurate in all cases) might do that. Here is a
recent example of that with another patch, involving a bitmap OR:

https://postgr.es/m/cah2-wzncdk9n2tz6j_-iln563_epuc3nzp6vsvtl6jhzs6n...@mail.gmail.com
At first, this surprised me very much. It took time to find a suitable 
place to implement the transformation.


I have looked through this thread many times, I will study it in more 
detail .

This example was *not* a regression, if you go by conventional
measures. It was merely a less robust plan than the bitmap OR plan,
because it didn't pass down both columns as index quals.

BTW, there are various restrictions on the sort order of SAOPs that
you might want to try to familiarize yourself with. I describe them
(perhaps not very clearly) here:

https://postgr.es/m/CAH2-Wz=ksvn_sjcnd1+bt-wtifra5ok48adynq3pkkhxgmq...@mail.gmail.com

Thank you! Yes, I'll study it too)

Currently, the optimizer doesn't recognize multi-column indexes with
SAOPs on every column as having a valid sort order, except on the
first column. It seems possible that that has consequences for your
patch. (I'm really only guessing, though; don't trust anything that I
say about the optimizer too much.)

Honestly, I couldn't understand your concerns very 

Re: psql: Could we get "-- " prefixing on the **** QUERY **** outputs? (ECHO_HIDDEN)

2023-07-26 Thread Nathan Bossart
On Wed, Jul 26, 2023 at 08:06:37AM +0200, Pavel Stehule wrote:
> st 26. 7. 2023 v 6:22 odesílatel Nathan Bossart 
> napsal:
>> Barring additional feedback, I think this is ready for commit.
>>
>>
> +1

Great.  I spent some time on the commit message in v4.  I plan to commit
this shortly.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 50dbc196d19f4716bc9cb59bd36661d3e4cd299e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 26 Jul 2023 14:35:07 -0700
Subject: [PATCH v4 1/1] Adjust extra lines generated by psql to be valid SQL
 comments.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

psql's --echo-hidden, --log-file, and --single-step options
generate extra lines to clearly separate queries from other output.
Presently, these extra lines are not valid SQL comments, which
makes them a hazard for anyone trying to copy/paste these decorated
queries into a client.  This commit replaces the starting and
ending asterisks in these extra lines with forward slashes so that
they are valid SQL comments that can be copy/pasted without
incident.

Author: Kirk Wolak
Reviewed-by: Pavel Stehule, Laurenz Albe, Tom Lane, Álvaro Herrera, Andrey Borodin
Discussion: https://postgr.es/m/CACLU5mTFJRJYtbvmZ26txGgmXWQo0hkGhH2o3hEquUPmSbGtBw%40mail.gmail.com
---
 src/bin/psql/command.c |  8 
 src/bin/psql/common.c  | 16 
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 6733f008fd..1300869d79 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -5395,16 +5395,16 @@ echo_hidden_command(const char *query)
 {
 	if (pset.echo_hidden != PSQL_ECHO_HIDDEN_OFF)
 	{
-		printf(_("* QUERY **\n"
+		printf(_("/ QUERY */\n"
  "%s\n"
- "**\n\n"), query);
+ "//\n\n"), query);
 		fflush(stdout);
 		if (pset.logfile)
 		{
 			fprintf(pset.logfile,
-	_("* QUERY **\n"
+	_("/ QUERY */\n"
 	  "%s\n"
-	  "**\n\n"), query);
+	  "//\n\n"), query);
 			fflush(pset.logfile);
 		}
 
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 5973df2e39..10ad1f2538 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -589,16 +589,16 @@ PSQLexec(const char *query)
 
 	if (pset.echo_hidden != PSQL_ECHO_HIDDEN_OFF)
 	{
-		printf(_("* QUERY **\n"
+		printf(_("/ QUERY */\n"
  "%s\n"
- "**\n\n"), query);
+ "//\n\n"), query);
 		fflush(stdout);
 		if (pset.logfile)
 		{
 			fprintf(pset.logfile,
-	_("* QUERY **\n"
+	_("/ QUERY */\n"
 	  "%s\n"
-	  "**\n\n"), query);
+	  "//\n\n"), query);
 			fflush(pset.logfile);
 		}
 
@@ -1060,9 +1060,9 @@ SendQuery(const char *query)
 		char		buf[3];
 
 		fflush(stderr);
-		printf(_("***(Single step mode: verify command)***\n"
+		printf(_("/**(Single step mode: verify command)**/\n"
  "%s\n"
- "***(press return to proceed or enter x and return to cancel)\n"),
+ "/**(press return to proceed or enter x and return to cancel)***/\n"),
 			   query);
 		fflush(stdout);
 		if (fgets(buf, sizeof(buf), stdin) != NULL)
@@ -1080,9 +1080,9 @@ SendQuery(const char *query)
 	if (pset.logfile)
 	{
 		fprintf(pset.logfile,
-_("* QUERY **\n"
+_("/ QUERY */\n"
   "%s\n"
-  "**\n\n"), query);
+  "//\n\n"), query);
 		fflush(pset.logfile);
 	}
 
-- 
2.25.1



Re: Obsolete reference to pg_relation in comment

2023-07-26 Thread Nathan Bossart
On Wed, Jul 26, 2023 at 05:14:08PM -0400, Tom Lane wrote:
> I think we should reword this to just generically claim that holding
> the Relation reference open for the whole transaction reduces overhead.

WFM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Obsolete reference to pg_relation in comment

2023-07-26 Thread Tom Lane
Nathan Bossart  writes:
> On Wed, Jul 26, 2023 at 06:48:51PM +0100, Dagfinn Ilmari Mannsåker wrote:
>>   * All accesses to pg_largeobject and its index make use of a single 
>> Relation
>> - * reference, so that we only need to open pg_relation once per transaction.
>> + * reference, so that we only need to open pg_class once per transaction.
>>   * To avoid problems when the first such reference occurs inside a
>>   * subtransaction, we execute a slightly klugy maneuver to assign ownership 
>> of
>>   * the Relation reference to TopTransactionResourceOwner.

> Hm.  Are you sure this is actually referring to pg_class?  It seems
> unlikely given pg_relation was renamed 14 years before this comment was
> added, and the code appears to be ensuring that pg_largeobject and its
> index are opened at most once per transaction.

I believe it is just a typo/thinko for pg_class, but there's more not
to like about this comment.  First, once we've made a relcache entry
it would typically stay valid across uses, so it's far from clear that
this coding actually prevents many catalog accesses in typical cases.
Second, when we do have to rebuild the relcache entry, there's a lot
more involved than just a pg_class fetch; we at least need to read
pg_attribute, and I think there may be other catalogs that we'd read
along the way, even for a system catalog that lacks complicated
features.  (pg_index would presumably get looked at, for instance.)

I think we should reword this to just generically claim that holding
the Relation reference open for the whole transaction reduces overhead.

regards, tom lane




Re: incremental-checkopints

2023-07-26 Thread Hannu Krosing
On Wed, Jul 26, 2023 at 9:54 PM Matthias van de Meent
 wrote:
>
> Then you ignore the max_wal_size GUC as PostgreSQL so often already
> does. At least, it doesn't do what I expect it to do at face value -
> limit the size of the WAL directory to the given size.

That would require stopping any new writes at wal size == max_wal_size
until the checkpoint is completed.
I don't think anybody would want that.

> But more reasonably, you'd keep track of the count of modified pages
> that are yet to be fully WAL-logged, and keep that into account as a
> debt that you have to the current WAL insert pointer when considering
> checkpoint distances and max_wal_size.

I think Peter Geoghegan has worked on somewhat similar approach to
account for "accumulated work needed until some desired outcome"
though I think it was on the VACUUM side of things.




Re: Obsolete reference to pg_relation in comment

2023-07-26 Thread Nathan Bossart
Okay, now looking at the patch...

On Wed, Jul 26, 2023 at 06:48:51PM +0100, Dagfinn Ilmari Mannsåker wrote:
>   * All accesses to pg_largeobject and its index make use of a single Relation
> - * reference, so that we only need to open pg_relation once per transaction.
> + * reference, so that we only need to open pg_class once per transaction.
>   * To avoid problems when the first such reference occurs inside a
>   * subtransaction, we execute a slightly klugy maneuver to assign ownership 
> of
>   * the Relation reference to TopTransactionResourceOwner.

Hm.  Are you sure this is actually referring to pg_class?  It seems
unlikely given pg_relation was renamed 14 years before this comment was
added, and the code appears to be ensuring that pg_largeobject and its
index are opened at most once per transaction.  I couldn't find the
original thread for this comment, unfortunately, but ISTM we might want to
replace "pg_relation" with "them" instead.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Row pattern recognition

2023-07-26 Thread Tatsuo Ishii
> I am going to add this thread to CommitFest and plan to add both of
> you as reviewers. Thanks in advance.

Done.
https://commitfest.postgresql.org/44/4460/

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: incremental-checkopints

2023-07-26 Thread Matthias van de Meent
On Wed, 26 Jul 2023 at 20:58, Tomas Vondra
 wrote:
>
>
>
> On 7/26/23 15:16, Matthias van de Meent wrote:
> > On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera  
> > wrote:
> >>
> >> Hello
> >>
> >> On 2023-Jul-26, Thomas wen wrote:
> >>
> >>> Hi Hackes:   I found this page :
> >>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
> >>> no incremental checkpoints have been implemented so far. When a
> >>> checkpoint is triggered, the performance jitter of PostgreSQL is very
> >>> noticeable. I think incremental checkpoints should be implemented as
> >>> soon as possible
> >>
> >> I think my first question is why do you think that is necessary; there
> >> are probably other tools to achieve better performance.  For example,
> >> you may want to try making checkpoint_completion_target closer to 1, and
> >> the checkpoint interval longer (both checkpoint_timeout and
> >> max_wal_size).  Also, changing shared_buffers may improve things.  You
> >> can try adding more RAM to the machine.
> >
> > Even with all those tuning options, a significant portion of a
> > checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
> > general) will most often appear at the start of each checkpoint due to
> > each first update to a page after a checkpoint needing an FPI.
>
> Yeah, FPIs are certainly expensive and can represent huge part of the
> WAL produced. But how would incremental checkpoints make that step
> unnecessary?
>
> > If instead we WAL-logged only the pages we are about to write to disk
> > (like MySQL's double-write buffer, but in WAL instead of a separate
> > cyclical buffer file), then a checkpoint_completion_target close to 1
> > would probably solve the issue, but with "WAL-logged torn page
> > protection at first update after checkpoint" we'll probably always
> > have higher-than-average FPI load just after a new checkpoint.
> >
>
> So essentially instead of WAL-logging the FPI on the first change, we'd
> only do that later when actually writing-out the page (either during a
> checkpoint or because of memory pressure)? How would you make sure
> there's enough WAL space until the next checkpoint? I mean, FPIs are a
> huge write amplification source ...

You don't make sure that there's enough space for the modifications,
but does it matter from a durability point of view? As long as the
page isn't written to disk before the FPI, we can replay non-FPI (but
fsynced) WAL on top of the old version of the page that you read from
disk, instead of only trusting FPIs from WAL.

> Imagine the system has max_wal_size set to 1GB, and does 1M updates
> before writing 512MB of WAL and thus triggering a checkpoint. Now it
> needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with
> indexes. What then?

Then you ignore the max_wal_size GUC as PostgreSQL so often already
does. At least, it doesn't do what I expect it to do at face value -
limit the size of the WAL directory to the given size.

But more reasonably, you'd keep track of the count of modified pages
that are yet to be fully WAL-logged, and keep that into account as a
debt that you have to the current WAL insert pointer when considering
checkpoint distances and max_wal_size.

---

The main issue that I see with "WAL-logging the FPI only when you
write the dirty page to disk" is that dirty page flushing also happens
with buffer eviction in ReadBuffer(). This change in behaviour would
add a WAL insertion penalty to this write, and make it a very common
occurrance that we'd have to write WAL + fsync the WAL when we have to
write the dirty page. It would thus add significant latency to the
dirty write mechanism, which is probably a unpopular change.


Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: incremental-checkopints

2023-07-26 Thread Hannu Krosing
Starting from increments checkpoint is approaching the problem from
the wrong end.

What you actually want is Atomic Disk Writes which will allow turning
off full_page_writes .

Without this you really can not do incremental checkpoints efficiently
as checkpoints are currently what is used to determine when is "the
first write to a page after checkpoint" and thereby when the full page
write is needed.




On Wed, Jul 26, 2023 at 8:58 PM Tomas Vondra
 wrote:
>
>
>
> On 7/26/23 15:16, Matthias van de Meent wrote:
> > On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera  
> > wrote:
> >>
> >> Hello
> >>
> >> On 2023-Jul-26, Thomas wen wrote:
> >>
> >>> Hi Hackes:   I found this page :
> >>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
> >>> no incremental checkpoints have been implemented so far. When a
> >>> checkpoint is triggered, the performance jitter of PostgreSQL is very
> >>> noticeable. I think incremental checkpoints should be implemented as
> >>> soon as possible
> >>
> >> I think my first question is why do you think that is necessary; there
> >> are probably other tools to achieve better performance.  For example,
> >> you may want to try making checkpoint_completion_target closer to 1, and
> >> the checkpoint interval longer (both checkpoint_timeout and
> >> max_wal_size).  Also, changing shared_buffers may improve things.  You
> >> can try adding more RAM to the machine.
> >
> > Even with all those tuning options, a significant portion of a
> > checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
> > general) will most often appear at the start of each checkpoint due to
> > each first update to a page after a checkpoint needing an FPI.
>
> Yeah, FPIs are certainly expensive and can represent huge part of the
> WAL produced. But how would incremental checkpoints make that step
> unnecessary?
>
> > If instead we WAL-logged only the pages we are about to write to disk
> > (like MySQL's double-write buffer, but in WAL instead of a separate
> > cyclical buffer file), then a checkpoint_completion_target close to 1
> > would probably solve the issue, but with "WAL-logged torn page
> > protection at first update after checkpoint" we'll probably always
> > have higher-than-average FPI load just after a new checkpoint.
> >
>
> So essentially instead of WAL-logging the FPI on the first change, we'd
> only do that later when actually writing-out the page (either during a
> checkpoint or because of memory pressure)? How would you make sure
> there's enough WAL space until the next checkpoint? I mean, FPIs are a
> huge write amplification source ...
>
> Imagine the system has max_wal_size set to 1GB, and does 1M updates
> before writing 512MB of WAL and thus triggering a checkpoint. Now it
> needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with
> indexes. What then?
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>




Re: Obsolete reference to pg_relation in comment

2023-07-26 Thread Nathan Bossart
On Wed, Jul 26, 2023 at 11:53:06AM -0700, Nathan Bossart wrote:
> On Wed, Jul 26, 2023 at 06:48:51PM +0100, Dagfinn Ilmari Mannsåker wrote:
>> Triggered by a discussion on IRC, I noticed that there's a stray
>> reference to pg_relation in a comment that was added long after it was
>> renamed to pg_class.  Here's a patch to bring that up to speed.
> 
>> pg_relation was renamed to pg_class in 1991, but this comment (added
>> in 2004) missed the memo
> 
> Huh, interesting!  I dug around the Berkeley archives [0] and found
> comments indicating that pg_relation was renamed to pg_class in Februrary
> 1990.  However, it looks like the file was named pg_relation.h until
> Postgres95 v0.01, which has the following comment in pg_class.h:
> 
>*``pg_relation'' is being replaced by ``pg_class''.  currently
>*we are only changing the name in the catalogs but someday the
>*code will be changed too. -cim 2/26/90
>*[it finally happens.  -ay 11/5/94]

This comment actually lived in Postgres until 9cf80f2 (June 2000), too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: incremental-checkopints

2023-07-26 Thread Tomas Vondra



On 7/26/23 15:16, Matthias van de Meent wrote:
> On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera  wrote:
>>
>> Hello
>>
>> On 2023-Jul-26, Thomas wen wrote:
>>
>>> Hi Hackes:   I found this page :
>>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
>>> no incremental checkpoints have been implemented so far. When a
>>> checkpoint is triggered, the performance jitter of PostgreSQL is very
>>> noticeable. I think incremental checkpoints should be implemented as
>>> soon as possible
>>
>> I think my first question is why do you think that is necessary; there
>> are probably other tools to achieve better performance.  For example,
>> you may want to try making checkpoint_completion_target closer to 1, and
>> the checkpoint interval longer (both checkpoint_timeout and
>> max_wal_size).  Also, changing shared_buffers may improve things.  You
>> can try adding more RAM to the machine.
> 
> Even with all those tuning options, a significant portion of a
> checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
> general) will most often appear at the start of each checkpoint due to
> each first update to a page after a checkpoint needing an FPI.

Yeah, FPIs are certainly expensive and can represent huge part of the
WAL produced. But how would incremental checkpoints make that step
unnecessary?

> If instead we WAL-logged only the pages we are about to write to disk
> (like MySQL's double-write buffer, but in WAL instead of a separate
> cyclical buffer file), then a checkpoint_completion_target close to 1
> would probably solve the issue, but with "WAL-logged torn page
> protection at first update after checkpoint" we'll probably always
> have higher-than-average FPI load just after a new checkpoint.
> 

So essentially instead of WAL-logging the FPI on the first change, we'd
only do that later when actually writing-out the page (either during a
checkpoint or because of memory pressure)? How would you make sure
there's enough WAL space until the next checkpoint? I mean, FPIs are a
huge write amplification source ...

Imagine the system has max_wal_size set to 1GB, and does 1M updates
before writing 512MB of WAL and thus triggering a checkpoint. Now it
needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with
indexes. What then?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Obsolete reference to pg_relation in comment

2023-07-26 Thread Nathan Bossart
On Wed, Jul 26, 2023 at 06:48:51PM +0100, Dagfinn Ilmari Mannsåker wrote:
> Triggered by a discussion on IRC, I noticed that there's a stray
> reference to pg_relation in a comment that was added long after it was
> renamed to pg_class.  Here's a patch to bring that up to speed.

> pg_relation was renamed to pg_class in 1991, but this comment (added
> in 2004) missed the memo

Huh, interesting!  I dug around the Berkeley archives [0] and found
comments indicating that pg_relation was renamed to pg_class in Februrary
1990.  However, it looks like the file was named pg_relation.h until
Postgres95 v0.01, which has the following comment in pg_class.h:

 *``pg_relation'' is being replaced by ``pg_class''.  currently
 *we are only changing the name in the catalogs but someday the
 *code will be changed too. -cim 2/26/90
 *[it finally happens.  -ay 11/5/94]

[0] https://dsf.berkeley.edu/postgres.html

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Obsolete reference to pg_relation in comment

2023-07-26 Thread Dagfinn Ilmari Mannsåker
Hi hackers,

Triggered by a discussion on IRC, I noticed that there's a stray
reference to pg_relation in a comment that was added long after it was
renamed to pg_class.  Here's a patch to bring that up to speed.

- ilmari

>From e395f8cb293f674f45eb3847534de07c7124e738 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Wed, 26 Jul 2023 18:31:51 +0100
Subject: [PATCH] Fix obsolete reference to pg_relation in comment

pg_relation was renamed to pg_class in 1991, but this comment (added
in 2004) missed the memo
---
 src/backend/storage/large_object/inv_api.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/large_object/inv_api.c b/src/backend/storage/large_object/inv_api.c
index 84e543e731..a56912700b 100644
--- a/src/backend/storage/large_object/inv_api.c
+++ b/src/backend/storage/large_object/inv_api.c
@@ -59,7 +59,7 @@ bool		lo_compat_privileges;
 
 /*
  * All accesses to pg_largeobject and its index make use of a single Relation
- * reference, so that we only need to open pg_relation once per transaction.
+ * reference, so that we only need to open pg_class once per transaction.
  * To avoid problems when the first such reference occurs inside a
  * subtransaction, we execute a slightly klugy maneuver to assign ownership of
  * the Relation reference to TopTransactionResourceOwner.
-- 
2.39.2



Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-07-26 Thread Jacob Champion
On Tue, Jul 18, 2023 at 4:04 PM Thomas Munro  wrote:
> On Tue, Jul 18, 2023 at 11:55 AM Jacob Champion  
> wrote:
> +1 for EV_RECEIPT ("just tell me about errors, don't drain any
> events").

Sounds good.

> While comparing the cousin OSs' man pages just now, I noticed that
> it's not only macOS that lacks NOTE_MSECONDS, it's also OpenBSD and
> NetBSD < 10.  Maybe just delete that cruft ^^^ and use literal 0 in
> fflags directly.

So I don't lose track of it, here's a v10 with those two changes.

Thanks!
--Jacob
1:  9c6a340119 = 1:  0278c7ba90 common/jsonapi: support FRONTEND clients
2:  8072d0416e ! 2:  bb3ce4b6a9 libpq: add OAUTHBEARER SASL mechanism
@@ src/interfaces/libpq/fe-auth-oauth-curl.c (new)
 +#include "libpq-int.h"
 +#include "mb/pg_wchar.h"
 +
-+#ifdef HAVE_SYS_EVENT_H
-+/* macOS doesn't define the time unit macros, but uses milliseconds by 
default. */
-+#ifndef NOTE_MSECONDS
-+#define NOTE_MSECONDS 0
-+#endif
-+#endif
-+
 +/*
 + * Parsed JSON Representations
 + *
@@ src/interfaces/libpq/fe-auth-oauth-curl.c (new)
 +  switch (what)
 +  {
 +  case CURL_POLL_IN:
-+  EV_SET([nev], socket, EVFILT_READ, EV_ADD, 0, 0, 0);
++  EV_SET([nev], socket, EVFILT_READ, EV_ADD | 
EV_RECEIPT, 0, 0, 0);
 +  nev++;
 +  break;
 +
 +  case CURL_POLL_OUT:
-+  EV_SET([nev], socket, EVFILT_WRITE, EV_ADD, 0, 0, 0);
++  EV_SET([nev], socket, EVFILT_WRITE, EV_ADD | 
EV_RECEIPT, 0, 0, 0);
 +  nev++;
 +  break;
 +
 +  case CURL_POLL_INOUT:
-+  EV_SET([nev], socket, EVFILT_READ, EV_ADD, 0, 0, 0);
++  EV_SET([nev], socket, EVFILT_READ, EV_ADD | 
EV_RECEIPT, 0, 0, 0);
 +  nev++;
-+  EV_SET([nev], socket, EVFILT_WRITE, EV_ADD, 0, 0, 0);
++  EV_SET([nev], socket, EVFILT_WRITE, EV_ADD | 
EV_RECEIPT, 0, 0, 0);
 +  nev++;
 +  break;
 +
@@ src/interfaces/libpq/fe-auth-oauth-curl.c (new)
 +   * both, so we try to remove both.  This means we need 
to tolerate
 +   * ENOENT below.
 +   */
-+  EV_SET([nev], socket, EVFILT_READ, EV_DELETE, 0, 0, 
0);
++  EV_SET([nev], socket, EVFILT_READ, EV_DELETE | 
EV_RECEIPT, 0, 0, 0);
 +  nev++;
-+  EV_SET([nev], socket, EVFILT_WRITE, EV_DELETE, 0, 0, 
0);
++  EV_SET([nev], socket, EVFILT_WRITE, EV_DELETE | 
EV_RECEIPT, 0, 0, 0);
 +  nev++;
 +  break;
 +
@@ src/interfaces/libpq/fe-auth-oauth-curl.c (new)
 +   */
 +  for (int i = 0; i < res; ++i)
 +  {
-+  if (ev_out[i].flags & EV_ERROR && ev_out[i].data != ENOENT)
++  /*
++   * EV_RECEIPT should guarantee one EV_ERROR result for every 
change,
++   * whether successful or not. Failed entries contain a non-zero 
errno in
++   * the `data` field.
++   */
++  Assert(ev_out[i].flags & EV_ERROR);
++
++  errno = ev_out[i].data;
++  if (errno && errno != ENOENT)
 +  {
-+  errno = ev_out[i].data;
 +  switch (what)
 +  {
 +  case CURL_POLL_REMOVE:
@@ src/interfaces/libpq/fe-auth-oauth-curl.c (new)
 +  struct kevent ev;
 +
 +  EV_SET(, 1, EVFILT_TIMER, timeout < 0 ? EV_DELETE : EV_ADD,
-+ NOTE_MSECONDS, timeout, 0);
++ 0, timeout, 0);
 +  if (kevent(actx->mux, , 1, NULL, 0, NULL) < 0 && errno != ENOENT)
 +  {
 +  actx_error(actx, "setting kqueue timer to %ld: %m", timeout);
@@ src/interfaces/libpq/fe-auth-oauth-curl.c (new)
 +#endif
 +#ifdef HAVE_SYS_EVENT_H
 +  // XXX: I guess this wants to be hidden in a routine
-+  EV_SET(, 1, EVFILT_TIMER, EV_ADD, NOTE_MSECONDS,
++  EV_SET(, 1, EVFILT_TIMER, EV_ADD, 0,
 + actx->authz.interval * 1000, 0);
 +  if (kevent(actx->mux, , 1, NULL, 0, NULL) < 0)
 +  {
3:  07be9375aa = 3:  20b758 backend: add OAUTHBEARER SASL mechanism
4:  71cedc6ff5 = 4:  f3cec068f9 Add pytest suite for OAuth
5:  9b02e14829 = 5:  da1933ac1d squash! Add pytest suite for OAuth
6:  7d179f7e53 = 6:  8f36b5c124 XXX work around psycopg2 build failures


v10-0001-common-jsonapi-support-FRONTEND-clients.patch.gz
Description: application/gzip


v10-0003-backend-add-OAUTHBEARER-SASL-mechanism.patch.gz
Description: application/gzip


v10-0002-libpq-add-OAUTHBEARER-SASL-mechanism.patch.gz
Description: 

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-26 Thread Matthias van de Meent
On Wed, 26 Jul 2023 at 15:42, Peter Geoghegan  wrote:
>
> On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
>  wrote:
> > Considering that it caches/reuses the page across SAOP operations, can
> > (or does) this also improve performance for index scans on the outer
> > side of a join if the order of join columns matches the order of the
> > index?
>
> It doesn't really cache leaf pages at all. What it does is advance the
> array keys locally, while the original buffer lock is still held on
> that same page.

Hmm, then I had a mistaken understanding of what we do in _bt_readpage
with _bt_saveitem.

> > That is, I believe this caches (leaf) pages across scan keys, but can
> > (or does) it also reuse these already-cached leaf pages across
> > restarts of the index scan/across multiple index lookups in the same
> > plan node, so that retrieval of nearby index values does not need to
> > do an index traversal?
>
> I'm not sure what you mean. There is no reason why you need to do more
> than one single descent of an index to scan many leaf pages using many
> distinct sets of array keys. Obviously, this depends on being able to
> observe that we really don't need to redescend the index to advance
> the array keys, again and again. Note in particularly that this
> usually works across leaf pages.

In a NestedLoop(inner=seqscan, outer=indexscan), the index gets
repeatedly scanned from the root, right? It seems that right now, we
copy matching index entries into a local cache (that is deleted on
amrescan), then we drop our locks and pins on the buffer, and then
start returning values from our local cache (in _bt_saveitem).
We could cache the last accessed leaf page across amrescan operations
to reduce the number of index traversals needed when the join key of
the left side is highly (but not necessarily strictly) correllated.
The worst case overhead of this would be 2 _bt_compares (to check if
the value is supposed to be fully located on the cached leaf page)
plus one memcpy( , , BLCKSZ) in the previous loop. With some smart
heuristics (e.g. page fill factor, number of distinct values, and
whether we previously hit this same leaf page in the previous scan of
this Node) we can probably also reduce this overhead to a minimum if
the joined keys are not correllated, but accellerate the query
significantly when we find out they are correllated.

Of course, in the cases where we'd expect very few distinct join keys
the planner would likely put a Memoize node above the index scan, but
for mostly unique join keys I think this could save significant
amounts of time, if only on buffer pinning and locking.

I guess I'll try to code something up when I have the time, as it
sounds not quite exactly related to your patch but an interesting
improvement nonetheless.


Kind regards,

Matthias van de Meent




Re: [PATCH] Check more invariants during syscache initialization

2023-07-26 Thread Aleksander Alekseev
Hi,

> Shouldn't these be calling `OidIsValid(…)`, not comparing directly to
> `InvalidOid`?

They should, thanks. Here is the updated patch. I made sure there are
no others != InvalidOid checks in syscache.c and catcache.c which are
affected by my patch. I didn't change any other files since Zhang
wants to propose the corresponding patch.

-- 
Best regards,
Aleksander Alekseev


v3-0001-Check-more-invariants-during-syscache-initializat.patch
Description: Binary data


Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-26 Thread Anthonin Bonnefoy
I've initially thought of sending the spans from PostgreSQL since this is
the usual behavior of tracing libraries.

However, this created a lot potential issues:

- Protocol support and differences between trace collectors. OpenTelemetry
seems to use gRPC, others are using http and those will require additional
libraries (plus gRPC support in C doesn't look good) and any change in
protobuf definition would require updating the extension.

- Do we send the spans within the query hooks? This means that we could
block the process if the trace collector is slow to answer or we can’t
connect. Sending spans from a background process sounded rather complex and
resource heavy.

Moving to a pull model fixed those issues and felt more natural as this is
the way PostgreSQL exposes its metrics.


On Wed, Jul 26, 2023 at 4:11 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Nikita,
>
> > This patch looks very interesting, I'm working on the same subject too.
> But I've used
> > another approach - I'm using C wrapper for C++ API library from
> OpenTelemetry, and
> > handle span storage and output to this library. There are some nuances
> though, but it
> > is possible. Have you tried to use OpenTelemetry APIs instead of
> implementing all
> > functionality around spans?
>
> I don't think that PostgreSQL accepts such kind of C++ code, not to
> mention the fact that the PostgreSQL license is not necessarily
> compatible with Apache 2.0 (I'm not a lawyer; this is not a legal
> advice). Such a design decision will probably require using separate
> compile flags since the user doesn't necessarily have a corresponding
> dependency installed. Similarly to how we do with LLVM, OpenSSL, etc.
>
> So -1 to the OpenTelemetry C++ library and +1 to the properly licensed
> C implementation without 3rd party dependencies from me. Especially
> considering the fact that the implementation seems to be rather
> simple.
>
> --
> Best regards,
> Aleksander Alekseev
>


Re: [PATCH] Check more invariants during syscache initialization

2023-07-26 Thread Dagfinn Ilmari Mannsåker
Aleksander Alekseev  writes:

> Hi Zhang,
>
>> That remind me to have a look other codes, and a grep search `oid != 0` show 
>> there are several files using old != 0.
>>
>> ```
>> .//src/bin/pg_resetwal/pg_resetwal.c: if (set_oid != 0)
>> .//src/bin/pg_resetwal/pg_resetwal.c: if (set_oid != 0)
>> .//src/bin/pg_dump/pg_backup_tar.c: if (oid != 0)
>> .//src/bin/pg_dump/pg_backup_custom.c: while (oid != 0)
>> .//src/bin/pg_dump/pg_backup_custom.c: while (oid != 0)
>> ```
>> That is another story…I would like provide a patch if it worths.
>
> Good catch. Please do so.

Shouldn't these be calling `OidIsValid(…)`, not comparing directly to
`InvalidOid`?

~/src/postgresql (master $)$ git grep '[Oo]id [!=]= 0' | wc -l
18
~/src/postgresql (master $)$ git grep '[!=]= InvalidOid' | wc -l
296
~/src/postgresql (master $)$ git grep 'OidIsValid' |
wc -l
1462

- ilmari




Re: Question about use_physical_tlist() which is applied on Scan path

2023-07-26 Thread 油屋
I had the same question recently. In addition, I looked at the results of
tpch which scale factor is 1 ran on postgres REL_15_STABLE and observed no
performance improvement from physical tlist. To be specific, I run two
versions of tpch, one with physical tlist enabled and one with physical
tlist disabled. The performance improvement of some queries in the former
was less than 5%, some queries performed worse than latter, and I think
this is a normal range of performance fluctuations which was not caused by
physical tlist. I have read the relevant commits, maybe because of my
carelessness, I did not find the test queries corresponding to physical
tlist. Are there any test queries of physical tlist?
Thanks

Alvaro Herrera  于2023年7月26日周三 18:16写道:

> On 2023-Jul-26, Jian Guo wrote:
>
> > It looks the columns besides `ps_supplycost` and `ps_availqty` are not
> > necessary, but fetched from tuples all at once. For the row-based
> > storage such as heap, it looks fine, but for column-based storage, it
> > would result into unnecessary overhead and impact performance. Is
> > there any plan to optimize here?
>
> I suppose that, at some point, it is going to have to be the table AM
> the one that makes the decision.  That is, use_physical_tlist would have
> to involve some new flag in path->parent->amflags to determine whether
> to skip using a physical tlist.  Right now, we don't have any columnar
> stores, so there's no way to verify an implementation.  If you do have a
> columnar store implementation, you're welcome to share it.
>
> --
> Álvaro Herrera PostgreSQL Developer
> "I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
> lack of hesitasion in answering a lost soul's question, I just wished the
> rest
> of the mailing list could be like this."
>  (Fotis)
>(
> http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)
>
>
>


[May be a bug] double free or corruption

2023-07-26 Thread Wen Yi
Hi community,
I use the PostgreSQL 17devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 
13.1.1 20230614 (Red Hat 13.1.1-4), 64-bit.(Fedora Linux)
And I use the gdb to debug the postgres, just test the pg_ctl.
As you can see:



---



[Switching to Thread 0x77dce740 (LWP 83554)]
start_postmaster () at pg_ctl.c:455
455  if (pm_pid < 0)
(gdb) ..
462  if (pm_pid  0)
(gdb) 
476  if (setsid() < 0)
(gdb) ..
489  if (log_file != NULL)
(gdb) 
490   cmd = 
psprintf("exec \"%s\" %s%s < \"%s\"  \"%s\" 21",
(gdb) ..
497  (void) execl("/bin/sh", "/bin/sh", 
"-c", cmd, (char *) NULL);
(gdb) .
process 83554 is executing new program: /usr/bin/bash
Error in re-setting breakpoint 1: No source file named 
/home/postgres/project/postgres/src/devel/src/bin/pg_ctl/pg_ctl.c.
 
Error in re-setting breakpoint 2: No source file named 
/home/postgres/project/postgres/src/devel/src/bin/pg_ctl/pg_ctl.c.
[Thread debugging using libthread_db 
enabled]
 
Using host libthread_db library "/lib64/libthread_db.so.1".
process 83554 is executing new program: /home/postgres/postgres/bin/bin/postgres
[Thread debugging using libthread_db 
enabled]
 
Using host libthread_db library "/lib64/libthread_db.so.1".
BFD: warning: 
/home/postgres/.cache/debuginfod_client/d25eaf3596d9455fe9725f6e9cd1aa5433f31b92/debuginfo
 has a section extending past end of 
file
 
Error while reading shared library symbols for /lib64/libstdc++.so.6:
`/home/postgres/.cache/debuginfod_client/d25eaf3596d9455fe9725f6e9cd1aa5433f31b92/debuginfo':
 can't read symbols: file format not recognized.
.[Attaching after Thread 0x77e8d480 (LWP 83554) fork to child process 
83559]
 
[New inferior 3 (process 83559)]
[Detaching after fork from parent process 83554]
[Inferior 2 (process 83554) detached]
double free or corruption (out)


Fatal signal: Aborted
- Backtrace -
corrupted double-linked list


Fatal signal: Aborted
- Backtrace -
done
server started
0x5557bf5908b0 ???
0x5557bf6cb4cd ???
0x7f040125fb6f ???
0x7f04012b0844 ???
0x7f040125fabd ???
0x7f040124887e ???
0x7f040124960e ???
0x7f04012ba774 ???
0x7f04012bc93f ???
0x7f04012bf1cd ???
0x5557bf98272a ???
0x5557bf7eb93c ???
0x5557bf643b48 ???
0x5557bf643d11 ???
0x7f04012b3af2 ???
0x5557bfc37d48 ???
0x7f04014e31f2 ???
0x7f04012ae906 ???
0x7f040133486f ???
0x ???
-
A fatal error internal to GDB has been detected, further
debugging is not possible. GDB will now terminate.

This is a bug, please report it. For instructions, see:


Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-26 Thread Aleksander Alekseev
Nikita,

> This patch looks very interesting, I'm working on the same subject too. But 
> I've used
> another approach - I'm using C wrapper for C++ API library from 
> OpenTelemetry, and
> handle span storage and output to this library. There are some nuances 
> though, but it
> is possible. Have you tried to use OpenTelemetry APIs instead of implementing 
> all
> functionality around spans?

I don't think that PostgreSQL accepts such kind of C++ code, not to
mention the fact that the PostgreSQL license is not necessarily
compatible with Apache 2.0 (I'm not a lawyer; this is not a legal
advice). Such a design decision will probably require using separate
compile flags since the user doesn't necessarily have a corresponding
dependency installed. Similarly to how we do with LLVM, OpenSSL, etc.

So -1 to the OpenTelemetry C++ library and +1 to the properly licensed
C implementation without 3rd party dependencies from me. Especially
considering the fact that the implementation seems to be rather
simple.

-- 
Best regards,
Aleksander Alekseev




Re: cataloguing NOT NULL constraints

2023-07-26 Thread Alvaro Herrera
On 2023-Jul-26, Alvaro Herrera wrote:

> On 2023-Jul-26, Dean Rasheed wrote:
> 
> > The new \d+ output certainly makes testing and reviewing easier,
> > though I do understand people's concerns that this may make the output
> > significantly longer in many real-world cases.
> 
> Yeah, at this point I'm inclined to get the \d+ version committed
> immediately after the main patch, and we can tweak the psql UI after the
> fact -- for instance so that they are only shown in \d++, or some other
> idea we may come across.

(For example, maybe we could add \dtc [PATTERN] or some such, that lists
all the constraints of all kinds in tables matching PATTERN.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"If you want to have good ideas, you must have many ideas.  Most of them
will be wrong, and what you have to learn is which ones to throw away."
 (Linus Pauling)




Re: [PATCH] Check more invariants during syscache initialization

2023-07-26 Thread Aleksander Alekseev
Hi Zhang,

> That remind me to have a look other codes, and a grep search `oid != 0` show 
> there are several files using old != 0.
>
> ```
> .//src/bin/pg_resetwal/pg_resetwal.c: if (set_oid != 0)
> .//src/bin/pg_resetwal/pg_resetwal.c: if (set_oid != 0)
> .//src/bin/pg_dump/pg_backup_tar.c: if (oid != 0)
> .//src/bin/pg_dump/pg_backup_custom.c: while (oid != 0)
> .//src/bin/pg_dump/pg_backup_custom.c: while (oid != 0)
> ```
> That is another story…I would like provide a patch if it worths.

Good catch. Please do so.

-- 
Best regards,
Aleksander Alekseev




Re: WaitForOlderSnapshots in DETACH PARTITION causes deadlocks

2023-07-26 Thread Alvaro Herrera
On 2023-Jul-25, Michael Paquier wrote:

> On Mon, Jul 24, 2023 at 07:40:04PM +, Imseih (AWS), Sami wrote:
> > WaitForOlderSnapshots is used here to ensure that snapshots older than
> > the start of the ALTER TABLE DETACH CONCURRENTLY are completely removed
> > to guarantee consistency, however it does seem to cause deadlocks for at
> > least RR/SERIALIZABLE transactions.
> > 
> > There are cases [2] in which certain operations are accepted as not being
> > MVCC-safe, and now I am wondering if this is another case? Deadlocks are
> > not a good scenario, and WaitForOlderSnapshot does not appear to do
> > anything for READ COMMITTED transactions.
> > 
> > So do we actually need WaitForOlderSnapshot in the FINALIZE code? and
> > Could be acceptable that this operation is marked as not MVCC-safe like
> > the other aforementioned operations?
> 
> I guess that there is an argument with lifting that a bit.  Based on
> the test case your are providing, a deadlock occuring between the
> FINALIZE and a scan of the top-level partitioned table in a
> transaction that began before the DETACH CONCURRENTLY does not seem
> like the best answer to have from the user perspective.  I got to
> wonder whether there is room to make the wait for older snapshots in
> the finalize phase more robust, though, and actually make it wait
> until the first transaction commits rather than fail because of a
> deadlock like that.

It took a lot of work to get SERIALIZABLE/RR transactions to work with
DETACHED CONCURRENTLY, so I'm certainly not in a hurry to give up and
just "document that it doesn't work".  I don't think that would be an
acceptable feature regression.

I spent some time yesterday trying to turn the reproducer into an
isolationtester spec file.  I have not succeeded yet, but I'll continue
with that this afternoon.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Every machine is a smoke machine if you operate it wrong enough."
https://twitter.com/libseybieda/status/1541673325781196801




Re: Remove unused fields in ReorderBufferTupleBuf

2023-07-26 Thread Aleksander Alekseev
Hi,

> Here is the corrected patch. I added it to the nearest CF [1].

I played a bit more with the patch. There was an idea to make
ReorderBufferTupleBufData an opaque structure known only within
reorderbyffer.c but it turned out that replication/logical/decode.c
accesses it directly so I abandoned that idea for now.

> Alternatively we could convert ReorderBufferTupleBufData macro to an
> inlined function. At least it will add some type safety.

Here is v3 that implements it too as a separate patch.

Apologies for the noise.

-- 
Best regards,
Aleksander Alekseev


v3-0001-Remove-unused-fields-in-ReorderBufferTupleBuf.patch
Description: Binary data


v3-0002-Replace-ReorderBufferTupleBufData-macro-with-a-fu.patch
Description: Binary data


Re: cataloguing NOT NULL constraints

2023-07-26 Thread Alvaro Herrera
Thanks for spending so much time with this patch -- really appreciated.

On 2023-Jul-26, Dean Rasheed wrote:

> On Tue, 25 Jul 2023 at 13:36, Alvaro Herrera  wrote:
> >
> > Okay then, I've made these show up in the footer of \d+.  This is in
> > patch 0003 here.  Please let me know what do you think of the regression
> > changes.
> 
> The new \d+ output certainly makes testing and reviewing easier,
> though I do understand people's concerns that this may make the output
> significantly longer in many real-world cases. I don't think it would
> be a good idea to filter the list in any way though, because I think
> that will only lead to confusion. I think it should be all-or-nothing,
> though I'm not necessarily opposed to using something like \d++ to
> enable it, if that turns out to be the least-bad option.

Yeah, at this point I'm inclined to get the \d+ version committed
immediately after the main patch, and we can tweak the psql UI after the
fact -- for instance so that they are only shown in \d++, or some other
idea we may come across.

> Going back to this example:
> 
> drop table if exists p1, p2, foo;
> create table p1(a int not null check (a > 0));
> create table p2(a int not null check (a > 0));
> create table foo () inherits (p1,p2);

> I remain of the opinion that that should create 2 NOT NULL constraints
> on foo, for consistency with CHECK constraints, and the misleading
> name that results if p1_a_not_null is dropped from p1. That way, the
> names of inherited NOT NULL constraints could be kept in sync, as they
> are for other constraint types, making it easier to keep track of
> where they come from, and it wouldn't be necessary to treat them
> differently (e.g., matching by column number, when dropping NOT NULL
> constraints).

I think having two constraints is more problematic, UI-wise.  Previous
versions of this patchset did it that way, and it's not great: for
example ALTER TABLE ALTER COLUMN DROP NOT NULL fails and tells you to
choose which exact constraint you want to drop and use DROP CONSTRAINT
instead.  And when searching for the not-null constraints for a column,
the code had to consider the case of there being multiple ones, which
led to strange contortions.  Allowing a single one is simpler and covers
all important cases well.

Anyway, you still can't drop the doubly-inherited constraint directly,
because it'll complain that it is an inherited constraint.  So you have
to deinherit first and only then can you drop the constraint.

Now, one possible improvement here would be to ignore the parent
constraint's name, and have 'foo' recompute its own constraint name from
scratch, inheriting the name only if one of the parents had a
manually-specified constraint name (and we would choose the first one,
if there's more than one).  I think complicating things more than that
is unnecessary -- particularly considering that legacy inheritance is,
well, legacy, and I doubt people are relying too much on it.


> Given the following sequence:
> 
> drop table if exists p,c;
> create table p(a int primary key);
> create table c() inherits (p);
> alter table p drop constraint p_pkey;
> 
> p.a ends up being nullable, where previously it would have been left
> non-nullable. That change makes sense, and is presumably one of the
> benefits of tying the nullability of columns to pg_constraint entries.

Right.

> However, c.a remains non-nullable, with a NOT NULL constraint that
> claims to be inherited:
> 
> \d+ c
> Table "public.c"
>  Column |  Type   | Collation | Nullable | Default | Storage |
> Compression | Stats target | Description
> +-+---+--+-+-+-+--+-
>  a  | integer |   | not null | | plain   |
> |  |
> Not null constraints:
> "c_a_not_null" NOT NULL "a" (inherited)
> Inherits: p
> Access method: heap
> 
> That's a problem, because now the NOT NULL constraint on c cannot be
> dropped (attempting to drop it on c errors out because it thinks it's
> inherited, but it can't be dropped via p, because p.a is already
> nullable).

Oh, I think the bug here is just that this constraint should not claim
to be inherited, but standalone.  So you can drop it afterwards; but if
you drop it and end up with NULL values in your PK-labelled column in
the parent table, that's on you.

> I wonder if NOT NULL constraints created as a result of inherited PKs
> should have names based on the PK name (e.g.,
> __not_null), to make it more obvious where they
> came from. That would be more consistent with the way NOT NULL
> constraint names are inherited.

Hmm, interesting idea.  I'll play with it.  (It may quickly lead to
constraint names that are too long, though.)

> Given the following sequence:
> 
> drop table if exists p,c;
> create table p(a int);
> create table c() inherits (p);
> alter table p add primary key (a);
> 
> c.a ends up 

Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-26 Thread Peter Geoghegan
On Wed, Jul 26, 2023 at 5:29 AM Matthias van de Meent
 wrote:
> Considering that it caches/reuses the page across SAOP operations, can
> (or does) this also improve performance for index scans on the outer
> side of a join if the order of join columns matches the order of the
> index?

It doesn't really cache leaf pages at all. What it does is advance the
array keys locally, while the original buffer lock is still held on
that same page.

> That is, I believe this caches (leaf) pages across scan keys, but can
> (or does) it also reuse these already-cached leaf pages across
> restarts of the index scan/across multiple index lookups in the same
> plan node, so that retrieval of nearby index values does not need to
> do an index traversal?

I'm not sure what you mean. There is no reason why you need to do more
than one single descent of an index to scan many leaf pages using many
distinct sets of array keys. Obviously, this depends on being able to
observe that we really don't need to redescend the index to advance
the array keys, again and again. Note in particularly that this
usually works across leaf pages.

> I'm not sure I understand. MDAM seems to work on an index level to
> return full ranges of values, while "skip scan" seems to try to allow
> systems to signal to the index to skip to some other index condition
> based on arbitrary cutoffs. This would usually be those of which the
> information is not stored in the index, such as "SELECT user_id FROM
> orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
> though the user_id index and skip to the next user_id value when it
> gets enough rows of a matching result (where "enough" is determined
> above the index AM's plan node, or otherwise is impossible to
> determine with only the scan key info in the index AM). I'm not sure
> how this could work without specifically adding skip scan-related
> index AM functionality, and I don't see how it fits in with this
> MDAM/SAOP system.

I think of that as being quite a different thing.

Basically, the patch that added that feature had to revise the index
AM API, in order to support a mode of operation where scans return
groupings rather than tuples. Whereas this patch requires none of
that. It makes affected index scans as similar as possible to
conventional index scans.

> > [...]
> >
> > Thoughts?
>
> MDAM seems to require exponential storage for "scan key operations"
> for conditions on N columns (to be precise, the product of the number
> of distinct conditions on each column); e.g. an index on mytable
> (a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
> AND h IN (1, 2)" would require 2^8 entries.

Note that I haven't actually changed anything about the way that the
state machine generates new sets of single value predicates -- it's
still just cycling through each distinct set of array keys in the
patch.

What you describe is a problem in theory, but I doubt that it's a
problem in practice. You don't actually have to materialize the
predicates up-front, or at all. Plus you can skip over them using the
next index tuple. So skipping works both ways.

-- 
Peter Geoghegan




Re: [PATCH] Check more invariants during syscache initialization

2023-07-26 Thread Zhang Mingli
HI,

> On Jul 26, 2023, at 20:50, Aleksander Alekseev  
> wrote:
> 
> Hi Michael,
> 
>> That was more a question.  I was wondering if it was something you've
>> noticed while working on a different patch because you somewhat
>> assigned incorrect values in the syscache array, but it looks like you
>> have noticed that while scanning the code.
> 
> Oh, got it. That's correct.
> 
>> Still it's hard to miss at compile time.  I think that I would remove
>> this one.
> 
> Fair enough. Here is the updated patch.
> 
> -- 
> Best regards,
> Aleksander Alekseev
> 



LGTM.

```
-   Assert(cacheinfo[cacheId].reloid != 0);
+   Assert(cacheinfo[cacheId].reloid != InvalidOid);
```

That remind me to have a look other codes, and a grep search `oid != 0` show 
there are several files using old != 0.

```
.//src/bin/pg_resetwal/pg_resetwal.c:   if (set_oid != 0)
.//src/bin/pg_resetwal/pg_resetwal.c:   if (set_oid != 0)
.//src/bin/pg_dump/pg_backup_tar.c: if (oid != 0)
.//src/bin/pg_dump/pg_backup_custom.c:  while (oid != 0)
.//src/bin/pg_dump/pg_backup_custom.c:  while (oid != 0)
```
That is another story…I would like provide a patch if it worths.


Zhang Mingli
https://www.hashdata.xyz



Re: Remove unused fields in ReorderBufferTupleBuf

2023-07-26 Thread Aleksander Alekseev
Hi,

> I'm afraid it's a bit incomplete:
>
> ```
> ../src/backend/replication/logical/reorderbuffer.c: In function
> ‘ReorderBufferGetTupleBuf’:
> ../src/backend/replication/logical/reorderbuffer.c:565:14: error:
> ‘ReorderBufferTupleBuf’ has no member named ‘alloc_tuple_size’
>   565 | tuple->alloc_tuple_size = alloc_len;
>   |  ^~
> [829/1882] Compiling C object contrib/xml2/pgxml.so.p/xpath.c.o
> ```

Here is the corrected patch. I added it to the nearest CF [1].

> On top of that IMO it doesn't make much sense to keep a one-field
> wrapper structure. Perhaps we should get rid of it entirely and just
> use HeapTupleData instead.
>
> Thoughts?

Alternatively we could convert ReorderBufferTupleBufData macro to an
inlined function. At least it will add some type safety.

I didn't change anything in this respect in v2. Feedback from the
community would be much appreciated.

[1]: https://commitfest.postgresql.org/44/4461/

-- 
Best regards,
Aleksander Alekseev


v2-0001-Remove-unused-fields-in-ReorderBufferTupleBuf.patch
Description: Binary data


Re: incremental-checkopints

2023-07-26 Thread Matthias van de Meent
On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera  wrote:
>
> Hello
>
> On 2023-Jul-26, Thomas wen wrote:
>
> > Hi Hackes:   I found this page :
> > https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
> > no incremental checkpoints have been implemented so far. When a
> > checkpoint is triggered, the performance jitter of PostgreSQL is very
> > noticeable. I think incremental checkpoints should be implemented as
> > soon as possible
>
> I think my first question is why do you think that is necessary; there
> are probably other tools to achieve better performance.  For example,
> you may want to try making checkpoint_completion_target closer to 1, and
> the checkpoint interval longer (both checkpoint_timeout and
> max_wal_size).  Also, changing shared_buffers may improve things.  You
> can try adding more RAM to the machine.

Even with all those tuning options, a significant portion of a
checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
general) will most often appear at the start of each checkpoint due to
each first update to a page after a checkpoint needing an FPI.
If instead we WAL-logged only the pages we are about to write to disk
(like MySQL's double-write buffer, but in WAL instead of a separate
cyclical buffer file), then a checkpoint_completion_target close to 1
would probably solve the issue, but with "WAL-logged torn page
protection at first update after checkpoint" we'll probably always
have higher-than-average FPI load just after a new checkpoint.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)




Re: cataloguing NOT NULL constraints

2023-07-26 Thread Dean Rasheed
On Tue, 25 Jul 2023 at 13:36, Alvaro Herrera  wrote:
>
> Okay then, I've made these show up in the footer of \d+.  This is in
> patch 0003 here.  Please let me know what do you think of the regression
> changes.
>

The new \d+ output certainly makes testing and reviewing easier,
though I do understand people's concerns that this may make the output
significantly longer in many real-world cases. I don't think it would
be a good idea to filter the list in any way though, because I think
that will only lead to confusion. I think it should be all-or-nothing,
though I'm not necessarily opposed to using something like \d++ to
enable it, if that turns out to be the least-bad option.

Going back to this example:

drop table if exists p1, p2, foo;
create table p1(a int not null check (a > 0));
create table p2(a int not null check (a > 0));
create table foo () inherits (p1,p2);
\d+ foo

   Table "public.foo"
 Column |  Type   | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
+-+---+--+-+-+-+--+-
 a  | integer |   | not null | | plain   |
|  |
Check constraints:
"p1_a_check" CHECK (a > 0)
"p2_a_check" CHECK (a > 0)
Not null constraints:
"p1_a_not_null" NOT NULL "a" (inherited)
Inherits: p1,
  p2
Access method: heap

I remain of the opinion that that should create 2 NOT NULL constraints
on foo, for consistency with CHECK constraints, and the misleading
name that results if p1_a_not_null is dropped from p1. That way, the
names of inherited NOT NULL constraints could be kept in sync, as they
are for other constraint types, making it easier to keep track of
where they come from, and it wouldn't be necessary to treat them
differently (e.g., matching by column number, when dropping NOT NULL
constraints).

Doing a little more testing, I found some other issues.


Given the following sequence:

drop table if exists p,c;
create table p(a int primary key);
create table c() inherits (p);
alter table p drop constraint p_pkey;

p.a ends up being nullable, where previously it would have been left
non-nullable. That change makes sense, and is presumably one of the
benefits of tying the nullability of columns to pg_constraint entries.
However, c.a remains non-nullable, with a NOT NULL constraint that
claims to be inherited:

\d+ c
Table "public.c"
 Column |  Type   | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
+-+---+--+-+-+-+--+-
 a  | integer |   | not null | | plain   |
|  |
Not null constraints:
"c_a_not_null" NOT NULL "a" (inherited)
Inherits: p
Access method: heap

That's a problem, because now the NOT NULL constraint on c cannot be
dropped (attempting to drop it on c errors out because it thinks it's
inherited, but it can't be dropped via p, because p.a is already
nullable).

I wonder if NOT NULL constraints created as a result of inherited PKs
should have names based on the PK name (e.g.,
__not_null), to make it more obvious where they
came from. That would be more consistent with the way NOT NULL
constraint names are inherited.


Given the following sequence:

drop table if exists p,c;
create table p(a int);
create table c() inherits (p);
alter table p add primary key (a);

c.a ends up non-nullable, but there is no pg_constraint entry
enforcing the constraint:

\d+ c
Table "public.c"
 Column |  Type   | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
+-+---+--+-+-+-+--+-
 a  | integer |   | not null | | plain   |
|  |
Inherits: p
Access method: heap


Given a database containing these 2 tables:

create table p(a int primary key);
create table c() inherits (p);

doing a pg_dump and restore fails to restore the NOT NULL constraint
on c, because all constraints created by the dump are local to p.


That's it for now. I'll try to do more testing later.

Regards,
Dean




Re: Remove unused fields in ReorderBufferTupleBuf

2023-07-26 Thread Aleksander Alekseev
Hi,

Thanks for the patch.

> However, node and alloc_tuple_size are not used at all. It seems an
> oversight in commit a4ccc1cef5a, which introduced the generation
> context and used it in logical decoding. I think we can remove them
> (only on HEAD). I've attached the patch.

I'm afraid it's a bit incomplete:

```
../src/backend/replication/logical/reorderbuffer.c: In function
‘ReorderBufferGetTupleBuf’:
../src/backend/replication/logical/reorderbuffer.c:565:14: error:
‘ReorderBufferTupleBuf’ has no member named ‘alloc_tuple_size’
  565 | tuple->alloc_tuple_size = alloc_len;
  |  ^~
[829/1882] Compiling C object contrib/xml2/pgxml.so.p/xpath.c.o
```

On top of that IMO it doesn't make much sense to keep a one-field
wrapper structure. Perhaps we should get rid of it entirely and just
use HeapTupleData instead.

Thoughts?

-- 
Best regards,
Aleksander Alekseev




Re: [PATCH] Check more invariants during syscache initialization

2023-07-26 Thread Aleksander Alekseev
Hi Michael,

> That was more a question.  I was wondering if it was something you've
> noticed while working on a different patch because you somewhat
> assigned incorrect values in the syscache array, but it looks like you
> have noticed that while scanning the code.

Oh, got it. That's correct.

> Still it's hard to miss at compile time.  I think that I would remove
> this one.

Fair enough. Here is the updated patch.

-- 
Best regards,
Aleksander Alekseev


v2-0001-Check-more-invariants-during-syscache-initializat.patch
Description: Binary data


Re: incremental-checkopints

2023-07-26 Thread Alvaro Herrera
Hello

On 2023-Jul-26, Thomas wen wrote:

> Hi Hackes:   I found this page :
> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
> no incremental checkpoints have been implemented so far. When a
> checkpoint is triggered, the performance jitter of PostgreSQL is very
> noticeable. I think incremental checkpoints should be implemented as
> soon as possible

I think my first question is why do you think that is necessary; there
are probably other tools to achieve better performance.  For example,
you may want to try making checkpoint_completion_target closer to 1, and
the checkpoint interval longer (both checkpoint_timeout and
max_wal_size).  Also, changing shared_buffers may improve things.  You
can try adding more RAM to the machine.

Tuning the overall performance of a Postgres server is still black magic
to some extent, but there are a few well-known things to play with,
without having to write any patches.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."  (Brian Kernighan)




Re: incremental-checkopints

2023-07-26 Thread Tomas Vondra
On 7/26/23 09:21, Thomas wen wrote:
> Hi Hackes:  I found this page :
> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
>  
> 
>  no incremental checkpoints have been implemented so far. When a checkpoint 
> is triggered, the performance jitter of PostgreSQL is very noticeable. I 
> think incremental checkpoints should be implemented as soon as possible
> 

Well, that thread is 12 years old, and no one followed on that proposal.
So it seems people have different priorities, working on other stuff
that they consider is more valuable ...

You can either work on this yourself and write a patch, or try to
convince others it's worth working on. But you didn't provide any
information that'd demonstrate the jitter and that incremental
checkpoints would improve that.


For the record, the thread in our archives is:

https://www.postgresql.org/message-id/8a867f1ffea72091bf3cd6a49ba68a97.squirrel%40mail.go-link.net


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2023-07-26 Thread Matthias van de Meent
On Tue, 25 Jul 2023 at 03:34, Peter Geoghegan  wrote:
>
> I've been working on a variety of improvements to nbtree's native
> ScalarArrayOpExpr execution. This builds on Tom's work in commit
> 9e8da0f7.

Cool.

> Attached patch is still at the prototype stage. I'm posting it as v1 a
> little earlier than I usually would because there has been much back
> and forth about it on a couple of other threads involving Tomas Vondra
> and Jeff Davis -- seems like it would be easier to discuss with
> working code available.
>
> The patch adds two closely related enhancements to ScalarArrayOp
> execution by nbtree:
>
> 1. Execution of quals with ScalarArrayOpExpr clauses during nbtree
> index scans (for equality-strategy SK_SEARCHARRAY scan keys) can now
> "advance the scan's array keys locally", which sometimes avoids
> significant amounts of unneeded pinning/locking of the same set of
> index pages.
>
> SAOP index scans become capable of eliding primitive index scans for
> the next set of array keys in line in cases where it isn't truly
> necessary to descend the B-Tree again. Index scans are now capable of
> "sticking with the existing leaf page for now" when it is determined
> that the end of the current set of array keys is physically close to
> the start of the next set of array keys (the next set in line to be
> materialized by the _bt_advance_array_keys state machine). This is
> often possible.
>
> Naturally, we still prefer to advance the array keys in the
> traditional way ("globally") much of the time. That means we'll
> perform another _bt_first/_bt_search descent of the index, starting a
> new primitive index scan. Whether we try to skip pages on the leaf
> level or stick with the current primitive index scan (by advancing
> array keys locally) is likely to vary a great deal. Even during the
> same index scan. Everything is decided dynamically, which is the only
> approach that really makes sense.
>
> This optimization can significantly lower the number of buffers pinned
> and locked in cases with significant locality, and/or with many array
> keys with no matches. The savings (when measured in buffers
> pined/locked) can be as high as 10x, 100x, or even more. Benchmarking
> has shown that transaction throughput for variants of "pgbench -S"
> designed to stress the implementation (hundreds of array constants)
> under concurrent load can have up to 5.5x higher transaction
> throughput with the patch. Less extreme cases (10 array constants,
> spaced apart) see about a 20% improvement in throughput. There are
> similar improvements to latency for the patch, in each case.

Considering that it caches/reuses the page across SAOP operations, can
(or does) this also improve performance for index scans on the outer
side of a join if the order of join columns matches the order of the
index?
That is, I believe this caches (leaf) pages across scan keys, but can
(or does) it also reuse these already-cached leaf pages across
restarts of the index scan/across multiple index lookups in the same
plan node, so that retrieval of nearby index values does not need to
do an index traversal?

> [...]
> Skip Scan
> =
>
> MDAM encompasses something that people tend to call "skip scan" --
> terminology with a great deal of baggage. These days I prefer to call
> it "filling in missing key predicates", per the paper. That's much
> more descriptive, and makes it less likely that people will conflate
> the techniques with InnoDB style "loose Index scans" -- the latter is
> a much more specialized/targeted optimization. (I now believe that
> these are very different things, though I was thrown off by the
> superficial similarities for a long time. It's pretty confusing.)

I'm not sure I understand. MDAM seems to work on an index level to
return full ranges of values, while "skip scan" seems to try to allow
systems to signal to the index to skip to some other index condition
based on arbitrary cutoffs. This would usually be those of which the
information is not stored in the index, such as "SELECT user_id FROM
orders GROUP BY user_id HAVING COUNT(*) > 10", where the scan would go
though the user_id index and skip to the next user_id value when it
gets enough rows of a matching result (where "enough" is determined
above the index AM's plan node, or otherwise is impossible to
determine with only the scan key info in the index AM). I'm not sure
how this could work without specifically adding skip scan-related
index AM functionality, and I don't see how it fits in with this
MDAM/SAOP system.

> [...]
>
> Thoughts?

MDAM seems to require exponential storage for "scan key operations"
for conditions on N columns (to be precise, the product of the number
of distinct conditions on each column); e.g. an index on mytable
(a,b,c,d,e,f,g,h) with conditions "a IN (1, 2) AND b IN (1, 2) AND ...
AND h IN (1, 2)" would require 2^8 entries. If 4 conditions were used
for each column, that'd be 4^8, etc...
With an index column limit of 32, that's 

Re: Synchronizing slots from primary to standby

2023-07-26 Thread shveta malik
On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy
 wrote:
>
> On Fri, Jul 21, 2023 at 5:16 PM shveta malik  wrote:
> >
> > Thanks Bharat for letting us know. It is okay to split the patch, it
> > may definitely help to understand the modules better but shall we take
> > a step back and try to reevaluate the design first before moving to
> > other tasks?
>
> Agree that design comes first. FWIW, I'm attaching the v9 patch set
> that I have with me. It can't be a perfect patch set unless the design
> is finalized.
>

Thanks for the patch and summarizing all the issues here. I was going
through the patch and found that now we need to maintain
'synchronize_slot_names' on both primary and standby unlike the old
way where it was maintained only on standby. I am aware of the problem
in earlier implementation where each logical walsender/slot  needed to
wait for all standbys to catch-up before sending changes to logical
subscribers even though that particular slot is not even needed to be
synced by any of the standbys. Now it is more restrictive. But now, is
this 'synchronize_slot_names'  per standby? If there are multiple
standbys each having different  'synchronize_slot_names' requirements,
then how primary is going to keep track of that?
Please let me know if that scenario can never arise where standbys can
have different 'synchronize_slot_names'.

thanks
Shveta




Re: Row pattern recognition

2023-07-26 Thread Tatsuo Ishii
Attached is the v3 patch. In this patch following changes are made.

(1) I completely changed the pattern matching engine so that it
performs backtracking. Now the engine evaluates all pattern elements
defined in PATTER against each row, saving matched pattern variables
in a string per row. For example if the pattern element A and B
evaluated to true, a string "AB" is created for current row.

This continues until all pattern matching fails or encounters the end
of full window frame/partition. After that, the pattern matching
engine creates all possible "pattern strings" and apply the regular
expression matching to each. For example if we have row 0 = "AB" row 1
= "C", possible pattern strings are: "AC" and "BC".

If it matches, the length of matching substring is saved. After all
possible trials are done, the longest matching substring is chosen and
it becomes the width (number of rows) in the reduced window frame.

See row_is_in_reduced_frame, search_str_set and search_str_set_recurse
in nodeWindowAggs.c for more details. For now I use a naive depth
first search and probably there is a lot of rooms for optimization
(for example rewriting it without using
recursion). Suggestions/patches are welcome.

Jacob Champion wrote:
> It looks like the + qualifier has trouble when it meets the end of the
> frame. For instance, try removing the last row of the 'stock' table in
> rpr.sql; some of the final matches will disappear unexpectedly. Or try a
> pattern like
> 
> PATTERN ( a+ )
>  DEFINE a AS TRUE
> 
> which doesn't seem to match anything in my testing.
> 
> There's also the issue of backtracking in the face of reclassification,
> as I think Vik was alluding to upthread. The pattern
> 
> PATTERN ( a+ b+ )
>  DEFINE a AS col = 2,
> b AS col = 2

With the new engine, cases above do not fail anymore. See new
regression test cases. Thanks for providing valuable test cases!

(2) Make window functions RPR aware. Now first_value, last_value, and
nth_value recognize RPR (maybe first_value do not need any change?)

Vik Fearing wrote:
> I strongly disagree with this.  Window function do not need to know
> how the frame is defined, and indeed they should not.
> WinGetFuncArgInFrame should answer yes or no and the window function
> just works on that. Otherwise we will get extension (and possibly even
> core) functions that don't handle the frame properly.

So I moved row_is_in_reduce_frame into WinGetFuncArgInFrame so that
those window functions are not needed to be changed.

(3) Window function rpr was removed. We can use first_value instead.

(4) Remaining tasks/issues.

- For now I disable WinSetMarkPosition because RPR often needs to
  access a row before the mark is set. We need to fix this in the
  future.

- I am working on making window aggregates RPR aware now. The
  implementation is in progress and far from completeness. An example
  is below. I think row 2, 3, 4 of "count" column should be NULL
  instead of 3, 2, 0, 0. Same thing can be said to other
  rows. Probably this is an effect of moving aggregate but I still
  studying the window aggregation code.

SELECT company, tdate, first_value(price) OVER W, count(*) OVER w FROM stock
 WINDOW w AS (
 PARTITION BY company
 ORDER BY tdate
 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
 AFTER MATCH SKIP PAST LAST ROW
 INITIAL
 PATTERN (START UP+ DOWN+)
 DEFINE
  START AS TRUE,
  UP AS price > PREV(price),
  DOWN AS price < PREV(price)
);
 company  |   tdate| first_value | count 
--++-+---
 company1 | 2023-07-01 | 100 | 4
 company1 | 2023-07-02 | | 3
 company1 | 2023-07-03 | | 2
 company1 | 2023-07-04 | | 0
 company1 | 2023-07-05 | | 0
 company1 | 2023-07-06 |  90 | 4
 company1 | 2023-07-07 | | 3
 company1 | 2023-07-08 | | 2
 company1 | 2023-07-09 | | 0
 company1 | 2023-07-10 | | 0
 company2 | 2023-07-01 |  50 | 4
 company2 | 2023-07-02 | | 3
 company2 | 2023-07-03 | | 2
 company2 | 2023-07-04 | | 0
 company2 | 2023-07-05 | | 0
 company2 | 2023-07-06 |  60 | 4
 company2 | 2023-07-07 | | 3
 company2 | 2023-07-08 | | 2
 company2 | 2023-07-09 | | 0
 company2 | 2023-07-10 | | 0

- If attributes appearing in DEFINE are not used in the target list, query 
fails.

SELECT company, tdate, count(*) OVER w FROM stock
 WINDOW w AS (
 PARTITION BY company
 ORDER BY tdate
 ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
 AFTER MATCH SKIP PAST LAST ROW
 INITIAL
 PATTERN (START UP+ DOWN+)
 DEFINE
  START AS TRUE,
  UP AS price > PREV(price),
  DOWN AS price < PREV(price)
);
ERROR:  attribute number 3 exceeds number of columns 2

This is because attributes appearing in DEFINE are not added to the
target list. I 

Re: POC: Extension for adding distributed tracing - pg_tracing

2023-07-26 Thread Nikita Malakhov
Hi!

This patch looks very interesting, I'm working on the same subject too. But
I've used
another approach - I'm using C wrapper for C++ API library from
OpenTelemetry, and
handle span storage and output to this library. There are some nuances
though, but it
is possible. Have you tried to use OpenTelemetry APIs instead of
implementing all
functionality around spans?

On Tue, Jul 25, 2023 at 1:16 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi Anthonin,
>
> > I have a working prototype of a pg_tracing extension and wanted some
> feedback on the design and architecture.
>
> The patch looks very interesting, thanks for working on it and for
> sharing. The facts that the patch doesn't change the core except for
> two lines in instrument.{c.h} and that is uses pull-based model:
>
> > - Collect the spans through a new pg_tracing_spans() function output
>
> ... IMO were the right design decisions. The patch lacks the
> documentation, but this is OK for a PoC.
>
> I added the patch to the nearest CF [1]. Let's see what the rest of
> the community thinks.
>
> [1] https://commitfest.postgresql.org/44/4456/
>
> --
> Best regards,
> Aleksander Alekseev
>
>
>

-- 
Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: In Postgres 16 BETA, should the ParseNamespaceItem have the same index as it's RangeTableEntry?

2023-07-26 Thread Amit Langote
Hello,

On Fri, Jul 21, 2023 at 5:05 AM Farias de Oliveira
 wrote:
>
> Hello,
>
> Thank you for the help guys and I'm so sorry for my late response. Indeed, 
> the error relies on the ResultRelInfo. In GetResultRTEPermissionInfo() 
> function, it does a checking on the relinfo->ri_RootResultRelInfo variable. I 
> believe that it should go inside this scope:
>
>
> if (relinfo->ri_RootResultRelInfo)
> {
> /*
> * For inheritance child result relations (a partition routing target
> * of an INSERT or a child UPDATE target), this returns the root
> * parent's RTE to fetch the RTEPermissionInfo because that's the only
> * one that has one assigned.
> */
> rti = relinfo->ri_RootResultRelInfo->ri_RangeTableIndex;
> }
>
> The relinfo contains:
>
> {type = T_ResultRelInfo, ri_RangeTableIndex = 5, ri_RelationDesc = 
> 0x7f44e3308cc8, ri_NumIndices = 0, ri_IndexRelationDescs = 0x0, 
> ri_IndexRelationInfo = 0x0, ri_RowIdAttNo = 0,
>   ri_extraUpdatedCols = 0x0, ri_projectNew = 0x0, ri_newTupleSlot = 0x0, 
> ri_oldTupleSlot = 0x0, ri_projectNewInfoValid = false, ri_TrigDesc = 0x0, 
> ri_TrigFunctions = 0x0,
>   ri_TrigWhenExprs = 0x0, ri_TrigInstrument = 0x0, ri_ReturningSlot = 0x0, 
> ri_TrigOldSlot = 0x0, ri_TrigNewSlot = 0x0, ri_FdwRoutine = 0x0, ri_FdwState 
> = 0x0,
>   ri_usesFdwDirectModify = false, ri_NumSlots = 0, ri_NumSlotsInitialized = 
> 0, ri_BatchSize = 0, ri_Slots = 0x0, ri_PlanSlots = 0x0, ri_WithCheckOptions 
> = 0x0,
>   ri_WithCheckOptionExprs = 0x0, ri_ConstraintExprs = 0x0, ri_GeneratedExprsI 
> = 0x0, ri_GeneratedExprsU = 0x0, ri_NumGeneratedNeededI = 0, 
> ri_NumGeneratedNeededU = 0,
>   ri_returningList = 0x0, ri_projectReturning = 0x0, 
> ri_onConflictArbiterIndexes = 0x0, ri_onConflict = 0x0, ri_matchedMergeAction 
> = 0x0, ri_notMatchedMergeAction = 0x0,
>   ri_PartitionCheckExpr = 0x0, ri_ChildToRootMap = 0x0, 
> ri_ChildToRootMapValid = false, ri_RootToChildMap = 0x0, 
> ri_RootToChildMapValid = false, ri_RootResultRelInfo = 0x0,
>   ri_PartitionTupleSlot = 0x0, ri_CopyMultiInsertBuffer = 0x0, 
> ri_ancestorResultRels = 0x0}
>
> Since relinfo->ri_RootResultRelInfo = 0x0, the rti will have no value and 
> Postgres will interpret that the ResultRelInfo must've been created only for 
> filtering triggers and the relation is not being inserted into.
> The relinfo variable is created with the create_entity_result_rel_info() 
> function:
>
> ResultRelInfo *create_entity_result_rel_info(EState *estate, char *graph_name,
>  char *label_name)
> {
> RangeVar *rv;
> Relation label_relation;
> ResultRelInfo *resultRelInfo;
>
> ParseState *pstate = make_parsestate(NULL);
>
> resultRelInfo = palloc(sizeof(ResultRelInfo));
>
> if (strlen(label_name) == 0)
> {
> rv = makeRangeVar(graph_name, AG_DEFAULT_LABEL_VERTEX, -1);
> }
> else
> {
> rv = makeRangeVar(graph_name, label_name, -1);
> }
>
> label_relation = parserOpenTable(pstate, rv, RowExclusiveLock);
>
> // initialize the resultRelInfo
> InitResultRelInfo(resultRelInfo, label_relation,
>   list_length(estate->es_range_table), NULL,
>   estate->es_instrument);
>
> // open the parse state
> ExecOpenIndices(resultRelInfo, false);
>
> free_parsestate(pstate);
>
> return resultRelInfo;
> }
>
> In this case, how can we get the relinfo->ri_RootResultRelInfo to store the 
> appropriate data?

Your function doesn't seem to have access to the ModifyTableState
node, so setting ri_RootResultRelInfo to the correct ResultRelInfo
node does not seem doable.

As I suggested in my previous reply, please check if passing 0 (not
list_length(estate->es_range_table)) for the 3rd argument
InitResultRelInfo() fixes the problem and gives the correct result.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




incremental-checkopints

2023-07-26 Thread Thomas wen
Hi Hackes:   I found this page : 
https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
 no incremental checkpoints have been implemented so far. When a checkpoint is 
triggered, the performance jitter of PostgreSQL is very noticeable. I think 
incremental checkpoints should be implemented as soon as possible






Best whish

Thomas wen


Re: PATCH: Add REINDEX tag to event triggers

2023-07-26 Thread jian he
On Wed, Jul 26, 2023 at 7:51 AM Michael Paquier  wrote:
>
> On Tue, Jul 25, 2023 at 04:34:47PM +0800, jian he wrote:
> > so  T_ReindexStmt should only be in  ProcessUtilitySlow, if you want
> > to create an event trigger on reindex?
> >
> > regression tests work fine. I even play with partitions.
>
> It would be an idea to have some regression tests for partitions,
> actually, so as some patterns around ReindexMultipleInternal() are
> checked.  We could have a REINDEX DATABASE in a TAP test with an event
> trigger, as well, but I don't feel strongly about the need to do that
> much extra work in 090_reindexdb.pl or 091_reindexdb_all.pl if
> partitions cover the multi-table case.
> --
> Michael

quite verbose, copied from partition-info.sql. meet the expectation:
partitioned index will do nothing, partition index will trigger event
trigger.

DROP EVENT TRIGGER  IF EXISTS  end_reindex_command  CASCADE;
DROP EVENT TRIGGER  IF EXISTS  start_reindex_command  CASCADE;

BEGIN;
CREATE OR REPLACE FUNCTION reindex_end_command()
RETURNS event_trigger AS $$
DECLARE
  obj record;
BEGIN
raise notice 'begin of reindex_end_command';

FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
LOOP
  RAISE NOTICE
'obj.command_tag:% obj.object_type:% obj.schema_name:%
obj.object_identity:%'
,obj.command_tag, obj.object_type,obj.schema_name,obj.object_identity;
  RAISE NOTICE 'ddl_end_command -- REINDEX: %', pg_get_indexdef(obj.objid);
END LOOP;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION start_reindex_command()
RETURNS event_trigger AS $$
DECLARE
obj record;
BEGIN
FOR obj IN SELECT * FROM pg_event_trigger_ddl_commands()
LOOP
RAISE NOTICE
'obj.command_tag:% obj.object_type:% obj.schema_name:%
obj.object_identity:%'
, obj.command_tag, obj.object_type,obj.schema_name,obj.object_identity;
RAISE NOTICE 'ddl_start_command -- REINDEX: %',
pg_get_indexdef(obj.objid);
END LOOP;
raise notice 'end of start_reindex_command';
END;
$$ LANGUAGE plpgsql;

BEGIN;
CREATE EVENT TRIGGER end_reindex_command ON ddl_command_end
WHEN TAG IN ('REINDEX') EXECUTE PROCEDURE reindex_end_command();
CREATE EVENT TRIGGER start_reindex_command ON ddl_command_start
WHEN TAG IN ('REINDEX') EXECUTE PROCEDURE start_reindex_command();
COMMIT;

-- test Reindex Event Trigger
BEGIN;
drop table if EXISTS ptif_test CASCADE;
CREATE TABLE ptif_test (a int, b int) PARTITION BY range (a);
CREATE TABLE ptif_test0 PARTITION OF ptif_test
  FOR VALUES FROM (minvalue) TO (0) PARTITION BY list (b);
CREATE TABLE ptif_test01 PARTITION OF ptif_test0 FOR VALUES IN (1);
CREATE TABLE ptif_test1 PARTITION OF ptif_test
  FOR VALUES FROM (0) TO (100) PARTITION BY list (b);
CREATE TABLE ptif_test11 PARTITION OF ptif_test1 FOR VALUES IN (1);

CREATE TABLE ptif_test2 PARTITION OF ptif_test
  FOR VALUES FROM (100) TO (200);
-- This partitioned table should remain with no partitions.
CREATE TABLE ptif_test3 PARTITION OF ptif_test
  FOR VALUES FROM (200) TO (maxvalue) PARTITION BY list (b);

-- Test index partition tree
CREATE INDEX ptif_test_index ON ONLY ptif_test (a);

CREATE INDEX ptif_test0_index ON ONLY ptif_test0 (a);
ALTER INDEX ptif_test_index ATTACH PARTITION ptif_test0_index;

CREATE INDEX ptif_test01_index ON ptif_test01 (a);
ALTER INDEX ptif_test0_index ATTACH PARTITION ptif_test01_index;

CREATE INDEX ptif_test1_index ON ONLY ptif_test1 (a);
ALTER INDEX ptif_test_index ATTACH PARTITION ptif_test1_index;

CREATE INDEX ptif_test11_index ON ptif_test11 (a);
ALTER INDEX ptif_test1_index ATTACH PARTITION ptif_test11_index;

CREATE INDEX ptif_test2_index ON ptif_test2 (a);
ALTER INDEX ptif_test_index ATTACH PARTITION ptif_test2_index;

CREATE INDEX ptif_test3_index ON ptif_test3 (a);
ALTER INDEX ptif_test_index ATTACH PARTITION ptif_test3_index;
COMMIT;

--top level partitioned index. will recurse to each partition index.
REINDEX INDEX CONCURRENTLY public.ptif_test_index;

--ptif_test0 is partitioned table. it will index partition: ptif_test01_index
-- event trigger will log  ptif_test01_index
REINDEX INDEX CONCURRENTLY public.ptif_test0_index;

--ptif_test1_index is partitioned index. it will index partition:
ptif_test11_index
-- event trigger will effect on partion index:ptif_test11_index
REINDEX INDEX CONCURRENTLY public.ptif_test1_index;

--ptif_test2 is a partition. event trigger will log ptif_test2_index
REINDEX INDEX CONCURRENTLY public.ptif_test2_index;

--no partitions. event trigger won't do anything.
REINDEX INDEX CONCURRENTLY public.ptif_test3_index;

reindex table ptif_test; --top level.  will recurse to each partition index.
reindex table ptif_test0; -- will direct to ptif_test01
reindex table ptif_test01; -- will index it's associtaed index
reindex table ptif_test11; -- will index it's associtaed index
reindex table ptif_test2;  -- will index it's associtaed index
reindex table ptif_test3;  -- no partion, index 

Re: Question about use_physical_tlist() which is applied on Scan path

2023-07-26 Thread Alvaro Herrera
On 2023-Jul-26, Jian Guo wrote:

> It looks the columns besides `ps_supplycost` and `ps_availqty` are not
> necessary, but fetched from tuples all at once. For the row-based
> storage such as heap, it looks fine, but for column-based storage, it
> would result into unnecessary overhead and impact performance. Is
> there any plan to optimize here?

I suppose that, at some point, it is going to have to be the table AM
the one that makes the decision.  That is, use_physical_tlist would have
to involve some new flag in path->parent->amflags to determine whether
to skip using a physical tlist.  Right now, we don't have any columnar
stores, so there's no way to verify an implementation.  If you do have a
columnar store implementation, you're welcome to share it.

-- 
Álvaro Herrera PostgreSQL Developer
"I am amazed at [the pgsql-sql] mailing list for the wonderful support, and
lack of hesitasion in answering a lost soul's question, I just wished the rest
of the mailing list could be like this."   (Fotis)
   (http://archives.postgresql.org/pgsql-sql/2006-06/msg00265.php)




Re: remaining sql/json patches

2023-07-26 Thread Amit Langote
On Fri, Jul 21, 2023 at 7:33 PM Amit Langote  wrote:
> On Fri, Jul 21, 2023 at 1:02 AM Alvaro Herrera  
> wrote:
> > On 2023-Jul-21, Amit Langote wrote:
> >
> > > I’m thinking of pushing 0001 and 0002 tomorrow barring objections.
> >
> > 0001 looks reasonable to me.  I think you asked whether to squash that
> > one with the other bugfix commit for the same code that you already
> > pushed to master; I think there's no point in committing as separate
> > patches, because the first one won't show up in the git_changelog output
> > as a single entity with the one in 16, so it'll just be additional
> > noise.
>
> OK, pushed 0001 to HEAD and b6e1157e7d + 0001 to 16.
>
> > I've looked at 0002 at various points in time and I think it looks
> > generally reasonable.  I think your removal of a couple of newlines
> > (where originally two appear in sequence) is unwarranted; that the name
> > to_json[b]_worker is ugly for exported functions (maybe "datum_to_json"
> > would be better, or you may have better ideas);
>
> Went with datum_to_json[b].  Created a separate refactoring patch for
> this, attached as 0001.
>
> Created another refactoring patch for the hunks related to renaming of
> a nonterminal in gram.y, attached as 0002.
>
> > and that the omission of
> > the stock comment in the new stanzas in FigureColnameInternal() is
> > strange.
>
> Yes, fixed.
>
> >  But I don't have anything serious.  Do add some ecpg tests ...
>
> Added.
>
> > Also, remember to pgindent and bump catversion, if you haven't already.
>
> Will do.  Wasn't sure myself whether the catversion should be bumped,
> but I suppose it must be because ruleutils.c has changed.
>
> Attaching latest patches.  Will push 0001, 0002, and 0003 on Monday to
> avoid worrying about the buildfarm on a Friday evening.

And pushed.

Will post the remaining patches after addressing jian he's comments.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: [PATCH] Add loongarch native checksum implementation.

2023-07-26 Thread John Naylor
On Wed, Jul 26, 2023 at 8:25 AM YANG Xudong  wrote:
>
> Many thanks to huchangqi. Now we have loongarch64 support for both old
> world ABI and new world ABI on the buildfarm!

Glad to hear it!

On Wed, Jul 26, 2023 at 10:16 AM Michael Paquier 
wrote:
>
> The performance numbers presented upthread for the CRC computations
> are kind of nice in this environment, but honestly I have no idea how
> much this architecture is used.  Perhaps that's only something in
> China?  I am not seeing much activity around that in Japan, for
> instance, and that's really close.

That was my impression as well. My thinking was, we can give the same
treatment that we gave Arm a number of years ago (which is now quite
mainstream).

> Anyway, based on today's state of the buildfarm, we have a buildfarm
> member named cisticola that should be able to test this new CRC
> implementation, so I see no problem in applying this stuff now if you
> think it is in good shape.

I believe there was just a comment that needed updating, so I'll do that
and push within a few days.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: logical decoding and replication of sequences, take 2

2023-07-26 Thread Amit Kapila
On Wed, Jul 26, 2023 at 9:37 AM Amit Kapila  wrote:
>
> On Tue, Jul 25, 2023 at 5:29 PM Tomas Vondra
>  wrote:
> >
> > On 7/25/23 08:28, Amit Kapila wrote:
> > > On Mon, Jul 24, 2023 at 9:32 PM Tomas Vondra
> > >  wrote:
> > >>
> > >> On 7/24/23 12:40, Amit Kapila wrote:
> > >>> On Wed, Jul 5, 2023 at 8:21 PM Ashutosh Bapat
> > >>>  wrote:
> > >>>
> > >>> Even after that, see below the value of the sequence is still not
> > >>> caught up. Later, when the apply worker processes all the WAL, the
> > >>> sequence state will be caught up.
> > >>>
> > >>
> > >> And how is this different from what tablesync does for tables? For that
> > >> 'r' also does not mean it's fully caught up, IIRC. What matters is
> > >> whether the sequence since this moment can go back. And I don't think it
> > >> can, because that would require replaying changes from before we did
> > >> copy_sequence ...
> > >>
> > >
> > > For sequences, it is quite possible that we replay WAL from before the
> > > copy_sequence whereas the same is not true for tables (w.r.t
> > > copy_table()). This is because for tables we have a kind of interlock
> > > w.r.t LSN returned via create_slot (say this value of LSN is LSN1),
> > > basically, the walsender corresponding to tablesync worker in
> > > publisher won't send any WAL before that LSN whereas the same is not
> > > true for sequences. Also, even if apply worker can receive WAL before
> > > copy_table, it won't apply that as that would be behind the LSN1 and
> > > the same is not true for sequences. So, for tables, we will never go
> > > back to a state before the copy_table() but for sequences, we can go
> > > back to a state before copy_sequence().
> > >
> >
> > Right. I think the important detail is that during sync we have three
> > important LSNs
> >
> > - LSN1 where the slot is created
> > - LSN2 where the copy happens
> > - LSN3 where we consider the sync completed
> >
> > For tables, LSN1 == LSN2, because the data is completed using the
> > snapshot from the temporary slot. And (LSN1 <= LSN3).
> >
> > But for sequences, the copy happens after the slot creation, possibly
> > with (LSN1 < LSN2). And because LSN3 comes from the main subscription
> > (which may be a bit behind, for whatever reason), it may happen that
> >
> >(LSN1 < LSN3 < LSN2)
> >
> > The the sync ends at LSN3, but that means all sequence changes between
> > LSN3 and LSN2 will be applied "again" making the sequence go away.
> >
>
> Yeah, the problem is something as you explained but an additional
> minor point is that for sequences we also do end up applying the WAL
> between LSN1 and LSN3 which makes it go backwards.
>

I was reading this email thread and found the email by Andres [1]
which seems to me to say the same thing: "I assume that part of the
initial sync would have to be a new sequence synchronization step that
reads all the sequence states on the publisher and ensures that the
subscriber sequences are at the same point. There's a bit of
trickiness there, but it seems entirely doable. The logical
replication replay support for sequences will have to be a bit careful
about not decreasing the subscriber's sequence values - the standby
initially will be ahead of the
increments we'll see in the WAL.". Now, IIUC this means that even
before the sequence is marked as SYNCDONE, it shouldn't go backward.

[1]: 
"https://www.postgresql.org/message-id/20221117024357.ljjme6v75mny2j6u%40awork3.anarazel.de

With Regards,
Amit Kapila.




Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns

2023-07-26 Thread Zhang Mingli
HI,

> I've looked at this patch and it looks mostly fine, though I do not
> intend to commit it myself; perhaps Andrew will.

HI, Amit, thanks for review.

> 
> A few minor things to improve:
> 
> +  If * is specified, it will be applied in all 
> columns.
> ...
> +  If * is specified, it will be applied in all 
> columns.
> 
> Please write "it will be applied in" as "the option will be applied to".

+1

> 
> +   boolforce_notnull_all;  /* FORCE_NOT_NULL * */
> ...
> +   boolforce_null_all; /* FORCE_NULL * */
> 
> Like in the comment for force_quote, please add a "?" after * in the
> above comments.

+1

> 
> +   if (cstate->opts.force_notnull_all)
> +   MemSet(cstate->opts.force_notnull_flags, true, num_phys_attrs
> * sizeof(bool));
> ...
> +   if (cstate->opts.force_null_all)
> +   MemSet(cstate->opts.force_null_flags, true, num_phys_attrs *
> sizeof(bool));
> 
> While I am not especially opposed to using this 1-line variant to set
> the flags array, it does mean that there are now different styles
> being used for similar code, because force_quote_flags uses a for
> loop:
> 
>if (cstate->opts.force_quote_all)
>{
>int i;
> 
>for (i = 0; i < num_phys_attrs; i++)
>cstate->opts.force_quote_flags[i] = true;
>}
> 
> Perhaps we could fix the inconsistency by changing the force_quote_all
> code to use MemSet() too.  I'll defer whether to do that to Andrew's
> judgement.

Sure, let’s wait for Andrew and I will put everything in one pot then.

Zhang Mingli
https://www.hashdata.xyz



Question about use_physical_tlist() which is applied on Scan path

2023-07-26 Thread Jian Guo
Hi hackers,

I have a question about `use_physical_tlist()` which is applied in 
`create_scan_plan()`:
```
if (flags == CP_IGNORE_TLIST)
{
  tlist = NULL;
}
else if (use_physical_tlist(root, best_path, flags))
{
  if (best_path->pathtype == T_IndexOnlyScan)
  {
/* For index-only scan, the preferred tlist is the index's */
tlist = copyObject(((IndexPath *) best_path)->indexinfo->indextlist);

/*
 * Transfer sortgroupref data to the replacement tlist, if
 * requested (use_physical_tlist checked that this will work).
 */
if (flags & CP_LABEL_TLIST)
  apply_pathtarget_labeling_to_tlist(tlist, best_path->pathtarget);
  }
  else
  {
tlist = build_physical_tlist(root, rel);
……
```
And the comment above the code block says:

```
/*
* For table scans, rather than using the relation targetlist (which is
* only those Vars actually needed by the query), we prefer to generate a
* tlist containing all Vars in order.  This will allow the executor to
* optimize away projection of the table tuples, if possible.
*
* But if the caller is going to ignore our tlist anyway, then don't
* bother generating one at all.  We use an exact equality test here, so
* that this only applies when CP_IGNORE_TLIST is the only flag set.
*/
```

But for some column-oriented database based on Postgres, it may help a lot in 
case of projection of the table tuples in execution? And is there any other 
optimization considerations behind this design?

e.g. If we have such table definition and a query:

```
CREATE TABLE partsupp
(PS_PARTKEY INT,
PS_SUPPKEY INT,
PS_AVAILQTY INTEGER,
PS_SUPPLYCOST DECIMAL(15,2),
PS_COMMENT VARCHAR(199),
dummy text);

explain analyze verbose select sum(ps_supplycost * ps_availqty) from partsupp;
```

And the planner would give such plan:

```
QUERY PLAN
---
Aggregate  (cost=12.80..12.81 rows=1 width=32) (actual time=0.013..0.015 rows=1 
loops=1)
   Output: sum((ps_supplycost * (ps_availqty)::numeric))
   ->  Seq Scan on public.partsupp  (cost=0.00..11.60 rows=160 width=22) 
(actual time=0.005..0.005 rows=0 loops=1)
 Output: ps_partkey, ps_suppkey, ps_availqty, ps_supplycost, 
ps_comment, dummy
Planning Time: 0.408 ms
Execution Time: 0.058 ms
(6 rows)
```
It looks the columns besides `ps_supplycost` and `ps_availqty` are not 
necessary, but fetched from tuples all at once. For the row-based storage such 
as heap, it looks fine, but for column-based storage, it would result into 
unnecessary overhead and impact performance. Is there any plan to optimize here?

Thanks.


Re: [feature]COPY FROM enable FORCE_NULL/FORCE_NOT_NULL on all columns

2023-07-26 Thread Amit Langote
Hello,

On Thu, Jul 20, 2023 at 4:06 PM Zhang Mingli  wrote:
>
> Hi,
>
> On Jul 7, 2023 at 18:00 +0800, Damir Belyalov , wrote:
>
>
> V2 patch still have some errors when apply file doc/src/sgml/ref/copy.sgml, 
> rebased and fixed it in V3 path.
> Thanks a lot for review.
>
> I have updated https://commitfest.postgresql.org/43/3896/ to staus Ready for 
> Committer, thanks again.

I've looked at this patch and it looks mostly fine, though I do not
intend to commit it myself; perhaps Andrew will.

A few minor things to improve:

+  If * is specified, it will be applied in all columns.
...
+  If * is specified, it will be applied in all columns.

Please write "it will be applied in" as "the option will be applied to".

+   boolforce_notnull_all;  /* FORCE_NOT_NULL * */
...
+   boolforce_null_all; /* FORCE_NULL * */

Like in the comment for force_quote, please add a "?" after * in the
above comments.

+   if (cstate->opts.force_notnull_all)
+   MemSet(cstate->opts.force_notnull_flags, true, num_phys_attrs
* sizeof(bool));
...
+   if (cstate->opts.force_null_all)
+   MemSet(cstate->opts.force_null_flags, true, num_phys_attrs *
sizeof(bool));

While I am not especially opposed to using this 1-line variant to set
the flags array, it does mean that there are now different styles
being used for similar code, because force_quote_flags uses a for
loop:

if (cstate->opts.force_quote_all)
{
int i;

for (i = 0; i < num_phys_attrs; i++)
cstate->opts.force_quote_flags[i] = true;
}

Perhaps we could fix the inconsistency by changing the force_quote_all
code to use MemSet() too.  I'll defer whether to do that to Andrew's
judgement.


--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: psql: Could we get "-- " prefixing on the **** QUERY **** outputs? (ECHO_HIDDEN)

2023-07-26 Thread Pavel Stehule
st 26. 7. 2023 v 6:22 odesílatel Nathan Bossart 
napsal:

> I took a look at this patch and changed a couple things:
>
>  * I made a similar adjustment to a few lines that seem to have been
>missed.
>  * I removed a couple of asterisks from the adjusted lines in order to
>maintain the existing line lengths.
>
> Barring additional feedback, I think this is ready for commit.
>
>
+1

Pavel

-- 
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
>