autoprewarm main function not tested background worker not listed in pg_stat_activity

2023-12-28 Thread jian he
Hi.
https://coverage.postgresql.org/contrib/pg_prewarm/autoprewarm.c.gcov.html
function autoprewarm_start_worker never gets tested, but
autoprewarm_start_worker listed in our doc
(https://www.postgresql.org/docs/16/pgprewarm.html)
Maybe we should test it.

also this part in function autoprewarm_main not tested.
 247 : /* Compute the next dump time. */
 248   0 : next_dump_time =
 249   0 :
TimestampTzPlusMilliseconds(last_dump_time,
 250 :
autoprewarm_interval * 1000);
 251 : delay_in_ms =
 252   0 :
TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
 253 :
next_dump_time);
 254 :
 255 : /* Perform a dump if it's time. */
 256   0 : if (delay_in_ms <= 0)
 257 : {
 258   0 : last_dump_time = GetCurrentTimestamp();
 259   0 : apw_dump_now(true, false);
 260   0 : continue;
 261 : }
 262 :
 263 : /* Sleep until the next dump time. */
 264   0 : (void) WaitLatch(MyLatch,
 265 :  WL_LATCH_SET |
WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
 266 :  delay_in_ms,
 267 :  WAIT_EVENT_EXTENSION);

I don't fully understand this module, but I kind of get the meaning of
the worker_spi module.

i didn't configure shared_preload_libraries.
I changed from
worker.bgw_flags = BGWORKER_SHMEM_ACCESS;
to
worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
BGWORKER_BACKEND_DATABASE_CONNECTION;

but still, the following query will not list the autoprewarm background worker.
SELECT  datname, pid, state, backend_type, wait_event_type, wait_event
FROMpg_stat_activity;

I thought pg_stat_activity will list all related pid for this cluster.
but it seems not in this case.
So here, is it a good idea to make the autoprewarm background worker
listed in pg_stat_activity?




Re: Synchronizing slots from primary to standby

2023-12-28 Thread Amit Kapila
On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada  wrote:
>
> On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila  wrote:
> >
> > >
> > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC
> > > parameter, say slot_sync_dbname, to specify the database to connect.
> > > The slot_sync_dbname overwrites the dbname if primary_conninfo also
> > > specifies it. If both don't have a dbname, raise an error.
> > >
> >
> > Would the users prefer to provide a value for a separate GUC instead
> > of changing primary_conninfo? It is possible that we can have some
> > users prefer to use one GUC and others prefer a separate GUC but we
> > should add a new GUC if we are sure that is what users would prefer.
> > Also, even if have to consider this option, I think we can easily
> > later add a new GUC to provide a dbname in addition to having the
> > provision of giving it in primary_conninfo.
>
> I think having two separate GUCs is more flexible for example when
> users want to change the dbname to connect. It makes sense that the
> slotsync worker wants to use the same connection string as the
> walreceiver uses. But I guess today most primary_conninfo settings
> that are set manually or are generated by tools such as pg_basebackup
> don't have dbname. If we require a dbname in primary_conninfo, many
> tools will need to be changed. Once the connection string is
> generated, it would be tricky to change the dbname in it, as Shveta
> mentioned. The users will have to carefully select the database to
> connect when taking a base backup.
>

I see your point and agree that users need to be careful. I was trying
to compare it with other places like the conninfo used with a
subscription where no separate dbname needs to be provided. Now, here
the situation is not the same because the same conninfo is used for
different purposes (walreceiver doesn't require dbname (dbname is
ignored even if present) whereas slotsyncworker requires dbname). I
was just trying to see if we can avoid having a new GUC for this
purpose. Does anyone else have an opinion on this matter?

-- 
With Regards,
Amit Kapila.




Re: pg_stat_statements: more test coverage

2023-12-28 Thread Julien Rouhaud
On Wed, Dec 27, 2023 at 8:53 PM Peter Eisentraut  wrote:
>
> On 27.12.23 09:08, Julien Rouhaud wrote:
> >
> > I was a bit surprised by that so I checked locally.  It does work as
> > expected provided that you set pg_stat_statements.track to all:
>
> Ok, here is an updated patch set that does it that way.

It looks good to me.  One minor complaint, I'm a bit dubious about
those queries:

SELECT count(*) <= 100 AND count(*) > 0 FROM pg_stat_statements;

Is it to actually test that pg_stat_statements won't store more than
pg_stat_statements.max records?  Note also that this query can't
return 0 rows, as the currently executed query will have an entry
added during post_parse_analyze.  Maybe a comment saying what this is
actually testing would help.

It would also be good to test that pg_stat_statements_info.dealloc is
more than 0 once enough statements have been issued.

> I have committed the patches 0002 and 0003 from the previous patch set
> already.
>
> I have also enhanced the TAP test a bit to check the actual content of
> the output across restarts.

Nothing much to say about this one, it all looks good.

> I'm not too hung up on the 0001 patch if others don't like that approach.

I agree with Michael on this one, the only times I saw this pattern
was to comply with some company internal policy for minimal coverage
numbers.




Re: Pdadmin open on Macbook issue

2023-12-28 Thread Maciek Sakrejda
This is not the right mailing list for your question. Try the
pgadmin-support [1] mailing list. You may also want to include more details
in your question, because it's not really possible to tell what's going
wrong from your description.

[1]: https://www.postgresql.org/list/pgadmin-support/


Re: Removing unneeded self joins

2023-12-28 Thread Alexander Lakhin

Hi Alexander,

23.10.2023 14:29, Alexander Korotkov wrote:

Fixed all of the above. Thank you for catching this!


I've discovered that starting from d3d55ce57 the following query:
CREATE TABLE t(a int PRIMARY KEY);

WITH tt AS (SELECT * FROM t)
UPDATE t SET a = tt.a + 1 FROM tt
WHERE tt.a = t.a RETURNING t.a;

triggers an error "variable not found in subplan target lists".
(Commits 8a8ed916f and b5fb6736e don't fix this, unfortunately.)

Best regards,
Alexander




Re: Synchronizing slots from primary to standby

2023-12-28 Thread Amit Kapila
On Fri, Dec 29, 2023 at 7:18 AM Masahiko Sawada  wrote:
>
> On Wed, Dec 27, 2023 at 7:13 PM shveta malik  wrote:
> >
> > On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada  
> > wrote:
>
> >  I was not aware if there is any way to connect if we
> > want to run SQL queries. I initially tried using 'PQconnectdbParams'
> > but couldn't make it work. Perhaps it is to be used only by front-end
> > and extensions as the header files indicate as well:
> >  * libpq-fe.h :  This file contains definitions for structures and
> >  externs for functions used by frontend postgres applications.
> >  * libpq-be-fe-helpers.h:   Helper functions for using libpq in
> > extensions . Code built directly into the backend is not allowed to
> > link to libpq directly.
>
> Oh I didn't know that. Thank you for pointing it out.
>
> But I'm still concerned it could confuse users that
> pg_stat_replication keeps showing one entry that remains as "startup"
> state. It has the same application_name as the walreceiver uses. For
> example, when users want to check the particular replication
> connection, it's common to filter the entries by the application name.
> But it will end up having duplicate entries having different states.
>

Valid point. The main reason for using cluster_name is that if
multiple standby's connect to the same primary, all will have the same
application_name as 'slotsyncworker'. The other alternative could be
to use {cluster_name}_slotsyncworker, which will probably address your
concern and we can have to provision to differentiate among
slotsyncworkers from different standby's.

-- 
With Regards,
Amit Kapila.




Re: Track in pg_replication_slots the reason why slots conflict?

2023-12-28 Thread Amit Kapila
On Thu, Dec 28, 2023 at 2:58 PM shveta malik  wrote:
>
> PFA the v2-patch with all your comments addressed.
>

Does anyone have a preference for a column name? The options on the
table are conflict_cause, conflicting_cause, conflict_reason. Any
others? I was checking docs for similar usage and found
"pg_event_trigger_table_rewrite_reason" function, so based on that we
can even go with conflict_reason.

-- 
With Regards,
Amit Kapila.




Re: POC: GROUP BY optimization

2023-12-28 Thread Andrei Lepikhov

On 28/12/2023 18:29, Alexander Korotkov wrote:

On Thu, Dec 28, 2023 at 10:22 AM Andrei Lepikhov
 wrote:

But arrangement with an ORDER BY clause doesn't work:

DROP INDEX abc;
explain SELECT x,w,z FROM t GROUP BY (w,x,z) ORDER BY (x,z,w);

I think the reason is that the sort_pathkeys and group_pathkeys are
physically different structures, and we can't just compare pointers here.


I haven't yet looked into the code.  But this looks strange to me.
Somehow, optimizer currently matches index pathkeys to ORDER BY
pathkeys.  If GROUP BY pathkeys could be matched to index pathkeys,
then it should be possible to match them to ORDER BY pathkeys too.
Oh, I found the mistake: I got used to using GROUP BY and ORDER BY on 
many columns with round brackets. In the case of the grouping list, it 
doesn't change anything. But ordering treats it as a WholeRowVar and 
breaks group-by arrangement. Look:

explain (COSTS OFF) SELECT relname,reltuples FROM pg_class
GROUP BY relname,reltuples ORDER BY reltuples,relname;

 Group
   Group Key: reltuples, relname
   ->  Sort
 Sort Key: reltuples, relname
 ->  Seq Scan on pg_class
But:
explain (COSTS OFF) SELECT relname,reltuples FROM pg_class
GROUP BY relname,reltuples ORDER BY (reltuples,relname);

 Sort
   Sort Key: (ROW(reltuples, relname))
   ->  Group
 Group Key: relname, reltuples
 ->  Sort
   Sort Key: relname, reltuples
   ->  Seq Scan on pg_class

So, let's continue to work.

--
regards,
Andrei Lepikhov
Postgres Professional





Re: Synchronizing slots from primary to standby

2023-12-28 Thread Masahiko Sawada
On Wed, Dec 27, 2023 at 7:13 PM shveta malik  wrote:
>
> On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada  
> wrote:
> >
> > Hi,
> >
> > Thank you for working on this.
> >
> > On Tue, Dec 26, 2023 at 9:27 PM shveta malik  wrote:
> > >
> > > On Tue, Dec 26, 2023 at 4:41 PM Zhijie Hou (Fujitsu)
> > >  wrote:
> > > >
> > > > On Wednesday, December 20, 2023 7:37 PM Amit Kapila 
> > > >  wrote:
> > > > >
> > > > > On Wed, Dec 20, 2023 at 3:29 PM shveta malik 
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Dec 20, 2023 at 9:12 AM Amit Kapila 
> > > > > > 
> > > > > wrote:
> > > > > > >
> > > > > > > On Tue, Dec 19, 2023 at 5:30 PM shveta malik 
> > > > > > > 
> > > > > wrote:
> > > > > > > >
> > > > > > > > Thanks for reviewing. I have addressed these in v50.
> > > > > > > >
> > > > > > >
> > > > > > > I was looking at this patch to see if something smaller could be
> > > > > > > independently committable. I think we can extract
> > > > > > > pg_get_slot_invalidation_cause() and commit it as that function
> > > > > > > could be independently useful as well. What do you think?
> > > > > > >
> > > > > >
> > > > > > Sure, forked another thread [1]
> > > > > > [1]:
> > > > > >
> > > > > https://www.postgresql.org/message-id/CAJpy0uBpr0ym12%2B0mXpjcRFA6
> > > > > N%3D
> > > > > > anX%2BYk9aGU4EJhHNu%3DfWykQ%40mail.gmail.com
> > > > > >
> > > > >
> > > > > Thanks, thinking more, we can split the patch into the following 
> > > > > three patches
> > > > > which can be committed separately (a) Allowing the failover property 
> > > > > to be set
> > > > > for a slot via SQL API and subscription commands
> > > > > (b) sync slot worker infrastructure (c) GUC standby_slot_names and 
> > > > > the the
> > > > > corresponding wait logic in server-side.
> > > > >
> > > > > Thoughts?
> > > >
> > > > I agree. Here is the V54 patch set which was split based on the 
> > > > suggestion.
> > > > The commit message in each patch is also improved.
> > > >
> > >
> > > I would like to revisit the current dependency of slotsync worker on
> > > dbname used in 002 patch. Currently we accept dbname in
> > > primary_conninfo and thus the user has to make sure to provide one (by
> > > manually altering it) even in case of a conf file auto-generated by
> > > "pg_basebackup -R".
> > > Thus I would like to discuss if there are better ways to do it.
> > > Complete background is as follow:
> > >
> > > We need dbname for 2 purposes:
> > >
> > > 1) to connect to remote db in order to run SELECT queries to fetch the
> > > info needed by slotsync worker.
> > > 2) to make connection in slot-sync worker itself in order to be able
> > > to use libpq APIs for 1)
> > >
> > > We run 3 kind of select queries in slot-sync worker currently:
> > >
> > > a) To fetch all failover slots (logical slots) info at once in
> > > synchronize_slots().
> > > b) To fetch a particular slot info during
> > > wait_for_primary_slot_catchup() logic (logical slot).
> > > c) To validate primary slot (physical one) and also to distinguish
> > > between standby and cascading standby by running pg_is_in_recovery().
> > >
> > >  1) One approach to avoid dependency on dbname is using commands
> > > instead of SELECT.  This will need implementing LIST_SLOTS command for
> > > a), and for b) we can use LIST_SLOTS and fetch everything (even though
> > > it is not needed) or have LIST_SLOTS with a filter on slot-name or
> > > extend READ_REPLICATION_SLOT,  and for c) we can have some other
> > > command to get pg_is_in_recovery() info. But, I feel by relying on
> > > commands we will be making the extension of the slot-sync feature
> > > difficult. In future, if there is some more requirement to fetch any
> > > other info,
> > > then there too we have to implement a command. I am not sure if it is
> > > good and extensible approach.
> > >
> > > 2) Another way to avoid asking for a dbname in primary_conninfo is to
> > > use the default dbname internally. This brings us to two questions:
> > > 'How' and 'Which default db'?
> > >
> > > 2.1) To answer 'How':
> > > Using default dbname is simpler for the purpose of slot-sync worker
> > > having its own db-connection, but is a little tricky for the purpose
> > > of connection to remote_db. This is because we have to inject this
> > > dbname internally in our connection-info.
> > >
> > > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname),
> > > then currently it could have 2 formats:
> > >
> > > a) The simple "=" format for key-value pairs, example:
> > > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'.
> > > b) URI format, example:
> > > postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp
> > >
> > > We can distinguish between the 2 formats using 'uri_prefix_length' but
> > > injecting the dbname part will be messy specially for URI format.  If
> > > we want to do it w/o injecting and only by changing libpq interfaces
> > > to accept dbname separately apart from conninfo, then there is no
>

Re: Removing const-false IS NULL quals and redundant IS NOT NULL quals

2023-12-28 Thread Andy Fan


Richard Guo  writes:
>
> The detection of self-inconsistent restrictions already exists in
> planner.
>
> # set constraint_exclusion to on;
> SET
> # explain (costs off) select * from a where a > 3 and a is null;
> QUERY PLAN
> --
>  Result
>One-Time Filter: false
> (2 rows)

It has a different scope and cost from what I suggested.  I'd suggest
to detect the notnull constraint only with lower cost and it can be used
in another user case. the constaint_exclusion can covers more user
cases but more expensivly and default off.


Apart from the abve topic, I'm thinking if we should think about the
case like this: 

create table t1(a int);
create table t2(a int);

explain (costs off) select * from t1 join t2 using(a) where a is NULL;
QUERY PLAN 
---
 Hash Join
   Hash Cond: (t2.a = t1.a)
   ->  Seq Scan on t2
   ->  Hash
 ->  Seq Scan on t1
   Filter: (a IS NULL)

Here a is nullable at the base relation side, but we know that the query
would not return anything at last.  IIUC, there is no good place to
handle this in our current infrastructure, I still raise this up in case
I missed anything.


-- 
Best Regards
Andy Fan





Re: Synchronizing slots from primary to standby

2023-12-28 Thread Masahiko Sawada
On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila  wrote:
>
> On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Dec 26, 2023 at 9:27 PM shveta malik  wrote:
> > >
> > > I would like to revisit the current dependency of slotsync worker on
> > > dbname used in 002 patch. Currently we accept dbname in
> > > primary_conninfo and thus the user has to make sure to provide one (by
> > > manually altering it) even in case of a conf file auto-generated by
> > > "pg_basebackup -R".
> > > Thus I would like to discuss if there are better ways to do it.
> > > Complete background is as follow:
> > >
> > > We need dbname for 2 purposes:
> > >
> > > 1) to connect to remote db in order to run SELECT queries to fetch the
> > > info needed by slotsync worker.
> > > 2) to make connection in slot-sync worker itself in order to be able
> > > to use libpq APIs for 1)
> > >
> > > We run 3 kind of select queries in slot-sync worker currently:
> > >
> > > a) To fetch all failover slots (logical slots) info at once in
> > > synchronize_slots().
> > > b) To fetch a particular slot info during
> > > wait_for_primary_slot_catchup() logic (logical slot).
> > > c) To validate primary slot (physical one) and also to distinguish
> > > between standby and cascading standby by running pg_is_in_recovery().
> > >
> > >  1) One approach to avoid dependency on dbname is using commands
> > > instead of SELECT.  This will need implementing LIST_SLOTS command for
> > > a), and for b) we can use LIST_SLOTS and fetch everything (even though
> > > it is not needed) or have LIST_SLOTS with a filter on slot-name or
> > > extend READ_REPLICATION_SLOT,  and for c) we can have some other
> > > command to get pg_is_in_recovery() info. But, I feel by relying on
> > > commands we will be making the extension of the slot-sync feature
> > > difficult. In future, if there is some more requirement to fetch any
> > > other info,
> > > then there too we have to implement a command. I am not sure if it is
> > > good and extensible approach.
> > >
> > > 2) Another way to avoid asking for a dbname in primary_conninfo is to
> > > use the default dbname internally. This brings us to two questions:
> > > 'How' and 'Which default db'?
> > >
> > > 2.1) To answer 'How':
> > > Using default dbname is simpler for the purpose of slot-sync worker
> > > having its own db-connection, but is a little tricky for the purpose
> > > of connection to remote_db. This is because we have to inject this
> > > dbname internally in our connection-info.
> > >
> > > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname),
> > > then currently it could have 2 formats:
> > >
> > > a) The simple "=" format for key-value pairs, example:
> > > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'.
> > > b) URI format, example:
> > > postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp
> > >
> > > We can distinguish between the 2 formats using 'uri_prefix_length' but
> > > injecting the dbname part will be messy specially for URI format.  If
> > > we want to do it w/o injecting and only by changing libpq interfaces
> > > to accept dbname separately apart from conninfo, then there is no
> > > current simpler way available. It will need a good amount of changes
> > > in libpq.
> > >
> > > 2.1.2) Another way is to not rely on primary_conninfo directly but
> > > rely on 'WalRcv->conninfo' in order to connect to remote_db. This is
> > > because the latter is never URI format, it is some parsed format and
> > > appending may work. As an example, primary_conninfo =
> > > 'postgresql://replication@localhost:5433', WalRcv->conninfo loaded
> > > internally is:
> > > "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer
> > > dbname=replication host=localhost port=5433
> > > fallback_application_name=walreceiver sslmode=prefer sslcompression=0
> > > sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2
> > > gssencmode=disable krbsrvname=postgres gssdelegation=0
> > > target_session_attrs=any load_balance_hosts=disable", '\000'
> > >
> > > So we can try appending our default dbname to this. But all the
> > > defaults loaded in WalRcv->conninfo need some careful analysis to
> > > figure out if they work for slot-sync worker case.
> > >
> > > 2.2) Now coming to 'Which default db':
> > >
> > > 2.2.1) If we use 'template1' as default db, it may block 'create db'
> > > operations on primary for the time when the slot-sync worker is
> > > connected to remote using this dbname. Example:
> > >
> > > postgres=# create database newdb1;
> > > ERROR:  source database "template1" is being accessed by other users
> > > DETAIL:  There is 1 other session using the database.
> > >
> > > 2.2.2) If we use 'postgres' as default db, there are chances that it
> > > can be dropped as unlike 'template1', it is allowed to be dropped by
> > > user, and if slotsync worker is connected to it, user may see:
> > > newdb1=# drop database postgres;
> > > ERR

Re: Fix Brin Private Spool Initialization (src/backend/access/brin/brin.c)

2023-12-28 Thread Tomas Vondra


On 12/27/23 12:37, Ranier Vilela wrote:
> Em ter., 26 de dez. de 2023 às 19:07, Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> escreveu:
> 
> 
> 
> On 12/26/23 19:10, Ranier Vilela wrote:
> > Hi,
> >
> > The commit b437571
>  > I
> > think has an oversight.
> > When allocate memory and initialize private spool in function:
> > _brin_leader_participate_as_worker
> >
> > The behavior is the bs_spool (heap and index fields)
> > are left empty.
> >
> > The code affected is:
> >   buildstate->bs_spool = (BrinSpool *) palloc0(sizeof(BrinSpool));
> > - buildstate->bs_spool->heap = buildstate->bs_spool->heap;
> > - buildstate->bs_spool->index = buildstate->bs_spool->index;
> > + buildstate->bs_spool->heap = heap;
> > + buildstate->bs_spool->index = index;
> >
> > Is the fix correct?
> >
> 
> Thanks for noticing this.
> 
> You're welcome.
>  
> 
> Yes, I believe this is a bug - the assignments
> are certainly wrong, it leaves the fields set to NULL.
> 
> I wonder how come this didn't fail during testing. Surely, if the leader
> participates as a worker, the tuplesort_begin_index_brin shall be called
> with heap/index being NULL, leading to some failure during the sort. But
> maybe this means we don't actually need the heap/index fields, it's just
> a copy of TuplesortIndexArg, but BRIN does not need that because we sort
> the tuples by blkno, and we don't need the descriptors for that.
> 
> Unfortunately I can't test on Windows, since I can't build with meson on
> Windows.
> 
> 
> In any case, the _brin_parallel_scan_and_build does not actually need
> the separate heap/index arguments, those are already in the spool.
> 
> Yeah, for sure.
> 
> 
> I'll try to figure out if we want to simplify the tuplesort or remove
> the arguments from _brin_parallel_scan_and_build.
> 

Here is a patch simplifying the BRIN parallel create code a little bit.
As I suspected, we don't need the heap/index in the spool at all, and we
don't need to pass it to tuplesort_begin_index_brin either - we only
need blkno, and we have that in the datum1 field. This also means we
don't need TuplesortIndexBrinArg.

I'll push this tomorrow, probably.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companydiff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 23f081389b2..a58f662f2cf 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -56,8 +56,6 @@
 typedef struct BrinSpool
 {
 	Tuplesortstate *sortstate;	/* state data for tuplesort.c */
-	Relation	heap;
-	Relation	index;
 } BrinSpool;
 
 /*
@@ -1144,8 +1142,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	   RelationGetNumberOfBlocks(heap));
 
 	state->bs_spool = (BrinSpool *) palloc0(sizeof(BrinSpool));
-	state->bs_spool->heap = heap;
-	state->bs_spool->index = index;
 
 	/*
 	 * Attempt to launch parallel worker scan when required
@@ -1200,8 +1196,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 * factor.
 		 */
 		state->bs_spool->sortstate =
-			tuplesort_begin_index_brin(heap, index,
-	   maintenance_work_mem, coordinate,
+			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 	   TUPLESORT_NONE);
 
 		/*
@@ -2706,8 +2701,6 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 
 	/* Allocate memory and initialize private spool */
 	buildstate->bs_spool = (BrinSpool *) palloc0(sizeof(BrinSpool));
-	buildstate->bs_spool->heap = buildstate->bs_spool->heap;
-	buildstate->bs_spool->index = buildstate->bs_spool->index;
 
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
@@ -2736,8 +2729,8 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 static void
 _brin_parallel_scan_and_build(BrinBuildState *state, BrinSpool *brinspool,
 			  BrinShared *brinshared, Sharedsort *sharedsort,
-			  Relation heap, Relation index, int sortmem,
-			  bool progress)
+			  Relation heap, Relation index,
+			  int sortmem, bool progress)
 {
 	SortCoordinate coordinate;
 	TableScanDesc scan;
@@ -2751,9 +2744,7 @@ _brin_parallel_scan_and_build(BrinBuildState *state, BrinSpool *brinspool,
 	coordinate->sharedsort = sharedsort;
 
 	/* Begin "partial" tuplesort */
-	brinspool->sortstate = tuplesort_begin_index_brin(brinspool->heap,
-	  brinspool->index,
-	  sortmem, coordinate,
+	brinspool->sortstate = tuplesort_begin_index_brin(sortmem, coordinate,
 	  TUPLESORT_NONE);
 
 	/* Join parallel scan */
@@ -2763,8 +2754,8 @@ _brin_parallel_scan_and_build(BrinBuildState *state, BrinSpool *brinspool,
 	scan = table_beginscan_parallel(heap,
 	Parall

Re: Revise the Asserts added to bimapset manipulation functions

2023-12-28 Thread David Rowley
On Thu, 28 Dec 2023 at 23:21, David Rowley  wrote:
> then instead of having to do:
>
> #ifdef REALLOCATE_BITMAPSETS
> a = (Bitmapset *) palloc(BITMAPSET_SIZE(tmp->nwords));
> memcpy(a, tmp, BITMAPSET_SIZE(tmp->nwords));
> pfree(tmp);
> #endif
>
> all over the place.  Couldn't we do:
>
> #ifdef REALLOCATE_BITMAPSETS
>  return bms_clone_and_free(a);
> #else
>  return a;
> #endif

I looked into this a bit more and I just can't see why the current
code feels like it must do the reallocation in functions such as
bms_del_members().  We don't repalloc the set there, ever, so why do
we need to do it when building with REALLOCATE_BITMAPSETS?  It seems
to me the point of REALLOCATE_BITMAPSETS is to ensure that *if* we
possibly could end up reallocating the set that we *do* reallocate.

There's also a few cases where you could argue that the
REALLOCATE_BITMAPSETS code has introduced bugs.  For example,
bms_del_members(), bms_join() and bms_int_members() are meant to
guarantee that their left input (both inputs in the case of
bms_join()) are recycled. Compiling with REALLOCATE_BITMAPSETS
invalidates that, so it seems about as likely that building with
REALLOCATE_BITMAPSETS could cause bugs rather than find bugs.

I've hacked a bit on this to fix these problems and also added some
documentation to try to offer future people modifying bitmapset.c some
guidance on if they need to care about REALLOCATE_BITMAPSETS.

I also don't think "hangling" is a word. So I've adjusted the comment
in pg_config_manual.h to fix that.

David
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index e13ecaa155..e0eeefa8bd 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -16,6 +16,15 @@
  * containing only a single word (likely the majority of them) this halves the
  * number of loop condition checks.
  *
+ * Because manipulation of a bitmapset can often be done without having to
+ * reallocate the memory for the set, we support REALLOCATE_BITMAPSETS as a
+ * debug option to have the code *always* reallocate the bitmapset when the
+ * set *could* be reallocated as a result of the modification.  This increases
+ * the likelihood of finding bugs in cases where there's more than one
+ * reference to a set and the additional ones are left pointing to freed
+ * memory.  Such bugs are unlikely to be otherwise found by our regression
+ * tests as the vast majority of bitmapsets will only ever contain a single
+ * bitmapword.
  *
  * Copyright (c) 2003-2023, PostgreSQL Global Development Group
  *
@@ -72,6 +81,45 @@
 #error "invalid BITS_PER_BITMAPWORD"
 #endif
 
+#ifdef USE_ASSERT_CHECKING
+/*
+ * bms_is_valid_set - for cassert builds to check for valid sets
+ */
+static bool
+bms_is_valid_set(const Bitmapset *a)
+{
+   /* NULL is the correct representation of an empty set */
+   if (a == NULL)
+   return true;
+
+   /* check the node tag is set correctly.  Freed pointer, maybe? */
+   if (!IsA(a, Bitmapset))
+   return false;
+
+   /* trailing zero words are not allowed */
+   if (a->words[a->nwords - 1] == 0)
+   return false;
+
+   return true;
+}
+#endif
+
+#ifdef REALLOCATE_BITMAPSETS
+/*
+ * bms_copy_and_free
+ * Only required in REALLOCATE_BITMAPSETS builds.  Provide a 
simple way to
+ * return a freshly allocated set similar to what would happen if 
the set
+ * had to be enlarged to add additional words.
+ */
+static Bitmapset *
+bms_copy_and_free(Bitmapset *a)
+{
+   Bitmapset  *c = bms_copy(a);
+
+   bms_free(a);
+   return c;
+}
+#endif
 
 /*
  * bms_copy - make a palloc'd copy of a bitmapset
@@ -82,9 +130,11 @@ bms_copy(const Bitmapset *a)
Bitmapset  *result;
size_t  size;
 
+   Assert(bms_is_valid_set(a));
+
if (a == NULL)
return NULL;
-   Assert(IsA(a, Bitmapset));
+
size = BITMAPSET_SIZE(a->nwords);
result = (Bitmapset *) palloc(size);
memcpy(result, a, size);
@@ -99,8 +149,8 @@ bms_equal(const Bitmapset *a, const Bitmapset *b)
 {
int i;
 
-   Assert(a == NULL || (IsA(a, Bitmapset) && a->words[a->nwords - 1] != 
0));
-   Assert(b == NULL || (IsA(b, Bitmapset) && b->words[b->nwords - 1] != 
0));
+   Assert(bms_is_valid_set(a));
+   Assert(bms_is_valid_set(b));
 
/* Handle cases where either input is NULL */
if (a == NULL)
@@ -140,8 +190,8 @@ bms_compare(const Bitmapset *a, const Bitmapset *b)
 {
int i;
 
-   Assert(a == NULL || (IsA(a, Bitmapset) && a->words[a->nwords - 1] != 
0));
-   Assert(b == NULL || (IsA(b, Bitmapset) && b->words[b->nwords - 1] != 
0));
+   Assert(bms_is_valid_set(a));
+   Assert(bms_is_valid_set(b));
 
/* Handle cases where either input is NULL */
if (a == NULL)
@@ -216,8 +266,8 @@ bms_union(const Bitmapset *a, const Bitmapset *b)
in

Re: Statistics Import and Export

2023-12-28 Thread Tomas Vondra
On 12/13/23 11:26, Corey Huinker wrote:
> Yeah, that was the simplest output function possible, it didn't seem
>
> worth it to implement something more advanced. pg_mcv_list_items() is
> more convenient for most needs, but it's quite far from the on-disk
> representation.
>
>
> I was able to make it work.
>
>
>
> That's actually a good question - how closely should the exported data
> be to the on-disk format? I'd say we should keep it abstract, not tied
> to the details of the on-disk format (which might easily change
between
> versions).
>
>
> For the most part, I chose the exported data json types and formats in a
> way that was the most accommodating to cstring input functions. So,
> while so many of the statistic values are obviously only ever
> integers/floats, those get stored as a numeric data type which lacks
> direct numeric->int/float4/float8 functions (though we could certainly
> create them, and I'm not against that), casting them to text lets us
> leverage pg_strtoint16, etc.
>
>
>
> I'm a bit confused about the JSON schema used in pg_statistic_export
> view, though. It simply serializes stakinds, stavalues, stanumbers
into
> arrays ... which works, but why not to use the JSON nesting? I mean,
> there could be a nested document for histogram, MCV, ... with just the
> correct fields.
>
>   {
> ...
> histogram : { stavalues: [...] },
> mcv : { stavalues: [...], stanumbers: [...] },
> ...
>   }
>
>
> That's a very good question. I went with this format because it was
> fairly straightforward to code in SQL using existing JSON/JSONB
> functions, and that's what we will need if we want to export statistics
> on any server currently in existence. I'm certainly not locked in with
> the current format, and if it can be shown how to transform the data
> into a superior format, I'd happily do so.
>
> and so on. Also, what does TRIVIAL stand for?
>
>
> It's currently serving double-duty for "there are no stats in this slot"
> and the situations where the stats computation could draw no conclusions
> about the data.
>
> Attached is v3 of this patch. Key features are:
>
> * Handles regular pg_statistic stats for any relation type.
> * Handles extended statistics.
> * Export views pg_statistic_export and pg_statistic_ext_export to allow
> inspection of existing stats and saving those values for later use.
> * Import functions pg_import_rel_stats() and pg_import_ext_stats() which
> take Oids as input. This is intentional to allow stats from one object
> to be imported into another object.
> * User scripts pg_export_stats and pg_import stats, which offer a
> primitive way to serialize all the statistics of one database and import
> them into another.
> * Has regression test coverage for both with a variety of data types.
> * Passes my own manual test of extracting all of the stats from a v15
> version of the popular "dvdrental" example database, as well as some
> additional extended statistics objects, and importing them into a
> development database.
> * Import operations never touch the heap of any relation outside of
> pg_catalog. As such, this should be significantly faster than even the
> most cursory analyze operation, and therefore should be useful in
> upgrade situations, allowing the database to work with "good enough"
> stats more quickly, while still allowing for regular autovacuum to
> recalculate the stats "for real" at some later point.
>
> The relation statistics code was adapted from similar features in
> analyze.c, but is now done in a query context. As before, the
> rowcount/pagecount values are updated on pg_class in a non-transactional
> fashion to avoid table bloat, while the updates to pg_statistic are
> pg_statistic_ext_data are done transactionally.
>
> The existing statistics _store() functions were leveraged wherever
> practical, so much so that the extended statistics import is mostly just
> adapting the existing _build() functions into _import() functions which
> pull their values from JSON rather than computing the statistics.
>
> Current concerns are:
>
> 1. I had to code a special-case exception for MCELEM stats on array data
> types, so that the array_in() call uses the element type rather than the
> array type. I had assumed that the existing exmaine_attribute()
> functions would have properly derived the typoid for that column, but it
> appears to not be the case, and I'm clearly missing how the existing
> code gets it right.
Hmm, after looking at this, I'm not sure it's such an ugly hack ...

The way this works for ANALYZE is that examine_attribute() eventually
calls the typanalyze function:

  if (OidIsValid(stats->attrtype->typanalyze))
ok = DatumGetBool(OidFunctionCall1(stats->attrtype->typanalyze,
   PointerGetDatum(stats)));

which for arrays is array_typanalyze, and this sets stats->extra_data to
ArrayAnalyzeExtraData with all the int

Re: Avoid computing ORDER BY junk columns unnecessarily

2023-12-28 Thread Tom Lane
I wrote:
> Yeah, fair point.  I'll try to take a look at your patchset after
> the holidays.

I spent some time looking at this patch, and I'm not very pleased
with it.  My basic complaint is that this is a band-aid that only
touches things at a surface level, whereas I think we need a much
deeper set of changes in order to have a plausible solution.
Specifically, you're only munging the final top-level targetlist
not anything for lower plan levels.  That looks like it works in
simple cases, but it leaves everything on the table as soon as
there's more than one level of plan involved.  I didn't stop to
trace the details but I'm pretty sure this is why you're getting the
bogus extra projections shown in the 0005 patch.  Moreover, this
isn't doing anything for cost estimates, which means the planner
doesn't really understand that more-desirable plans are more
desirable, and it may not pick an available plan that would exploit
what we want to have happen here.

As an example, say we have an index on f(x) and the query requires
sorting by f(x) but not final output of f(x).  If we devise a plan
that uses the index to sort and doesn't emit f(x), we need to not
charge the evaluation cost of f() for that plan --- this might
make the difference between picking that plan and not.  Right now
I believe we'll charge all plans for f(), so that some other plan
might look cheaper even when f() is very expensive.

Another example: we might be using an indexscan but not relying on
its sort order, for example because its output is going into a hash
join and then we'll sort afterwards.  For expensive f() it would
still be useful if the index could emit f(x) so that we don't have
to calculate f() at the sort step.  Right now I don't think we can
even emit that plan shape, never mind recognize why it's cheaper.

I have only vague ideas about how to do this better.  It might work
to set up multiple PathTargets for a relation that has such an
index, one for the base case where the scan only emits x and f() is
computed above, one for the case where we don't need either x or
f(x) because we're relying on the index order, and one that emits
f(x) with the expectation that a sort will happen above.  Then we'd
potentially generate multiple Paths representing the very same
indexscan but with different PathTargets, and differing targets
would have to become something that add_path treats as a reason to
keep multiple Paths for the same relation.  I'm a little frightened
about the possible growth in number of paths considered, but how
else would we keep track of the differing costs of these cases?

So your proposed patch is far from a final solution in this area,
and I'm not even sure that it represents a plausible incremental
step.  I've got mixed feelings about the "make resjunk an enum"
idea --- it seems quite invasive and I'm not sure it's buying
enough to risk breaking non-planner code for.  It might be
better to leave the targetlist representation alone and deal
with this strictly inside the planner.  We could, for example,
add an array to PlannerInfo that's indexed by query-tlist resno
and indicates the reason for a particular tlist item to exist.

Also, as you mentioned, this categorization of junk columns
seems a little unprincipled.  It might help to think about that
in terms similar to the "attr_needed" data we keep for Vars,
that is: how far up in the plan tree is this tlist item needed?
I think the possibilities are:

* needed for final output (ie, not resjunk)

* needed for final grouping/sorting steps (I think all
  resjunk items produced by the parser are this case;
  but do we need to distinguish GROUP BY from ORDER BY?)
  
* needed at the ModifyTable node (for rowmark columns)

* needed at the SetOp node (for flag columns added in
  prepunion.c)

It looks to me like these cases are mutually exclusive, so
that we just need an enum value not a bitmask.  Only
"needed for final grouping/sorting" is something we can
potentially omit from the plan.

BTW, the hack you invented JUNK_PLANNER_ONLY for, namely
that create_indexscan_plan feeds canreturn flags forward to
set_indexonlyscan_references, could be done another way:
we don't really have to overload resjunk for that.  At worst,
set_indexonlyscan_references could re-look-up the IndexOptInfo
using the index OID from the plan node, and get the canreturn
flags directly from that.  If we keep resjunk as a bool I'd be
inclined to leave this alone; but if we change resjunk, we
don't need the replacement design to account for this bit.

regards, tom lane




Re: The segmentation fault of Postgresql 9.6.24

2023-12-28 Thread Bruce Momjian
On Thu, Dec 28, 2023 at 11:20:18PM +0100, Tomas Vondra wrote:
> This could be pretty much anything, and without seeing where exactly it
> fails it's impossible to say. I see you apparently hit the issue
> repeatedly, and tall the information is *exactly* the same - addresses,
> code, etc. Try decoding the addresses with addr2line, or even better get
> a proper backtrace - either from a core file, or using gdb.

Also, Postgres 9.6 went out of support on November 11, 2021:

https://www.postgresql.org/support/versioning/

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: The segmentation fault of Postgresql 9.6.24

2023-12-28 Thread Tomas Vondra


On 12/28/23 21:09, Kevin Wang wrote:
> Hello hackers,
> 
> Our prod databases are still PG 9.6.24.  We have one primary plus 3
> stream replications that are all working well for a long time. 

Everything is working well until the day it breaks ...

> However, when I promoted one standby database to the primary role,
> we the the below error message from the PG log:
> ===
> 2023-12-01 06:57:35.541 UTC,,,1553,,6569738f.611,639,,2023-12-01
> 05:47:59 UTC,,0,LOG,0,"server process (PID 31839) was terminated by
> signal 11: Segmentation fault","Failed process was running: UPDATE 
> SET employee_id = (9489910) WHERE id = (1162120221)"""
> 
> 
> 
> Here is the message from dmesg:
> ===
> [ 3676.406247] postgres[27789]: segfault at 0 ip 5618bf79bfe4 sp
> 7ffcd9a75dc8 error 4 in postgres[5618bf3db000+3f7000]
> [ 3676.406265] Code: ff ff 48 83 c2 40 ff d0 e8 19 9c ff ff e8 44 0f c4
> ff 0f 1f 40 00 f3 0f 1e fa e9 27 be cc ff 0f 1f 80 00 00 00 00 f3 0f 1e
> fa <0f> b6 17 89 d1
>  83 e1 03 80 f9 02 74 0f 80 fa 01 74 0a 48 89 f8 c3
> [ 3715.937850] postgres[27928]: segfault at 0 ip 5618bf79bfe4 sp
> 7ffcd9a75dc8 error 4 in postgres[5618bf3db000+3f7000]
> [ 3715.937858] Code: ff ff 48 83 c2 40 ff d0 e8 19 9c ff ff e8 44 0f c4
> ff 0f 1f 40 00 f3 0f 1e fa e9 27 be cc ff 0f 1f 80 00 00 00 00 f3 0f 1e
> fa <0f> b6 17 89 d1
>  83 e1 03 80 f9 02 74 0f 80 fa 01 74 0a 48 89 f8 c3
> [ 3732.278367] postgres[28212]: segfault at 0 ip 5618bf79bfe4 sp
> 7ffcd9a75dc8 error 4 in postgres[5618bf3db000+3f7000]
> [ 3732.278384] Code: ff ff 48 83 c2 40 ff d0 e8 19 9c ff ff e8 44 0f c4
> ff 0f 1f 40 00 f3 0f 1e fa e9 27 be cc ff 0f 1f 80 00 00 00 00 f3 0f 1e
> fa <0f> b6 17 89 d1
>  83 e1 03 80 f9 02 74 0f 80 fa 01 74 0a 48 89 f8 c3
> 
> Error 4 is the error related to unmapping memory. But the database works
> well for long time as the standby database. After it was promoted to the
> primary role, no memory parameter change at all.
> 

Why do you think "4" means unmapping memory? 4 is error code for
"user-mode access" (i.e. not invalid memory access from kernel).

> Could you give us some hint where to fix this issue?
> 

This could be pretty much anything, and without seeing where exactly it
fails it's impossible to say. I see you apparently hit the issue
repeatedly, and tall the information is *exactly* the same - addresses,
code, etc. Try decoding the addresses with addr2line, or even better get
a proper backtrace - either from a core file, or using gdb.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Pdadmin open on Macbook issue

2023-12-28 Thread Bruce Momjian
On Thu, Dec 28, 2023 at 04:12:06PM +0800, Anita wrote:
> Dear Sir/Madam,
> 
> I have used Pdadmin on my Macbook for a few days, then suddenly it can not be
> opened anymore(after i tried to force to exit it when i tried to turn off the
> laptop), I have tried many solution to try to fix this issue but none of them
> worked. Therefore, can you please help me solve it?

For pgAdmin support, see:

https://www.pgadmin.org/support/list/
https://github.com/pgadmin-org/pgadmin4/issues

or email

pgadmin-supp...@postgresql.org

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




The segmentation fault of Postgresql 9.6.24

2023-12-28 Thread Kevin Wang
Hello hackers,

Our prod databases are still PG 9.6.24.  We have one primary plus 3 stream
replications that are all working well for a long time. However, when I
promoted one standby database to the primary role, we the the below error
message from the PG log:
===
2023-12-01 06:57:35.541 UTC,,,1553,,6569738f.611,639,,2023-12-01 05:47:59
UTC,,0,LOG,0,"server process (PID 31839) was terminated by signal 11:
Segmentation fault","Failed process was running: UPDATE  SET
employee_id = (9489910) WHERE id = (1162120221)"""



Here is the message from dmesg:
===
[ 3676.406247] postgres[27789]: segfault at 0 ip 5618bf79bfe4 sp
7ffcd9a75dc8 error 4 in postgres[5618bf3db000+3f7000]
[ 3676.406265] Code: ff ff 48 83 c2 40 ff d0 e8 19 9c ff ff e8 44 0f c4 ff
0f 1f 40 00 f3 0f 1e fa e9 27 be cc ff 0f 1f 80 00 00 00 00 f3 0f 1e fa
<0f> b6 17 89 d1
 83 e1 03 80 f9 02 74 0f 80 fa 01 74 0a 48 89 f8 c3
[ 3715.937850] postgres[27928]: segfault at 0 ip 5618bf79bfe4 sp
7ffcd9a75dc8 error 4 in postgres[5618bf3db000+3f7000]
[ 3715.937858] Code: ff ff 48 83 c2 40 ff d0 e8 19 9c ff ff e8 44 0f c4 ff
0f 1f 40 00 f3 0f 1e fa e9 27 be cc ff 0f 1f 80 00 00 00 00 f3 0f 1e fa
<0f> b6 17 89 d1
 83 e1 03 80 f9 02 74 0f 80 fa 01 74 0a 48 89 f8 c3
[ 3732.278367] postgres[28212]: segfault at 0 ip 5618bf79bfe4 sp
7ffcd9a75dc8 error 4 in postgres[5618bf3db000+3f7000]
[ 3732.278384] Code: ff ff 48 83 c2 40 ff d0 e8 19 9c ff ff e8 44 0f c4 ff
0f 1f 40 00 f3 0f 1e fa e9 27 be cc ff 0f 1f 80 00 00 00 00 f3 0f 1e fa
<0f> b6 17 89 d1
 83 e1 03 80 f9 02 74 0f 80 fa 01 74 0a 48 89 f8 c3

Error 4 is the error related to unmapping memory. But the database works
well for long time as the standby database. After it was promoted to the
primary role, no memory parameter change at all.

Could you give us some hint where to fix this issue?

Regards,

Kevin


Pdadmin open on Macbook issue

2023-12-28 Thread Anita

  Dear Sir/Madam, 
 
 
 
 
 

  I have used Pdadmin on my Macbook for a few days, then suddenly it can not be 
opened anymore(after i tried to force to exit it when i tried to turn off the 
laptop), I have tried many solution to try to fix this issue but none of them 
worked. Therefore, can you please help me solve it? 
 
 
 
 
 

  thanks, 
 

  Anita 
 
 
 
 
 
 
 
 
 
 
-- 

 
 
 
     
  
Girl can run the world 
  


PostgreSQL 16.1 dict_snowball.so: undefined symbol: CurrentMemoryContext

2023-12-28 Thread Vitalijus Jefišovas
Hello,
Having issues compiling PostgreSQL 16.1 targeting ARM32 architecture.
Using building, which currently has 15.5 version, and it compiles and runs well.
Currently using GCC 11 with binutils 2.39.

During initdb, it gives error message:
[356] FATAL:  could not load library "/pgsql16/lib/dict_snowball.so": 
/pgsql16/lib/dict_snowball.so: undefined symbol: CurrentMemoryContext
STATEMENT:  CREATE FUNCTION dsnowball_init(INTERNAL)
RETURNS INTERNAL AS '$libdir/dict_snowball', 'dsnowball_init'
LANGUAGE C STRICT;

I've traced build log, and this is how it builds for version 13:
/home/build/panelpi4/host/bin/arm-buildroot-linux-gnueabihf-gcc -Wall 
-Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla 
-Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard 
-Wno-format-truncation -Wno-stringop-truncation -D_LARGEFILE_SOURCE 
-D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64  -O1 -g0  -fPIC 
-I../../../src/include/snowball -I../../../src/include/snowball/libstemmer 
-I../../../src/include  -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE 
-I/home/build/panelpi4/host/arm-buildroot-linux-gnueabihf/sysroot/usr/bin/../../usr/include/libxml2
   -c -o dict_snowball.o dict_snowball.c

And this is version 16.1:
/home/build/panelpi4/host/bin/arm-buildroot-linux-gnueabihf-gcc -Wall 
-Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla 
-Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 
-Wcast-function-type -Wshadow=compatible-local -Wformat-security 
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation 
-Wno-stringop-truncation -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE 
-D_FILE_OFFSET_BITS=64  -O1 -g0  -fPIC -fvisibility=hidden 
-I../../../src/include/snowball -I../../../src/include/snowball/libstemmer 
-I../../../src/include  -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE 
-I/home/build/panelpi4/host/arm-buildroot-linux-gnueabihf/sysroot/usr/bin/../../usr/include/libxml2
   -c -o dict_snowball.o dict_snowball.c

So the new thing is: "-fvisibility=hidden", and there is discussion about it a 
year ago:
https://www.postgresql.org/message-id/20220718220938.5yx7if2mru3rd6jc%40awork3.anarazel.de

Any tips to go around this issue?
Thanks!





Re: broken master regress tests

2023-12-28 Thread Alexander Lakhin

28.12.2023 20:36, Jeff Davis wrote:

We do want that test to run though, right?


Yes, I think so.


I suspect that test line never worked reliably. The skip_test check at
the top guarantees that the collation named "en_US" exists, but that
doesn't mean that the OS understands the locale 'en_US'.

Perhaps we can change that line to use a similar trick as what's used
elsewhere in the file:

   do $$
   BEGIN
 EXECUTE 'CREATE COLLATION ctest_det (locale = ' ||
 quote_literal((SELECT collcollate FROM pg_collation WHERE
collname = ''en_US'')) || ', deterministic = true);';
   END
   $$;

The above may need some adjustment, but perhaps you can try it out?


Yes, this trick resolves the issue, it gives locale 'en-US' on that OS,
which works there. Please see the attached patch.

But looking at the result with the comment above that "do" block, I wonder
whether this successful CREATE COLLATION command is so important to perform
it that tricky way, if we want to demonstrate that nondeterministic
collations not supported.
So in case you decide just to remove this command, please see the second
patch.

Best regards,
Alexanderdiff --git a/src/test/regress/expected/collate.windows.win1252.out b/src/test/regress/expected/collate.windows.win1252.out
index b7b93959de..bb6297a113 100644
--- a/src/test/regress/expected/collate.windows.win1252.out
+++ b/src/test/regress/expected/collate.windows.win1252.out
@@ -992,7 +992,6 @@ drop type textrange_c;
 drop type textrange_en_us;
 -- nondeterministic collations
 -- (not supported with libc provider)
-CREATE COLLATION ctest_det (locale = 'en_US', deterministic = true);
 CREATE COLLATION ctest_nondet (locale = 'en_US', deterministic = false);
 ERROR:  nondeterministic collations not supported with this provider
 -- cleanup
diff --git a/src/test/regress/sql/collate.windows.win1252.sql b/src/test/regress/sql/collate.windows.win1252.sql
index 353d769a5b..6246d89634 100644
--- a/src/test/regress/sql/collate.windows.win1252.sql
+++ b/src/test/regress/sql/collate.windows.win1252.sql
@@ -400,8 +400,6 @@ drop type textrange_en_us;
 
 -- nondeterministic collations
 -- (not supported with libc provider)
-
-CREATE COLLATION ctest_det (locale = 'en_US', deterministic = true);
 CREATE COLLATION ctest_nondet (locale = 'en_US', deterministic = false);
 
 
diff --git a/src/test/regress/expected/collate.windows.win1252.out b/src/test/regress/expected/collate.windows.win1252.out
index b7b93959de..97cbfc86de 100644
--- a/src/test/regress/expected/collate.windows.win1252.out
+++ b/src/test/regress/expected/collate.windows.win1252.out
@@ -992,7 +992,13 @@ drop type textrange_c;
 drop type textrange_en_us;
 -- nondeterministic collations
 -- (not supported with libc provider)
-CREATE COLLATION ctest_det (locale = 'en_US', deterministic = true);
+do $$
+BEGIN
+  EXECUTE 'CREATE COLLATION ctest_det (locale = ' ||
+  quote_literal((SELECT collcollate FROM pg_collation WHERE
+  collname = 'en_US')) || ', deterministic = true);';
+  END
+$$;
 CREATE COLLATION ctest_nondet (locale = 'en_US', deterministic = false);
 ERROR:  nondeterministic collations not supported with this provider
 -- cleanup
diff --git a/src/test/regress/sql/collate.windows.win1252.sql b/src/test/regress/sql/collate.windows.win1252.sql
index 353d769a5b..727d136f2f 100644
--- a/src/test/regress/sql/collate.windows.win1252.sql
+++ b/src/test/regress/sql/collate.windows.win1252.sql
@@ -400,8 +400,13 @@ drop type textrange_en_us;
 
 -- nondeterministic collations
 -- (not supported with libc provider)
-
-CREATE COLLATION ctest_det (locale = 'en_US', deterministic = true);
+do $$
+BEGIN
+  EXECUTE 'CREATE COLLATION ctest_det (locale = ' ||
+  quote_literal((SELECT collcollate FROM pg_collation WHERE
+  collname = 'en_US')) || ', deterministic = true);';
+  END
+$$;
 CREATE COLLATION ctest_nondet (locale = 'en_US', deterministic = false);
 
 


Windows sockets (select missing events?)

2023-12-28 Thread Ranier Vilela
Hi,

The type of field fd_count is defined in winsock.h:
typedef unsigned intu_int;

So in the struct fd_set,
the field fd_count is unsigned int.

The pgwin32_select  function has loops that use *int* as indices.

Question: in Windows, the socket select function isn't missing some events?

If not, It would be a good prevention practice, using the correct type,
right?

Patch attached.

best regards,
Ranier Vilela


001-fix-socket-select-events.patch
Description: Binary data


Re: Statistics Import and Export

2023-12-28 Thread Bruce Momjian
On Thu, Dec 28, 2023 at 12:28:06PM -0500, Corey Huinker wrote:
> What I am proposing is *import* functions.  I didn't say anything about
> how pg_dump obtains the data it prints; however, I would advocate that
> we keep that part as simple as possible.  You cannot expect export
> functionality to know the requirements of future server versions,
> so I don't think it's useful to put much intelligence there.
> 
> True, but presumably you'd be using the pg_dump/pg_upgrade of that future
> version to do the exporting, so the export format would always be tailored to
> the importer's needs.

I think the question is whether we will have the export functionality in
the old cluster, or if it will be queries run by pg_dump and therefore
also run by pg_upgrade calling pg_dump.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: broken master regress tests

2023-12-28 Thread Jeff Davis
On Thu, 2023-12-28 at 18:00 +0300, Alexander Lakhin wrote:
> AFAICS, before that commit SELECT getdatabaseencoding() in the test
> returned SQL_ASCII, hence the test was essentially skipped, but now
> it
> returns WIN1252, so problematic CREATE COLLATION(locale = 'en_US',
> ...)
> is reached.

We do want that test to run though, right?

I suspect that test line never worked reliably. The skip_test check at
the top guarantees that the collation named "en_US" exists, but that
doesn't mean that the OS understands the locale 'en_US'.

Perhaps we can change that line to use a similar trick as what's used
elsewhere in the file:

  do $$
  BEGIN
EXECUTE 'CREATE COLLATION ctest_det (locale = ' ||
quote_literal((SELECT collcollate FROM pg_collation WHERE
collname = ''en_US'')) || ', deterministic = true);';
  END
  $$;

The above may need some adjustment, but perhaps you can try it out?
Another option might be to use \gset to assign it to a variable, which
might be more readable, but I think it's better to just follow what the
rest of the file is doing.

Regards,
Jeff Davis





Re: Statistics Import and Export

2023-12-28 Thread Corey Huinker
On Wed, Dec 27, 2023 at 10:10 PM Tom Lane  wrote:

> Corey Huinker  writes:
> > Export functions was my original plan, for simplicity, maintenance, etc,
> > but it seemed like I'd be adding quite a few functions, so the one view
> > made more sense for an initial version. Also, I knew that pg_dump or some
> > other stats exporter would have to inline the guts of those functions
> into
> > queries for older versions, and adapting a view definition seemed more
> > straightforward for the reader than function definitions.
>
> Hmm, I'm not sure we are talking about the same thing at all.
>

Right, I was conflating two things.


>
> What I am proposing is *import* functions.  I didn't say anything about
> how pg_dump obtains the data it prints; however, I would advocate that
> we keep that part as simple as possible.  You cannot expect export
> functionality to know the requirements of future server versions,
> so I don't think it's useful to put much intelligence there.
>

True, but presumably you'd be using the pg_dump/pg_upgrade of that future
version to do the exporting, so the export format would always be tailored
to the importer's needs.


>
> So I think pg_dump should produce a pretty literal representation of
> what it finds in the source server's catalog, and then rely on the
> import functions in the destination server to make sense of that
> and do whatever slicing-n-dicing is required.
>

Obviously it can't be purely literal, as we have to replace the oid values
with whatever text representation we feel helps us carry forward. In
addition, we're setting the number of tuples and number of pages directly
in pg_class, and doing so non-transactionally just like ANALYZE does. We
could separate that out into its own import function, but then we're
locking every relation twice, once for the tuples/pages and once again for
the pg_statistic import.

My current line of thinking was that the stats import call, if enabled,
would immediately follow the CREATE statement of the object itself, but
that requires us to have everything we need to know for the import passed
into the import function, so we'd be needing a way to serialize _that_. If
you're thinking that we have one big bulk stats import, that might work,
but it also means that we're less tolerant of failures in the import step.


Re: Transaction timeout

2023-12-28 Thread Junwang Zhao
Hey Andrey,

On Sun, Dec 24, 2023 at 1:14 AM Andrey M. Borodin  wrote:
>
>
>
> > On 22 Dec 2023, at 10:39, Japin Li  wrote:
> >
> >
> > I try to split the test for transaction timeout, and all passed on my CI 
> > [1].
>
>
> I like the refactoring you did in timeout.spec. I thought it is impossible, 
> because permutations would try to reinitialize FATALed sessions. But, 
> obviously, tests work the way you refactored it.
> However I don't think ignoring test failures on Windows without understanding 
> root cause is a good idea.
> Let's get back to v13 version of tests, understand why it failed, apply your 
> test refactorings afterwards. BTW are you sure that v14 refactorings are 
> functional equivalent of v13 tests?
>
> To go with this plan I attach slightly modified version of v13 tests in v16 
> patchset. The only change is timing in "sleep_there" step. I suspect that 
> failure was induced by more coarse timer granularity on Windows. Tests were 
> giving only 9 milliseconds for a timeout to entirely wipe away backend from 
> pg_stat_activity. This saves testing time, but might induce false positive 
> test flaps. So I've raised wait times to 100ms. This seems too much, but I do 
> not have other ideas how to ensure tests stability. Maybe 50ms would be 
> enough, I do not know. Isolation runs ~50 seconds now. I'm tempted to say 
> that 200ms for timeouts worth it.
>
>
> As to 2nd step "Try to enable transaction_timeout during transaction", I 
> think this makes sense. But if we are doing so, shouldn't we also allow to 
> enable idle_in_transaction timeout in a same manner? Currently we only allow 
> to disable other timeouts... Also, if we are already in transaction, 
> shouldn't we also subtract current transaction span from timeout?
> I think making this functionality as another step of the patchset was a good 
> idea.
>
> Thanks!
Seems V5~V17 doesn't work as expected for Nikolay's case:

postgres=# set transaction_timeout to '2s';
SET
postgres=# begin; select pg_sleep(1); select pg_sleep(1); select
pg_sleep(1); select pg_sleep(1); select pg_sleep(1); commit;
BEGIN

The reason for this seems the timer has been refreshed for each
command, xact_started along can not indicate it's a new
transaction or not, there is a TransactionState contains some
infos.

So I propose the following change, what do you think?

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index a2611cf8e6..cffd2c44d0 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2746,7 +2746,7 @@ start_xact_command(void)
StartTransactionCommand();

/* Schedule or reschedule transaction timeout */
-   if (TransactionTimeout > 0)
+   if (TransactionTimeout > 0 &&
!get_timeout_active(TRANSACTION_TIMEOUT))
enable_timeout_after(TRANSACTION_TIMEOUT,
TransactionTimeout);

xact_started = true;

>
>
> Best regards, Andrey Borodin.



-- 
Regards
Junwang Zhao




Re: introduce dynamic shared memory registry

2023-12-28 Thread Nathan Bossart
On Wed, Dec 27, 2023 at 01:53:27PM -0600, Nathan Bossart wrote:
> Here is a new version of the patch.  In addition to various small changes,
> I've rewritten the test suite to use an integer and a lock, added a
> dsm_registry_find() function, and adjusted autoprewarm to use the registry.

Here's a v3 that fixes a silly mistake in the test.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b9b1638c9eb55eba51356690a2b3da1fe0b2b14e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 11 Oct 2023 22:07:26 -0500
Subject: [PATCH v3 1/2] add dsm registry

---
 src/backend/storage/ipc/Makefile  |   1 +
 src/backend/storage/ipc/dsm_registry.c| 209 ++
 src/backend/storage/ipc/ipci.c|   3 +
 src/backend/storage/ipc/meson.build   |   1 +
 src/backend/storage/lmgr/lwlock.c |   4 +
 src/backend/storage/lmgr/lwlocknames.txt  |   1 +
 .../utils/activity/wait_event_names.txt   |   3 +
 src/include/storage/dsm_registry.h|  24 ++
 src/include/storage/lwlock.h  |   2 +
 src/test/modules/Makefile |   1 +
 src/test/modules/meson.build  |   1 +
 src/test/modules/test_dsm_registry/.gitignore |   4 +
 src/test/modules/test_dsm_registry/Makefile   |  23 ++
 .../expected/test_dsm_registry.out|  14 ++
 .../modules/test_dsm_registry/meson.build |  33 +++
 .../sql/test_dsm_registry.sql |   4 +
 .../test_dsm_registry--1.0.sql|  10 +
 .../test_dsm_registry/test_dsm_registry.c |  75 +++
 .../test_dsm_registry.control |   4 +
 src/tools/pgindent/typedefs.list  |   3 +
 20 files changed, 420 insertions(+)
 create mode 100644 src/backend/storage/ipc/dsm_registry.c
 create mode 100644 src/include/storage/dsm_registry.h
 create mode 100644 src/test/modules/test_dsm_registry/.gitignore
 create mode 100644 src/test/modules/test_dsm_registry/Makefile
 create mode 100644 src/test/modules/test_dsm_registry/expected/test_dsm_registry.out
 create mode 100644 src/test/modules/test_dsm_registry/meson.build
 create mode 100644 src/test/modules/test_dsm_registry/sql/test_dsm_registry.sql
 create mode 100644 src/test/modules/test_dsm_registry/test_dsm_registry--1.0.sql
 create mode 100644 src/test/modules/test_dsm_registry/test_dsm_registry.c
 create mode 100644 src/test/modules/test_dsm_registry/test_dsm_registry.control

diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 6d5b921038..d8a1653eb6 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -12,6 +12,7 @@ OBJS = \
 	barrier.o \
 	dsm.o \
 	dsm_impl.o \
+	dsm_registry.o \
 	ipc.o \
 	ipci.o \
 	latch.o \
diff --git a/src/backend/storage/ipc/dsm_registry.c b/src/backend/storage/ipc/dsm_registry.c
new file mode 100644
index 00..5fc970001e
--- /dev/null
+++ b/src/backend/storage/ipc/dsm_registry.c
@@ -0,0 +1,209 @@
+/*-
+ *
+ * dsm_registry.c
+ *
+ * Functions for interfacing with the dynamic shared memory registry.  This
+ * provides a way for libraries to use shared memory without needing to
+ * request it at startup time via a shmem_request_hook.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/ipc/dsm_registry.c
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "lib/dshash.h"
+#include "storage/dsm_registry.h"
+#include "storage/lwlock.h"
+#include "storage/shmem.h"
+#include "utils/memutils.h"
+
+typedef struct DSMRegistryCtxStruct
+{
+	dsa_handle	dsah;
+	dshash_table_handle dshh;
+} DSMRegistryCtxStruct;
+
+static DSMRegistryCtxStruct *DSMRegistryCtx;
+
+typedef struct DSMRegistryEntry
+{
+	char		key[256];
+	dsm_handle	handle;
+} DSMRegistryEntry;
+
+static const dshash_parameters dsh_params = {
+	offsetof(DSMRegistryEntry, handle),
+	sizeof(DSMRegistryEntry),
+	dshash_memcmp,
+	dshash_memhash,
+	LWTRANCHE_DSM_REGISTRY_HASH
+};
+
+static dsa_area *dsm_registry_dsa;
+static dshash_table *dsm_registry_table;
+
+static void init_dsm_registry(void);
+
+Size
+DSMRegistryShmemSize(void)
+{
+	return MAXALIGN(sizeof(DSMRegistryCtxStruct));
+}
+
+void
+DSMRegistryShmemInit(void)
+{
+	bool		found;
+
+	DSMRegistryCtx = (DSMRegistryCtxStruct *)
+		ShmemInitStruct("DSM Registry Data",
+		DSMRegistryShmemSize(),
+		&found);
+
+	if (!found)
+	{
+		DSMRegistryCtx->dsah = DSA_HANDLE_INVALID;
+		DSMRegistryCtx->dshh = DSHASH_HANDLE_INVALID;
+	}
+}
+
+/*
+ * Initialize or attach to the dynamic shared hash table that stores the DSM
+ * registry entries, if not already done.  This must be called before accessing
+ * the table.
+ */
+static void
+init_dsm_registry(void)
+{
+	/* Quick exit if we

Re: broken master regress tests

2023-12-28 Thread Alexander Lakhin

Hello Jeff,

22.12.2023 02:17, Jeff Davis wrote:

On Wed, 2023-12-20 at 17:48 -0800, Jeff Davis wrote:

Attached.

It appears to increase the coverage. I committed it and I'll see how
the buildfarm reacts.


Starting from the commit 8793c6005, I observe a failure of test
collate.windows.win1252 on Windows Server 2016:
meson test regress/regress
1/1 postgresql:regress / regress/regress    ERROR 24.47s   exit status 1

regression.diffs contains:
@@ -993,6 +993,8 @@
 -- nondeterministic collations
 -- (not supported with libc provider)
 CREATE COLLATION ctest_det (locale = 'en_US', deterministic = true);
+ERROR:  could not create locale "en_US": No such file or directory
+DETAIL:  The operating system could not find any locale data for the locale name 
"en_US".
 CREATE COLLATION ctest_nondet (locale = 'en_US', deterministic = false);
 ERROR:  nondeterministic collations not supported with this provider
 -- cleanup

Though
CREATE COLLATION ctest_det (locale = 'English_United States', deterministic = 
true);
executed successfully on this OS.

AFAICS, before that commit SELECT getdatabaseencoding() in the test
returned SQL_ASCII, hence the test was essentially skipped, but now it
returns WIN1252, so problematic CREATE COLLATION(locale = 'en_US', ...)
is reached.

Best regards,
Alexander




Re: Multidimensional Histograms

2023-12-28 Thread Tomas Vondra
On 12/27/23 22:19, Tomas Vondra wrote:
> Hello Alexander,
> 
> We actually had histograms in the early patches adding multivariate
> statistics [1]. We however ended up removing histograms and only kept
> the simpler types, for a couple reasons.
> 
> It might be worth going through the discussion - I'm sure one of the
> reasons was we needed to limit the scope of the patch, and histograms
> were much more complex and possibly less useful compared to the other
> statistics types.
>
> ...

FWIW I did not intend to reject the idea of adding multi-dimensional
histograms, but rather to provide some insight into the history of the
past attempt, and also point some weaknesses of the algorithm described
in the 1988 paper. If you choose to work on this, I can do a review etc.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Tidy fill hstv array (src/backend/access/heap/pruneheap.c)

2023-12-28 Thread Ranier Vilela
Hi,

The function heap_page_prune (src/backend/access/heap/pruneheap.c)
Has a comment:

"/*
 * presult->htsv is not initialized here because all ntuple spots in the
 * array will be set either to a valid HTSV_Result value or -1.
 */

IMO, this is a little bogus and does not make sense.

I think it would be more productive to initialize the entire array with -1,
and avoid flagging, one by one, the items that should be -1.

Patch attached.

best regards,
Ranier Vilela


001-tidy-fill-htsv-array.patch
Description: Binary data


Re: doc patch: note AttributeRelationId passed to FDW validator function

2023-12-28 Thread Ian Lawrence Barwick
2023年12月28日(木) 15:37 Michael Paquier :
>
> On Thu, Dec 28, 2023 at 01:55:27PM +0900, Ian Lawrence Barwick wrote:
> > 2023年6月7日(水) 9:08 Ian Lawrence Barwick :
> >> The alternative v2 patch adds this to this list of OIDs, and also
> >> formats it as an
> >> SGML list, which IMHO is easier to read.
> >>
> >> Looks like this has been missing since 9.3.
> >
> > Forgot to add this to a CF; done:  
> > https://commitfest.postgresql.org/46/4730/
>
> Agreed that a list is cleaner.  Looking around I can see that the
> catalogs going through the validator functions are limited to the five
> you are listing in your patch.  Will apply in a bit, thanks!

Thanks for taking care of that!

Regards

Ian Barwick




Re: Switching XLog source from archive to streaming when primary available

2023-12-28 Thread Bharath Rupireddy
On Sat, Oct 21, 2023 at 11:59 PM Bharath Rupireddy
 wrote:
>
> On Fri, Jul 21, 2023 at 12:38 PM Bharath Rupireddy
>  wrote:
> >
> > Needed a rebase. I'm attaching the v13 patch for further consideration.
>
> Needed a rebase. I'm attaching the v14 patch. It also has the following 
> changes:
>
> - Ran pgindent on the new source code.
> - Ran pgperltidy on the new TAP test.
> - Improved the newly added TAP test a bit. Used the new wait_for_log
> core TAP function in place of custom find_in_log.
>
> Thoughts?

I took a closer look at v14 and came up with the following changes:

1. Used advance_wal introduced by commit c161ab74f7.
2. Simplified the core logic and new TAP tests.
3. Reworded the comments and docs.
4. Simplified new DEBUG messages.

I've attached the v15 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 8bdd3b999343c02ae01a7d15ef7fdd0be25623bd Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 28 Dec 2023 11:14:05 +
Subject: [PATCH v15] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  47 +++
 doc/src/sgml/high-availability.sgml   |  15 ++-
 src/backend/access/transam/xlogrecovery.c | 115 --
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   4 +
 src/include/access/xlogrecovery.h |   1 +
 src/test/recovery/meson.build |   1 +
 src/test/recovery/t/040_wal_source_switch.pl  |  93 ++
 8 files changed, 269 insertions(+), 19 deletions(-)
 create mode 100644 src/test/recovery/t/040_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b5624ca884..04aa2fa8d2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4866,6 +4866,53 @@ ANY num_sync ( 
+  streaming_replication_retry_interval (integer)
+  
+   streaming_replication_retry_interval configuration parameter
+  
+  
+  
+   
+Specifies amount of time after which standby attempts to switch WAL
+source from archive to streaming replication (i.e., getting WAL from
+primary). However, the standby exhausts all the WAL present in pg_wal
+before switching. If the standby fails to switch to stream mode, it
+falls back to archive mode. If this parameter value is specified
+without units, it is taken as milliseconds. Default is
+5min. With a lower value for this parameter, the
+standby makes frequent WAL source switch attempts. To avoid this, it is
+recommended to set a reasonable value. A setting of 0
+disables the feature. When disabled, the standby typically switches to
+stream mode only after receiving WAL from archive finishes (i.e., no
+more WAL left there) or fails for any reason. This parameter can only
+be set in the postgresql.conf file or on the
+server command line.
+   
+   
+
+ Standby may not always attempt to switch source from WAL archive to
+ streaming replication at exact
+ streaming_replication_retry_interval intervals. For
+ example, if the parameter is set to 1min and
+ fetching WAL file from archive takes about 2min,
+ then the source switch attempt happens for the next WAL file after
+ current WAL file fetched from archive is fully applied.
+
+   
+   
+Reading WAL from archive may not always be as efficient and fast as
+reading from primary. This can be due to the differences 

Re: pg_upgrade failing for 200+ million Large Objects

2023-12-28 Thread Robins Tharakan
On Thu, 28 Dec 2023 at 01:48, Tom Lane  wrote:

> Robins Tharakan  writes:
> > Applying all 4 patches, I also see good performance improvement.
> > With more Large Objects, although pg_dump improved significantly,
> > pg_restore is now comfortably an order of magnitude faster.
>
> Yeah.  The key thing here is that pg_dump can only parallelize
> the data transfer, while (with 0004) pg_restore can parallelize
> large object creation and owner-setting as well as data transfer.
> I don't see any simple way to improve that on the dump side,
> but I'm not sure we need to.  Zillions of empty objects is not
> really the use case to worry about.  I suspect that a more realistic
> case with moderate amounts of data in the blobs would make pg_dump
> look better.
>


Thanks for elaborating, and yes pg_dump times do reflect that
expectation.

The first test involved a fixed number (32k) of
Large Objects (LOs) with varying sizes - I chose that number
intentionally since this was being tested on a 32vCPU instance
and the patch employs 1k batches.


We again see that pg_restore is an order of magnitude faster.

 LO Size (bytes)  restore-HEAD restore-patched  improvement (Nx)
   124.182 1.4  17x
  1024.741 1.5  17x
 10024.574 1.6  15x
   1,00025.314 1.7  15x
  10,00025.644 1.7  15x
 100,00050.046 4.3  12x
   1,000,000   281.54930.0   9x


pg_dump also sees improvements. Really small sized LOs
see a decent ~20% improvement which grows considerably as LOs
get bigger (beyond ~10-100kb).


 LO Size (bytes)  dump-HEAD  dump-patchedimprovement (%)
   112.9  10.7  18%
  1012.9  10.4  19%
 10012.8  10.3  20%
   1,00013.0  10.3  21%
  10,00014.2  10.3  27%
 100,00032.8  11.5  65%
   1,000,000   211.8  23.6  89%


To test pg_restore scaling, 1 Million LOs (100kb each)
were created and pg_restore times tested for increasing
concurrency (on a 192vCPU instance). We see major speedup
upto -j64 and the best time was at -j96, after which
performance decreases slowly - see attached image.

Concurrencypg_restore-patched
384  75.87
352  75.63
320  72.11
288  70.05
256  70.98
224  66.98
192  63.04
160  61.37
128  58.82
 96  58.55
 64  60.46
 32  77.29
 16 115.51
  8 203.48
  4 366.33



Test details:
- Command used to generate SQL - create 1k LOs of 1kb each
  - echo "SELECT lo_from_bytea(0, '\x`  printf 'ff%.0s' {1..1000}`') FROM
generate_series(1,1000);" > /tmp/tempdel
- Verify the LO size: select pg_column_size(lo_get(oid));
- Only GUC changed: max_connections=1000 (for the last test)

-
Robins Tharakan
Amazon Web Services


Re: POC: GROUP BY optimization

2023-12-28 Thread Alexander Korotkov
On Thu, Dec 28, 2023 at 10:22 AM Andrei Lepikhov
 wrote:
> But arrangement with an ORDER BY clause doesn't work:
>
> DROP INDEX abc;
> explain SELECT x,w,z FROM t GROUP BY (w,x,z) ORDER BY (x,z,w);
>
> I think the reason is that the sort_pathkeys and group_pathkeys are
> physically different structures, and we can't just compare pointers here.

I haven't yet looked into the code.  But this looks strange to me.
Somehow, optimizer currently matches index pathkeys to ORDER BY
pathkeys.  If GROUP BY pathkeys could be matched to index pathkeys,
then it should be possible to match them to ORDER BY pathkeys too.


--
Regards,
Alexander Korotkov




Re: Set log_lock_waits=on by default

2023-12-28 Thread Shinya Kato

On 2023-12-22 20:00, Laurenz Albe wrote:


My point is that in the vast majority of cases, long lock waits
indicate a problem that you would like to know about, so the parameter
should default to "on".


+1.
I always set log_lock_waits=on, so I agree with this.



Just a random idea but what if we separated log_lock_waits from
deadlock_timeout? Say, it becomes time-valued rather than
Boolean-valued, but it has to be >= deadlock_timeout? Because I'd
probably be more interested in hearing about a lock wait that was more
than say 10 seconds, but I don't necessarily want to wait 10 seconds
for the deadlock detector to trigger.


That is an appealing thought, but as far as I know, "log_lock_waits"
is implemented by the deadlock detector, which is why it is tied to
"deadlock_timeout".  So if we want that, we'd need a separate "live
lock detector".  I don't know if we want to go there.


Personally, I thought it was a good idea to separate log_lock_waits and 
deadlock_timeout, but I have not checked how that is implemented.


--
Regards,
Shinya Kato
NTT DATA GROUP CORPORATION




Re: Some revises in adding sorting path

2023-12-28 Thread Shubham Khanna
On Thu, Dec 28, 2023 at 4:01 PM Shubham Khanna
 wrote:
>
> On Thu, Dec 28, 2023 at 4:00 PM Richard Guo  wrote:
> >
> >
> > On Wed, Mar 29, 2023 at 4:00 AM David Rowley  wrote:
> >>
> >> If you write some tests for this code, it will be useful to prove that
> >> it actually does something, and also that it does not break again in
> >> the future. I don't really want to just blindly copy the pattern used
> >> in 3c6fc5820 for creating incremental sort paths if it's not useful
> >> here. It would be good to see tests that make an Incremental Sort path
> >> using the code you're changing.
> >
> >
> > Thanks for the suggestion.  I've managed to come up with a query that
> > gets the codes being changed here to perform an incremental sort.
> >
> > set min_parallel_index_scan_size to 0;
> > set enable_seqscan to off;
> >
> > Without making those parallel paths:
> >
> > explain (costs off)
> > select * from tenk1 where four = 2 order by four, hundred, 
> > parallel_safe_volatile(thousand);
> >   QUERY PLAN
> > --
> >  Incremental Sort
> >Sort Key: hundred, (parallel_safe_volatile(thousand))
> >Presorted Key: hundred
> >->  Gather Merge
> >  Workers Planned: 3
> >  ->  Parallel Index Scan using tenk1_hundred on tenk1
> >Filter: (four = 2)
> > (7 rows)
> >
> > and with those parallel paths:
> >
> > explain (costs off)
> > select * from tenk1 where four = 2 order by four, hundred, 
> > parallel_safe_volatile(thousand);
> >   QUERY PLAN
> > ---
> >  Gather Merge
> >Workers Planned: 3
> >->  Incremental Sort
> >  Sort Key: hundred, (parallel_safe_volatile(thousand))
> >  Presorted Key: hundred
> >  ->  Parallel Index Scan using tenk1_hundred on tenk1
> >Filter: (four = 2)
> > (7 rows)
> >
> > I've added two tests for the code changes in create_ordered_paths in the
> > new patch.
> >
> >>
> >> Same for the 0003 patch.
> >
> >
> > For the code changes in gather_grouping_paths, I've managed to come up
> > with a query that makes an explicit Sort atop cheapest partial path.
> >
> > explain (costs off)
> > select count(*) from tenk1 group by twenty, parallel_safe_volatile(two);
> >  QUERY PLAN
> > 
> >  Finalize GroupAggregate
> >Group Key: twenty, (parallel_safe_volatile(two))
> >->  Gather Merge
> >  Workers Planned: 4
> >  ->  Sort
> >Sort Key: twenty, (parallel_safe_volatile(two))
> >->  Partial HashAggregate
> >  Group Key: twenty, parallel_safe_volatile(two)
> >  ->  Parallel Seq Scan on tenk1
> > (9 rows)
> >
> > Without this logic the plan would look like:
> >
> > explain (costs off)
> > select count(*) from tenk1 group by twenty, parallel_safe_volatile(two);
> >  QUERY PLAN
> > 
> >  Finalize GroupAggregate
> >Group Key: twenty, (parallel_safe_volatile(two))
> >->  Sort
> >  Sort Key: twenty, (parallel_safe_volatile(two))
> >  ->  Gather
> >Workers Planned: 4
> >->  Partial HashAggregate
> >  Group Key: twenty, parallel_safe_volatile(two)
> >  ->  Parallel Seq Scan on tenk1
> > (9 rows)
> >
> > This test is also added in the new patch.
> >
> > But I did not find a query that makes an incremental sort in this case.
> > After trying for a while it seems to me that we do not need to consider
> > incremental sort in this case, because for a partial path of a grouped
> > or partially grouped relation, it is either unordered (HashAggregate or
> > Append), or it has been ordered by the group_pathkeys (GroupAggregate).
> > It seems there is no case that we'd have a partial path that is
> > partially sorted.
> >
Just for clarity; I am not familiar with the code. And for the review,
I ran 'make check' and 'make check-world' and all the test cases
passed successfully.

Thanks and Regards,
Shubham Khanna.




Re: Some revises in adding sorting path

2023-12-28 Thread Shubham Khanna
On Thu, Dec 28, 2023 at 4:00 PM Richard Guo  wrote:
>
>
> On Wed, Mar 29, 2023 at 4:00 AM David Rowley  wrote:
>>
>> If you write some tests for this code, it will be useful to prove that
>> it actually does something, and also that it does not break again in
>> the future. I don't really want to just blindly copy the pattern used
>> in 3c6fc5820 for creating incremental sort paths if it's not useful
>> here. It would be good to see tests that make an Incremental Sort path
>> using the code you're changing.
>
>
> Thanks for the suggestion.  I've managed to come up with a query that
> gets the codes being changed here to perform an incremental sort.
>
> set min_parallel_index_scan_size to 0;
> set enable_seqscan to off;
>
> Without making those parallel paths:
>
> explain (costs off)
> select * from tenk1 where four = 2 order by four, hundred, 
> parallel_safe_volatile(thousand);
>   QUERY PLAN
> --
>  Incremental Sort
>Sort Key: hundred, (parallel_safe_volatile(thousand))
>Presorted Key: hundred
>->  Gather Merge
>  Workers Planned: 3
>  ->  Parallel Index Scan using tenk1_hundred on tenk1
>Filter: (four = 2)
> (7 rows)
>
> and with those parallel paths:
>
> explain (costs off)
> select * from tenk1 where four = 2 order by four, hundred, 
> parallel_safe_volatile(thousand);
>   QUERY PLAN
> ---
>  Gather Merge
>Workers Planned: 3
>->  Incremental Sort
>  Sort Key: hundred, (parallel_safe_volatile(thousand))
>  Presorted Key: hundred
>  ->  Parallel Index Scan using tenk1_hundred on tenk1
>Filter: (four = 2)
> (7 rows)
>
> I've added two tests for the code changes in create_ordered_paths in the
> new patch.
>
>>
>> Same for the 0003 patch.
>
>
> For the code changes in gather_grouping_paths, I've managed to come up
> with a query that makes an explicit Sort atop cheapest partial path.
>
> explain (costs off)
> select count(*) from tenk1 group by twenty, parallel_safe_volatile(two);
>  QUERY PLAN
> 
>  Finalize GroupAggregate
>Group Key: twenty, (parallel_safe_volatile(two))
>->  Gather Merge
>  Workers Planned: 4
>  ->  Sort
>Sort Key: twenty, (parallel_safe_volatile(two))
>->  Partial HashAggregate
>  Group Key: twenty, parallel_safe_volatile(two)
>  ->  Parallel Seq Scan on tenk1
> (9 rows)
>
> Without this logic the plan would look like:
>
> explain (costs off)
> select count(*) from tenk1 group by twenty, parallel_safe_volatile(two);
>  QUERY PLAN
> 
>  Finalize GroupAggregate
>Group Key: twenty, (parallel_safe_volatile(two))
>->  Sort
>  Sort Key: twenty, (parallel_safe_volatile(two))
>  ->  Gather
>Workers Planned: 4
>->  Partial HashAggregate
>  Group Key: twenty, parallel_safe_volatile(two)
>  ->  Parallel Seq Scan on tenk1
> (9 rows)
>
> This test is also added in the new patch.
>
> But I did not find a query that makes an incremental sort in this case.
> After trying for a while it seems to me that we do not need to consider
> incremental sort in this case, because for a partial path of a grouped
> or partially grouped relation, it is either unordered (HashAggregate or
> Append), or it has been ordered by the group_pathkeys (GroupAggregate).
> It seems there is no case that we'd have a partial path that is
> partially sorted.
>
I reviewed the Patch and it looks fine to me.

Thanks and Regards,
Shubham Khanna.




Re: pg_upgrade and logical replication

2023-12-28 Thread Amit Kapila
On Wed, Dec 13, 2023 at 12:09 PM vignesh C  wrote:
>
> Thanks for the comments, the attached v25 version patch has the
> changes for the same.
>

I have looked at it again and made some cosmetic changes like changing
some comments and a minor change in one of the error messages. See, if
the changes look okay to you.

-- 
With Regards,
Amit Kapila.


v26-0001-Allow-upgrades-to-preserve-the-full-subscription.patch
Description: Binary data


Re: Revise the Asserts added to bimapset manipulation functions

2023-12-28 Thread David Rowley
On Wed, 27 Dec 2023 at 22:30, Richard Guo  wrote:
> The Asserts added to bitmapset.c by commits 71a3e8c43b and 7d58f2342b
> contain some duplicates, such as in bms_difference, bms_is_subset,
> bms_subset_compare, bms_int_members and bms_join.  For instance,

I'm just learning of these changes now.   I wonder why that wasn't
done more like:

#ifdef REALLOCATE_BITMAPSETS
static Bitmapset *
bms_clone_and_free(Bitmapset *a)
{
Bitmapset *c = bms_copy(a);

bms_free(a);
return c;
}
#endif

then instead of having to do:

#ifdef REALLOCATE_BITMAPSETS
a = (Bitmapset *) palloc(BITMAPSET_SIZE(tmp->nwords));
memcpy(a, tmp, BITMAPSET_SIZE(tmp->nwords));
pfree(tmp);
#endif

all over the place.  Couldn't we do:

#ifdef REALLOCATE_BITMAPSETS
 return bms_clone_and_free(a);
#else
 return a;
#endif

I think it's best to leave at least that and not hide the behaviour
inside a macro.

It would also be good if REALLOCATE_BITMAPSETS was documented in
bitmapset.c to offer some guidance to people modifying the code so
they know under what circumstances they need to return a copy. There
are no comments that offer any indication of what the intentions are
with this :-(  What's written in pg_config_manual.h isn't going to
help anyone that's modifying bitmapset.c

> While removing those duplicates, I think we can add checks in the new
> Asserts to ensure that Bitmapsets should not contain trailing zero
> words, as the old Asserts did.  That makes the Asserts in the form of:
>
> Assert(a == NULL || (IsA(a, Bitmapset) && a->words[a->nwords - 1] != 0));
>
> I think we can define a new macro for this form and use it to check that
> a Bitmapset is valid.

I think that's an improvement.  I did have a function for this in [1],
but per [2], Tom wasn't a fan.  I likely shouldn't have bothered with
the loop there. It seems fine just to ensure the final word isn't 0.

David

[1] 
https://postgr.es/m/CAApHDvr5O41MuUjw0DQKqmAnv7QqvmLqXReEd5o4nXTzWp8-%2Bw%40mail.gmail.com
[2] https://postgr.es/m/2686153.1677881312%40sss.pgh.pa.us




Re: Track in pg_replication_slots the reason why slots conflict?

2023-12-28 Thread shveta malik
On Thu, Dec 28, 2023 at 10:16 AM shveta malik  wrote:
>
> On Wed, Dec 27, 2023 at 4:16 PM Amit Kapila  wrote:
> >
> > On Wed, Dec 27, 2023 at 3:08 PM shveta malik  wrote:
> > >
> > > PFA the patch which attempts to implement this.
> > >
> > > This patch changes the existing 'conflicting' field to
> > > 'conflicting_cause' in pg_replication_slots.
> > >
> >
> > This name sounds a bit odd to me, would it be better to name it as
> > conflict_cause?
> >
> > A few other minor comments:
> > =

Thanks for the feedback Amit.

> > *
> > +   conflicting_cause text
> > +  
> > +  
> > +   Cause if this logical slot conflicted with recovery (and so is now
> > +   invalidated). It is always NULL for physical slots, as well as for
> > +   those logical slots which are not invalidated. Possible values are:
> >
> > Would it better to use description as follows:" Cause of logical
> > slot's conflict with recovery. It is always NULL for physical slots,
> > as well as for logical slots which are not invalidated. The non-NULL
> > values indicate that the slot is marked as invalidated. Possible
> > values are:
> > .."
> >
> > *
> >   $res = $node_standby->safe_psql(
> >   'postgres', qq(
> > - select bool_and(conflicting) from pg_replication_slots;));
> > + select bool_and(conflicting) from
> > + (select conflicting_cause is not NULL as conflicting from
> > pg_replication_slots);));
> >
> > Won't the query "select conflicting_cause is not NULL as conflicting
> > from pg_replication_slots" can return false even for physical slots
> > and then impact the result of the main query whereas the original
> > query would seem to be simply ignoring physical slots? If this
> > observation is correct then you might want to add a 'slot_type'
> > condition in the new query.
> >
> > * After introducing check_slots_conflicting_cause(), do we need to
> > have check_slots_conflicting_status()? Aren't both checking the same
> > thing?
>
> I think it is needed for the case where we want to check that there is
> no conflict.
>
> # Verify slots are reported as non conflicting in pg_replication_slots
> check_slots_conflicting_status(0);
>
> For the cases where there is conflict, I think
> check_slots_conflicting_cause() can replace
> check_slots_conflicting_status().

I have removed check_slots_conflicting_status() and where it was
needed to check non-conflicting, I have added a simple query.

PFA the v2-patch with all your comments addressed.

thanks
Shveta


v2-0001-Track-conflicting_cause-in-pg_replication_slots.patch
Description: Binary data


Re: POC: GROUP BY optimization

2023-12-28 Thread Andrei Lepikhov

On 27/12/2023 12:07, Tom Lane wrote:

Andrei Lepikhov  writes:

To be clear. In [1], I mentioned we can perform micro-benchmarks and
structure costs of operators. At least for fixed-length operators, it is
relatively easy.


I repeat what I said: this is a fool's errand.  You will not get
trustworthy results even for the cases you measured, let alone
all the rest.  I'd go as far as to say I would not believe your
microbenchmarks, because they would only apply for one platform,
compiler, backend build, phase of the moon, etc.


Thanks for the explanation.
I removed all cost-related codes. It still needs to be finished; I will 
smooth the code further and rewrite regression tests - many of them 
without cost-dependent reorderings look silly. Also, remember 
Alexander's remarks, which must be implemented, too.

But already here, it works well. Look:

Preliminaries:
CREATE TABLE t(x int, y int, z text, w int);
INSERT INTO t SELECT gs%100,gs%100, 'abc' || gs%10, gs
  FROM generate_series(1,1) AS gs;
CREATE INDEX abc ON t(x,y);
ANALYZE t;
SET enable_hashagg = 'off';

This patch eliminates unneeded Sort operation:
explain SELECT x,y FROM t GROUP BY (x,y);
explain SELECT x,y FROM t GROUP BY (y,x);

Engages incremental sort:
explain SELECT x,y FROM t GROUP BY (x,y,z,w);
explain SELECT x,y FROM t GROUP BY (z,y,w,x);
explain SELECT x,y FROM t GROUP BY (w,z,x,y);
explain SELECT x,y FROM t GROUP BY (w,x,z,y);

Works with subqueries:
explain SELECT x,y
FROM (SELECT * FROM t ORDER BY x,y,w,z) AS q1
GROUP BY (w,x,z,y);
explain SELECT x,y
FROM (SELECT * FROM t ORDER BY x,y,w,z LIMIT 100) AS q1
GROUP BY (w,x,z,y);

But arrangement with an ORDER BY clause doesn't work:

DROP INDEX abc;
explain SELECT x,w,z FROM t GROUP BY (w,x,z) ORDER BY (x,z,w);

I think the reason is that the sort_pathkeys and group_pathkeys are 
physically different structures, and we can't just compare pointers here.


--
regards,
Andrei Lepikhov
Postgres Professional
From 6183555c2c56d7cbbc44f213f6c5bc4cbcaef1ec Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 13 Sep 2023 11:20:03 +0700
Subject: [PATCH] Explore alternative orderings of group-by pathkeys during
 optimization.

When evaluating a query with a multi-column GROUP BY clause, we can minimize
sort operations or avoid them if we synchronize the order of GROUP BY clauses
with the ORDER BY sort clause or sort order, which comes from the underlying
query tree. Grouping does not imply any ordering, so we can compare
the keys in arbitrary order, and a Hash Agg leverages this. But for Group Agg,
we simply compared keys in the order specified in the query. This commit
explores alternative ordering of the keys, trying to find a cheaper one.

The ordering of group keys may interact with other parts of the query, some of
which may not be known while planning the grouping. For example, there may be
an explicit ORDER BY clause or some other ordering-dependent operation higher up
in the query, and using the same ordering may allow using either incremental
sort or even eliminating the sort entirely.

The patch always keeps the ordering specified in the query, assuming the user
might have additional insights.

This introduces a new GUC enable_group_by_reordering so that the optimization
may be disabled if needed.
---
 src/backend/optimizer/path/equivclass.c   |  13 +-
 src/backend/optimizer/path/pathkeys.c | 224 +
 src/backend/optimizer/plan/planner.c  | 464 ++
 src/backend/utils/adt/selfuncs.c  |  38 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/nodes/pathnodes.h |  10 +
 src/include/optimizer/paths.h |   3 +
 src/test/regress/expected/aggregates.out  | 235 +
 .../regress/expected/incremental_sort.out |  20 +-
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/aggregates.sql   |  99 
 src/test/regress/sql/incremental_sort.sql |   2 +-
 13 files changed, 912 insertions(+), 210 deletions(-)

diff --git a/src/backend/optimizer/path/equivclass.c 
b/src/backend/optimizer/path/equivclass.c
index 7fa502d6e2..07edd4f38e 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -652,7 +652,18 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 
if (opcintype == cur_em->em_datatype &&
equal(expr, cur_em->em_expr))
-   return cur_ec;  /* Match! */
+   {
+   /*
+* Match!
+*
+* Copy the sortref if it wasn't set yet. That 
may happen if
+* the ec was constructed from WHERE clause, 
i.e. it doesn't
+* have a target reference at all.
+