Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-06 Thread Julien Rouhaud
On Mon, Aug 07, 2023 at 09:24:02AM +0530, Amit Kapila wrote:
>
> I think autovacuum is not enabled during the upgrade. See comment "Use
> -b to disable autovacuum." in start_postmaster(). However, I am not
> sure if there can't be any additional WAL from checkpointer or
> bgwriter. Checkpointer has a code that ensures that if there is no
> important WAL activity then it would be skipped. Similarly, bgwriter
> also doesn't LOG xl_running_xacts unless there is an important
> activity. I feel if there is a chance of any WAL activity during the
> upgrade, we need to either change the check to ensure such WAL records
> are expected or document the same in some way.

Unless I'm missing something I don't see what prevents something to connect
using the replication protocol and issue any query or even create new
replication slots?

Note also that as complained a few years ago nothing prevents a bgworker from
spawning up during pg_upgrade and possibly corrupt the upgraded cluster if
multixid are assigned.  If publications are preserved wouldn't it mean that
such bgworkers could also lead to data loss?




Re: "duplicated" wait events

2023-08-06 Thread Kyotaro Horiguchi
At Fri, 4 Aug 2023 17:07:51 +0200, "Drouvot, Bertrand" 
 wrote in 
> SynRep currently appears in "IPC" and "LWLock" (see [2])
> WALWrite currently appears in "IO" and "LWLock" (see [2])
> 
> I think that can lead to confusion and it would be better to avoid
> duplicate wait event
> name across Wait Class (and so fix those 2 ones above), what do you
> think?

I think it would be handy to be able to summirize wait events by event
names, instead of classes. In other words, grouping wait events
through all classes according to their purpose. From that perspective,
I'm rather fan of consolidating event names (that is, the opposite of
your proposal).

However, it seems the event weren't named with this consideration. So
I'm on the fence about this change.


By the way, I couldn't figure out how to tag event names without
messing up the source tree. This causes meson to refuse a build..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Cleaning up array_in()

2023-08-06 Thread jian he
hi.
based on Heikki v3.
I made some changes:
array_in: dim[6] all initialize with -1, lBound[6] all initialize with 1.
if ReadArrayDimensions called, then corresponding dimension lBound
will replace the initialized default 1 value.
ReadArrayStr, since array_in main function initialized dim array,
dimensions_specified true or false, I don't need to initialize again,
so I deleted that part.

to solve corner cases like  '{{1,},{1},}'::text[]. in ReadArrayStr
main switch function, like other ArrayToken, first evaluate
expect_delim then assign expect_delim.
In ATOK_LEVEL_END. if non-empty array, closing bracket either precede
with an element or another closing element. In both cases, the
previous expect_delim should be true.

in
 * FIXME: Is this still required? I believe all the checks it
performs are
 * redundant with other checks in ReadArrayDimension() and
ReadArrayStr()
 */
I deleted
-   nitems_according_to_dims = ArrayGetNItemsSafe(ndim, dim, escontext);
-   if (nitems_according_to_dims < 0)
-   PG_RETURN_NULL();
-   if (nitems != nitems_according_to_dims)
-   elog(ERROR, "mismatch nitems, %d vs %d", nitems,
nitems_according_to_dims);
but I am not sure if the following is necessary.
  if (!ArrayCheckBoundsSafe(ndim, dim, lBound, escontext))
PG_RETURN_NULL();

I added some corner case tests like select '{{1,},{1},}'::text[];

some changes broken:
select '{{1},{}}'::text[];
-DETAIL:  Multidimensional arrays must have sub-arrays with matching dimensions.
+DETAIL:  Unexpected "," character.
I added some error checks in ATOK_LEVEL_END. The first expect_delim
part check will first generate an error, the dimension error part will
not be reached.
From e59ccd4fa6dda94c50d5b0a2f9388dd8d859d501 Mon Sep 17 00:00:00 2001
From: pgaddict 
Date: Mon, 7 Aug 2023 12:49:57 +0800
Subject: [PATCH v4] based on Heikki verion3

---
 src/backend/utils/adt/arrayfuncs.c   | 45 ++--
 src/test/regress/expected/arrays.out | 22 +-
 src/test/regress/sql/arrays.sql  |  4 +++
 3 files changed, 55 insertions(+), 16 deletions(-)

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index 19d7af6d..03b6b56d 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -233,6 +233,11 @@ array_in(PG_FUNCTION_ARGS)
 	typdelim = my_extra->typdelim;
 	typioparam = my_extra->typioparam;
 
+	for (int i = 0; i < MAXDIM; i++)
+	{
+		dim[i] = -1;
+		lBound[i] = 1;
+	}
 	/*
 	 * Start processing the input string.
 	 *
@@ -245,14 +250,12 @@ array_in(PG_FUNCTION_ARGS)
 
 	if (ndim == 0)
 	{
-		/* No array dimensions, so intuit dimensions from brace structure */
+		/* No array dimensions, so init dimensions from brace structure */
 		if (*p != '{')
 			ereturn(escontext, (Datum) 0,
 	(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 	 errmsg("malformed array literal: \"%s\"", string),
 	 errdetail("Array value must start with \"{\" or dimension information.")));
-		for (int i = 0; i < MAXDIM; i++)
-			lBound[i] = 1;
 	}
 	else
 	{
@@ -304,8 +307,11 @@ array_in(PG_FUNCTION_ARGS)
 		initStringInfo();
 
 		appendStringInfo(, "array_in- ndim %d (", ndim);
-		for (int i = 0; i < ndim; i++)
+		for (int i = 0; i < MAXDIM; i++)
 			appendStringInfo(, " %d", dim[i]);
+		appendStringInfo(, "lBound info");
+		for (int i = 0; i < MAXDIM; i++)
+			appendStringInfo(, " %d", lBound[i]);
 		appendStringInfo(, ") for %s\n", string);
 		elog(NOTICE, "%s", buf.data);
 		pfree(buf.data);
@@ -318,11 +324,6 @@ array_in(PG_FUNCTION_ARGS)
 	 * FIXME: Is this still required? I believe all the checks it performs are
 	 * redundant with other checks in ReadArrayDimension() and ReadArrayStr()
 	 */
-	nitems_according_to_dims = ArrayGetNItemsSafe(ndim, dim, escontext);
-	if (nitems_according_to_dims < 0)
-		PG_RETURN_NULL();
-	if (nitems != nitems_according_to_dims)
-		elog(ERROR, "mismatch nitems, %d vs %d", nitems, nitems_according_to_dims);
 	if (!ArrayCheckBoundsSafe(ndim, dim, lBound, escontext))
 		PG_RETURN_NULL();
 
@@ -603,12 +604,6 @@ ReadArrayStr(char **srcptr,
 	/* The caller already checked this */
 	Assert(**srcptr == '{');
 
-	if (!dimensions_specified)
-	{
-		/* Initialize dim[] entries to -1 meaning "unknown" */
-		for (int i = 0; i < MAXDIM; ++i)
-			dim[i] = -1;
-	}
 	ndim_frozen = dimensions_specified;
 
 	maxitems = 16;
@@ -675,10 +670,30 @@ ReadArrayStr(char **srcptr,
 {
 	/* Nested sub-arrays count as elements of outer level */
 	nelems[nest_level - 1]++;
+
+	if (nitems > 0 && expect_delim == false)
+	{
+		ereturn(escontext, false,
+(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+ errmsg("malformed array literal: \"%s\"", origStr),
+ errdetail("Unexpected \"%c\" character.",
+		typdelim)));
+	}
 	expect_delim = true;
 }
 else
+{
+	/* rightmost should precede with element or bracket */
+	if 

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-06 Thread Amit Kapila
On Wed, Aug 2, 2023 at 7:46 AM Jonathan S. Katz  wrote:
>
> Can I take this a step further on the user interface and ask why the
> flag would be "--include-logical-replication-slots" vs. being enabled by
> default?
>
> Are there reasons why we wouldn't enable this feature by default on
> pg_upgrade, and instead (if need be) have a flag that would be
> "--exclude-logical-replication-slots"? Right now, not having the ability
> to run pg_upgrade with logical replication slots enabled on the
> publisher is a a very big pain point for users, so I would strongly
> recommend against adding friction unless there is a very large challenge
> with such an implementation.
>

Thanks for acknowledging the need/importance of this feature. I also
don't see a need to have such a flag for pg_upgrade. The only reason
why one might want to exclude slots is that they are not up to date
w.r.t WAL being consumed. For example, one has not consumed all the
WAL from manually created slots or say some subscription has been
disabled before shutdown. I guess in those cases we should give an
error to the user and ask to remove such slots before the upgrade
because anyway, those won't be usable after the upgrade.

Having said that, I think we need a flag for pg_dump to dump the slots.

-- 
With Regards,
Amit Kapila.




Re: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-08-06 Thread Amit Kapila
On Sun, Aug 6, 2023 at 7:54 PM José Neves  wrote:
>
> A follow-up on this. Indeed, a new commit-based approach solved my missing 
> data issues.
> But, getting back to the previous examples, how are server times expected to 
> be logged for the xlogs containing these records?
>

I think it should be based on commit_time because as far as I see we
can only get that on the client.

-- 
With Regards,
Amit Kapila.




Re: Postmaster self-deadlock due to PLT linkage resolution

2023-08-06 Thread Thomas Munro
After commit 7389aad6, I think commit 8acd8f86's linker changes (+
meson.build's equivalent) must now be redundant?




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-06 Thread Amit Kapila
On Sun, Aug 6, 2023 at 6:02 PM Masahiko Sawada  wrote:
>
> On Wed, Aug 2, 2023 at 5:13 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > > 4.
> > > + /*
> > > + * Check that all logical replication slots have reached the current WAL
> > > + * position.
> > > + */
> > > + res = executeQueryOrDie(conn,
> > > + "SELECT slot_name FROM pg_catalog.pg_replication_slots "
> > > + "WHERE (SELECT count(record_type) "
> > > + " FROM pg_catalog.pg_get_wal_records_content(confirmed_flush_lsn,
> > > pg_catalog.pg_current_wal_insert_lsn()) "
> > > + " WHERE record_type != 'CHECKPOINT_SHUTDOWN') <> 0 "
> > > + "AND temporary = false AND wal_status IN ('reserved', 'extended');");
> > >
> > > I think this can unnecessarily lead to reading a lot of WAL data if
> > > the confirmed_flush_lsn for a slot is too much behind. Can we think of
> > > improving this by passing the number of records to read which in this
> > > case should be 1?
> >
> > I checked and pg_wal_record_info() seemed to be used for the purpose. I 
> > tried to
> > move the functionality to core.
>
> IIUC the above query checks if the WAL record written at the slot's
> confirmed_flush_lsn is a CHECKPOINT_SHUTDOWN, but there is no check if
> this WAL record is the latest record.
>

Yeah, I also think there should be some way to ensure this. How about
passing the number of records to read to this API? Actually, that will
address my other concern as well where the current API can lead to
reading an unbounded number of records if the confirmed_flush_lsn
location is far behind the CHECKPOINT_SHUTDOWN. Do you have any better
ideas to address it?

> Therefore, I think it's quite
> possible that slot's confirmed_flush_lsn points to previous
> CHECKPOINT_SHUTDOWN, for example, in cases where the subscription was
> disabled after the publisher shut down and then some changes are made
> on the publisher. We might want to add that check too but it would not
> work. Because some WAL records could be written (e.g., by autovacuums)
> during pg_upgrade before checking the slot's confirmed_flush_lsn.
>

I think autovacuum is not enabled during the upgrade. See comment "Use
-b to disable autovacuum." in start_postmaster(). However, I am not
sure if there can't be any additional WAL from checkpointer or
bgwriter. Checkpointer has a code that ensures that if there is no
important WAL activity then it would be skipped. Similarly, bgwriter
also doesn't LOG xl_running_xacts unless there is an important
activity. I feel if there is a chance of any WAL activity during the
upgrade, we need to either change the check to ensure such WAL records
are expected or document the same in some way.

-- 
With Regards,
Amit Kapila.




Re: Extract numeric filed in JSONB more effectively

2023-08-06 Thread Andy Fan
Hi:


> For all the people who are interested in this topic, I will post a
> planner support function soon,  you can check that then.
>
>
The updated patch doesn't need users to change their codes and can get
better performance. Thanks for all the feedback which makes things better.

To verify there is no unexpected stuff happening, here is the performance
comparison between master and patched.

create table tb(a jsonb);
insert into tb select '{"a": true, "b": 23.}' from generate_series(1,
10)i;

Master:
select 1 from tb where  (a->'b')::numeric = 1;
Time: 31.020 ms

select 1 from tb where not (a->'a')::boolean;
Time: 25.888 ms

select 1 from tb where  (a->'b')::int2 = 1;
Time: 30.138 ms

select 1 from tb where  (a->'b')::int4 = 1;
Time: 32.384 ms

select 1 from tb where  (a->'b')::int8 = 1;\
Time: 29.922 ms

select 1 from tb where  (a->'b')::float4 = 1;
Time: 54.139 ms

select 1 from tb where  (a->'b')::float8 = 1;
Time: 66.933 ms

Patched:

select 1 from tb where  (a->'b')::numeric = 1;
Time: 15.203 ms

select 1 from tb where not (a->'a')::boolean;
Time: 12.894 ms

select 1 from tb where  (a->'b')::int2 = 1;
Time: 16.847 ms

select 1 from tb where  (a->'b')::int4 = 1;
Time: 17.105 ms

select 1 from tb where  (a->'b')::int8 = 1;
Time: 16.720 ms

select 1 from tb where  (a->'b')::float4 = 1;
Time: 33.409 ms

select 1 from tb where  (a->'b')::float8 = 1;
Time: 34.660 ms

-- 
Best Regards
Andy Fan


v3-0001-Optimize-extracting-a-given-data-type-from-jsonb.patch
Description: Binary data


Re: pgbench: allow to exit immediately when any client is aborted

2023-08-06 Thread Yugo NAGATA
On Mon, 7 Aug 2023 11:02:48 +0900
Yugo NAGATA  wrote:

> On Sat, 05 Aug 2023 12:16:11 +0900 (JST)
> Tatsuo Ishii  wrote:
> 
> > > Hi,
> > > 
> > > I would like to propose to add an option to pgbench so that benchmark
> > > can quit immediately when any client is aborted. Currently, when a
> > > client is aborted due to some error, for example, network trouble, 
> > > other clients continue their run until a certain number of transactions
> > > specified -t is reached or the time specified by -T is expired. At the
> > > end, the results are printed, but they are not useful, as the message
> > > "Run was aborted; the above results are incomplete" shows.
> > 
> > Sounds like a good idea. It's a waste of resources waiting for
> > unusable benchmark results until t/T expired. If we graze on the
> > screen, then it's easy to cancel the pgbench run. But I frequently let
> > pgbench run without sitting in front of the screen especially when t/T
> > is large (it's recommended that running pgbench with large enough t/T
> > to get usable results).
> 
> Thank you for your agreement.
> 
> > > For precise benchmark purpose, we would not want to wait to get such
> > > incomplete results, rather we would like to know some trouble happened
> > > to allow a quick retry. Therefore, it would be nice to add an option to
> > > make pgbench exit instead of continuing run in other clients when any
> > > client is aborted. I think adding the optional is better than  whole
> > > behavioural change because some users that use pgbench just in order
> > > to stress on backends for testing purpose rather than benchmark might
> > > not want to stop pgbench even a client is aborted. 
> > > 
> > > Attached is the patch to add the option --exit-on-abort.
> > > If this option is specified, when any client is aborted, pgbench
> > > immediately quit by calling exit(2).

I attached v2 patch including the documentation and some comments
in the code.

Regards,
Yugo Nagata

> > > 
> > > What do you think about it?
> > 
> > I think aborting pgbench by calling exit(2) is enough. It's not worth
> > the trouble to add more codes for this purpose.
> 
> In order to stop pgbench more gracefully, it might be better to make
> each thread exit at more proper timing after some cleaning-up like
> connection close. However, pgbench code doesn't provide such functions
> for inter-threads communication. If we would try to make this, both
> pthread and Windows versions would be needed. I don't think it is necessary
> to make such effort for --exit-on-abort option, as you said.
> 
> Regards,
> Yugo Nagata
> 
> > 
> > Best reagards,
> > --
> > Tatsuo Ishii
> > SRA OSS LLC
> > English: http://www.sraoss.co.jp/index_en/
> > Japanese:http://www.sraoss.co.jp
> 
> 
> -- 
> Yugo NAGATA 
> 
> 


-- 
Yugo NAGATA 
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 6c5c8afa6d..83fe01c33f 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -787,6 +787,19 @@ pgbench  options  d
   
  
 
+ 
+  --exit-on-abort
+  
+   
+Exit immediately when any client is aborted due to some error. Without
+this option, even when a client is aborted, other clients could continue
+their run as specified by -t or -T option,
+and pgbench will print an incomplete results
+in this case.
+   
+  
+ 
+
  
   --log-prefix=prefix
   
@@ -985,7 +998,8 @@ pgbench  options  d
benchmark such as initial connection failures also exit with status 1.
Errors during the run such as database errors or problems in the script
will result in exit status 2. In the latter case,
-   pgbench will print partial results.
+   pgbench will print partial results if
+   --exit-on-abort option is not specified.
   
  
 
@@ -2801,14 +2815,17 @@ statement latencies in milliseconds, failures and retries:
  start a connection to the database server / the socket for connecting
  the client to the database server has become invalid). In such cases
  all clients of this thread stop while other threads continue to work.
+ However, --exit-on-abort is specified, all of the
+ threads stop immediately in this case.

  
  

  Direct client errors. They lead to immediate exit from
  pgbench with the corresponding error message
- only in the case of an internal pgbench
- error (which are supposed to never occur...). Otherwise in the worst
+ in the case of an internal pgbench
+ error (which are supposed to never occur...) or when
+ --exit-on-abort is specified . Otherwise in the worst
  case they only lead to the abortion of the failed client while other
  clients continue their run (but some client errors are handled without
  an abortion of the client and reported separately, see below). Later in
diff --git a/src/bin/pgbench/pgbench.c 

Re: pgbench: allow to exit immediately when any client is aborted

2023-08-06 Thread Yugo NAGATA
On Sat, 05 Aug 2023 12:16:11 +0900 (JST)
Tatsuo Ishii  wrote:

> > Hi,
> > 
> > I would like to propose to add an option to pgbench so that benchmark
> > can quit immediately when any client is aborted. Currently, when a
> > client is aborted due to some error, for example, network trouble, 
> > other clients continue their run until a certain number of transactions
> > specified -t is reached or the time specified by -T is expired. At the
> > end, the results are printed, but they are not useful, as the message
> > "Run was aborted; the above results are incomplete" shows.
> 
> Sounds like a good idea. It's a waste of resources waiting for
> unusable benchmark results until t/T expired. If we graze on the
> screen, then it's easy to cancel the pgbench run. But I frequently let
> pgbench run without sitting in front of the screen especially when t/T
> is large (it's recommended that running pgbench with large enough t/T
> to get usable results).

Thank you for your agreement.

> > For precise benchmark purpose, we would not want to wait to get such
> > incomplete results, rather we would like to know some trouble happened
> > to allow a quick retry. Therefore, it would be nice to add an option to
> > make pgbench exit instead of continuing run in other clients when any
> > client is aborted. I think adding the optional is better than  whole
> > behavioural change because some users that use pgbench just in order
> > to stress on backends for testing purpose rather than benchmark might
> > not want to stop pgbench even a client is aborted. 
> > 
> > Attached is the patch to add the option --exit-on-abort.
> > If this option is specified, when any client is aborted, pgbench
> > immediately quit by calling exit(2).
> > 
> > What do you think about it?
> 
> I think aborting pgbench by calling exit(2) is enough. It's not worth
> the trouble to add more codes for this purpose.

In order to stop pgbench more gracefully, it might be better to make
each thread exit at more proper timing after some cleaning-up like
connection close. However, pgbench code doesn't provide such functions
for inter-threads communication. If we would try to make this, both
pthread and Windows versions would be needed. I don't think it is necessary
to make such effort for --exit-on-abort option, as you said.

Regards,
Yugo Nagata

> 
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp


-- 
Yugo NAGATA 




Re: Sync scan & regression tests

2023-08-06 Thread Tom Lane
Thomas Munro  writes:
> On Mon, Aug 7, 2023 at 7:21 AM Konstantin Knizhnik  wrote:
>> Two tests are failed because of sync scan - this tests cluster.sql and
>> portals.sql perform seqscan without explicit order by and expect that
>> data will be returned in particular order. But because of sync scan it
>> doesn't happen. Small shared buffers are needed to satisfy seqscan
>> criteria in heapam.c: `scan->rs_nblocks > NBuffers / 4` for tenk1 table.

> I wondered the same thing while working on the tests in commit
> 8ab0ebb9a84, which explicitly care about physical order, so they *say
> so* with ORDER BY ctid.  But the problem seems quite widespread, so I
> didn't volunteer to try to do something like that everywhere, when Tom
> committed cbf4177f for 027_stream_regress.pl.

Our customary theory about that is explained in regress.sgml:

  You might wonder why we don't order all the regression test queries explicitly
  to get rid of this issue once and for all.  The reason is that that would
  make the regression tests less useful, not more, since they'd tend
  to exercise query plan types that produce ordered results to the
  exclusion of those that don't.

Having said that ... I just noticed that chipmunk, which I'd been
ignoring because it had been having configuration-related failures
ever since it came back to life about three months ago, has gotten
past those problems and is now failing with what looks suspiciously
like syncscan problems:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk=2023-08-06%2008%3A20%3A21

This is possibly explained by the fact that it uses (per its
extra_config)
  'shared_buffers = 10MB',
although it's done that for a long time and portals.out hasn't changed
since before chipmunk's latest success.  Perhaps some change in an
earlier test script affected this?

I think it'd be worth running to ground exactly what is causing these
failures and when they started.  But assuming that they are triggered
by syncscan, my first thought about dealing with it is to disable
syncscan within the affected tests.  However ... do we exercise the
syncscan logic at all within the regression tests, normally?  Is there
a coverage issue to be dealt with?

regards, tom lane




Re: Use of additional index columns in rows filtering

2023-08-06 Thread Peter Geoghegan
On Sun, Aug 6, 2023 at 3:28 PM Peter Geoghegan  wrote:
> I decided to verify my understanding by checking what would happen
> when I ran the OR-heavy tenk1 regression test query against a
> combination of your patch, and v7 of the OR-to-SAOP transformation
> patch. (To be clear, this is without my patch.)

I also spotted what looks like it might be a problem with your patch
when looking at this query (hard to be sure if it's truly a bug,
though).

I manually SAOP-ify the OR-heavy tenk1 regression test query like so:

select
  *
from
  tenk1
where
  thousand = 42
  and tenthous in (1, 3, 42);

Sure enough, I continue to get 7 buffer hits with this query. Just
like with the BitmapOr plan (and exactly like the original query with
the OR-to-SAOP transformation patch in place).

As I continue to add SAOP constants to the original "tenthous" IN(),
eventually the planner switches over to not using index quals on the
"tenthous" low order index column (they're only used on the high order
"thousand" index column). Here's where the switch to only using the
leading column from the index happens for me:

select
  *
from
  tenk1
where
  thousand = 42
  and
  tenthous in (1, 3, 42, 43, 44, 45, 46, 47, 48, 49, 50);

This plan switchover isn't surprising in itself -- it's one of the most
important issues addressed by my SAOP patch. However, it *is* a little
surprising that your patch doesn't even manage to use "Index Filter"
quals. It appears that it is only capable of using table filter quals.
Obviously, the index has all the information that expression
evaluation needs, and yet I see "Filter: (tenk1.tenthous = ANY
('{1,3,42,43,44,45,46,47,48,49,50}'::integer[]))". So no improvement
over master here.

Interestingly enough, your patch only has this problem with SAOPs, at
least that I know of -- the spelling/style matters. If I add many
additional "tenthous" constants to the original version of the query
from the regression tests in the same way, but using the "longform"
(tenthous = 1 or tenthous = 3 ...) spelling, then your patch does
indeed use index filters/expression evaluation. Just like the original
"risky" plan (it's just a much bigger expression, with many more ORs).

--
Peter Geoghegan




Re: Sync scan & regression tests

2023-08-06 Thread Thomas Munro
On Mon, Aug 7, 2023 at 7:21 AM Konstantin Knizhnik  wrote:
> Two tests are failed because of sync scan - this tests cluster.sql and
> portals.sql perform seqscan without explicit order by and expect that
> data will be returned in particular order. But because of sync scan it
> doesn't happen. Small shared buffers are needed to satisfy seqscan
> criteria in heapam.c: `scan->rs_nblocks > NBuffers / 4` for tenk1 table.

I wondered the same thing while working on the tests in commit
8ab0ebb9a84, which explicitly care about physical order, so they *say
so* with ORDER BY ctid.  But the problem seems quite widespread, so I
didn't volunteer to try to do something like that everywhere, when Tom
committed cbf4177f for 027_stream_regress.pl.

FWIW here's another discussion of that cluster test, in which I was
still figuring out some surprising ways this feature can introduce
non-determinism even without concurrent access to the same table.

https://www.postgresql.org/message-id/flat/CA%2BhUKGLTK6ZuEkpeJ05-MEmvmgZveCh%2B_w013m7%2ByKWFSmRcDA%40mail.gmail.com




Re: Sync scan & regression tests

2023-08-06 Thread Tom Lane
Konstantin Knizhnik  writes:
> Is it is ok, that regression tests do not pass with small value of 
> shared buffers (for example 1Mb)?

There are quite a few GUC settings with which you can break the
regression tests.  I'm not especially bothered by this one.

> More general question - is it really good that in situation when there 
> is actually no concurrent queries, seqscan is started not from the first 
> page?

You're going to have race-condition issues if you try to make that
(not) happen, I should think.

regards, tom lane




Re: Use of additional index columns in rows filtering

2023-08-06 Thread Peter Geoghegan
On Sun, Aug 6, 2023 at 1:13 PM Peter Geoghegan  wrote:
> Since you're not relying on the nbtree work at all here, really (just
> on the transformation process itself), the strategic risk that this
> adds to your project isn't too great. It's not like this ties the
> success of your patch to the success of my own patch. At most it ties
> the success of your patch to something like Alena Rybakina's
> OR-to-SAOP transformation patch, which seems manageable to me. (To be
> clear, I'm not relying on that work in the same way myself -- for my
> patch the transformation process is just a nice-to-have.)

I decided to verify my understanding by checking what would happen
when I ran the OR-heavy tenk1 regression test query against a
combination of your patch, and v7 of the OR-to-SAOP transformation
patch. (To be clear, this is without my patch.)

I found that the problem that I saw with the OR-heavy tenk1 regression
test goes away (though only when I "set or_transform_limit=0"). That
is, we'll get an index scan plan that uses a SAOP. This index scan
plan is comparable to the master branch's BitmapOr scan. In
particular, both plans get 7 buffer hits. More importantly, the new
plan is (like the master branch plan) not risky in the way I've been
going on about.

This does mean that your patch gets a *slightly* slower plan, due to
the issue of added index page accesses. Improving that should be a job
for my patch -- it's not your problem, since there is no regression.

I'm not sure if it's somehow still possible that SAOP expression
evaluation is able to "do the risky thing" in the same way that your
patch's "Index Filter: ((tenk1.tenthous = 1) OR (tenk1.tenthous = 3)
OR (tenk1.tenthous = 42))" plan. But it certainly looks like it can't.
Increasingly, the problem here appears to me to be a problem of
lacking useful CNF transformations/normalization -- nothing more.
Structuring things so that we reliably use "the native representation
of ORs" via normalization seems likely to be all you really need.

We may currently be over relying on a similar process that happens
indirectly, via BitmapOr paths. I suspect that you're right to be
concerned about how this might already be affecting index-only scans.
Once we have CNF normalization/transformation in place, we will of
course continue to use some BitmapOr plans that may look a little like
the ones I'm focussed on, and some plans that use "index filters" (due
to your patch) that are also a little like that. But there's nothing
objectionable about those cases IMV (quite the opposite), since there
is no question of displacing/out-competing very similar plans that can
use index quals. (You might also find a way to avoid ever requiring
heap access/visibility checks for a subset of the "index filter" cases
where it is determined to be safe up front, but that's just a bonus.)

--
Peter Geoghegan




Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-06 Thread Thomas Munro
On Mon, Aug 7, 2023 at 8:40 AM Christoph Berg  wrote:
> 2023-08-06 17:21:24.078 UTC [127] 031_recovery_conflict.pl FATAL:  
> unrecognized conflict mode: 7

Thanks for testing!  Would you mind trying v8 from that thread?  V7
had a silly bug (I accidentally deleted a 'case' label while cleaning
some stuff up, resulting in the above error...)




Re: Adding a pg_servername() function

2023-08-06 Thread Christoph Moench-Tegeder
Hi,

## Laetitia Avrot (laetitia.av...@gmail.com):

> I understand your point and sure enough, my customer could set and use the
> cluster_name for that purpose. I totally disagree with using
> inet_server_addr() for that purpose as there are so many different network
> settings with VIPs and so on that it won't help.

That depends a bit on what the exact question is: "where did I connect
to" vs "where is the thing I conencted to running".

> Also, a socket connection
> will give a NULL value, not that it's *that* bad because if it's a socket,
> you're running locally and could still use `\! ifconfig`, but it does not
> work on some configurations (docker for example).

But with a local socket, you already know where you are.

> Also, most humans will
> find it easier to memorize a name than a number and that's one of the
> reasons why we remember websites' URLs instead of their IP.

That's why we have DNS, and it works both ways (especially when
working programatically).

> I still disagree with the monitoring part. A good monitoring tool will have
> to reconcile data from the OS with data from the Postgres cluster. So that,
> we kind of need a way for the monitoring tool to know on which host this
> particular cluster is running and I think it's smarter to get this
> information from the Postgres cluster.

But that's again at least two questions in here: "what is the
instance on this host doing" and "what is the primary/read standby/...
of service X doing", and depending on that ask the base host's
primary address or the service's address. Yes, that's easy to say
when you can define the whole environment, and many setups discover
this only later in the life cycle.

Regards,
Christoph

-- 
Spare Space




Re: POC, WIP: OR-clause support for indexes

2023-08-06 Thread Peter Geoghegan
On Sat, Aug 5, 2023 at 7:01 PM Peter Geoghegan  wrote:
> Of course this immediately makes me wonder: shouldn't your patch be
> able to perform an additional transformation here? You know, by
> transforming "a.x = 42 OR a.x = 44" into "a IN (42, 44)"? Although I
> haven't checked for myself, I assume that this doesn't happen right
> now, since your patch currently performs all of its transformations
> during parsing.

Many interesting cases won't get SAOP transformation from the patch,
simply because of the or_transform_limit GUC's default of 500. I don't
think that that design makes too much sense. It made more sense back
when the focus was on expression evaluation overhead. But that's only
one of the benefits that we now expect from the patch, right? So it
seems like something that should be revisited soon.

I'm not suggesting that there is no need for some kind of limit. But
it seems like a set of heuristics might be a better approach. Although
I would like to get a better sense of the costs of the transformation
to be able to say too much more.

-- 
Peter Geoghegan




Re: A failure in 031_recovery_conflict.pl on Debian/s390x

2023-08-06 Thread Christoph Berg
Re: Thomas Munro
> It's great that you can reproduce this semi-reliably!  I've rebased
> the patch, hoping you can try it out.

Unfortunately very semi, today I didn't get to the same point where it
exited after test 7, but got some other timeouts. Not even sure they
are related to this (?) problem. Anyway:

> https://www.postgresql.org/message-id/CA%2BhUKGJDcyW8Hbq3UyG-5-5Y7WqqOTjrXbFTMxxmhiofFraE-Q%40mail.gmail.com

This patch makes the testsuite hang (and later exit) after this:

ok 17 - 5 recovery conflicts shown in pg_stat_database
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 255 just after 17.

I haven't seen any other problems with the patch attached, but since
that test was hanging and hence very slow, I couldn't do many runs.

Perhaps that's progress, I don't know. :) Logs attached.

Christoph
2023-08-06 17:20:34.436 UTC [1274721] LOG:  starting PostgreSQL 17devel (Ubuntu 
17~~devel-1) on s390x-ibm-linux-gnu, compiled by gcc (Ubuntu 
11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit
2023-08-06 17:20:34.437 UTC [1274721] LOG:  listening on Unix socket 
"/tmp/N8KG4BCXAf/.s.PGSQL.53738"
2023-08-06 17:20:34.505 UTC [1274726] LOG:  database system was shut down at 
2023-08-06 17:20:33 UTC
2023-08-06 17:20:34.513 UTC [1274721] LOG:  database system is ready to accept 
connections
2023-08-06 17:20:35.499 UTC [1274748] 031_recovery_conflict.pl LOG:  statement: 
CREATE TABLESPACE test_recovery_conflict_tblspc LOCATION ''
2023-08-06 17:20:35.664 UTC [1274757] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW data_directory_mode
2023-08-06 17:20:35.664 UTC [1274757] 031_recovery_conflict.pl STATEMENT:  SHOW 
data_directory_mode
2023-08-06 17:20:35.684 UTC [1274757] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW wal_segment_size
2023-08-06 17:20:35.684 UTC [1274757] 031_recovery_conflict.pl STATEMENT:  SHOW 
wal_segment_size
2023-08-06 17:20:35.689 UTC [1274757] 031_recovery_conflict.pl LOG:  received 
replication command: IDENTIFY_SYSTEM
2023-08-06 17:20:35.689 UTC [1274757] 031_recovery_conflict.pl STATEMENT:  
IDENTIFY_SYSTEM
2023-08-06 17:20:35.694 UTC [1274757] 031_recovery_conflict.pl LOG:  received 
replication command: BASE_BACKUP ( LABEL 'pg_basebackup base backup',  
PROGRESS,  CHECKPOINT 'fast',  WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-06 17:20:35.694 UTC [1274757] 031_recovery_conflict.pl STATEMENT:  
BASE_BACKUP ( LABEL 'pg_basebackup base backup',  PROGRESS,  CHECKPOINT 'fast', 
 WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-06 17:20:35.735 UTC [1274724] LOG:  checkpoint starting: immediate 
force wait
2023-08-06 17:20:35.752 UTC [1274724] LOG:  checkpoint complete: wrote 7 
buffers (5.5%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.001 s, 
sync=0.001 s, total=0.018 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=11360 kB, estimate=11360 kB; lsn=0/260, redo lsn=0/228
2023-08-06 17:20:35.877 UTC [1274760] 031_recovery_conflict.pl LOG:  received 
replication command: SHOW data_directory_mode
2023-08-06 17:20:35.877 UTC [1274760] 031_recovery_conflict.pl STATEMENT:  SHOW 
data_directory_mode
2023-08-06 17:20:35.877 UTC [1274760] 031_recovery_conflict.pl LOG:  received 
replication command: CREATE_REPLICATION_SLOT "pg_basebackup_1274760" TEMPORARY 
PHYSICAL ( RESERVE_WAL)
2023-08-06 17:20:35.877 UTC [1274760] 031_recovery_conflict.pl STATEMENT:  
CREATE_REPLICATION_SLOT "pg_basebackup_1274760" TEMPORARY PHYSICAL ( 
RESERVE_WAL)
2023-08-06 17:20:36.012 UTC [1274760] 031_recovery_conflict.pl LOG:  received 
replication command: IDENTIFY_SYSTEM
2023-08-06 17:20:36.012 UTC [1274760] 031_recovery_conflict.pl STATEMENT:  
IDENTIFY_SYSTEM
2023-08-06 17:20:36.017 UTC [1274760] 031_recovery_conflict.pl LOG:  received 
replication command: START_REPLICATION SLOT "pg_basebackup_1274760" 0/200 
TIMELINE 1
2023-08-06 17:20:36.017 UTC [1274760] 031_recovery_conflict.pl STATEMENT:  
START_REPLICATION SLOT "pg_basebackup_1274760" 0/200 TIMELINE 1
2023-08-06 17:20:37.707 UTC [1274757] 031_recovery_conflict.pl LOG:  temporary 
file: path "base/pgsql_tmp/pgsql_tmp1274757.0", size 137324
2023-08-06 17:20:37.707 UTC [1274757] 031_recovery_conflict.pl STATEMENT:  
BASE_BACKUP ( LABEL 'pg_basebackup base backup',  PROGRESS,  CHECKPOINT 'fast', 
 WAIT 0,  MANIFEST 'yes',  TARGET 'client')
2023-08-06 17:20:52.532 UTC [1274953] standby LOG:  received replication 
command: IDENTIFY_SYSTEM
2023-08-06 17:20:52.532 UTC [1274953] standby STATEMENT:  IDENTIFY_SYSTEM
2023-08-06 17:20:52.558 UTC [1274953] standby LOG:  received replication 
command: START_REPLICATION 0/300 TIMELINE 1
2023-08-06 17:20:52.558 UTC [1274953] standby STATEMENT:  START_REPLICATION 
0/300 TIMELINE 1
2023-08-06 17:20:53.333 UTC [1274960] 031_recovery_conflict.pl LOG:  statement: 
CREATE DATABASE test_db
2023-08-06 17:21:05.491 UTC [1275162] 031_recovery_conflict.pl LOG:  statement: 
CREATE TABLE 

Re: Use of additional index columns in rows filtering

2023-08-06 Thread Peter Geoghegan
On Sun, Aug 6, 2023 at 10:23 AM Tomas Vondra
 wrote:
> > The index AM is entitled to make certain assumptions of opclass
> > members -- assumptions that cannot be made during expression
> > evaluation.

> Thanks for reminding me, I keep forgetting about this.

I was almost certain that you already knew that, actually. It's easy
to forget such details in a discussion like this one, where the focus
zooms out and then zooms back in, again and again.

> Would it be possible to inspect the expression and determine if it's
> "safe" to be treated almost like an index qual? Say, given a complex
> expression, we might check if it consists only of expressions that could
> be treated as an index qual. So a bit like leak-proof, which may
> actually be relevant here too I guess (e.g. int4div is not leak-proof,
> for example, so e.g. the division-by-zero would not allow index-qual
> treatment).

Clearly you're talking about a distinct set of guarantees to the ones
that B-Tree opclasses make about not throwing errors when scanning
maybe-not-visible index tuples. The B-Tree opclass guarantees might
not even be written down anywhere -- they seem like common sense,
almost.

What you've described definitely seems like it could be very useful,
but I don't think that it solves the fundamental problem with cases
like the BitmapOr plan. Even if you do things in a way that precludes
the possibility of extra heap page accesses (when the VM bit isn't
set), you still have the problem of "access predicates vs index filter
predicates". Which is a big problem, in and of itself.

> Anyway, I think there are always going to be clauses that would be safe
> to evaluate on the index, but the index AM does not know to handle them
> for some reason. For example it might require extending the AM to handle
> generic expressions, which doesn't seem quite desirable.

Actually, I mostly don't think we'd need to teach nbtree or other
index AMs anything about simplifying expressions. Structurally, we
should try to make things like ScalarArrayOpExpr into "just another
indexable operator", which has little to no difference with any other
indexable operator at runtime.

There probably are several good reasons why "normalizing to CNF" in
the planner is a good idea (to some extent we do this already). Alena
Rybakina's OR-to-SAOP transformation patch was written well before
anybody knew about the MDAM/SAOP patch I'm working on. The original
motivation was to lower expression evaluation overhead.

You could probably find a third of even a fourth reason to do that
specific transformation, if you thought about it for a while. Top-down
planner designs such as Cascades really have to spend a lot of time on
this kind of normalization process. For very general reasons -- many
of which are bound to apply in our own bottom-up planner design.

> So I think I see three points where we could evaluate expressions:
>
> 1) in the AM, as access preditates / index quals (doing this more often
>is kinda what the SAOP patches aim to do)
>
> 2) in the index scan, before checking VM / visibility (if the qual is
>safe to be evaluated early)
>
> 3) in the index scan, after checking VM / visibility (if the expression
>does unsafe things)

Agreed.

Presumably there would also be a class of expressions that the patch
should make into index filters rather than table filters, despite
being unable to determine whether they're safe to evaluate early. Even
if there is only a small chance of it helping at runtime, there is no
reason (or infinitesimally small reason) to not to just do prefer
index filters where possible -- so it's desirable to always prefer
index filters, regardless of opclass/type restrictions on early
evaluation. Right?

Assuming the answer is yes, then I think that you still need all of
the index-only scan stuff that can "fallback to heap access", just to
be able to cover this other class of expression. I don't think that
this class that I've described will be rarely encountered, or
anything.

> I agree. That seems like a discussion relevant to the general topic of
> "upgrading" clauses. If anything, the patch may need to worry about
> moving table filters to index filters, that's the only thing it does.

Obviously that will have indirect consequences due to the changes in
the costing.

> > It is always better to make what could be an "index filter" into an
> > index qual. Of course, the current practical problem for you is
> > figuring out how to deal with that in cases like the BitmapOr case.
> > Since it's not as if the planner is wrong, exactly...it really is the
> > cheapest plan, so the planner is at least right on its own terms. I am
> > interested in finding a solution to that problem.
> >
>
> Well, if the planner is not wrong, what solution are we looking for? ;-)

I imagine that you really don't want to have to rely on some
wishy-washy philosophical argument about the planner's expectation
being the only reasonable basis for choosing a plan. 

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-06 Thread Andres Freund
Hi,

On 2023-08-02 16:51:29 -0700, Matt Smiley wrote:
> I thought it might be helpful to share some more details from one of the
> case studies behind Nik's suggestion.
> 
> Bursty contention on lock_manager lwlocks recently became a recurring cause
> of query throughput drops for GitLab.com, and we got to study the behavior
> via USDT and uprobe instrumentation along with more conventional
> observations (see
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301).  This
> turned up some interesting finds, and I thought sharing some of that
> research might be helpful.

Hm, I'm curious whether you have a way to trigger the issue outside of your
prod environment. Mainly because I'm wondering if you're potentially hitting
the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so
you'd not see the benefit unless you reproduced the load in 16+.

I'm also wondering if it's possible that the reason for the throughput drops
are possibly correlated with heavyweight contention or higher frequency access
to the pg_locks view. Deadlock checking and the locks view acquire locks on
all lock manager partitions... So if there's a bout of real lock contention
(for longer than deadlock_timeout)...


Given that most of your lock manager traffic comes from query planning - have
you evaluated using prepared statements more heavily?

Greetings,

Andres Freund




Re: There should be a way to use the force flag when restoring databases

2023-08-06 Thread Ahmed Ibrahim
Hi all,

I have addressed the pg version compatibility with the FORCE option in
drop. Here is the last version of the patch

On Tue, Aug 1, 2023 at 6:19 PM Ahmed Ibrahim 
wrote:

> Hi Gurjeet,
>
> I have addressed all your comments except for the tests.
>
> I have tried adding test cases but I wasn't able to do it as it's in my
> mind. I am not able to do things like having connections to the database
> and trying to force the restore, then it will complete successfully
> otherwise it shows errors.
>
> In the meantime I will continue trying to do the test cases. If anyone can
> help on that, I will appreciate it.
>
> Thanks
>
> On Thu, Jul 27, 2023 at 1:36 AM Gurjeet Singh  wrote:
>
>> On Sun, Jul 23, 2023 at 6:09 AM Ahmed Ibrahim
>>  wrote:
>> >
>> > Hi everyone,
>> >
>> > I have been working on this. This is a proposed patch for it so we have
>> a force option for DROPping the database.
>> >
>> > I'd appreciate it if anyone can review.
>>
>> Hi Ahmed,
>>
>> Thanks for working on this patch!
>>
>> +
>> +int force;
>>
>> That extra blank line is unnecessary.
>>
>> Using the bool data type, instead of int, for this option would've
>> more natural.
>>
>> +if (ropt->force){
>>
>> Postgres coding style is to place the curly braces on a new line,
>> by themselves.
>>
>> +char   *dropStmt = pg_strdup(te->dropStmt);
>>
>> See if you can use pnstrdup(). Using that may obviate the need for
>> doing the null-placement acrobatics below.
>>
>> +PQExpBuffer ftStmt = createPQExpBuffer();
>>
>> What does the 'ft' stand for in this variable's name?
>>
>> +dropStmt[strlen(dropStmt) - 2] = ' ';
>> +dropStmt[strlen(dropStmt) - 1] = '\0';
>>
>> Try to evaluate the strlen() once and reuse it.
>>
>> +appendPQExpBufferStr(ftStmt, dropStmt);
>> +appendPQExpBufferStr(ftStmt, "WITH(FORCE);");
>> +te->dropStmt = ftStmt->data;
>> +}
>> +
>>
>> Remove the extra trailing whitespace on that last blank line.
>>
>> I think this whole code block needs to be protected by an 'if
>> (ropt->createDB)' check, like it's done about 20 lines above this
>> hunk. Otherwise, we may be appending 'WITH (FORCE)' for the DROP
>> command of a different (not a database) object type.
>>
>> Also, you may want to check that the target database version is
>> one that supports WITH force option. This command will fail for
>> anything before v13.
>>
>> The patch needs doc changes (pg_restore.sgml). And it needs to
>> mention --force option in the help output, as well (usage() function).
>>
>> Can you please see if you can add appropriate test case for this.
>> The committers may insist on it, when reviewing.
>>
>> Here are a couple of helpful links on how to prepare and submit
>> patches to the community. You may not need to strictly adhere to
>> these, but try to pick up a few recommendations that would make the
>> reviewer's job a bit easier.
>>
>> [1]: https://wiki.postgresql.org/wiki/Creating_Clean_Patches
>> [2]: https://wiki.postgresql.org/wiki/Submitting_a_Patch
>>
>> Best regards,
>> Gurjeet
>> http://Gurje.et
>>
>
diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 47bd7dbda0..f24e06fdf7 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -726,6 +726,15 @@ PostgreSQL documentation

  
 
+ 
+  --force
+  
+   
+ Force database to drop while restoring if there are any connections.
+   
+  
+ 
+
 

 
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index aba780ef4b..2d167b58bf 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -154,6 +154,7 @@ typedef struct _restoreOptions
 	int			enable_row_security;
 	int			sequence_data;	/* dump sequence data even in schema-only mode */
 	int			binary_upgrade;
+	bool		force;
 } RestoreOptions;
 
 typedef struct _dumpOptions
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 39ebcfec32..218d440e35 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -532,6 +532,15 @@ RestoreArchive(Archive *AHX)
  */
 if (*te->dropStmt != '\0')
 {
+	if (ropt->createDB && ropt->force)
+	{
+		int queryLen = strlen(te->dropStmt);
+		char	   *dropStmt = pnstrdup(te->dropStmt, queryLen - 2);
+		PQExpBuffer newStmt = createPQExpBuffer();
+		appendPQExpBufferStr(newStmt, dropStmt);
+		appendPQExpBufferStr(newStmt, " WITH(FORCE);");
+		te->dropStmt = newStmt->data;
+	}
 	if (!ropt->if_exists ||
 		strncmp(te->dropStmt, "--", 2) == 0)
 	{
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 5dab1ba9ea..5041092ef9 100644
--- 

Sync scan & regression tests

2023-08-06 Thread Konstantin Knizhnik

Hi hackers,

Is it is ok, that regression tests do not pass with small value of 
shared buffers (for example 1Mb)?


Two tests are failed because of sync scan - this tests cluster.sql and 
portals.sql perform seqscan without explicit order by and expect that 
data will be returned in particular order. But because of sync scan it 
doesn't happen. Small shared buffers are needed to satisfy seqscan 
criteria in heapam.c: `scan->rs_nblocks > NBuffers / 4` for tenk1 table.


More general question - is it really good that in situation when there 
is actually no concurrent queries, seqscan is started not from the first 
page?







Re: Use of additional index columns in rows filtering

2023-08-06 Thread Tomas Vondra
On 8/5/23 02:53, Peter Geoghegan wrote:
> ...
> 
>> Right. This however begs a question - why would we actually need to
>> check the visibility map before evaluating the index filter, when the
>> index tuple alone is clearly good enough for the bitmapOr plan?
>>
>> Because if we didn't need to do that VM check, this issue with extra
>> heap accesses would disappear.
> 
> The index AM is entitled to make certain assumptions of opclass
> members -- assumptions that cannot be made during expression
> evaluation. The classic example is division-by-zero during evaluation
> of a qual, for a tuple that wasn't visible anyway. Our assumption is
> that stuff like that just cannot happen with index quals -- users
> shouldn't ever encounter sanity check errors caused by
> invisible-to-their-MVCC-snapshot tuples.
> 

Thanks for reminding me, I keep forgetting about this.

> I think that that's the main difficulty, as far as avoiding heap
> access for index filters is concerned. Of course, you could just limit
> yourself to those cases where the index AM assumptions were safe. But
> at that point, why not simply make sure to generate true index quals,
> and be done with it?
> 

Yeah, if it's possible to generate true index quals, it'd be stupid not
to do that. But clearly there are cases where that's not possible (we
may not have the code doing that, or maybe it's just not possible in
principle).

Would it be possible to inspect the expression and determine if it's
"safe" to be treated almost like an index qual? Say, given a complex
expression, we might check if it consists only of expressions that could
be treated as an index qual. So a bit like leak-proof, which may
actually be relevant here too I guess (e.g. int4div is not leak-proof,
for example, so e.g. the division-by-zero would not allow index-qual
treatment).

I now recall there probably was a past discussion about how leak-proof
relates to this, but IIRC the conclusion was it's not quite the same
thing. But maybe I just remember things wrong.

Anyway, I think there are always going to be clauses that would be safe
to evaluate on the index, but the index AM does not know to handle them
for some reason. For example it might require extending the AM to handle
generic expressions, which doesn't seem quite desirable.

So I think I see three points where we could evaluate expressions:

1) in the AM, as access preditates / index quals (doing this more often
   is kinda what the SAOP patches aim to do)

2) in the index scan, before checking VM / visibility (if the qual is
   safe to be evaluated early)

3) in the index scan, after checking VM / visibility (if the expression
   does unsafe things)



>> I copied this from the IOS somewhat blindly, but now I'm starting to
>> think it was misguided. I thought it's a protection against processing
>> "invalid" tuples - not tuples broken after a crash (as that would be
>> fixed by crash recovery), but e.g. tuples with schema changes from an
>> aborted transaction.
> 
> I agree that schema changes for indexes shouldn't be an issue, though.
> 
>> I'm not quite sure what are the differences between "index qual" vs.
>> "access predicate index qual" vs. "index filter predicate index quals",
>> or what "dispacing" would mean exactly.
> 
> For the purposes of this point about "a hierarchy of quals", it
> doesn't really matter -- that was the point I was trying to make.
> 
> In other words, "index quals" are strictly better than equivalent
> "index filters", which are themselves strictly better than equivalent
> "table filters". While it is also true that you can meaningfully
> classify "index quals" into their own hierarchy (namely access
> predicates vs index filter predicates), that doesn't necessarily need
> to be discussed when discussing the hierarchy from a planner point of
> view, since it is (at least for now) internal to the nbtree index AM.
> 
> On second thought, I tend to doubt that your patch needs to worry
> about each type of index qual directly. It probably needs to worry
> about index quals in general.
> 

I agree. That seems like a discussion relevant to the general topic of
"upgrading" clauses. If anything, the patch may need to worry about
moving table filters to index filters, that's the only thing it does.

> It is always better to make what could be an "index filter" into an
> index qual. Of course, the current practical problem for you is
> figuring out how to deal with that in cases like the BitmapOr case.
> Since it's not as if the planner is wrong, exactly...it really is the
> cheapest plan, so the planner is at least right on its own terms. I am
> interested in finding a solution to that problem.
> 

Well, if the planner is not wrong, what solution are we looking for? ;-)

FWIW if the problem is the patch may make the new plan look cheaper than
some "actually better" plan (e.g. the BitmapOr one). In that case, we
could just keep the old costing (kinda assuming the worst case, but as
you said, the 

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-06 Thread Tomas Vondra



On 8/3/23 22:39, Tomas Vondra wrote:
> On 8/3/23 01:51, Matt Smiley wrote:
>> I thought it might be helpful to share some more details from one of the
>> case studies behind Nik's suggestion.
>>
>> Bursty contention on lock_manager lwlocks recently became a recurring
>> cause of query throughput drops for GitLab.com, and we got to study the
>> behavior via USDT and uprobe instrumentation along with more
>> conventional observations (see
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301
>> ). 
>> This turned up some interesting finds, and I thought sharing some of
>> that research might be helpful.
>>
> 
> The analysis in the linked gitlab issue is pretty amazing. I wasn't
> planning to argue against the findings anyway, but plenty of data
> supporting the conclusions is good.
> 
> I'm not an expert on locking, so some of the stuff I say may be
> trivially obvious - it's just me thinking about ...
> 
> I wonder what's the rough configuration of those systems, though. Both
> the hardware and PostgreSQL side. How many cores / connections, etc.?
> 
> 
>> Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would
>> have a much larger positive impact than any other mitigation strategy we
>> have evaluated.  Rather than reducing hold duration or collision rate,
>> adding fastpath slots reduces the frequency of even having to acquire
>> those lock_manager lwlocks.  I suspect this would be helpful for many
>> other workloads, particularly those having high frequency queries whose
>> tables collectively have more than about 16 or indexes.
>>
> 
> Yes, I agree with that. Partitioning makes this issue works, I guess.
> Schemas with indexes on every column are disturbingly common these days
> too, which hits the issue too ...
> 
>> Lowering the lock_manager lwlock acquisition rate means lowering its
>> contention rate (and probably also its contention duration, since
>> exclusive mode forces concurrent lockers to queue).
>>
>> I'm confident this would help our workload, and I strongly suspect it
>> would be generally helpful by letting queries use fastpath locking more
>> often.
>>
> 
> OK
> 
>>> However, the lmgr/README says this is meant to alleviate contention on
>>> the lmgr partition locks. Wouldn't it be better to increase the number
>>> of those locks, without touching the PGPROC stuff?
>>
>> That was my first thought too, but growing the lock_manager lwlock
>> tranche isn't nearly as helpful.
>>
>> On the slowpath, each relation's lock tag deterministically hashes onto
>> a specific lock_manager lwlock, so growing the number of lock_manager
>> lwlocks just makes it less likely for two or more frequently locked
>> relations to hash onto the same lock_manager.
>>
> 
> Hmmm, so if we have a query that joins 16 tables, or a couple tables
> with indexes, all backends running this will acquire exactly the same
> partition locks. And we're likely acquiring them in exactly the same
> order (to lock objects in the same order because of deadlocks), making
> the issue worse.
> 
>> In contrast, growing the number of fastpath slots completely avoids
>> calls to the slowpath (i.e. no need to acquire a lock_manager lwlock).
>>
>> The saturation condition we'd like to solve is heavy contention on one
>> or more of the lock_manager lwlocks.  Since that is driven by the
>> slowpath acquisition rate of heavyweight locks, avoiding the slowpath is
>> better than just moderately reducing the contention on the slowpath.
>>
>> To be fair, increasing the number of lock_manager locks definitely can
>> help to a certain extent, but it doesn't cover an important general
>> case.  As a thought experiment, suppose we increase the lock_manager
>> tranche to some arbitrarily large size, larger than the number of
>> relations in the db.  This unrealistically large size means we have the
>> best case for avoiding collisions -- each relation maps uniquely onto
>> its own lock_manager lwlock.  That helps a lot in the case where the
>> workload is spread among many non-overlapping sets of relations.  But it
>> doesn't help a workload where any one table is accessed frequently via
>> slowpath locking.
>>
> 
> Understood.
> 
>> Continuing the thought experiment, if that frequently queried table has
>> 16 or more indexes, or if it is joined to other tables that collectively
>> add up to over 16 relations, then each of those queries is guaranteed to
>> have to use the slowpath and acquire the deterministically associated
>> lock_manager lwlocks.
>>
>> So growing the tranche of lock_manager lwlocks would help for some
>> workloads, while other workloads would not be helped much at all.  (As a
>> concrete example, a workload at GitLab has several frequently queried
>> tables with over 16 indexes that consequently always use at least some
>> slowpath locks.)
>>
>> For additional context:
>>
>> 

RE: CDC/ETL system on top of logical replication with pgoutput, custom client

2023-08-06 Thread José Neves
Hi there, hope to find you all well.

A follow-up on this. Indeed, a new commit-based approach solved my missing data 
issues.
But, getting back to the previous examples, how are server times expected to be 
logged for the xlogs containing these records?

With these 2 transactions:
T-1
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000

T-2
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

Arriving this way:
BEGIN
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000

BEGIN
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

Are server times for them expected to be:

BEGIN
INSERT LSN1-1000 - 2
UPDATE LSN2-2000 - 4
UPDATE LSN3-3000 - 6
COMMIT  LSN4-4000 - 7

BEGIN
INSERT LSN1-500 - 1
UPDATE LSN2-1500 - 3
UPDATE LSN3-2500 - 5
COMMIT  LSN4-5500 - 8

Or:

BEGIN
INSERT LSN1-1000 - 1
UPDATE LSN2-2000 - 2
UPDATE LSN3-3000 - 3
COMMIT  LSN4-4000 - 4

BEGIN
INSERT LSN1-500 - 5
UPDATE LSN2-1500 - 6
UPDATE LSN3-2500 - 7
COMMIT  LSN4-5500 - 8

I'm asking because altho I'm no longer missing data, I have a second async 
process that can fail (publishing data to an event messaging service), and 
therefore there is a possibility of data duplication. Worst I've to split large 
transactions as message sizes are limited. Would be nice if I could rely on 
server time ts to discard duplicated data...

Thanks.
Regards,
José Neves

De: José Neves 
Enviado: 1 de agosto de 2023 10:13
Para: Andres Freund 
Cc: Amit Kapila ; pgsql-hack...@postgresql.org 

Assunto: RE: CDC/ETL system on top of logical replication with pgoutput, custom 
client

Hi Andres.

Owh, I see the error of my way... :(

By ignoring commits, and committing individual operation LSNs, I was 
effectively rolling back the subscription. In the previous example, if I 
committed the LSN of the first insert of the second transaction (LSN1-500), I 
was basically telling Postgres to send everything again, including the already 
processed T1.

> what you mean with the "different from the numeric order"
I'm probably lacking terminology. I mean that LSN4-5500 > LSN4-4000 > LSN3-3000 
> LSN3-2500...

But, if I'm understanding correctly, I can only rely on the incremental 
sequence to be true for the commit events. Which explains my pain.
The world makes sense again.

Thank you very much. Will try to implement this new logic, and hopefully not 
bug again with this issue.
Regards,
José Neves

De: Andres Freund 
Enviado: 1 de agosto de 2023 00:21
Para: José Neves 
Cc: Amit Kapila ; pgsql-hack...@postgresql.org 

Assunto: Re: CDC/ETL system on top of logical replication with pgoutput, custom 
client

Hi,

On 2023-07-31 21:25:06 +, José Neves wrote:
> Ok, if I understood you correctly, I start to see where my logic is faulty. 
> Just to make sure that I got it right, taking the following example again:
>
> T-1
> INSERT LSN1-1000
> UPDATE LSN2-2000
> UPDATE LSN3-3000
> COMMIT  LSN4-4000
>
> T-2
> INSERT LSN1-500
> UPDATE LSN2-1500
> UPDATE LSN3-2500
> COMMIT  LSN4-5500
>
> Where data will arrive in this order:
>
> INSERT LSN1-500
> INSERT LSN1-1000
> UPDATE LSN2-1500
> UPDATE LSN2-2000
> UPDATE LSN3-2500
> UPDATE LSN3-3000
> COMMIT  LSN4-4000
> COMMIT  LSN4-5500

No, they won't arrive in that order. They will arive as

BEGIN
INSERT LSN1-1000
UPDATE LSN2-2000
UPDATE LSN3-3000
COMMIT  LSN4-4000
BEGIN
INSERT LSN1-500
UPDATE LSN2-1500
UPDATE LSN3-2500
COMMIT  LSN4-5500

Because T1 committed before T2. Changes are only streamed out at commit /
prepare transaction (*). Within a transaction, they however *will* be ordered
by LSN.

(*) Unless you use streaming mode, in which case it'll all be more
complicated, as you'll also receive changes for transactions that might still
abort.


> You are saying that the LSN3-3000 will never be missing, either the entire
> connection will fail at that point, or all should be received in the
> expected order (which is different from the "numeric order" of LSNs).

I'm not quite sure what you mean with the "different from the numeric order"
bit...


> If the connection is down, upon restart, I will receive the entire T-1
> transaction again (well, all example data again).

Yes, unless you already acknowledged receipt up to LSN4-4000 and/or are only
asking for newer transactions when reconnecting.


> In addition to that, if I commit LSN4-4000, even tho that LSN has a "bigger
> numeric value" than the ones representing INSERT and UPDATE events on T-2, I
> will be receiving the entire T-2 transaction again, as the LSN4-5500 is
> still uncommitted.

I don't quite know what you mean with "commit LSN4-4000" here.


> This makes sense to me, but just to be extra clear, I will never receive a
> transaction commit before receiving all other events for that transaction.

Correct.

Greetings,

Andres Freund


Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-08-06 Thread Masahiko Sawada
On Wed, Aug 2, 2023 at 5:13 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > 4.
> > + /*
> > + * Check that all logical replication slots have reached the current WAL
> > + * position.
> > + */
> > + res = executeQueryOrDie(conn,
> > + "SELECT slot_name FROM pg_catalog.pg_replication_slots "
> > + "WHERE (SELECT count(record_type) "
> > + " FROM pg_catalog.pg_get_wal_records_content(confirmed_flush_lsn,
> > pg_catalog.pg_current_wal_insert_lsn()) "
> > + " WHERE record_type != 'CHECKPOINT_SHUTDOWN') <> 0 "
> > + "AND temporary = false AND wal_status IN ('reserved', 'extended');");
> >
> > I think this can unnecessarily lead to reading a lot of WAL data if
> > the confirmed_flush_lsn for a slot is too much behind. Can we think of
> > improving this by passing the number of records to read which in this
> > case should be 1?
>
> I checked and pg_wal_record_info() seemed to be used for the purpose. I tried 
> to
> move the functionality to core.

IIUC the above query checks if the WAL record written at the slot's
confirmed_flush_lsn is a CHECKPOINT_SHUTDOWN, but there is no check if
this WAL record is the latest record. Therefore, I think it's quite
possible that slot's confirmed_flush_lsn points to previous
CHECKPOINT_SHUTDOWN, for example, in cases where the subscription was
disabled after the publisher shut down and then some changes are made
on the publisher. We might want to add that check too but it would not
work. Because some WAL records could be written (e.g., by autovacuums)
during pg_upgrade before checking the slot's confirmed_flush_lsn.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Improve join_search_one_level readibilty (one line change)

2023-08-06 Thread David Rowley
On Fri, 4 Aug 2023 at 16:05, Richard Guo  wrote:
>
>
> On Fri, Aug 4, 2023 at 10:36 AM David Rowley  wrote:
>>
>> The whole lnext() stuff all feels a bit old now that Lists are arrays.
>> I think we'd be better adjusting the code to pass the List index where
>> we start from rather than the ListCell to start from.  That way we can
>> use for_each_from() to iterate rather than for_each_cell().  What's
>> there today feels a bit crufty and there's some element of danger that
>> the given ListCell does not even belong to the given List.
>
>
> I think we can go even further to do the same for 'bushy plans' case,
> like the attached.

Seems like a good idea to me. I've pushed that patch.

Alex, many thanks for highlighting this and posting a patch to fix it.
Congratulations on your first patch being committed.

David




Re: How to add a new operator for parser?

2023-08-06 Thread Julien Rouhaud
On Sun, Aug 06, 2023 at 01:37:42PM +0800, jacktby jacktby wrote:
>
> I need to build a new datatype. It can contains different datatypes, like
> ‘(1,’a’,2.0)’,it’s a (interger,string,float) tuple type

Is there any reason why you can't simply rely on the record datatype?

> and Then I need to
> add operator for it. How should I do?

If using record datatype you would only need to add a new operator.