Re: Change GUC hashtable to use simplehash?

2024-01-05 Thread jian he
On Sat, Jan 6, 2024 at 9:04 AM John Naylor  wrote:
>
> On Fri, Jan 5, 2024 at 6:58 PM jian he  wrote:
> > -Dcassert=true \
>
> > -Dbuildtype=debug \
>
> These probably don't matter much for this test, but these should be
> off for any performance testing.
>
> > -DWRITE_READ_PARSE_PLAN_TREES
> > -DCOPY_PARSE_PLAN_TREES
> > -DREALLOCATE_BITMAPSETS
> > -DRAW_EXPRESSION_COVERAGE_TEST
>
> I'd guess it was was of these, which should likewise be off as well.

Thanks for pointing it out.
meson setup ${BUILD} \
-Dprefix=${PG_PREFIX} \
-Dpgport=5459 \
-Dplperl=enabled \
-Dplpython=enabled \
-Dssl=openssl \
-Dldap=enabled \
-Dlibxml=enabled \
-Dlibxslt=enabled \
-Duuid=e2fs \
-Dzstd=enabled \
-Dlz4=enabled \
-Dsystemd=enabled \
-Dicu=enabled \
-Dbuildtype=release \
-Ddocs_pdf=disabled \
-Dllvm=disabled \
-Ddocs_pdf=disabled

now the results:

jian@jian:~/Desktop/pg_src/src4/postgres$ bash
/home/jian/Desktop/pg_src/src4/postgres/runbench.sh
select * from bench_string_hash(10);

latency average = 145.021 ms
select * from bench_cstring_hash_unaligned(10);
latency average = 100.829 ms
select * from bench_cstring_hash_aligned(10);
latency average = 100.606 ms
select * from bench_pgstat_hash(10);
latency average = 96.140 ms
select * from bench_pgstat_hash_fh(10);

latency average = 62.784 ms
jian@jian:~/Desktop/pg_src/src4/postgres$ bash
/home/jian/Desktop/pg_src/src4/postgres/runbench.sh
select * from bench_string_hash(10);

latency average = 147.782 ms
select * from bench_cstring_hash_unaligned(10);
latency average = 101.179 ms
select * from bench_cstring_hash_aligned(10);
latency average = 101.219 ms
select * from bench_pgstat_hash(10);
latency average = 96.357 ms
select * from bench_pgstat_hash_fh(10);

latency average = 62.902 ms




Re: Change GUC hashtable to use simplehash?

2024-01-05 Thread John Naylor
On Fri, Jan 5, 2024 at 6:58 PM jian he  wrote:
> -Dcassert=true \

> -Dbuildtype=debug \

These probably don't matter much for this test, but these should be
off for any performance testing.

> -DWRITE_READ_PARSE_PLAN_TREES
> -DCOPY_PARSE_PLAN_TREES
> -DREALLOCATE_BITMAPSETS
> -DRAW_EXPRESSION_COVERAGE_TEST

I'd guess it was was of these, which should likewise be off as well.




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-05 Thread jian he
On Fri, Jan 5, 2024 at 4:37 PM jian he  wrote:
>
> > > > be reused for a different user.
> > > >
> > >
> > > You are right.
> > > so I changed, now the schema owner will be the error table owner.
> > > every error table tuple inserts,
> > > I switch to schema owner, do the insert, then switch back to the
> > > COPY_FROM operation user.
> > > now everyone (except superuser) will need explicit grant to access the
> > > error table.
> >
> > There are some compilation issues reported at [1] for the patch:
> > [04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’:
> > [04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’
> > may be used uninitialized in this function
> > [-Werror=maybe-uninitialized]
> > [04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc,
> > [04:04:26.288] | ^~~~
> > [04:04:26.288] 1127 | t_values,
> > [04:04:26.288] | ~
> > [04:04:26.288] 1128 | t_isnull);
> > [04:04:26.288] | ~
> > [04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be
> > used uninitialized in this function [-Werror=maybe-uninitialized]
> > [04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock);
> > [04:04:26.288] | ^
> >
> > [1] - https://cirrus-ci.com/task/4785221183209472
> >
>
> I fixed this issue, and also improved the doc.
> Other implementations have not changed.

bother again.
This time, I used the ci test it again.
now there should be no warning.
From f033ef4025dbe2012007434dacd4821718443571 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Sat, 6 Jan 2024 02:34:22 +0800
Subject: [PATCH v14 1/1] Make COPY FROM more error tolerant

At present, when processing the source file, COPY FROM may encounter three types of data type conversion errors.
* extra data after last expected column
* missing data for column \"%s\"
* data type conversion error.

Instead of throwing errors while copying, save_error (boolean) specifier will
save errors to the table copy_errors for all the copy from operation happend in the same schema.

 We check the existence of table copy_errors,
 we also check the data definition of copy_errors via compare column names and data types.
 If copy_errors already exists and meets the criteria then errors metadata will save to it.
 If copy_errors does not exist, then create it.
 If copy_errors exist, cannot use for saving error, then raise an error.

 the table copy_errors is per schema-wise, it's owned by the copy from
 operation destination schema's owner.
 The table owner has full privilege on copy_errors,
 other non-superuser need gain privilege to access it.
---
 doc/src/sgml/ref/copy.sgml   | 120 -
 src/backend/commands/copy.c  |  12 ++
 src/backend/commands/copyfrom.c  | 133 +-
 src/backend/commands/copyfromparse.c | 217 +--
 src/backend/parser/gram.y|   8 +-
 src/bin/psql/tab-complete.c  |   3 +-
 src/include/commands/copy.h  |   1 +
 src/include/commands/copyfrom_internal.h |   6 +
 src/include/parser/kwlist.h  |   1 +
 src/test/regress/expected/copy2.out  | 137 ++
 src/test/regress/sql/copy2.sql   | 123 +
 11 files changed, 745 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69c..f6cdf0cf 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 FORCE_NULL { ( column_name [, ...] ) | * }
 ENCODING 'encoding_name'
+SAVE_ERROR [ boolean ]
 
  
 
@@ -411,6 +412,18 @@ WHERE condition
 

 
+   
+SAVE_ERROR
+
+ 
+  Specifies that any data conversion errors while copying will automatically saved in table COPY_ERRORS and the COPY FROM operation will not be interrupted by conversion errors.
+  This option is not allowed when using binary format. This option
+  is only supported for COPY FROM syntax.
+  If this option is omitted, any data type conversion errors will be raised immediately.
+ 
+
+   
+
   
  
 
@@ -564,6 +577,7 @@ COPY count
 amount to a considerable amount of wasted disk space if the failure
 happened well into a large copy operation. You might wish to invoke
 VACUUM to recover the wasted space.
+To continue copying while skip conversion errors in a COPY FROM, you might wish to specify SAVE_ERROR.

 

@@ -572,6 +586,18 @@ COPY count
 null strings to null values and unquoted null strings to empty strings.

 
+   
+If the SAVE_ERROR option is specified and conversion errors occur while copying,
+PostgreSQL will first check for the existence of the table COPY_ERRORS, then save the conversion error information to it.
+If it does exist, but the table definition cannot use it to save the error, an error is raised, COPY FROM 

Re: remaining sql/json patches

2024-01-05 Thread jian he
some tests after applying V33 and my small changes.
setup:
create table test_scalar1(js jsonb);
insert into test_scalar1 select jsonb '{"a":"[12,13]"}' FROM
generate_series(1,1e5) g;
create table test_scalar2(js jsonb);
insert into test_scalar2 select jsonb '{"a":12}' FROM generate_series(1,1e5) g;
create table test_array1(js jsonb);
insert into test_array1 select jsonb '{"a":[1,2,3,4,5]}' FROM
generate_series(1,1e5) g;
create table test_array2(js jsonb);
insert into test_array2 select jsonb '{"a": "{1,2,3,4,5}"}' FROM
generate_series(1,1e5) g;

tests:
return a scalar int4range
explain(costs off,analyze) SELECT item FROM test_scalar1,
JSON_TABLE(js, '$.a' COLUMNS (item int4range PATH '$' omit quotes))
\watch count=5
237.753 ms

explain(costs off,analyze) select json_query(js, '$.a' returning
int4range omit quotes) from test_scalar1  \watch count=5
462.379 ms

explain(costs off,analyze) select json_value(js,'$.a' returning
int4range) from test_scalar1 \watch count=5
362.148 ms

explain(costs off,analyze) select (js->>'a')::int4range from
test_scalar1 \watch count=5
301.089 ms

explain(costs off,analyze) select trim(both '"' from
jsonb_path_query_first(js,'$.a')::text)::int4range from test_scalar1
\watch count=5
643.337 ms

return a numeric array from jsonb array.
explain(costs off,analyze) SELECT item FROM test_array1,
JSON_TABLE(js, '$.a' COLUMNS (item numeric[] PATH '$')) \watch count=5
727.807 ms

explain(costs off,analyze) SELECT json_query(js, '$.a' returning
numeric[]) from test_array1 \watch count=5
2995.909 ms

explain(costs off,analyze) SELECT
replace(replace(js->>'a','[','{'),']','}')::numeric[] from test_array1
\watch count=5
2990.114 ms

return a numeric array from jsonb string
explain(costs off,analyze) SELECT item FROM test_array2,
JSON_TABLE(js, '$.a' COLUMNS (item numeric[] PATH '$' omit quotes))
\watch count=5
237.863 ms

explain(costs off,analyze) SELECT json_query(js,'$.a' returning
numeric[] omit quotes) from test_array2 \watch count=5
893.888 ms

explain(costs off,analyze) SELECT trim(both '"'
from(jsonb_path_query(js,'$.a')::text))::numeric[] from test_array2
\watch count=5
1329.713 ms

explain(costs off,analyze) SELECT (js->>'a')::numeric[] from
test_array2 \watch count=5
740.645 ms

explain(costs off,analyze) SELECT trim(both '"' from
(json_query(js,'$.a' returning text)))::numeric[]  from test_array2
\watch count=5
1085.230 ms
return a scalar numeric
explain(costs off,analyze) SELECT item FROM test_scalar2,
JSON_TABLE(js, '$.a' COLUMNS (item numeric PATH '$' omit quotes)) \watch count=5
238.036 ms

explain(costs off,analyze) select json_query(js,'$.a' returning
numeric) from test_scalar2 \watch count=5
300.862 ms

explain(costs off,analyze) select json_value(js,'$.a' returning
numeric) from test_scalar2 \watch count=5
160.035 ms

explain(costs off,analyze) select
jsonb_path_query_first(js,'$.a')::numeric from test_scalar2 \watch
count=5
294.666 ms

explain(costs off,analyze) select jsonb_path_query(js,'$.a')::numeric
from test_scalar2 \watch count=5
547.130 ms

explain(costs off,analyze) select (js->>'a')::numeric from
test_scalar2 \watch count=5
243.652 ms

explain(costs off,analyze) select (js->>'a')::numeric,
(js->>'a')::numeric from test_scalar2 \watch count=5
403.183 ms

explain(costs off,analyze) select json_value(js,'$.a' returning numeric),
json_value(js,'$.a' returning numeric) from test_scalar2 \watch count=5
246.405 ms

explain(costs off,analyze) select json_query(js,'$.a' returning numeric),
json_query(js,'$.a' returning numeric) from test_scalar2 \watch count=5
520.754 ms

explain(costs off,analyze) SELECT item, item1 FROM test_scalar2,
JSON_TABLE(js, '$.a' COLUMNS (item numeric PATH '$' omit quotes,
item1 numeric PATH '$' omit quotes)) \watch count=5
242.586 ms
-
overall, json_value is faster than json_query. but json_value can not
deal with arrays in some cases.
but as you can see, in some cases, json_value and json_query are not
as fast as our current implementation.

Here I only test simple nested levels. if you extra multiple values
from jsonb to sql type, then json_table is faster.
In almost all cases, json_table is faster.

json_table is actually called json_value_op, json_query_op under the hood.
Without json_value and json_query related code, json_table cannot be
implemented.




Re: Fix bogus Asserts in calc_non_nestloop_required_outer

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 4:36 PM Tom Lane  wrote:
> I don't think I believe this code change, let alone any of the
> explanations for it.  The point of these Asserts is to be sure that
> we don't form an alleged parameterization set that includes any rels
> that are included in the new join, because that would be obviously
> wrong.  They do check that as they stand.  I'm not sure what invariant
> the proposed revision would be enforcing.

Well, this explanation made sense to me:

https://www.postgresql.org/message-id/CAMbWs4-%2BGs0HJ9ouBUb%3DqwHsGCXxG%2B92eJzLOpCkedvgtOWQ%3DQ%40mail.gmail.com

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Multidimensional Histograms

2024-01-05 Thread Alexander Cheshev
Hi Tomas,

> Another reason was that the algorithm described in the two papers you
> reference (1988 paper by DeWitt and the evaluation by Carlson and
> Sutherland from ~2010) is simple but has pretty obvious weaknesses. It
> processes the columns one by one - first build bucket on column "a",
> then splits each bucket into buckets on "b". So it's not symmetrical,
> and it results in lower accuracy compared to an "ideal" histogram with
> the same number of buckets (especially for the dimensions split early).

As stated in [1] Sum Square Error (SSE) is one of the most natural
error metrics. Equi-Depth Histogram is not ideal because it doesn't
minimize SSE. On the other hand V-Optimal Histogram indeed minimizes
SSE and from this point of view can be considered as an ideal
solution.

> This does indeed produce an equi-depth histogram, which seems nice, but
> the mesh is regular in such a way that any value of the first dimension
> intersects all buckets on the second dimension. So for example with a
> histogram of 100x100 buckets on [a,b], any value "a=X" intersects with
> 100 buckets on "b", each representing 1/1 of tuples. But we don't
> know how the tuples are distributed in the tuple - so we end up using
> 0.5 of the bucket as unbiased estimate, but that can easily add-up in
> the wrong dimension.

Suppose that we have a query box a=X and we know how data is
distributed in buckets. Lets consider only the buckets which are
intersected by the query box a=X. As we know how data is distributes
in buckets we can exclude the buckets which have no tuples which
intersect the query box a=X.

As we store buckets with no information about data distribution we
have to make reasonable assumptions. If we keep buckets relativly
small then we can assume that buckets have uniform distribution.

What I am trying to say is that the problem which you pointed out
comes from the fact that we store buckets with no information about
data distribution. Even in one dimensional case we have to assume how
data is distributed in buckets. By the way Postgres assumes that data
has uniform distribution in buckets.

> However, this is not the only way to build an equi-depth histogram,
> there are ways to make it more symmetrical. Also, it's not clear
> equi-depth histograms are ideal with multiple columns. There are papers
> proposing various other types of histograms (using different criteria to
> build buckets optimizing a different metric). The most interesting one
> seems to be V-Optimal histograms - see for example papers [1], [2], [3],
> [4], [5] and [6]. I'm sure there are more. The drawback of course is
> that it's more expensive to build such histograms.

Tomas thank you for shearing with me your ideas regarding V-Optimal
Histogram. I read through the papers which you gave me and came up
with the following conclusion.

The problem can be formulated like this. We have N tuples in
M-dimensional space. We need to partition space into buckets
iteratively until SSE is less than E or we reach the limit of buckets
B.

In the case of M-dimensional space it seems to me like an NP-hard
problem. A greedy heuristic MHIST-2 is proposed in [2]. Preliminary we
sort N tuples in ascending order. Then we iteratively select a bucket
which leads to the largest SSE reduction and split it into two parts.
We repeat the process until SSE is less than E or we reach the limit
of buckets B.

If we assume that B is significantly less than N then the time
complexity of MHIST-2 can be estimated as O(M*N*B). Suppose that M
equals 3, B equals 1000 and N equals 300*B then it will take slightly
over 0.9*10^9 iterations to build a V-Optimal Histogram.

You can see that we have to keep B as low as possible in order to
build V-Optimal Histogram in feasible time. And here is a paradox.
>From one side we use V-Optimal Histogram in order to minimize SSE but
from the other side we have to keep B as low as possible which
eventually leads to increase in SSE.

On the other hand time complexity required to build an Equi-Depth
Histogram doesn't depend on B and can be estimated as O(M*N*logN). SSE
can be arbitrarily reduced by increasing B which in turn is only
limited by the storage limit. Experimental results show low error
metric [3].

In Equi-Depth Histogram a bucket is represented by two vectors. The
first vector points to the left bottom corner of the bucket and the
other one point to the right top corner of the bucket. Thus space
complexity of Equi-Depth Histogram can be estimated as
2*integer_size*M*B. Assume that M equals 3, B equals 1000 and
integer_size equals 4 bytes then Equi-Depth Histogram will ocupy 24000
bytes.

If a bucket is partially intersected by a query box then we assume
that data has uniform distribution inside of the bucket. It is a
reasonable assumption if B is relativly large.

In order to efficianly find buckets which intersect a query box we can
store Equi-Depth Histogram in R-tree as proposed in [3]. On average it
takes O(logB) iterations 

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-05 Thread Melanie Plageman
On Fri, Jan 5, 2024 at 2:47 PM Andres Freund  wrote:
> On 2024-01-04 17:37:27 -0500, Melanie Plageman wrote:
> > On Thu, Jan 4, 2024 at 3:03 PM Andres Freund  wrote:
> > >
> > > On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote:
> > > > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel)
> > > >   continue;
> > > >   }
> > > >
> > > > - /* Collect LP_DEAD items in dead_items array, 
> > > > count tuples */
> > > > - if (lazy_scan_noprune(vacrel, buf, blkno, page, 
> > > > ,
> > > > + /*
> > > > +  * Collect LP_DEAD items in dead_items array, 
> > > > count tuples,
> > > > +  * determine if rel truncation is safe
> > > > +  */
> > > > + if (lazy_scan_noprune(vacrel, buf, blkno, page,
> > > > 
> > > > ))
> > > >   {
> > > >   Sizefreespace = 0;
> > > >
> > > >   /*
> > > >* Processed page successfully (without 
> > > > cleanup lock) -- just
> > > > -  * need to perform rel truncation and FSM 
> > > > steps, much like the
> > > > -  * lazy_scan_prune case.  Don't bother 
> > > > trying to match its
> > > > -  * visibility map setting steps, though.
> > > > +  * need to update the FSM, much like the 
> > > > lazy_scan_prune case.
> > > > +  * Don't bother trying to match its 
> > > > visibility map setting
> > > > +  * steps, though.
> > > >*/
> > > > - if (hastup)
> > > > - vacrel->nonempty_pages = blkno + 
> > > > 1;
> > > >   if (recordfreespace)
> > > >   freespace = 
> > > > PageGetHeapFreeSpace(page);
> > > >   UnlockReleaseBuffer(buf);
> > >
> > > The comment continues to say that we "determine if rel truncation is 
> > > safe" -
> > > but I don't see that?  Oh, I see, it's done inside lazy_scan_noprune(). 
> > > This
> > > doesn't seem like a clear improvement to me. Particularly because it's 
> > > only
> > > set if lazy_scan_noprune() actually does something.
> >
> > I don't get what the last sentence means ("Particularly because...").
>
> Took me a second to understand myself again too, oops. What I think I meant is
> that it seems error-prone that it's only set in some paths inside
> lazy_scan_noprune(). Previously it was at least a bit clearer in
> lazy_scan_heap() that it would be set for the different possible paths.

I see what you are saying. But if lazy_scan_noprune() doesn't do
anything, then it calls lazy_scan_prune(), which does set hastup and
update vacrel->nonempty_pages if needed.

Using hastup in lazy_scan_[no]prune() also means that they are
directly updating LVRelState after determining how to update it.
lazy_scan_heap() isn't doing responsible anymore. I don't see a reason
to be passing information back to lazy_scan_heap() to update
LVRelState when lazy_scan_[no]prune() has access to the LVRelState.

Importantly, once I combine the prune and freeze records, hastup is
set in heap_page_prune() instead of lazy_scan_prune() (that whole loop
in lazy_scan_prune() is eliminated()). And I don't like having to pass
hastup back through lazy_scan_prune() and then to lazy_scan_heap() so
that lazy_scan_heap() can use it (and use it to update a data
structure available in lazy_scan_prune()).

- Melanie




Re: Improve the log message output of basic_archive when basic_archive.archive_directory parameter is not set

2024-01-05 Thread Nathan Bossart
On Fri, Oct 13, 2023 at 11:02:39AM +0200, Daniel Gustafsson wrote:
> That's a more compelling reason IMO.  I'm not sure if I prefer the
> GUC_check_errdetail-like approach better, I would for sure not be opposed to
> reviewing a version of the patch doing it that way.
> 
> Tung Nguyen: are you interested in updating the patch along these lines
> suggested by Nathan?

I gave it a try.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 7183005bb6786b63a5fd96ba5101849eb6f203e5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 5 Jan 2024 17:01:10 -0600
Subject: [PATCH v4 1/1] add support for emitting errdetail from archive module
 check-configured callback

---
 contrib/basic_archive/basic_archive.c | 8 +++-
 src/backend/postmaster/pgarch.c   | 8 +++-
 src/include/archive/archive_module.h  | 8 
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index 804567e919..2c8721ebf6 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -161,7 +161,13 @@ check_archive_directory(char **newval, void **extra, GucSource source)
 static bool
 basic_archive_configured(ArchiveModuleState *state)
 {
-	return archive_directory != NULL && archive_directory[0] != '\0';
+	if (archive_directory == NULL || archive_directory[0] == '\0')
+	{
+		arch_module_check_errdetail("basic_archive.archive_directory is not set.");
+		return false;
+	}
+
+	return true;
 }
 
 /*
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 67693b0580..f663965d89 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -91,6 +91,7 @@ typedef struct PgArchData
 } PgArchData;
 
 char	   *XLogArchiveLibrary = "";
+char	   *arch_module_check_errdetail_string;
 
 
 /* --
@@ -408,12 +409,17 @@ pgarch_ArchiverCopyLoop(void)
 			 */
 			HandlePgArchInterrupts();
 
+			/* Reset variables that might be set by the callback */
+			arch_module_check_errdetail_string = NULL;
+
 			/* can't do anything if not configured ... */
 			if (ArchiveCallbacks->check_configured_cb != NULL &&
 !ArchiveCallbacks->check_configured_cb(archive_module_state))
 			{
 ereport(WARNING,
-		(errmsg("archive_mode enabled, yet archiving is not configured")));
+		(errmsg("archive_mode enabled, yet archiving is not configured"),
+		 arch_module_check_errdetail_string ?
+		 errdetail_internal("%s", arch_module_check_errdetail_string) : 0));
 return;
 			}
 
diff --git a/src/include/archive/archive_module.h b/src/include/archive/archive_module.h
index fd59b9faf4..763af76e54 100644
--- a/src/include/archive/archive_module.h
+++ b/src/include/archive/archive_module.h
@@ -56,4 +56,12 @@ typedef const ArchiveModuleCallbacks *(*ArchiveModuleInit) (void);
 
 extern PGDLLEXPORT const ArchiveModuleCallbacks *_PG_archive_module_init(void);
 
+/* Support for messages reported from archive module callbacks. */
+
+extern PGDLLIMPORT char *arch_module_check_errdetail_string;
+
+#define arch_module_check_errdetail \
+	pre_format_elog_string(errno, TEXTDOMAIN), \
+	arch_module_check_errdetail_string = format_elog_string
+
 #endif			/* _ARCHIVE_MODULE_H */
-- 
2.25.1



Re: Build versionless .so for Android

2024-01-05 Thread Andres Freund
Hi,

On 2024-01-05 15:57:23 +0100, Peter Eisentraut wrote:
> On 05.01.24 01:00, Matthias Kuhn wrote:
> > Attached a patch with a (hopefully) better wording of the comment.
> > 
> > I have unsuccessfully tried to find an official source for this policy.
> > So for reference some discussions about the topic:
> > 
> > - 
> > https://stackoverflow.com/questions/11491065/linking-with-versioned-shared-library-in-android-ndk
> >  
> > 
> > - 
> > https://stackoverflow.com/questions/18681401/how-can-i-remove-all-versioning-information-from-shared-object-files
> >  
> > 
>  What I would like to see is a specific thing that you are trying to do that
> doesn't work.  Probably, you are writing a program that is meant to run on
> Android, and you are linking it (provide command line), and then what
> happens?  The linking fails?  It fails to run?  What is the error? Can you
> provide a minimal example?  And so on.

I looked around for a bit, and couldn't really find good documentation around
this. The best I found is
https://developer.android.com/ndk/guides/abis#native-code-in-app-packages
which does strongly imply that the library names aren't versioned.

Greetings,

Andres Freund




Re: Fix bogus Asserts in calc_non_nestloop_required_outer

2024-01-05 Thread Tom Lane
Robert Haas  writes:
> On Fri, Nov 3, 2023 at 2:47 AM Richard Guo  wrote:
>> Comment is now added above the Asserts.  Thanks for taking an interest
>> in this.

> I'd like to suggest rewording this comment a little more. Here's my proposal:

> Both of the paths passed to this function are still parameterized by
> the topmost parent, because reparameterize_path_by_child() hasn't been
> called yet. So, these sanity checks use the parent relids.

> What do you think?

I don't think I believe this code change, let alone any of the
explanations for it.  The point of these Asserts is to be sure that
we don't form an alleged parameterization set that includes any rels
that are included in the new join, because that would be obviously
wrong.  They do check that as they stand.  I'm not sure what invariant
the proposed revision would be enforcing.

There might be an argument for leaving the existing Asserts in
place and *also* checking

Assert(!bms_overlap(outer_paramrels,
inner_path->parent->top_parent_relids));
Assert(!bms_overlap(inner_paramrels,
outer_path->parent->top_parent_relids));

but I'm not quite convinced what the point of that is.

regards, tom lane




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-05 Thread Andres Freund
Hi,

On 2024-01-05 15:23:12 -0500, Robert Haas wrote:
> On Fri, Jan 5, 2024 at 3:05 PM Andres Freund  wrote:
> > An aside:
> >
> > As I think we chatted about before, I eventually would like the option to
> > remove index entries for a tuple during on-access pruning, for OLTP
> > workloads. I.e. before removing the tuple, construct the corresponding index
> > tuple, use it to look up index entries pointing to the tuple. If all the 
> > index
> > entries were found (they might not be, if they already were marked dead 
> > during
> > a lookup, or if an expression wasn't actually immutable), we can prune 
> > without
> > the full index scan.  Obviously this would only be suitable for some
> > workloads, but it could be quite beneficial when you have huge indexes.  The
> > reason I mention this is that then we'd have another source of marking items
> > unused during pruning.
>
> I will be astonished if you can make this work well enough to avoid
> huge regressions in plausible cases. There are plenty of cases where
> we do a very thorough job opportunistically removing index tuples.

These days the AM is often involved with that, via
table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to
happen before physically removing the already-marked-killed index entries. We
can't rely on being able to actually prune the heap page at that point, there
might be other backends pinning it, but often we will be able to. If we were
to prune below heap_index_delete_tuples(), we wouldn't need to recheck that
index again during "individual tuple pruning", if the to-be-marked-unused heap
tuple is one of the tuples passed to heap_index_delete_tuples(). Which
presumably will be very commonly the case.

At least for nbtree, we are much more aggressive about marking index entries
as killed, than about actually removing the index entries. "individual tuple
pruning" would have to look for killed-but-still-present index entries, not
just for "live" entries.


Greetings,

Andres Freund




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-05 Thread Melanie Plageman
On Fri, Jan 5, 2024 at 1:47 PM Robert Haas  wrote:
>
> On Fri, Jan 5, 2024 at 12:59 PM Melanie Plageman
>  wrote:
> MP> I am planning to add a VM update into the freeze record, at which point
> MP> I will move the VM update code into lazy_scan_prune(). This will then
> MP> allow us to consolidate the freespace map update code for the prune and
> MP> noprune cases and make lazy_scan_heap() short and sweet.
>
> Can we see what that looks like on top of this change?

Yes, attached is a patch set which does this. My previous patchset
already reduced the number of places we unlock the buffer and update
the freespace map in lazy_scan_heap(). This patchset combines the
lazy_scan_prune() and lazy_scan_noprune() FSM update cases. I also
have a version which moves the freespace map updates into
lazy_scan_prune() and lazy_scan_noprune() -- eliminating all of these
from lazy_scan_heap(). This is arguably more clear. But Andres
mentioned he wanted fewer places unlocking the buffer and updating the
FSM.

- Melanie
From 5f8d7544fbe53645b9d46b0eeb74c39cbfdc21f7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 5 Jan 2024 11:47:31 -0500
Subject: [PATCH v3 4/6] Inline LVPagePruneState members in lazy_scan_prune()

After moving the code to update the visibility map from lazy_scan_heap
into lazy_scan_prune, most of the members of LVPagePruneState are no
longer required. Make them local variables in lazy_scan_prune().
---
 src/backend/access/heap/vacuumlazy.c | 104 ++-
 1 file changed, 55 insertions(+), 49 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bc7c56396f..c0266eabb44 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -217,18 +217,8 @@ typedef struct LVRelState
  */
 typedef struct LVPagePruneState
 {
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
 	/* whether or not caller should do FSM processing for block */
 	bool		recordfreespace;
-
-	/*
-	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
-	 * trust all_frozen result unless all_visible is also set to true.
-	 */
-	bool		all_visible;	/* Every item visible to all? */
-	bool		all_frozen;		/* provided all_visible is also true */
-	TransactionId visibility_cutoff_xid;	/* For recovery conflicts */
 } LVPagePruneState;
 
 /* Struct for saving and restoring vacuum error information. */
@@ -1397,6 +1387,10 @@ lazy_scan_prune(LVRelState *vacrel,
 	HeapPageFreeze pagefrz;
 	bool		no_indexes;
 	bool		hastup = false;
+	bool		all_visible,
+all_frozen;
+	TransactionId visibility_cutoff_xid;
+	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
 	int64		fpi_before = pgWalUsage.wal_fpi;
 	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
@@ -1436,14 +1430,24 @@ lazy_scan_prune(LVRelState *vacrel,
 
 	/*
 	 * Now scan the page to collect LP_DEAD items and check for tuples
-	 * requiring freezing among remaining tuples with storage
+	 * requiring freezing among remaining tuples with storage.
+	 * has_lpdead_items includes existing LP_DEAD items.
 	 */
-	prunestate->has_lpdead_items = false;
-	prunestate->all_visible = true;
-	prunestate->all_frozen = true;
-	prunestate->visibility_cutoff_xid = InvalidTransactionId;
+	has_lpdead_items = false;
+	/* for recovery conflicts */
+	visibility_cutoff_xid = InvalidTransactionId;
 	prunestate->recordfreespace = false;
 
+	/*
+	 * We will update the VM after collecting LP_DEAD items and freezing
+	 * tuples. Keep track of whether or not the page is all_visible and
+	 * all_frozen and use this information to update the VM. all_visible
+	 * implies !has_lpdead_items, but don't trust all_frozen result unless
+	 * all_visible is also set to true.
+	 */
+	all_visible = true;
+	all_frozen = true;
+
 	for (offnum = FirstOffsetNumber;
 		 offnum <= maxoff;
 		 offnum = OffsetNumberNext(offnum))
@@ -1530,13 +1534,13 @@ lazy_scan_prune(LVRelState *vacrel,
  * asynchronously. See SetHintBits for more info. Check that
  * the tuple is hinted xmin-committed because of that.
  */
-if (prunestate->all_visible)
+if (all_visible)
 {
 	TransactionId xmin;
 
 	if (!HeapTupleHeaderXminCommitted(htup))
 	{
-		prunestate->all_visible = false;
+		all_visible = false;
 		break;
 	}
 
@@ -1548,14 +1552,14 @@ lazy_scan_prune(LVRelState *vacrel,
 	if (!TransactionIdPrecedes(xmin,
 			   vacrel->cutoffs.OldestXmin))
 	{
-		prunestate->all_visible = false;
+		all_visible = false;
 		break;
 	}
 
 	/* Track newest xmin on page. */
-	if (TransactionIdFollows(xmin, prunestate->visibility_cutoff_xid) &&
+	if (TransactionIdFollows(xmin, visibility_cutoff_xid) &&
 		TransactionIdIsNormal(xmin))
-		prunestate->visibility_cutoff_xid = xmin;
+		

Re: Build versionless .so for Android

2024-01-05 Thread Andres Freund
Hi,

On 2024-01-05 16:00:01 +0100, Peter Eisentraut wrote:
> On 01.01.24 06:25, Matthias Kuhn wrote:
> > It looks like meson does not currently support building for android, the
> > following output is what I get (but I have actually no experience with
> > meson):
> > 
> >      meson.build:320:2: ERROR: Problem encountered: unknown host system:
> > android
> 
> FWIW, the meson source code contains numerous mentions of an 'android'
> platform, so it seems like this is expected to work.

This error is from our code, not meson's - the simplest fix would be to just
map android to linux, similar to how dragonflybsd is mapped to netbsd.  Blind
attempt attached.

It looks like that might be all that's required, as it looks like meson knows
that android doesn't properly deal with soversion.

Greetings,

Andres Freund
diff --git i/meson.build w/meson.build
index 57f9735feb0..ad97da0daa3 100644
--- i/meson.build
+++ w/meson.build
@@ -190,6 +190,10 @@ sema_kind = 'sysv'
 if host_system == 'dragonfly'
   # apparently the most similar
   host_system = 'netbsd'
+elif host_system == 'android'
+  # while android isn't quite a normal linux, it seems close enough for our
+  # purposes so far
+  host_system = 'linux'
 endif
 
 if host_system == 'aix'


Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 3:05 PM Andres Freund  wrote:
> OTOH, the pruning logic, including its WAL record, already supports marking
> items unused, all we need to do is to tell it to do so in a few more cases. If
> we didn't already need to have support for this, I'd a much harder time
> arguing for doing this.
>
> One important part of the larger project is to combine the WAL records for
> pruning, freezing and setting the all-visible/all-frozen bit into one WAL
> record. We can't set all-frozen before we have removed the dead items. So
> either we need to combine pruning and setting items unused for no-index tables
> or we end up considerably less efficient in the no-indexes case.

Those are fair arguments.

> An aside:
>
> As I think we chatted about before, I eventually would like the option to
> remove index entries for a tuple during on-access pruning, for OLTP
> workloads. I.e. before removing the tuple, construct the corresponding index
> tuple, use it to look up index entries pointing to the tuple. If all the index
> entries were found (they might not be, if they already were marked dead during
> a lookup, or if an expression wasn't actually immutable), we can prune without
> the full index scan.  Obviously this would only be suitable for some
> workloads, but it could be quite beneficial when you have huge indexes.  The
> reason I mention this is that then we'd have another source of marking items
> unused during pruning.

I will be astonished if you can make this work well enough to avoid
huge regressions in plausible cases. There are plenty of cases where
we do a very thorough job opportunistically removing index tuples.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Stack overflow issue

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 3:16 PM Andres Freund  wrote:
> On 2024-01-05 12:23:25 -0500, Robert Haas wrote:
> > I agree that in the memory-context case it might be worth expending
> > some more code to be more clever. But I probably wouldn't do that for
> > MemoryContextStats(); check_stack_depth() seems fine for that one.
>
> We run MemoryContextStats() when we fail to allocate memory, including during
> abort processing after a previous error. So I think it qualifies for being
> somewhat special.

OK.

> Thus I suspect check_stack_depth() wouldn't be a good idea -
> but we could make the stack_is_too_deep() path simpler and just return in the
> existing MemoryContextStatsInternal() when that's the case.

Since this kind of code will be exercised so rarely, it's highly
vulnerable to bugs, so I favor keeping it as simple as we can.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Error "initial slot snapshot too large" in create replication slot

2024-01-05 Thread Robert Haas
On Thu, Mar 23, 2023 at 11:02 PM Kyotaro Horiguchi
 wrote:
> [ new patch ]

Well, I guess nobody is too excited about fixing this, because it's
been another 10 months with no discussion. Andres doesn't even seem to
think this is as much a bug as it is a limitation, for all that it's
filed in the CF application under bug fixes. I kind of wonder if we
should just close this entry in the CF, but I'll hold off on that for
now.

  /*
  * For normal MVCC snapshot this contains the all xact IDs that are in
  * progress, unless the snapshot was taken during recovery in which case
- * it's empty. For historic MVCC snapshots, the meaning is inverted, i.e.
- * it contains *committed* transactions between xmin and xmax.
+ * it's empty. In the case of !suboverflowed, there's a situation where
+ * this contains both top and sub-transaction IDs in a mixed manner.  For
+ * historic MVCC snapshots, the meaning is inverted, i.e.  it contains
+ * *committed* transactions between xmin and xmax.
  *
  * note: all ids in xip[] satisfy xmin <= xip[i] < xmax
  */

I have to say that I don't like this at all. It's bad enough that we
already use the xip/subxip arrays in two different ways depending on
the situation. Increasing that to three different ways seems painful.
How is anyone supposed to keep track of how the array is being used at
which point in the code?

If we are going to do that, I suspect it needs comment updates in more
places than what the patch does currently. For instance:

+ if (newxcnt < xiplen)
+ newxip[newxcnt++] = xid;
+ else
+ newsubxip[newsubxcnt++] = xid;

Just imagine coming across this code in 5 or 10 years and finding that
it had no comment explaining anything. Yikes!

Aside from the details of the patch, and perhaps more seriously, I'm
not really clear that we have consensus on an approach. A few
different proposals seem to have been floated, and it doesn't seem
like anybody hates anybody else's idea completely, but it doesn't
really seem like everyone agrees on what to do, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Stack overflow issue

2024-01-05 Thread Andres Freund
Hi,

On 2024-01-05 12:23:25 -0500, Robert Haas wrote:
> I agree that in the memory-context case it might be worth expending
> some more code to be more clever. But I probably wouldn't do that for
> MemoryContextStats(); check_stack_depth() seems fine for that one.

We run MemoryContextStats() when we fail to allocate memory, including during
abort processing after a previous error. So I think it qualifies for being
somewhat special. Thus I suspect check_stack_depth() wouldn't be a good idea -
but we could make the stack_is_too_deep() path simpler and just return in the
existing MemoryContextStatsInternal() when that's the case.

Greetings,

Andres Freund




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-05 Thread Andres Freund
Hi,

On 2024-01-05 08:59:41 -0500, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 6:03 PM Melanie Plageman
>  wrote:
> > When a single page is being processed, page pruning happens in
> > heap_page_prune(). Freezing, dead items recording, and visibility
> > checks happen in lazy_scan_prune(). Visibility map updates and
> > freespace map updates happen back in lazy_scan_heap(). Except, if the
> > table has no indexes, in which case, lazy_scan_heap() also invokes
> > lazy_vacuum_heap_page() to set dead line pointers unused and do
> > another separate visibility check and VM update. I maintain that all
> > page-level processing should be done in the page-level processing
> > functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be
> > directly responsible for special case page-level processing.
> 
> But you can just as easily turn this argument on its head, can't you?
> In general, except for HOT tuples, line pointers are marked dead by
> pruning and unused by vacuum. Here you want to turn it on its head and
> make pruning do what would normally be vacuum's responsibility.

OTOH, the pruning logic, including its WAL record, already supports marking
items unused, all we need to do is to tell it to do so in a few more cases. If
we didn't already need to have support for this, I'd a much harder time
arguing for doing this.

One important part of the larger project is to combine the WAL records for
pruning, freezing and setting the all-visible/all-frozen bit into one WAL
record. We can't set all-frozen before we have removed the dead items. So
either we need to combine pruning and setting items unused for no-index tables
or we end up considerably less efficient in the no-indexes case.


An aside:

As I think we chatted about before, I eventually would like the option to
remove index entries for a tuple during on-access pruning, for OLTP
workloads. I.e. before removing the tuple, construct the corresponding index
tuple, use it to look up index entries pointing to the tuple. If all the index
entries were found (they might not be, if they already were marked dead during
a lookup, or if an expression wasn't actually immutable), we can prune without
the full index scan.  Obviously this would only be suitable for some
workloads, but it could be quite beneficial when you have huge indexes.  The
reason I mention this is that then we'd have another source of marking items
unused during pruning.

Greetings,

Andres Freund




Re: pg_upgrade failing for 200+ million Large Objects

2024-01-05 Thread Tom Lane
I wrote:
> "Kumar, Sachin"  writes:
>> I was not able to find email thread which details why we are not using
>> parallel pg_restore for pg_upgrade.

> Well, it's pretty obvious isn't it?  The parallelism is being applied
> at the per-database level instead.

On further reflection, there is a very good reason why it's done like
that.  Because pg_upgrade is doing schema-only dump and restore,
there's next to no opportunity for parallelism within either pg_dump
or pg_restore.  There's no data-loading steps, and there's no
index-building either, so the time-consuming stuff that could be
parallelized just isn't happening in pg_upgrade's usage.

Now it's true that my 0003 patch moves the needle a little bit:
since it makes BLOB creation (as opposed to loading) parallelizable,
there'd be some hope for parallel pg_restore doing something useful in
a database with very many blobs.  But it makes no sense to remove the
existing cross-database parallelism in pursuit of that; you'd make
many more people unhappy than happy.

Conceivably something could be salvaged of your idea by having
pg_upgrade handle databases with many blobs differently from
those without, applying parallelism within pg_restore for the
first kind and then using cross-database parallelism for the
rest.  But that seems like a lot of complexity compared to the
possible win.

In any case I'd stay far away from using --section in pg_upgrade.
Too many moving parts there.

regards, tom lane




Re: Add a perl function in Cluster.pm to generate WAL

2024-01-05 Thread Alexander Lakhin

05.01.2024 02:48, Michael Paquier wrote:

On Thu, Jan 04, 2024 at 04:00:01PM +0300, Alexander Lakhin wrote:

Reproduced here.

Did you just make the run slow enough to show the failure with
valgrind?


Yes, I just run several test instances (with no extra modifications) under
Valgrind with parallel as follows:
for i in `seq 20`; do cp -r src/test/recovery/ src/test/recovery_$i/; sed "s|src/test/recovery|src/test/recovery_$i|" -i 
src/test/recovery_$i/Makefile; done


for i in `seq 20`; do echo "iteration $i"; parallel --halt now,fail=1 -j20 --linebuffer --tag PROVE_TESTS="t/035*" 
NO_TEMP_INSTALL=1 make check -s -C src/test/recovery_{} PROVE_FLAGS="--timer" ::: `seq 20` || break; done


The test run fails for me on iterations 9, 8, 14 like so:
...
iteration 9
...
5
5   #   Failed test 'inactiveslot slot invalidation is logged with vacuum 
on pg_authid'
5   #   at t/035_standby_logical_decoding.pl line 224.
5
5   #   Failed test 'activeslot slot invalidation is logged with vacuum on 
pg_authid'
5   #   at t/035_standby_logical_decoding.pl line 229.
13  [07:35:53] t/035_standby_logical_decoding.pl .. ok   432930 ms ( 0.02 usr  0.00 sys + 292.08 cusr 13.05 csys = 
305.15 CPU)

13  [07:35:53]
13  All tests successful.
...

I've also reproduced it without Valgrind in a VM with CPU slowed down to
5% (on iterations 2, 5, 4), where average test duration is about 800 sec:

4
4   #   Failed test 'inactiveslot slot invalidation is logged with vacuum 
on pg_authid'
4   #   at t/035_standby_logical_decoding.pl line 222.
4
4   #   Failed test 'activeslot slot invalidation is logged with vacuum on 
pg_authid'
4   #   at t/035_standby_logical_decoding.pl line 227.
6   [15:16:53] t/035_standby_logical_decoding.pl .. ok   763168 ms ( 0.68 usr  0.10 sys + 19.55 cusr 102.59 csys = 
122.92 CPU)





As I can see in the failure logs you referenced, the first problem is:
#   Failed test 'inactiveslot slot invalidation is logged with vacuum on 
pg_authid'
#   at t/035_standby_logical_decoding.pl line 224.
It reminded me of:
https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b010%40gmail.com

It seems that it's not something new, and maybe that my analysis is still
valid. If so, VACUUM FREEZE/autovacuum = off should fix the issue.

Not sure about that yet.



Your suspicion was proved right. After
git show c161ab74f src/test/recovery/t/035_standby_logical_decoding.pl  | git 
apply -R
20 iterations with 20 tests in parallel performed successfully for me
(twice).

So it looks like c161ab74f really made the things worse.

Best regards,
Alexander




Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-01-05 Thread Robert Haas
On Tue, Oct 24, 2023 at 4:28 AM Kyotaro Horiguchi
 wrote:
> In the attached, fixed the existing two messages, and adjusted one
> message to display an error code, all in the consistent format.

Hi,

I'm not a Windows expert, but my guess is that 0001 is a very good
idea. I hope someone who is a Windows expert will comment on that.

0002 seems problematic to me. One potential issue is that it would
break if someone renamed postgres.exe to something else -- although
that's probably not really a serious problem. A bigger issue is that
it seems like it would break if someone used pg_ctl to start several
instances in different data directories on the same machine. If I'm
understanding correctly, that currently mostly works, and this would
break it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-05 Thread Andres Freund
Hi,

On 2024-01-04 17:37:27 -0500, Melanie Plageman wrote:
> On Thu, Jan 4, 2024 at 3:03 PM Andres Freund  wrote:
> >
> > On 2023-11-17 18:12:08 -0500, Melanie Plageman wrote:
> > >   Assert(ItemIdIsNormal(lp));
> > >   htup = (HeapTupleHeader) PageGetItem(dp, lp);
> > > @@ -715,7 +733,17 @@ heap_prune_chain(Buffer buffer, OffsetNumber 
> > > rootoffnum,
> > >* redirect the root to the correct chain member.
> > >*/
> > >   if (i >= nchain)
> > > - heap_prune_record_dead(prstate, rootoffnum);
> > > + {
> > > + /*
> > > +  * If the relation has no indexes, we can remove 
> > > dead tuples
> > > +  * during pruning instead of marking their line 
> > > pointers dead. Set
> > > +  * this tuple's line pointer LP_UNUSED.
> > > +  */
> > > + if (prstate->pronto_reap)
> > > + heap_prune_record_unused(prstate, 
> > > rootoffnum);
> > > + else
> > > + heap_prune_record_dead(prstate, rootoffnum);
> > > + }
> > >   else
> > >   heap_prune_record_redirect(prstate, rootoffnum, 
> > > chainitems[i]);
> > >   }
> > > @@ -726,9 +754,12 @@ heap_prune_chain(Buffer buffer, OffsetNumber 
> > > rootoffnum,
> > >* item.  This can happen if the loop in heap_page_prune 
> > > caused us to
> > >* visit the dead successor of a redirect item before 
> > > visiting the
> > >* redirect item.  We can clean up by setting the redirect 
> > > item to
> > > -  * DEAD state.
> > > +  * DEAD state. If pronto_reap is true, we can set it 
> > > LP_UNUSED now.
> > >*/
> > > - heap_prune_record_dead(prstate, rootoffnum);
> > > + if (prstate->pronto_reap)
> > > + heap_prune_record_unused(prstate, rootoffnum);
> > > + else
> > > + heap_prune_record_dead(prstate, rootoffnum);
> > >   }
> > >
> > >   return ndeleted;
> >
> > There's three new calls to heap_prune_record_unused() and the logic got more
> > nested. Is there a way to get to a nicer end result?
> 
> So, I could do another loop through the line pointers in
> heap_page_prune() (after the loop calling heap_prune_chain()) and, if
> pronto_reap is true, set dead line pointers LP_UNUSED. Then, when
> constructing the WAL record, I would just not add the prstate.nowdead
> that I saved from heap_prune_chain() to the prune WAL record.

Hm, that seems a bit sad as well. I am wondering if we could move the
pronto_reap handling into heap_prune_record_dead() or a wrapper of it.  I am
more concerned about the human reader than the CPU here...



> > > @@ -972,20 +970,21 @@ lazy_scan_heap(LVRelState *vacrel)
> > >   continue;
> > >   }
> > >
> > > - /* Collect LP_DEAD items in dead_items array, count 
> > > tuples */
> > > - if (lazy_scan_noprune(vacrel, buf, blkno, page, 
> > > ,
> > > + /*
> > > +  * Collect LP_DEAD items in dead_items array, count 
> > > tuples,
> > > +  * determine if rel truncation is safe
> > > +  */
> > > + if (lazy_scan_noprune(vacrel, buf, blkno, page,
> > > 
> > > ))
> > >   {
> > >   Sizefreespace = 0;
> > >
> > >   /*
> > >* Processed page successfully (without 
> > > cleanup lock) -- just
> > > -  * need to perform rel truncation and FSM 
> > > steps, much like the
> > > -  * lazy_scan_prune case.  Don't bother 
> > > trying to match its
> > > -  * visibility map setting steps, though.
> > > +  * need to update the FSM, much like the 
> > > lazy_scan_prune case.
> > > +  * Don't bother trying to match its 
> > > visibility map setting
> > > +  * steps, though.
> > >*/
> > > - if (hastup)
> > > - vacrel->nonempty_pages = blkno + 1;
> > >   if (recordfreespace)
> > >   freespace = 
> > > PageGetHeapFreeSpace(page);
> > >   UnlockReleaseBuffer(buf);
> >
> > The comment continues to say that we "determine if rel truncation is safe" -
> > but I don't see that?  Oh, I see, it's done inside lazy_scan_noprune(). This
> > doesn't seem like a clear improvement to 

Re: Unlogged relation copy is not fsync'd

2024-01-05 Thread Robert Haas
On Fri, Sep 15, 2023 at 7:47 AM Heikki Linnakangas  wrote:
> Thinking about this some more, I think this is still not 100% correct,
> even with the patch I posted earlier:

This is marked as needing review, but that doesn't appear to be
correct, because there's this comment, indicating that the patch
requires re-work, and there's also two emails from Noah on the thread
providing further feedback. So it seems this has been waiting for
Heikki or someone else to have time to work it for the last 3 months.

Hence, marking RwF for now; if someone gets back to it, please reopen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Should the archiver process always make sure that the timeline history files exist in the archive?

2024-01-05 Thread Robert Haas
On Mon, Aug 28, 2023 at 8:59 PM Jimmy Yih  wrote:
> Thanks for the insightful response! I have attached an updated patch
> that moves the proposed logic to the end of StartupXLOG where it seems
> more correct to do this. It also helps with backporting (if it's
> needed) since the archiver process only has access to shared memory
> starting from Postgres 14.

Hmm. Do I understand correctly that the two patches you attached are
alternatives to each other, i.e. we need one or the other to fix the
issue, but not both?

It seems to me that trying to fetch a timeline history file and then
ignoring any error has got to be wrong. Either the file is needed or
it isn't. If it's needed, then failing to fetch it is a problem. If
it's not needed, there's no reason to try fetching it in the first
place. So I feel like we could either try to archive the file at the
end of recovery, as you propose in
v2-0001-Archive-current-timeline-history-file-after-recovery.patch.
Alternatively, we could try to find a way not to request the file in
the first place, if it's not required. But
v1-0001-Allow-recovery-to-proceed-when-initial-timeline-hist.patch
doesn't seem good to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Andres Freund
Hi,

On 2024-01-05 14:19:23 -0500, Robert Haas wrote:
> On Fri, Jan 5, 2024 at 2:11 PM Andres Freund  wrote:
> > I see it fairly regularly. Including finding several related bugs that lead 
> > to
> > stuck systems last year (signal handlers are a menace).
> 
> In that case, I think this proposal is dead. I can't personally
> testify to this code being a force for good, but it sounds like you
> can. So be it!

I think the proposal to make it a WARNING shouldn't go anywhere, but I think
there are several improvements that could come out of this discussion:

- assertion checks against doing dangerous stuff
- compile time help for detecting bad stuff without hitting it at runtime
- make the stuck lock message more informative, e.g. by including the length
  of time the lock was stuck for
- make sure that interrupts can't trigger the stuck lock much quicker, which
  afaict can happen today

Greetings,

Andres Freund




Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Andres Freund
Hi,

On 2024-01-05 10:20:39 +0800, Andy Fan wrote:
> Andres Freund  writes:
> > On 2024-01-04 14:59:06 +0800, Andy Fan wrote:
> >> My question is if someone doesn't obey the rule by mistake (everyone
> >> can make mistake), shall we PANIC on a production environment? IMO I
> >> think it can be a WARNING on a production environment and be a stuck
> >> when 'ifdef USE_ASSERT_CHECKING'.
> >>
> >> [...]
> >>
> >> I notice this issue actually because of the patch "Cache relation
> >> sizes?" from Thomas Munro [1], where the latest patch[2] still have the
> >> following code.
> >> +  sr = smgr_alloc_sr();  <-- HERE a spin lock is hold
> >> +
> >> +  /* Upgrade to exclusive lock so we can create a mapping. */
> >> +  LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
> >>   operation is needed. it may take a long time.
> >>
> >> Our internal testing system found more misuses on our own PG version.
> >
> >> I think a experienced engineer like Thomas can make this mistake and the
> >> patch was reviewed by 3 peoples, the bug is still there. It is not easy
> >> to say just don't do it.
> >
> > I don't follow this argument - just ignoring the problem,
>
> I agree with you but I'm feeling you ignored my post at [1], where I
> said for the known issue, it should be fixed at the first chance.

With "ignoring the problem" I was referencing emitting a WARNING instead of
crash-restart.

IME stuck spinlocks are caused by issues like not releasing a spinlock,
possibly due to returning early due to an error or such, having lock-nesting
issues leading to deadlocks, acquiring spinlocks or lwlocks in signal
handlers, blocking in signal handlers while holding a spinlock outside of the
signal handers and many variations of those. The common theme of these causes
is that they don't resolve after some time. The only way out of the situation
is to crash-restart, either "automatically" or by operator intervention.


> > which emitting a
> > WARNING basically is, doesn't reduce the impact of the bug, it *increases* 
> > the
> > impact, because now the system will not recover from the bug without 
> > explicit
> > operator intervention.  During that time the system might be essentially
> > unresponsive, because all backends end up contending for some spinlock, 
> > which
> > makes investigating such issues very hard.
>
> Acutally they are doing pg_usleep at the most time.

Sure - but that changes nothing about the problem. The concern isn't CPU
usage, the concern is that there's often no possible forward progress. To take
a recent-ish production issue I looked at, a buggy signal handler lead to a
backend sleeping while holding a spinlock. Soon after the entire system got
stuck, because they also acquired the spinlock. The person initially
investigating the issue at first contacted me because they couldn't even log
into the system, because connection establishment also acquired the spinlock
(I'm not sure anymore which spinlock it was, possibly xlog.c's info_lck?).


> > but not PANICing when this happens in production seems completely
> > non-viable.
> >
>
> Not sure what does *this* exactly means. If it means the bug in Thomas's
> patch, I absoluately agree with you(since it is a known bug and it
> should be fixed). If it means the general *unknown* case, it's something
> we talked above.

I mean that I think that not PANICing anymore would be a seriously bad idea
and cause way more problems than the PANIC.

Greetings,

Andres Freund




Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 2:11 PM Andres Freund  wrote:
> I see it fairly regularly. Including finding several related bugs that lead to
> stuck systems last year (signal handlers are a menace).

In that case, I think this proposal is dead. I can't personally
testify to this code being a force for good, but it sounds like you
can. So be it!

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Fix bogus Asserts in calc_non_nestloop_required_outer

2024-01-05 Thread Robert Haas
On Fri, Nov 3, 2023 at 2:47 AM Richard Guo  wrote:
> Comment is now added above the Asserts.  Thanks for taking an interest
> in this.

I'd like to suggest rewording this comment a little more. Here's my proposal:

Both of the paths passed to this function are still parameterized by
the topmost parent, because reparameterize_path_by_child() hasn't been
called yet. So, these sanity checks use the parent relids.

What do you think?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Andres Freund
Hi,

On 2024-01-05 08:51:53 -0500, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 6:06 PM Andres Freund  wrote:
> > I think we should add infrastructure to detect bugs like this during
> > development, but not PANICing when this happens in production seems 
> > completely
> > non-viable.
> 
> I mean +1 for the infrastructure, but "completely non-viable"? Why?
> 
> I've only very rarely seen this PANIC occur, and in the few cases
> where I've seen it, it was entirely unclear that the problem was due
> to a bug where somebody failed to release a spinlock.

I see it fairly regularly. Including finding several related bugs that lead to
stuck systems last year (signal handlers are a menace).


> It seemed more likely that the machine was just not really functioning, and
> the PANIC was a symptom of processes not getting scheduled rather than a PG
> bug.

If processes don't get scheduled for that long a crash-restart doesn't seem
that bad anymore :)


> And every time I tell a user that they might need to use a debugger to, say,
> set VacuumCostActive = false, or to get a backtrace, or any other reason, I
> have to tell them to make sure to detach the debugger in under 60 seconds,
> because in the unlikely event that they attach while the process is holding
> a spinlock, failure to detach in under 60 seconds will take their production
> system down for no reason.

Hm - isn't the stuck lock timeout more like 900s (MAX_DELAY_USEC * NUM_DELAYS
= 1000s, but we start at a lower delay)?  One issue with the code as-is is
that interrupted sleeps count towards to the timeout, despite possibly
sleeping much shorter. We should probably fix that, and also report the time
the lock was stuck for in s_lock_stuck().


> Now, if you're about to say that people shouldn't need to use a debugger on
> their production instance, I entirely agree ... but in the world I inhabit,
> that's often the only way to solve a customer problem, and it probably will
> continue to be until we have much better ways of getting backtraces without
> using a debugger than is currently the case.
> 
> Have you seen real cases where this PANIC prevents a hangup? If yes,
> that PANIC traced back to a bug in PostgreSQL? And why didn't the user
> just keep hitting the same bug over and PANICing in an endless loop?

Many, as hinted above. Some bugs in postgres, more bugs in extensions. IME
these bugs aren't hit commonly, so a crash-restart at least allows to hobble
along. The big issue with not crash-restarting is that often the system ends
up inaccessible, which makes it very hard to investigate the issue.


> I feel like this is one of those things that has just been this way
> forever and we don't question it because it's become an article of
> faith that it's something we have to have. But I have a very hard time
> explaining why it's even a net positive, let alone the unquestionable
> good that you seem to think.

I don't think it's an unquestionable good, I just think the alternative of
just endlessly spinning is way worse.

Greetings,

Andres Freund




Re: Oversight in reparameterize_path_by_child leading to executor crash

2024-01-05 Thread Tom Lane
Robert Haas  writes:
> OK, so a few people like the current form of this patch but we haven't
> heard from Tom since August. Tom, any thoughts on the current
> incarnation?

Not yet, but it is on my to-look-at list.  In the meantime I concur
with your comments here.

regards, tom lane




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 12:59 PM Melanie Plageman
 wrote:
> I actually think we are going to want to stop referring to these steps
> as pruning and vacuuming. It is confusing because vacuuming refers to
> the whole process of doing garbage collection on the table and also to
> the specific step of setting dead line pointers unused. If we called
> these steps say, pruning and reaping, that may be more clear.

I agree that there's some terminological confusion here, but if we
leave all of the current naming unchanged and start using new naming
conventions in new code, I don't think that's going to clear things
up. Getting rid of the weird naming with pruning, "scanning" (that is
part of vacuum), and "vacuuming" (that is another part of vacuum) is
gonna take some work.

> Vacuuming consists of three phases -- the first pass, index vacuuming,
> and the second pass. I don't think we should dictate what happens in
> each pass. That is, we shouldn't expect only pruning to happen in the
> first pass and only reaping to happen in the second pass. For example,
> I think Andres has previously proposed doing another round of pruning
> after index vacuuming. The second pass/third phase is distinguished
> primarily by being after index vacuuming.
>
> If we think about it this way, that frees us up to set dead line
> pointers unused in the first pass when the table has no indexes. For
> clarity, I could add a block comment explaining that doing this is an
> optimization and not a logical requirement. One way to make this even
> more clear would be to set the dead line pointers unused in a separate
> loop after heap_prune_chain() as I proposed upthread.

I don't really disagree with any of this, but I think the question is
whether the code is cleaner with mark-LP-as-unused pushed down into
pruning or whether it's better the way it is. Andres seems confident
that the change won't suck for performance, which is good, but I'm not
quite convinced that it's the right direction to go with the code, and
he doesn't seem to be either. Perhaps this all turns on this point:

MP> I am planning to add a VM update into the freeze record, at which point
MP> I will move the VM update code into lazy_scan_prune(). This will then
MP> allow us to consolidate the freespace map update code for the prune and
MP> noprune cases and make lazy_scan_heap() short and sweet.

Can we see what that looks like on top of this change?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: weird GROUPING SETS and ORDER BY behaviour

2024-01-05 Thread Zhang Mingli
Hi,


Zhang Mingli
www.hashdata.xyz
On Jan 6, 2024 at 01:38 +0800, Geoff Winkless , wrote:
>
> Am I missing some reason why the first set isn't sorted as I'd hoped?

Woo, it’s a complex order by, I try to understand your example.
And I think the order is right, what’s your expected order result?

```
ORDER BY
 CASE WHEN GROUPING(test1.n)=0 THEN test1.n ELSE NULL END NULLS FIRST,
 CASE WHEN GROUPING(concat(test1.n, (SELECT x FROM test2 WHERE
seq=test1.seq)))=0 THEN concat(test1.n, (SELECT x FROM test2 WHERE
seq=test1.seq)) ELSE NULL END NULLS FIRST;
```
You want to Order by a, b where a is: CASE WHEN GROUPING(test1.n)=0 THEN 
test1.n ELSE NULL END NULLS FIRST.
GROUPING(test1.n)=0 means that your are within grouping set test1.n and the 
value is test1.n, so results of another grouping
set b is NULL, and you specific  NULL FIRST.

So your will first get the results of grouping set b while of course, column 
gp_n GROUPING(test1.n) is 1.
The result is very right.

gp_n | gp_conc | n | concat
--+-+--+
 1 | 0 | NULL | n5x1
 1 | 0 | NULL | n4x2
 1 | 0 | NULL | n3x3
 1 | 0 | NULL | n2x4
 1 | 0 | NULL | n1x5
 0 | 1 | n1 | NULL
 0 | 1 | n2 | NULL
 0 | 1 | n3 | NULL
 0 | 1 | n4 | NULL
 0 | 1 | n5 | NULL
(10 rows)

NB: the Grouping bit is set to 1 when this column is not included.

https://www.postgresql.org/docs/current/functions-aggregate.html
GROUPING ( group_by_expression(s) ) → integer
Returns a bit mask indicating which GROUP BY expressions are not included in 
the current grouping set. Bits are assigned with the rightmost argument 
corresponding to the least-significant bit; each bit is 0 if the corresponding 
expression is included in the grouping criteria of the grouping set generating 
the current result row, and 1 if it is not included

I guess you misunderstand it?


And your GROUPING target entry seems misleading, I modify it to:

SELECT GROUPING(test1.n, (concat(test1.n, (SELECT x FROM test2 WHERE 
seq=test1.seq::bit(2),

test1.n, CONCAT(test1.n, (SELECT x FROM test2 WHERE seq=test1.seq))
FROM test1
…skip


To show the grouping condition:

grouping | n | concat
--+--+
 10 | NULL | n5x1
 10 | NULL | n4x2
 10 | NULL | n3x3
 10 | NULL | n2x4
 10 | NULL | n1x5
 01 | n1 | NULL
 01 | n2 | NULL
 01 | n3 | NULL
 01 | n4 | NULL
 01 | n5 | NULL
(10 rows)










Re: Oversight in reparameterize_path_by_child leading to executor crash

2024-01-05 Thread Robert Haas
On Tue, Dec 12, 2023 at 9:55 PM Andrei Lepikhov
 wrote:
> By and large, this patch is in a good state and may be committed.

OK, so a few people like the current form of this patch but we haven't
heard from Tom since August. Tom, any thoughts on the current
incarnation?

Richard, I think it could be useful to put a better commit message
into the patch file, describing both what problem is being fixed and
what the design of the fix is. I gather that the problem is that we
crash if the query contains a partioningwise join and also $SOMETHING,
and the solution is to move reparameterization to happen at
createplan() time but with a precheck that runs during path
generation. Presumably, that means this is more than a minimal bug
fix, because the bug could be fixed without splitting
can-it-be-reparameterized to reparameterize-it in this way. Probably
that's a performance optimization, so maybe it's worth clarifying
whether that's just an independently good idea or whether it's a part
of making the bug fix not regress performance.

I think the macro names in path_is_reparameterizable_by_child could be
better chosen. CHILD_PATH_IS_REPARAMETERIZABLE doesn't convey that the
macro will return from the calling function if not -- it looks like it
just returns a Boolean. Maybe REJECT_IF_PATH_NOT_REPARAMETERIZABLE and
REJECT_IF_PATH_LIST_NOT_REPARAMETERIZABLE or some such.

Another question here is whether we really want to back-patch all of
this or whether it might be better to, as Tom proposed previously,
back-patch a more minimal fix and leave the more invasive stuff for
master.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: psql not responding to SIGINT upon db reconnection

2024-01-05 Thread Robert Haas
On Tue, Dec 5, 2023 at 1:35 PM Tristan Partin  wrote:
> On Wed Nov 29, 2023 at 11:48 AM CST, Tristan Partin wrote:
> > I am not completely in love with the code I have written. Lots of
> > conditional compilation which makes it hard to read. Looking forward to
> > another round of review to see what y'all think.
>
> Ok. Here is a patch which just uses select(2) with a timeout of 1s or
> pselect(2) if it is available. I also moved the state machine processing
> into its own function.

Hmm, this adds a function called pqSocketPoll to psql/command.c. But
there already is such a function in libpq/fe-misc.c. It's not quite
the same, though. Having the same function in two different modules
with subtly different definitions seems like it's probably not the
right idea.

Also, this seems awfully complicated for something that's supposed to
(checks notes) wait for a file descriptor to become ready for I/O for
up to 1 second. It's 160 lines of code in pqSocketPoll and another 50
in the caller. If this is really the simplest way to do this, we
really need to rethink our libpq APIs or, uh, something. I wonder if
we could make this simpler by, say:

- always use select
- always use a 1-second timeout
- if it returns faster because the race condition doesn't happen, cool
- if it doesn't, well then you get to wait for a second, oh well

I don't feel strongly that that's the right way to go and if Heikki or
some other committer wants to go with this more complex conditional
approach, that's fine. But to me it seems like a lot.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG

2024-01-05 Thread Joe Conway

On 1/5/24 12:56, Robert Haas wrote:

On Sun, Aug 27, 2023 at 4:25 PM Heikki Linnakangas  wrote:

I think you got that it backwards. 'perl_locale_obj' is set to the perl
interpreter's locale, whenever we are *outside* the interpreter.


This thread has had no update for more than 4 months, so I'm marking
the CF entry RwF for now.

It can always be reopened, if Joe or Tristan or Heikki or someone else
picks it up again.



It is definitely a bug, so I do plan to get back to it at some point, 
hopefully sooner rather than later...


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-05 Thread Melanie Plageman
On Fri, Jan 5, 2024 at 8:59 AM Robert Haas  wrote:
>
> On Thu, Jan 4, 2024 at 6:03 PM Melanie Plageman
>  wrote:
> > When a single page is being processed, page pruning happens in
> > heap_page_prune(). Freezing, dead items recording, and visibility
> > checks happen in lazy_scan_prune(). Visibility map updates and
> > freespace map updates happen back in lazy_scan_heap(). Except, if the
> > table has no indexes, in which case, lazy_scan_heap() also invokes
> > lazy_vacuum_heap_page() to set dead line pointers unused and do
> > another separate visibility check and VM update. I maintain that all
> > page-level processing should be done in the page-level processing
> > functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be
> > directly responsible for special case page-level processing.
>
> But you can just as easily turn this argument on its head, can't you?
> In general, except for HOT tuples, line pointers are marked dead by
> pruning and unused by vacuum. Here you want to turn it on its head and
> make pruning do what would normally be vacuum's responsibility.

I actually think we are going to want to stop referring to these steps
as pruning and vacuuming. It is confusing because vacuuming refers to
the whole process of doing garbage collection on the table and also to
the specific step of setting dead line pointers unused. If we called
these steps say, pruning and reaping, that may be more clear.

Vacuuming consists of three phases -- the first pass, index vacuuming,
and the second pass. I don't think we should dictate what happens in
each pass. That is, we shouldn't expect only pruning to happen in the
first pass and only reaping to happen in the second pass. For example,
I think Andres has previously proposed doing another round of pruning
after index vacuuming. The second pass/third phase is distinguished
primarily by being after index vacuuming.

If we think about it this way, that frees us up to set dead line
pointers unused in the first pass when the table has no indexes. For
clarity, I could add a block comment explaining that doing this is an
optimization and not a logical requirement. One way to make this even
more clear would be to set the dead line pointers unused in a separate
loop after heap_prune_chain() as I proposed upthread.

- Melanie




Re: BUG #17946: LC_MONETARY & DO LANGUAGE plperl - BUG

2024-01-05 Thread Robert Haas
On Sun, Aug 27, 2023 at 4:25 PM Heikki Linnakangas  wrote:
> I think you got that it backwards. 'perl_locale_obj' is set to the perl
> interpreter's locale, whenever we are *outside* the interpreter.

This thread has had no update for more than 4 months, so I'm marking
the CF entry RwF for now.

It can always be reopened, if Joe or Tristan or Heikki or someone else
picks it up again.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_get_indexdef() modification to use TxnSnapshot

2024-01-05 Thread Robert Haas
On Thu, Oct 5, 2023 at 11:15 PM Tom Lane  wrote:
> kuroda.keis...@nttcom.co.jp writes:
> > On 2023-06-14 15:31, vignesh C wrote:
> >> I have attempted to convert pg_get_indexdef() to use
> >> systable_beginscan() based on transaction-snapshot rather than using
> >> SearchSysCache().
>
> Has anybody thought about the fact that ALTER TABLE ALTER TYPE
> (specifically RememberIndexForRebuilding) absolutely requires seeing
> the latest version of the index's definition?
>
> >>> it would be going to be a large refactoring and potentially make the
> >>> future implementations such as pg_get_tabledef() etc hard. Have you
> >>> considered changes to the SearchSysCache() family so that they
> >>> internally use a transaction snapshot that is registered in advance.
>
> A very significant fraction of other SearchSysCache callers likewise
> cannot afford to see stale data.  We might be able to fix things so
> that the SQL-accessible ruleutils functionality works differently, but
> we can't just up and change the behavior of cache lookups everywhere.

This patch was registered in the CommitFest as a bug fix, but I think
it's a much more significant change than that label applies, and it
seems like we have no consensus on what the right design is.

Since there's been no response to these (entirely valid) comments from
Tom in the past 3 months, I've marked this CF entry RwF for now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: SET ROLE x NO RESET

2024-01-05 Thread Eric Hanson
On Sat, Dec 30, 2023 at 11:50 AM Joe Conway  wrote:

> In the meantime, in case it helps, see
>
>https://github.com/pgaudit/set_user
>
> Specifically set_session_auth(text):
> -
> When set_session_auth(text) is called, the effective session and current
> user is switched to the rolename supplied, irrevocably. Unlike
> set_user() or set_user_u(), it does not affect logging nor allowed
> statements. If set_user.exit_on_error is "on" (the default), and any
> error occurs during execution, a FATAL error is thrown and the backend
> session exits.
>

This helps, but has the downside (of course) of being a compiled extension
which limits its use on hosted services and such unless they decide to
support it.

Would be really great if pooling could co-exist with per-user roles
somehow, I'm not the best to weigh in on how, but it's bottlenecking the
whole space of using roles per-user, and AFAICT this pattern would
otherwise be totally feasible and awesome, with all the progress that's
been made in this space.

Eric


Re: verify predefined LWLocks have entries in wait_event_names.txt

2024-01-05 Thread Nathan Bossart
On Fri, Jan 05, 2024 at 10:42:03AM -0600, Nathan Bossart wrote:
> On Fri, Jan 05, 2024 at 07:39:39AM +, Bertrand Drouvot wrote:
>> +   die "lists of predefined LWLocks in lwlocknames.txt and 
>> wait_event_names.txt do not match"
>> + unless $wait_event_lwlocks[$i] eq $lockname;
>> 
>> What about printing $wait_event_lwlocks[$i] and $lockname in the error 
>> message?
>> Something like?
>> 
>> "
>> die "lists of predefined LWLocks in lwlocknames.txt and 
>> wait_event_names.txt do not match (comparing $lockname and 
>> $wait_event_lwlocks[$i])"
>>   unless $wait_event_lwlocks[$i] eq $lockname;
>> "
>> 
>> I think that would give more clues for debugging purpose.
> 
> Sure, I'll add something like that.  I think this particular scenario is
> less likely, but that's not a reason to make the error message hard to
> decipher.

Here is a new version of the patch with this change.

I also tried to make the verification logic less fragile.  Instead of
depending on the exact location of empty lines in wait_event_names.txt, v3
adds a marker comment below the list that clearly indicates it should not
be changed.  This simplifies the verification code a bit, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From e1367cb5c9bbad6b963a5f3596a6133df471654c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 5 Jan 2024 00:07:40 -0600
Subject: [PATCH v3 1/1] verify lists of predefined LWLocks in lwlocknames.txt
 and wait_event_names.txt match

---
 src/backend/Makefile  |  2 +-
 src/backend/storage/lmgr/Makefile |  4 +-
 .../storage/lmgr/generate-lwlocknames.pl  | 48 +++
 .../utils/activity/wait_event_names.txt   | 12 +
 src/include/storage/meson.build   |  2 +-
 5 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 816461aa17..7d2ea7d54a 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -130,7 +130,7 @@ $(top_builddir)/src/port/libpgport_srv.a: | submake-libpgport
 parser/gram.h: parser/gram.y
 	$(MAKE) -C parser gram.h
 
-storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt
+storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt utils/activity/wait_event_names.txt
 	$(MAKE) -C storage/lmgr lwlocknames.h lwlocknames.c
 
 utils/activity/wait_event_types.h: utils/activity/generate-wait_event_types.pl utils/activity/wait_event_names.txt
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index c48ba943c4..504480e847 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -39,8 +39,8 @@ s_lock_test: s_lock.c $(top_builddir)/src/common/libpgcommon.a $(top_builddir)/s
 lwlocknames.c: lwlocknames.h
 	touch $@
 
-lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt generate-lwlocknames.pl
-	$(PERL) $(srcdir)/generate-lwlocknames.pl $<
+lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt $(top_srcdir)/src/backend/utils/activity/wait_event_names.txt generate-lwlocknames.pl
+	$(PERL) $(srcdir)/generate-lwlocknames.pl $^
 
 check: s_lock_test
 	./s_lock_test
diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/generate-lwlocknames.pl
index bef5e16e11..237c3b9678 100644
--- a/src/backend/storage/lmgr/generate-lwlocknames.pl
+++ b/src/backend/storage/lmgr/generate-lwlocknames.pl
@@ -15,6 +15,7 @@ my $continue = "\n";
 GetOptions('outdir:s' => \$output_path);
 
 open my $lwlocknames, '<', $ARGV[0] or die;
+open my $wait_event_names, '<', $ARGV[1] or die;
 
 # Include PID in suffix in case parallel make runs this multiple times.
 my $htmp = "$output_path/lwlocknames.h.tmp$$";
@@ -30,6 +31,42 @@ print $c $autogen, "\n";
 
 print $c "const char *const IndividualLWLockNames[] = {";
 
+#
+# First, record the predefined LWLocks listed in wait_event_names.txt.  We'll
+# cross-check those with the ones in lwlocknames.txt.
+#
+my @wait_event_lwlocks;
+my $record_lwlocks = 0;
+
+while (<$wait_event_names>)
+{
+	chomp;
+
+	# Check for end marker.
+	last if /^# END OF PREDEFINED LWLOCKS/;
+
+	# Skip comments and empty lines.
+	next if /^#/;
+	next if /^\s*$/;
+
+	# Start recording LWLocks when we find the WaitEventLWLock section.
+	if (/^Section: ClassName(.*)/)
+	{
+		my $section_name = $_;
+		$section_name =~ s/^.*- //;
+		$record_lwlocks = 1 if $section_name eq "WaitEventLWLock";
+		next;
+	}
+
+	# Go to the next line if we are not yet recording LWLocks.
+	next if not $record_lwlocks;
+
+	# Record the LWLock.
+	(my $waiteventname, my $waitevendocsentence) = split(/\t/, $_);
+	push(@wait_event_lwlocks, $waiteventname . "Lock");
+}
+
+my $i = 0;
 while (<$lwlocknames>)
 {
 	chomp;
@@ -50,6 +87,14 @@ while (<$lwlocknames>)
 	die "lwlocknames.txt not in order" if $lockidx < $lastlockidx;
 	die "lwlocknames.txt has duplicates" if $lockidx == 

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 12:35 PM Tom Lane  wrote:
> I think it'd be quite simple.  As I said, it's just a small variation
> on how some GUCs already work.  The only thing that's really
> transactional is SQL-driven updates, which'd be disallowed for this
> class of variables.

Well, I know better than to tell you something is hard if you think
it's easy. :-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Issue in postgres_fdw causing unnecessary wait for cancel request reply

2024-01-05 Thread Robert Haas
On Thu, Apr 13, 2023 at 2:04 PM Fujii Masao  wrote:
> >> To clarify, are you suggesting that PQgetCancel() should
> >> only parse the parameters for TCP connections
> >> if cancel->raddr.addr.ss_family != AF_UNIX?
> >> If so, I think that's a good idea.
> >
> > You're right. I used connip in the diff because I thought it provided
> > the same condition, but in a simpler way.
>
> I made a modification to the 0001 patch. It will now allow PQgetCancel() to 
> parse and interpret TCP connection parameters only when the connection is not 
> made through a Unix-domain socket.

I don't really like this change. It seems to me that what this does is
decide that it's not an error to set tcp_user_timeout='a' when making
a cancel request if the connection doesn't actually use TCP. I agree
that we shouldn't try to *use* the values if they don't apply, but I'm
not sure it's a good idea to skip *sanity-checking* them when they
don't apply. For instance you can't set work_mem=ssdgjsjdg in
postgresql.conf even if you never run a query that needs work_mem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Implement missing join selectivity estimation for range types

2024-01-05 Thread Schoemans Maxime
On 05/01/2024 11:37, vignesh C wrote:
 > One of the tests was aborted at [1], kindly post an updated patch for 
the same:

Thank you for notifying us.
I believe I fixed the issue but it is hard to be certain as the issue 
did not arise when running the regression tests locally.

Regards,
Maximediff --git a/src/backend/utils/adt/multirangetypes_selfuncs.c b/src/backend/utils/adt/multirangetypes_selfuncs.c
index 981c1fd298..6abc43f149 100644
--- a/src/backend/utils/adt/multirangetypes_selfuncs.c
+++ b/src/backend/utils/adt/multirangetypes_selfuncs.c
@@ -1335,3 +1335,542 @@ calc_hist_selectivity_contains(TypeCacheEntry *typcache,
 
 	return sum_frac;
 }
+
+/*
+ * This is a utility function used to estimate the join selectivity of
+ * range attributes using rangebound histogram statistics as described
+ * in this paper:
+ *
+ * Diogo Repas, Zhicheng Luo, Maxime Schoemans and Mahmoud Sakr, 2022
+ * Selectivity Estimation of Inequality Joins In Databases
+ * https://doi.org/10.48550/arXiv.2206.07396
+ *
+ * The attributes being joined will be treated as random variables
+ * that follow a distribution modeled by a Probability Density Function (PDF).
+ * Let the two attributes be denoted X, Y.
+ * This function finds the probability P(X < Y).
+ * Note that the PDFs of the two variables can easily be obtained
+ * from their bounds histogram, respectively hist1 and hist2 .
+ *
+ * Let the PDF of X, Y be denoted as f_X, f_Y.
+ * The probability P(X < Y) can be formalized as follows:
+ * P(X < Y)= integral_-inf^inf( integral_-inf^y ( f_X(x) * f_Y(y) dx dy ) )
+ *= integral_-inf^inf( F_X(y) * f_Y(y) dy )
+ * where F_X(y) denote the Cumulative Distribution Function of X at y.
+ * Note that F_X is the selectivity estimation (non-join),
+ * which is implemented using the function calc_hist_selectivity_scalar.
+ *
+ * Now given the histograms of the two attributes X, Y, we note the following:
+ * - The PDF of Y is a step function
+ * (constant piece-wise, where each piece is defined in a bin of Y's histogram)
+ * - The CDF of X is linear piece-wise
+ *   (each piece is defined in a bin of X's histogram)
+ * This leads to the conclusion that their product
+ * (used to calculate the equation above) is also linear piece-wise.
+ * A new piece starts whenever either the bin of X or the bin of Y changes.
+ * By parallel scanning the two rangebound histograms of X and Y,
+ * we evaluate one piece of the result between every two consecutive rangebounds
+ * in the union of the two histograms.
+ *
+ * Given that the product F_X * f_y is linear in the interval
+ * between every two consecutive rangebounds, let them be denoted prev, cur,
+ * it can be shown that the above formula can be discretized into the following:
+ * P(X < Y) =
+ *   0.5 * sum_0^{n+m-1} ( ( F_X(prev) + F_X(cur) ) * ( F_Y(cur) - F_Y(prev) ) )
+ * where n, m are the lengths of the two histograms.
+ *
+ * As such, it is possible to fully compute the join selectivity
+ * as a summation of CDFs, iterating over the bounds of the two histograms.
+ * This maximizes the code reuse, since the CDF is computed using
+ * the calc_hist_selectivity_scalar function, which is the function used
+ * for selectivity estimation (non-joins).
+ *
+ */
+static double
+calc_hist_join_selectivity(TypeCacheEntry *typcache,
+		   const RangeBound *hist1, int nhist1,
+		   const RangeBound *hist2, int nhist2)
+{
+	int			i,
+j;
+	double		selectivity = 0.0,	/* initialization */
+prev_sel1 = -1.0,	/* to skip the first iteration */
+prev_sel2 = 0.0;	/* initialization */
+
+	/*
+	 * Histograms will never be empty. In fact, a histogram will never have
+	 * less than 2 values (1 bin)
+	 */
+	Assert(nhist1 > 1);
+	Assert(nhist2 > 1);
+
+	/* Fast-forwards i and j to start of iteration */
+	for (i = 0; range_cmp_bound_values(typcache, [i], [0]) < 0; i++);
+	for (j = 0; range_cmp_bound_values(typcache, [j], [0]) < 0; j++);
+
+	/* Do the estimation on overlapping regions */
+	while (i < nhist1 && j < nhist2)
+	{
+		double		cur_sel1,
+	cur_sel2;
+		RangeBound	cur_sync;
+
+		if (range_cmp_bound_values(typcache, [i], [j]) < 0)
+			cur_sync = hist1[i++];
+		else if (range_cmp_bound_values(typcache, [i], [j]) > 0)
+			cur_sync = hist2[j++];
+		else
+		{
+			/* If equal, skip one */
+			cur_sync = hist1[i];
+			i++;
+			j++;
+		}
+		cur_sel1 = calc_hist_selectivity_scalar(typcache, _sync,
+hist1, nhist1, false);
+		cur_sel2 = calc_hist_selectivity_scalar(typcache, _sync,
+hist2, nhist2, false);
+
+		/* Skip the first iteration */
+		if (prev_sel1 >= 0)
+			selectivity += (prev_sel1 + cur_sel1) * (cur_sel2 - prev_sel2);
+
+		/* Prepare for the next iteration */
+		prev_sel1 = cur_sel1;
+		prev_sel2 = cur_sel2;
+	}
+
+	/* P(X < Y) = 0.5 * Sum(...) */
+	selectivity /= 2;
+
+	/* Include remainder of hist2 if any */
+	if (j < nhist2)
+		selectivity += 1 - prev_sel2;
+
+	return selectivity;
+}
+
+/*
+ * multirangejoinsel -- join cardinality for 

weird GROUPING SETS and ORDER BY behaviour

2024-01-05 Thread Geoff Winkless
We have some (generated) SQL that uses grouping sets to give us the
same data grouped in multiple ways (with sets of groups configurable
by the user), with the ordering of the rows the same as the grouping
set. This generally works fine, except for when one of the grouping
sets contains part of another grouping set joined against a subquery
(at least, I think that's the trigger).

Minimal example here:

SELECT seq, CONCAT('n', seq) AS n INTO TEMP TABLE test1 FROM
generate_series(1,5) AS seq;
SELECT seq, CONCAT('x', 6-seq) AS x INTO TEMP TABLE test2 FROM
generate_series(1,5) AS seq;

SELECT
  GROUPING(test1.n) AS gp_n,
  GROUPING(concat(test1.n, (SELECT x FROM test2 WHERE seq=test1.seq)))
AS gp_conc,
  test1.n,
  CONCAT(test1.n, (SELECT x FROM test2 WHERE seq=test1.seq)) FROM test1
GROUP BY
GROUPING SETS(
  (test1.n),
  (concat(test1.n, (SELECT x FROM test2 WHERE seq=test1.seq)))
)
ORDER BY
  CASE WHEN GROUPING(test1.n)=0 THEN test1.n ELSE NULL END NULLS FIRST,
  CASE WHEN GROUPING(concat(test1.n, (SELECT x FROM test2 WHERE
seq=test1.seq)))=0 THEN concat(test1.n, (SELECT x FROM test2 WHERE
seq=test1.seq)) ELSE NULL END NULLS FIRST;
 gp_n | gp_conc | n  | concat
--+-++
1 |   0 || n5x1
1 |   0 || n4x2
1 |   0 || n3x3
1 |   0 || n2x4
1 |   0 || n1x5
0 |   1 | n1 |
0 |   1 | n2 |
0 |   1 | n3 |
0 |   1 | n4 |
0 |   1 | n5 |


Am I missing some reason why the first set isn't sorted as I'd hoped?
Is the subquery value in the ORDER BY not the same as the value in the
main query? That seems... frustrating. I'd like to be able to say
"order by column (n)" but I don't think I can?

On Centos7, with the latest pg12 from the pg repo:
PostgreSQL 12.16 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-44), 64-bit

Thanks

Geoff




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-05 Thread Tom Lane
Robert Haas  writes:
> On Fri, Jan 5, 2024 at 11:53 AM Tom Lane  wrote:
>> So my thought was that this should be implemented as an (unchangeable)
>> flag bit for a GUC variable, GUC_PROTOCOL_ONLY or something like that,
>> and then we would refuse SQL-based set attempts on that.  The behavior
>> would end up being very much like PGC_BACKEND variables, in that we
>> could allow all the existing setting methods to work to establish
>> a session's initial value; but after that, it can only change within
>> that session via a protocol message from the client.  With that
>> rule, it's okay for the protocol message to be nontransactional since
>> there's no interaction with transactions.

> Maybe, but it seems like it might be complicated to make that work
> with the existing GUC code. GUCs are fundamentally transactional, I
> think.

I think it'd be quite simple.  As I said, it's just a small variation
on how some GUCs already work.  The only thing that's really
transactional is SQL-driven updates, which'd be disallowed for this
class of variables.

(After consuming a little more caffeine, I wonder if the class ought
to be defined by a new PGC_XXX context value, rather than a flag bit.)

regards, tom lane




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-05 Thread Jelte Fennema-Nio
On Fri, 5 Jan 2024 at 17:28, Robert Haas  wrote:
> First, I don't see a reason to bump the protocol version. The whole
> reason for adding support for protocol options (_pq_.whatever) is so
> that we wouldn't have to change the protocol version to add new
> message types. At some point we may want to make a change that's large
> enough that we have to do that, or a large enough number of small
> changes that it seems worthwhile, but as long as we can add new
> features without bumping the protocol version, it seems advantageous
> to avoid doing so. It seems easier to reason about and less likely to
> break older clients.

While I agree that it's not strictly necessary, I also feel that you
think the a minor protocol version bump a much bigger deal than it is
(afaict). The protocol is designed in such a way that bumping the
minor version can be done without any problems. There is no
possibility of breaking older clients, because the server will
silently downgrade to the version that the client asks for.

I would be okay not doing the actual bump, but I think at least having
the infrastructure in libpq to support a future bump would be quite
useful (i.e. patch 0002 and 0003).

> > - How do you get the value of a protocol parameter status? Do we
> > expect the client to keep track of what it has sent? Do we always send
> > a ParameterStatus message whenever it is changed? Or should we add a
> > ParamaterGet protocol message too?
>
> I would expect that the client would have full control over these
> values and so the configured value would always be the default (which
> should be non-configurable to avoid ambiguity) unless the client set
> it to something else (in which case the client should know the value).
> So I would think that we'd only need a message to set parameter
> values, not one to get them.

Based on my short experience writing these patches, even for
testing/debugging it's quite useful to be able to do SHOW
_pq_.some_protocol_parameter

I think that's a major benefit of re-using the GUC system.




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 12:20 PM Jelte Fennema-Nio  wrote:
> They are not fundamentally transactional afaict based on the changes
> that were needed so far. It makes sense too, because e.g. SIGHUP
> should change the GUC value if the config changed no matter if the
> current transaction aborts or succeeds.

Well, AtEOXact_GUC either reverts or puts back changes to GUC values
that have happened in that (sub)transaction, depending on whether the
(sub)transaction committed or aborted. To make that work, there's a
"stack" of GUC values for any given setting. For a non-transactional
value, we wouldn't have all that infrastructure...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Stack overflow issue

2024-01-05 Thread Robert Haas
On Fri, Nov 24, 2023 at 10:47 AM Heikki Linnakangas  wrote:
> What do you think?

At least for 0001 and 0002, I think we should just add the stack depth checks.

With regard to 0001, CommitTransactionCommand() and friends are hard
enough to understand as it is; they need "goto" like I need an extra
hole in my head.

With regard to 0002, this function isn't sufficiently important to
justify adding special-case code for an extremely rare event. We
should just handle it the way we do in general.

I agree that in the memory-context case it might be worth expending
some more code to be more clever. But I probably wouldn't do that for
MemoryContextStats(); check_stack_depth() seems fine for that one.

In general, I think we should try to keep the number of places that
handle stack overflow in "special" ways as small as possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-05 Thread Jelte Fennema-Nio
On Fri, 5 Jan 2024 at 18:08, Robert Haas  wrote:
> On Fri, Jan 5, 2024 at 11:53 AM Tom Lane  wrote:
> > There is a lot of infrastructure we'll have to re-invent if
> > we make this completely independent of GUCs, notably:
> > * a way to establish the initial/default value
> > * a way to display the active value
> >
> > So my thought was that this should be implemented as an (unchangeable)
> > flag bit for a GUC variable, GUC_PROTOCOL_ONLY or something like that,
> > and then we would refuse SQL-based set attempts on that.  The behavior
> > would end up being very much like PGC_BACKEND variables, in that we
> > could allow all the existing setting methods to work to establish
> > a session's initial value; but after that, it can only change within
> > that session via a protocol message from the client.  With that
> > rule, it's okay for the protocol message to be nontransactional since
> > there's no interaction with transactions.
>
> Maybe, but it seems like it might be complicated to make that work
> with the existing GUC code. GUCs are fundamentally transactional, I
> think.

They are not fundamentally transactional afaict based on the changes
that were needed so far. It makes sense too, because e.g. SIGHUP
should change the GUC value if the config changed no matter if the
current transaction aborts or succeeds.

Based on my experience when writing the current set of patches I think
the GUC infrastructure fits quite well with protocol extension
parameters. When you add a new flag bit it feels very natural
(whatever you may call this flag GUC_PROTOCOL_ONLY,
GUC_PROTOCOL_EXTENSION, or something else).




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 11:53 AM Tom Lane  wrote:
> There is a lot of infrastructure we'll have to re-invent if
> we make this completely independent of GUCs, notably:
> * a way to establish the initial/default value
> * a way to display the active value
>
> So my thought was that this should be implemented as an (unchangeable)
> flag bit for a GUC variable, GUC_PROTOCOL_ONLY or something like that,
> and then we would refuse SQL-based set attempts on that.  The behavior
> would end up being very much like PGC_BACKEND variables, in that we
> could allow all the existing setting methods to work to establish
> a session's initial value; but after that, it can only change within
> that session via a protocol message from the client.  With that
> rule, it's okay for the protocol message to be nontransactional since
> there's no interaction with transactions.

Maybe, but it seems like it might be complicated to make that work
with the existing GUC code. GUCs are fundamentally transactional, I
think.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

2024-01-05 Thread Robert Haas
On Thu, Nov 9, 2023 at 10:32 PM Thomas Munro  wrote:
> Here is a new attempt to fix this mess.  Disclaimer: this based
> entirely on reading the manual and vicariously hacking a computer I
> don't have via CI.

I'd first like to congratulate this thread on reaching its second
birthday. The CommitFest entry hasn't quite made it to the two year
mark yet - expect that in another month or so - but thread itself is
over the line.

Regarding 0001, I don't know if we really need SH_RAW_FREE. You can
just define your own SH_FREE implementation in userspace. That doesn't
work for SH_RAW_ALLOCATOR because there's code in simplehash.h that
knows about memory contexts apart from the actual definition of
SH_ALLOCATE - e.g. we include a MemoryContext pointer in SH_TYPE, and
in the signature of SH_CREATE. But SH_FREE doesn't seem to have any
similar issues. Maybe it's still worth doing for convenience -- I
haven't thought about that very hard -- but it doesn't seem to be
required in the same way that SH_RAW_ALLOCATOR was.

I wonder whether we really want 0002. It seems like a pretty
significant behavior change -- now everybody using simplehash has to
worry about whether failure cases are possible. And maybe there's some
performance overhead. And most of the changes are restricted to the
SH_RAW_ALLOCATOR case, but the changes to SH_GROW are not. And making
this contingent on SH_RAW_ALLOCATOR doesn't seem principled.

I kind of wonder whether trying to handle OOM here is the wrong
direction to go. What if we just bail out hard if we can't insert into
the hash table? I think that we don't expect the hash table to ever be
very large (right?) and we don't install these kinds of defenses
everywhere that OOM on a small memory allocation is a possibility (or
at least I don't think we do). I'm actually sort of unclear about why
you're trying to force this to use raw malloc/free instead of
palloc/pfree. Do we need to use this on the frontend side? Do we need
it on the backend side prior to the memory context infrastructure
being up?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: add AVX2 support to simd.h

2024-01-05 Thread Nathan Bossart
On Fri, Jan 05, 2024 at 09:03:39AM +0700, John Naylor wrote:
> On Wed, Jan 3, 2024 at 10:29 PM Nathan Bossart  
> wrote:
>> If the requirement is that normal builds use AVX2, then I fear we will be
>> waiting a long time.  IIUC the current proposals (building multiple
>> binaries or adding a configuration option that maps to compiler flags)
>> would still be opt-in,
> 
> If and when we get one of those, I would consider that  a "normal"
> build. Since there are no concrete proposals yet, I'm still waiting
> for you to justify imposing an immediate maintenance cost for zero
> benefit.

I've been thinking about the configuration option approach.  ISTM that
would be the most feasible strategy, at least for v17.  A couple things
come to mind:

* This option would simply map to existing compiler flags.  We already have
  ways to provide those (-Dc_args in meson, CFLAGS in autoconf).  Perhaps
  we'd want to provide our own shorthand for certain platforms (e.g., ARM),
  but that will still just be shorthand for compiler flags.

* Such an option would itself generate some maintenance cost.  That could
  be worth it because it formalizes the Postgres support for those options,
  but it's still one more thing to track.

Another related option could be to simply document that we have support for
some newer instructions that can be enabled by setting the aforementioned
compiler flags.  That's perhaps a little less user-friendly, but it'd avoid
the duplication and possibly reduce the maintenance cost.  I also wonder if
it'd help prevent confusion when CFLAGS and this extra option conflict.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-05 Thread Tom Lane
Robert Haas  writes:
> Second, I don't really like the idea of selectively turning GUCs into
> protocol-managed parameters. I think there are a relatively small
> number of things that we'd ever want to manage on a protocol level,
> but this design would force us to make it work for any GUC whatsoever.

I'd not been following along for the last few days, but I agree that
we don't want to make it apply to any GUC at all.

> I think we should start by picking one or two protocol-managed
> parameters that we want to add, and then adding those in a way that is
> distinct from the GUC system. I don't think we should add an abstract
> system divorced from any particular application.

There is a lot of infrastructure we'll have to re-invent if
we make this completely independent of GUCs, notably:
* a way to establish the initial/default value
* a way to display the active value

So my thought was that this should be implemented as an (unchangeable)
flag bit for a GUC variable, GUC_PROTOCOL_ONLY or something like that,
and then we would refuse SQL-based set attempts on that.  The behavior
would end up being very much like PGC_BACKEND variables, in that we
could allow all the existing setting methods to work to establish
a session's initial value; but after that, it can only change within
that session via a protocol message from the client.  With that
rule, it's okay for the protocol message to be nontransactional since
there's no interaction with transactions.

regards, tom lane




Re: verify predefined LWLocks have entries in wait_event_names.txt

2024-01-05 Thread Nathan Bossart
Thanks for reviewing.

On Fri, Jan 05, 2024 at 07:39:39AM +, Bertrand Drouvot wrote:
> Another option could be to create a sub-section for predefined LWLocks that 
> are
> part of lwlocknames.txt and then sort both list (the one in the sub-section 
> and
> the one in lwlocknames.txt). That would avoid the "must be listed in the same 
> order"
> constraint. That said, I think the way it is done in the patch is fine because
> if one does not follow the constraint then the build would fail.

IMHO the ordering constraint makes it easier for humans to verify the lists
match.

> +   die "lists of predefined LWLocks in lwlocknames.txt and 
> wait_event_names.txt do not match"
> + unless $wait_event_lwlocks[$i] eq $lockname;
> 
> What about printing $wait_event_lwlocks[$i] and $lockname in the error 
> message?
> Something like?
> 
> "
> die "lists of predefined LWLocks in lwlocknames.txt and 
> wait_event_names.txt do not match (comparing $lockname and 
> $wait_event_lwlocks[$i])"
>   unless $wait_event_lwlocks[$i] eq $lockname;
> "
> 
> I think that would give more clues for debugging purpose.

Sure, I'll add something like that.  I think this particular scenario is
less likely, but that's not a reason to make the error message hard to
decipher.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-05 Thread Nathan Bossart
On Fri, Jan 05, 2024 at 04:18:49PM +0900, Michael Paquier wrote:
> Putting that in contrib/ has a lot of extra cost.  One is
> documentation and more complexity regarding versioning when it comes
> to upgrading it to a new version.  I don't think that it is a good
> idea to deal with this extra load of work for something that I'd aim
> to be used for having improved *test* coverage, and the build switch
> should stay.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-05 Thread Robert Haas
On Fri, Jan 5, 2024 at 10:14 AM Jelte Fennema-Nio  wrote:
> New patchset attached, where I split up the patches in smaller logical units.
> Note that the first 4 patches in this series are not making any
> protocol changes. All they do is set up infrastructure in the code
> that allows us to make protocol changes in the future.
>
> I hope that those 4 should all be fairly uncontroversial, especially
> patch 1 seems a no-brainer to me. Note that this infrastructure would
> be needed for any patchset that introduces protocol changes.
>
> The 5th only bumps the version
>
> The 6th introduces the _pq_.protocol_managed_parms protocol parameter
>
> The 7th adds the new ParameterSet protocol message

Apologies in advance if this sounds harsh but ... I don't like this
design. I have two main complaints.

First, I don't see a reason to bump the protocol version. The whole
reason for adding support for protocol options (_pq_.whatever) is so
that we wouldn't have to change the protocol version to add new
message types. At some point we may want to make a change that's large
enough that we have to do that, or a large enough number of small
changes that it seems worthwhile, but as long as we can add new
features without bumping the protocol version, it seems advantageous
to avoid doing so. It seems easier to reason about and less likely to
break older clients.

Second, I don't really like the idea of selectively turning GUCs into
protocol-managed parameters. I think there are a relatively small
number of things that we'd ever want to manage on a protocol level,
but this design would force us to make it work for any GUC whatsoever.
That seems like it adds a lot of complexity for not much benefit. If
somebody makes a random GUC into a protocol-managed parameter and then
somebody updates postgresql.conf and then the config file is reloaded,
I guess we need to refuse to adopt the new value of that parameter?
That doesn't seem like a lot of fun to implement. What about the fact
that GUCs are transactional and protocol-managed parameters maybe
shouldn't be? We can dodge a lot of complexity here, I think, if we
just put the things into this new mechanism that have a clear reason
to be there.

To answer a few questions from upthread (MHO):

> - Since we currently don't have any protocol parameters. How do I test
> it? Should I add a debug protocol parameter specifically for this
> purpose? Or should my tests  just always error at the moment?

I think we should start by picking one or two protocol-managed
parameters that we want to add, and then adding those in a way that is
distinct from the GUC system. I don't think we should add an abstract
system divorced from any particular application.

> - How do you get the value of a protocol parameter status? Do we
> expect the client to keep track of what it has sent? Do we always send
> a ParameterStatus message whenever it is changed? Or should we add a
> ParamaterGet protocol message too?

I would expect that the client would have full control over these
values and so the configured value would always be the default (which
should be non-configurable to avoid ambiguity) unless the client set
it to something else (in which case the client should know the value).
So I would think that we'd only need a message to set parameter
values, not one to get them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-05 Thread Nathan Bossart
On Fri, Jan 05, 2024 at 08:38:22AM +0900, Michael Paquier wrote:
> On Thu, Jan 04, 2024 at 04:31:02PM -0600, Nathan Bossart wrote:
>> Rather than defining a module somewhere that tests would need to load,
>> should we just put the common callbacks in the core server?  Unless there's
>> a strong reason to define them elsewhere, that could be a nice way to save
>> a step in the tests.
> 
> Nah, having some pre-existing callbacks existing in the backend is
> against the original minimalistic design spirit.  These would also
> require an SQL interface, and the interface design also depends on the
> functions registering them when pushing down custom conditions.
> Pushing that down to extensions to do what they want will lead to less
> noise, particularly if you consider that we will most likely want to
> tweak the callback interfaces for backpatched bugs.  That's also why I
> think contrib/ is not a good idea, src/test/modules/ serving the
> actual testing purpose here.

Ah, so IIUC we'd have to put some functions in pg_proc.dat even though they
would only be used for a handful of tests in special builds.  I'd agree
that's not desirable.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Assertion failure in SnapBuildInitialSnapshot()

2024-01-05 Thread Robert Haas
This thread has gone for about a year here without making any
progress, which isn't great.

On Tue, Feb 7, 2023 at 2:49 PM Andres Freund  wrote:
> Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over
> the slots. ReplicationSlotsComputeRequiredXmin() can be called at a
> non-neglegible frequency.  Callers like CreateInitDecodingContext(), that pass
> already_locked=true worry me a lot less, because obviously that's not a very
> frequent operation.

Maybe, but it would be good to have some data indicating whether this
is really an issue.

> I wonder if we could instead invert the locks, and hold
> ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and
> acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin().  That'd mean
> that already_locked = true callers have to do a bit more work (we have to be
> sure the locks are always acquired in the same order, or we end up in
> unresolved deadlock land), but I think we can live with that.

This seems like it could be made to work, but there's apparently a
shortage of people willing to write the patch.

As another thought, Masahiko-san writes in his proposed commit message:

"As a result, the replication_slot_xmin could be overwritten with an
old value and retreated."

But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does
nothing?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: POC: Extension for adding distributed tracing - pg_tracing

2024-01-05 Thread Nikita Malakhov
Hi!

I've meant exactly the thing you mentioned -

>
> > By queries you mean particular queries, not transactions? And since
> > they have an assigned ID it means that the query is already executing
> > and we want to enable the tracking in another session, right?
>
> I think that was the idea. The query identifier generated for a
> specific statement is stable so we could have a GUC variable with a
> list of query id and only queries matching the provided query ids
> would be sampled. This would make it easier to generate traces for a
> specific query within a session.
>
> This one. When maintaining production systems with high load rate
we often encountered necessity to trace a number of queries with
known IDs (that was the case for Oracle environments, but Postgres
is not very different from this point of view), because enabling tracing
for the whole session would immediately result in thousands of trace
spans in the system with throughput like hundreds or even thousands
of tps, when we need, say, to trace a single problematic query.

Thank you!

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-05 Thread Jelte Fennema-Nio
New patchset attached, where I split up the patches in smaller logical units.
Note that the first 4 patches in this series are not making any
protocol changes. All they do is set up infrastructure in the code
that allows us to make protocol changes in the future.

I hope that those 4 should all be fairly uncontroversial, especially
patch 1 seems a no-brainer to me. Note that this infrastructure would
be needed for any patchset that introduces protocol changes.

The 5th only bumps the version

The 6th introduces the _pq_.protocol_managed_parms protocol parameter

The 7th adds the new ParameterSet protocol message


v4-0005-Bump-protocol-version-to-3.1.patch
Description: Binary data


v4-0002-libpq-Handle-NegotiateProtocolVersion-message-mor.patch
Description: Binary data


v4-0004-Prepare-server-code-for-addition-of-protocol-exte.patch
Description: Binary data


v4-0003-libpq-Include-minor-version-number-in-PQprotocolV.patch
Description: Binary data


v4-0001-libpq-Remove-instance-of-hardcoded-protocol-versi.patch
Description: Binary data


v4-0006-Add-_pq_.protocol_managed_params-protocol-extensi.patch
Description: Binary data


v4-0007-Add-protcol-message-to-change-protocol-extension-.patch
Description: Binary data


Re: Build versionless .so for Android

2024-01-05 Thread Peter Eisentraut

On 01.01.24 06:25, Matthias Kuhn wrote:
It looks like meson does not currently support building for android, the 
following output is what I get (but I have actually no experience with 
meson):


     meson.build:320:2: ERROR: Problem encountered: unknown host system: 
android


FWIW, the meson source code contains numerous mentions of an 'android' 
platform, so it seems like this is expected to work.






Re: Build versionless .so for Android

2024-01-05 Thread Peter Eisentraut

On 05.01.24 01:00, Matthias Kuhn wrote:

Attached a patch with a (hopefully) better wording of the comment.

I have unsuccessfully tried to find an official source for this policy.
So for reference some discussions about the topic:

- 
https://stackoverflow.com/questions/11491065/linking-with-versioned-shared-library-in-android-ndk 
- 
https://stackoverflow.com/questions/18681401/how-can-i-remove-all-versioning-information-from-shared-object-files 
 What I would like to see is a specific thing that you are trying to do 
that doesn't work.  Probably, you are writing a program that is meant to 
run on Android, and you are linking it (provide command line), and then 
what happens?  The linking fails?  It fails to run?  What is the error? 
Can you provide a minimal example?  And so on.






Re: Build versionless .so for Android

2024-01-05 Thread Robert Haas
On Thu, Jan 4, 2024 at 7:00 PM Matthias Kuhn  wrote:
> Attached a patch with a (hopefully) better wording of the comment.
>
> I have unsuccessfully tried to find an official source for this policy.
> So for reference some discussions about the topic:
>
> - 
> https://stackoverflow.com/questions/11491065/linking-with-versioned-shared-library-in-android-ndk
> - 
> https://stackoverflow.com/questions/18681401/how-can-i-remove-all-versioning-information-from-shared-object-files

This is interesting, but these links are also very old.

I like the new wording of the comment better, but I think we need more
confirmation that this is correct before committing it. Anyone else
here familiar with building on Android?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




cleanup patches for incremental backup

2024-01-05 Thread Robert Haas
Hi,

I discovered that my patch to add WAL summarization added two new
SQL-callable functions but failed to document them. 0001 fixes that.

An outstanding item from the original thread was to write a better
test for the not-yet-committed pg_walsummary utility. But I discovered
that I couldn't do that because there were some race conditions that
couldn't easily be cured. So 0002 therefore adds a new function
pg_get_wal_summarizer_state() which returns various items of in-memory
state related to WAL summarization. We had some brief discussion of
this being desirable for other reasons; it's nice for users to be able
to look at this information in case of trouble (say, if the summarizer
is not keeping up).

0003 then adds the previously-proposed pg_walsummary utility, with
tests that depend on 0002.

0004 attempts to fix some problems detected by Coverity and subsequent
code inspection.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0002-Add-new-function-pg_get_wal_summarizer_state.patch
Description: Binary data


v1-0001-Document-WAL-summarization-information-functions.patch
Description: Binary data


v1-0004-Repair-various-defects-in-dc212340058b4e7ecfc5a7a.patch
Description: Binary data


v1-0003-Add-new-pg_walsummary-tool.patch
Description: Binary data


Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-05 Thread Robert Haas
On Thu, Jan 4, 2024 at 6:03 PM Melanie Plageman
 wrote:
> When a single page is being processed, page pruning happens in
> heap_page_prune(). Freezing, dead items recording, and visibility
> checks happen in lazy_scan_prune(). Visibility map updates and
> freespace map updates happen back in lazy_scan_heap(). Except, if the
> table has no indexes, in which case, lazy_scan_heap() also invokes
> lazy_vacuum_heap_page() to set dead line pointers unused and do
> another separate visibility check and VM update. I maintain that all
> page-level processing should be done in the page-level processing
> functions (like lazy_scan_prune()). And lazy_scan_heap() shouldn't be
> directly responsible for special case page-level processing.

But you can just as easily turn this argument on its head, can't you?
In general, except for HOT tuples, line pointers are marked dead by
pruning and unused by vacuum. Here you want to turn it on its head and
make pruning do what would normally be vacuum's responsibility.

I mean, that's not to say that your argument is "wrong" ... but what I
just said really is how I think about it, too.

> > Also, I find "pronto_reap" to be a poor choice of name. "pronto" is an
> > informal word that seems to have no advantage over something like
> > "immediate" or "now," and I don't think "reap" has a precise,
> > universally-understood meaning. You could call this "mark_unused_now"
> > or "immediately_mark_unused" or something and it would be far more
> > self-documenting, IMHO.
>
> Yes, I see how pronto is unnecessarily informal. If there are no cases
> other than when the table has no indexes that we would consider
> immediately marking LPs unused, then perhaps it is better to call it
> "no_indexes" (per andres' suggestion)?

wfm.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: the s_lock_stuck on perform_spin_delay

2024-01-05 Thread Robert Haas
On Thu, Jan 4, 2024 at 6:06 PM Andres Freund  wrote:
> I think we should add infrastructure to detect bugs like this during
> development, but not PANICing when this happens in production seems completely
> non-viable.

I mean +1 for the infrastructure, but "completely non-viable"? Why?

I've only very rarely seen this PANIC occur, and in the few cases
where I've seen it, it was entirely unclear that the problem was due
to a bug where somebody failed to release a spinlock. It seemed more
likely that the machine was just not really functioning, and the PANIC
was a symptom of processes not getting scheduled rather than a PG bug.
And every time I tell a user that they might need to use a debugger
to, say, set VacuumCostActive = false, or to get a backtrace, or any
other reason, I have to tell them to make sure to detach the debugger
in under 60 seconds, because in the unlikely event that they attach
while the process is holding a spinlock, failure to detach in under 60
seconds will take their production system down for no reason. Now, if
you're about to say that people shouldn't need to use a debugger on
their production instance, I entirely agree ... but in the world I
inhabit, that's often the only way to solve a customer problem, and it
probably will continue to be until we have much better ways of getting
backtraces without using a debugger than is currently the case.

Have you seen real cases where this PANIC prevents a hangup? If yes,
that PANIC traced back to a bug in PostgreSQL? And why didn't the user
just keep hitting the same bug over and PANICing in an endless loop?

I feel like this is one of those things that has just been this way
forever and we don't question it because it's become an article of
faith that it's something we have to have. But I have a very hard time
explaining why it's even a net positive, let alone the unquestionable
good that you seem to think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Synchronizing slots from primary to standby

2024-01-05 Thread Amit Kapila
On Fri, Jan 5, 2024 at 4:25 PM Dilip Kumar  wrote:
>
> On Fri, Jan 5, 2024 at 8:59 AM shveta malik  wrote:
> >
> I was going the the patch set again, I have a question.  The below
> comments say that we keep the failover option as PENDING until we have
> done the initial table sync which seems fine.  But what happens if we
> add a new table to the publication and refresh the subscription? In
> such a case does this go back to the PENDING state or something else?
>

At this stage, such an operation is prohibited. Users need to disable
the failover option first, then perform the above operation, and after
that failover option can be re-enabled.

-- 
With Regards,
Amit Kapila.




Re: Change GUC hashtable to use simplehash?

2024-01-05 Thread jian he
On Fri, Jan 5, 2024 at 6:54 PM John Naylor  wrote:
>
> On Thu, Jan 4, 2024 at 10:01 AM jian he  wrote:
> >
> > I still cannot git apply your patch cleanly. in
>
> I don't know why you're using that -- the git apply man page even says
>
> "Use git-am(1) to create commits from patches generated by
> git-format-patch(1) and/or received by email."
>
> Or, if that fails, use "patch".
>
> > http://cfbot.cputube.org/ i cannot find your patch.
> > ( so, it might be that I test based on incomplete information).
> > but only hashfn_unstable.h influences bench_hash/bench_hash.c.
> >
> > so I attached the whole patch that I had git applied, that is the
> > changes i applied for the following tests.
>
> Well, aside from the added text-editor detritus, it looks like this
> has everything except v11-0008, without which I still get improvement
> for the pgstat hash.
>
> >   Model name:Intel(R) Core(TM) i5-14600K
>
> > The following is tested with another machine, also listed machine spec 
> > below.
> > I tested 3 times, the results is very similar as following:
> > select * from bench_cstring_hash_aligned(10);4705.686 ms
> > select * from bench_cstring_hash_unaligned(10);6835.753 ms
> > select * from bench_pgstat_hash(10);   2678.978 ms
> > select * from bench_pgstat_hash_fh(10);  6199.017 ms
> > select * from bench_string_hash(10);847.699 ms
>
> I was fully prepared to believe something like 32-bit Arm would have
> difficulty with 64-bit shifts/multiplies etc., but this makes no sense
> at all. In this test, on my machine, HEAD's pgstat_hash is 3x faster
> than HEAD's "strlen + hash_bytes", but for you it's 3x slower. To
> improve reproducibility, I've added the .sql files and a bench script
> to v13. I invite you to run bench_hash.sh and see if that changes
> anything.

git apply has a verbose option.
also personally I based on vscode editor, the color to view the changes.

jian@jian:~/Desktop/pg_src/src4/postgres$ git apply
$PATCHES/v13-0006-Add-benchmarks-for-hashing.patch
/home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:81:
indent with spaces.
if (/^PG_KEYWORD\("(\w+)"/)
/home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:82:
indent with spaces.
{
/home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:87:
indent with spaces.
}
/home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:89:
trailing whitespace.

/home/jian/Downloads/patches/v13-0006-Add-benchmarks-for-hashing.patch:92:
trailing whitespace.

warning: squelched 11 whitespace errors
warning: 16 lines add whitespace errors.


jian@jian:~/Desktop/pg_src/src4/postgres$ bash runbench.sh
select * from bench_string_hash(10);

latency average = 875.482 ms
select * from bench_cstring_hash_unaligned(10);
latency average = 6539.231 ms
select * from bench_cstring_hash_aligned(10);
latency average = 4401.278 ms
select * from bench_pgstat_hash(10);
latency average = 2679.732 ms
select * from bench_pgstat_hash_fh(10);

latency average = 5711.012 ms
jian@jian:~/Desktop/pg_src/src4/postgres$ bash runbench.sh
select * from bench_string_hash(10);

latency average = 874.261 ms
select * from bench_cstring_hash_unaligned(10);
latency average = 6538.874 ms
select * from bench_cstring_hash_aligned(10);
latency average = 4400.546 ms
select * from bench_pgstat_hash(10);
latency average = 2682.013 ms
select * from bench_pgstat_hash_fh(10);

latency average = 5709.815 ms

meson:

meson setup ${BUILD} \
-Dprefix=${PG_PREFIX} \
-Dpgport=5459 \
-Dplperl=enabled \
-Dplpython=enabled \
-Dssl=openssl \
-Dldap=enabled \
-Dlibxml=enabled \
-Dlibxslt=enabled \
-Duuid=e2fs \
-Dzstd=enabled \
-Dlz4=enabled \
-Dsystemd=enabled \
-Dcassert=true \
-Db_coverage=true \
-Dicu=enabled \
-Dbuildtype=debug \
-Dwerror=true \
-Dc_args='-Wunused-variable
-Wuninitialized
-Werror=maybe-uninitialized
-Wreturn-type
-DWRITE_READ_PARSE_PLAN_TREES
-DCOPY_PARSE_PLAN_TREES
-DREALLOCATE_BITMAPSETS
-DRAW_EXPRESSION_COVERAGE_TEST -fno-omit-frame-pointer' \
-Ddocs_pdf=disabled \
-Dllvm=disabled \
-Ddocs_pdf=disabled




Re: Assorted typo fixes

2024-01-05 Thread Michael Paquier
On Wed, Jan 03, 2024 at 12:56:58AM -0500, Tom Lane wrote:
> Yeah.  A quick grep shows that we have 16 uses of "vertices" and
> only this one of "vertexes".  It's not really wrong, but +1 for
> making it match the others.

Applied this one as 793ecff7df80 on HEAD.

> I'd leave this alone, it's not wrong either.  If you want to propose
> a complete rewording, do so; but that's not "misspelling" territory.

I don't have a better idea in my mind now, so I'm OK to leave that be.
--
Michael


signature.asc
Description: PGP signature


Re: Synchronizing slots from primary to standby

2024-01-05 Thread Dilip Kumar
On Fri, Jan 5, 2024 at 8:59 AM shveta malik  wrote:
>
I was going the the patch set again, I have a question.  The below
comments say that we keep the failover option as PENDING until we have
done the initial table sync which seems fine.  But what happens if we
add a new table to the publication and refresh the subscription? In
such a case does this go back to the PENDING state or something else?

+ * As a result, we enable the failover option for the main slot only after the
+ * initial sync is complete. The failover option is implemented as a tri-state
+ * with values DISABLED, PENDING, and ENABLED. The state transition process
+ * between these values is the same as the two_phase option (see TWO_PHASE
+ * TRANSACTIONS for details).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Change GUC hashtable to use simplehash?

2024-01-05 Thread John Naylor
On Thu, Jan 4, 2024 at 10:01 AM jian he  wrote:
>
> I still cannot git apply your patch cleanly. in

I don't know why you're using that -- the git apply man page even says

"Use git-am(1) to create commits from patches generated by
git-format-patch(1) and/or received by email."

Or, if that fails, use "patch".

> http://cfbot.cputube.org/ i cannot find your patch.
> ( so, it might be that I test based on incomplete information).
> but only hashfn_unstable.h influences bench_hash/bench_hash.c.
>
> so I attached the whole patch that I had git applied, that is the
> changes i applied for the following tests.

Well, aside from the added text-editor detritus, it looks like this
has everything except v11-0008, without which I still get improvement
for the pgstat hash.

>   Model name:Intel(R) Core(TM) i5-14600K

> The following is tested with another machine, also listed machine spec below.
> I tested 3 times, the results is very similar as following:
> select * from bench_cstring_hash_aligned(10);4705.686 ms
> select * from bench_cstring_hash_unaligned(10);6835.753 ms
> select * from bench_pgstat_hash(10);   2678.978 ms
> select * from bench_pgstat_hash_fh(10);  6199.017 ms
> select * from bench_string_hash(10);847.699 ms

I was fully prepared to believe something like 32-bit Arm would have
difficulty with 64-bit shifts/multiplies etc., but this makes no sense
at all. In this test, on my machine, HEAD's pgstat_hash is 3x faster
than HEAD's "strlen + hash_bytes", but for you it's 3x slower. To
improve reproducibility, I've added the .sql files and a bench script
to v13. I invite you to run bench_hash.sh and see if that changes
anything.

v13 also
- adds an assert that aligned and unaligned C string calculations give
the same result
- properly mixes roleid in the namespace hash, since it's now
convenient to do so (0005 is an alternate method)
- removes the broken makefile from the benchmark (not for commit anyway)
From cf64f9a0603837dd89efdf1aa455395906e75ded Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 5 Jan 2024 17:21:53 +0700
Subject: [PATCH v13 5/6] WIP: a safer way to accumulate a single struct member
 into the hash state

---
 src/backend/catalog/namespace.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index d1eae2a2d4..83fd57906c 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -258,7 +258,9 @@ spcachekey_hash(SearchPathCacheKey key)
 
 	fasthash_init(, FH_UNKNOWN_LENGTH, 0);
 
-	fasthash_accum(, (const char*) , sizeof(Oid));
+	hs.accum = key.roleid;
+	fasthash_combine();
+
 	sp_len = fasthash_accum_cstring(, key.searchPath);
 
 	/* pass the length to tweak the final mix */
-- 
2.43.0

From 3ff66ebbe9f27639984c726dbd4005002b2615b9 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Sun, 24 Dec 2023 09:46:44 +0700
Subject: [PATCH v13 6/6] Add benchmarks for hashing

---
 bench_cstr_aligned.sql |   1 +
 bench_cstr_unaligned.sql   |   1 +
 bench_pgstat_fh.sql|   2 +
 bench_pgstat_orig.sql  |   1 +
 bench_string_hash.sql  |   2 +
 contrib/bench_hash/aligned_keywords.h  | 991 +
 contrib/bench_hash/bench_hash--1.0.sql |  30 +
 contrib/bench_hash/bench_hash.c| 169 +
 contrib/bench_hash/bench_hash.control  |   5 +
 contrib/bench_hash/meson.build |  19 +
 contrib/meson.build|   1 +
 runbench.sh|  16 +
 12 files changed, 1238 insertions(+)
 create mode 100644 bench_cstr_aligned.sql
 create mode 100644 bench_cstr_unaligned.sql
 create mode 100644 bench_pgstat_fh.sql
 create mode 100644 bench_pgstat_orig.sql
 create mode 100644 bench_string_hash.sql
 create mode 100644 contrib/bench_hash/aligned_keywords.h
 create mode 100644 contrib/bench_hash/bench_hash--1.0.sql
 create mode 100644 contrib/bench_hash/bench_hash.c
 create mode 100644 contrib/bench_hash/bench_hash.control
 create mode 100644 contrib/bench_hash/meson.build
 create mode 100755 runbench.sh

diff --git a/bench_cstr_aligned.sql b/bench_cstr_aligned.sql
new file mode 100644
index 00..9ce6074fe2
--- /dev/null
+++ b/bench_cstr_aligned.sql
@@ -0,0 +1 @@
+select * from bench_cstring_hash_aligned(10);
diff --git a/bench_cstr_unaligned.sql b/bench_cstr_unaligned.sql
new file mode 100644
index 00..d654be3c07
--- /dev/null
+++ b/bench_cstr_unaligned.sql
@@ -0,0 +1 @@
+select * from bench_cstring_hash_unaligned(10);
diff --git a/bench_pgstat_fh.sql b/bench_pgstat_fh.sql
new file mode 100644
index 00..1130361c43
--- /dev/null
+++ b/bench_pgstat_fh.sql
@@ -0,0 +1,2 @@
+select * from bench_pgstat_hash_fh(10);
+
diff --git a/bench_pgstat_orig.sql b/bench_pgstat_orig.sql
new file mode 100644
index 00..bd6d084fc2
--- 

Re: Confine vacuum skip logic to lazy_scan_skip

2024-01-05 Thread Nazir Bilal Yavuz
Hi,

On Fri, 5 Jan 2024 at 02:25, Jim Nasby  wrote:
>
> On 1/4/24 2:23 PM, Andres Freund wrote:
>
> On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
>
> Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
> Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
>  nskippable_blocks
>
> I think these may lead to worse code - the compiler has to reload
> vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
> can't guarantee that they're not changed within one of the external functions
> called in the loop body.
>
> Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder 
> if the concept of skipping should go away in the context of vector IO? 
> Instead of thinking about "we can skip this range of blocks", why not 
> maintain a list of "here's the next X number of blocks that we need to 
> vacuum"?

Sorry if I misunderstood. AFAIU, with the help of the vectored IO;
"the next X number of blocks that need to be vacuumed" will be
prefetched by calculating the unskippable blocks ( using the
lazy_scan_skip() function ) and the X will be determined by Postgres
itself. Do you have something different in your mind?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: generic plans and "initial" pruning

2024-01-05 Thread vignesh C
On Mon, 20 Nov 2023 at 10:00, Amit Langote  wrote:
>
> On Thu, Sep 28, 2023 at 5:26 PM Amit Langote  wrote:
> > On Tue, Sep 26, 2023 at 10:06 PM Amit Langote  
> > wrote:
> > > After sleeping on this, I think we do need the checks after all the
> > > ExecInitNode() calls too, because we have many instances of the code
> > > like the following one:
> > >
> > > outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
> > > tupDesc = ExecGetResultType(outerPlanState(gatherstate));
> > > 
> > >
> > > If outerNode is a SeqScan and ExecInitSeqScan() returned early because
> > > ExecOpenScanRelation() detected that plan was invalidated, then
> > > tupDesc would be NULL in this case, causing the code to crash.
> > >
> > > Now one might say that perhaps we should only add the
> > > is-CachedPlan-valid test in the instances where there is an actual
> > > risk of such misbehavior, but that could lead to confusion, now or
> > > later.  It seems better to add them after every ExecInitNode() call
> > > while we're inventing the notion, because doing so relieves the
> > > authors of future enhancements of the ExecInit*() routines from
> > > worrying about any of this.
> > >
> > > Attached 0003 should show how that turned out.
> > >
> > > Updated 0002 as mentioned in the previous reply -- setting pointers to
> > > NULL after freeing them more consistently across various ExecEnd*()
> > > routines and using the `if (pointer != NULL)` style over the `if
> > > (pointer)` more consistently.
> > >
> > > Updated 0001's commit message to remove the mention of its relation to
> > > any future commits.  I intend to push it tomorrow.
> >
> > Pushed that one.  Here are the rebased patches.
> >
> > 0001 seems ready to me, but I'll wait a couple more days for others to
> > weigh in.  Just to highlight a kind of change that others may have
> > differing opinions on, consider this hunk from the patch:
> >
> > -   MemoryContextDelete(node->aggcontext);
> > +   if (node->aggcontext != NULL)
> > +   {
> > +   MemoryContextDelete(node->aggcontext);
> > +   node->aggcontext = NULL;
> > +   }
> > ...
> > +   ExecEndNode(outerPlanState(node));
> > +   outerPlanState(node) = NULL;
> >
> > So the patch wants to enhance the consistency of setting the pointer
> > to NULL after freeing part.  Robert mentioned his preference for doing
> > it in the patch, which I agree with.
>
> Rebased.

There is a leak reported at [1], details for the same is available at [2]:
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/select_views.out
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/select_views.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/select_views.out
2023-12-19 23:00:04.677385000 +
+++ 
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/select_views.out
2023-12-19 23:06:26.870259000 +
@@ -1288,6 +1288,7 @@
   (102, '2011-10-12', 120),
   (102, '2011-10-28', 200),
   (103, '2011-10-15', 480);
+WARNING:  resource was not closed: relation "customer_pkey"
 CREATE VIEW my_property_normal AS
SELECT * FROM customer WHERE name = current_user;
 CREATE VIEW my_property_secure WITH (security_barrier) A

[1] - https://cirrus-ci.com/task/6494009196019712
[2] - 
https://api.cirrus-ci.com/v1/artifact/task/6494009196019712/testrun/build/testrun/regress-running/regress/regression.diffs

Regards,
Vingesh




Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-01-05 Thread Ashutosh Bapat
On Fri, Jan 5, 2024 at 1:34 PM Jeff Davis  wrote:
>
> On Fri, 2024-01-05 at 12:49 +0530, Ashutosh Bapat wrote:
> > Can you please provide an example using postgres_fdw to create a
> > subscription using this patch. I think we should document it in
> > postgres_fdw and add a test for the same.
>
> There's a basic test for postgres_fdw in patch 0003, just testing the
> syntax and validation.
>
> A manual end-to-end test is pretty straightforward:
>
>   -- on publisher
>   create table foo(i int primary key);
>   create publication pub1 for table foo;
>   insert into foo values(42);
>
>   -- on subscriber
>   create extension postgres_fdw;
>   create table foo(i int primary key);
>   create server server1
> foreign data wrapper postgres_fdw
> options (host '/tmp', port '5432', dbname 'postgres');
>   create user mapping for u1 server server1
> options (user 'u1');
>   select pg_conninfo_from_server('server1','u1',true);
>   create subscription sub1 server server1 publication pub1;
>
> I don't think we need to add an end-to-end test for each FDW, because
> it's just using the assembled connection string. To see if it's
> assembling the connection string properly, we can unit test with
> pg_conninfo_from_server().

Thanks for the steps.

I don't think we need to add a test for every FDW. E.g. adding a test
in file_fdw would be pointless. But postgres_fdw is special. The test
could further create a foreign table ftab_foo on subscriber
referencing foo on publisher and then compare the data from foo and
ftab_foo to make sure that the replication is happening. This will
serve as a good starting point for replicated tables setup in a
sharded cluster.

-- 
Best Wishes,
Ashutosh Bapat




Re: Implement missing join selectivity estimation for range types

2024-01-05 Thread vignesh C
On Tue, 21 Nov 2023 at 01:47, Schoemans Maxime  wrote:
>
> On 14/11/2023 20:46, Tom Lane wrote:
> > I took a brief look through this very interesting work.  I concur
> > with Tomas that it feels a little odd that range join selectivity
> > would become smarter than scalar inequality join selectivity, and
> > that we really ought to prioritize applying these methods to that
> > case.  Still, that's a poor reason to not take the patch.
>
> Indeed, we started with ranges as this was the simpler case (no MCV) and
> was the topic of a course project.
> The idea is to later write a second patch that applies these ideas to
> scalar inequality while also handling MCV's correctly.
>
> > I also agree with the upthread criticism that having two identical
> > functions in different source files will be a maintenance nightmare.
> > Don't do it.  When and if there's a reason for the behavior to
> > diverge between the range and multirange cases, it'd likely be
> > better to handle that by passing in a flag to say what to do.
>
> The duplication is indeed not ideal. However, there are already 8 other
> duplicate functions between the two files.
> I would thus suggest to leave the duplication in this patch and create a
> second one that removes all duplication from the two files, instead of
> just removing the duplication for our new function.
> What are your thoughts on this? If we do this, should the function
> definitions go in rangetypes.h or should we create a new
> rangetypes_selfuncs.h header?
>
> > But my real unhappiness with the patch as-submitted is the test cases,
> > which require rowcount estimates to be reproduced exactly.
>
> > We need a more forgiving test method. Usually the
> > approach is to set up a test case where the improved accuracy of
> > the estimate changes the planner's choice of plan compared to what
> > you got before, since that will normally not be too prone to change
> > from variations of a percent or two in the estimates.
>
> I have changed the test method to produce query plans for a 3-way range
> join.
> The plans for the different operators differ due to the computed
> selectivity estimation, which was not the case before this patch.

One of the tests was aborted at [1], kindly post an updated patch for the same:
[04:55:42.797] src/tools/ci/cores_backtrace.sh linux /tmp/cores
[04:56:03.640] dumping /tmp/cores/postgres-6-24094.core for
/tmp/cirrus-ci-build/tmp_install/usr/local/pgsql/bin/postgres

[04:57:24.199] Core was generated by `postgres: old_node: postgres
regression [local] EXPLAIN '.
[04:57:24.199] Program terminated with signal SIGABRT, Aborted.
[04:57:24.199] #0 __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
[04:57:24.199] Download failed: Invalid argument. Continuing without
source file ./signal/../sysdeps/unix/sysv/linux/raise.c.
[04:57:26.803]
[04:57:26.803] Thread 1 (Thread 0x7f121d9ec380 (LWP 24094)):
[04:57:26.803] #0 __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
[04:57:26.803] set = {__val = {4194304, 0, 4636737291354636288,
4636737291354636288, 0, 0, 64, 64, 128, 128, 192, 192, 256, 256, 0,
0}}
[04:57:26.803] pid = 
[04:57:26.803] tid = 
[04:57:26.803] ret = 
[04:57:26.803] #1 0x7f122003d537 in __GI_abort () at abort.c:79
...
...
[04:57:38.774] #6 0x7f357ad95788 in __asan::__asan_report_load1
(addr=addr@entry=107477261711120) at
../../../../src/libsanitizer/asan/asan_rtl.cpp:117
[04:57:38.774] bp = 140731433585840
[04:57:38.774] pc = 
[04:57:38.774] local_stack = 139867680821632
[04:57:38.774] sp = 140731433585832
[04:57:38.774] #7 0x55d5b155c37c in range_cmp_bound_values
(typcache=typcache@entry=0x62930b60, b1=b1@entry=0x61c17708,
b2=b2@entry=0x61c188b8) at rangetypes.c:2090
[04:57:38.774] No locals.
[04:57:38.774] #8 0x55d5b1567bb2 in calc_hist_join_selectivity
(typcache=typcache@entry=0x62930b60,
hist1=hist1@entry=0x61c188b8, nhist1=nhist1@entry=101,
hist2=hist2@entry=0x61c170b8, nhist2=nhist2@entry=101) at
rangetypes_selfuncs.c:1298
[04:57:38.774] i = 0
[04:57:38.774] j = 101
[04:57:38.774] selectivity = 
[04:57:38.774] cur_sel1 = 
[04:57:38.774] cur_sel2 = 
[04:57:38.774] prev_sel1 = 
[04:57:38.774] prev_sel2 = 
[04:57:38.774] cur_sync = {val = , infinite =
, inclusive = , lower = }
[04:57:38.774] #9 0x55d5b1569190 in rangejoinsel
(fcinfo=) at rangetypes_selfuncs.c:1495

[1] - https://cirrus-ci.com/task/5507789477380096

Regards,
Vignesh




Re: Make mesage at end-of-recovery less scary.

2024-01-05 Thread vignesh C
On Wed, 22 Nov 2023 at 13:01, Kyotaro Horiguchi  wrote:
>
> Anyway, this requires rebsaing, and done.

Few tests are failing at [1], kindly post an updated patch:
/tmp/cirrus-ci-build/src/test/recovery --testgroup recovery --testname
039_end_of_wal -- /usr/local/bin/perl -I
/tmp/cirrus-ci-build/src/test/perl -I
/tmp/cirrus-ci-build/src/test/recovery
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl
[23:53:10.370] ― ✀
―
[23:53:10.370] stderr:
[23:53:10.370] # Failed test 'xl_tot_len zero'
[23:53:10.370] # at
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 267.
[23:53:10.370] # Failed test 'xlp_magic zero'
[23:53:10.370] # at
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 340.
[23:53:10.370] # Failed test 'xlp_magic zero (split record header)'
[23:53:10.370] # at
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 445.
[23:53:10.370] # Looks like you failed 3 tests of 14.
[23:53:10.370]
[23:53:10.370] (test program exited with status code 3)
[23:53:10.370] 
――

[1] - https://cirrus-ci.com/task/5859293157654528

Regards,
Vignesh


RE: speed up a logical replica setup

2024-01-05 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

I love your proposal, so I want to join the review. Here are my first comments.

01. 
Should we restrict that `--subscriber-conninfo` must not have hostname or IP?
We want users to execute pg_subscriber on the target, right?

02.
When the application was executed, many outputs filled my screen. Some of them
were by pg_subscriber, and others were server log. Can we record them into
separated file? I imagined like pg_upgrade.

03.
A replication command is used when replication slots are created. Is there a
reason to use it? I think we do not have to use logical replication walsender 
mode,
we can use an SQL function instead. pg_create_logical_replication_slot() also 
outputs
LSN, isn't it sufficient?

04.
As you know, there are several options for 
publications/subscriptions/replication
slots. Do you have a good way to specify them in your mind? 

05.
I found that the connection string for each subscriptions have a setting
"fallback_application_name=pg_subscriber". Can we remove it?

```
postgres=# SELECT subconninfo FROM pg_subscription;
   subconninfo  
 
-
 user=postgres port=5431 fallback_application_name=pg_subscriber dbname=postgres
(1 row)
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: doing also VM cache snapshot and restore with pg_prewarm, having more information of the VM inside PostgreSQL

2024-01-05 Thread Cédric Villemain

Le 04/01/2024 à 23:41, Jim Nasby a écrit :

On 1/3/24 5:57 PM, Cedric Villemain wrote:


for 15 years pgfincore has been sitting quietly and being used in 
large setups to help in HA and resources management.
It can perfectly stay as is, to be honest I was expecting to one day 
include a windows support and propose that to PostgreSQL, it appears 
getting support on linux and BSD is more than enough today.


So I wonder if there are interest for having virtual memory snapshot 
and restore operations with, for example, pg_prewarm/autowarm ?


IMHO, to improve the user experience here we'd need something that 
combined the abilities of all these extensions into a cohesive interface 
that allowed users to simply say "please get this data into cache". 
Simply moving pgfincore into core Postgres wouldn't satisfy that need.


This is exactly why I proposed those additions to pg_prewarm and autowarm.

So I think the real question is whether the community feels spport for 
better cache (buffercache + filesystem) management is a worthwhile 
feature to add to Postgres.


Agreed, to add in an extension more probably and only the "filesystem" 
part as the buffercache is done already.


Micromanaging cache contents for periodic jobs seems almost like a 
mis-feature. While it's a useful tool to have in the toolbox, it's also 
a non-trivial layer of complexity. IMHO not something we'd want to add. 
Though, there might be smaller items that would make creating tools to 
do that easier, such as some ability to see what blocks a backend is 
accessing (perhaps via a hook).


From my point of view it's not so complex, but this is subjective and I 
won't argue in this area.


I confirm that having someway to get feedback on current position/block 
activity (and distance/context: heap scan, index build, analyze, ...) 
would be super useful to allow external management. Maybe the "progress" 
facilities can be used for that. Maybe worth looking at that for another 
proposal than the current one.


To be clear I am not proposing that PostgreSQL handles those tasks 
transparently or itself, but offering options to the users via 
extensions like we do with pg_prewarm and pg_buffercache.

It's just the same for virtual memory/filesystem.

---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R





Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-05 Thread Dilip Kumar
On Tue, Dec 12, 2023 at 4:15 PM Michael Paquier  wrote:
>
> On Tue, Dec 12, 2023 at 10:27:09AM +0530, Dilip Kumar wrote:
> > Oops, I only included the code changes where I am adding injection
> > points and some comments to verify that, but missed the actual test
> > file. Attaching it here.
>
> I see.  Interesting that this requires persistent connections to work.
> That's something I've found clunky to rely on when the scenarios a
> test needs to deal with are rather complex.  That's an area that could
> be made easier to use outside of this patch..  Something got proposed
> by Andrew Dunstan to make the libpq routines usable through a perl
> module, for example.
>
> > Note:  I think the latest patches are conflicting with the head, can you 
> > rebase?
>
> Indeed, as per the recent manipulations in ipci.c for the shmem
> initialization areas.  Here goes a v6.

Some comments in 0001, mostly cosmetics

1.
+/* utilities to handle the local array cache */
+static void
+injection_point_cache_add(const char *name,
+   InjectionPointCallback callback)

I think the comment for this function should be more specific about
adding an entry to the local injection_point_cache_add.  And add
comments for other functions as well e.g. injection_point_cache_get


2.
+typedef struct InjectionPointEntry
+{
+ char name[INJ_NAME_MAXLEN]; /* hash key */
+ char library[INJ_LIB_MAXLEN]; /* library */
+ char function[INJ_FUNC_MAXLEN]; /* function */
+} InjectionPointEntry;

Some comments would be good for the structure

3.

+static bool
+file_exists(const char *name)
+{
+ struct stat st;
+
+ Assert(name != NULL);
+ if (stat(name, ) == 0)
+ return !S_ISDIR(st.st_mode);
+ else if (!(errno == ENOENT || errno == ENOTDIR))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access file \"%s\": %m", name)));
+ return false;
+}

dfmgr.c has a similar function so can't we reuse it by making that
function external?

4.
+ if (found)
+ {
+ LWLockRelease(InjectionPointLock);
+ elog(ERROR, "injection point \"%s\" already defined", name);
+ }
+
...
+#else
+ elog(ERROR, "Injection points are not supported by this build");

Better to use similar formatting for error output, Injection vs
injection (better not to capitalize the first letter for consistency
pov)

5.
+ * Check first the shared hash table, and adapt the local cache
+ * depending on that as it could be possible that an entry to run
+ * has been removed.
+ */

What if the entry is removed after we have released the
InjectionPointLock? Or this would not cause any harm?


0004:

I think
test_injection_points_wake() and test_injection_wait() can be moved as
part of 0002



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-05 Thread jian he
On Fri, Jan 5, 2024 at 12:05 AM vignesh C  wrote:
>
> On Thu, 28 Dec 2023 at 09:27, jian he  wrote:
> >
> > On Wed, Dec 20, 2023 at 8:27 PM Masahiko Sawada  
> > wrote:
> > >
> > >
> > > Why do we need to use SPI? I think we can form heap tuples and insert
> > > them to the error table. Creating the error table also doesn't need to
> > > use SPI.
> > >
> > Thanks for pointing it out. I figured out how to form heap tuples and
> > insert them to the error table.
> > but I don't know how to create the error table without using SPI.
> > Please pointer it out.
> >
> > > >
> > > > copy_errors one per schema.
> > > > foo.copy_errors will be owned by the schema: foo owner.
> > >
> > > It seems that the error table is created when the SAVE_ERROR is used
> > > for the first time. It probably blocks concurrent COPY FROM commands
> > > with SAVE_ERROR option to different tables if the error table is not
> > > created yet.
> > >
> > I don't know how to solve this problem Maybe we can document this.
> > but it will block the COPY FROM immediately.
> >
> > > >
> > > > if you can insert to a table in that specific schema let's say foo,
> > > > then you will get privilege to INSERT/DELETE/SELECT
> > > > to foo.copy_errors.
> > > > If you are not a superuser, you are only allowed to do
> > > > INSERT/DELETE/SELECT on foo.copy_errors rows where USERID =
> > > > current_user::regrole::oid.
> > > > This is done via row level security.
> > >
> > > I don't think it works. If the user is dropped, the user's oid could
> > > be reused for a different user.
> > >
> >
> > You are right.
> > so I changed, now the schema owner will be the error table owner.
> > every error table tuple inserts,
> > I switch to schema owner, do the insert, then switch back to the
> > COPY_FROM operation user.
> > now everyone (except superuser) will need explicit grant to access the
> > error table.
>
> There are some compilation issues reported at [1] for the patch:
> [04:04:26.288] copyfromparse.c: In function ‘NextCopyFrom’:
> [04:04:26.288] copyfromparse.c:1126:25: error: ‘copy_errors_tupDesc’
> may be used uninitialized in this function
> [-Werror=maybe-uninitialized]
> [04:04:26.288] 1126 | copy_errors_tup = heap_form_tuple(copy_errors_tupDesc,
> [04:04:26.288] | ^~~~
> [04:04:26.288] 1127 | t_values,
> [04:04:26.288] | ~
> [04:04:26.288] 1128 | t_isnull);
> [04:04:26.288] | ~
> [04:04:26.288] copyfromparse.c:1160:4: error: ‘copy_errorsrel’ may be
> used uninitialized in this function [-Werror=maybe-uninitialized]
> [04:04:26.288] 1160 | table_close(copy_errorsrel, RowExclusiveLock);
> [04:04:26.288] | ^
>
> [1] - https://cirrus-ci.com/task/4785221183209472
>

I fixed this issue, and also improved the doc.
Other implementations have not changed.
From 99ebe03caa9d50b2cd3cdcd05becccd4b61684e1 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Fri, 5 Jan 2024 16:29:38 +0800
Subject: [PATCH v14 1/1] Make COPY FROM more error tolerant

At present, when processing the source file, COPY FROM may encounter three types of data type conversion errors.
* extra data after last expected column
* missing data for column \"%s\"
* data type conversion error.

Instead of throwing errors while copying, save_error (boolean) specifier will
save errors to the table copy_errors for all the copy from operation happend in the same schema.

 We check the existence of table copy_errors,
 we also check the data definition of copy_errors via compare column names and data types.
 If copy_errors already exists and meets the criteria then errors metadata will save to it.
 If copy_errors does not exist, then create it.
 If copy_errors exist, cannot use for saving error, then raise an error.

 the table copy_errors is per schema-wise, it's owned by the copy from
 operation destination schema's owner.
 The table owner has full privilege on copy_errors,
 other non-superuser need gain privilege to access it.
---
 doc/src/sgml/ref/copy.sgml   | 120 -
 src/backend/commands/copy.c  |  12 ++
 src/backend/commands/copyfrom.c  | 133 +-
 src/backend/commands/copyfromparse.c | 217 +--
 src/backend/parser/gram.y|   8 +-
 src/bin/psql/tab-complete.c  |   3 +-
 src/include/commands/copy.h  |   1 +
 src/include/commands/copyfrom_internal.h |   6 +
 src/include/parser/kwlist.h  |   1 +
 src/test/regress/expected/copy2.out  | 137 ++
 src/test/regress/sql/copy2.sql   | 123 +
 11 files changed, 745 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 18ecc69c..f6cdf0cf 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -44,6 +44,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 FORCE_NULL { ( column_name [, ...] ) | * }
 ENCODING 'encoding_name'
+

Re: Synchronizing slots from primary to standby

2024-01-05 Thread Bertrand Drouvot
Hi,

On Fri, Jan 05, 2024 at 10:00:53AM +0530, Amit Kapila wrote:
> On Fri, Jan 5, 2024 at 8:59 AM shveta malik  wrote:
> >
> > On Thu, Jan 4, 2024 at 7:24 PM Bertrand Drouvot
> >  wrote:
> > >
> > > 4 ===
> > >
> > > Looking closer, the only place where walrcv_connect() is called with 
> > > replication
> > > set to false and logical set to false is in ReplSlotSyncWorkerMain().
> > >
> > > That does make sense, but what do you think about creating dedicated 
> > > libpqslotsyncwrkr_connect
> > > and slotsyncwrkr_connect (instead of using the libpqrcv_connect / 
> > > walrcv_connect ones)?
> > >
> > > That way we could make use of slotsyncwrkr_connect() in 
> > > ReplSlotSyncWorkerMain()
> > > as I think it's confusing to use "rcv" functions while the process using 
> > > them is
> > > not of backend type walreceiver.
> > >
> > > I'm not sure that worth the extra complexity though, what do you think?
> >
> > I gave it a thought earlier, but then I was not sure even if I create
> > a new function w/o "rcv" in it then where should it be placed as the
> > existing file name itself is libpq'walreceiver'.c. Shall we be
> > creating a new file then? But it does not seem good to create a new
> > setup (new file, function pointers  other stuff) around 1 function.

Yeah...

> > And thus reusing the same function with 'replication' (new arg) felt
> > like a better choice than other options. If in future, there is any
> > other module trying to do the same, then it can use current
> > walrcv_connect() with rep=false. If I make it specific to slot-sync
> > worker, then it will not be reusable by other modules (if needed).

Yeah good point, it would need to be more generic.

> I agree that the benefit of creating a new API is not very clear.

Yeah, that would be more for cosmetic purpose (and avoid using a WalReceiverConn
while a PGconn could/should suffice).

> How
> about adjusting the description in the file header of
> libpqwalreceiver.c. 

Agree, that seems to be a better option (not sure that building the new API is
worth the extra work).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-01-05 Thread Jeff Davis
On Fri, 2024-01-05 at 12:49 +0530, Ashutosh Bapat wrote:
> Can you please provide an example using postgres_fdw to create a
> subscription using this patch. I think we should document it in
> postgres_fdw and add a test for the same.

There's a basic test for postgres_fdw in patch 0003, just testing the
syntax and validation.

A manual end-to-end test is pretty straightforward:

  -- on publisher
  create table foo(i int primary key);
  create publication pub1 for table foo;
  insert into foo values(42);

  -- on subscriber
  create extension postgres_fdw;
  create table foo(i int primary key);
  create server server1
foreign data wrapper postgres_fdw
options (host '/tmp', port '5432', dbname 'postgres');
  create user mapping for u1 server server1
options (user 'u1');
  select pg_conninfo_from_server('server1','u1',true);
  create subscription sub1 server server1 publication pub1;

I don't think we need to add an end-to-end test for each FDW, because
it's just using the assembled connection string. To see if it's
assembling the connection string properly, we can unit test with
pg_conninfo_from_server().

Regards,
Jeff Davis