Re: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

2023-02-24 Thread Tom Lane
Noel Grandin  writes:
> On Fri, 24 Feb 2023 at 17:39, Tom Lane  wrote:
>> Postgres doesn't really do LOB in the same sense that some other DBs
>> have, so you'd need to specify what you have in mind in Postgres
>> terms to get a useful answer.

> So, specifically, the primary problem we have is this:

> (1) A typical small query returns all of its data in a stream to the client
> (2) which means that, from the server's perspective, the transaction is
> closed the moment the last record in the stream is pushed to the client.
> (3) which means that, in the face of concurrent updates, the underlying
> MVCC data in the query might be long-gone from the server by the time the
> client has finished reading the result set.
> (4) However, with LOBs, the client doesn't get the LOB in the result set
> data stream, it gets a special identifier (a hash), which it uses to fetch
> LOB data from the server in chunks
> (5) Which means that the lifetime of an individual LOB is just horrible
> At the moment the implementation I have satisfies the needs of clients in
> terms of correctness (crosses fingers), but is horrible in terms of
> performance because of how long it has to keep LOB data around.

Yeah, Postgres has an analogous kind of problem.  Our standard way to
use "large objects" is to store their identifying OIDs in tables,
fetch the desired OID with a regular SQL query, and then open and read
(or write) the large object using its OID.  So you have a hazard of
time skew between what you saw in the table and what you see in the
large object.  We pretty much lay that problem off on the clients: if
they want consistency of those views they need to make sure that the
same snapshot is used for both the SQL query and the large-object
read.  That's not hard to do, but it isn't the default behavior,
and in particular they can *not* close the transaction that read the
OID if they'd like to read a matching state of the large object.
So far there's not been a lot of complaints about that ...

regards, tom lane




Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

2023-02-24 Thread John Naylor
On Tue, Feb 21, 2023 at 2:46 AM Andres Freund  wrote:

> On 2023-02-21 08:33:22 +1300, David Rowley wrote:
> > I am interested in a bump allocator for tuplesort.c. There it would be
> > used in isolation and all the code which would touch pointers
> > allocated by the bump allocator would be self-contained to the
> > tuplesorting code.
> >
> > What use case do you have in mind?
>
> E.g. the whole executor state tree (and likely also the plan tree) should
be
> allocated that way. They're never individually freed. But we also allocate
> other things in the same context, and those do need to be individually
> freeable. We could use a separate memory context, but that'd increase
memory
> usage in many cases, because there'd be two different blocks being
allocated
> from at the same time.

That reminds me of this thread I recently stumbled across about memory
management of prepared statements:

https://www.postgresql.org/message-id/20190726004124.prcb55bp43537vyw%40alap3.anarazel.de

I recently heard of a technique for relative pointers that could enable
tree structures within a single allocation.

If "a" needs to store the location of "b" relative to "a", it would be
calculated like

a = (char *)  - (char *) 

...then to find b again, do

typeof_b* b_ptr;
b_ptr = (typeof_b* ) ((char *)  + a);

One issue with this naive sketch is that zero would point to one's self,
and it would be better if zero still meant "invalid pointer" so that
memset(0) does the right thing.

Using signed byte-sized offsets as an example, the range is -128 to 127, so
we can call -128 the invalid pointer, or in binary 0b1000_.

To interpret a raw zero as invalid, we need an encoding, and here we can
just XOR it:

#define Encode(a) a^0b1000_;
#define Decode(a) a^0b1000_;

Then, encode(-128) == 0 and decode(0) == -128, and memset(0) will do the
right thing and that value will be decoded as invalid.

Conversely, this preserves the ability to point to self, if needed:

encode(0) == -128 and decode(-128) == 0

...so we can store any relative offset in the range -127..127, as well as
"invalid offset". This extends to larger signed integer types in the
obvious way.

Putting the above two calculations together, the math ends up like this,
which can be put into macros:

absolute to relative:
a = Encode((int32) (char *)  - (char *) );

relative to absolute:
typeof_b* b_ptr;
b_ptr = (typeof_b* ) ((char *)  + Decode(a));

I'm not yet familiar enough with parse/plan/execute trees to know if this
would work or not, but that might be a good thing to look into next cycle.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

2023-02-24 Thread Noel Grandin
Thanks for the answers.

So, H2, like PostgreSQL, also internally has (a) an MVCC engine and (b)
LOBs existing as a on-the-side extra thing.

On Fri, 24 Feb 2023 at 17:39, Tom Lane  wrote:

> Postgres doesn't really do LOB in the same sense that some other DBs
> have, so you'd need to specify what you have in mind in Postgres
> terms to get a useful answer.
>

So, specifically, the primary problem we have is this:

(1) A typical small query returns all of its data in a stream to the client
(2) which means that, from the server's perspective, the transaction is
closed the moment the last record in the stream is pushed to the client.
(3) which means that, in the face of concurrent updates, the underlying
MVCC data in the query might be long-gone from the server by the time the
client has finished reading the result set.
(4) However, with LOBs, the client doesn't get the LOB in the result set
data stream, it gets a special identifier (a hash), which it uses to fetch
LOB data from the server in chunks
(5) Which means that the lifetime of an individual LOB is just horrible
At the moment the implementation I have satisfies the needs of clients in
terms of correctness (crosses fingers), but is horrible in terms of
performance because of how long it has to keep LOB data around.


Re: Doc update for pg_stat_statements normalization

2023-02-24 Thread Michael Paquier
On Fri, Feb 24, 2023 at 08:54:00PM +, Imseih (AWS), Sami wrote:
> I think the only thing to do here is to call this out in docs with a
> suggestion to increase pg_stat_statements.max to reduce the
> likelihood. I also attached the suggested  doc enhancement as well.

Improving the docs about that sounds like a good idea.  This would be
less surprising for users, if we had some details about that.

> Any thoughts?

The risk of deallocation of an entry between the post-analyze hook and
the planner/utility hook represented with two calls of pgss_store()
means that this will never be able to work correctly as long as we
don't know the shape of the normalized query in the second code path
(planner, utility execution) updating the entries with the call
information, etc.  And we only want to pay the cost of normalization
once, after the post-analyze where we do the query jumbling.

Could things be done in a more stable way?  For example, imagine that
we have an extra Query field called void *private_data that extensions
can use to store custom data associated to a query ID, then we could
do something like that:
- In the post-analyze hook, check if an entry with the query ID
calculated exists.
-- If the entry exists, grab a copy of the existing query string,
which may be normalized or not, and save it into Query->private_data.
-- If the entry does not exist, normalize the query, store it in
Query->private_data but do not yet create an entry in the hash table.
- In the planner/utility hook, fetch the normalized query from
private_data, then use it if an entry needs to be created in the hash
table.  The entry may have been deallocated since the post-analyze
hook, in which case it is re-created with the normalized copy saved in
the first phase.
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade and logical replication

2023-02-24 Thread Amit Kapila
On Wed, Feb 22, 2023 at 12:13 PM Julien Rouhaud  wrote:
>
> On Mon, Feb 20, 2023 at 03:07:37PM +0800, Julien Rouhaud wrote:
> > On Mon, Feb 20, 2023 at 11:07:42AM +0530, Amit Kapila wrote:
> > >
> > > I think the current mechanism tries to provide more flexibility to the
> > > users. OTOH, in some of the cases where users don't want to change
> > > anything in the logical replication (both upstream and downstream
> > > function as it is) after the upgrade then they need to do more work. I
> > > think ideally there should be some option in pg_dump that allows us to
> > > dump the contents of pg_subscription_rel as well, so that is easier
> > > for users to continue replication after the upgrade. We can then use
> > > it for binary-upgrade mode as well.
> >
> > Is there really a use case for dumping the content of pg_subscription_rel
> > outside of pg_upgrade?

I think the users who want to take a dump and restore the entire
cluster may need it there for the same reason as pg_upgrade needs it.
TBH, I have not seen such a request but this is what I imagine one
would expect if we provide this functionality via pg_upgrade.

> >  I'm not particularly worried about the publisher going
> > away or changing while pg_upgrade is running , but for a normal pg_dump /
> > pg_restore I don't really see how anyone would actually want to resume 
> > logical
> > replication from a pg_dump, especially since it's almost guaranteed that the
> > node will already have consumed data from the publication that won't be in 
> > the
> > dump in the first place.
> >
> > Are you ok with the suggested syntax above (probably with extra parens to 
> > avoid
> > adding new keywords), or do you have some better suggestion?  I'm a bit 
> > worried
> > about adding some O(n) commands, as it can add some noticeable slow-down for
> > pg_upgrade-ing logical replica, but I don't really see how to avoid that.  
> > Note
> > that if we make this option available to end-users, we will have to use the
> > relation name rather than its oid, which will make this option even more
> > expensive when restoring due to the extra lookups.
> >
> > For the pg_upgrade use-case, do you see any reason to not restore the
> > pg_subscription_rel by default?

As I said earlier, one can very well say that giving more flexibility
(in terms of where the publications will be after restore) after a
restore is a better idea. Also, we are doing the same till now without
any major complaints about the same, so it makes sense to keep the
current behavior as default.

> >  Maybe having an option to not restore it would
> > make sense if it indeed add noticeable overhead when publications have a 
> > lot of
> > tables?

Yeah, that could be another reason to not do it default.

>
> Since I didn't hear any objection I worked on a POC patch with this approach.
>
> For now when pg_dump is invoked with --binary, it will always emit extra
> commands to restore the relation list.  This command is only allowed when the
> server is started in binary upgrade mode.
>
> The new command is of the form
>
> ALTER SUBSCRIPTION name ADD TABLE (relid = X, state = 'Y', lsn = 'Z/Z')
>
> with the lsn part being optional.
>

BTW, do we restore the origin and its LSN after the upgrade? Because
without that this won't be sufficient as that is required for apply
worker to ensure that it is in sync with table sync workers.

-- 
With Regards,
Amit Kapila.




Re: Documentation for building with meson

2023-02-24 Thread samay sharma
Hi,

On Thu, Dec 1, 2022 at 9:21 AM Andres Freund  wrote:

> Hi,
>
> On 2022-12-01 15:58:39 +0100, Peter Eisentraut wrote:
> > On 23.11.22 22:24, samay sharma wrote:
> > > Thank you. Attaching v7 addressing most of the points below.
> >
> > I have committed this, after some editing and making some structural
> > changes.
>
> Thanks. I was working on that too, but somehow felt a bit stuck...
>
> I'll try if I can adapt my pending changes.
>

I got back to working on the meson docs. I'm attaching a new patch set
proposing some improvements to the current installation docs. I've tried to
make some corrections and improvements suggested in this thread while
trying to maintain consistency across make and meson docs as per Peter's
ask. There are 5 patches in the patch-set:

v8-0001: Makes minor corrections, adds instructions to build docs and adds
a few links to meson docs.
v8-0002, v8-0003 and v8-0004 make changes to restructure the configure
options based on reasoning discussed below. To maintain consistency, I've
made those changes on both the make and meson side.
v8-0005 Reproposes the Short Version I had proposed in v7 as I feel we
should discuss that proposal. I think it improves things in terms of
installation docs. More details below.


>
> > I moved the "Requirements" section back to the top level.  It did
> > not look appealing to have to maintain two copies of this that have
> almost
> > no substantial difference (but for some reason were written with separate
> > structure and wording).
>

There are a few reasons why I had done this. Some reasons Andres has
described in his previous email and I'll add a few specific examples on why
having the same section for both might not be a good idea.

* Having readline and zlib as required for building PostgreSQL is now wrong
because they are not required for meson builds. Also, the name of the
configs are different for make and meson and the current section only lists
the make ones.
* There are many references to configure in that section which don't apply
to meson.
* Last I checked Flex and Bison were always required to build via meson but
not for make and the current section doesn't explain those differences.

I spent a good amount of time thinking if we could have a single section,
clarify these differences to make it correct and not confuse the users. I
couldn't find a way to do all three. Therefore, I think we should move to a
different requirements section for both. I'm happy to re-propose the
previous version which separates them but wanted to see if anybody has
better ideas.


>
> I don't think this is good. The whole "The following software packages are
> required for building PostgreSQL" section is wrong now.  "They are not
> required in the default configuration, but they are needed when certain
> build
> options are enabled, as explained below:" section is misleading as well.
>
> By the time we fix all of those we'll end up with a different section
> again.
>
>
> > Also, I rearranged the Building with Meson section to use the same
> internal
> > structure as the Building with Autoconf and Make section.  This will
> make it
> > easier to maintain going forward.  For example if someone adds a new
> option,
> > it will be easier to find the corresponding places in the lists where to
> add
> > them.


> I don't know. The existing list order makes very little sense to me. The
> E.g. --enable-nls is before the rest in configure, presumably because it
> sorts
> there alphabetically. But that's not the case for meson.
>
> Copying "Anti-Features" as a structure element to the meson docs seems
> bogus
> (also the name is bogus, but that's a pre-existing issue). There's no
> difference in -Dreadline= to the other options meson-options-features list.


> Nor does -Dspinlocks= -Datomics= make sense in the "anti features"
> section. It
> made some sense for autoconf because of the --without- prefix, but that's
> not
> at play in meson.  Their placement in the "Developer Options" made a whole
> lot
> more sense.
>

I agree "Anti-Features" desn't make sense in the meson context. One of my
patches removes that section and moves some options into the "Postgres
Features" section and others into the "Developer Options" section. I've
proposed to make those changes on both sides to make it easier to maintain.


>
> I don't like "Miscellaneous" bit containing minor stuff like krb_srvnam and
> data layout changing options like blocksize,segsize,wal_blocksize. But it
> makes sense to change that for both at the same time.
>

I've proposed a patch to add a new "Data Layout Options" section which
includes: blocksize, segsize and wal_blocksize. I've created that section
on both sides.

>
>
> > We will likely keep iterating on the contents for the next little while,
> but
> > I'm glad we now have a structure in place that we should be able to live
> > with.
>

I feel that there are a few shortcomings of the current "Short Version". I
tried to address them in my previous 

Re: Add LZ4 compression in pg_dump

2023-02-24 Thread Justin Pryzby
I have some fixes (attached) and questions while polishing the patch for
zstd compression.  The fixes are small and could be integrated with the
patch for zstd, but could be applied independently.

- I'm unclear about get_error_func().  That's called in three places
  from pg_backup_directory.c, after failures from write_func(), to
  supply an compression-specific error message to pg_fatal().  But it's
  not being used outside of directory format, nor for errors for other
  function pointers, or even for all errors in write_func().  Is there
  some reason why each compression method's write_func() shouldn't call
  pg_fatal() directly, with its compression-specific message ?

- I still think supports_compression() should be renamed, or made into a
  static function in the necessary file.  The main reason is that it's
  more clear what it indicates - whether compression is "implemented by
  pgdump" and not whether compression is "supported by this postgres
  build".  It also seems possible that we'd want to add a function
  called something like supports_compression(), indicating whether the
  algorithm is supported by the current build.  It'd be better if pgdump
  didn't subjugate that name.

- Finally, the "Nothing to do in the default case" comment comes from
  Michael's commit 5e73a6048:

+   /*
+* Custom and directory formats are compressed by default with gzip when
+* available, not the others.
+*/
+   if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
+   !user_compression_defined)
{
 #ifdef HAVE_LIBZ
-   if (archiveFormat == archCustom || archiveFormat == 
archDirectory)
-   compressLevel = Z_DEFAULT_COMPRESSION;
-   else
+   parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
+
_spec);
+#else
+   /* Nothing to do in the default case */
 #endif
-   compressLevel = 0;
}


As the comment says: for -Fc and -Fd, the compression is set to zlib, if
enabled, and when not otherwise specified by the user.

Before 5e73a6048, this set compressLevel=0 for -Fp and -Ft, *and* when
zlib was unavailable.

But I'm not sure why there's now an empty "#else".  I also don't know
what "the default case" refers to.

Maybe the best thing here is to move the preprocessor #if, since it's no
longer in the middle of a runtime conditional:

 #ifdef HAVE_LIBZ
+   if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
+   !user_compression_defined)
+   parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
+_spec);
 #endif

...but that elicits a warning about "variable set but not used"...

-- 
Justin
>From e31901414a8509317297972d1033c2e08629d903 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 24 Feb 2023 14:07:09 -0600
Subject: [PATCH] f!fixes for LZ4

---
 doc/src/sgml/ref/pg_dump.sgml| 4 ++--
 src/bin/pg_dump/compress_io.c| 2 +-
 src/bin/pg_dump/compress_io.h| 4 ++--
 src/bin/pg_dump/compress_lz4.c   | 4 ++--
 src/bin/pg_dump/pg_backup_archiver.c | 4 ++--
 src/bin/pg_dump/pg_dump.c| 2 --
 src/bin/pg_dump/t/002_pg_dump.pl | 6 +++---
 7 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 49d218905fb..6fbe49f7ede 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -331,7 +331,7 @@ PostgreSQL documentation
can read. A directory format archive can be manipulated with
standard Unix tools; for example, files in an uncompressed archive
can be compressed with the gzip or
-   lz4tool.
+   lz4 tools.
This format is compressed by default using gzip
and also supports parallel dumps.
   
@@ -655,7 +655,7 @@ PostgreSQL documentation

 Specify the compression method and/or the compression level to use.
 The compression method can be set to gzip or
-lz4 or none for no compression.
+lz4, or none for no compression.
 A compression detail string can optionally be specified.  If the
 detail string is an integer, it specifies the compression level.
 Otherwise, it should be a comma-separated list of items, each of the
diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index ce06f1eac9c..9239dbb2755 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -83,7 +83,7 @@
  * used by the caller in an error message.
  */
 char *
-supports_compression(const pg_compress_specification compression_spec)
+pgdump_supports_compression(const pg_compress_specification compression_spec)
 {
 	const pg_compress_algorithm	algorithm = compression_spec.algorithm;
 	bool		supported = 

Re: zstd compression for pg_dump

2023-02-24 Thread Justin Pryzby
On Sat, Feb 25, 2023 at 01:44:36PM +0900, Michael Paquier wrote:
> On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> > This is a draft patch - review is welcome and would help to get this
> > ready to be considererd for v16, if desired.
> > 
> > I'm going to add this thread to the old CF entry.
> > https://commitfest.postgresql.org/31/2888/
> 
> Patch 0003 adds support for the --long option of zstd, meaning that it
> "enables long distance matching with #windowLog".  What's the benefit
> of that when it is applied to dumps and base backup contents?

It (can) makes it smaller.

+   The long keyword enables long-distance matching
+   mode, for improved compression ratio, at the expense of higher 
memory
+   use.  Long-distance mode is supported only for 

+With zstd compression, long mode may allow dumps
+to be significantly smaller, but it might not reduce the size of
+custom or directory format dumps, whose fields are separately 
compressed.

Note that I included that here as 003, but I also have an pre-existing
patch for adding that just to basebackup.

-- 
Justin




Re: zstd compression for pg_dump

2023-02-24 Thread Michael Paquier
On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
> 
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/

Patch 0003 adds support for the --long option of zstd, meaning that it
"enables long distance matching with #windowLog".  What's the benefit
of that when it is applied to dumps and base backup contents?
--
Michael


signature.asc
Description: PGP signature


Re: verbose mode for pg_input_error_message?

2023-02-24 Thread Michael Paquier
On Fri, Feb 24, 2023 at 05:36:42PM -0500, Corey Huinker wrote:
> Looks good to me, passes make check-world. Thanks for slogging through this.

FWIW, I agree that switching pg_input_error_message() to return a row
would be nicer in the long-run than just getting an error message
because it has the merit to be extensible at will with all the data
we'd like to attach to it (I suspect that getting more fields is not
much likely, but who knows..).

pg_input_error_message() does not strike me as a good function name,
though, because it now returns much more than an error message.
Hence, couldn't something like pg_input_error() be better, because
more generic?
--
Michael


signature.asc
Description: PGP signature


Re: Proposal: :SQL_EXEC_TIME (like :ROW_COUNT) Variable (psql)

2023-02-24 Thread Kirk Wolak
On Fri, Feb 24, 2023 at 7:09 AM Jim Jones  wrote:

> On 23.02.23 20:55, Kirk Wolak wrote:
> > Everyone,
> ... SQL_EXEC_TIME
> >   I think like ROW_COUNT, it should not change because of internal
> > commands.
> > So, you guys +1 this thing, give additional comments.  When the
> > feedback settles, I commit to making it happen.
> >
> > Thanks, Kirk
> >
> I can see it being pretty handy to check if a certain task involving two
> different terminal windows was done in the right order. Basically to see
> what went wrong, e.g. "did I really stop the master database before
> promoting the replica?"
>
> +1 !
>
> Best, Jim
>

Jim, thanks, here is that patch for the %T option, but I think you did a +1
for the new psql variable :SQL_EXEC_TIME.
I realized my communication style needs to be cleaner, I caused that with
the lead in.

I created this proposal because I felt it was an excellent suggestion, and
I think it will be trivial to implement, although
it will involve a lot more testing...  MANY times, I have run a query that
took a touch too long, and I was wondering how long EXACTLY did that take.
This makes it easy  \echo :SQL_EXEC_TIME  (Well, I think it will be
SQL_EXEC_ELAPSED)

regards, kirk
From eaf6d05028052a3ccaa8d980953ac4fd75255250 Mon Sep 17 00:00:00 2001
From: Kirk Wolak 
Date: Thu, 23 Feb 2023 17:52:09 +
Subject: [PATCH] Time option added to psql prompt

This adds a useful time option to the prompt: %T. Which does not
require a wasteful backquoted shell command which is also not
compatible between operating systems.
The format is simply HH24:MI:SS no other options available by design!

Author: Kirk Wolak 
Reviewed-By: Andrey Borodin 
Reviewed-By: Nikolay Samokhvalov 
Thread: 
https://postgr.es/m/CACLU5mSRwHr_8z%3DenMj-nXF1tmC7%2BJn5heZQNiKuLyxYUtL2fg%40mail.gmail.com
---
 doc/src/sgml/ref/psql-ref.sgml |  9 +
 src/bin/psql/prompt.c  | 10 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index dc6528dc11d..04ab9eeb8c0 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -4575,6 +4575,15 @@ testdb= INSERT INTO my_table VALUES 
(:'content');
 
   
 
+  
+%T
+
+ 
+  The current time on the client in HH24:MI:SS format.
+ 
+
+  
+
   
 %x
 
diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c
index 969cd9908e5..a590c27389b 100644
--- a/src/bin/psql/prompt.c
+++ b/src/bin/psql/prompt.c
@@ -41,6 +41,7 @@
  * or a ! if session is not connected to a database;
  * in prompt2 -, *, ', or ";
  * in prompt3 nothing
+ * %T - time in HH24:MI:SS format
  * %x - transaction status: empty, *, !, ? (unknown or no connection)
  * %l - The line number inside the current statement, starting from 1.
  * %? - the error code of the last query (not yet implemented)
@@ -223,7 +224,14 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
break;
}
break;
-
+   /* output HH24:MI:SS */
+   case 'T':
+   {
+   time_t current_time = 
time(NULL);
+   struct tm *tm_info = 
localtime(_time);
+   sprintf(buf, "%02d:%02d:%02d", 
tm_info->tm_hour, tm_info->tm_min, tm_info->tm_sec);
 
+   }   
+   break;
case 'x':
if (!pset.db)
buf[0] = '?';
-- 
GitLab

Re: Proposal: :SQL_EXEC_TIME (like :ROW_COUNT) Variable (psql)

2023-02-24 Thread Kirk Wolak
On Fri, Feb 24, 2023 at 2:11 AM Gurjeet Singh  wrote:

> On Thu, Feb 23, 2023 at 8:42 PM Kirk Wolak  wrote:
> >   I love that my proposal for %T in the prompt, triggered some great
> conversations.
> >
> >   This is not instead of that.  That lets me run a query and come back
> HOURS later, and know it finished before 7PM like it was supposed to!
>
> Neat! I have this info embedded in my Bash prompt [1], but many a
> times this is not sufficient to reconstruct the time it took to run
> the shell command.
> ...
> >   I think like ROW_COUNT, it should not change because of internal
> commands.
>
> +1
>
> > So, you guys +1 this thing, give additional comments.  When the feedback
> settles, I commit to making it happen.
>
> This is definitely a useful feature. I agree with everything in the
> proposed UI (reporting in milliseconds, don't track internal commands'
> timing).
>
> I think 'duration' or 'elapsed' would be a better words in this
> context. So perhaps the name could be one of :sql_exec_duration (sql
> prefix feels superfluous), :exec_duration, :command_duration, or
> :elapsed_time.
>

I chose that prefix because it sorts near ROW_COUNT (LOL) when you do \SET

I agree that the name wasn't perfect...
I like SQL_EXEC_ELAPSED
keeping the result closer to ROW_COUNT, and it literally ONLY applies to SQL


> By using \timing, the user is explicitly opting into any overhead
> caused by time-keeping. With this feature, the timing info will be
> collected all the time. So do consider evaluating the performance
> impact this can cause on people's workloads. They may not care for the
> impact in interactive mode, but in automated scripts, even a moderate
> performance overhead would be a deal-breaker.
>
Excellent point.  I run lots of long scripts, but I usually set \timing on,
just because I turn off everything else.
I tested 2,000+ lines of select 1; (Fast sql shouldn't matter, it's the
most impacted)
Honestly, it was imperceptible,  Maybe approximating 0.01 seconds
With timing on:  ~ seconds 0.28
With timing of:   ~ seconds 0.27

The \timing incurs no realistic penalty at this point.  The ONLY penalty we
could face is the time to
write it to the variable, and that cannot be tested until implemented.  But
I will do that.  And I will
report the results of the impact.  But I do not expect a big impact.  We
update SQL_COUNT without an issue.
And that might be much more expensive to get.

Thanks!

>
> [1]:
> https://github.com/gurjeet/home/blob/08f1051fb854f4fc8fbc4f1326f393ed507a55ce/.bashrc#L278
> [2]:
> https://github.com/gurjeet/home/blob/08f1051fb854f4fc8fbc4f1326f393ed507a55ce/.bashrc#L262
>
> Best regards,
> Gurjeet
> http://Gurje.et
>


Re: Disable vacuuming to provide data history

2023-02-24 Thread Vik Fearing

On 2/24/23 22:06, Corey Huinker wrote:

On Thu, Feb 23, 2023 at 6:04 AM  wrote:

  [1] some implementations don't use null, they use an end-timestamp set to
a date implausibly far in the future ( 3999-12-31 for example ),


The specification is, "At any point in time, all rows that have their 
system-time period end column set to the highest value supported by the 
data type of that column are known as current system rows; all other 
rows are known as historical system rows."


I would like to see us use 'infinity' for this.

The main design blocker for me is how to handle dump/restore.  The 
standard does not bother thinking about that.

--
Vik Fearing





Re: Move defaults toward ICU in 16?

2023-02-24 Thread Jeff Davis
On Fri, 2023-02-17 at 15:07 -0800, Jeff Davis wrote:
> 2. Update the pg_database entry for template0. This has less
> potential
> for surprise in case people are actually using template0 for a
> template.

New patches attached.

  0001: default autoconf to build with ICU (meson already uses 'auto')
  0002: update template0 in new cluster (as described above)
  0003: default initdb to use ICU

Updating template0, as in 0002, seems straightforward and unsurprising,
since only template0 is preserved and it was only initialized for the
purposes of upgrading. Also, template0 is not sensitive to locale
settings, and doesn't even have the datcollversion set. The patch
updates encoding, datlocprovider, datcollate, datctype, and
daticulocale on the new cluster. No doc update, because there are some
initdb settings (like checksums) which still need to be compatible
between the old and the new cluster.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From 34fcb5ebfd0463a659170ea0d7599bd37d6c3e76 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 10 Feb 2023 12:08:11 -0800
Subject: [PATCH v4 1/3] Build ICU support by default.

Discussion: https://postgr.es/m/510d284759f6e943ce15096167760b2edcb2e700.ca...@j-davis.com
---
 .cirrus.yml|  1 +
 configure  | 36 ++--
 configure.ac   |  8 +++-
 doc/src/sgml/installation.sgml | 76 +++---
 4 files changed, 61 insertions(+), 60 deletions(-)

diff --git a/.cirrus.yml b/.cirrus.yml
index f212978752..34450a9c7b 100644
--- a/.cirrus.yml
+++ b/.cirrus.yml
@@ -751,6 +751,7 @@ task:
   time ./configure \
 --host=x86_64-w64-mingw32 \
 --enable-cassert \
+--without-icu \
 CC="ccache x86_64-w64-mingw32-gcc" \
 CXX="ccache x86_64-w64-mingw32-g++"
   make -s -j${BUILD_JOBS} clean
diff --git a/configure b/configure
index e35769ea73..436c2446e8 100755
--- a/configure
+++ b/configure
@@ -1558,7 +1558,7 @@ Optional Packages:
   set WAL block size in kB [8]
   --with-CC=CMD   set compiler (deprecated)
   --with-llvm build with LLVM based JIT support
-  --with-icu  build with ICU support
+  --without-icu   build without ICU support
   --with-tcl  build Tcl modules (PL/Tcl)
   --with-tclconfig=DIRtclConfig.sh is in DIR
   --with-perl build Perl modules (PL/Perl)
@@ -8401,7 +8401,9 @@ $as_echo "#define USE_ICU 1" >>confdefs.h
   esac
 
 else
-  with_icu=no
+  with_icu=yes
+
+$as_echo "#define USE_ICU 1" >>confdefs.h
 
 fi
 
@@ -8470,31 +8472,17 @@ fi
 	# Put the nasty error message in config.log where it belongs
 	echo "$ICU_PKG_ERRORS" >&5
 
-	as_fn_error $? "Package requirements (icu-uc icu-i18n) were not met:
-
-$ICU_PKG_ERRORS
-
-Consider adjusting the PKG_CONFIG_PATH environment variable if you
-installed software in a non-standard prefix.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details." "$LINENO" 5
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 elif test $pkg_failed = untried; then
 { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
 $as_echo "no" >&6; }
-	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
-$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
-as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
-is in your PATH or set the PKG_CONFIG environment variable to the full
-path to pkg-config.
-
-Alternatively, you may set the environment variables ICU_CFLAGS
-and ICU_LIBS to avoid the need to call pkg-config.
-See the pkg-config man page for more details.
-
-To get pkg-config, see .
-See \`config.log' for more details" "$LINENO" 5; }
+	as_fn_error $? "ICU library not found
+If you have ICU already installed, see config.log for details on the
+failure.  It is possible the compiler isn't looking in the proper directory.
+Use --without-icu to disable ICU support." "$LINENO" 5
 else
 	ICU_CFLAGS=$pkg_cv_ICU_CFLAGS
 	ICU_LIBS=$pkg_cv_ICU_LIBS
diff --git a/configure.ac b/configure.ac
index af23c15cb2..215e32120f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -853,13 +853,17 @@ AC_SUBST(enable_thread_safety)
 # ICU
 #
 AC_MSG_CHECKING([whether to build with ICU support])
-PGAC_ARG_BOOL(with, icu, no, [build with ICU support],
+PGAC_ARG_BOOL(with, icu, yes, [build without ICU support],
   [AC_DEFINE([USE_ICU], 1, [Define to build with ICU support. (--with-icu)])])
 AC_MSG_RESULT([$with_icu])
 AC_SUBST(with_icu)
 
 if test "$with_icu" = yes; then
-  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n)
+  PKG_CHECK_MODULES(ICU, icu-uc icu-i18n, [],

Re: Disable rdns for Kerberos tests

2023-02-24 Thread Stephen Frost
Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> On 21/02/2023 01:35, Stephen Frost wrote:
> > The name canonicalization support for Kerberos is doing us more harm
> > than good in the regression tests, so I propose we disable it.  Patch
> > attached.
> > 
> > Thoughts?
> 
> Makes sense. A brief comment in 001_auth.pl itself to mention why we disable
> rdns would be nice.

Thanks for reviewing!  Comments added and updated the commit message.

Unless there's anything else, I'll push this early next week.

Thanks again!

Stephen
From e0df250f30cdde1ee6d57ff24efbca3ecd923801 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 20 Feb 2023 17:53:48 -0500
Subject: [PATCH] For Kerberos testing, disable reverse DNS lookup

In our Kerberos test suite, there isn't much need to worry about the
normal canonicalization that Kerberos provides by looking up the reverse
DNS for the IP address connected to, and in some cases it can actively
cause problems (eg: capture portal wifi where the normally not
resolvable localhost address used ends up being resolved anyway, and
never to the domain we are using for testing, causing the entire
regression test to fail with errors about not being able to get a TGT
for the remote realm for cross-realm trust).

Therefore, disable it by adding rdns = false into the krb5.conf that's
generated for the test.

Reviewed-By: Heikki Linnakangas
Discussion: https://postgr.es/m/Y/qd2zdkdyqa1...@tamriel.snowman.net
---
 src/test/kerberos/t/001_auth.pl | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index d610ce63ab..16b3e8ed23 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -100,6 +100,17 @@ $stdout =~ m/Kerberos 5 release ([0-9]+\.[0-9]+)/
   or BAIL_OUT("could not get Kerberos version");
 $krb5_version = $1;
 
+# Build the krb5.conf to use.
+#
+# Explicitly specify the default (test) realm and the KDC for
+# that realm to avoid the Kerberos library trying to look up
+# that information in DNS, and also because we're using a
+# non-standard KDC port.
+#
+# Reverse DNS is explicitly disabled to avoid any issue with
+# capture portals or other cases where the reverse DNS succeeds
+# and the Kerberos library uses that as the canonical name of
+# the host and tries to do a cross-realm lookup.
 append_to_file(
 	$krb5_conf,
 	qq![logging]
@@ -108,6 +119,7 @@ kdc = FILE:$kdc_log
 
 [libdefaults]
 default_realm = $realm
+rdns = false
 
 [realms]
 $realm = {
-- 
2.34.1



signature.asc
Description: PGP signature


Re: [Proposal] Allow pg_dump to include all child tables with the root table

2023-02-24 Thread Cary Huang
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Hi

the patch applies fine on current master branch and it works as described. 
However, I would suggest changing the new option name from "--with-childs" to 
"--with-partitions" for several reasons. 

"childs" is grammatically incorrect and in the PG community, the term 
"partitioned table" is normally used to denote a parent table, and the term 
"partition" is used to denote the child table under the parent table. We should 
use these terms to stay consistent with the community.

Also, I would rephrase the documentation as:

Used in conjunction with -t/--table or 
-T/--exclude-table options to include or 
exclude partitions of the specified tables if any.

thank you

Cary Huang

HighGo Software Canada
www.highgo.ca

Re: verbose mode for pg_input_error_message?

2023-02-24 Thread Corey Huinker
On Thu, Feb 23, 2023 at 4:47 PM Nathan Bossart 
wrote:

> On Thu, Feb 23, 2023 at 11:30:38AM -0800, Nathan Bossart wrote:
> > Will post a new version shortly.
>
> As promised...
>
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com


Looks good to me, passes make check-world. Thanks for slogging through this.


Re: PG_FREE_IF_COPY extraneous in numeric_cmp?

2023-02-24 Thread Tom Lane
CK Tan  writes:
> Isn't it true that pfree() will never be called by PG_FREE_IF_COPY?

No.  You're forgetting the possibility that PG_GETARG_NUMERIC will
have to de-toast a toasted input.  Granted, numerics are seldom
going to be long enough to get compressed or pushed out-of-line;
but that's possible, and what's very possible is that they'll have
a short header.

regards, tom lane




Re: BRIN indexes vs. SK_SEARCHARRAY (and preprocessing scan keys)

2023-02-24 Thread Heikki Linnakangas

I had a quick look at just the preliminary cleanup patches:


0001-BRIN-bloom-cleanup-20230218.patch


Looks good to me


0002-BRIN-minmax-multi-cleanup-20230218.patch


Looks good, although it would feel more natural to me to do it the other 
way round, and define 'matches' as 'bool matches', and use DatumGetBool.


Not new with this patch, but I find the 'matches' and 'matching' 
variables a bit strange. Wouldn't it be simpler to have just one variable?



0003-Introduce-bloom_filter_size-20230218.patch


Looks good


0004-Add-minmax-multi-inequality-tests-20230218.patch


Looks good


+SELECT i/5 + mod(911 * i + 483, 25),
+   i/10 + mod(751 * i + 221, 41)


Peculiar formulas. Was there a particular reason for these values?

- Heikki





Re: Disable vacuuming to provide data history

2023-02-24 Thread Corey Huinker
On Thu, Feb 23, 2023 at 6:04 AM  wrote:

> Hey,
>
> It depnends on scenario, but there is many use cases that hack data
> change from somebody with admin privileges could be disaster.
> That is the place where data history could come with help.  Some basic
> solution would be trigger which writes previous version of record
> to some other table. Trigger however can be disabled or removed (crazy
> solution would be to provide pernament
> triggers and tables which  can only be pernamently inserted).
> Then we have also possibility to modify tablespace directly on disk.
>
> But Postgres has ability to not override records when two concurrent
> transaction modify data to provide MVCC.
>
> So what about pernamently not vacuumable tables. Adding some xid log
> tables with hash of record on hash on previous hash.
> I think that would be serious additional advantage for best open source
> relational databes.
>
> Best regards,
>Marek Mosiewicz
>

What you are describing sounds like the "system versioning" flavor of
"temporal" tables. It's a part of the SQL Standard, but PostgreSQL has yet
to implement it in core. Basically, every row has a start_timestamp and
end_timestamp field. Updating a row sets the end_timestamp of the old
version and inserts a new one with a start_timestamp matching the
end-timestamp of the previous row. Once a record has a non-null [1]
end_timestamp, it is not possible to update that row via SQL. Regular SQL
statements effectively have a "AND end_timestamp IS NULL" filter on them,
so the old rows are not visible without specifically invoking temporal
features to get point-in-time queries. At the implementation level, this
probably means a table with 2 partitions, one for live rows all having null
end_timestamps, and one for archived rows which is effectively append-only.

This strategy is common practice for chain of custody and auditing
purposes, either as a feature of the RDBMS or home-rolled. I have also seen
it used for developing forecasting models (ex "what would this model have
told us to do if we had run it a year ago?").

A few years ago, I personally thought about implementing a hash-chain
feature, but my research at the time concluded that:

* Few customers were interested in going beyond what was required for
regulatory compliance
* Once compliant, any divergence from established procedures, even if it
was an unambiguous improvement, only invited re-examination of it and
adjacent procedures, and they would avoid that
* They could get the same validation by comparing against a secured backup
and out-of-band audit "logs" (most would call them "reports")
* They were of the opinion that if a bad actor got admin access, it was
"game over" anyway

The world may have changed since then, but even if there is now interest, I
wonder if that isn't better implemented at the OS level rather than the
RDBMS level.

 [1] some implementations don't use null, they use an end-timestamp set to
a date implausibly far in the future ( 3999-12-31 for example ), but the
concept remains that once the column is set to a real timestamp, the row
isn't visible to update statements.


Doc update for pg_stat_statements normalization

2023-02-24 Thread Imseih (AWS), Sami
Replacing constants in pg_stat_statements is on a best effort basis.
It is not unlikely that on a busy workload with heavy entry deallocation,
the user may observe the query with the constants in pg_stat_statements.

From what I can see, this is because the only time an entry is normalized is
during post_parse_analyze, and the entry may be deallocated by the time query
execution ends. At that point, the original form ( with constants ) of the query
is used.

It is not clear how prevalent this is in real-world workloads, but it's easily 
reproducible
on a workload with high entry deallocation. Attached are the repro steps on the 
latest
branch.

I think the only thing to do here is to call this out in docs with a suggestion 
to increase
pg_stat_statements.max to reduce the likelihood. I also attached the suggested
doc enhancement as well.

Any thoughts?

Regards,

--
Sami Imseih
Amazon Web Services


### pg_stat_statements.max is min allowed
postgres=# show pg_stat_statements.max ;
 pg_stat_statements.max

 100
(1 row)

### create 200 tables
do $$
begin
for i in 1 .. 200
loop
execute 'drop table if exists foo'||i;
execute 'create table foo'||i||'(id int, c2 text)';
end loop;
end; $$ ;

### create a pgbench script to insert into the tables.
for (( i=1; i<=200; i++ ))
do 
   echo "INSERT INTO foo"$i" (id, c2) values (1, 'somedata');" >> 
/tmp/pgbench.sql
done

### run pgbench
pgbench -c 50 -f /tmp/pgbench.sql -T 1200


### observe pg_stat_statements
postgres=# select query from pg_stat_statements where query like '%foo%' and 
query not like '%$1%';
query
-
 INSERT INTO foo31 (id, c2) values (1, 'somedata')
 INSERT INTO foo20 (id, c2) values (1, 'somedata')
 INSERT INTO foo191 (id, c2) values (1, 'somedata')
 INSERT INTO foo170 (id, c2) values (1, 'somedata')
 INSERT INTO foo167 (id, c2) values (1, 'somedata')
 INSERT INTO foo32 (id, c2) values (1, 'somedata')
 INSERT INTO foo36 (id, c2) values (1, 'somedata')
 INSERT INTO foo43 (id, c2) values (1, 'somedata')
 INSERT INTO foo181 (id, c2) values (1, 'somedata')
 INSERT INTO foo88 (id, c2) values (1, 'somedata')
(10 rows)

postgres=# select query from pg_stat_statements where query like '%foo%' and 
query  like '%$1%' limit 5;
   query

 INSERT INTO foo33 (id, c2) values ($1, $2)
 INSERT INTO foo59 (id, c2) values ($1, $2)
 INSERT INTO foo50 (id, c2) values ($1, $2)
 INSERT INTO foo42 (id, c2) values ($1, $2)
 INSERT INTO foo91 (id, c2) values ($1, $2)

### observe the # of deallocations
postgres=# select * from pg_stat_statements_info ;
 dealloc |  stats_reset
-+---
  113371 | 2023-02-24 06:24:21.912275-06
(1 row)


0001-doc-update-regarding-pg_stat_statements-normalizatio.patch
Description: 0001-doc-update-regarding-pg_stat_statements-normalizatio.patch


Re: Marking options deprecated in help output

2023-02-24 Thread Heikki Linnakangas

On 05/12/2022 11:42, Daniel Gustafsson wrote:

In the pg_dump blob terminology thread it was briefly discussed [0] to mark
parameters as deprecated in the --help output.  The attached is a quick diff to
show that that would look like.  Personally I think it makes sense, not
everyone will read the docs.


Makes sense. One minor suggestion; instead of this:


  -h, -H, --host=HOSTNAMEdatabase server host or socket directory
 (-H is deprecated)


How about putting the deprecated option on a separate line like this:


  -h, --host=HOSTNAMEdatabase server host or socket directory
  -H (same as -h, deprecated)


And same for --blobs and -no-blobs

- Heikki





Re: Inconsistency in ACL error message

2023-02-24 Thread Joseph Koshakow
On Fri, Feb 24, 2023 at 1:31 PM Nathan Bossart 
wrote:

> You might be interested in
>
>https://commitfest.postgresql.org/42/4145/

Ah, perfect. In that case ignore my patch!

- Joe Koshakow


zstd compression for pg_dump

2023-02-24 Thread Justin Pryzby
This is a draft patch - review is welcome and would help to get this
ready to be considererd for v16, if desired.

I'm going to add this thread to the old CF entry.
https://commitfest.postgresql.org/31/2888/

-- 
Justin
>From 2486417b7c3586e150e806a1fbc3b873c2a4a0f9 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 7 Jan 2023 15:45:06 -0600
Subject: [PATCH 1/3] WIP: pg_dump: zstd compression

Previously proposed at: 20201221194924.gi30...@telsasoft.com

note-to-self: see also: private commit 36ab001fb
---
 src/bin/pg_dump/Makefile  |   1 +
 src/bin/pg_dump/compress_io.c |  54 +--
 src/bin/pg_dump/compress_zstd.c   | 520 ++
 src/bin/pg_dump/compress_zstd.h   |   9 +
 src/bin/pg_dump/meson.build   |   1 +
 src/bin/pg_dump/pg_backup_archiver.c  |   9 +-
 src/bin/pg_dump/pg_backup_directory.c |   2 +
 src/bin/pg_dump/pg_dump.c |  13 -
 src/bin/pg_dump/t/002_pg_dump.pl  |  71 
 9 files changed, 640 insertions(+), 40 deletions(-)
 create mode 100644 src/bin/pg_dump/compress_zstd.c
 create mode 100644 src/bin/pg_dump/compress_zstd.h

diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index eb8f59459a1..76574298faf 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -29,6 +29,7 @@ OBJS = \
 	compress_io.o \
 	compress_lz4.o \
 	compress_none.o \
+	compress_zstd.o \
 	dumputils.o \
 	parallel.o \
 	pg_backup_archiver.o \
diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index ce06f1eac9c..061e3d9ce1c 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -52,8 +52,8 @@
  *
  *	InitDiscoverCompressFileHandle tries to infer the compression by the
  *	filename suffix. If the suffix is not yet known then it tries to simply
- *	open the file and if it fails, it tries to open the same file with the .gz
- *	suffix, and then again with the .lz4 suffix.
+ *	open the file and if it fails, it tries to open the same file with
+ *	compressed suffixes.
  *
  * IDENTIFICATION
  *	   src/bin/pg_dump/compress_io.c
@@ -69,6 +69,7 @@
 #include "compress_io.h"
 #include "compress_lz4.h"
 #include "compress_none.h"
+#include "compress_zstd.h"
 #include "pg_backup_utils.h"
 
 /*--
@@ -98,6 +99,10 @@ supports_compression(const pg_compress_specification compression_spec)
 	if (algorithm == PG_COMPRESSION_LZ4)
 		supported = true;
 #endif
+#ifdef USE_ZSTD
+	if (algorithm == PG_COMPRESSION_ZSTD)
+		supported = true;
+#endif
 
 	if (!supported)
 		return psprintf("this build does not support compression with %s",
@@ -130,6 +135,8 @@ AllocateCompressor(const pg_compress_specification compression_spec,
 		InitCompressorGzip(cs, compression_spec);
 	else if (compression_spec.algorithm == PG_COMPRESSION_LZ4)
 		InitCompressorLZ4(cs, compression_spec);
+	else if (compression_spec.algorithm == PG_COMPRESSION_ZSTD)
+		InitCompressorZstd(cs, compression_spec);
 
 	return cs;
 }
@@ -196,25 +203,36 @@ InitCompressFileHandle(const pg_compress_specification compression_spec)
 		InitCompressFileHandleGzip(CFH, compression_spec);
 	else if (compression_spec.algorithm == PG_COMPRESSION_LZ4)
 		InitCompressFileHandleLZ4(CFH, compression_spec);
+	else if (compression_spec.algorithm == PG_COMPRESSION_ZSTD)
+		InitCompressFileHandleZstd(CFH, compression_spec);
 
 	return CFH;
 }
 
+static bool
+check_compressed_file(const char *path, char **fname, char *ext)
+{
+	free_keep_errno(*fname);
+	*fname = psprintf("%s.%s", path, ext);
+	return (access(*fname, F_OK) == 0);
+}
+
 /*
  * Open a file for reading. 'path' is the file to open, and 'mode' should
  * be either "r" or "rb".
  *
  * If the file at 'path' contains the suffix of a supported compression method,
- * currently this includes ".gz" and ".lz4", then this compression will be used
+ * currently this includes ".gz", ".lz4" and ".zst", then this compression will be used
  * throughout. Otherwise the compression will be inferred by iteratively trying
  * to open the file at 'path', first as is, then by appending known compression
  * suffixes. So if you pass "foo" as 'path', this will open either "foo" or
- * "foo.gz" or "foo.lz4", trying in that order.
+ * "foo.{gz,lz4,zst}", trying in that order.
  *
  * On failure, return NULL with an error code in errno.
  */
 CompressFileHandle *
 InitDiscoverCompressFileHandle(const char *path, const char *mode)
+// pg_compress_algorithm alg
 {
 	CompressFileHandle *CFH = NULL;
 	struct stat st;
@@ -237,28 +255,12 @@ InitDiscoverCompressFileHandle(const char *path, const char *mode)
 		/* avoid unused warning if it is not built with compression */
 		if (exists)
 			compression_spec.algorithm = PG_COMPRESSION_NONE;
-#ifdef HAVE_LIBZ
-		if (!exists)
-		{
-			free_keep_errno(fname);
-			fname = psprintf("%s.gz", path);
-			exists = (stat(fname, ) == 0);
-
-			if (exists)
-compression_spec.algorithm = PG_COMPRESSION_GZIP;
-		}
-#endif
-#ifdef USE_LZ4
-		if (!exists)
-		{
-			

Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Tomas Vondra



On 2/24/23 19:03, Matthias van de Meent wrote:
> On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
>  wrote:
>>
>> On 2/23/23 17:44, Matthias van de Meent wrote:
>>> On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
>>>  wrote:

 On 2/23/23 15:19, Matthias van de Meent wrote:
> Comments on 0001, mostly comments and patch design:
>>>
>>> One more comment:
>>>
>> + range->min_value = bval->bv_values[0];
>> + range->max_value = bval->bv_values[1];
>>>
>>> This produces dangling pointers for minmax indexes on by-ref types
>>> such as text and bytea, due to the memory context of the decoding
>>> tuple being reset every iteration. :-(
>>>
>>
>> Yeah, that sounds like a bug. Also a sign the tests should have some
>> by-ref data types (presumably there are none, as that would certainly
>> trip some asserts etc.).
> 
> I'm not sure we currently trip asserts, as the data we store in the
> memory context is very limited, making it unlikely we actually release
> the memory region back to the OS.
> I did get assertion failures by adding the attached patch, but I don't
> think that's something we can do in the final release.
> 

But we should randomize the memory if we ever do pfree(), and it's
strange valgrind didn't complain when I ran tests with it. But yeah,
I'll take a look and see if we can add some tests covering this.

>>> With the proposed patch, we do O(ncols_statsenabled) scans over the
>>> BRIN index. Each scan reads all ncol columns of all block ranges from
>>> disk, so in effect the data scan does on the order of
>>> O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
>>> all columns have statistics enabled.
>>>
>>
>> I don't think that's the number of I/O operations we'll do, because we
>> always read the whole BRIN tuple at once. So I believe it should rather
>> be something like
>>
>>   O(ncols_statsenabled * nranges)
>>
>> assuming nranges is the number of page ranges. But even that's likely a
>> significant overestimate because most of the tuples will be served from
>> shared buffers.
> 
> We store some data per index attribute, which makes the IO required
> for a single indexed range proportional to the number of index
> attributes.
> If we then scan the index a number of times proportional to the number
> of attributes, the cumulative IO load of statistics gathering for that
> index is quadratic on the number of attributes, not linear, right?
> 

Ah, OK. I was focusing on number of I/O operations while you're talking
about amount of I/O performed. You're right the amount of I/O is
quadratic, but I think what's more interesting is the comparison of the
alternative ANALYZE approaches. The current simple approach does a
multiple of the I/O amount, linear to the number of attributes.

Which is not great, ofc.


>> Considering how tiny BRIN indexes are, this is likely orders of
>> magnitude less I/O than we expend on sampling rows from the table. I
>> mean, with the default statistics target we read ~3 pages (~240MB)
>> or more just to sample the rows. Randomly, while the BRIN index is
>> likely scanned mostly sequentially.
> 
> Mostly agreed; except I think it's not too difficult to imagine a BRIN
> index that is on that scale; with as an example the bloom filters that
> easily take up several hundreds of bytes.
> 
> With the default configuration of 128 pages_per_range,
> n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
> bloom filter size is 4.36 KiB - each indexed item on its own page. It
> is still only 1% of the original table's size, but there are enough
> tables that are larger than 24GB that this could be a significant
> issue.
> 

Right, it's certainly true BRIN indexes may be made fairly large (either
by using something like bloom or by making the ranges much smaller). But
those are exactly the indexes where building statistics for all columns
at once is going to cause issues with memory usage etc.

Note: Obviously, that depends on how much data per range we need to keep
in memory. For bloom I doubt we'd actually want to keep all the filters,
we'd probably calculate just some statistics (e.g. number of bits set),
so maybe the memory consumption is not that bad.

In fact, for such indexes the memory consumption may already be an issue
even when analyzing the index column by column. My feeling is we'll need
to do something about that, e.g. by reading only a sample of the ranges,
or something like that. That might help with the I/O too, I guess.

> Note that most of my concerns are related to our current
> documentation's statement that there are no demerits to multi-column
> BRIN indexes:
> 
> """
> 11.3. Multicolumn Indexes
> 
> [...] The only reason to have multiple BRIN indexes instead of one
> multicolumn BRIN index on a single table is to have a different
> pages_per_range storage parameter.
> """
> 

True, we may need to clarify this in the documentation.

> Wide BRIN indexes add IO overhead for single-attribute scans when
> compared to 

PG_FREE_IF_COPY extraneous in numeric_cmp?

2023-02-24 Thread CK Tan
Hi hackers,

I have a question on the code below:

Datum
numeric_cmp(PG_FUNCTION_ARGS)
{
  Numeric num1 = PG_GETARG_NUMERIC(0);
  Numeric num2 = PG_GETARG_NUMERIC(1);
  int result;

  result = cmp_numerics(num1, num2);

  PG_FREE_IF_COPY(num1, 0);
  PG_FREE_IF_COPY(num2, 1);

  PG_RETURN_INT32(result);
}

It seems to me that num1 is a copy of fcinfo->arg[0]. It is passed to
the function cmp_numerics(), It's value remains the same after the
call. Also, cmp_numerics() does not have a handle to fcinfo, so it
can't modify fcinfo->arg[0].

Isn't it true that pfree() will never be called by PG_FREE_IF_COPY?

Cheers,
-cktan




Re: Inconsistency in ACL error message

2023-02-24 Thread Nathan Bossart
On Fri, Feb 24, 2023 at 12:23:27PM -0500, Joseph Koshakow wrote:
> I noticed a very minor inconsistency in some ACL error messages. When
> you are try and alter a role, it just says "permission denied":

You might be interested in

https://commitfest.postgresql.org/42/4145/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Matthias van de Meent
On Thu, 23 Feb 2023 at 19:48, Tomas Vondra
 wrote:
>
> On 2/23/23 17:44, Matthias van de Meent wrote:
> > On Thu, 23 Feb 2023 at 16:22, Tomas Vondra
> >  wrote:
> >>
> >> On 2/23/23 15:19, Matthias van de Meent wrote:
> >>> Comments on 0001, mostly comments and patch design:
> >
> > One more comment:
> >
>  + range->min_value = bval->bv_values[0];
>  + range->max_value = bval->bv_values[1];
> >
> > This produces dangling pointers for minmax indexes on by-ref types
> > such as text and bytea, due to the memory context of the decoding
> > tuple being reset every iteration. :-(
> >
>
> Yeah, that sounds like a bug. Also a sign the tests should have some
> by-ref data types (presumably there are none, as that would certainly
> trip some asserts etc.).

I'm not sure we currently trip asserts, as the data we store in the
memory context is very limited, making it unlikely we actually release
the memory region back to the OS.
I did get assertion failures by adding the attached patch, but I don't
think that's something we can do in the final release.

> > With the proposed patch, we do O(ncols_statsenabled) scans over the
> > BRIN index. Each scan reads all ncol columns of all block ranges from
> > disk, so in effect the data scan does on the order of
> > O(ncols_statsenabled * ncols * nranges) IOs, or O(n^2) on cols when
> > all columns have statistics enabled.
> >
>
> I don't think that's the number of I/O operations we'll do, because we
> always read the whole BRIN tuple at once. So I believe it should rather
> be something like
>
>   O(ncols_statsenabled * nranges)
>
> assuming nranges is the number of page ranges. But even that's likely a
> significant overestimate because most of the tuples will be served from
> shared buffers.

We store some data per index attribute, which makes the IO required
for a single indexed range proportional to the number of index
attributes.
If we then scan the index a number of times proportional to the number
of attributes, the cumulative IO load of statistics gathering for that
index is quadratic on the number of attributes, not linear, right?

> Considering how tiny BRIN indexes are, this is likely orders of
> magnitude less I/O than we expend on sampling rows from the table. I
> mean, with the default statistics target we read ~3 pages (~240MB)
> or more just to sample the rows. Randomly, while the BRIN index is
> likely scanned mostly sequentially.

Mostly agreed; except I think it's not too difficult to imagine a BRIN
index that is on that scale; with as an example the bloom filters that
easily take up several hundreds of bytes.

With the default configuration of 128 pages_per_range,
n_distinct_per_range of -0.1, and false_positive_rate of 0.01, the
bloom filter size is 4.36 KiB - each indexed item on its own page. It
is still only 1% of the original table's size, but there are enough
tables that are larger than 24GB that this could be a significant
issue.

Note that most of my concerns are related to our current
documentation's statement that there are no demerits to multi-column
BRIN indexes:

"""
11.3. Multicolumn Indexes

[...] The only reason to have multiple BRIN indexes instead of one
multicolumn BRIN index on a single table is to have a different
pages_per_range storage parameter.
"""

Wide BRIN indexes add IO overhead for single-attribute scans when
compared to single-attribute indexes, so having N single-attribute
index scans to build statistics the statistics on an N-attribute index
is not great.

> Maybe there are cases where this would be an issue, but I haven't seen
> one when working on this patch (and I did a lot of experiments). I'd
> like to see one before we start optimizing it ...

I'm not only worried about optimizing it, I'm also worried that we're
putting this abstraction at the wrong level in a way that is difficult
to modify.

> This also reminds me that the issues I actually saw (e.g. memory
> consumption) would be made worse by processing all columns at once,
> because then you need to keep more columns in memory.

Yes, I that can be a valid concern, but don't we already do similar
things in the current table statistics gathering?

Kind regards,

Matthias van de Meent




Inconsistency in ACL error message

2023-02-24 Thread Joseph Koshakow
Hi all,

I noticed a very minor inconsistency in some ACL error messages. When
you are try and alter a role, it just says "permission denied":

  postgres=> ALTER ROLE bar NOCREATEDB;
  ERROR:  permission denied
  postgres=> ALTER ROLE bar SET search_path TO 'foo';
  ERROR:  permission denied

For almost all other ACL error, we include what the action was. For
example:

  postgres=> CREATE ROLE r;
  ERROR:  permission denied to create role
  postgres=> DROP ROLE postgres;
  ERROR:  permission denied to drop role
  postgres=> CREATE DATABASE foo;
  ERROR:  permission denied to create database


It's not a huge deal, but it's easy enough to fix that I thought I'd
generate a patch (attached). Let me know if people think that it's
worth merging.

- Joe Koshakow
From 3ab31bc755043973ce56ee620ad99b5789d12111 Mon Sep 17 00:00:00 2001
From: Joseph Koshakow 
Date: Fri, 24 Feb 2023 12:05:19 -0500
Subject: [PATCH] Add details to ALTER ROLE permission errors

---
 src/backend/commands/user.c   | 4 ++--
 src/test/regress/expected/create_role.out | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
index 3a92e930c0..2c7a4204a6 100644
--- a/src/backend/commands/user.c
+++ b/src/backend/commands/user.c
@@ -761,7 +761,7 @@ AlterRole(ParseState *pstate, AlterRoleStmt *stmt)
 			dvalidUntil || disreplication || dbypassRLS)
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 errmsg("permission denied")));
+	 errmsg("permission denied to alter role")));
 
 		/* an unprivileged user can change their own password */
 		if (dpassword && roleid != currentUserId)
@@ -1008,7 +1008,7 @@ AlterRoleSet(AlterRoleSetStmt *stmt)
 && roleid != GetUserId())
 ereport(ERROR,
 		(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-		 errmsg("permission denied")));
+		 errmsg("permission denied to alter role")));
 		}
 
 		ReleaseSysCache(roletuple);
diff --git a/src/test/regress/expected/create_role.out b/src/test/regress/expected/create_role.out
index 9f431bd4f5..691cff86d2 100644
--- a/src/test/regress/expected/create_role.out
+++ b/src/test/regress/expected/create_role.out
@@ -98,7 +98,7 @@ ERROR:  must have admin option on role "regress_role_normal"
 ALTER ROLE regress_role_normal RENAME TO regress_role_abnormal;
 ERROR:  permission denied to rename role
 ALTER ROLE regress_role_normal NOINHERIT NOLOGIN CONNECTION LIMIT 7;
-ERROR:  permission denied
+ERROR:  permission denied to alter role
 -- ok, regress_tenant can create objects within the database
 SET SESSION AUTHORIZATION regress_tenant;
 CREATE TABLE tenant_table (i integer);
-- 
2.34.1



Re: Timeline ID hexadecimal format

2023-02-24 Thread Sébastien Lardière

On 31/01/2023 20:16, Greg Stark wrote:

A hint or something just in
that case might be enough?


It seems to be a -1 ;

let's try to improve the documentation, with the attached patch

best regards,

--
Sébastien
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index be05a33205..7e26b51031 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1332,7 +1332,8 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'
 you like, add comments to a history file to record your own notes about
 how and why this particular timeline was created.  Such comments will be
 especially valuable when you have a thicket of different timelines as
-a result of experimentation.
+a result of experimentation. In both WAL segment file names and history files,
+the timeline ID number is expressed in hexadecimal.

 

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..3b5d041d92 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4110,7 +4110,9 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
 current when the base backup was taken.  The
 value latest recovers
 to the latest timeline found in the archive, which is useful in
-a standby server.  latest is the default.
+a standby server. A numerical value expressed in hexadecimal must be
+prefixed with 0x, for example 0x11.
+latest is the default. 

 

diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml
index 343f0482a9..4ae8f2ebdd 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -215,7 +215,8 @@ PostgreSQL documentation

 Timeline from which to read WAL records. The default is to use the
 value in startseg, if that is specified; otherwise, the
-default is 1.
+default is 1. The value must be expressed in decimal, contrary to the hexadecimal 
+value given in WAL segment file names and history files.

   
  


Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Matthias van de Meent
On Fri, 24 Feb 2023 at 17:04, Tomas Vondra
 wrote:
>
> On 2/24/23 16:14, Alvaro Herrera wrote:
> > ... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
> > this will neglect the last range in the table.
> >
>
> Why would it? Let's say BlockNumber is uint8, i.e. 255 max. And there
> are 10 pages per range. That's 25 "full" ranges, and the last range
> being just 5 pages. So we get into
>
>prevHeapBlk = 240
>heapBlk = 250
>
> and we read the last 5 pages. And then we update
>
>prevHeapBlk = 250
>heapBlk = (250 + 10) % 255 = 5
>
> and we don't do that loop. Or did I get this wrong, somehow?

The result is off-by-one due to (u)int8 overflows being mod-256, but
apart from that your result is accurate.

The condition only stops the loop when we wrap around or when we go
past the last block, but no earlier than that.


Kind regards,

Matthias van de Meent




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Tomas Vondra



On 2/24/23 16:14, Alvaro Herrera wrote:
> On 2023-Feb-24, Tomas Vondra wrote:
> 
>> I guess the easiest fix would be to do the arithmetic in 64 bits. That'd
>> eliminate the overflow.
> 
> Yeah, that might be easy to set up.  We then don't have to worry about
> it until BlockNumber is enlarged to 64 bits ... but by that time surely
> we can just grow it again to a 128 bit loop variable.
> 
>> Alternatively, we could do something like
>>
>>   prevHeapBlk = 0;
>>   for (heapBlk = 0; (heapBlk < nblocks) && (prevHeapBlk <= heapBlk);
>>heapBlk += pagesPerRange)
>>   {
>> ...
>> prevHeapBlk = heapBlk;
>>   }
> 
> I think a formulation of this kind has the benefit that it works after
> BlockNumber is enlarged to 64 bits, and doesn't have to be changed ever
> again (assuming it is correct).
> 

Did anyone even propose doing that? I suspect this is unlikely to be the
only place that'd might be broken by that.

> ... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
> this will neglect the last range in the table.
> 

Why would it? Let's say BlockNumber is uint8, i.e. 255 max. And there
are 10 pages per range. That's 25 "full" ranges, and the last range
being just 5 pages. So we get into

   prevHeapBlk = 240
   heapBlk = 250

and we read the last 5 pages. And then we update

   prevHeapBlk = 250
   heapBlk = (250 + 10) % 255 = 5

and we don't do that loop. Or did I get this wrong, somehow?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Missing update of all_hasnulls in BRIN opclasses

2023-02-24 Thread Tomas Vondra


On 1/9/23 00:34, Tomas Vondra wrote:
> 
> I've been working on this over the past couple days, trying to polish
> and commit it over the weekend - both into master and backbranches.
> Sadly, the backpatching part turned out to be a bit more complicated
> than I expected, because of the BRIN reworks in PG14 (done by me, as
> foundation for the new opclasses, so ... well).
> 
> Anyway, I got it done, but it's a bit uglier than I hoped for and I
> don't feel like pushing this on Sunday midnight. I think it's correct,
> but maybe another pass to polish it a bit more is better.
> 
> So here are two patches - one for 11-13, the other for 14-master.
> 

I spent a bit more time on this fix. I realized there are two more
places that need fixes.

Firstly, the placeholder tuple needs to be marked as "empty" too, so
that it can be correctly updated by other backends etc.

Secondly, union_tuples had a couple bugs in handling empty ranges (this
is related to the placeholder tuple changes). I wonder what's the best
way to test this in an automated way - it's very dependent on timing of
the concurrent updated. For example we need to do something like this:

T1: run pg_summarize_range() until it inserts the placeholder tuple
T2: do an insert into the page range (updates placeholder)
T1: continue pg_summarize_range() to merge into the placeholder

But there are no convenient ways to do this, I think. I had to check the
various cases using breakpoints in gdb etc.

I'm not very happy with the union_tuples() changes - it's quite verbose,
perhaps a bit too verbose. We have to check for empty ranges first, and
then various combinations of allnulls/hasnulls flags for both BRIN
tuples. There are 9 combinations, and the current code just checks them
one by one - I was getting repeatedly confused by the original code, but
maybe it's too much.

As for the backpatch, I tried to keep it as close to the 14+ fixes as
possible, but it effectively backports some of the 14+ BRIN changes. In
particular, 14+ moved most of the NULL-handling logic from opclasses to
brin.c, and I think it's reasonable to do that for the backbranches too.

The alternative is to apply the same fix to every BRIN_PROCNUM_UNION
opclass procedure out there. I guess doing that for minmax+inclusion is
not a huge deal, but what about external opclasses? And without the fix
the indexes are effectively broken. Fixing this outside in brin.c (in
the union procedure) fixes this for every opclass procedure, without any
actual limitation of functinality (14+ does that anyway).

But maybe someone thinks this is a bad idea and we should do something
else in the backbranches?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyFrom 0abf47f311bfb0b03e5349b12c8e67ad3d5c0842 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Sun, 8 Jan 2023 16:43:06 +0100
Subject: [PATCH] Fix handling of NULLs in BRIN indexes

BRIN indexes did not properly distinguish between summaries for empty
(no rows) and all-NULL ranges. All summaries were initialized with
allnulls=true, and the opclasses simply reset allnulls to false when
processing the first non-NULL value. This however fails if the range
starts with a NULL value (or a sequence of NULL values), in which case
we forget the range contains NULL values.

This happens because the allnulls flag is used for two separate
purposes - to mark empty ranges (not representing any rows yet) and
ranges containing only NULL values.

Opclasses don't know which of these cases it is, and so don't know
whether to set hasnulls=true. Setting hasnulls=true in both cases would
make it correct, but it would also make BRIN indexes useless for queries
with IS NULL clauses - all ranges start empty (and thus allnulls=true),
so all ranges would end up with either allnulls=true or hasnulls=true.

The severity of the issue is somewhat reduced by the fact that it only
happens when adding values to an existing summary with allnulls=true,
not when the summarization is processing values in bulk (e.g. during
CREATE INDEX or automatic summarization). In this case the flags were
updated in a slightly different way, not forgetting the NULL values.

The best solution would be to introduce a new flag marking index tuples
representing ranges with no rows, but that would break on-disk format
and/or ABI, depending on where we put the flag. Considering we need to
backpatch this, that's not acceptable.

So instead we use an "impossible" combination of both flags (allnulls
and hasnulls) set to true, to mark "empty" ranges with no rows. In
principle "empty" is a feature of the whole index tuple, which may
contain multiple summaries in a multi-column index, but this is where
the flags are, unfortunately.

We could also skip storing index tuples for empty summaries, but then
we'd have to always process such ranges - even if there are no rows in
large parts of the table (e.g. after a bulk DELETE), it would still
require reading 

Re: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

2023-02-24 Thread Tom Lane
Noel Grandin  writes:
> Hacker from another open-source DB here (h2database.com).

> How does postgresql handle the following situation?

> (1) a table containing a LOB column

Postgres doesn't really do LOB in the same sense that some other DBs
have, so you'd need to specify what you have in mind in Postgres
terms to get a useful answer.

We do have a concept of "large objects" named by OIDs, but they're
much more of a manually-managed, nontransparent feature than typical
LOB implementations.  I don't think our JDBC driver implements the
sort of syntax you sketch (I could be wrong though, not much of a
JDBC guy).

Having said that ...

> In the face of concurrent updates that might overwrite the existing LOB
> data, how does PostgresQL handle this?

... reading from a large object follows the same MVCC rules we use
for all other data.  We allow multiple versions of a tuple to exist
on-disk, and we don't clean out old versions until no live transaction
can "see" them anymore.  So data consistency is just a matter of using
the same "snapshot" (which selects appropriate tuple versions) across
however many queries you want consistent results from.  If somebody
writes new data meanwhile, it doesn't matter because that tuple version
is invisible to your snapshot.

> Or does it impose some extra constraint on the client side? e.g..
> explicitly opening and closing a transaction, and only wipe the "old" LOB
> data when the transaction is closed?

>From a client's perspective, the two options are "snapshots last for
one query" and "snapshots last for one transaction".  You signify which
one you want by selecting a transaction isolation mode when you begin
the transaction.

regards, tom lane




Re: Stale references to guc.c in comments/tests

2023-02-24 Thread Tom Lane
Daniel Gustafsson  writes:
> I happened to notice that there were a few references to guc.c regarding
> variables, which with the recent refactoring in 0a20ff54f have become stale.
> Attached is a trivial patch to instead point to guc_tables.c.

Hmm, I think you may have done an overenthusiastic replacement here.
I agree with changes like this:

-extern char *role_string;  /* in guc.c */
+extern char *role_string;  /* in guc_tables.c */

because clearly that variable is now declared in guc_tables.c
and guc.c knows nothing of it explicitly.  However, a lot of
these places are really talking about the behavior of the GUC
mechanisms as a whole, and so a pointer to guc_tables.c doesn't
seem very on-point to me --- I find it hard to attribute behavior
to a static table.  Here for instance:

@@ -3041,7 +3041,7 @@ pg_get_functiondef(PG_FUNCTION_ARGS)
  *
  * Variables that are not so marked should just be emitted as
  * simple string literals.  If the variable is not known to
- * guc.c, we'll do that; this makes it unsafe to use
+ * guc_tables.c, we'll do that; this makes it unsafe to use
  * GUC_LIST_QUOTE for extension variables.
  */
 if (GetConfigOptionFlags(configitem, true) & GUC_LIST_QUOTE)

An extension's GUC is by definition not known in guc_tables.c, so I think
this change is losing the point of the text.  What it's really describing
is a variable that hasn't been entered into the dynamic tables maintained
by guc.c.

Perhaps you could use "the GUC mechanisms" in these places, but it's a bit
longer than "guc.c".  Leaving such references alone seems OK too.

regards, tom lane




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Alvaro Herrera
On 2023-Feb-24, Tomas Vondra wrote:

> I guess the easiest fix would be to do the arithmetic in 64 bits. That'd
> eliminate the overflow.

Yeah, that might be easy to set up.  We then don't have to worry about
it until BlockNumber is enlarged to 64 bits ... but by that time surely
we can just grow it again to a 128 bit loop variable.

> Alternatively, we could do something like
> 
>   prevHeapBlk = 0;
>   for (heapBlk = 0; (heapBlk < nblocks) && (prevHeapBlk <= heapBlk);
>heapBlk += pagesPerRange)
>   {
> ...
> prevHeapBlk = heapBlk;
>   }

I think a formulation of this kind has the benefit that it works after
BlockNumber is enlarged to 64 bits, and doesn't have to be changed ever
again (assuming it is correct).

... if pagesPerRange is not a whole divisor of MaxBlockNumber, I think
this will neglect the last range in the table.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: pgindent vs. git whitespace check

2023-02-24 Thread Peter Eisentraut

On 22.02.23 15:49, Alvaro Herrera wrote:

On 2023-Feb-22, Peter Eisentraut wrote:


In the meantime, I suggest we work around this, perhaps by

 conn = libpqsrv_connect_params(keywords, values, /* expand_dbname = */ 
false,
PG_WAIT_EXTENSION);


I suggest

  conn = libpqsrv_connect_params(keywords, values,

false, /* expand_dbname */
 PG_WAIT_EXTENSION);

which is what we typically do elsewhere and doesn't go overlength.


Fixed this way.





how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

2023-02-24 Thread Noel Grandin
Hi

Hacker from another open-source DB here (h2database.com).

How does postgresql handle the following situation?

(1) a table containing a LOB column
(2) a query that does
   ResultSet rs = query("select lob_column from table_foo");
  while (rs.next())
  {
  retrieve_lob_data(rs.getLob(1));
   very long running stuff here..
   }

In the face of concurrent updates that might overwrite the existing LOB
data, how does PostgresQL handle this?

Does it keep the LOB data around until the ResultSet/Connection is closed?
Or does it impose some extra constraint on the client side? e.g..
explicitly opening and closing a transaction, and only wipe the "old" LOB
data when the transaction is closed?

I ask because I have implemented two of the four LOB implementations that
H2 has used, and we are still having trouble :-(

Regards, Noel.


Re: buildfarm + meson

2023-02-24 Thread Andrew Dunstan


On 2023-02-23 Th 16:12, Andrew Dunstan wrote:



On 2023-02-23 Th 10:58, Andres Freund wrote:



On a Windows instance, fairly similar to what's running drongo, I can get a
successful build with meson+VS2019, but I'm getting an error in the
regression tests, which don't like setting lc_time to 'de_DE'. Not sure
what's going on there.

Huh, that's odd.

See my reply to Michael for details

I suspect the issue might be related to this:

+   local %ENV = (PATH => $ENV{PATH}, PGUSER => $ENV{PGUSER});
+   @makeout=run_log("meson test --logbase checklog --print-errorlogs 
--no-rebuild -C $pgsql --suite setup --suite regress");



I commented out the 'local %ENV' line and still got the error. I also 
got the same error running by hand.







On drongo, this test isn't failing, and I think the reason is that it 
runs "make NO_LOCALE=1 check" so it never gets a database with win1252 
encoding.


I'm going to try adding a win1252 test to drongo's locales.


cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Stale references to guc.c in comments/tests

2023-02-24 Thread Daniel Gustafsson
I happened to notice that there were a few references to guc.c regarding
variables, which with the recent refactoring in 0a20ff54f have become stale.
Attached is a trivial patch to instead point to guc_tables.c.

--
Daniel Gustafsson



0001-Fix-outdated-references-to-guc.c.patch
Description: Binary data


Re: meson: Non-feature feature options

2023-02-24 Thread Nazir Bilal Yavuz
Hi,

On Wed, 22 Feb 2023 at 12:14, Peter Eisentraut
 wrote:
>
> On 21.02.23 17:32, Nazir Bilal Yavuz wrote:
> >>> I like the second approach, with a 'uuid' feature option.  As you wrote
> >>> earlier, adding an 'auto' choice to a combo option doesn't work fully 
> >>> like a
> >>> real feature option.
> >> But we can make it behave exactly like one, by checking the auto_features
> >> option.
> > Yes, we can set it like `uuidopt = get_option('auto_features')`.
> > However, if someone wants to set 'auto_features' to 'disabled' but
> > 'uuid' to 'enabled'(to find at least one working uuid library); this
> > won't be possible. We can add 'enabled', 'disabled and 'auto' choices
> > to 'uuid' combo option to make all behaviours possible but adding
> > 'uuid' feature option and 'uuid_library' combo option seems better to
> > me.
>
> I think the uuid side of this is making this way too complicated.  I'm
> content leaving this as a manual option for now.
>
> There is much more value in making the ssl option work automatically.
> So I would welcome a patch that just makes -Dssl=auto work smoothly,
> perhaps using the "trick" that Andres described.
>

Thanks for the feedback. I updated the ssl patch and if you like
changes, I can apply the same logic to uuid.

Regards,
Nazir Bilal Yavuz
Microsoft


v2-0001-meson-Refactor-SSL-option.patch
Description: Binary data


Re: Proposal: :SQL_EXEC_TIME (like :ROW_COUNT) Variable (psql)

2023-02-24 Thread Jim Jones

On 23.02.23 20:55, Kirk Wolak wrote:

Everyone,
  I love that my proposal for %T in the prompt, triggered some great 
conversations.


  This is not instead of that.  That lets me run a query and come back 
HOURS later, and know it finished before 7PM like it was supposed to!


  This feature is simple.  We forget to set \timing on...

I've been there many times!

We run a query, and we WONDER... how long did  that take.

  This, too, should be a trivial problem (the code will tell).

  I am proposing this to get feedback (I don't have a final design in 
mind, but I will start by reviewing when and how ROW_COUNT gets set, 
and what \timing reports).


  Next up, as I learn (and make mistakes), this toughens me up...

  I am not sure the name is right, but I would like to report it in 
the same (ms) units as \timing, since there is an implicit 
relationship in what they are doing.


  I think like ROW_COUNT, it should not change because of internal 
commands.
So, you guys +1 this thing, give additional comments.  When the 
feedback settles, I commit to making it happen.


Thanks, Kirk

I can see it being pretty handy to check if a certain task involving two 
different terminal windows was done in the right order. Basically to see 
what went wrong, e.g. "did I really stop the master database before 
promoting the replica?"


+1 !

Best, Jim





Re: dynamic result sets support in extended query protocol

2023-02-24 Thread Peter Eisentraut

On 20.02.23 13:58, Peter Eisentraut wrote:
The attached patches are the same as before, rebased over master and 
split up as described.  I haven't done any significant work on the 
contents, but I will try to get the 0001 patch into a more polished 
state soon.


I've done a bit of work on this patch, mainly cleaned up and expanded 
the tests, and also added DO support, which is something that had been 
requested (meaning you can return result sets from DO with this 
facility).  Here is a new version.
From ada315925d02883833cc5f4bc95477b0217d9d66 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 24 Feb 2023 12:21:40 +0100
Subject: [PATCH v8 1/2] Dynamic result sets from procedures

Declaring a cursor WITH RETURN in a procedure makes the cursor's data
be returned as a result of the CALL (or DO) invocation.  The procedure
needs to be declared with the DYNAMIC RESULT SETS attribute.

Discussion: 
https://www.postgresql.org/message-id/flat/6e747f98-835f-2e05-cde5-86ee444a7...@2ndquadrant.com
---
 doc/src/sgml/catalogs.sgml|  10 ++
 doc/src/sgml/information_schema.sgml  |   3 +-
 doc/src/sgml/plpgsql.sgml |  27 +++-
 doc/src/sgml/ref/alter_procedure.sgml |  12 ++
 doc/src/sgml/ref/create_procedure.sgml|  14 ++
 doc/src/sgml/ref/declare.sgml |  35 -
 src/backend/catalog/information_schema.sql|   2 +-
 src/backend/catalog/pg_aggregate.c|   3 +-
 src/backend/catalog/pg_proc.c |   4 +-
 src/backend/catalog/sql_features.txt  |   2 +-
 src/backend/commands/functioncmds.c   |  94 +++--
 src/backend/commands/portalcmds.c |  23 
 src/backend/commands/typecmds.c   |  12 +-
 src/backend/parser/gram.y |  18 ++-
 src/backend/tcop/postgres.c   |  37 -
 src/backend/utils/errcodes.txt|   1 +
 src/backend/utils/mmgr/portalmem.c|  48 +++
 src/bin/pg_dump/pg_dump.c |  16 ++-
 src/include/catalog/pg_proc.h |   6 +-
 src/include/commands/defrem.h |   1 +
 src/include/nodes/parsenodes.h|   1 +
 src/include/parser/kwlist.h   |   2 +
 src/include/utils/portal.h|  12 ++
 src/pl/plpgsql/src/Makefile   |   2 +-
 .../src/expected/plpgsql_with_return.out  | 105 ++
 src/pl/plpgsql/src/meson.build|   1 +
 src/pl/plpgsql/src/pl_exec.c  |   6 +
 src/pl/plpgsql/src/pl_gram.y  |  58 +++-
 src/pl/plpgsql/src/pl_unreserved_kwlist.h |   2 +
 .../plpgsql/src/sql/plpgsql_with_return.sql   |  64 +
 .../regress/expected/dynamic_result_sets.out  | 129 ++
 src/test/regress/parallel_schedule|   2 +-
 src/test/regress/sql/dynamic_result_sets.sql  |  90 
 33 files changed, 803 insertions(+), 39 deletions(-)
 create mode 100644 src/pl/plpgsql/src/expected/plpgsql_with_return.out
 create mode 100644 src/pl/plpgsql/src/sql/plpgsql_with_return.sql
 create mode 100644 src/test/regress/expected/dynamic_result_sets.out
 create mode 100644 src/test/regress/sql/dynamic_result_sets.sql

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index c1e4048054..5baec4dc3a 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -6041,6 +6041,16 @@ pg_proc Columns
   
  
 
+ 
+  
+   prodynres int4
+  
+  
+   For procedures, this records the maximum number of dynamic result sets
+   the procedure may create.  Otherwise zero.
+  
+ 
+
  
   
pronargs int2
diff --git a/doc/src/sgml/information_schema.sgml 
b/doc/src/sgml/information_schema.sgml
index 350c75bc31..5fc9dc22ae 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -5885,7 +5885,8 @@ routines Columns
max_dynamic_result_sets 
cardinal_number
   
   
-   Applies to a feature not available in 
PostgreSQL
+   For a procedure, the maximum number of dynamic result sets.  Otherwise
+   zero.
   
  
 
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 8897a5450a..0c0d77b0e6 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -3128,7 +3128,7 @@ Declaring Cursor Variables
  Another way is to use the cursor declaration syntax,
  which in general is:
 
-name   NO  SCROLL 
 CURSOR  ( arguments ) 
 FOR query;
+name   NO  SCROLL 
 CURSOR   WITH RETURN  ( 
arguments )  FOR 
query;
 
  (FOR can be replaced by IS for
  Oracle compatibility.)
@@ -3136,6 +3136,10 @@ Declaring Cursor Variables
  scrolling backward; if NO SCROLL is specified, backward
  fetches will be rejected; if neither specification appears, it is
  query-dependent whether backward fetches will be allowed.
+ If WITH RETURN is specified, 

Re: allow meson to find ICU in non-standard localtion

2023-02-24 Thread Nazir Bilal Yavuz
Hi,

Thanks for the patch.

On Wed, 22 Feb 2023 at 21:26, Jeff Davis  wrote:
>
> I'm not sure it's the right thing to do though. One downside is that it
> doesn't output the version that it finds, it only outputs "YES".

-  icu = dependency('icu-uc', required: icuopt.enabled())
-  icu_i18n = dependency('icu-i18n', required: icuopt.enabled())

I think you can do dependency checks with 'required: false' first and
if they weren't found by dependency checks; then you can do
cc.find_library() checks. This also solves only the outputting "YES"
problem if they were found by dependency checks.

Regards,
Nazir Bilal Yavuz
Microsoft




Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Tomas Vondra



On 2/24/23 09:39, Alvaro Herrera wrote:
> On 2023-Feb-23, Matthias van de Meent wrote:
> 
>>> + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
>>
>> I am not familiar with the frequency of max-sized relations, but this
>> would go into an infinite loop for pagesPerRange values >1 for
>> max-sized relations due to BlockNumber wraparound. I think there
>> should be some additional overflow checks here.
> 
> They are definitely not very common -- BlockNumber wraps around at 32 TB
> IIUC.  But yeah, I guess it is a possibility, and perhaps we should find
> a way to write these loops in a more robust manner.
> 

I guess the easiest fix would be to do the arithmetic in 64 bits. That'd
eliminate the overflow.

Alternatively, we could do something like

  prevHeapBlk = 0;
  for (heapBlk = 0; (heapBlk < nblocks) && (prevHeapBlk <= heapBlk);
   heapBlk += pagesPerRange)
  {
...
prevHeapBlk = heapBlk;
  }


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-02-24 Thread Anton A. Melnikov

Hi, Thomas!

On 17.02.2023 06:21, Thomas Munro wrote:

There are two kinds of atomicity that we rely on for the control file today:

* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking policy)

I assume that documentation is talking about the first thing 


I think this is true, as documentation says about write operations only,
but not about the read ones.


(BTW, I
suspect that documentation is also now wrong in one special case: NTFS
filesystems mounted on DAX devices don't have sectors or sector-based
atomicity unless you turn on BTT and slow them down[1]; that might
eventually be something for PostgreSQL to think about, and it applies
to other OSes too).


Very interesting find! For instance, the volume of Intel® Optane™ Persistent 
Memory
already reaches 512GB and can be potentially used for cluster data. As the
first step it would be good to understand what Microsoft means by
large memory pages, what size are they talking about, that is, where
is the reasonable boundary for using BTT; i suppose, this will help choose
whether to use BTT or have to write own DAX-aware code.


With this patch we would stop relying on the second thing.  Supposedly
POSIX requires read/write atomicity, and many file systems offer it in
a strong form (internal range locking) or maybe a weak accidental form
(page-level locking).  Since some extremely popular implementations
just don't care, and Windows isn't even a POSIX, we just have to do it
ourselves.


Yes. indeed. But unless it's an atomic or transactional filesystem. In
such a case there is almost nothing to do. Another thing is that
it seems such a systems do not exist in reality although it has been
discussed many times I've googled some information on this topic e.g
[1], [2], [3] but all of project were abandoned or deprecated as
Microsoft's own development.


BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)


It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?
 

Therefore, we need a solution for Windows too.  I tried to write the
equivalent code, in the attached.  I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix).  Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments).  Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that.  I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do.  Do you have
any ideas?


Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
Maybe logs of such a fail were saved somewhere? I would like to see
them if possible.


While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file.  Hrmph.  Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?


Here, possibly pass a boolean flag into sendFile()? When it is true,
then take a lock after OpenTransientFile() and release it before
CloseTransientFile() if under Windows.

There are also two places where read or write from/to the pg_control
occur. These are functions WriteControlFile() in xlog.c and
read_controlfile() in pg_resetwal.c.
For the second case, locking definitely not necessary as
the server is stopped. For the first case seems too as BootStrapXLOG()
where WriteControlFile() will be called inside must be called
only once on system install.

Since i've smoothly moved on to the code review here there is a
suggestion at your discretion to add error messages to get_controlfile()
and update_controlfile() if unlock_controlfile() fails.


I think we should just make fdatasync the default on all
systems.


Agreed. And maybe choose the UPDATE_CONTROLFILE_FDATASYNC as the default
case in UpdateControlFile() since fdatasync now the default on all
systems and its metadata are of no interest?


On 22.02.2023 03:10, Thomas Munro wrote:

If we go this way, I suppose, in theory at least, someone with
external pg_backup_start()-based tools might also want to hold the
lock while copying pg_control.  Otherwise they might fail to open it
on Windows (where that patch uses a 

Re: TAP output format in pg_regress

2023-02-24 Thread Daniel Gustafsson
Another rebase on top of 337903a16f.  Unless there are conflicting reviews, I
consider this patch to be done and ready for going in during the next CF.

--
Daniel Gustafsson



v17-0001-Emit-TAP-compliant-output-from-pg_regress.patch
Description: Binary data


Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size

2023-02-24 Thread Corey Huinker
On Thu, Feb 23, 2023 at 2:39 PM Andres Freund  wrote:

> Hi,
>
> On 2023-02-23 13:56:56 -0500, Tom Lane wrote:
> > Corey Huinker  writes:
> > > My not-ready-for-16 work on CAST( ... ON DEFAULT ... ) involved making
> > > FuncExpr/IoCoerceExpr/ArrayCoerceExpr have a safe_mode flag, and that
> > > necessitates adding a reserror boolean to ExprEvalStep for subsequent
> steps
> > > to test if the error happened.
> >
> > Why do you want it in ExprEvalStep ... couldn't it be in ExprState?
> > I can't see why you'd need more than one at a time during evaluation.
>
> I don't know exactly what CAST( ... ON DEFAULT ... ) is aiming for - I
> guess
> it wants to assign a different value when the cast fails?  Is the default
> expression a constant, or does it need to be runtime evaluated?  If a
> const,
> then the cast steps just could assign the new value. If runtime evaluation
> is
> needed I'd expect the various coerce steps to jump to the value
> implementing
> the default expression in case of a failure.
>

The default expression is itself a cast expression. So CAST (expr1 AS
some_type DEFAULT expr2 ON ERROR) would basically be a safe-mode cast of
expr1 to some_type, and only upon failure would the non-safe cast of expr2
to some_type be executed. Granted, the most common use case would be for
expr2 to be a constant or something that folds into a constant, but the
proposed spec allows for it.

My implementation involved adding a setting to CoalesceExpr that tested for
error flags rather than null flags, hence putting it in ExprEvalStep and
ExprState (perhaps mistakenly). Copying and adapting EEOP_JUMP_IF_NOT_NULL
lead me to this:

  EEO_CASE(EEOP_JUMP_IF_NOT_ERROR)
  {
  /* Transfer control if current result is non-error */
  if (!*op->reserror)
  {
  *op->reserror = false;
  EEO_JUMP(op->d.jump.jumpdone);
  }

  /* reset error flag */
  *op->reserror = false;

  EEO_NEXT();
  }


Re: Doc updates for MERGE

2023-02-24 Thread Dean Rasheed
On Fri, 24 Feb 2023 at 08:56, Alvaro Herrera  wrote:
>
> Agreed.  Your patch looks good to me.
>
> I was confused for a bit about arch-dev.sgml talking about ModifyTable
> when perform.sgml talks about Insert/Update et al; I thought at first
> that one or the other was in error, so I checked.  It turns out that
> they are both correct, because arch-dev is talking about code-level
> executor nodes while perform.sgml is talking about how it looks under
> EXPLAIN.  So, it all looks good.
>

Cool. Thanks for checking.

> I assume you're proposing to back-patch this.
>

Yes. Will do.

Regards,
Dean




Re: Add LZ4 compression in pg_dump

2023-02-24 Thread gkokolatos






--- Original Message ---
On Friday, February 24th, 2023 at 5:35 AM, Michael Paquier 
 wrote:


> 
> 
> On Thu, Feb 23, 2023 at 07:51:16PM -0600, Justin Pryzby wrote:
> 
> > On Thu, Feb 23, 2023 at 09:24:46PM +0100, Tomas Vondra wrote:
> > 
> > > I've now pushed 0002 and 0003, after minor tweaks (a couple typos etc.),
> > > and marked the CF entry as committed. Thanks for the patch!
> > 
> > A big thanks from me to everyone involved.
> 
> 
> Wow, nice! The APIs are clear to follow.

I am out of words, thank you all so very much. I learned a lot. 

> 
> > I'll send a patch soon. I first submitted patches for that 2 years ago
> > (before PGDG was ready to add zstd).
> > https://commitfest.postgresql.org/31/2888/
> 
> 
> Thanks. It should be straight-forward to see that in 16, I guess.
> --
> Michael




Re: Doc updates for MERGE

2023-02-24 Thread Alvaro Herrera
On 2023-Feb-24, Dean Rasheed wrote:

> Attached is a patch fixing a few doc omissions for MERGE.
> 
> I don't think that it's necessary to update every place that could
> possibly apply to MERGE, but there are a few places where we give a
> list of commands that may be used in a particular context, and I think
> those should mention MERGE, if it applies.

Agreed.  Your patch looks good to me.

I was confused for a bit about arch-dev.sgml talking about ModifyTable
when perform.sgml talks about Insert/Update et al; I thought at first
that one or the other was in error, so I checked.  It turns out that
they are both correct, because arch-dev is talking about code-level
executor nodes while perform.sgml is talking about how it looks under
EXPLAIN.  So, it all looks good.

I assume you're proposing to back-patch this.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La verdad no siempre es bonita, pero el hambre de ella sí"




Re: Should vacuum process config file reload more often

2023-02-24 Thread Pavel Borisov
Hi, Melanie!

On Fri, 24 Feb 2023 at 02:08, Melanie Plageman
 wrote:
>
> Hi,
>
> Users may wish to speed up long-running vacuum of a large table by
> decreasing autovacuum_vacuum_cost_delay/vacuum_cost_delay, however the
> config file is only reloaded between tables (for autovacuum) or after
> the statement (for explicit vacuum). This has been brought up for
> autovacuum in [1].
>
> Andres suggested that it might be possible to check ConfigReloadPending
> in vacuum_delay_point(), so I thought I would draft a rough patch and
> start a discussion.
>
> Since vacuum_delay_point() is also called by analyze and we do not want
> to reload the configuration file if we are in a user transaction, I
> widened the scope of the in_outer_xact variable in vacuum() and allowed
> analyze in a user transaction to default to the current configuration
> file reload cadence in PostgresMain().
>
> I don't think I can set and leave vac_in_outer_xact the way I am doing
> it in this patch, since I use vac_in_outer_xact in vacuum_delay_point(),
> which I believe is reachable from codepaths that would not have called
> vacuum(). It seems that if a backend sets it, the outer transaction
> commits, and then the backend ends up calling vacuum_delay_point() in a
> different way later, it wouldn't be quite right.
>
> Apart from this, one higher level question I have is if there are other
> gucs whose modification would make reloading the configuration file
> during vacuum/analyze unsafe.

I have a couple of small questions:
Can this patch also read the current GUC value if it's modified by the
SET command, without editing config file?
What will be if we modify config file with mistakes? (When we try to
start the cluster with an erroneous config file it will fail to start,
not sure about re-read config)

Overall the proposal seems legit and useful.

Kind regards,
Pavel Borisov




Re: Support logical replication of DDLs

2023-02-24 Thread Masahiko Sawada
On Tue, Feb 21, 2023 at 11:09 AM Zheng Li  wrote:
>
> On Mon, Feb 20, 2023 at 3:23 AM Masahiko Sawada  wrote:
> >
> > On Fri, Feb 17, 2023 at 1:13 PM Zheng Li  wrote:
> > >
> > > > > I've implemented a prototype to allow replicated objects to have the
> > > > > same owner from the publisher in
> > > > > v69-0008-Allow-replicated-objects-to-have-the-same-owner-from.patch.
> > > > >
> > > >
> > > > I also think it would be a helpful addition for users.A few points
> > > Thanks for supporting this addition.
> > >
> > > > that come to my mind are: (a) Shouldn't the role have the same
> > > > privileges (for ex. rolbypassrls or rolsuper) on both sides before we
> > > > allow this? (b) Isn't it better to first have a replication of roles?
> > >
> > > > I think if we have (b) then it would be probably a bit easier because
> > > > if the subscription has allowed replicating roles and we can confirm
> > > > that the role is replicated then we don't need to worry about the
> > > > differences.
> > >
> > > Yes, having role replication will help further reduce the manual
> > > effort. But even if we don't end up doing role replication soon, I
> > > think we can still provide this subscription option (match_ddl_owner,
> > > off by default) and document that the same roles need to be on both
> > > sides for it to work.
> >
> > From the user perspective, I expect that the replicated objects are
> > created on the subscriber by the same owner as the publisher, by
> > default.
>
> OK, I agree. I think the use cases for matching the owner are likely
> more than the other way around. I can make the subscription option
> "match_ddl_owner" on by default in the next version.
>
> > I think that the same name users must exist on both sides (by
> > role replication or manually if not supported yet) but the privileges
> > of the role doesn't necessarily need to match. IOW, it's sufficient
> > that the role on the subscriber has enough privileges to create the
> > object.
>
> This is also my understanding.
>
> > > > Now, coming to implementation, won't it be better if we avoid sending
> > > > the owner to the subscriber unless it is changed for the replicated
> > > > command? Consider the current case of tables where we send schema only
> > > > if it is changed. This is not a direct mapping but it would be better
> > > > to avoid sending additional information and then process it on the
> > > > subscriber for each command.
> > >
> > > Right, we can do some optimization here: only send the owner for
> > > commands that create objects (CREATE TABLE/FUNCTION/INDEX etc.) Note
> > > that ALTER TABLE/OBJECT OWNER TO is replicated so we don't need to
> > > worry about owner change.
> >
> > What role will be used for executing ALTER and DROP commands on the
> > subscriber? the subscription owner?
>
> Yes, I think DROP and ALTER commands (and other non-CREATE commands)
> can be executed by the subscription owner (superuser).

I think the subscription owner might not be a superuser in the future
as we are discussing on this thread[1].

Regards,

[1] 
https://www.postgresql.org/message-id/flat/9DFC88D3-1300-4DE8-ACBC-4CEF84399A53%40enterprisedb.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-02-24 Thread Masahiko Sawada
On Thu, Feb 23, 2023 at 6:41 PM John Naylor
 wrote:
>
> I ran a couple "in situ" tests on server hardware using UUID columns, since 
> they are common in the real world and have bad correlation to heap order, so 
> are a challenge for index vacuum.

Thank you for the test!

>
> === test 1, delete everything from a small table, with very small 
> maintenance_work_mem:
>
> alter system set shared_buffers ='4GB';
> alter system set max_wal_size ='10GB';
> alter system set checkpoint_timeout ='30 min';
> alter system set autovacuum =off;
>
> -- unrealistically low
> alter system set maintenance_work_mem = '32MB';
>
> create table if not exists test (x uuid);
> truncate table test;
> insert into test (x) select gen_random_uuid() from 
> generate_series(1,50*1000*1000);
> create index on test (x);
>
> delete from test;
> vacuum (verbose, truncate off) test;
> --
>
> master:
> INFO:  finished vacuuming "john.naylor.public.test": index scans: 9
> system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s
>
> v29 patch:
> INFO:  finished vacuuming "john.naylor.public.test": index scans: 1
> system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s
>
> This is a bit artificial, but it's easy to construct cases where the array 
> leads to multiple index scans but the new tid store can fit everythin without 
> breaking a sweat. I didn't save the progress reporting, but v29 was using 
> about 11MB for tid storage.

Cool.

>
>
> === test 2: try to stress tid lookup with production maintenance_work_mem:
> 1. use unlogged table to reduce noise
> 2. vacuum freeze first to reduce heap scan time
> 3. delete some records at the beginning and end of heap to defeat binary 
> search's pre-check
>
> alter system set shared_buffers ='4GB';
> alter system set max_wal_size ='10GB';
> alter system set checkpoint_timeout ='30 min';
> alter system set autovacuum =off;
>
> alter system set maintenance_work_mem = '1GB';
>
> create unlogged table if not exists test (x uuid);
> truncate table test;
> insert into test (x) select gen_random_uuid() from 
> generate_series(1,1000*1000*1000);
> vacuum_freeze test;
>
> select pg_size_pretty(pg_table_size('test'));
>  pg_size_pretty
> 
>  41 GB
>
> create index on test (x);
>
> select pg_size_pretty(pg_total_relation_size('test'));
>  pg_size_pretty
> 
>  71 GB
>
> select max(ctid) from test;
>  max
> --
>  (5405405,75)
>
> delete from test where ctid <  '(10,0)'::tid;
> delete from test where ctid > '(530,0)'::tid;
>
> vacuum (verbose, truncate off) test;
>
> both:
> INFO:  vacuuming "john.naylor.public.test"
> INFO:  finished vacuuming "john.naylor.public.test": index scans: 1
> index scan needed: 205406 pages from table (3.80% of total) had 3800 dead 
> item identifiers removed
>
> --
> master:
> system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s
>
> v29 patch:
> system usage: CPU: user:  97.71 s, system: 45.78 s, elapsed: 573.94 s

In v29 vacuum took twice as long (286 s vs. 573 s)?

>
> The entire vacuum took 25% less wall clock time. Reminder that this is 
> without wal logging, and also unscientific because only one run.
>
> --
> I took 10 seconds of perf data while index vacuuming was going on (showing 
> calls > 2%):
>
> master:
>   40.59%  postgres  postgres[.] vac_cmp_itemptr
>   24.97%  postgres  libc-2.17.so[.] bsearch
>6.67%  postgres  postgres[.] btvacuumpage
>4.61%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>3.48%  postgres  postgres[.] PageIndexMultiDelete
>2.67%  postgres  postgres[.] vac_tid_reaped
>2.03%  postgres  postgres[.] compactify_tuples
>2.01%  postgres  libc-2.17.so[.] __memcpy_ssse3_back
>
> v29 patch:
>
>   29.22%  postgres  postgres[.] TidStoreIsMember
>9.30%  postgres  postgres[.] btvacuumpage
>7.76%  postgres  postgres[.] PageIndexMultiDelete
>6.31%  postgres  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>5.60%  postgres  postgres[.] compactify_tuples
>4.26%  postgres  libc-2.17.so[.] __memcpy_ssse3_back
>4.12%  postgres  postgres[.] hash_search_with_hash_value
>
> --
> master:
> psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples, 
> num_dead_tuples from pg_stat_progress_vacuum"
>phase   | heap_blks_total | heap_blks_scanned | max_dead_tuples | 
> num_dead_tuples
> ---+-+---+-+-
>  vacuuming indexes | 5405406 |   5405406 |   178956969 |  
>   3800
>
> v29 patch:
> psql  -c "select phase, heap_blks_total, heap_blks_scanned, 
> max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum"
>phase   | heap_blks_total | heap_blks_scanned | 
> max_dead_tuple_bytes | dead_tuple_bytes
> 

Re: PATCH: Using BRIN indexes for sorted output

2023-02-24 Thread Alvaro Herrera
On 2023-Feb-23, Matthias van de Meent wrote:

> > + for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
> 
> I am not familiar with the frequency of max-sized relations, but this
> would go into an infinite loop for pagesPerRange values >1 for
> max-sized relations due to BlockNumber wraparound. I think there
> should be some additional overflow checks here.

They are definitely not very common -- BlockNumber wraps around at 32 TB
IIUC.  But yeah, I guess it is a possibility, and perhaps we should find
a way to write these loops in a more robust manner.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"World domination is proceeding according to plan"(Andrew Morton)