date:20240117

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-17 Thread Alexander Korotkov

On Thu, Jan 18, 2024 at 4:16 AM torikoshia  wrote:
> On 2024-01-18 10:10, jian he wrote:
> > On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> > wrote:
> >> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> >> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> >> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> >> > You will need a separate parameter anyway to specify the destination
> >> > of "log", unless "none" became an illegal table name when I wasn't
> >> > looking.  I don't buy that one parameter that has some special values
> >> > while other values could be names will be a good design.  Moreover,
> >> > what if we want to support (say) log-to-file along with log-to-table?
> >> > Trying to distinguish a file name from a table name without any other
> >> > context seems impossible.
> >>
> >> I've been thinking we can add more values to this option to log errors
> >> not only to the server logs but also to the error table (not sure
> >> details but I imagined an error table is created for each table on
> >> error), without an additional option for the destination name. The
> >> values would be like error_action {error|ignore|save-logs|save-table}.
> >>
> >
> > another idea:
> > on_error {error|ignore|other_future_option}
> > if not specified then by default ERROR.
> > You can also specify ERROR or IGNORE for now.
> >
> > I agree, the parameter "error_action" is better than "location".
>
> I'm not sure whether error_action or on_error is better, but either way
> "error_action error" and "on_error error" seems a bit odd to me.
> I feel "stop" is better for both cases as Tom suggested.

OK.  What about this?
on_error {stop|ignore|other_future_option}
where other_future_option might be compound like "file 'copy.log'" or
"table 'copy_log'".

--
Regards,
Alexander Korotkov

Re: Make all Perl warnings fatal

2024-01-17 Thread Peter Eisentraut


On 16.01.24 12:08, Bharath Rupireddy wrote:

On Fri, Jan 12, 2024 at 9:21 PM Bharath Rupireddy
 wrote:


On Fri, Jan 12, 2024 at 9:03 PM Peter Eisentraut  wrote:


I would put this code

  my $core = $ret & 128 ? " (core dumped)" : "";
  die "psql exited with signal "
. ($ret & 127)
. "$core: '$$stderr' while running '@psql_params'"
if $ret & 127;
  $ret = $ret >> 8;

inside a if (defined $ret) block.

Then the behavior would be that the whole function returns undef on
timeout, which is usefully different from returning 0 (and matches
previous behavior).


WFM.


I've attached a patch for the above change.


Committed, thanks.

Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-01-17 Thread Jeff Davis

On Tue, 2024-01-16 at 09:23 +0530, Bharath Rupireddy wrote:
> 1.
>  May be a more descriptive note is
> worth here instead of just saying "Load the library providing us
> libpq calls."?

OK, will be in the next patch set.

> 2. Why not typedef keyword before the ConnectionOption structure?

Agreed. An earlier unpublished iteration had the struct more localized,
but here it makes more sense to be typedef'd.

> 3.
> +static const struct ConnectionOption *
> +libpqrcv_conninfo_options(void)
> 
> Why is libpqrcv_conninfo_options returning the const
> ConnectionOption?

I did that so I could save the result, and each subsequent call would
be free (just returning the same pointer). That also means that the
caller doesn't need to free the result, which would require another
entry point in the API.

> Is it that we don't expect callers to modify the result? I think it's
> not needed given the fact that PQconndefaults doesn't constify the
> return value.

The result of PQconndefaults() can change from call to call when the
defaults change. libpqrcv_conninfo_options() only depends on the
available option names (and dispchar), which should be a static list.

> 4.
> +    /* skip options that must be overridden */
> +    if (strcmp(option, "client_encoding") == 0)
> +    return false;
> +
> 
> Options that must be overriden or disallow specifiing
> "client_encoding" in the SERVER/USER MAPPING definition (just like
> the
> dblink)?

I'm not quite sure of your question, but I'll try to improve the
comment.

> 5.
> "By using the correct libpq options, it no longer needs to be
> deprecated, and can be used by the upcoming pg_connection_fdw."
> 
> Use of postgresql_fdw_validator for pg_connection_fdw seems a bit odd
> to me. I don't mind pg_connection_fdw having its own validator
> pg_connection_fdw_validator even if it duplicates the code. To avoid
> code duplication we can move the guts to an internal function in
> foreign.c so that both postgresql_fdw_validator and
> pg_connection_fdw_validator can use it. This way the code is cleaner
> and we can just leave postgresql_fdw_validator as deprecated.

Will do so in the next patch set.

Thank you for taking a look.

Regards,
Jeff Davis

Re: Catalog domain not-null constraints

2024-01-17 Thread Peter Eisentraut


On 17.01.24 13:15, vignesh C wrote:

One of the test has failed in CFBot at [1] with:
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/domain.out
/tmp/cirrus-ci-build/src/test/regress/results/domain.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/domain.out
2024-01-14 15:40:01.793434601 +
+++ /tmp/cirrus-ci-build/src/test/regress/results/domain.out
2024-01-14 15:42:23.013332625 +
@@ -1271,11 +1271,4 @@
  FROM information_schema.domain_constraints
  WHERE domain_name IN ('con', 'dom', 'pos_int', 'things'))
ORDER BY constraint_name;
- constraint_catalog | constraint_schema | constraint_name  |   check_clause
-+---+--+---
- regression | public| con_check| (VALUE > 0)
- regression | public| meow | (VALUE < 11)
- regression | public| pos_int_check| (VALUE > 0)
- regression | public| pos_int_not_null | VALUE IS NOT NULL
-(4 rows)
-
+ERROR:  could not open relation with OID 36379

[1] - https://cirrus-ci.com/task/4536440638406656
[2] - 
https://api.cirrus-ci.com/v1/artifact/task/4536440638406656/log/src/test/regress/regression.diffs


Interesting.  I couldn't reproduce this locally, even across different 
operating systems.  The cfbot failures appear to be sporadic, but also 
happening across multiple systems, so it's clearly not just a local 
environment failure.  Can anyone else perhaps reproduce this locally?

Re: Reduce useless changes before reassembly during logical replication

2024-01-17 Thread Bharath Rupireddy

On Wed, Jan 17, 2024 at 11:45 AM li jie  wrote:
>
> Hi hackers,
>
> During logical replication, if there is a large write transaction, some
> spill files will be written to disk, depending on the setting of
> logical_decoding_work_mem.
>
> This behavior can effectively avoid OOM, but if the transaction
> generates a lot of change before commit, a large number of files may
> fill the disk. For example, you can update a TB-level table.
>
> However, I found an inelegant phenomenon. If the modified large table is not
> published, its changes will also be written with a large number of spill 
> files.
> Look at an example below:

Thanks. I agree that decoding and queuing the changes of unpublished
tables' data into reorder buffer is an unnecessary task for walsender.
It takes processing efforts (CPU overhead), consumes disk space and
uses memory configured via logical_decoding_work_mem for a replication
connection inefficiently.

> Later you will see a large number of spill files in the
>
> We can see that table tbl_t1 is not published in mypub. It also won't be sent
> downstream because it's not subscribed.
> After the transaction is reorganized, the pgoutput decoding plugin filters out
> changes to these unpublished relationships when sending logical changes.
> See function pgoutput_change.

Right. Here's my testing [1].

> Most importantly, if we filter out unpublished relationship-related
> changes after constructing the changes but before queuing the changes
> into a transaction, will it reduce the workload of logical decoding
> and avoid disk
> or memory growth as much as possible?

Right. It can.

> The patch in the attachment is a prototype, which can effectively reduce the
> memory and disk space usage during logical replication.
>
> Design:
> 1. Added a callback LogicalDecodeFilterByRelCB for the output plugin.
>
> 2. Added this callback function pgoutput_table_filter for the pgoutput plugin.
> Its main implementation is based on the table filter in the
> pgoutput_change function.
> Its main function is to determine whether the change needs to be published 
> based
> on the parameters of the publication, and if not, filter it.
>
> 3. After constructing a change and before Queue a change into a transaction,
> use RelidByRelfilenumber to obtain the relation associated with the change,
> just like obtaining the relation in the ReorderBufferProcessTXN function.
>
> 4. Relation may be a toast, and there is no good way to get its real
> table relation based on toast relation. Here, I get the real table oid
> through toast relname, and then get the real table relation.
>
> 5. This filtering takes into account INSERT/UPDATE/INSERT. Other
> changes have not been considered yet and can be expanded in the future.

Design of this patch is based on the principle of logical decoding
filtering things out early on and looks very similar to
filter_prepare_cb_wrapper/pg_decode_filter_prepare and
filter_by_origin_cb/pgoutput_origin_filter. Per my understanding this
design looks okay  unless I'm missing anything.

> Test:
> 1. Added a test case 034_table_filter.pl
> 2. Like the case above, create two tables, the published table tbl_pub and
> the non-published table tbl_t1
> 3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use
> pg_ls_replslotdir to record the total size of the slot directory
> every second.
> 4. Compare the size of the slot directory at the beginning of the
> transaction(size1),
> the size at the end of the transaction (size2), and the average
> size of the entire process(size3).
> 5. Assert(size1==size2==size3)

I bet that the above test with 10K rows is going to take a noticeable
time on some buildfarm members (it took 6 seconds on my dev system
which is an AWS EC2 instance). And, the above test can get flaky.
Therefore, IMO, the concrete way of testing this feature is by looking
at the server logs for the following message using
PostgreSQL::Test::Cluster log_contains().

+filter_done:
+
+if (result && RelationIsValid(relation))
+elog(DEBUG1, "logical filter change by table %s",
RelationGetRelationName(relation));
+

Here are some comments on the v1 patch:
1.
@@ -1415,9 +1419,6 @@ pgoutput_change(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn,
 TupleTableSlot *old_slot = NULL;
 TupleTableSlot *new_slot = NULL;

-if (!is_publishable_relation(relation))
-return;
-

Instead of removing is_publishable_relation from pgoutput_change, I
think it can just be turned into an assertion
Assert(is_publishable_relation(relation));, no?

2.
+switch (change->action)
+{
+/* intentionally fall through */

Perhaps, it must use /* FALLTHROUGH */ just like elsewhere in the
code, otherwise a warning is thrown.

3. From commit message:
Most of the code in the FilterByTable function is transplanted from
the ReorderBufferProcessTXN
function, which can be called before the ReorderBufferQueueChange function.It is

I think the above note can just be

Re: A performance issue with Memoize

2024-01-17 Thread Arne Roland


Hi Richard,

I can tell this a real world problem. I have seen this multiple times in 
production.


The fix seems surprisingly simple.

I hope my questions here aren't completely off. I still struggle to 
think about the implications.


I wonder, if there is any stuff we are breaking by bluntly forgetting 
about the subplan params. Maybe some table valued function scan within a 
subquery scan? Or something about casts on a join condition, that could 
be performed differently?


I wasn't able to construct a problem case. I might be just missing 
context here. But I am not yet fully convinced whether this is safe to 
do in all cases.


Regards
Arne

Re: heavily contended lwlocks with long wait queues scale badly

2024-01-17 Thread Michael Paquier

On Tue, Jan 16, 2024 at 03:11:48PM +0900, Michael Paquier wrote:
> I'd like to apply that, just let me know if you have any comments
> and/or objections.

And done on 12~15.

While on it, I have also looked at source code references on github
and debian that involve lwWaiting, and all of them rely on lwWaiting
when not waiting, making LW_WS_NOT_WAITING an equivalent.
--
Michael


signature.asc
Description: PGP signature

Re: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo

2024-01-17 Thread Peter Smith

On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada  wrote:
>
...
>
> Although we can improve it to handle this case too, I'm not sure it's
> a bug. The doc says[1]:
>
> Specifies whether the subscription should be automatically disabled if
> any errors are detected by subscription workers during data
> replication from the publisher.
>
> When an apply worker is trying to establish a connection, it's not
> replicating data from the publisher.
>
> Regards,
>
> [1] 
> https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

Yeah, I had also seen that wording of those docs. And I agree it
leaves open some room for doubts because strictly from that wording it
can be interpreted that establishing the connection is not actually
"data replication from the publisher" in which case maybe there is no
bug.

OTOH, it was not clear to me if that precise wording was the intention
or not. It could have been written as "Specifies whether the
subscription should be automatically disabled if any errors are
detected by subscription workers." which would mean the same thing 99%
of the time except that would mean that the current behaviour is a
bug.

I tried looking at the original thread where this feature was born [1]
but it is still unclear to me if 'disable_on_error' was meant for
every kind of error or only data replication errors.

Indeed. even the commit message [2] seems to have an each-way bet:
* It talks about errors applying changes --- "Logical replication
apply workers for a subscription can easily get stuck in an infinite
loop of attempting to apply a change..."
* But, it also says it covers any errors --- "When true, both the
tablesync worker and apply worker catch any errors thrown..."

~

Maybe we should be asking ourselves how a user intuitively expects
this option to behave. IMO the answer is right there in the option
name - the subscription says 'disable_on_error' and I got an error, so
I expected the subscription to be disabled. To wriggle out of it by
saying "Ah, but we did not mean _those_ kinds of errors" doesn't quite
seem genuine to me.

==
[1] 
https://www.postgresql.org/message-id/flat/CAA4eK1KsaVgkO%3DRbjj0bcXZTpeV1QVm0TGkdxZiH73MHfxf6oQ%40mail.gmail.com#d4a0db154fbeca356a494c50ac877ff1
[2] 
https://github.com/postgres/postgres/commit/705e20f8550c0e8e47c0b6b20b5f5ffd6ffd9e33

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-17 Thread Andy Fan



Hi, 
David Rowley  writes:

> On Thu, 18 Jan 2024 at 14:58, Andy Fan  wrote:
>> I want to know if "user just want to zero out the flags in bitmapset
>> but keeping the memory allocation" is a valid requirement. Commit
>> 00b41463c makes it is hard IIUC.
>
> Looking at your patch, I see:
>
> +/*
> + * does this break commit 00b41463c21615f9bf3927f207e37f9e215d32e6?
> + * but I just found alloc memory and free the memory is too bad
> + * for this current feature. So let see ...;
> + */
> +void
> +bms_zero(Bitmapset *a)
>
> I understand the requirement here, but, to answer the question in the
> comment -- Yes, that does violate the requirements for how an empty
> set is represented and as of c7e5e994b and possibly earlier, any
> subsequent Bitmapset operations will cause an Assert failure due to
> the trailing zero word(s) left by bms_zero().

Thanks for the understanding and confirmation.

>> Looks like this is another user case of "user just wants to zero out the
>> flags in bitmapset but keeping the memory allocation".
>
> I don't really see a way to have it work the way you want without
> violating the new representation of an empty Bitmapset.  Per [1],
> there's quite a performance advantage to 00b41463c plus the additional
> don't allow trailing empty words rule done in a8c09daa8. So I don't
> wish the rules were more lax.

I agree with this.

>
> Maybe Bitmapsets aren't the best fit for your need.  Maybe it would be
> better to have a more simple fixed size bitset that you could allocate
> in the same allocation as the TupleTableSlot's tts_null and tts_values
> arrays. The slot's TupleDesc should know how many bits are needed.

I think this is the direction we have to go. There is no bitset struct
yet, so I prefer to design it as below, The APIs are pretty similar with
Bitmapset. 

typdef char Bitset;

Bitset *bitset_init(int size);
void bitset_add_member(Bitset *bitset, int x);
void bitset_del_member(Bitset *bitset, int x);
Bitset *bitset_add_members(Bitset *bitset1, Bitset *bitset2);
bool bitset_is_member(int x);
int bitset_next_member(Bitset *bitset, int i);
Bitset *bitset_clear();

After this, we may use it for DiscreteKnapsack as well since the
num_items is fixed as well and this user case wants the memory allocation 
is done only once.

-- 
Best Regards
Andy Fan

Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-17 Thread Ashutosh Bapat

Hi Michael,

There is some overlap between Dtrace functionality and this
functionality. But I see differences too. E.g. injection points offer
deeper integration whereas dtrace provides more information to the
probe like callstack and argument values etc. We need to assess
whether these functionality can co-exist and whether we need both of
them. If the answer to both of these questions is yes, it will be good
to add documentation explaining the differences and similarities and
also some guidance on when to use what.


On Fri, Jan 12, 2024 at 9:56 AM Michael Paquier  wrote:
>
> On Fri, Jan 12, 2024 at 08:35:42AM +0900, Michael Paquier wrote:
> > It also seems like you've missed this message, where this has been
> > mentioned (spoiler: first version of the patch used an alternate
> > output):
> > https://www.postgresql.org/message-id/zunuzpimkzcov...@paquier.xyz
>
> The refactoring of 0001 has now been applied as of e72a37528dda, and
> the buildfarm looks stable (at least for now).
>
> Here is a rebased patch set of the rest.

+
+#ifdef USE_INJECTION_POINTS
+static bool
+file_exists(const char *name)
+{
+ struct stat st;
+
+ Assert(name != NULL);
+ if (stat(name, ) == 0)
+ return !S_ISDIR(st.st_mode);
+ else if (!(errno == ENOENT || errno == ENOTDIR))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not access file \"%s\": %m", name)));
+ return false;
+}

Shouldn't this be removed now? The code should use one from fd.c

Other code changes look good. I think the documentation and comments
need some changes esp. considering the users point of view. Have
attached two patches (0003, and 0004) with those changes to be applied
on top of 0001 and 0002 respectively. Please review them. Might need
some wordsmithy and language correction. Attaching the whole patch set
to keep cibot happy.

This is review of 0001 and 0002 only. Once we take care of these
comments I think those patches will be ready for commit except one
point of contention mentioned in [1]. We haven't heard any third
opinion yet.

[1] 
https://www.postgresql.org/message-id/CAExHW5sc_ar7=w9xccc9twyxzf71ghc6poq_+u4hxtxmnb7...@mail.gmail.com


-- 
Best Wishes,
Ashutosh Bapat
From 7a26f1c345e8a8dcf895dcd5d1d45012f46dd826 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Wed, 17 Jan 2024 16:59:03 +0530
Subject: [PATCH 2/6] Comment and documentation suggestions.

... reflecting a user's point of view.

Ashutosh Bapat
---
 doc/src/sgml/installation.sgml   | 24 
 doc/src/sgml/xfunc.sgml  | 24 +++-
 src/backend/utils/misc/injection_point.c | 11 +--
 3 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 51830fc078..f913b6b78f 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1661,12 +1661,12 @@ build-postgresql:

 
 Compiles PostgreSQL with support for
-injection points in the server.  This is valuable to inject
-user-defined code to force specific conditions to happen on the
-server in pre-defined code paths.  This option is disabled by default.
-See  for more details.
-This option is only for developers to test specific concurrency
-scenarios.
+injection points in the server. Injection points allow to run
+user-defined code from within the server in pre-defined code paths.
+This helps in testing and investigation of concurrency scenarios in a
+controlled fashion. This option is disabled by default.  See  for more details.  This option
+is inteded to be used only by developers for testing.
 

   
@@ -3180,12 +3180,12 @@ ninja install
   

 Compiles PostgreSQL with support for
-injection points in the server.  This is valuable to inject
-user-defined code to force specific conditions to happen on the
-server in pre-defined code paths.  This option is disabled by default.
-See  for more details.
-This option is only for developers to test specific concurrency
-scenarios.
+injection points in the server. Injection points allow to run
+user-defined code from within the server in pre-defined code paths.
+This helps in testing and investigation of concurrency scenarios in a
+controlled fashion. This option is disabled by default.  See  for more details.  This option
+is inteded to be used only by developers for testing.

   
  
diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 9dd9eafd22..59dbe73c67 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3514,27 +3514,24 @@ uint32 WaitEventExtensionNew(const char *wait_event_name)
 Injection Points
 
 
- Add-ins can define injection points, that would run user-defined
- code when going through a specific code

Re: Fix search_path for all maintenance commands

2024-01-17 Thread Jeff Davis

On Thu, 2024-01-18 at 09:24 +0530, Shubham Khanna wrote:
> I tried to apply the patch but it is failing at the Head. It is
> giving
> the following error:

I am withdrawing this patch from the CF until it's more clear that this
is what we really want to do.

Thank you for looking into it.

Regards,
Jeff Davis

Re: Synchronizing slots from primary to standby

2024-01-17 Thread Peter Smith

I have one question about the new code in v63-0002.

==
src/backend/replication/logical/slotsync.c

1. ReplSlotSyncWorkerMain

+ Assert(SlotSyncWorker->pid == InvalidPid);
+
+ /*
+ * Startup process signaled the slot sync worker to stop, so if meanwhile
+ * postmaster ended up starting the worker again, exit.
+ */
+ if (SlotSyncWorker->stopSignaled)
+ {
+ SpinLockRelease(>mutex);
+ proc_exit(0);
+ }

Can we be sure a worker crash can't occur (in ShutDownSlotSync?) in
such a way that SlotSyncWorker->stopSignaled was already assigned
true, but SlotSyncWorker->pid was not yet reset to InvalidPid;

e.g. Is the Assert above still OK?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: initdb's -c option behaves wrong way?

2024-01-17 Thread Kyotaro Horiguchi

Thank you for upicking this up.

At Wed, 17 Jan 2024 23:47:41 +0100, Daniel Gustafsson  wrote 
in 
> > On 17 Jan 2024, at 21:33, Tom Lane  wrote:
> > 
> > I wrote:
> >> However ... I don't like the patch much.  It seems to have left
> >> the code in a rather random state.  Why, for example, didn't you
> >> keep all the code that constructs the "newline" value together?
> > 
> > After thinking about it a bit more, I don't see why you didn't just
> > s/strncmp/strncasecmp/ and call it good.  The messiness seems to be
> > a result of your choice to replace the GUC's case as shown in the
> > file with the case used on the command line, which is not better
> > IMO.  We don't change our mind about the canonical spelling of a
> > GUC because somebody varied the case in a SET command.
> 
> The original patch was preserving the case of the file throwing away the case
> from the commandline.  The attached is a slimmed down version which only moves
> the assembly of newline to the different cases (replace vs.  new) keeping the
> rest of the code intact.  Keeping it in one place gets sort of messy too since
> it needs to use different values for a replacement and a new variable.  Not
> sure if there is a cleaner approach?

Just to clarify, the current code breaks out after processing the
first matching line. I haven't changed that behavior.  The reason I
moved the rewrite processing code out of the loop was I wanted to
avoid adding more lines that are executed only once into the
loop. However, it is in exchange of additional complexity to remember
the original spelling of the parameter name. Personally, I believe
separating the search and rewrite processing is better, but I'm not
particularly attached to the approach, so either way is fine with me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-17 Thread John Naylor

On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada  wrote:
> It seems we cannot make this work nicely. IIUC VacDeadItems is
> allocated in DSM and TidStore is embedded there. However,
> dead_items->items.area is a local pointer to dsa_area. So we cannot
> include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
> would need to pass dsa_area to each interface by callers.

Thanks again for exploring this line of thinking! Okay, it seems even
if there's a way to make this work, it would be too invasive to
justify when compared with the advantage I was hoping for.

> > If we can't make this work nicely, I'd be okay with keeping the tid
> > store control object. My biggest concern is unnecessary
> > double-locking.
>
> If we don't do any locking stuff in radix tree APIs and it's the
> user's responsibility at all, probably we don't need a lock for
> tidstore? That is, we expose lock functions as you mentioned and the
> user (like tidstore) acquires/releases the lock before/after accessing
> the radix tree and num_items.

I'm not quite sure what the point of "num_items" is anymore, because
it was really tied to the array in VacDeadItems. dead_items->num_items
is essential to reading/writing the array correctly. If this number is
wrong, the array is corrupt. There is no such requirement for the
radix tree. We don't need to know the number of tids to add to it or
do a lookup, or anything.

There are a number of places where we assert "the running count of the
dead items" is the same as "the length of the dead items array", like
here:

@@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
  BlockNumber threshold;

  Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));

As such, in HEAD I'm guessing it's arbitrary which one is used for
control flow. Correct me if I'm mistaken. If I am wrong for some part
of the code, it'd be good to understand when that invariant can't be
maintained.

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * Do index vacuuming (call each index's ambulkdelete routine), then do
  * related heap vacuuming
  */
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
  lazy_vacuum(vacrel);

Like here. In HEAD, could this have used vacrel->dead_items?

@@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
  * the second heap pass.  No more, no less.
  */
- Assert(index > 0);
  Assert(vacrel->num_index_scans > 1 ||
-(index == vacrel->lpdead_items &&
+(TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
  vacuumed_pages == vacrel->lpdead_item_pages));

  ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
+ vacuumed_pages)));

We assert that vacrel->lpdead_items has the expected value, and then
the ereport repeats the function call (with a lock) to read the value
we just consulted to pass the assert.

If we *really* want to compare counts, maybe we could invent a
debugging-only function that iterates over the tree and popcounts the
bitmaps. That seems too expensive for regular assert builds, though.

On the subject of debugging builds, I think it no longer makes sense
to have the array for debug checking in tid store, even during
development. A few months ago, we had an encoding scheme that looked
simple on paper, but its code was fiendishly difficult to follow (at
least for me). That's gone. In addition to the debugging count above,
we could also put a copy of the key in the BlockTableEntry's header,
in debug builds. We don't yet need to care about the key size, since
we don't (yet) have runtime-embeddable values.

> Currently (as of v52 patch) RT_FIND is
> doing so,

[meaning, there is no internal "automatic" locking here since after we
switched to variable-length types, an outstanding TODO]
Maybe it's okay to expose global locking for v17. I have one possible
alternative:

This week I tried an idea to use a callback there so that after
internal unlocking, the caller received the value (or whatever else
needs to happen, such as lookup an offset in the tid bitmap). I've
attached a draft for that that passes radix tree tests. It's a bit
awkward, but I'm guessing this would more closely match future
internal atomic locking. Let me know what you think of the concept,
and then do whichever way you think is best. (using v53 as the basis)

I believe this is the only open question remaining. The rest is just
polish and testing.

> During trying this idea, I realized that there is a visibility problem
> in the radix tree template

If it's broken even without the embedding I'll look into this (I don't
know if this

Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-17 Thread David Rowley

On Thu, 18 Jan 2024 at 14:58, Andy Fan  wrote:
> I want to know if "user just want to zero out the flags in bitmapset
> but keeping the memory allocation" is a valid requirement. Commit
> 00b41463c makes it is hard IIUC.

Looking at your patch, I see:

+/*
+ * does this break commit 00b41463c21615f9bf3927f207e37f9e215d32e6?
+ * but I just found alloc memory and free the memory is too bad
+ * for this current feature. So let see ...;
+ */
+void
+bms_zero(Bitmapset *a)

I understand the requirement here, but, to answer the question in the
comment -- Yes, that does violate the requirements for how an empty
set is represented and as of c7e5e994b and possibly earlier, any
subsequent Bitmapset operations will cause an Assert failure due to
the trailing zero word(s) left by bms_zero().

> Looks like this is another user case of "user just wants to zero out the
> flags in bitmapset but keeping the memory allocation".

I don't really see a way to have it work the way you want without
violating the new representation of an empty Bitmapset.  Per [1],
there's quite a performance advantage to 00b41463c plus the additional
don't allow trailing empty words rule done in a8c09daa8. So I don't
wish the rules were more lax.

Maybe Bitmapsets aren't the best fit for your need.  Maybe it would be
better to have a more simple fixed size bitset that you could allocate
in the same allocation as the TupleTableSlot's tts_null and tts_values
arrays. The slot's TupleDesc should know how many bits are needed.

David

[1] 
https://postgr.es/m/CAJ2pMkYcKHFBD_OMUSVyhYSQU0-j9T6NZ0pL6pwbZsUCohWc7Q%40mail.gmail.com

Re: Fix search_path for all maintenance commands

2024-01-17 Thread Shubham Khanna

On Thu, Jan 18, 2024 at 9:19 AM Jeff Davis  wrote:
>
> On Mon, 2023-07-17 at 12:16 -0700, Jeff Davis wrote:
> > Based on feedback, I plan to commit soon.
>
> Attached is a new version.
>
> Changes:
>
> * Also switch the search_path during CREATE MATERIALIZED VIEW, so that
> it's consistent with REFRESH. As a part of this change, I slightly
> reordered things in ExecCreateTableAs() so that the skipData path
> returns early without entering the SECURITY_RESTRICTED_OPERATION. I
> don't think that's a problem because (a) that is one place where
> SECURITY_RESTRICTED_OPERATION is not used for security, but rather for
> consistency; and (b) that path doesn't go through rewriter, planner, or
> executor anyway so I don't see why it would matter.
>
> * Use GUC_ACTION_SAVE rather than GUC_ACTION_SET. That was a problem
> with the previous patch for index functions executed in parallel
> workers, which can happen calling SQL functions from pg_amcheck.
>
> * I used a wrapper function RestrictSearchPath() rather than calling
> set_config_option() directly. That provides a nice place in case we
> need to add a compatibility GUC to disable it.
>
> Question:
>
> Why do we switch to the table owner and use
> SECURITY_RESTRICTED_OPERATION in DefineIndex(), when we will switch in
> index_build (etc.) anyway? Similarly, why do we switch in vacuum_rel(),
> when it doesn't matter for lazy vacuum and we will switch in
> cluster_rel() and do_analyze_rel() anyway?

I tried to apply the patch but it is failing at the Head. It is giving
the following error:
Hunk #7 succeeded at 3772 (offset -12 lines).
patching file src/backend/commands/matview.c
patching file src/backend/commands/vacuum.c
Hunk #2 succeeded at 2169 (offset -19 lines).
patching file src/backend/utils/init/usercontext.c
patching file src/bin/scripts/t/100_vacuumdb.pl
Hunk #1 FAILED at 109.
1 out of 1 hunk FAILED -- saving rejects to file
src/bin/scripts/t/100_vacuumdb.pl.rej
patching file src/include/utils/usercontext.h
patching file src/test/modules/test_oat_hooks/expected/test_oat_hooks.out
patching file src/test/regress/expected/matview.out
patching file src/test/regress/expected/privileges.out
patching file src/test/regress/expected/vacuum.out
patching file src/test/regress/sql/matview.sql
patching file src/test/regress/sql/privileges.sql
patching file src/test/regress/sql/vacuum.sql

Please send the Re-base version of the patch.

Thanks and Regards,
Shubham Khanna.

Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-17 Thread David Rowley

Thanks for having a look at this again.

On Thu, 18 Jan 2024 at 15:22, Richard Guo  wrote:
> Do you think we can use 'memcpy(a, b, BITMAPSET_SIZE(b->nwords))'
> directly in the new bms_replace_members() instead of copying the
> bitmapwords one by one, like:
>
> -   i = 0;
> -   do
> -   {
> -   a->words[i] = b->words[i];
> -   } while (++i < b->nwords);
> -
> -   a->nwords = b->nwords;
> +   memcpy(a, b, BITMAPSET_SIZE(b->nwords));
>
> But I'm not sure if this is an improvement or not.

I considered this earlier but felt it was going against the method
used in other places in the file. However, on relooking I do see
bms_copy() does a memcpy().

I'm still in favour of keeping it the way the v2 patch does it for 2 reasons:

1) Ignoring bms_copy(), we use do/while in all other functions where
we operate on all words in the set.
2) memcpy isn't that fast for small numbers of bytes when that number
of bytes isn't known at compile-time.

The do/while method can take advantage of knowing that the Bitmapset
will have at least 1 word allowing a single loop check when the set
only has a single word, which I expect most Bitmapsets do.

Of course, memcpy() is fewer lines of code and this likely isn't that
performance critical, so there's certainly arguments for memcpy().
However, it isn't quite as few lines as the patch you pasted.  We
still need to overwrite a->nwords to ensure we grow the set or shrink
it to trim off any trailing zero words (which I didn't feel any need
to actually set to 0).

David

Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-17 Thread Richard Guo

On Thu, Jan 18, 2024 at 8:35 AM David Rowley  wrote:

> The functions's header comment mentions "The bitmapsets are all
> pre-initialized with an unused high bit so that memory allocation is
> done only once.".


Ah, I neglected to notice this when reviewing the v1 patch.  I guess
it's implemented this way due to performance considerations, right?


> I've attached a patch which adds bms_replace_members(). It's basically
> like bms_copy() but attempts to reuse the member from another set. I
> considered if the new function should be called bms_copy_inplace(),
> but left it as bms_replace_members() for now.


Do you think we can use 'memcpy(a, b, BITMAPSET_SIZE(b->nwords))'
directly in the new bms_replace_members() instead of copying the
bitmapwords one by one, like:

-   i = 0;
-   do
-   {
-   a->words[i] = b->words[i];
-   } while (++i < b->nwords);
-
-   a->nwords = b->nwords;
+   memcpy(a, b, BITMAPSET_SIZE(b->nwords));

But I'm not sure if this is an improvement or not.

Thanks
Richard

Re: Add system identifier to backup manifest

2024-01-17 Thread Michael Paquier

On Wed, Jan 17, 2024 at 08:46:09AM -0500, Robert Haas wrote:
> On Wed, Jan 17, 2024 at 6:45 AM Alvaro Herrera  
> wrote:
>> Hmm, okay, but what if I take a full backup from a primary server and
>> later I want an incremental from a standby, or the other way around?
>> Will this prevent me from using such a combination?
> 
> The system identifier had BETTER match in such cases. If it doesn't,
> somebody's run pg_resetwal on your standby since it was created... and
> in that case, no incremental backup for you!

There is an even stronger check than that at replay as we also store
the system identifier in XLogLongPageHeaderData and cross-check it
with the contents of the control file.  Having a field in the backup
manifest makes for a much faster detection, even if that's not the
same as replaying things, it can avoid a lot of problems when
combining backup pieces.  I'm +1 for Amul's patch concept.
--
Michael


signature.asc
Description: PGP signature

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-17 Thread torikoshia


On 2024-01-18 10:10, jian he wrote:
On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada  
wrote:


On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
>
> Alexander Korotkov  writes:
> > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
> >  wrote:
> >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
> >> indicate "immediately error out" and 'just ignore the failure'
> >> respectively, but these options hardly seem to denote a 'location',
> >> and appear more like an 'action'. I somewhat suspect that this
> >> parameter name intially conceived with the assupmtion that it would
> >> take file names or similar parameters. I'm not sure if others will
> >> agree, but I think the parameter name might not be the best
> >> choice. For instance, considering the addition of the third value
> >> 'log', something like on_error_action (error, ignore, log) would be
> >> more intuitively understandable. What do you think?
>
> > Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
> > the next word will be location, not action.  With some stretch we can
> > assume 'error' to be location.  I think it would be even more stretchy
> > to think that SAVE_ERROR_TO is followed by action.
>
> The other problem with this terminology is that with 'none', what it
> is doing is the exact opposite of "saving" the errors.  I agree we
> need a better name.

Agreed.

>
> Kyotaro-san's suggestion isn't bad, though I might shorten it to
> error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> You will need a separate parameter anyway to specify the destination
> of "log", unless "none" became an illegal table name when I wasn't
> looking.  I don't buy that one parameter that has some special values
> while other values could be names will be a good design.  Moreover,
> what if we want to support (say) log-to-file along with log-to-table?
> Trying to distinguish a file name from a table name without any other
> context seems impossible.

I've been thinking we can add more values to this option to log errors
not only to the server logs but also to the error table (not sure
details but I imagined an error table is created for each table on
error), without an additional option for the destination name. The
values would be like error_action {error|ignore|save-logs|save-table}.



another idea:
on_error {error|ignore|other_future_option}
if not specified then by default ERROR.
You can also specify ERROR or IGNORE for now.

I agree, the parameter "error_action" is better than "location".


I'm not sure whether error_action or on_error is better, but either way 
"error_action error" and "on_error error" seems a bit odd to me.

I feel "stop" is better for both cases as Tom suggested.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: Add last_commit_lsn to pg_stat_database

2024-01-17 Thread James Coleman

On Sun, Jan 14, 2024 at 6:01 AM vignesh C  wrote:
>
> On Sat, 10 Jun 2023 at 07:57, James Coleman  wrote:
> >
> > I've previously noted in "Add last commit LSN to
> > pg_last_committed_xact()" [1] that it's not possible to monitor how
> > many bytes of WAL behind a logical replication slot is (computing such
> > is obviously trivial for physical slots) because the slot doesn't need
> > to replicate beyond the last commit. In some cases it's possible for
> > the current WAL location to be far beyond the last commit. A few
> > examples:
> >
> > - An idle server with checkout_timeout at a lower value (so lots of
> > empty WAL advance).
> > - Very large transactions: particularly insidious because committing a
> > 1 GB transaction after a small transaction may show almost zero time
> > lag even though quite a bit of data needs to be processed and sent
> > over the wire (so time to replay is significantly different from
> > current lag).
> > - A cluster with multiple databases complicates matters further,
> > because while physical replication is cluster-wide, the LSNs that
> > matter for logical replication are database specific.
> >
> > Since we don't expose the most recent commit's LSN there's no way to
> > say "the WAL is currently 1250, the last commit happened at 1000, the
> > slot has flushed up to 800, therefore there are at most 200 bytes
> > replication needs to read through to catch up.
> >
> > In the aforementioned thread [1] I'd proposed a patch that added a SQL
> > function pg_last_commit_lsn() to expose the most recent commit's LSN.
> > Robert Haas didn't think the initial version's modifications to
> > commit_ts made sense, and a subsequent approach adding the value to
> > PGPROC didn't have strong objections, from what I can see, but it also
> > didn't generate any enthusiasm.
> >
> > As I was thinking about how to improve things, I realized that this
> > information (since it's for monitoring anyway) fits more naturally
> > into the stats system. I'd originally thought of exposing it in
> > pg_stat_wal, but that's per-cluster rather than per-database (indeed,
> > this is a flaw I hadn't considered in the original patch), so I think
> > pg_stat_database is the correct location.
> >
> > I've attached a patch to track the latest commit's LSN in pg_stat_database.
>
> I have changed the status of commitfest entry to "Returned with
> Feedback" as Aleksander's comments have not yet been resolved. Please
> feel free to post an updated version of the patch and update the
> commitfest entry accordingly.

Thanks for reminding me; I'd lost track of this patch.

Regards,
James Coleman

Re: Add last_commit_lsn to pg_stat_database

2024-01-17 Thread James Coleman

Hello,

Thanks for reviewing!

On Tue, Sep 19, 2023 at 8:16 AM Aleksander Alekseev
 wrote:
>
> Hi,
>
> > [...]
> > As I was thinking about how to improve things, I realized that this
> > information (since it's for monitoring anyway) fits more naturally
> > into the stats system. I'd originally thought of exposing it in
> > pg_stat_wal, but that's per-cluster rather than per-database (indeed,
> > this is a flaw I hadn't considered in the original patch), so I think
> > pg_stat_database is the correct location.
> >
> > I've attached a patch to track the latest commit's LSN in pg_stat_database.
>
> Thanks for the patch. It was marked as "Needs Review" so I decided to
> take a brief look.
>
> All in all the code seems to be fine but I have a couple of nitpicks:
>
> - If you are modifying pg_stat_database the corresponding changes to
> the documentation are needed.

Updated.

> - pg_stat_get_db_last_commit_lsn() has the same description as
> pg_stat_get_db_xact_commit() which is confusing.

I've fixed this.

> - pg_stat_get_db_last_commit_lsn() is marked as STABLE which is
> probably correct. But I would appreciate a second opinion on this.

Sounds good.

> - Wouldn't it be appropriate to add a test or two?

Added.

> - `if (!XLogRecPtrIsInvalid(commit_lsn))` - I suggest adding
> XLogRecPtrIsValid() macro for better readability

We have 36 usages of !XLogRecPtrIsInvalid(...) already, so I think we
should avoid making this change in this patch.

v2 is attached.

Regards,
James Coleman


v2-0001-Add-last-commit-s-LSN-to-pg_stat_database.patch
Description: Binary data

Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-17 Thread Andy Fan

Hi,

David Rowley  writes:

> On Tue, 16 Jan 2024 at 16:32, David Rowley  wrote:
>>
>>
>> Now that 00b41463c changed Bitmapset to have NULL be the only valid
>> representation of an empty set, this code no longer makes sense.  We
>> may as well just bms_free() the original set and bms_copy() in the new
>> set as the bms_del_members() call will always pfree the set anyway.

I want to know if "user just want to zero out the flags in bitmapset
but keeping the memory allocation" is a valid requirement. Commit
00b41463c makes it is hard IIUC.  The user case I have is I want to
keep the detoast datum in slot->tts_values[1]  so that any further
access doesn't need to detoast it again, I used a 'Bitmapset' in
TupleTableSlot which shows which attributes is detoast. all of the
detoast values should be pfree-d in ExecClearTuple. However if a
bms_free the bitmapset everytime in ExecClearTuple and allocate the
memory again later makes some noticable performance regression (5%
difference in my workload). That is still a open items for that patch. 

> ...

> The functions's header comment mentions "The bitmapsets are all
> pre-initialized with an unused high bit so that memory allocation is
> done only once.".
> NOTICE:  DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33%
> NOTICE:  DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33%
>
> and by the looks of the code, the worst case is much worse.
>

Looks like this is another user case of "user just wants to zero out the
flags in bitmapset but keeping the memory allocation".

[1] https://www.postgresql.org/message-id/flat/87il4jrk1l@163.com

-- 
Best Regards
Andy Fan

Re: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo

2024-01-17 Thread Masahiko Sawada

On Thu, Jan 18, 2024 at 8:16 AM Peter Smith  wrote:
>
> Hi,
>
> I had reported a possible subscription 'disable_on_error' bug found
> while reviewing another patch.
>
> I am including my initial report and Nisha's analysis again here so
> that this topic has its own thread.
>
> ==
> INITIAL REPORT [1]
> ==
>
> ...
> I see now that any ALTER of the subscription's connection, even to
> some value that fails, will restart a new worker (like ALTER of any
> other subscription parameters). For a bad connection, it will continue
> to relaunch-worker/ERROR over and over. e.g.
>
> --
> test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG:  logical
> replication apply worker for subscription "sub4" has started
> 2024-01-17 09:34:28.666 AEDT [11274] ERROR:  could not connect to the
> publisher: invalid port number: "-1"
> 2024-01-17 09:34:28.667 AEDT [928] LOG:  background worker "logical
> replication apply worker" (PID 11274) exited with exit code 1
> 2024-01-17 09:34:33.669 AEDT [11391] LOG:  logical replication apply
> worker for subscription "sub4" has started
> 2024-01-17 09:34:33.669 AEDT [11391] ERROR:  could not connect to the
> publisher: invalid port number: "-1"
> 2024-01-17 09:34:33.670 AEDT [928] LOG:  background worker "logical
> replication apply worker" (PID 11391) exited with exit code 1
> etc...
> --
>
> While experimenting with the bad connection ALTER I also tried setting
> 'disable_on_error' like below:
>
> ALTER SUBSCRIPTION sub4 SET (disable_on_error);
> ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';
>
> ...but here the subscription did not become DISABLED as I expected it
> would do on the next connection error iteration. It remains enabled
> and just continues to loop relaunch/ERROR indefinitely same as before.
>
> That looks like it may be a bug. Thoughts?

Although we can improve it to handle this case too, I'm not sure it's
a bug. The doc says[1]:

Specifies whether the subscription should be automatically disabled if
any errors are detected by subscription workers during data
replication from the publisher.

When an apply worker is trying to establish a connection, it's not
replicating data from the publisher.

Regards,

[1] 
https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

linux cachestat in file Readv and Prefetch

2024-01-17 Thread Cédric Villemain


Hi,

I was testing the index prefetch and streamIO patches and I added 
cachestat() syscall to get a better view of the prefetching.


It's a new linux syscall, it requires 6.5, it provides numerous 
interesting information from the VM for the range of pages examined.
It's way way faster than the old mincore() and provides much more 
valuable information:


    uint64 nr_cache;        /* Number of cached pages */
    uint64 nr_dirty;       /* Number of dirty pages */
    uint64 nr_writeback;  /* Number of pages marked for writeback. */
    uint64 nr_evicted;       /* Number of pages evicted from the cache. */
    /*
    * Number of recently evicted pages. A page is recently evicted if its
    * last eviction was recent enough that its reentry to the cache would
    * indicate that it is actively being used by the system, and that there
    * is memory pressure on the system.
    */
    uint64 nr_recently_evicted;


While here I also added some quick tweaks to suspend prefetching on 
memory pressure.
It's working but I have absolutely not checked the performance impact of 
my additions.


Sharing here for others to tests and adding in CF in case there is 
interest to go further in this direction.



---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R
From 4cdc5e952e0cd729bde472bdc3d7ff7bbc3009ea Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= 
Date: Thu, 18 Jan 2024 01:00:37 +0100
Subject: [PATCH] Add cachestat, requires linux 6.5

there is USE_CACHESTAT defined and ifdef as the feature is only for
testing and discussion at the moment.

cachestat provides a struct with those following fields set for the
range of pages examinated. The syscall is supposed to be fast enougth
but maybe not for each PageRead/Prefetch.

The very basic logic to suspend prefetching is effective enough for some
scenarios (ANALYZE being a good candidate).
---
 src/backend/storage/file/fd.c | 85 +++
 1 file changed, 85 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 43d2b2a156e..ffdbd542373 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -181,6 +181,57 @@ int			io_direct_flags;
 	((void) 0)
 #endif
 
+/* cachestat - use structs as defined in linux */
+typedef struct
+{
+	uint64 nr_cache;		/* Number of cached pages */
+	uint64 nr_dirty;		/* Number of dirty pages */
+	uint64 nr_writeback;	/* Number of pages marked for writeback. */
+	uint64 nr_evicted;		/* Number of pages evicted from the cache. */
+	/*
+	* Number of recently evicted pages. A page is recently evicted if its
+	* last eviction was recent enough that its reentry to the cache would
+	* indicate that it is actively being used by the system, and that there
+	* is memory pressure on the system.
+	*/
+	uint64 nr_recently_evicted;
+} cachestat;
+
+typedef struct
+{
+	uint64 off;
+	uint64 len;
+} cachestat_range;
+
+/*
+ * simple wrapper of linux cachestat syscall
+ * *cstat point to a member in Vfd.
+ * XXX define USE_CACHESTAT ?
+ */
+static inline void vm_cachestat(int fd, cachestat_range csrange,
+cachestat *cstat)
+{
+
+#define USE_CACHESTAT	1
+
+#if defined(USE_CACHESTAT)
+#define __NR_cachestat 451
+
+	long rc;
+	rc = syscall(__NR_cachestat, fd, , cstat, 0);
+	if (rc == -1)
+	{
+		if (errno == ENOSYS)
+			ereport(DEBUG1, (errcode(ERRCODE_INTERNAL_ERROR),
+			 errmsg("sys_cachestat is not available: %m"),
+			 errhint("linux 6.5 minimum is required!")));
+
+		ereport(ERROR, (errcode(ERRCODE_INTERNAL_ERROR),
+		errmsg("sys_cachestat returns: %m")));
+	}
+#endif
+}
+
 #define VFD_CLOSED (-1)
 
 #define FileIsValid(file) \
@@ -206,6 +257,9 @@ typedef struct vfd
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
 	mode_t		fileMode;		/* mode to pass to open(2) */
+#if defined(USE_CACHESTAT)
+	cachestat	cstat;			/* linux cachestat() */
+#endif
 } Vfd;
 
 /*
@@ -2090,6 +2144,33 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	if (returnCode < 0)
 		return returnCode;
 
+#if defined(USE_CACHESTAT)
+	/*
+	 * last time we visit this file (somewhere), nr_recently_evicted pages of
+	 * the range were just removed from vm cache, it's a sign a memory pressure.
+	 * so do not prefetch further.
+	 * it is hard to guess if it is always the right choice in absence of more
+	 * information like:
+	 *  - prefetching distance expected overall
+	 *  - access pattern/backend maybe
+	 */
+	if (VfdCache[file].cstat.nr_recently_evicted > 0)
+	{
+		DO_DB(elog(LOG, "FilePrefetch: nr_recently_evicted="INT64_FORMAT"\t\t\t\tnoway",
+   VfdCache[file].cstat.nr_recently_evicted));
+		return 0;
+	}
+	vm_cachestat(VfdCache[file].fd, (cachestat_range) {offset, amount},
+ [file].cstat);
+	if (VfdCache[file].cstat.nr_cache > 0)
+	{
+		DO_DB(elog(LOG, "FilePrefetch:

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-17 Thread Masahiko Sawada

On Wed, Jan 17, 2024 at 11:37 AM John Naylor  wrote:
>
> On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jan 17, 2024 at 9:20 AM John Naylor  wrote:
> > >
> > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada  
> > > wrote:
> > > > Just changing "items" to be the local tidstore struct could make the
> > > > code tricky a bit, since max_bytes and num_items are on the shared
> > > > memory while "items" is a local pointer to the shared tidstore.
> > >
> > > Thanks for trying it this way! I like the overall simplification but
> > > this aspect is not great.
> > > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > > their own allocations and returning a pointer. Would it be less tricky
> > > if the structs were declared where we need them and passed to "init"
> > > functions?
> >
> > Seems worth trying. The current RT_CREATE() API is also convenient as
> > other data structure such as simplehash.h and dshash.c supports a
> > similar
>
> I don't happen to know if these paths had to solve similar trickiness
> with some values being local, and some shared.
>
> > > That may be a good idea for other reasons. It's awkward that the
> > > create function is declared like this:
> > >
> > > #ifdef RT_SHMEM
> > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
> > > dsa_area *dsa,
> > > int tranche_id);
> > > #else
> > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
> > > #endif
> > >
> > > An init function wouldn't need these parameters: it could look at the
> > > passed struct to know what to do.
> >
> > But the init function would initialize leaf_ctx etc,no? Initializing
> > leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE.
>
> I was more referring to the parameters that were different above
> depending on shared memory. My first thought was that the tricky part
> is because of the allocation in local memory, but it's certainly
> possible I've misunderstood the problem.
>
> > The same
> > is true for dsa. I imagined that an init function would allocate a DSA
> > memory for the control object.
>
> Yes:
>
> ...
> //  embedded in VacDeadItems
>   TidStore items;
> };
>
> // NULL DSA in local case, etc
> dead_items->items.area = dead_items_dsa;
> dead_items->items.tranche_id = FOO_ID;
>
> TidStoreInit(_items->items, vac_work_mem);
>
> That's how I imagined it would work (leaving out some details). I
> haven't tried it, so not sure how much it helps. Maybe it has other
> problems, but I'm hoping it's just a matter of programming.

It seems we cannot make this work nicely. IIUC VacDeadItems is
allocated in DSM and TidStore is embedded there. However,
dead_items->items.area is a local pointer to dsa_area. So we cannot
include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
would need to pass dsa_area to each interface by callers.

>
> If we can't make this work nicely, I'd be okay with keeping the tid
> store control object. My biggest concern is unnecessary
> double-locking.

If we don't do any locking stuff in radix tree APIs and it's the
user's responsibility at all, probably we don't need a lock for
tidstore? That is, we expose lock functions as you mentioned and the
user (like tidstore) acquires/releases the lock before/after accessing
the radix tree and num_items. Currently (as of v52 patch) RT_FIND is
doing so, but we would need to change RT_SET() and iteration functions
as well.

During trying this idea, I realized that there is a visibility problem
in the radix tree template especially if we want to embed the radix
tree in a struct. Considering a use case where we want to use a radix
tree in an exposed struct, we would declare only interfaces in a .h
file and define actual implementation in a .c file (FYI
TupleHashTableData does a similar thing with simplehash.h). The .c
file and .h file would be like:

in .h file:
#define RT_PREFIX local_rt
#define RT_SCOPE extern
#define RT_DECLARE
#define RT_VALUE_TYPE BlocktableEntry
#define RT_VARLEN_VALUE
#include "lib/radixtree.h"

typedef struct TidStore
{
:
local_rt_radix_tree tree; /* embedded */
:
} TidStore;

in .c file:

#define RT_PREFIX local_rt
#define RT_SCOPE extern
#define RT_DEFINE
#define RT_VALUE_TYPE BlocktableEntry
#define RT_VARLEN_VALUE
#include "lib/radixtree.h"

But it doesn't work as the compiler doesn't know the actual definition
of local_rt_radix_tree. If the 'tree' is *local_rt_radix_tree, it
works. The reason is that with RT_DECLARE but without RT_DEFINE, the
radix tree template generates only forward declarations:

#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

In order to make it work, we need to move the definitions required to
expose RT_RADIX_TREE struct to RT_DECLARE part, which actually
requires to move RT_NODE, RT_HANDLE, RT_NODE_PTR, RT_SIZE_CLASS_COUNT,
and RT_RADIX_TREE_CONTROL etc. However RT_SIZE_CLASS_COUNT, used in
RT_RADIX_TREE, could be bothersome. Since it refers to

Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx

2024-01-17 Thread Michael Paquier

On Wed, Jan 17, 2024 at 10:05:33AM +0100, Anthonin Bonnefoy wrote:
> > I do realize the same is true for plain \bind, but it seems
> > like a bug there too.
> 
> The unscanned bind's parameters are discarded later in the
> HandleSlashCmds functions. So adding the ignore_slash_options() for
> inactive branches scans and discards them earlier. I will add it to
> match what's done in the other commands but I don't think it's
> testable as the behaviour is the same unless I miss something.

Hmm.  So it does not lead to any user-visible changes, right?  I can
get your argument about being consistent in the code across the board
for all the backslash commands, though.

> I did add the \bind, \bindx, \close and \parse to the inactive branch
> tests to complete the list.

Could you split the bits for \bind into a separate patch, please?
This requires a separate evaluation, especially if this had better be
backpatched.
--
Michael


signature.asc
Description: PGP signature

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-17 Thread Michael Paquier

On Wed, Jan 17, 2024 at 03:20:39PM +0300, Nazir Bilal Yavuz wrote:
> I agree with your points. While the other I/O related work is
> happening we can discuss what we should do in the variable op_byte
> cases. Also, this is happening only for WAL right now but if we try to
> extend pg_stat_io in the future, that problem possibly will rise
> again. So, it could be good to come up with a general solution, not
> only for WAL.

Okay, I've marked the patch as RwF for this CF.
--
Michael


signature.asc
Description: PGP signature

Re: CI and test improvements

2024-01-17 Thread Michael Paquier

On Wed, Jan 17, 2024 at 05:34:00PM +0530, vignesh C wrote:
> Are we planning to do anything more in this thread, the thread has
> been idle for more than 7 months. If nothing more is planned for this,
> I'm planning to close this commitfest entry in this commitfest.

Oops, this went through the cracks.  security was still missing in
seg's meson.build, so I've just applied a patch to take care of it.
I am not spotting any other holes..
--
Michael


signature.asc
Description: PGP signature

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-17 Thread jian he

On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada  wrote:
>
> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
> >
> > Alexander Korotkov  writes:
> > > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
> > >  wrote:
> > >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
> > >> indicate "immediately error out" and 'just ignore the failure'
> > >> respectively, but these options hardly seem to denote a 'location',
> > >> and appear more like an 'action'. I somewhat suspect that this
> > >> parameter name intially conceived with the assupmtion that it would
> > >> take file names or similar parameters. I'm not sure if others will
> > >> agree, but I think the parameter name might not be the best
> > >> choice. For instance, considering the addition of the third value
> > >> 'log', something like on_error_action (error, ignore, log) would be
> > >> more intuitively understandable. What do you think?
> >
> > > Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
> > > the next word will be location, not action.  With some stretch we can
> > > assume 'error' to be location.  I think it would be even more stretchy
> > > to think that SAVE_ERROR_TO is followed by action.
> >
> > The other problem with this terminology is that with 'none', what it
> > is doing is the exact opposite of "saving" the errors.  I agree we
> > need a better name.
>
> Agreed.
>
> >
> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> > You will need a separate parameter anyway to specify the destination
> > of "log", unless "none" became an illegal table name when I wasn't
> > looking.  I don't buy that one parameter that has some special values
> > while other values could be names will be a good design.  Moreover,
> > what if we want to support (say) log-to-file along with log-to-table?
> > Trying to distinguish a file name from a table name without any other
> > context seems impossible.
>
> I've been thinking we can add more values to this option to log errors
> not only to the server logs but also to the error table (not sure
> details but I imagined an error table is created for each table on
> error), without an additional option for the destination name. The
> values would be like error_action {error|ignore|save-logs|save-table}.
>

another idea:
on_error {error|ignore|other_future_option}
if not specified then by default ERROR.
You can also specify ERROR or IGNORE for now.

I agree, the parameter "error_action" is better than "location".

Re: Where can I find the doxyfile?

2024-01-17 Thread vignesh C

On Tue, 7 Nov 2023 at 12:23, John Morris  wrote:
>
> Another update, this time with an abbreviated Doxyfile. Rather than including 
> the full Doxyfile, this updated version only includes modified settings. It 
> is more compact and more likely to survive across future doxygen versions.

Meson build is failing in CFbot at [1] and [2] with:
Message: Install doxygen if you wish to create Doxygen documentation
Program dot found: NO
Configuring Doxyfile using configuration

doc/doxygen/meson.build:52:0: ERROR: Tried to use not-found external
program in "command"

[1] - https://cirrus-ci.com/task/5823573617541120
[2] - 
https://api.cirrus-ci.com/v1/artifact/task/5823573617541120/meson_log/build/meson-logs/meson-log.txt

Regards,
Vignesh

Re: Add system identifier to backup manifest

2024-01-17 Thread Sravan Kumar

I have also done a review of the patch and some testing. The patch looks
good, and I agree with Robert's comments.

On Wed, Jan 17, 2024 at 8:40 PM Robert Haas  wrote:
>
> On Wed, Jan 17, 2024 at 6:31 AM Amul Sul  wrote:
> > With the attached patch, the backup manifest will have a new key item as
> > "System-Identifier" 64-bit integer whose value is derived from
pg_control while
> > generating it, and the manifest version bumps to 2.
> >
> > This helps to identify the correct database server and/or backup for the
> > subsequent backup operations.  pg_verifybackup validates the manifest
system
> > identifier against the backup control file and fails if they don’t
match.
> > Similarly, pg_basebackup increment backup will fail if the manifest
system
> > identifier does not match with the server system identifier.  The
> > pg_combinebackup is already a bit smarter -- checks the system
identifier from
> > the pg_control of all the backups, with this patch the manifest system
> > identifier also validated.
>
> Thanks for working on this. Without this, I think what happens is that
> you can potentially take an incremental backup from the "wrong"
> server, if the states of the systems are such that all of the other
> sanity checks pass. When you run pg_combinebackup, it'll discover the
> problem and tell you, but you ideally want to discover such errors at
> backup time rather than at restore time. This addresses that. And,
> overall, I think it's a pretty good patch. But I nonetheless have a
> bunch of comments.
>
> -  The associated value is always the integer 1.
> +  The associated value is the integer, either 1 or 2.
>
> is an integer. Beginning in PostgreSQL 17,
> it is 2; in older versions, it is 1.
>
> + context.identity_cb = manifest_process_identity;
>
> I'm not really on board with calling the system identifier "identity"
> throughout the patch. I think it should just say system_identifier. If
> we were going to abbreviate, I'd prefer something like "sysident" that
> looks like it's derived from "system identifier" rather than
> "identity" which is a different word altogether. But I don't think we
> should abbreviate unless doing so creates *ridiculously* long
> identifier names.
>
> +static void
> +manifest_process_identity(JsonManifestParseContext *context,
> +   int manifest_version,
> +   uint64 manifest_system_identifier)
> +{
> + uint64 system_identifier;
> +
> + /* Manifest system identifier available in version 2 or later */
> + if (manifest_version == 1)
> + return;
>
> I think you've got the wrong idea here. I think this function would
> only get called if System-Identifier is present in the manifest, so if
> it's a v1 manifest, this would never get called, so this if-statement
> would not ever do anything useful. I think what you should do is (1)
> if the client supplies a v1 manifest, reject it, because surely that's
> from an older server version that doesn't support incremental backup;
> but do that when the version is parsed rather than here; and (2) also
> detect and reject the case when it's supposedly a v2 manifest but this
> is absent.
>
> (1) should really be done when the version number is parsed, so I
> suspect you may need to add manifest_version_cb.
>
> +static void
> +combinebackup_identity_cb(JsonManifestParseContext *context,
> +   int manifest_version,
> +   uint64 manifest_system_identifier)
> +{
> + parser_context *private_context = context->private_data;
> + uint64 system_identifier = private_context->system_identifier;
> +
> + /* Manifest system identifier available in version 2 or later */
> + if (manifest_version == 1)
> + return;
>
> Very similar to the above case. Just reject a version 1 manifest as
> soon as we see the version number. In this function, set a flag
> indicating we saw the system identifier; if at the end of parsing that
> flag is not set, kaboom.
>
> - parse_manifest_file(manifest_path, , _wal_range);
> + parse_manifest_file(manifest_path, , _wal_range,
> + context.backup_directory);
>
> Don't do this! parse_manifest_file() should just record everything
> found in the manifest in the context object. Syntax validation should
> happen while parsing the manifest (e.g. "CAT/DOG" is not a valid LSN
> and we should reject that at this stage) but semantic validation
> should happen later (e.g. "0/0" can't be a the correct backup end LSN
> but we don't figure that out while parsing, but rather later). I think
> you should actually move validation of the system identifier to the
> point where the directory walk encounters the control file (and update
> the docs and tests to match that decision). Imagine if you wanted to
> validate a tar-format backup; then you wouldn't have random access to
> the directory. You'd see the manifest file first, and then all the
> files in a random order, with one chance to look at each one.
>
> (This is, in fact, a feature I think we should implement.)
>
> - if (strcmp(token, "1") != 0)
> + parse->manifest_version =

Re: pgsql: Clean up role created in new subscription test.

2024-01-17 Thread vignesh C

On Fri, 1 Dec 2023 at 18:23, Daniel Gustafsson  wrote:
>
> > On 1 Dec 2023, at 13:19, Alvaro Herrera  wrote:
> >
> > Isn't it simpler to use DROP OWNED BY in 0001?
>
> I suppose it is, I kind of like the explicit drops but we do use OWNED BY 
> quite
> a lot in the regression tests so changed to that in the attached v5.

There are a lot of failures in CFBot at [1] with:
# test failed
--- stderr ---
# left over global object detected: regress_test_bypassrls
# 1 of 2 tests failed.

More details of the same are available at [2].
Do we need to clean up the objects leftover for the reported issues in the test?

[1] - https://cirrus-ci.com/task/6222185975513088
[2] - 
https://api.cirrus-ci.com/v1/artifact/task/6222185975513088/meson_log/build/meson-logs/testlog-running.txt

Regards,
Vignesh

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-17 Thread Masahiko Sawada

On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
>
> Alexander Korotkov  writes:
> > On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
> >  wrote:
> >> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
> >> indicate "immediately error out" and 'just ignore the failure'
> >> respectively, but these options hardly seem to denote a 'location',
> >> and appear more like an 'action'. I somewhat suspect that this
> >> parameter name intially conceived with the assupmtion that it would
> >> take file names or similar parameters. I'm not sure if others will
> >> agree, but I think the parameter name might not be the best
> >> choice. For instance, considering the addition of the third value
> >> 'log', something like on_error_action (error, ignore, log) would be
> >> more intuitively understandable. What do you think?
>
> > Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
> > the next word will be location, not action.  With some stretch we can
> > assume 'error' to be location.  I think it would be even more stretchy
> > to think that SAVE_ERROR_TO is followed by action.
>
> The other problem with this terminology is that with 'none', what it
> is doing is the exact opposite of "saving" the errors.  I agree we
> need a better name.

Agreed.

>
> Kyotaro-san's suggestion isn't bad, though I might shorten it to
> error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
> You will need a separate parameter anyway to specify the destination
> of "log", unless "none" became an illegal table name when I wasn't
> looking.  I don't buy that one parameter that has some special values
> while other values could be names will be a good design.  Moreover,
> what if we want to support (say) log-to-file along with log-to-table?
> Trying to distinguish a file name from a table name without any other
> context seems impossible.

I've been thinking we can add more values to this option to log errors
not only to the server logs but also to the error table (not sure
details but I imagined an error table is created for each table on
error), without an additional option for the destination name. The
values would be like error_action {error|ignore|save-logs|save-table}.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Teach predtest about IS [NOT] proofs

2024-01-17 Thread James Coleman

On Fri, Dec 22, 2023 at 2:48 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > I've not yet applied all of your feedback, but I wanted to get an
> > initial read on your thoughts on how using switch statements ends up
> > looking. Attached is a single (pure refactor) patch that converts the
> > various if/else levels that check things like node tag and
> > boolean/null test type into switch statements. I've retained 'default'
> > keyword usages for now for simplicity (my intuition is that we
> > generally prefer to list out all options for compiler safety benefits,
> > though I'm not 100% sure that's useful in the outer node tag check
> > since it's unlikely that someone adding a new node would modify
> > this...).
>
> > My big question is: are you comfortable with the indentation explosion
> > this creates? IMO it's a lot wordier, but it is also more obvious what
> > the structural goal is. I'm not sure how we want to make the right
> > trade-off though.
>
> Yeah, I see what you mean.  Also, I'd wanted to shove most of
> the text in the function header in-line and get rid of the short
> restatements of those paras.  I carried that through just for
> predicate_implied_by_simple_clause, as attached.  The structure is
> definitely clearer, but we end up with an awful lot of indentation,
> which makes the comments less readable than I'd like.  (I did some
> minor rewording to make them flow better.)
>
> On balance I think this is probably better than what we have, but
> maybe we'd be best off to avoid doubly nested switches?  I think
> there's a good argument for the outer switch on nodeTag, but
> maybe we're getting diminishing returns from an inner switch.
>
> regards, tom lane
>

Apologies for the long delay.

Attached is a new patch series.

0001 does the initial pure refactor. 0003 makes a lot of modifications
to what we can prove about implication and refutation. Finally, 0003
isn't intended to be committed, but attempts to validate more
holistically that none of the changes creates any invalid proofs
beyond the mostly happy-path tests added in 0004.

I ended up not tackling changing how test_predtest tests run for now.
That's plausibly still useful, and I'd be happy to add that if you
generally agree with the direction of the patch and with that
abstraction being useful.

I added some additional verifications to the test_predtest module to
prevent additional obvious flaws.

Regards,
James Coleman


v4-0003-Add-temporary-all-permutations-test.patch
Description: Binary data


v4-0001-Use-switch-statements-in-predicate_-implied-refut.patch
Description: Binary data


v4-0002-Teach-predtest.c-about-BooleanTest.patch
Description: Binary data

Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-17 Thread David Rowley

On Tue, 16 Jan 2024 at 16:32, David Rowley  wrote:
>
> While working on [1], I noticed some strange code in
> DiscreteKnapsack() which seems to be aiming to copy the Bitmapset.
>
> It's not that obvious at a casual glance, but:
>
> sets[j] = bms_del_members(sets[j], sets[j]);
>
> this is aiming to zero all the words in the set by passing the same
> set in both parameters.
>
> Now that 00b41463c changed Bitmapset to have NULL be the only valid
> representation of an empty set, this code no longer makes sense.  We
> may as well just bms_free() the original set and bms_copy() in the new
> set as the bms_del_members() call will always pfree the set anyway.
>
> I've done that in the attached.
>
> I did consider if we might want bms_merge_members() or
> bms_exchange_members() or some other function suitably named function
> to perform a del/add operation, but given the lack of complaints about
> any performance regressions here, I think it's not worthwhile.

After looking at this again and reading more code, I see that
DiscreteKnapsack() goes to some efforts to minimise memory
allocations.

The functions's header comment mentions "The bitmapsets are all
pre-initialized with an unused high bit so that memory allocation is
done only once.".

I tried adding some debugging output to track how many additional
allocations we're now causing as a result of 00b41463c.  Previously
there'd have just been max_weight allocations, but now there's those
plus the number that's mentioned for "frees" below.

NOTICE:  DiscreteKnapsack: frees = 373, max_weight = 140, extra = 266.43%
NOTICE:  DiscreteKnapsack: frees = 373, max_weight = 140, extra = 266.43%
NOTICE:  DiscreteKnapsack: frees = 267, max_weight = 100, extra = 267.00%
NOTICE:  DiscreteKnapsack: frees = 267, max_weight = 100, extra = 267.00%
NOTICE:  DiscreteKnapsack: frees = 200, max_weight = 140, extra = 142.86%
NOTICE:  DiscreteKnapsack: frees = 200, max_weight = 140, extra = 142.86%
NOTICE:  DiscreteKnapsack: frees = 30, max_weight = 40, extra = 75.00%
NOTICE:  DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33%
NOTICE:  DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33%
NOTICE:  DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33%
NOTICE:  DiscreteKnapsack: frees = 110, max_weight = 60, extra = 183.33%

and by the looks of the code, the worst case is much worse.

Given that the code original code was written in a very deliberate way
to avoid reallocations, I now think that we should maintain that.

I've attached a patch which adds bms_replace_members(). It's basically
like bms_copy() but attempts to reuse the member from another set. I
considered if the new function should be called bms_copy_inplace(),
but left it as bms_replace_members() for now.

Now I wonder if this should be backpatched to PG16.

David
diff --git a/src/backend/lib/knapsack.c b/src/backend/lib/knapsack.c
index 13d800718f..439da1ad70 100644
--- a/src/backend/lib/knapsack.c
+++ b/src/backend/lib/knapsack.c
@@ -89,10 +89,7 @@ DiscreteKnapsack(int max_weight, int num_items,
{
/* copy sets[ow] to sets[j] without realloc */
if (j != ow)
-   {
-   sets[j] = bms_del_members(sets[j], 
sets[j]);
-   sets[j] = bms_add_members(sets[j], 
sets[ow]);
-   }
+   sets[j] = bms_replace_members(sets[j], 
sets[ow]);
 
sets[j] = bms_add_member(sets[j], i);
 
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b0f908f978..8be61cd081 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -976,6 +976,50 @@ bms_add_members(Bitmapset *a, const Bitmapset *b)
return result;
 }
 
+/*
+ * bms_replace_members
+ * Remove all existing members from 'a' and repopulate the set 
with members
+ * from 'b', recycling 'a', when possible.
+ */
+Bitmapset *
+bms_replace_members(Bitmapset *a, const Bitmapset *b)
+{
+   int i;
+
+   Assert(bms_is_valid_set(a));
+   Assert(bms_is_valid_set(b));
+
+   if (a == NULL)
+   return bms_copy(b);
+   if (b == NULL)
+   {
+   pfree(a);
+   return NULL;
+   }
+
+   if (a->nwords < b->nwords)
+   a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(b->nwords));
+
+   i = 0;
+   do
+   {
+   a->words[i] = b->words[i];
+   } while (++i < b->nwords);
+
+   a->nwords = b->nwords;
+
+#ifdef REALLOCATE_BITMAPSETS
+
+   /*
+* There's no guarantee that the repalloc returned a new pointer, so 
copy
+* and free unconditionally here.
+*/
+   a = bms_copy_and_free(a);
+#endif
+
+   return a;
+}
+
 /*
  * bms_add_range
  * Add members in the range of 'lower' to

linux cachestat in file Readv and Prefetch

2024-01-17 Thread Cedric Villemain


Hi,

I was testing the index prefetch and streamIO patches and I added 
cachestat() syscall to get a better view of the prefetching.


It's a new linux syscall, it requires 6.5, it provides numerous 
interesting information from the VM for the range of pages examined.
It's way way faster than the old mincore() and provides much more 
valuable information:


    uint64 nr_cache;        /* Number of cached pages */
    uint64 nr_dirty;       /* Number of dirty pages */
    uint64 nr_writeback;  /* Number of pages marked for writeback. */
    uint64 nr_evicted;       /* Number of pages evicted from the cache. */
    /*
    * Number of recently evicted pages. A page is recently evicted if its
    * last eviction was recent enough that its reentry to the cache would
    * indicate that it is actively being used by the system, and that there
    * is memory pressure on the system.
    */
    uint64 nr_recently_evicted;


While here I also added some quick tweaks to suspend prefetching on 
memory pressure.
It's working but I have absolutely not checked the performance impact of 
my additions.


Sharing here for others to tests and adding in CF in case there is 
interest to go further in this direction.



---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R

--
---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R
From 4cdc5e952e0cd729bde472bdc3d7ff7bbc3009ea Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Villemain?= 
Date: Thu, 18 Jan 2024 01:00:37 +0100
Subject: [PATCH] Add cachestat, requires linux 6.5

there is USE_CACHESTAT defined and ifdef as the feature is only for
testing and discussion at the moment.

cachestat provides a struct with those following fields set for the
range of pages examinated. The syscall is supposed to be fast enougth
but maybe not for each PageRead/Prefetch.

The very basic logic to suspend prefetching is effective enough for some
scenarios (ANALYZE being a good candidate).
---
 src/backend/storage/file/fd.c | 85 +++
 1 file changed, 85 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 43d2b2a156e..ffdbd542373 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -181,6 +181,57 @@ int			io_direct_flags;
 	((void) 0)
 #endif
 
+/* cachestat - use structs as defined in linux */
+typedef struct
+{
+	uint64 nr_cache;		/* Number of cached pages */
+	uint64 nr_dirty;		/* Number of dirty pages */
+	uint64 nr_writeback;	/* Number of pages marked for writeback. */
+	uint64 nr_evicted;		/* Number of pages evicted from the cache. */
+	/*
+	* Number of recently evicted pages. A page is recently evicted if its
+	* last eviction was recent enough that its reentry to the cache would
+	* indicate that it is actively being used by the system, and that there
+	* is memory pressure on the system.
+	*/
+	uint64 nr_recently_evicted;
+} cachestat;
+
+typedef struct
+{
+	uint64 off;
+	uint64 len;
+} cachestat_range;
+
+/*
+ * simple wrapper of linux cachestat syscall
+ * *cstat point to a member in Vfd.
+ * XXX define USE_CACHESTAT ?
+ */
+static inline void vm_cachestat(int fd, cachestat_range csrange,
+cachestat *cstat)
+{
+
+#define USE_CACHESTAT	1
+
+#if defined(USE_CACHESTAT)
+#define __NR_cachestat 451
+
+	long rc;
+	rc = syscall(__NR_cachestat, fd, , cstat, 0);
+	if (rc == -1)
+	{
+		if (errno == ENOSYS)
+			ereport(DEBUG1, (errcode(ERRCODE_INTERNAL_ERROR),
+			 errmsg("sys_cachestat is not available: %m"),
+			 errhint("linux 6.5 minimum is required!")));
+
+		ereport(ERROR, (errcode(ERRCODE_INTERNAL_ERROR),
+		errmsg("sys_cachestat returns: %m")));
+	}
+#endif
+}
+
 #define VFD_CLOSED (-1)
 
 #define FileIsValid(file) \
@@ -206,6 +257,9 @@ typedef struct vfd
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
 	mode_t		fileMode;		/* mode to pass to open(2) */
+#if defined(USE_CACHESTAT)
+	cachestat	cstat;			/* linux cachestat() */
+#endif
 } Vfd;
 
 /*
@@ -2090,6 +2144,33 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	if (returnCode < 0)
 		return returnCode;
 
+#if defined(USE_CACHESTAT)
+	/*
+	 * last time we visit this file (somewhere), nr_recently_evicted pages of
+	 * the range were just removed from vm cache, it's a sign a memory pressure.
+	 * so do not prefetch further.
+	 * it is hard to guess if it is always the right choice in absence of more
+	 * information like:
+	 *  - prefetching distance expected overall
+	 *  - access pattern/backend maybe
+	 */
+	if (VfdCache[file].cstat.nr_recently_evicted > 0)
+	{
+		DO_DB(elog(LOG, "FilePrefetch: nr_recently_evicted="INT64_FORMAT"\t\t\t\tnoway",
+   VfdCache[file].cstat.nr_recently_evicted));
+		return 0;
+	}
+	vm_cachestat(VfdCache[file].fd, (cachestat_range) {offset, amount},
+

Re: Improve the connection failure error messages

2024-01-17 Thread Peter Smith

On Wed, Jan 17, 2024 at 7:15 PM Nisha Moond  wrote:
>
> >
> > ~~
> >
> > BTW, while experimenting with the bad connection ALTER I also tried
> > setting 'disable_on_error' like below:
> >
> > ALTER SUBSCRIPTION sub4 SET (disable_on_error);
> > ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';
> >
> > ...but here the subscription did not become DISABLED as I expected it
> > would do on the next connection error iteration. It remains enabled
> > and just continues to loop relaunch/ERROR indefinitely same as before.
> >
> > That looks like it may be a bug. Thoughts?
> >
> Ideally, if the already running apply worker in
> "LogicalRepApplyLoop()" has any exception/error it will be handled and
> the subscription will be disabled if 'disable_on_error' is set -
>
> start_apply(XLogRecPtr origin_startpos)
> {
> PG_TRY();
> {
> LogicalRepApplyLoop(origin_startpos);
> }
> PG_CATCH();
> {
> if (MySubscription->disableonerr)
> DisableSubscriptionAndExit();
> ...
>
> What is happening in this case is that the control reaches the function -
> run_apply_worker() -> start_apply() -> LogicalRepApplyLoop ->
> maybe_reread_subscription()
> ...
> /*
> * Exit if any parameter that affects the remote connection was changed.
> * The launcher will start a new worker but note that the parallel apply
> * worker won't restart if the streaming option's value is changed from
> * 'parallel' to any other value or the server decides not to stream the
> * in-progress transaction.
> */
> if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
> ...
>
> and it sees a change in the parameter and calls apply_worker_exit().
> This will exit the current process, without throwing an exception to
> the caller and the postmaster will try to restart the apply worker.
> The new apply worker, before reaching the start_apply() [where we
> handle exception], will hit the code to establish the connection to
> the publisher -
>
> ApplyWorkerMain() -> run_apply_worker() -
> ...
> LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo,
> true /* replication */ ,
> true,
> must_use_password,
> MySubscription->name, );
>
> if (LogRepWorkerWalRcvConn == NULL)
>   ereport(ERROR,
>   (errcode(ERRCODE_CONNECTION_FAILURE),
>errmsg("could not connect to the publisher: %s", err)));
> ...
> and due to the bad connection string in the subscription, it will error out.
> [28680] ERROR:  could not connect to the publisher: invalid port number: "-1"
> [3196] LOG:  background worker "logical replication apply worker" (PID
> 28680) exited with exit code 1
>
> Now, the postmaster keeps trying to restart the apply worker and it
> will keep failing until the connection string is corrected or the
> subscription is disabled manually.
>
> I think this is a bug that needs to be handled in run_apply_worker()
> when disable_on_error is set.
> IMO, this bug-fix discussion deserves a separate thread. Thoughts?

Hi Nisha,

Thanks for your analysis -- it is the same as my understanding.

As suggested, I have created a new thread for any further discussion
related to this 'disable_on_error' topic [1].

==
[1] 
https://www.postgresql.org/message-id/flat/CAHut%2BPuEsekA3e7ThwzWr%2BUs4x%3DLzkF7DSrED1UsZTUqNrhCUQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo

2024-01-17 Thread Peter Smith

Hi,

I had reported a possible subscription 'disable_on_error' bug found
while reviewing another patch.

I am including my initial report and Nisha's analysis again here so
that this topic has its own thread.

==
INITIAL REPORT [1]
==

...
I see now that any ALTER of the subscription's connection, even to
some value that fails, will restart a new worker (like ALTER of any
other subscription parameters). For a bad connection, it will continue
to relaunch-worker/ERROR over and over. e.g.

--
test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG:  logical
replication apply worker for subscription "sub4" has started
2024-01-17 09:34:28.666 AEDT [11274] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:28.667 AEDT [928] LOG:  background worker "logical
replication apply worker" (PID 11274) exited with exit code 1
2024-01-17 09:34:33.669 AEDT [11391] LOG:  logical replication apply
worker for subscription "sub4" has started
2024-01-17 09:34:33.669 AEDT [11391] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:33.670 AEDT [928] LOG:  background worker "logical
replication apply worker" (PID 11391) exited with exit code 1
etc...
--

While experimenting with the bad connection ALTER I also tried setting
'disable_on_error' like below:

ALTER SUBSCRIPTION sub4 SET (disable_on_error);
ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';

...but here the subscription did not become DISABLED as I expected it
would do on the next connection error iteration. It remains enabled
and just continues to loop relaunch/ERROR indefinitely same as before.

That looks like it may be a bug. Thoughts?

=
ANALYSIS BY NISHA [2]
=

Ideally, if the already running apply worker in
"LogicalRepApplyLoop()" has any exception/error it will be handled and
the subscription will be disabled if 'disable_on_error' is set -

start_apply(XLogRecPtr origin_startpos)
{
PG_TRY();
{
LogicalRepApplyLoop(origin_startpos);
}
PG_CATCH();
{
if (MySubscription->disableonerr)
DisableSubscriptionAndExit();
...

What is happening in this case is that the control reaches the
function - run_apply_worker() -> start_apply() -> LogicalRepApplyLoop
-> maybe_reread_subscription()
...
/*
* Exit if any parameter that affects the remote connection was changed.
* The launcher will start a new worker but note that the parallel apply
* worker won't restart if the streaming option's value is changed from
* 'parallel' to any other value or the server decides not to stream the
* in-progress transaction.
*/
if (strcmp(newsub->conninfo, MySubscription->conninfo) != 0 ||
...

and it sees a change in the parameter and calls apply_worker_exit().
This will exit the current process, without throwing an exception to
the caller and the postmaster will try to restart the apply worker.
The new apply worker, before reaching the start_apply() [where we
handle exception], will hit the code to establish the connection to
the publisher -

ApplyWorkerMain() -> run_apply_worker() -
...
LogRepWorkerWalRcvConn = walrcv_connect(MySubscription->conninfo,
true /* replication */ ,
true,
must_use_password,
MySubscription->name, );

if (LogRepWorkerWalRcvConn == NULL)
  ereport(ERROR,
  (errcode(ERRCODE_CONNECTION_FAILURE),
   errmsg("could not connect to the publisher: %s", err)));
...

and due to the bad connection string in the subscription, it will error out.
[28680] ERROR:  could not connect to the publisher: invalid port number: "-1"
[3196] LOG:  background worker "logical replication apply worker" (PID
28680) exited with exit code 1

Now, the postmaster keeps trying to restart the apply worker and it
will keep failing until the connection string is corrected or the
subscription is disabled manually.

I think this is a bug that needs to be handled in run_apply_worker()
when disable_on_error is set.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPvcf7P2dHbCeWPM4jQ%3DyHqf4WFS_C6eVb8V%3DbcZPMMp7A%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CABdArM6ORu%2BKpS_kXd-jwwPBqYPo1YqZjwwGnqmVanWgdHCggA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-17 Thread Peter Geoghegan

On Wed, Jan 17, 2024 at 5:47 PM Melanie Plageman
 wrote:
>
> On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan  wrote:
> >
> > On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan  wrote:
> > > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the
> > > wrong idea. If it's such a good idea then why not apply it all the
> > > time? That is, why not apply it independently of whether nindexes==0
> > > in the current VACUUM operation? (You know, just like with
> > > FAILSAFE_EVERY_PAGES.)
> >
> > Actually, I suppose that we couldn't apply it independently of
> > nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our
> > second pass over the heap takes place for those LP_DEAD-containing
> > heap pages scanned since the last round of index/heap vacuuming took
> > place (or since VACUUM began). We need to make sure that the FSM has
> > the most recent possible information known to VACUUM, which would
> > break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0.
> >
> > Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me.
>
> I now see I misunderstood and my earlier email was wrong. I didn't
> notice that we only use VACUUM_FSM_EVERY_PAGES if nindexes ==0.
> So, in master, we call FreeSpaceMapVacuumRange() always after a round
> of index vacuuming and periodically if there are no indexes.

The "nindexes == 0" if() that comes just after our call to
lazy_scan_prune() is "the one-pass equivalent of a call to
lazy_vacuum()". Though this includes the call to
FreeSpaceMapVacuumRange() that immediately follows the two-pass case
calling lazy_vacuum(), too.

> It seems like you are asking whether not we should vacuum the FSM at a
> different cadence for the no indexes case (and potentially count
> blocks actually vacuumed instead of blocks considered).
>
> And it seems like Robert is asking whether or not we should
> FreeSpaceMapVacuumRange() more frequently than after index vacuuming
> in the nindexes > 0 case.

There is no particular reason for the nindexes==0 case to care about
how often we'd call FreeSpaceMapVacuumRange() in the counterfactual
world where the same VACUUM ran on the same table, except that it was
nindexes>1 instead. At least I don't see any.

> Other than the overhead of the actual vacuuming of the FSM, what are
> the potential downsides of knowing about freespace sooner? It could
> change what pages are inserted to. What are the possible undesirable
> side effects?

The whole VACUUM_FSM_EVERY_PAGES thing comes from commit 851a26e266.
The commit message of that work seems to suppose that calling
FreeSpaceMapVacuumRange() more frequently is pretty much strictly
better than calling it less frequently, at least up to the point where
certain more-or-less fixed costs paid once per
FreeSpaceMapVacuumRange() start to become a problem. I think that
that's probably about right.

The commit message also says that we "arbitrarily update upper FSM
pages after each 8GB of heap" (in the nindexes==0 case). So
VACUUM_FSM_EVERY_PAGES is only very approximately analogous to what we
do in the nindexes>1 case. That seems reasonable because these two
cases really aren't so comparable in terms of the FSM vacuuming
requirements -- the nindexes==0 case legitimately doesn't have the
same dependency on heap vacuuming (and index vacuuming) that we have
to consider when nindexes>1.

-- 
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-17 Thread Melanie Plageman

On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan  wrote:
>
> On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan  wrote:
> > I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the
> > wrong idea. If it's such a good idea then why not apply it all the
> > time? That is, why not apply it independently of whether nindexes==0
> > in the current VACUUM operation? (You know, just like with
> > FAILSAFE_EVERY_PAGES.)
>
> Actually, I suppose that we couldn't apply it independently of
> nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our
> second pass over the heap takes place for those LP_DEAD-containing
> heap pages scanned since the last round of index/heap vacuuming took
> place (or since VACUUM began). We need to make sure that the FSM has
> the most recent possible information known to VACUUM, which would
> break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0.
>
> Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me.

I now see I misunderstood and my earlier email was wrong. I didn't
notice that we only use VACUUM_FSM_EVERY_PAGES if nindexes ==0.
So, in master, we call FreeSpaceMapVacuumRange() always after a round
of index vacuuming and periodically if there are no indexes.

It seems like you are asking whether not we should vacuum the FSM at a
different cadence for the no indexes case (and potentially count
blocks actually vacuumed instead of blocks considered).

And it seems like Robert is asking whether or not we should
FreeSpaceMapVacuumRange() more frequently than after index vacuuming
in the nindexes > 0 case.

Other than the overhead of the actual vacuuming of the FSM, what are
the potential downsides of knowing about freespace sooner? It could
change what pages are inserted to. What are the possible undesirable
side effects?

- Melanie

Re: initdb's -c option behaves wrong way?

2024-01-17 Thread Daniel Gustafsson

> On 17 Jan 2024, at 21:33, Tom Lane  wrote:
> 
> I wrote:
>> However ... I don't like the patch much.  It seems to have left
>> the code in a rather random state.  Why, for example, didn't you
>> keep all the code that constructs the "newline" value together?
> 
> After thinking about it a bit more, I don't see why you didn't just
> s/strncmp/strncasecmp/ and call it good.  The messiness seems to be
> a result of your choice to replace the GUC's case as shown in the
> file with the case used on the command line, which is not better
> IMO.  We don't change our mind about the canonical spelling of a
> GUC because somebody varied the case in a SET command.

The original patch was preserving the case of the file throwing away the case
from the commandline.  The attached is a slimmed down version which only moves
the assembly of newline to the different cases (replace vs.  new) keeping the
rest of the code intact.  Keeping it in one place gets sort of messy too since
it needs to use different values for a replacement and a new variable.  Not
sure if there is a cleaner approach?

--
Daniel Gustafsson

v3-0001-Make-initdb-c-option-case-insensitive.patch
Description: Binary data

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-17 Thread Melanie Plageman

On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan  wrote:
>
> On Wed, Jan 17, 2024 at 3:58 PM Robert Haas  wrote:
> > > Ah, I realize I was not clear. I am now talking about inconsistencies
> > > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating
> > > the freespace map during the course of vacuuming the heap relation.
> >
> > Fair enough, but I'm still not quite sure exactly what the question
> > is. It looks to me like the current code, when there are indexes,
> > vacuums the FSM after each round of index vacuuming. When there are no
> > indexes, doing it after each round of index vacuuming would mean never
> > doing it, so instead we vacuum the FSM every ~8GB. I assume what
> > happened here is that somebody decided doing it after each round of
> > index vacuuming was the "right thing," and then realized that was not
> > going to work if no index vacuuming was happening, and so inserted the
> > 8GB threshold to cover that case.
>
> Note that VACUUM_FSM_EVERY_PAGES is applied against the number of
> rel_pages "processed" so far -- *including* any pages that were
> skipped using the visibility map. It would make a bit more sense if it
> was applied against scanned_pages instead (just like
> FAILSAFE_EVERY_PAGES has been since commit 07eef53955). In other
> words, VACUUM_FSM_EVERY_PAGES is applied against a thing that has only
> a very loose relationship with physical work performed/time elapsed.

This is a good point. Seems like a very reasonable change to make, as
I would think that was the original intent.

- Melanie

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-17 Thread Melanie Plageman

On Wed, Jan 17, 2024 at 3:58 PM Robert Haas  wrote:
>
> On Wed, Jan 17, 2024 at 3:12 PM Melanie Plageman
>  wrote:
>
> > Yes, I also spent some time thinking about this. In master, we do
> > always call lazy_scan_new_or_empty() before calling
> > lazy_scan_noprune(). The code is aiming to ensure we call
> > lazy_scan_new_or_empty() once before calling either of
> > lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call
> > lazy_scan_new_or_empty() unconditionally first because of the
> > different lock types expected. But, your structure has solved that.
> > I've used a version of your example code above in attached v9. It is
> > much nicer.
>
> Oh, OK, I see it now. I missed that lazy_scan_new_or_empty() was
> called either way. Glad that my proposed restructuring managed to be
> helpful despite that confusion, though. :-)
>
> At a quick glance, I also like the way this looks. I'll review it more
> thoroughly later. Does this patch require 0002 and 0003 or could it
> equally well go first? I confess that I don't entirely understand why
> we want 0002 and 0003.

Well, 0002 and 0003 move the updates to the visibility map into
lazy_scan_prune(). We only want to update the VM if we called
lazy_scan_prune() (i.e. not if lazy_scan_noprune() returned true). We
also need the lock on the heap page when updating the visibility map
but we want to have released the lock before updating the FSM, so we
need to first update the VM then the FSM.

The VM update code, in my opinion, belongs in lazy_scan_prune() --
since it is only done when lazy_scan_prune() is called. To keep the VM
update code in lazy_scan_heap() and still consolidate the FSM update
code, we would have to surround all of the VM update code in a test
(if got_cleanup_lock, I suppose). I don't see any advantage in doing
that.

> > Ah, I realize I was not clear. I am now talking about inconsistencies
> > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating
> > the freespace map during the course of vacuuming the heap relation.
>
> Fair enough, but I'm still not quite sure exactly what the question
> is. It looks to me like the current code, when there are indexes,
> vacuums the FSM after each round of index vacuuming. When there are no
> indexes, doing it after each round of index vacuuming would mean never
> doing it, so instead we vacuum the FSM every ~8GB. I assume what
> happened here is that somebody decided doing it after each round of
> index vacuuming was the "right thing," and then realized that was not
> going to work if no index vacuuming was happening, and so inserted the
> 8GB threshold to cover that case.

Ah, I see. I understood that we want to update the FSM every 8GB, but
I didn't understand that we wanted to check if we were at that 8GB
only after a round of index vacuuming. That would explain why we also
had to do it in the no indexes case -- because, as you say, there
wouldn't be a round of index vacuuming.

This does mean that something is not quite right with 0001 as well as
0004. We'd end up checking if we are at 8GB much more often. I should
probably find a way to replicate the cadence on master.

> I don't really know what to make of
> all of this. On a happy PostgreSQL system, doing anything after each
> round of index vacuuming means doing it once, because multiple rounds
> of indexing vacuum are extremely expensive and we hope that it won't
> ever occur. From that point of view, the 8GB threshold is better,
> because it means that when we vacuum a large relation, space should
> become visible to the rest of the system incrementally without needing
> to wait for the entire vacuum to finish. On the other hand, we also
> have this idea that we want to record free space in the FSM once,
> after the last time we touch the page. Given that behavior, vacuuming
> the FSM every 8GB when we haven't yet done index vacuuming wouldn't
> accomplish much of anything, because we haven't updated it for the
> pages we just touched. On the third hand, the current behavior seems
> slightly ridiculous, because pruning the page is where we're mostly
> going to free up space, so we might be better off just updating the
> FSM then instead of waiting. That free space could be mighty useful
> during the long wait between pruning and marking line pointers unused.
> On the fourth hand, that seems like a significant behavior change that
> we might not want to undertake without a bunch of research that we
> might not want to do right now -- and if we did do it, should we then
> update the FSM a second time after marking line pointers unused?

I suspect we'd need to do some testing of various scenarios to justify
such a change.

- Melanie

modify first-word capitalisation of some messages

2024-01-17 Thread Peter Smith

Hi.

PSA a small patch to adjust the first-word capitalisation of some
errmsg/ errdetail/ errhint so they comply with the guidelines.

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Error-message-capitalisation.patch
Description: Binary data

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-17 Thread Tom Lane

Alexander Korotkov  writes:
> On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
>  wrote:
>> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
>> indicate "immediately error out" and 'just ignore the failure'
>> respectively, but these options hardly seem to denote a 'location',
>> and appear more like an 'action'. I somewhat suspect that this
>> parameter name intially conceived with the assupmtion that it would
>> take file names or similar parameters. I'm not sure if others will
>> agree, but I think the parameter name might not be the best
>> choice. For instance, considering the addition of the third value
>> 'log', something like on_error_action (error, ignore, log) would be
>> more intuitively understandable. What do you think?

> Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
> the next word will be location, not action.  With some stretch we can
> assume 'error' to be location.  I think it would be even more stretchy
> to think that SAVE_ERROR_TO is followed by action.

The other problem with this terminology is that with 'none', what it
is doing is the exact opposite of "saving" the errors.  I agree we
need a better name.

Kyotaro-san's suggestion isn't bad, though I might shorten it to
error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
You will need a separate parameter anyway to specify the destination
of "log", unless "none" became an illegal table name when I wasn't
looking.  I don't buy that one parameter that has some special values
while other values could be names will be a good design.  Moreover,
what if we want to support (say) log-to-file along with log-to-table?
Trying to distinguish a file name from a table name without any other
context seems impossible.

regards, tom lane

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-17 Thread Peter Geoghegan

On Wed, Jan 17, 2024 at 4:25 PM Peter Geoghegan  wrote:
> I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the
> wrong idea. If it's such a good idea then why not apply it all the
> time? That is, why not apply it independently of whether nindexes==0
> in the current VACUUM operation? (You know, just like with
> FAILSAFE_EVERY_PAGES.)

Actually, I suppose that we couldn't apply it independently of
nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our
second pass over the heap takes place for those LP_DEAD-containing
heap pages scanned since the last round of index/heap vacuuming took
place (or since VACUUM began). We need to make sure that the FSM has
the most recent possible information known to VACUUM, which would
break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0.

Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me.

-- 
Peter Geoghegan

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-17 Thread Peter Geoghegan

On Wed, Jan 17, 2024 at 3:58 PM Robert Haas  wrote:
> > Ah, I realize I was not clear. I am now talking about inconsistencies
> > in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating
> > the freespace map during the course of vacuuming the heap relation.
>
> Fair enough, but I'm still not quite sure exactly what the question
> is. It looks to me like the current code, when there are indexes,
> vacuums the FSM after each round of index vacuuming. When there are no
> indexes, doing it after each round of index vacuuming would mean never
> doing it, so instead we vacuum the FSM every ~8GB. I assume what
> happened here is that somebody decided doing it after each round of
> index vacuuming was the "right thing," and then realized that was not
> going to work if no index vacuuming was happening, and so inserted the
> 8GB threshold to cover that case.

Note that VACUUM_FSM_EVERY_PAGES is applied against the number of
rel_pages "processed" so far -- *including* any pages that were
skipped using the visibility map. It would make a bit more sense if it
was applied against scanned_pages instead (just like
FAILSAFE_EVERY_PAGES has been since commit 07eef53955). In other
words, VACUUM_FSM_EVERY_PAGES is applied against a thing that has only
a very loose relationship with physical work performed/time elapsed.

I tend to suspect that VACUUM_FSM_EVERY_PAGES is fundamentally the
wrong idea. If it's such a good idea then why not apply it all the
time? That is, why not apply it independently of whether nindexes==0
in the current VACUUM operation? (You know, just like with
FAILSAFE_EVERY_PAGES.)

-- 
Peter Geoghegan

Re: On login trigger: take three

2024-01-17 Thread Alexander Korotkov

On Mon, Jan 15, 2024 at 11:29 AM Daniel Gustafsson  wrote:
> > On 13 Jan 2024, at 17:00, Alexander Lakhin  wrote:
>
> > I suspected that this failure was caused by autovacuum, and I've managed to
> > reproduce it without Valgrind or slowing down a machine.
>
> This might be due to the fact that the cleanup codepath to remove the flag 
> when
> there is no login event trigger doesn't block on locking pg_database to avoid
> stalling connections.  There are no guarantees when the flag is cleared, so a
> test like this will always have potential to be flaky it seems.

+1
Thank you, Daniel.  I think you described everything absolutely
correctly.  As I wrote upthread, it doesn't seem much could be done
with this, at least within a regression test.  So, I just removed this
query from the test.

--
Regards,
Alexander Korotkov

Re: Assertion failure with epoch when replaying standby records for 2PC

2024-01-17 Thread Alexander Korotkov

Hi, Michael!

On Wed, Jan 17, 2024 at 7:47 AM Michael Paquier  wrote:
> rorqual has failed today with a very interesting failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual=2024-01-17%2005%3A06%3A31
>
> This has caused an assertion failure for a 2PC transaction when
> replaying one of the tests from the main regression suite:
> 2024-01-17 05:08:23.143 UTC [3242608] DETAIL:  Last completed transaction was 
> at log time 2024-01-17 05:08:22.920244+00.
> TRAP: failed Assert("epoch > 0"), File: 
> "../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
> postgres: standby_1: startup recovering 
> 0001000C(ExceptionalCondition+0x83)[0x55746c7838c1]
> postgres: standby_1: startup recovering 
> 0001000C(+0x194f0e)[0x55746c371f0e]
> postgres: standby_1: startup recovering 
> 0001000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
> postgres: standby_1: startup recovering 
> 0001000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
> postgres: standby_1: startup recovering 
> 0001000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
> postgres: standby_1: startup recovering 
> 0001000C(standby_redo+0xe1)[0x55746c621490]
> postgres: standby_1: startup recovering 
> 0001000C(PerformWalRecovery+0xa5e)[0x55746c392404]
> postgres: standby_1: startup recovering 
> 0001000C(StartupXLOG+0x3ac)[0x55746c3862b8]
> postgres: standby_1: startup recovering 
> 0001000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
> postgres: standby_1: startup recovering 
> 0001000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
> postgres: standby_1: startup recovering 
> 0001000C(+0x3c4235)[0x55746c5a1235]
> postgres: standby_1: startup recovering 
> 0001000C(PostmasterMain+0x1401)[0x55746c5a5a10]
> postgres: standby_1: startup recovering 
> 0001000C(main+0x835)[0x55746c4e90ce]
> /lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
> postgres: standby_1: startup recovering 
> 0001000C(_start+0x21)[0x55746c2b61d1]
>
> This refers to the following in twophase.c with
> AdjustToFullTransactionId():
> nextXid = XidFromFullTransactionId(nextFullXid);
> epoch = EpochFromFullTransactionId(nextFullXid);
>
> if (unlikely(xid > nextXid))
> {
> /* Wraparound occurred, must be from a prev epoch. */
> Assert(epoch > 0);
> epoch--;
> }
>
> This would mean that we've found a way to get a negative epoch, which
> should not be possible.
>
> Alexander, you have added this code in 5a1dfde8334b when switching the
> 2PC file names to use FullTransactionIds.  Could you check please?

Thank you for reporting!  I'm going to look at this in the next couple of days.

--
Regards,
Alexander Korotkov

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-17 Thread Alexander Korotkov

On Wed, Jan 17, 2024 at 9:49 AM Kyotaro Horiguchi
 wrote:
> At Wed, 17 Jan 2024 14:38:54 +0900, torikoshia  
> wrote in
> > Hi,
> >
> > Thanks for applying!
> >
> > > + errmsg_plural("%zd row were skipped due to data type
> > > incompatibility",
> >
> > Sorry, I just noticed it, but 'were' should be 'was' here?
> >
> > >> BTW I'm thinking we should add a column to pg_stat_progress_copy that
> > >> counts soft errors. I'll suggest this in another thread.
> > > Please do!
> >
> > I've started it here:
> >
> > https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff...@oss.nttdata.com
>
> Switching topics, this commit (9e2d870119) adds the following help message:
>
>
> >   "COPY { %s [ ( %s [, ...] ) ] | ( %s ) }\n"
> >   "TO { '%s' | PROGRAM '%s' | STDOUT }\n"
> > ...
> >   "SAVE_ERROR_TO '%s'\n"
> > ...
> >   _("location"),
>
> On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
> indicate "immediately error out" and 'just ignore the failure'
> respectively, but these options hardly seem to denote a 'location',
> and appear more like an 'action'. I somewhat suspect that this
> parameter name intially conceived with the assupmtion that it would
> take file names or similar parameters. I'm not sure if others will
> agree, but I think the parameter name might not be the best
> choice. For instance, considering the addition of the third value
> 'log', something like on_error_action (error, ignore, log) would be
> more intuitively understandable. What do you think?

Probably, but I'm not sure about that.  The name SAVE_ERROR_TO assumes
the next word will be location, not action.  With some stretch we can
assume 'error' to be location.  I think it would be even more stretchy
to think that SAVE_ERROR_TO is followed by action.  Probably, we can
replace SAVE_ERROR_TO with another name which could be naturally
followed by action, but I don't have something appropriate in mind.
However, I'm not native english speaker and certainly could miss
something.

--
Regards,
Alexander Korotkov

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-17 Thread Alexander Korotkov

On Wed, Jan 17, 2024 at 7:38 AM torikoshia  wrote:
>
> Hi,
>
> Thanks for applying!
>
> > +   errmsg_plural("%zd row were skipped due
> > to data type incompatibility",
>
> Sorry, I just noticed it, but 'were' should be 'was' here?

Sure, the fix is pushed.

--
Regards,
Alexander Korotkov

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-17 Thread Robert Haas

On Wed, Jan 17, 2024 at 3:12 PM Melanie Plageman
 wrote:
> > Reviewing 0001, consider the case where a table has no indexes.
> > Pre-patch, PageTruncateLinePointerArray() will happen when
> > lazy_vacuum_heap_page() is called; post-patch, it will not occur.
> > That's a loss. Should we care? On the plus side, visibility map
> > repair, if required, can now take place. That's a gain.
>
> I thought that this wasn't an issue because heap_page_prune_execute()
> calls PageRepairFragmentation() which similarly modifies pd_lower and
> sets the hint bit about free line pointers.

Ah, OK, I didn't understand that PageRepairFragmentation() does what
is also done by PageTruncateLinePointerArray().

> Yes, I also spent some time thinking about this. In master, we do
> always call lazy_scan_new_or_empty() before calling
> lazy_scan_noprune(). The code is aiming to ensure we call
> lazy_scan_new_or_empty() once before calling either of
> lazy_scan_noprune() or lazy_scan_prune(). I think it didn't call
> lazy_scan_new_or_empty() unconditionally first because of the
> different lock types expected. But, your structure has solved that.
> I've used a version of your example code above in attached v9. It is
> much nicer.

Oh, OK, I see it now. I missed that lazy_scan_new_or_empty() was
called either way. Glad that my proposed restructuring managed to be
helpful despite that confusion, though. :-)

At a quick glance, I also like the way this looks. I'll review it more
thoroughly later. Does this patch require 0002 and 0003 or could it
equally well go first? I confess that I don't entirely understand why
we want 0002 and 0003.

> Ah, I realize I was not clear. I am now talking about inconsistencies
> in vacuuming the FSM itself. FreeSpaceMapVacuumRange(). Not updating
> the freespace map during the course of vacuuming the heap relation.

Fair enough, but I'm still not quite sure exactly what the question
is. It looks to me like the current code, when there are indexes,
vacuums the FSM after each round of index vacuuming. When there are no
indexes, doing it after each round of index vacuuming would mean never
doing it, so instead we vacuum the FSM every ~8GB. I assume what
happened here is that somebody decided doing it after each round of
index vacuuming was the "right thing," and then realized that was not
going to work if no index vacuuming was happening, and so inserted the
8GB threshold to cover that case. I don't really know what to make of
all of this. On a happy PostgreSQL system, doing anything after each
round of index vacuuming means doing it once, because multiple rounds
of indexing vacuum are extremely expensive and we hope that it won't
ever occur. From that point of view, the 8GB threshold is better,
because it means that when we vacuum a large relation, space should
become visible to the rest of the system incrementally without needing
to wait for the entire vacuum to finish. On the other hand, we also
have this idea that we want to record free space in the FSM once,
after the last time we touch the page. Given that behavior, vacuuming
the FSM every 8GB when we haven't yet done index vacuuming wouldn't
accomplish much of anything, because we haven't updated it for the
pages we just touched. On the third hand, the current behavior seems
slightly ridiculous, because pruning the page is where we're mostly
going to free up space, so we might be better off just updating the
FSM then instead of waiting. That free space could be mighty useful
during the long wait between pruning and marking line pointers unused.
On the fourth hand, that seems like a significant behavior change that
we might not want to undertake without a bunch of research that we
might not want to do right now -- and if we did do it, should we then
update the FSM a second time after marking line pointers unused?

I'm not sure if any of this is answering your actual question, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: initdb's -c option behaves wrong way?

2024-01-17 Thread Tom Lane

I wrote:
> However ... I don't like the patch much.  It seems to have left
> the code in a rather random state.  Why, for example, didn't you
> keep all the code that constructs the "newline" value together?

After thinking about it a bit more, I don't see why you didn't just
s/strncmp/strncasecmp/ and call it good.  The messiness seems to be
a result of your choice to replace the GUC's case as shown in the
file with the case used on the command line, which is not better
IMO.  We don't change our mind about the canonical spelling of a
GUC because somebody varied the case in a SET command.

regards, tom lane

Re: initdb's -c option behaves wrong way?

2024-01-17 Thread Tom Lane

Robert Haas  writes:
> On Wed, Jan 17, 2024 at 2:31 PM Daniel Gustafsson  wrote:
>> Agreed, I think the patch as it stands now where it replaces case 
>> insensitive,
>> while keeping the original casing, is the best path forward.  The issue exist
>> in 16 as well so I propose a backpatch to there.

> I like that approach, too. I could go either way on back-patching. It
> doesn't seem important, but it probably won't break anything, either.

We just added this switch in 16, so I think backpatching to keep all
the branches consistent is a good idea.

However ... I don't like the patch much.  It seems to have left
the code in a rather random state.  Why, for example, didn't you
keep all the code that constructs the "newline" value together?
I'm also unconvinced by the removal of the "assume there's only
one match" break --- if we need to support multiple matches
I doubt that's sufficient.

regards, tom lane

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-17 Thread Melanie Plageman

On Wed, Jan 17, 2024 at 12:17 PM Robert Haas  wrote:
>
> On Tue, Jan 16, 2024 at 6:07 PM Melanie Plageman
>  wrote:
> > Attached v8 patch set is rebased over this.
>
> Reviewing 0001, consider the case where a table has no indexes.
> Pre-patch, PageTruncateLinePointerArray() will happen when
> lazy_vacuum_heap_page() is called; post-patch, it will not occur.
> That's a loss. Should we care? On the plus side, visibility map
> repair, if required, can now take place. That's a gain.

I thought that this wasn't an issue because heap_page_prune_execute()
calls PageRepairFragmentation() which similarly modifies pd_lower and
sets the hint bit about free line pointers.

> I'm otherwise satisfied with this patch now, except for some extremely
> minor nitpicking:
>
> + * For now, pass mark_unused_now == false regardless of whether 
> or
>
> Personally, i would write "pass mark_unused_now as false" here,
> because we're not testing equality. Or else "pass mark_unused_now =
> false". This is not an equality test.
>
> + * During pruning, the caller may have passed mark_unused_now == 
> true,
>
> Again here, but also, this is referring to the name of a parameter to
> a function whose name is not given. I think this this should either
> talk fully in terms of code ("When heap_page_prune was called,
> mark_unused_now may have been passed as true, which allows would-be
> LP_DEAD items to be made LP_USED instead.") or fully in English
> ("During pruning, we may have decided to mark would-be dead items as
> unused.").

Fixed both of the above issues as suggested in attached v9

> > In 0004, I've taken the approach you seem to favor and combined the FSM
> > updates from the prune and no prune cases in lazy_scan_heap() instead
> > of pushing the FSM updates into lazy_scan_prune() and
> > lazy_scan_noprune().
>
> I do like that approach.
>
> I think do_prune can be declared one scope inward, in the per-block
> for loop. I would probably initialize it to true so I could drop the
> stubby else block:
>
> +   /* If we do get a cleanup lock, we will definitely prune */
> +   else
> +   do_prune = true;
>
> And then I'd probably write the call as if (!lazy_scan_noprune())
> doprune = true.
>
> If I wanted to stick with not initializing do_prune, I'd put the
> comment inside as /* We got the cleanup lock, so we will definitely
> prune */ and add braces since that makes it a two-line block.

If we don't unconditionally set do_prune using the result of
lazy_scan_noprune(), then we cannot leave do_prune uninitialized. I
preferred having it uninitialized, as it didn't imply that doing
pruning was the default. Also, it made it simpler to have that comment
about always pruning when we get the cleanup lock.

However, with the changes you mentioned below (got_cleanup_lock), this
discussion is moot.

> > I did not guard against calling lazy_scan_new_or_empty() a second time
> > in the case that lazy_scan_noprune() was called. I can do this. I
> > mentioned upthread I found it confusing for lazy_scan_new_or_empty()
> > to be guarded by if (do_prune). The overhead of calling it wouldn't be
> > terribly high. I can change that based on your opinion of what is
> > better.
>
> To me, the relevant question here isn't reader confusion, because that
> can be cleared up with a comment explaining why we do or do not test
> do_prune. Indeed, I'd say it's a textbook example of when you should
> comment a test: when it might look wrong to the reader but you have a
> good reason for doing it.
>
> But that brings me to what I think the real question is here: do we,
> uh, have a good reason for doing it? At first blush the structure
> looks a bit odd here. lazy_scan_new_or_empty() is intended to handle
> PageIsNew() and PageIsEmpty() cases, lazy_scan_noprune() the cases
> where we process the page without a cleanup lock, and
> lazy_scan_prune() the regular case. So you might think that
> lazy_scan_new_or_empty() would always be applied *first*, that we
> would then conditionally apply lazy_scan_noprune(), and finally
> conditionally apply lazy_scan_prune(). Like this:
>
> bool got_cleanup_lock = ConditionalLockBufferForCleanup(buf);
> if (lazy_scan_new_or_empty())
> continue;
> if (!got_cleanup_lock && !lazy_scan_noprune())
> {
> LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> LockBufferForCleanup(buf);
> got_cleanup_lock = true;
> }
> if (got_cleanup_lock)
> lazy_scan_prune();
>
> The current organization of the code seems to imply that we don't need
> to worry about the PageIsNew() and PageIsEmpty() cases before calling
> lazy_scan_noprune(), and at the moment I'm not understanding why that
> should be the case. I wonder if this is a bug, or if I'm just
> confused.

Yes, I also spent some time thinking about this. In master, we do
always call lazy_scan_new_or_empty() before calling
lazy_scan_noprune(). The code is aiming to ensure we call
lazy_scan_new_or_empty() once before

Re: cleanup patches for incremental backup

2024-01-17 Thread Robert Haas

On Wed, Jan 17, 2024 at 1:42 PM Matthias van de Meent
 wrote:
> Sure, added in attached.

I think this mostly looks good now but I just realized that I think
this needs rephrasing:

+ To restore incremental backups the tool 
+ is used, which combines incremental backups with a base backup and
+ WAL to restore a
+ database cluster to
+ a consistent state.

The way this is worded, at least to me, it makes it sound like
pg_combinebackup is going to do the WAL recovery for you, which it
isn't. Maybe:

To restore incremental backups the tool  is used, which combines incremental
backups with a base backup. Afterwards, recovery can use WAL to bring the database cluster to a
consistent state.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: initdb's -c option behaves wrong way?

2024-01-17 Thread Robert Haas

On Wed, Jan 17, 2024 at 2:31 PM Daniel Gustafsson  wrote:
> Agreed, I think the patch as it stands now where it replaces case insensitive,
> while keeping the original casing, is the best path forward.  The issue exist
> in 16 as well so I propose a backpatch to there.

I like that approach, too. I could go either way on back-patching. It
doesn't seem important, but it probably won't break anything, either.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: psql JSON output format

2024-01-17 Thread Robert Haas

On Wed, Jan 17, 2024 at 4:30 AM Laurenz Albe  wrote:
> As mentioned in my other mail, I was talking about the psql output
> format "csv" rather than about COPY.

Oh. Well, I think it's sad that the psql format csv has that property.
Why doesn't it adopt COPY's handling?

> I agree that it is desirable to lose as little information as possible.
> But if we want to format query output as JSON, we have a couple of
> requirements that cannot all be satisfied:
>
> 1. lose no information ("round-trip safe")
>
> 2. don't double quote numbers, booleans and other JSON values
>
> 3. don't skip any table column in the output
>
> Christoph's original patch didn't satisfy #2, and his current version
> doesn't satisfy #1.  Do you think that skipping NULL columns would be
> the best solution?  We don't do that in the to_json() function, which
> also renders SQL NULL as JSON null.

Let me start by clarifying that I'm OK with sacrificing
round-trippability here as long as we do it thoughtfully.
"Round-trippability is important but X is more important and we cannot
have both for Y reasons" seems like a potentially fine argument to me;
I'm only objecting to an argument of the form "round-trippability
doesn't even matter." My previous comment was a bit of a drive-by
remark on that specifically rather than a strong opinion about what
exactly we ought to do here.

I guess the specifically issue here is around a json(b) column that is
null at the SQL level vs one that contains a JSON null. How do we
distinguish those cases? I think entirely omitting null columns could
be a way forward, but I don't know if that would cause other problems
for users.

I'm not quite sure that addresses all the issues, though. For
instance, consider that 1.00::numeric and 1.0::numeric are equal but
distinguishable. If those get rendered into the JSON unquoted as 1.00
and 1.0, respectively, is that going to round-trip properly? What
about float8 values where extra_float_digits=3 is needed to properly
round trip? If we take PostgreSQL's array data types and turn them
into JSON arrays, what happens with non-default bounds? I know how
we're going to turn '{1,2}'::int[] into a JSON array, or at least I
assume I do, but what in the world are we going to do about
'[-3:-2]={1,2}'?

As much as I think round-trippability is good, getting it to 100% here
is probably a good bit of work. And maybe that work isn't worth doing
or involves too much collateral damage. But I do think it has positive
value. If we produce output that could be ingested back into PG later
with the right tool, that leaves the door open for someone to build
the tool later even if we don't have it today. If we produce output
that loses information, no tool built later can make up for the loss.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: More new SQL/JSON item methods

2024-01-17 Thread Peter Eisentraut


On 17.01.24 10:03, Jeevan Chalke wrote:
I added unary '+' and '-' support as well and thus thought of having 
separate rules altogether rather than folding those in.


Per SQL standard, the precision and scale arguments are unsigned
integers, so unary plus and minus signs are not supported.  So my patch
removes that support, but I didn't adjust the regression tests for that.


However, PostgreSQL numeric casting does support a negative scale. Here 
is an example:


# select '12345'::numeric(4,-2);
  numeric
-
    12300
(1 row)

And thus thought of supporting those.
Do we want this JSON item method to behave differently here?


Ok, it would make sense to support this in SQL/JSON as well.

I will merge them all into one and will try to keep them in the order 
specified in sql_features.txt.
However, for documentation, it makes more sense to keep them in logical 
order than the alphabetical one. What are your views on this?


The documentation can be in a different order.

Re: initdb's -c option behaves wrong way?

2024-01-17 Thread Daniel Gustafsson

> On 17 Jan 2024, at 18:05, Tom Lane  wrote:
> 
> Alvaro Herrera  writes:
>> Hmm, how about raising an error if multiple options are given targetting
>> the same GUC?
> 
> I don't see any reason to do that.  The underlying configuration
> files don't complain about duplicate entries, they just take the
> last setting.

Agreed, I think the patch as it stands now where it replaces case insensitive,
while keeping the original casing, is the best path forward.  The issue exist
in 16 as well so I propose a backpatch to there.

--
Daniel Gustafsson

Re: cleanup patches for incremental backup

2024-01-17 Thread Matthias van de Meent

On Tue, 16 Jan 2024 at 21:49, Robert Haas  wrote:
>
> On Tue, Jan 16, 2024 at 3:22 PM Matthias van de Meent
>  wrote:
> + A special base 
> backup
> + that for some WAL-logged relations only contains the pages that were
> + modified since a previous backup, as opposed to the full relation data 
> of
> + normal base backups. Like base backups, it is generated by the tool
> + .
>
> Could we say "that for some files may contain only those pages that
> were modified since a previous backup, as opposed to the full contents
> of every file"?

Sure, added in attached.

> + To restore incremental backups the tool 
> + is used, which combines the incremental backups with a base backup and
> + [...]
> I wondered if this needed to be clearer that the chain of backups
> could have length > 2. But on further reflection, I think it's fine,
> unless you feel otherwise.

I removed "the" from the phrasing "the incremental backups", which
makes it a bit less restricted.

> The rest LGTM.

In the latest patch I also fixed the casing of "Incremental Backup" to
"... backup", to be in line with most other multi-word items.

Thanks!

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v3-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data

Re: New Table Access Methods for Multi and Single Inserts

2024-01-17 Thread Bharath Rupireddy

On Tue, Aug 1, 2023 at 10:32 PM Jacob Champion  wrote:
>
> On Tue, Aug 1, 2023 at 9:31 AM Bharath Rupireddy
>  wrote:
> > Thanks. Finally, I  started to spend time on this. Just curious - may
> > I know the discussion in/for which this patch is referenced? What was
> > the motive? Is it captured somewhere?
>
> It may not have been the only place, but we at least touched on it
> during the unconference:
>
> 
> https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference#Table_AMs
>
> We discussed two related-but-separate ideas:
> 1) bulk/batch operations and
> 2) maintenance of TAM state across multiple related operations.

Thank you. I'm attaching v8 patch-set here which includes use of new
insert TAMs for COPY FROM. With this, postgres not only will have the
new TAM for inserts, but they also can make the following commands
faster - CREATE TABLE AS, SELECT INTO, CREATE MATERIALIZED VIEW,
REFRESH MATERIALIZED VIEW and COPY FROM. I'll perform some testing in
the coming days and post the results here, until then I appreciate any
feedback on the patches.

I've also added this proposal to CF -
https://commitfest.postgresql.org/47/4777/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From cbdf2935be360017c0d62479e879630d4fec8766 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 17 Jan 2024 16:44:19 +
Subject: [PATCH v8] New TAMs for inserts

---
 src/backend/access/heap/heapam.c | 224 +++
 src/backend/access/heap/heapam_handler.c |   9 +
 src/include/access/heapam.h  |  49 +
 src/include/access/tableam.h | 143 +++
 4 files changed, 425 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a536..7df305380e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -68,6 +68,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 #include "utils/spccache.h"
@@ -2446,6 +2447,229 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	pgstat_count_heap_insert(relation, ntuples);
 }
 
+/*
+ * Initialize state required for an insert a single tuple or multiple tuples
+ * into a heap.
+ */
+TableInsertState *
+heap_insert_begin(Relation rel, CommandId cid, int am_flags, int insert_flags)
+{
+	TableInsertState *tistate;
+
+	tistate = palloc0(sizeof(TableInsertState));
+	tistate->rel = rel;
+	tistate->cid = cid;
+	tistate->am_flags = am_flags;
+	tistate->insert_flags = insert_flags;
+
+	if ((am_flags & TABLEAM_MULTI_INSERTS) != 0 ||
+		(am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY))
+		tistate->am_data = palloc0(sizeof(HeapInsertState));
+
+	if ((am_flags & TABLEAM_MULTI_INSERTS) != 0)
+	{
+		HeapMultiInsertState *mistate;
+
+		mistate = palloc0(sizeof(HeapMultiInsertState));
+		mistate->slots = palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS);
+
+		mistate->context = AllocSetContextCreate(CurrentMemoryContext,
+ "heap_multi_insert_v2 memory context",
+ ALLOCSET_DEFAULT_SIZES);
+
+		((HeapInsertState *) tistate->am_data)->mistate = mistate;
+	}
+
+	if ((am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY) != 0)
+		((HeapInsertState *) tistate->am_data)->bistate = GetBulkInsertState();
+
+	return tistate;
+}
+
+/*
+ * Insert a single tuple into a heap.
+ */
+void
+heap_insert_v2(TableInsertState * state, TupleTableSlot *slot)
+{
+	bool		shouldFree = true;
+	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, );
+	BulkInsertState bistate = NULL;
+
+	Assert(state->am_data != NULL &&
+		   ((HeapInsertState *) state->am_data)->mistate == NULL);
+
+	/* Update tuple with table oid */
+	slot->tts_tableOid = RelationGetRelid(state->rel);
+	tuple->t_tableOid = slot->tts_tableOid;
+
+	if (state->am_data != NULL &&
+		((HeapInsertState *) state->am_data)->bistate != NULL)
+		bistate = ((HeapInsertState *) state->am_data)->bistate;
+
+	/* Perform insertion, and copy the resulting ItemPointer */
+	heap_insert(state->rel, tuple, state->cid, state->insert_flags,
+bistate);
+	ItemPointerCopy(>t_self, >tts_tid);
+
+	if (shouldFree)
+		pfree(tuple);
+}
+
+/*
+ * Create/return next free slot from multi-insert buffered slots array.
+ */
+TupleTableSlot *
+heap_multi_insert_next_free_slot(TableInsertState * state)
+{
+	TupleTableSlot *slot;
+	HeapMultiInsertState *mistate;
+
+	Assert(state->am_data != NULL &&
+		   ((HeapInsertState *) state->am_data)->mistate != NULL);
+
+	mistate = ((HeapInsertState *) state->am_data)->mistate;
+	slot = mistate->slots[mistate->cur_slots];
+
+	if (slot == NULL)
+	{
+		slot = table_slot_create(state->rel, NULL);
+		mistate->slots[mistate->cur_slots] = slot;
+	}
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * Store passed-in tuple into in-memory buffered slots.

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-17 Thread Robert Haas

On Tue, Jan 16, 2024 at 6:07 PM Melanie Plageman
 wrote:
> Attached v8 patch set is rebased over this.

Reviewing 0001, consider the case where a table has no indexes.
Pre-patch, PageTruncateLinePointerArray() will happen when
lazy_vacuum_heap_page() is called; post-patch, it will not occur.
That's a loss. Should we care? On the plus side, visibility map
repair, if required, can now take place. That's a gain.

I'm otherwise satisfied with this patch now, except for some extremely
minor nitpicking:

+ * For now, pass mark_unused_now == false regardless of whether or

Personally, i would write "pass mark_unused_now as false" here,
because we're not testing equality. Or else "pass mark_unused_now =
false". This is not an equality test.

+ * During pruning, the caller may have passed mark_unused_now == true,

Again here, but also, this is referring to the name of a parameter to
a function whose name is not given. I think this this should either
talk fully in terms of code ("When heap_page_prune was called,
mark_unused_now may have been passed as true, which allows would-be
LP_DEAD items to be made LP_USED instead.") or fully in English
("During pruning, we may have decided to mark would-be dead items as
unused.").

> In 0004, I've taken the approach you seem to favor and combined the FSM
> updates from the prune and no prune cases in lazy_scan_heap() instead
> of pushing the FSM updates into lazy_scan_prune() and
> lazy_scan_noprune().

I do like that approach.

I think do_prune can be declared one scope inward, in the per-block
for loop. I would probably initialize it to true so I could drop the
stubby else block:

+   /* If we do get a cleanup lock, we will definitely prune */
+   else
+   do_prune = true;

And then I'd probably write the call as if (!lazy_scan_noprune())
doprune = true.

If I wanted to stick with not initializing do_prune, I'd put the
comment inside as /* We got the cleanup lock, so we will definitely
prune */ and add braces since that makes it a two-line block.

> I did not guard against calling lazy_scan_new_or_empty() a second time
> in the case that lazy_scan_noprune() was called. I can do this. I
> mentioned upthread I found it confusing for lazy_scan_new_or_empty()
> to be guarded by if (do_prune). The overhead of calling it wouldn't be
> terribly high. I can change that based on your opinion of what is
> better.

To me, the relevant question here isn't reader confusion, because that
can be cleared up with a comment explaining why we do or do not test
do_prune. Indeed, I'd say it's a textbook example of when you should
comment a test: when it might look wrong to the reader but you have a
good reason for doing it.

But that brings me to what I think the real question is here: do we,
uh, have a good reason for doing it? At first blush the structure
looks a bit odd here. lazy_scan_new_or_empty() is intended to handle
PageIsNew() and PageIsEmpty() cases, lazy_scan_noprune() the cases
where we process the page without a cleanup lock, and
lazy_scan_prune() the regular case. So you might think that
lazy_scan_new_or_empty() would always be applied *first*, that we
would then conditionally apply lazy_scan_noprune(), and finally
conditionally apply lazy_scan_prune(). Like this:

bool got_cleanup_lock = ConditionalLockBufferForCleanup(buf);
if (lazy_scan_new_or_empty())
continue;
if (!got_cleanup_lock && !lazy_scan_noprune())
{
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBufferForCleanup(buf);
got_cleanup_lock = true;
}
if (got_cleanup_lock)
lazy_scan_prune();

The current organization of the code seems to imply that we don't need
to worry about the PageIsNew() and PageIsEmpty() cases before calling
lazy_scan_noprune(), and at the moment I'm not understanding why that
should be the case. I wonder if this is a bug, or if I'm just
confused.

> The big open question/functional change is when we consider vacuuming
> the FSM. Previously, we only considered vacuuming the FSM in the no
> indexes, has dead items case. After combining the FSM updates from
> lazy_scan_prune()'s no indexes/has lpdead items case,
> lazy_scan_prune()'s regular case, and lazy_scan_noprune(), all of them
> consider vacuuming the FSM. I could guard against this, but I wasn't
> sure why we would want to only vacuum the FSM in the no indexes/has
> dead items case.

I don't get it. Conceptually, I agree that we don't want to be
inconsistent here  without some good reason. One of the big advantages
of unifying different code paths is that you avoid being accidentally
inconsistent. If different things are different it shows up as a test
in the code instead of just having different code paths in different
places that may or may not match.

But I thought the whole point of
45d395cd75ffc5b4c824467140127a5d11696d4c was to iron out the existing
inconsistencies so that we could unify this code without having to
change any more behavior. In particular, I

Re: initdb's -c option behaves wrong way?

2024-01-17 Thread Tom Lane

Alvaro Herrera  writes:
> Hmm, how about raising an error if multiple options are given targetting
> the same GUC?

I don't see any reason to do that.  The underlying configuration
files don't complain about duplicate entries, they just take the
last setting.

regards, tom lane

Re: More new SQL/JSON item methods

2024-01-17 Thread Andrew Dunstan



On 2024-01-17 We 04:03, Jeevan Chalke wrote:



On Mon, Jan 15, 2024 at 7:41 PM Peter Eisentraut 
 wrote:



Overall, I think it would be better if you combined all three of
these
patches into one.  Right now, you have arranged these as incremental
features, and as a result of that, the additions to the
JsonPathItemType
enum and the grammar keywords etc. are ordered in the way you
worked on
these features, I guess.  It would be good to maintain a bit of
sanity
to put all of this together and order all the enums and everything
else
for example in the order they are in the sql_features.txt file
(which is
alphabetical, I suppose).  At this point I suspect we'll end up
committing this whole feature set together anyway, so we might as
well
organize it that way.


OK.
I will merge them all into one and will try to keep them in the order 
specified in sql_features.txt.
However, for documentation, it makes more sense to keep them in 
logical order than the alphabetical one. What are your views on this?




I agree that we should order the documentation logically. Users don't 
care how we organize the code etc, but they do care about docs have 
sensible structure.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Build the docs if there are changes in docs and don't run other tasks if the changes are only in docs

2024-01-17 Thread Andrew Dunstan




On 2023-10-06 Fr 10:07, Tom Lane wrote:

Peter Eisentraut  writes:

I don't have a good sense of what you are trying to optimize for.  If
it's the mainline build-on-every-commit type, then I wonder how many
commits would really be affected by this.  Like, how many commits touch
only a README file.  If it's for things like the cfbot, then I think the
time-triggered builds would be more frequent than new patch versions, so
I don't know if these kinds of optimizations would affect anything.

As a quick cross-check, I searched our commit log to see how many
README-only commits there were so far this year.  I found 11 since
January.  (Several were triggered by the latest round of pgindent
code and process changes, so maybe this is more than typical.)

Not sure what that tells us about the value of changing the CI
logic, but it does seem like it could be worth the one-liner
change needed to teach buildfarm animals to ignore READMEs.

-   trigger_exclude => qr[^doc/|\.po$],
+   trigger_exclude => qr[^doc/|/README$|\.po$],






I've put that in the sample config file for the next release.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: psql JSON output format

2024-01-17 Thread Andrew Dunstan




On 2024-01-17 We 03:52, Laurenz Albe wrote:

On Tue, 2024-01-16 at 11:49 -0500, Andrew Dunstan wrote:

On 2024-01-16 Tu 11:07, Laurenz Albe wrote:

On Tue, 2024-01-09 at 16:51 +, Dean Rasheed wrote:

On Tue, 9 Jan 2024 at 14:35, Christoph Berg  wrote:

Getting it print numeric/boolean without quotes was actually easy, as
well as json(b). Implemented as the attached v2 patch.

But: not quoting json means that NULL and 'null'::json will both be
rendered as 'null'. That strikes me as a pretty undesirable conflict.
Does the COPY patch also do that?

Yes. Perhaps what needs to happen is for a NULL column to be omitted
entirely from the output. I think the COPY TO json patch would have to
do that if COPY FROM json were to be added later, to make it
round-trip safe.

I think the behavior is fine as it is.  I'd expect both NULL and JSON "null"
to be rendered as "null".  I think the main use case for a feature like this
is people who need the result in JSON for further processing somewhere else.

"Round-trip safety" is not so important.  If you want to move data from
PostgreSQL to PostgreSQL, you use the plain or the binary format.
The CSV format by default renders NULL and empty strings identical, and
I don't think anybody objects to that.

This is absolutely not true.

CSV format with default settings is and has been from the beginning designed
to be round trippable.

Sorry for being unclear.  I wasn't talking about COPY, but about the psql
output format:

CREATE TABLE xy (a integer, b text);

INSERT INTO xy VALUES (1, 'one'), (2, NULL), (3, '');

\pset format csv
Output format is csv.

TABLE xy;
a,b
1,one
2,
3,



I think the reason nobody's complained about it is quite possibly that 
very few people have used it. That's certainly the case with me - if I'd 
noticed it I would have complained.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

[pg16]Question about building snapshot in logical decoding

2024-01-17 Thread cca5507

If txn1 begin after SNAPBUILD_BUILDING_SNAPSHOT and commit before
SNAPBUILD_FULL_SNAPSHOT(so txn1 will not be processed by DecodeCommit), and
txn2 begin after SNAPBUILD_FULL_SNAPSHOT and commit after
SNAPBUILD_CONSISTENT(so txn2 will be replayed), how to ensure that txn2
could see the changes made by txn1?

Thanks

Re: initdb's -c option behaves wrong way?

2024-01-17 Thread Alvaro Herrera

On 2024-Jan-16, Daniel Gustafsson wrote:

> > On 28 Sep 2023, at 09:49, Kyotaro Horiguchi  wrote:
> 
> > I noticed that -c option of initdb behaves in an unexpected
> > manner. Identical variable names with variations in letter casing are
> > treated as distinct variables.
> > 
> > $ initdb -cwork_mem=100 -cWORK_MEM=1000 -cWork_mem=2000
> 
> > The original intention was apparently to overwrite the existing
> > line. Furthermore, I surmise that preserving the original letter
> > casing is preferable.
> 
> Circling back to an old thread, I agree that this seems odd and the original
> thread [0] makes no mention of it being intentional.

Hmm, how about raising an error if multiple options are given targetting
the same GUC?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Add pgindent test to check if codebase is correctly formatted

2024-01-17 Thread Tristan Partin


On Wed Jan 17, 2024 at 3:50 AM CST, Alvaro Herrera wrote:

Hmm, should this also install typedefs.list and pgindent.man?
What about the tooling to reformat Perl code?


Good point about pgindent.man. It would definitely be good to install 
alongside pgindent and pg_bsd_indent.


I don't know if we need to install the typedefs.list file. I think it 
would just be good enough to also install the find_typedefs script. But 
it needs some fixing up first[0]. Extension authors can then just 
generate their own typedefs.list that will include the typedefs of the 
extension and the typedefs of the postgres types they use. At least, 
that is what we have found works at Neon.


I cannot vouch for extension authors writing Perl but I think it could 
make sense to install the src/test/perl tree, so extension authors could 
more easily write tests for their extensions in Perl. But we could 
install the perltidy file and whatever else too. I keep my Perl writing 
to a minimum, so I am not the best person to vouch for these usecases.


[0]: 
https://www.postgresql.org/message-id/aaa59ef5-dce8-7369-5cae-487727664127%40dunslane.net

--
Tristan Partin
Neon (https://neon.tech)

Re: Fix a possible socket leak at Windows (src/backend/port/win32/socket.c)

2024-01-17 Thread Ranier Vilela

Em qua., 17 de jan. de 2024 09:54, Daniel Gustafsson 
escreveu:

> > On 17 Jan 2024, at 07:26, Michael Paquier  wrote:
> > On Tue, Jan 16, 2024 at 05:25:39PM -0300, Ranier Vilela wrote:
>
> >> Do you have plans or should I register for a commitfest?
> >
> > Daniel has stated that he would take care of it, so why not letting
> > him a few days?  I don't think that a CF entry is necessary.
>
> It isn't, I've now committed it backpatched down to 12.
>
Thanks for the commit, Daniel.

Best regards,
Ranier Vilela

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-17 Thread Jelte Fennema-Nio

On Wed, 17 Jan 2024 at 14:39, Robert Haas  wrote:
> I think it's hard to say for sure what API is going to work well here,
> because we just don't have much experience with this.

Agreed, but I strongly believe PQunsupportedProtocolExtensions() is
useful regardless of the API choice.

> I also think that the reason why
> the API you're proposing here looks good in this case is because libpq
> itself doesn't really need to do anything differently for these
> parameters. It doesn't actually really change anything about the
> protocol; it only nails down the server behavior in a way that can't
> be changed. Another current request is to have a way to have certain
> data types always be sent in binary format, specified by OID. Do we
> want that to be written as PQsetParameter("always_binary_format",
> "123,456,789,101112") or do we want it to maybe look more like
> PQalwaysBinaryFormat(int count, Oid *stuff)? Or, another example, say
> we want to set client_to_server_compression_method=lz4.

I think from libpq's perspective there are two categories of protocol
extension parameters:
1. parameters that change behaviour in a way that does not matter to libpq
2. parameters that change in such a way that libpq needs to change its
behaviour too (by parsing or sending messages differently than it
normally would).

_pq_.protocol_roles, _pq_.report_parameters, and (afaict) even
_pq_.always_binary_format would fall into category 1. But
_pq_.client_to_server_compression_method would fall into category 2,
because libpq should start to compress the messages that it is
sending.

I think if you look at it like that, then using PQsetParameter for
parameters in category 1 makes sense. And indeed you'd likely want a
dedicated API for each parameter in category 2, and probably have
PQsetParameter error for these parameters. In any case it seems like
something that can be decided on a case by case basis. However, to
make this future proof, I think it might be best if PQsetParameter
would error for protocol extension parameters that it does not know
about.

> Also, I never intended for _pq_ to become a user-visible namespace
> that people would have to care about

I agree that forcing Postgres users to learn about this prefix is
probably unwanted. But requiring client authors to learn about the
prefix seems acceptable to me.

Re: Add system identifier to backup manifest

2024-01-17 Thread Robert Haas

On Wed, Jan 17, 2024 at 6:31 AM Amul Sul  wrote:
> With the attached patch, the backup manifest will have a new key item as
> "System-Identifier" 64-bit integer whose value is derived from pg_control 
> while
> generating it, and the manifest version bumps to 2.
>
> This helps to identify the correct database server and/or backup for the
> subsequent backup operations.  pg_verifybackup validates the manifest system
> identifier against the backup control file and fails if they don’t match.
> Similarly, pg_basebackup increment backup will fail if the manifest system
> identifier does not match with the server system identifier.  The
> pg_combinebackup is already a bit smarter -- checks the system identifier from
> the pg_control of all the backups, with this patch the manifest system
> identifier also validated.

Thanks for working on this. Without this, I think what happens is that
you can potentially take an incremental backup from the "wrong"
server, if the states of the systems are such that all of the other
sanity checks pass. When you run pg_combinebackup, it'll discover the
problem and tell you, but you ideally want to discover such errors at
backup time rather than at restore time. This addresses that. And,
overall, I think it's a pretty good patch. But I nonetheless have a
bunch of comments.

-  The associated value is always the integer 1.
+  The associated value is the integer, either 1 or 2.

is an integer. Beginning in PostgreSQL 17,
it is 2; in older versions, it is 1.

+ context.identity_cb = manifest_process_identity;

I'm not really on board with calling the system identifier "identity"
throughout the patch. I think it should just say system_identifier. If
we were going to abbreviate, I'd prefer something like "sysident" that
looks like it's derived from "system identifier" rather than
"identity" which is a different word altogether. But I don't think we
should abbreviate unless doing so creates *ridiculously* long
identifier names.

+static void
+manifest_process_identity(JsonManifestParseContext *context,
+   int manifest_version,
+   uint64 manifest_system_identifier)
+{
+ uint64 system_identifier;
+
+ /* Manifest system identifier available in version 2 or later */
+ if (manifest_version == 1)
+ return;

I think you've got the wrong idea here. I think this function would
only get called if System-Identifier is present in the manifest, so if
it's a v1 manifest, this would never get called, so this if-statement
would not ever do anything useful. I think what you should do is (1)
if the client supplies a v1 manifest, reject it, because surely that's
from an older server version that doesn't support incremental backup;
but do that when the version is parsed rather than here; and (2) also
detect and reject the case when it's supposedly a v2 manifest but this
is absent.

(1) should really be done when the version number is parsed, so I
suspect you may need to add manifest_version_cb.

+static void
+combinebackup_identity_cb(JsonManifestParseContext *context,
+   int manifest_version,
+   uint64 manifest_system_identifier)
+{
+ parser_context *private_context = context->private_data;
+ uint64 system_identifier = private_context->system_identifier;
+
+ /* Manifest system identifier available in version 2 or later */
+ if (manifest_version == 1)
+ return;

Very similar to the above case. Just reject a version 1 manifest as
soon as we see the version number. In this function, set a flag
indicating we saw the system identifier; if at the end of parsing that
flag is not set, kaboom.

- parse_manifest_file(manifest_path, , _wal_range);
+ parse_manifest_file(manifest_path, , _wal_range,
+ context.backup_directory);

Don't do this! parse_manifest_file() should just record everything
found in the manifest in the context object. Syntax validation should
happen while parsing the manifest (e.g. "CAT/DOG" is not a valid LSN
and we should reject that at this stage) but semantic validation
should happen later (e.g. "0/0" can't be a the correct backup end LSN
but we don't figure that out while parsing, but rather later). I think
you should actually move validation of the system identifier to the
point where the directory walk encounters the control file (and update
the docs and tests to match that decision). Imagine if you wanted to
validate a tar-format backup; then you wouldn't have random access to
the directory. You'd see the manifest file first, and then all the
files in a random order, with one chance to look at each one.

(This is, in fact, a feature I think we should implement.)

- if (strcmp(token, "1") != 0)
+ parse->manifest_version = atoi(token);
+ if (parse->manifest_version != 1 && parse->manifest_version != 2)
  json_manifest_parse_failure(parse->context,
  "unexpected manifest version");

Please either (a) don't do a string-to-integer conversion and just
strcmp() twice or (b) use strtol so that you can check that it
succeeded. I don't want to accept manifest version 1a as 1.

+/*
+ *

Re: Change GUC hashtable to use simplehash?

2024-01-17 Thread Heikki Linnakangas


On 17/01/2024 09:15, John Naylor wrote:

/*
 * hashfn_unstable.h
 *
 * Building blocks for creating fast inlineable hash functions. The
 * unstable designation is in contrast to hashfn.h, which cannot break
 * compatibility because hashes can be written to disk and so must produce
 * the same hashes between versions.
 *
 * The functions in this file are not guaranteed to be stable between
 * versions, and may differ by hardware platform.


These paragraphs sound a bit awkward. It kind of buries the lede, the 
"these functions are not guaranteed to be stable" part, to the bottom.


Maybe something like:

"
Building blocks for creating fast inlineable hash functions. The 
functions in this file are not guaranteed to be stable between versions, 
and may differ by hardware platform. Hence they must not be used in 
indexes or other on-disk structures. See hashfn.h if you need stability.

"

typo: licencse

Other than that, LGTM.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Build the docs if there are changes in docs and don't run other tasks if the changes are only in docs

2024-01-17 Thread Daniel Gustafsson

> On 14 Dec 2023, at 14:40, Nazir Bilal Yavuz  wrote:
> On Fri, 6 Oct 2023 at 17:07, Tom Lane  wrote:

>> Not sure what that tells us about the value of changing the CI
>> logic, but it does seem like it could be worth the one-liner
>> change needed to teach buildfarm animals to ignore READMEs.
> 
> I agree that it could be worth implementing this logic on buildfarm animals.
> 
> In case we want to implement the same logic on the CI, I added a new
> version of the patch; it skips CI completely if the changes are only
> in the README files.

I think it makes sense to avoid wasting CI cycles on commits only changing
README files since we know it will be a no-op.  A README documentation patch
going through N revisions will incur at least N full CI runs which are
resources we can spend on other things.  So +1 for doing this both in CI and in
the buildfarm.

--
Daniel Gustafsson

Re: [PATCH] Compression dictionaries for JSONB

2024-01-17 Thread Nikita Malakhov

Hi,

Aleksander, there was a quite straightforward answer regarding Pluggable
TOAST
in other thread - the Pluggable TOAST feature is not desired by the
community,
and advanced TOAST mechanics would be accepted as parts of problematic
datatypes extended functionality, on a par with in and out functions, so
what I am
actually doing now - re-writing JSONb TOAST improvements to be called as
separate
functions without Pluggable TOAST API. This takes some time because there
is a large
and complex code base left by Nikita Glukhov who has lost interest in this
work due
to some reasons.

I, personally, think that these two features could benefit from each other,
but they could
be adapted to each other after I would introduce JSONb Toaster in v17
master.

If you don't mind please check the thread on extending the TOAST pointer -
it is important
for improving TOAST mechanics.

On Wed, Jan 17, 2024 at 5:27 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi again,
>
> > Yes it does for a while now. Until we reach any agreement regarding
> > questions (1) and (2) personally I don't see the point in submitting
> > rebased patches. We can continue the discussion but mark the CF entry
> > as RwF for now it will be helpful.
>
> Sorry, what I actually meant were the following questions:
>
> """
> Several things occured to me:
>
> - Does anyone believe that va_tag should be part of the utf8-like
> bitmask in order to save a byte or two?
>
> - The described approach means that compression dictionaries are not
> going to be used when data is compressed in-place (i.e. within a
> tuple), since no TOAST pointer is involved in this case. Also we will
> be unable to add additional compression algorithms here. Does anyone
> have problems with this? Should we use the reserved compression
> algorithm id instead as a marker of an extended TOAST?
>
> - It would be nice to decompose the feature in several independent
> patches, e.g. modify TOAST first, then add compression dictionaries
> without automatic update of the dictionaries, then add the automatic
> update. I find it difficult to imagine however how to modify TOAST
> pointers and test the code properly without a dependency on a larger
> feature. Could anyone think of a trivial test case for extendable
> TOAST? Maybe something we could add to src/test/modules similarly to
> how we test SLRU, background workers, etc.
> """
>
> Since there was not much activity since then (for 3 months) I don't
> really see how to process further.
>
> --
> Best regards,
> Aleksander Alekseev
>

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

Re: MERGE ... RETURNING

2024-01-17 Thread jian he

On Sat, Nov 18, 2023 at 8:55 PM Dean Rasheed  wrote:
>
> The v13 patch still applies on top of this, so I won't re-post it.
>

Hi.
minor issues based on v13.

+
+MERGING (
property )
+
+   The following are valid property values specifying what to return:
+
+   
+
+ ACTION
+ 
+  
+   The merge action command executed for the current row
+   ('INSERT', 'UPDATE', or
+   'DELETE').
+  
+ 
+
do we change to property?
Maybe the main para should be two sentences like:
The merge action command executed for the current row. Possible values
are: 'INSERT', 'UPDATE',
'DELETE'.

 static Node *
+transformMergingFunc(ParseState *pstate, MergingFunc *f)
+{
+ /*
+ * Check that we're in the RETURNING list of a MERGE command.
+ */
+ if (pstate->p_expr_kind != EXPR_KIND_MERGE_RETURNING)
+ {
+ ParseState *parent_pstate = pstate->parentParseState;
+
+ while (parent_pstate &&
+   parent_pstate->p_expr_kind != EXPR_KIND_MERGE_RETURNING)
+ parent_pstate = parent_pstate->parentParseState;
+
+ if (!parent_pstate ||
+ parent_pstate->p_expr_kind != EXPR_KIND_MERGE_RETURNING)
+ ereport(ERROR,
+ errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("MERGING() can only be used in the RETURNING list of a MERGE command"),
+ parser_errposition(pstate, f->location));
+ }
+
+ return (Node *) f;
+}
+
the object is correct, but not in the right place.
maybe we should change errcode(ERRCODE_WRONG_OBJECT_TYPE) to
errcode(ERRCODE_INVALID_OBJECT_DEFINITION)
also do we need to add some comments explain that why we return it as
is when it's EXPR_KIND_MERGE_RETURNING.
(my understanding is that, if key words not match, then it will fail
at gram.y, like syntax error, else MERGING will based on keywords make
a MergingFunc node and assign mfop, mftype, location to it)

in src/backend/executor/functions.c
/*
* Break from loop if we didn't shut down (implying we got a
* lazily-evaluated row).  Otherwise we'll press on till the whole
* function is done, relying on the tuplestore to keep hold of the
* data to eventually be returned.  This is necessary since an
* INSERT/UPDATE/DELETE RETURNING that sets the result might be
* followed by additional rule-inserted commands, and we want to
* finish doing all those commands before we return anything.
*/
Does the above comments need to change to INSERT/UPDATE/DELETE/MERGE?

in src/backend/nodes/nodeFuncs.c
case T_UpdateStmt:
{
UpdateStmt *stmt = (UpdateStmt *) node;
if (WALK(stmt->relation))
return true;
if (WALK(stmt->targetList))
return true;
if (WALK(stmt->whereClause))
return true;
if (WALK(stmt->fromClause))
return true;
if (WALK(stmt->returningList))
return true;
if (WALK(stmt->withClause))
return true;
}
break;
case T_MergeStmt:
{
MergeStmt  *stmt = (MergeStmt *) node;
if (WALK(stmt->relation))
return true;
if (WALK(stmt->sourceRelation))
return true;
if (WALK(stmt->joinCondition))
return true;
if (WALK(stmt->mergeWhenClauses))
return true;
if (WALK(stmt->withClause))
return true;
}
break;

you add "returningList" to MergeStmt.
do you need to do the following similar to UpdateStmt, even though
it's so abstract, i have no idea what's going on.
`
if (WALK(stmt->returningList))
return true;
`

Re: [PATCH] Compression dictionaries for JSONB

2024-01-17 Thread Aleksander Alekseev

Hi again,

> Yes it does for a while now. Until we reach any agreement regarding
> questions (1) and (2) personally I don't see the point in submitting
> rebased patches. We can continue the discussion but mark the CF entry
> as RwF for now it will be helpful.

Sorry, what I actually meant were the following questions:

"""
Several things occured to me:

- Does anyone believe that va_tag should be part of the utf8-like
bitmask in order to save a byte or two?

- The described approach means that compression dictionaries are not
going to be used when data is compressed in-place (i.e. within a
tuple), since no TOAST pointer is involved in this case. Also we will
be unable to add additional compression algorithms here. Does anyone
have problems with this? Should we use the reserved compression
algorithm id instead as a marker of an extended TOAST?

- It would be nice to decompose the feature in several independent
patches, e.g. modify TOAST first, then add compression dictionaries
without automatic update of the dictionaries, then add the automatic
update. I find it difficult to imagine however how to modify TOAST
pointers and test the code properly without a dependency on a larger
feature. Could anyone think of a trivial test case for extendable
TOAST? Maybe something we could add to src/test/modules similarly to
how we test SLRU, background workers, etc.
"""

Since there was not much activity since then (for 3 months) I don't
really see how to process further.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Compression dictionaries for JSONB

2024-01-17 Thread Aleksander Alekseev

Hi Shubham,

> > > 8272749e added a few more arguments to CastCreate(). Here is the rebased 
> > > patch.
> >
> > After merging afbfc029 [1] the patch needed a rebase. PFA v10.
> >
> > The patch is still in a PoC state and this is exactly why comments and
> > suggestions from the community are most welcome! Particularly I would
> > like to know:
> >
> > 1. Would you call it a wanted feature considering the existence of
> > Pluggable TOASTer patchset which (besides other things) tries to
> > introduce type-aware TOASTers for EXTERNAL attributes? I know what
> > Simon's [2] and Nikita's latest answers were, and I know my personal
> > opinion on this [3][4], but I would like to hear from the rest of the
> > community.
> >
> > 2. How should we make sure a dictionary will not consume all the
> > available memory? Limiting the amount of dictionary entries to pow(2,
> > 16) and having dictionary versions seems to work OK for ZSON. However
> > it was pointed out that this may be an unwanted limitation for the
> > in-core implementation.
> >
> > [1]: 
> > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c727f511;hp=afbfc02983f86c4d71825efa6befd547fe81a926
> > [2]: 
> > https://www.postgresql.org/message-id/CANbhV-HpCF852WcZuU0wyh1jMU4p6XLbV6rCRkZpnpeKQ9OenQ%40mail.gmail.com
> > [3]: 
> > https://www.postgresql.org/message-id/CAJ7c6TN-N3%3DPSykmOjmW1EAf9YyyHFDHEznX-5VORsWUvVN-5w%40mail.gmail.com
> > [4]: 
> > https://www.postgresql.org/message-id/CAJ7c6TO2XTTk3cu5w6ePHfhYQkoNpw7u1jeqHf%3DGwn%2BoWci8eA%40mail.gmail.com
>
> I tried to apply the patch but it is failing at the Head. It is giving
> the following error:

Yes it does for a while now. Until we reach any agreement regarding
questions (1) and (2) personally I don't see the point in submitting
rebased patches. We can continue the discussion but mark the CF entry
as RwF for now it will be helpful.

-- 
Best regards,
Aleksander Alekseev

Re: Add system identifier to backup manifest

2024-01-17 Thread Robert Haas

On Wed, Jan 17, 2024 at 6:45 AM Alvaro Herrera  wrote:
> Hmm, okay, but what if I take a full backup from a primary server and
> later I want an incremental from a standby, or the other way around?
> Will this prevent me from using such a combination?

The system identifier had BETTER match in such cases. If it doesn't,
somebody's run pg_resetwal on your standby since it was created... and
in that case, no incremental backup for you!

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-17 Thread Robert Haas

On Tue, Jan 16, 2024 at 9:21 PM Jelte Fennema-Nio  wrote:
> It's understandable you're worried about breaking clients, but afaict
> **you are actually misunderstanding the situation**. I totally agree
> that we cannot bump the protocol version without also merging 0002
> (that's why the version bump is in patch 0005 not patch 0001). But
> 0002 does **not** need to be backported to all supported branches. The
> only libpq versions that can ever receive a NegotiateVersionProtocol
> are ones that request 3.1, and since none of the pre-17 libpq versions
> ever request 3.1 there's no need for them to be able to handle
> NegotiateVersionProtocol.

OK, yeah, fuzzy thinking on my part.

> I think we have a very different idea of what is a desirable API for
> client authors that use libpq to build their clients. libpq its API is
> pretty low level, so I think it makes total sense for client authors
> to know what protocol extension parameters exist. It seems totally
> acceptable for me to have them call PQsetParameter directly:
>
> PQsetParameter("_pq_.protocol_roles", "true")
> PQsetParameter("_pq_.report_parameters", "role,search_path")
>
> Otherwise we need to introduce **two** new functions for every
> protocol extension that is going to be introduced, a blocking and a
> non-blocking one (e.g. PQsetWireProtocolRole() and
> PQsendSetWireProtocolRole()). And that seems like unnecessary API
> bloat to me.

I think it's hard to say for sure what API is going to work well here,
because we just don't have much experience with this. I do agree that
we want to avoid API bloat. However, I also think that the reason why
the API you're proposing here looks good in this case is because libpq
itself doesn't really need to do anything differently for these
parameters. It doesn't actually really change anything about the
protocol; it only nails down the server behavior in a way that can't
be changed. Another current request is to have a way to have certain
data types always be sent in binary format, specified by OID. Do we
want that to be written as PQsetParameter("always_binary_format",
"123,456,789,101112") or do we want it to maybe look more like
PQalwaysBinaryFormat(int count, Oid *stuff)? Or, another example, say
we want to set client_to_server_compression_method=lz4. It's very
difficult for me to believe that libpq should be doing strcmp()
against the proposed protocol parameter settings and thus realizing
that it needs to start compressing ... especially since there might be
compression parameters (like level or degree of parallelism) that the
client needs and the server doesn't.

Also, I never intended for _pq_ to become a user-visible namespace
that people would have to care about; that was just a convention that
I adopted to differentiate setting a protocol parameter from setting a
GUC. I think it's a mistake to make that string something users have
to worry about directly. It wouldn't begin and end with an underscore
if it were intended to be user-visible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Fix a possible socket leak at Windows (src/backend/port/win32/socket.c)

2024-01-17 Thread Daniel Gustafsson

> On 17 Jan 2024, at 07:26, Michael Paquier  wrote:
> On Tue, Jan 16, 2024 at 05:25:39PM -0300, Ranier Vilela wrote:

>> Do you have plans or should I register for a commitfest?
> 
> Daniel has stated that he would take care of it, so why not letting
> him a few days?  I don't think that a CF entry is necessary.

It isn't, I've now committed it backpatched down to 12.

--
Daniel Gustafsson

Re: generate syscache info automatically

2024-01-17 Thread Peter Eisentraut


On 10.01.24 09:00, John Naylor wrote:

I suppose a sensible alternative could be to leave the
DECLARE_..._INDEX... alone and make a separate statement, like

MAKE_SYSCACHE(pg_type_oid_index, TYPEOID, 64);

That's at least visually easier, because some of those
DECLARE_... lines are getting pretty long.


Probably a good idea, and below I mention a third possible macro.


I updated the patch to use this style (but I swapped the first two 
arguments from my example, so that the thing being created is named first).


I also changed the names of the output files a bit to make them less 
confusing.  (I initially had some files named .c.h, which was weird, but 
apparently necessary to avoid confusing the build system.  But it's all 
clearer now.)


Other than bugs and perhaps style opinions, I think the first patch is 
pretty good now.


I haven't changed much in the second patch, other than to update it for 
the code changes made in the first patch.  It's still very much 
WIP/preview at the moment.
From f77c381b9df30824e469f72f4e7754d03cb97d78 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 17 Jan 2024 12:33:42 +0100
Subject: [PATCH v5 1/2] Generate syscache info from catalog files

Add a new genbki macros MAKE_SYSCACHE that specifies the syscache ID
macro, the underlying index, and the number of buckets.  From that, we
can generate the existing tables in syscache.h and syscache.c via
genbki.pl.

Discussion: 
https://www.postgresql.org/message-id/flat/75ae5875-3abc-dafc-8aec-73247ed41...@eisentraut.org
---
 src/backend/catalog/.gitignore|   2 +
 src/backend/catalog/Catalog.pm|  10 +
 src/backend/catalog/Makefile  |   2 +-
 src/backend/catalog/genbki.pl |  72 ++
 src/backend/utils/cache/syscache.c| 632 +-
 src/include/catalog/.gitignore|   2 +
 src/include/catalog/genbki.h  |  10 +-
 src/include/catalog/meson.build   |  18 +-
 src/include/catalog/pg_aggregate.h|   2 +
 src/include/catalog/pg_am.h   |   3 +
 src/include/catalog/pg_amop.h |   3 +
 src/include/catalog/pg_amproc.h   |   2 +
 src/include/catalog/pg_attribute.h|   3 +
 src/include/catalog/pg_auth_members.h |   3 +
 src/include/catalog/pg_authid.h   |   3 +
 src/include/catalog/pg_cast.h |   2 +
 src/include/catalog/pg_class.h|   3 +
 src/include/catalog/pg_collation.h|   3 +
 src/include/catalog/pg_constraint.h   |   2 +
 src/include/catalog/pg_conversion.h   |   4 +
 src/include/catalog/pg_database.h |   2 +
 src/include/catalog/pg_default_acl.h  |   2 +
 src/include/catalog/pg_enum.h |   3 +
 src/include/catalog/pg_event_trigger.h|   3 +
 src/include/catalog/pg_foreign_data_wrapper.h |   3 +
 src/include/catalog/pg_foreign_server.h   |   3 +
 src/include/catalog/pg_foreign_table.h|   2 +
 src/include/catalog/pg_index.h|   2 +
 src/include/catalog/pg_language.h |   3 +
 src/include/catalog/pg_namespace.h|   3 +
 src/include/catalog/pg_opclass.h  |   3 +
 src/include/catalog/pg_operator.h |   2 +
 src/include/catalog/pg_opfamily.h |   3 +
 src/include/catalog/pg_parameter_acl.h|   2 +
 src/include/catalog/pg_partitioned_table.h|   2 +
 src/include/catalog/pg_proc.h |   3 +
 src/include/catalog/pg_publication.h  |   3 +
 .../catalog/pg_publication_namespace.h|   3 +
 src/include/catalog/pg_publication_rel.h  |   3 +
 src/include/catalog/pg_range.h|   3 +
 src/include/catalog/pg_replication_origin.h   |   3 +
 src/include/catalog/pg_rewrite.h  |   2 +
 src/include/catalog/pg_sequence.h |   2 +
 src/include/catalog/pg_statistic.h|   2 +
 src/include/catalog/pg_statistic_ext.h|   3 +
 src/include/catalog/pg_statistic_ext_data.h   |   1 +
 src/include/catalog/pg_subscription.h |   3 +
 src/include/catalog/pg_subscription_rel.h |   2 +
 src/include/catalog/pg_tablespace.h   |   2 +
 src/include/catalog/pg_transform.h|   3 +
 src/include/catalog/pg_ts_config.h|   3 +
 src/include/catalog/pg_ts_config_map.h|   2 +
 src/include/catalog/pg_ts_dict.h  |   3 +
 src/include/catalog/pg_ts_parser.h|   3 +
 src/include/catalog/pg_ts_template.h  |   3 +
 src/include/catalog/pg_type.h |   3 +
 src/include/catalog/pg_user_mapping.h |   3 +
 src/include/utils/syscache.h  |  98 +--
 src/tools/pginclude/cpluspluscheck|   5 +
 src/tools/pginclude/headerscheck  |   5 +
 60 files changed, 266 insertions(+), 719 deletions(-)

diff --git a/src/backend/catalog/.gitignore b/src/backend/catalog/.gitignore

Re: Show WAL write and fsync stats in pg_stat_io

2024-01-17 Thread Nazir Bilal Yavuz

Hi,

On Mon, 15 Jan 2024 at 09:27, Michael Paquier  wrote:
>
> On Fri, Jan 12, 2024 at 04:23:26PM +0300, Nazir Bilal Yavuz wrote:
> > On Thu, 11 Jan 2024 at 17:28, Melanie Plageman  
> > wrote:
> >> Even if we made a separate view for WAL I/O stats, we would still have
> >> this issue of variable sized I/O vs block sized I/O and would probably
> >> end up solving it with separate columns for the number of bytes and
> >> number of operations.
> >
> > Yes, I think it is more about flexibility and not changing the already
> > published pg_stat_io view.
>
> I don't know.  Adding more columns or changing op_bytes with an extra
> mode that reflects on what the other columns mean is kind of the same
> thing to me: we want pg_stat_io to report more modes so as all I/O can
> be evaluated from a single view, but the complication is now that
> everything is tied to BLCKSZ.
>
> IMHO, perhaps we'd better put this patch aside until we are absolutely
> *sure* of what we want to achieve when it comes to WAL, and I am
> afraid that this cannot happen until we're happy with the way we
> handle WAL reads *and* writes, including WAL receiver or anything that
> has the idea of pulling its own page callback with
> XLogReaderAllocate() in the backend.  Well, writes should be
> relatively "easy" as things happen with XLOG_BLCKSZ, mainly, but
> reads are the unknown part.
>
> That also seems furiously related to the work happening with async I/O
> or the fact that we may want to have in the view a separate meaning
> for cached pages or pages read directly from disk.  The worst thing
> that we would do is rush something into the tree and then have to deal
> with the aftermath of what we'd need to deal with in terms of
> compatibility depending on the state of the other I/O related work
> when the new view is released.  That would not be fun for the users
> and any hackers who would have to deal with that (aka mainly me if I
> were to commit something), because pg_stat_io could mean something in
> version N, still mean something entirely different in version N+1.

I agree with your points. While the other I/O related work is
happening we can discuss what we should do in the variable op_byte
cases. Also, this is happening only for WAL right now but if we try to
extend pg_stat_io in the future, that problem possibly will rise
again. So, it could be good to come up with a general solution, not
only for WAL.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Catalog domain not-null constraints

2024-01-17 Thread vignesh C

On Wed, 29 Nov 2023 at 01:14, Peter Eisentraut  wrote:
>
> On 23.11.23 14:13, Aleksander Alekseev wrote:
> > =# create domain connotnull1 integer;
> > =# create domain connotnull2 integer;
> > =# alter domain connotnull1 add not null value;
> > =# alter domain connotnull2 set not null;
> > =# \dD
> > ERROR:  unexpected null value in cached tuple for catalog
> > pg_constraint column conkey
>
> Yeah, for domain not-null constraints pg_constraint.conkey is indeed
> null.  Should we put something in there?
>
> Attached is an updated patch that avoids the error by taking a separate
> code path for domain constraints in ruleutils.c.  But maybe there is
> another way to arrange this.

One of the test has failed in CFBot at [1] with:
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/domain.out
/tmp/cirrus-ci-build/src/test/regress/results/domain.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/domain.out
2024-01-14 15:40:01.793434601 +
+++ /tmp/cirrus-ci-build/src/test/regress/results/domain.out
2024-01-14 15:42:23.013332625 +
@@ -1271,11 +1271,4 @@
 FROM information_schema.domain_constraints
 WHERE domain_name IN ('con', 'dom', 'pos_int', 'things'))
   ORDER BY constraint_name;
- constraint_catalog | constraint_schema | constraint_name  |   check_clause
-+---+--+---
- regression | public| con_check| (VALUE > 0)
- regression | public| meow | (VALUE < 11)
- regression | public| pos_int_check| (VALUE > 0)
- regression | public| pos_int_not_null | VALUE IS NOT NULL
-(4 rows)
-
+ERROR:  could not open relation with OID 36379

[1] - https://cirrus-ci.com/task/4536440638406656
[2] - 
https://api.cirrus-ci.com/v1/artifact/task/4536440638406656/log/src/test/regress/regression.diffs

Regards,
Vignesh

Re: Add system identifier to backup manifest

2024-01-17 Thread Amul Sul

On Wed, Jan 17, 2024 at 5:15 PM Alvaro Herrera 
wrote:

> On 2024-Jan-17, Amul Sul wrote:
>
> > This helps to identify the correct database server and/or backup for the
> > subsequent backup operations.  pg_verifybackup validates the manifest
> system
> > identifier against the backup control file and fails if they don’t match.
> > Similarly, pg_basebackup increment backup will fail if the manifest
> system
> > identifier does not match with the server system identifier.  The
> > pg_combinebackup is already a bit smarter -- checks the system identifier
> > from
> > the pg_control of all the backups, with this patch the manifest system
> > identifier also validated.
>
> Hmm, okay, but what if I take a full backup from a primary server and
> later I want an incremental from a standby, or the other way around?
> Will this prevent me from using such a combination?
>

Yes, that worked for me where the system identifier was the same on
master as well standby.

Regards,
Amul

Re: CI and test improvements

2024-01-17 Thread vignesh C

On Wed, 12 Jul 2023 at 11:38, Michael Paquier  wrote:
>
> On Wed, Jul 12, 2023 at 12:56:17AM -0500, Justin Pryzby wrote:
> > And, since 681d9e462:
> >
> > security is missing from contrib/seg/meson.build
>
> Ugh.  Good catch!

Are we planning to do anything more in this thread, the thread has
been idle for more than 7 months. If nothing more is planned for this,
I'm planning to close this commitfest entry in this commitfest.

Regards,
Vignesh

Re: Add system identifier to backup manifest

2024-01-17 Thread Alvaro Herrera

On 2024-Jan-17, Amul Sul wrote:

> This helps to identify the correct database server and/or backup for the
> subsequent backup operations.  pg_verifybackup validates the manifest system
> identifier against the backup control file and fails if they don’t match.
> Similarly, pg_basebackup increment backup will fail if the manifest system
> identifier does not match with the server system identifier.  The
> pg_combinebackup is already a bit smarter -- checks the system identifier
> from
> the pg_control of all the backups, with this patch the manifest system
> identifier also validated.

Hmm, okay, but what if I take a full backup from a primary server and
later I want an incremental from a standby, or the other way around?
Will this prevent me from using such a combination?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick."  (Andrew Sullivan)
https://postgr.es/m/20050809113420.gd2...@phlogiston.dyndns.org

Add system identifier to backup manifest

2024-01-17 Thread Amul Sul

Hi All,

With the attached patch, the backup manifest will have a new key item as
"System-Identifier" 64-bit integer whose value is derived from pg_control
while
generating it, and the manifest version bumps to 2.

This helps to identify the correct database server and/or backup for the
subsequent backup operations.  pg_verifybackup validates the manifest system
identifier against the backup control file and fails if they don’t match.
Similarly, pg_basebackup increment backup will fail if the manifest system
identifier does not match with the server system identifier.  The
pg_combinebackup is already a bit smarter -- checks the system identifier
from
the pg_control of all the backups, with this patch the manifest system
identifier also validated.

For backward compatibility, the manifest system identifier validation will
be
skipped for version 1.

-- 
Regards,
Amul Sul
EDB: http://www.enterprisedb.com
From 03d21efd7b03d17a55fc1dd1159e0838777f548a Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Wed, 17 Jan 2024 16:12:19 +0530
Subject: [PATCH v1] Add system identifier to backup manifest

The backup manifest will be having a new item as "System-Identifier"
and its value is from "ControlFileData.system_identifier" when
manifest generated.  This help identify the correct database server
and/or backup while taking subsequent backup.
---
 doc/src/sgml/backup-manifest.sgml | 13 +++-
 doc/src/sgml/ref/pg_verifybackup.sgml |  3 +-
 src/backend/backup/backup_manifest.c  |  9 ++-
 src/backend/backup/basebackup.c   |  3 +-
 src/backend/backup/basebackup_incremental.c   | 29 
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  | 20 +
 src/bin/pg_combinebackup/load_manifest.c  | 73 ---
 src/bin/pg_combinebackup/load_manifest.h  |  6 +-
 src/bin/pg_combinebackup/pg_combinebackup.c   | 16 ++--
 src/bin/pg_combinebackup/t/005_integrity.pl   | 14 
 src/bin/pg_combinebackup/write_manifest.c |  8 +-
 src/bin/pg_combinebackup/write_manifest.h |  3 +-
 src/bin/pg_verifybackup/pg_verifybackup.c | 59 ++-
 src/bin/pg_verifybackup/t/003_corruption.pl   | 16 +++-
 src/bin/pg_verifybackup/t/004_options.pl  |  2 +-
 src/bin/pg_verifybackup/t/005_bad_manifest.pl |  7 +-
 src/common/parse_manifest.c   | 36 -
 src/include/backup/backup_manifest.h  |  3 +-
 src/include/common/parse_manifest.h   |  4 +
 19 files changed, 286 insertions(+), 38 deletions(-)

diff --git a/doc/src/sgml/backup-manifest.sgml b/doc/src/sgml/backup-manifest.sgml
index 771be1310a..d0fc20ed0f 100644
--- a/doc/src/sgml/backup-manifest.sgml
+++ b/doc/src/sgml/backup-manifest.sgml
@@ -37,7 +37,18 @@
 PostgreSQL-Backup-Manifest-Version
 
  
-  The associated value is always the integer 1.
+  The associated value is the integer, either 1 or 2.
+ 
+
+   
+
+   
+System-Identifier
+
+ 
+  The associated integer value is an unique system identifier to ensure
+  file match up with the installation that produced them. Available only
+  when PostgreSQL-Backup-Manifest-Version is 2 and later.
  
 

diff --git a/doc/src/sgml/ref/pg_verifybackup.sgml b/doc/src/sgml/ref/pg_verifybackup.sgml
index 36335e0a18..3051517d92 100644
--- a/doc/src/sgml/ref/pg_verifybackup.sgml
+++ b/doc/src/sgml/ref/pg_verifybackup.sgml
@@ -53,7 +53,8 @@ PostgreSQL documentation
Backup verification proceeds in four stages. First,
pg_verifybackup reads the
backup_manifest file. If that file
-   does not exist, cannot be read, is malformed, or fails verification
+   does not exist, cannot be read, is malformed, fails to match the system
+   identifier with pg_control of the backup directory or fails verification
against its own internal checksum, pg_verifybackup
will terminate with a fatal error.
   
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index 2c34e59752..612ff3add2 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -56,7 +56,8 @@ IsManifestEnabled(backup_manifest_info *manifest)
 void
 InitializeBackupManifest(backup_manifest_info *manifest,
 		 backup_manifest_option want_manifest,
-		 pg_checksum_type manifest_checksum_type)
+		 pg_checksum_type manifest_checksum_type,
+		 uint64 system_identifier)
 {
 	memset(manifest, 0, sizeof(backup_manifest_info));
 	manifest->checksum_type = manifest_checksum_type;
@@ -79,8 +80,10 @@ InitializeBackupManifest(backup_manifest_info *manifest,
 
 	if (want_manifest != MANIFEST_OPTION_NO)
 		AppendToManifest(manifest,
-		 "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
-		 "\"Files\": [");
+		 "{ \"PostgreSQL-Backup-Manifest-Version\": 2,\n"
+		 "\"System-Identifier\": " UINT64_FORMAT ",\n"
+		 "\"Files\": [",
+		 system_identifier);
 }
 
 /*
diff --git a/src/backend/backup/basebackup.c

Re: [PATCH] Compression dictionaries for JSONB

2024-01-17 Thread Shubham Khanna

On Wed, Jan 17, 2024 at 4:16 PM Aleksander Alekseev
 wrote:
>
> Hi hackers,
>
> > 8272749e added a few more arguments to CastCreate(). Here is the rebased 
> > patch.
>
> After merging afbfc029 [1] the patch needed a rebase. PFA v10.
>
> The patch is still in a PoC state and this is exactly why comments and
> suggestions from the community are most welcome! Particularly I would
> like to know:
>
> 1. Would you call it a wanted feature considering the existence of
> Pluggable TOASTer patchset which (besides other things) tries to
> introduce type-aware TOASTers for EXTERNAL attributes? I know what
> Simon's [2] and Nikita's latest answers were, and I know my personal
> opinion on this [3][4], but I would like to hear from the rest of the
> community.
>
> 2. How should we make sure a dictionary will not consume all the
> available memory? Limiting the amount of dictionary entries to pow(2,
> 16) and having dictionary versions seems to work OK for ZSON. However
> it was pointed out that this may be an unwanted limitation for the
> in-core implementation.
>
> [1]: 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c727f511;hp=afbfc02983f86c4d71825efa6befd547fe81a926
> [2]: 
> https://www.postgresql.org/message-id/CANbhV-HpCF852WcZuU0wyh1jMU4p6XLbV6rCRkZpnpeKQ9OenQ%40mail.gmail.com
> [3]: 
> https://www.postgresql.org/message-id/CAJ7c6TN-N3%3DPSykmOjmW1EAf9YyyHFDHEznX-5VORsWUvVN-5w%40mail.gmail.com
> [4]: 
> https://www.postgresql.org/message-id/CAJ7c6TO2XTTk3cu5w6ePHfhYQkoNpw7u1jeqHf%3DGwn%2BoWci8eA%40mail.gmail.com

I tried to apply the patch but it is failing at the Head. It is giving
the following error:
patching file src/test/regress/expected/dict.out
patching file src/test/regress/expected/oidjoins.out
patching file src/test/regress/expected/opr_sanity.out
patching file src/test/regress/parallel_schedule
Hunk #1 FAILED at 111.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/parallel_schedule.rej
patching file src/test/regress/sql/dict.sql
Please send the Re-base version of the patch.

Thanks and Regards,
Shubham Khanna.

Re: Implement missing join selectivity estimation for range types

2024-01-17 Thread vignesh C

On Fri, 5 Jan 2024 at 23:09, Schoemans Maxime  wrote:
>
> On 05/01/2024 11:37, vignesh C wrote:
>  > One of the tests was aborted at [1], kindly post an updated patch for
> the same:
>
> Thank you for notifying us.
> I believe I fixed the issue but it is hard to be certain as the issue
> did not arise when running the regression tests locally.

I'm noticing this issue is not yet resolved, the CFBot is still
failing at [1] with:
#7 0x55cddc25cd93 in range_cmp_bound_values
(typcache=typcache@entry=0x62930b60, b1=b1@entry=0x61c16f08,
b2=b2@entry=0x61c180b8) at rangetypes.c:2090
[19:55:02.591] No locals.
[19:55:02.591] #8 0x55cddc2685c1 in calc_hist_join_selectivity
(typcache=typcache@entry=0x62930b60,
hist1=hist1@entry=0x61c180b8, nhist1=nhist1@entry=101,
hist2=hist2@entry=0x61c168b8, nhist2=nhist2@entry=101) at
rangetypes_selfuncs.c:1295
[19:55:02.591] i = 0
[19:55:02.591] j = 101
[19:55:02.591] selectivity = 0
[19:55:02.591] prev_sel1 = -1
[19:55:02.591] prev_sel2 = 0
[19:55:02.591] #9 0x55cddc269aaa in rangejoinsel
(fcinfo=) at rangetypes_selfuncs.c:1479
[19:55:02.591] root = 
[19:55:02.591] operator = 
[19:55:02.591] args = 
[19:55:02.591] sjinfo = 
[19:55:02.591] vardata1 = {var = , rel = , statsTuple = , freefunc = ,
vartype = , atttype = , atttypmod =
, isunique = , acl_ok = }
[19:55:02.591] vardata2 = {var = , rel = , statsTuple = , freefunc = ,
vartype = , atttype = , atttypmod =
, isunique = , acl_ok = }
[19:55:02.591] hist1 = {staop = , stacoll = , valuetype = , values = , nvalues =
, numbers = , nnumbers = , values_arr = , numbers_arr = }
[19:55:02.591] hist2 = {staop = , stacoll = , valuetype = , values = , nvalues =
, numbers = , nnumbers = , values_arr = , numbers_arr = }
[19:55:02.591] sslot = {staop = , stacoll = , valuetype = , values = , nvalues =
, numbers = , nnumbers = , values_arr = , numbers_arr = }
[19:55:02.591] reversed = 
[19:55:02.591] selec = 0.00170937500013
[19:55:02.591] typcache = 0x62930b60
[19:55:02.591] stats1 = 
[19:55:02.591] stats2 = 
[19:55:02.591] empty_frac1 = 0
[19:55:02.591] empty_frac2 = 0
[19:55:02.591] null_frac1 = 0
[19:55:02.591] null_frac2 = 0
[19:55:02.591] nhist1 = 101
[19:55:02.591] nhist2 = 101
[19:55:02.591] hist1_lower = 0x61c168b8
[19:55:02.591] hist1_upper = 0x61c170b8
[19:55:02.591] hist2_lower = 0x61c178b8
[19:55:02.591] hist2_upper = 0x61c180b8
[19:55:02.591] empty = 
[19:55:02.591] i = 
[19:55:02.591] __func__ = "rangejoinsel"
[19:55:02.591] #10 0x55cddc3b761f in FunctionCall5Coll
(flinfo=flinfo@entry=0x7ffc1628d710, collation=collation@entry=0,
arg1=arg1@entry=107545982648856, arg2=arg2@entry=3888,
arg3=arg3@entry=107820862916056, arg4=arg4@entry=0, arg5=) at fmgr.c:1242
[19:55:02.591] fcinfodata = {fcinfo = {flinfo = ,
context = , resultinfo = , fncollation =
, isnull = , nargs = ,
args = 0x0}, fcinfo_data = { }}
[19:55:02.591] fcinfo = 0x7ffc1628d5e0
[19:55:02.591] result = 
[19:55:02.591] __func__ = "FunctionCall5Coll"
[19:55:02.591] #11 0x55cddc3b92ee in OidFunctionCall5Coll
(functionId=8355, collation=collation@entry=0,
arg1=arg1@entry=107545982648856, arg2=arg2@entry=3888,
arg3=arg3@entry=107820862916056, arg4=arg4@entry=0, arg5=) at fmgr.c:1460
[19:55:02.591] flinfo = {fn_addr = , fn_oid =
, fn_nargs = , fn_strict = , fn_retset = , fn_stats = ,
fn_extra = , fn_mcxt = , fn_expr =
}
[19:55:02.591] #12 0x55cddbe834ae in join_selectivity
(root=root@entry=0x61d00017c218, operatorid=operatorid@entry=3888,
args=0x6210003bc5d8, inputcollid=0,
jointype=jointype@entry=JOIN_INNER,
sjinfo=sjinfo@entry=0x7ffc1628db30) at
../../../../src/include/postgres.h:324
[19:55:02.591] oprjoin = 
[19:55:02.591] result = 
[19:55:02.591] __func__ = "join_selectivity"
[19:55:02.591] #13 0x55cddbd8c18c in clause_selectivity_ext
(root=root@entry=0x61d00017c218, clause=0x6210003bc678,
varRelid=varRelid@entry=0, jointype=jointype@entry=JOIN_INNER,
sjinfo=sjinfo@entry=0x7ffc1628db30,
use_extended_stats=use_extended_stats@entry=true) at clausesel.c:841

I have changed the status to "Waiting on Author", feel free to post an
updated version, check CFBot and update the Commitfest entry
accordingly.

[1] - https://cirrus-ci.com/task/5698117824151552

Regards,
Vignesh

Re: Synchronizing slots from primary to standby

2024-01-17 Thread shveta malik

On Wed, Jan 17, 2024 at 3:08 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Tue, Jan 16, 2024 at 05:27:05PM +0530, shveta malik wrote:
> > PFA v62. Details:
>
> Thanks!
>
> > v62-003:
> > It is a new patch which attempts to implement slot-sync worker as a
> > special process which is neither a bgworker nor an Auxiliary process.
> > Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP
> > Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if
> > it is hot-standby and 'enable_syncslot' is ON.
>
> The implementation looks reasonable to me (from what I can see some parts is
> copy/paste from an already existing "special" process and some parts are
> "sync slot" specific) which makes fully sense.

Thanks for the feedback. I have addressed the comments in v63 except 5th one.

> A few remarks:
>
> 1 ===
> +* Was it the slot sycn worker?
>
> Typo: sycn
>
> 2 ===
> +* ones), and no walwriter, autovac launcher or bgwriter or 
> slot sync
>
> Instead? "* ones), and no walwriter, autovac launcher, bgwriter or slot sync"
>
> 3 ===
> + * restarting slot slyc worker. If stopSignaled is set, the worker will
>
> Typo: slyc
>
> 4 ===
> +/* Flag to tell if we are in an slot sync worker process */
>
> s/an/a/ ?
>
> 5 === (coming from v62-0002)
> +   Assert(tuplestore_tuple_count(res->tuplestore) == 1);
>
> Is it even possible for the related query to not return only one row? (I 
> think the
> "count" ensures it).

I think you are right. This assertion was added sometime back on the
basis of feedback on hackers. Let me review that again. I can consider
this comment in the next version.

> 6 ===
> if (conninfo_changed ||
> primary_slotname_changed ||
> +   old_enable_syncslot != enable_syncslot ||
> (old_hot_standby_feedback != hot_standby_feedback))
> {
> ereport(LOG,
> errmsg("slot sync worker will restart because 
> of"
>" a parameter change"));
>
> I don't think "slot sync worker will restart" is true if one change 
> enable_syncslot
> from on to off.

Yes, right. I have changed the log-msg in this specific case.

>
> IMHO, v62-003 is in good shape and could be merged in v62-002 (that would ease
> the review). But let's wait to see if others think differently.
>
> Regards,
>
> --
> Bertrand Drouvot
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com

Re: Add pgindent test to check if codebase is correctly formatted

2024-01-17 Thread Alvaro Herrera

Hmm, should this also install typedefs.list and pgindent.man?
What about the tooling to reformat Perl code?

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Linux transformó mi computadora, de una `máquina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada día aprendo
algo nuevo" (Jaime Salinas)

Re: Oversight in reparameterize_path_by_child leading to executor crash

2024-01-17 Thread Richard Guo

On Wed, Jan 17, 2024 at 5:01 PM Richard Guo  wrote:

> On Tue, Jan 16, 2024 at 2:30 AM Robert Haas  wrote:
>
>> On Mon, Jan 8, 2024 at 3:32 AM Richard Guo 
>> wrote:
>
> > Fair point.  I think we can back-patch a more minimal fix, as Tom
>> > proposed in [1], which disallows the reparameterization if the path
>> > contains sampling info that references the outer rel.  But I think we
>> > need also to disallow the reparameterization of a path if it contains
>> > restriction clauses that reference the outer rel, as such paths have
>> > been found to cause incorrect results, or planning errors as in [2].
>>
>> Do you want to produce a patch for that, to go along with this patch for
>> master?
>
>
> Sure, here it is:
> v10-0001-Avoid-reparameterizing-Paths-when-it-s-not-suitable.patch
>

I forgot to mention that this patch applies on v16 not master.

Thanks
Richard

Re: Synchronizing slots from primary to standby

2024-01-17 Thread Bertrand Drouvot

Hi,

On Tue, Jan 16, 2024 at 05:27:05PM +0530, shveta malik wrote:
> PFA v62. Details:

Thanks! 

> v62-003:
> It is a new patch which attempts to implement slot-sync worker as a
> special process which is neither a bgworker nor an Auxiliary process.
> Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP
> Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if
> it is hot-standby and 'enable_syncslot' is ON.

The implementation looks reasonable to me (from what I can see some parts is
copy/paste from an already existing "special" process and some parts are
"sync slot" specific) which makes fully sense.

A few remarks:

1 ===
+* Was it the slot sycn worker?

Typo: sycn

2 ===
+* ones), and no walwriter, autovac launcher or bgwriter or 
slot sync

Instead? "* ones), and no walwriter, autovac launcher, bgwriter or slot sync"

3 ===
+ * restarting slot slyc worker. If stopSignaled is set, the worker will

Typo: slyc

4 ===
+/* Flag to tell if we are in an slot sync worker process */

s/an/a/ ?

5 === (coming from v62-0002)
+   Assert(tuplestore_tuple_count(res->tuplestore) == 1);

Is it even possible for the related query to not return only one row? (I think 
the 
"count" ensures it).

6 ===
if (conninfo_changed ||
primary_slotname_changed ||
+   old_enable_syncslot != enable_syncslot ||
(old_hot_standby_feedback != hot_standby_feedback))
{
ereport(LOG,
errmsg("slot sync worker will restart because 
of"
   " a parameter change"));

I don't think "slot sync worker will restart" is true if one change 
enable_syncslot
from on to off.

IMHO, v62-003 is in good shape and could be merged in v62-002 (that would ease
the review). But let's wait to see if others think differently.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: make pg_ctl more friendly

2024-01-17 Thread Junwang Zhao

Hi Alvaro,

On Wed, Jan 17, 2024 at 4:54 PM Alvaro Herrera  wrote:
>
> I think this needs more comments.  First, in the WaitPMResult enum, we
> currently have three values -- READY, STILL_STARTING, FAILED.  These are
> all pretty self-explanatory.  But this patch adds SHUTDOWN_IN_RECOVERY,
> and that's not at all self-explanatory.  Did postmaster start or not?
> The enum value's name doesn't make that clear.  So I'd do something like
>
> typedef enum
> {
> POSTMASTER_READY,
> POSTMASTER_STILL_STARTING,
> /*
>  * postmaster no longer running, because it stopped after recovery
>  * completed.
>  */
> POSTMASTER_SHUTDOWN_IN_RECOVERY,
> POSTMASTER_FAILED,
> } WaitPMResult;
>
> Maybe put the comments in wait_for_postmaster_start instead.

I put the comments in WaitPMResult since we need to add two
of those if in wait_for_postmaster_start.

>
> Secondly, the patch proposes to add new text to be returned by
> do_start() when this happens, which would look like this:
>
>   waiting for server to start...  shut down while in recovery
>   update recovery target settings for startup again if needed
>
> Is this crystal clear?  I'm not sure.  How about something like this?
>
>   waiting for server to start...  done, but not running
>   server shut down because of recovery target settings

Agreed.
>
> variations on the first phrase:
>
> "done, no longer running"
> "done, but no longer running"
> "done, automatically shut down"
> "done, automatically shut down after recovery"

I chose the last one because it gives information to users.
See V5, thanks for the wording ;)

>
> --
> Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
> "Now I have my system running, not a byte was off the shelf;
> It rarely breaks and when it does I fix the code myself.
> It's stable, clean and elegant, and lightning fast as well,
> And it doesn't cost a nickel, so Bill Gates can go to hell."



-- 
Regards
Junwang Zhao


v5-0001-PITR-shutdown-should-not-report-error-by-pg_ctl.patch
Description: Binary data

Re: psql JSON output format

2024-01-17 Thread Laurenz Albe

On Tue, 2024-01-16 at 14:12 -0500, Robert Haas wrote:
> On Tue, Jan 16, 2024 at 11:07 AM Laurenz Albe  
> wrote:
> > "Round-trip safety" is not so important.  If you want to move data from
> > PostgreSQL to PostgreSQL, you use the plain or the binary format.
> > The CSV format by default renders NULL and empty strings identical, and
> > I don't think anybody objects to that.
> 
> As Andrew says, the part about the CSV format is not correct, but I
> also don't think I agree with the larger point, either. I believe that
> round-trip safety is a really desirable property. Is it absolutely
> necessary in every case? Maybe not. But, it shouldn't be lacking
> without a good reason, either, at least IMHO. If you postulate that
> people are moving data from A to B, it is reasonable to think that
> eventually someone is going to want to move some data from B back to
> A. If that turns out to be hard, they'll be sad. We shouldn't make
> people sad without a good reason.

As mentioned in my other mail, I was talking about the psql output
format "csv" rather than about COPY.

I agree that it is desirable to lose as little information as possible.
But if we want to format query output as JSON, we have a couple of
requirements that cannot all be satisfied:

1. lose no information ("round-trip safe")

2. don't double quote numbers, booleans and other JSON values

3. don't skip any table column in the output

Christoph's original patch didn't satisfy #2, and his current version
doesn't satisfy #1.  Do you think that skipping NULL columns would be
the best solution?  We don't do that in the to_json() function, which
also renders SQL NULL as JSON null.

I think the argument for round-trip safety of psql output is tenuous.
There is no way for psql to ingest JSON as input format, and the patch
to add JSON as COPY format only supports COPY TO.  And unless you can
name the exact way that the data written by psql will be loaded into
PostgreSQL again, all that remains is an (understandable) unease about
losing the distiction between SQL NULL and JSON null.

We have jsonb_populate_record() to convert JSON back to a table row,
but that function will convert both missing columns and a JSON null
to SQL NULL:

CREATE TABLE xy (id integer, j jsonb);

\pset null '∅'

SELECT * FROM jsonb_populate_record(NULL::xy, '{"id":1,"j":null}');

 id │ j 
╪═══
  1 │ ∅
(1 row)

SELECT * FROM jsonb_populate_record(NULL::xy, '{"id":1}');

 id │ j 
╪═══
  1 │ ∅
(1 row)

Indeed, there doesn't seem to be a way to generate JSON null with that
function.

So I wouldn't worry about round-trip safety too much, and my preference
is how the current patch does it.  I am not dead set against a solution
that omits NULL columns in the output, though.

Yours,
Laurenz Albe

Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx

2024-01-17 Thread Anthonin Bonnefoy

> I do realize the same is true for plain \bind, but it seems
> like a bug there too.

The unscanned bind's parameters are discarded later in the
HandleSlashCmds functions. So adding the ignore_slash_options() for
inactive branches scans and discards them earlier. I will add it to
match what's done in the other commands but I don't think it's
testable as the behaviour is the same unless I miss something.

I did add the \bind, \bindx, \close and \parse to the inactive branch
tests to complete the list.

> One more usability thing. I think \parse and \close should not require
> a \g to send the message. You can do that by returning PSQL_CMD_SEND
> instead of PSQL_CMD_SKIP_LIN

Changed.

> I think the examples for \bindx and \close
> should use \parse instead of PREPARE

Done. I had to rely on manual PREPARE for my first tests and it leaked
in the docs.


v3-0001-psql-Add-support-for-prepared-stmt-with-extended-.patch
Description: Binary data

Re: More new SQL/JSON item methods

2024-01-17 Thread Jeevan Chalke

On Mon, Jan 15, 2024 at 7:41 PM Peter Eisentraut 
wrote:

> Attached are two small fixup patches for your patch set.
>

Thanks, Peter.


>
> In the first one, I simplified the grammar for the .decimal() method.
> It seemed a bit overkill to build a whole list structure when all we
> need are 0, 1, or 2 arguments.
>

Agree.
I added unary '+' and '-' support as well and thus thought of having
separate rules altogether rather than folding those in.


> Per SQL standard, the precision and scale arguments are unsigned
> integers, so unary plus and minus signs are not supported.  So my patch
> removes that support, but I didn't adjust the regression tests for that.
>

However, PostgreSQL numeric casting does support a negative scale. Here is
an example:

# select '12345'::numeric(4,-2);
 numeric
-
   12300
(1 row)

And thus thought of supporting those.
Do we want this JSON item method to behave differently here?


>
> Also note that in your 0002 patch, the datetime precision is similarly
> unsigned, so that's consistent.
>
> By the way, in your 0002 patch, don't see the need for the separate
> datetime_method grammar rule.  You can fold that into accessor_op.
>

Sure.


>
> Overall, I think it would be better if you combined all three of these
> patches into one.  Right now, you have arranged these as incremental
> features, and as a result of that, the additions to the JsonPathItemType
> enum and the grammar keywords etc. are ordered in the way you worked on
> these features, I guess.  It would be good to maintain a bit of sanity
> to put all of this together and order all the enums and everything else
> for example in the order they are in the sql_features.txt file (which is
> alphabetical, I suppose).  At this point I suspect we'll end up
> committing this whole feature set together anyway, so we might as well
> organize it that way.
>

OK.
I will merge them all into one and will try to keep them in the order
specified in sql_features.txt.
However, for documentation, it makes more sense to keep them in logical
order than the alphabetical one. What are your views on this?


Thanks
-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com

1 2 >

1 - 100 of 111 matches

Mail list logo