Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-18 Thread Amit Kapila
On Mon, Mar 18, 2024 at 8:19 PM Bertrand Drouvot
 wrote:
>
> On Thu, Mar 14, 2024 at 12:27:26PM +0530, Amit Kapila wrote:
> > On Wed, Mar 13, 2024 at 10:16 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Wed, Mar 13, 2024 at 11:13 AM shveta malik  
> > > wrote:
> > > >
> > > > > Thanks. v8-0001 is how it looks. Please see the v8 patch set with 
> > > > > this change.
> > > >
> > > > JFYI, the patch does not apply to the head. There is a conflict in
> > > > multiple files.
> > >
> > > Thanks for looking into this. I noticed that the v8 patches needed
> > > rebase. Before I go do anything with the patches, I'm trying to gain
> > > consensus on the design. Following is the summary of design choices
> > > we've discussed so far:
> > > 1) conflict_reason vs invalidation_reason.
> > > 2) When to compute the XID age?
> > >
> >
> > I feel we should focus on two things (a) one is to introduce a new
> > column invalidation_reason, and (b) let's try to first complete
> > invalidation due to timeout. We can look into XID stuff if time
> > permits, remember, we don't have ample time left.
>
> Agree. While it makes sense to invalidate slots for wal removal in
> CreateCheckPoint() (because this is the place where wal is removed), I 'm not
> sure this is the right place for the 2 new cases.
>
> Let's focus on the timeout one as proposed above (as probably the simplest 
> one):
> as this one is purely related to time and activity what about to invalidate 
> them
> when?:
>
> - their usage resume
> - in pg_get_replication_slots()
>
> The idea is to invalidate the slot when one resumes activity on it or wants to
> get information about it (and among other things wants to know if the slot is
> valid or not).
>

Trying to invalidate at those two places makes sense to me but we
still need to cover the cases where it takes very long to resume the
slot activity and the dangling slot cases where the activity is never
resumed. How about apart from the above two places, trying to
invalidate in CheckPointReplicationSlots() where we are traversing all
the slots? This could prevent invalid slots from being marked as
dirty.

BTW, how will the user use 'inactive_count' to know whether a
replication slot is becoming inactive too frequently? The patch just
keeps incrementing this counter, one will never know in the last 'n'
minutes, how many times the slot became inactive unless there is some
monitoring tool that keeps capturing this counter from time to time
and calculates the frequency in some way. Even, if this is useful, it
is not clear to me whether we need to store 'inactive_count' in the
slot's persistent data. I understand it could be a metric required by
the user but wouldn't it be better to track this via
pg_stat_replication_slots such that we don't need to store this in
slot's persist data? If this understanding is correct, I would say
let's remove 'inactive_count' as well from the main patch and discuss
it separately.

-- 
With Regards,
Amit Kapila.




Re: POC, WIP: OR-clause support for indexes

2024-03-18 Thread Andrei Lepikhov

On 14/3/2024 16:31, Alexander Korotkov wrote:
On Wed, Mar 13, 2024 at 2:16 PM Andrei Lepikhov 
As you can see this case is not related to partial indexes.  Just no 
index selective for the whole query.  However, splitting scan by the OR 
qual lets use a combination of two selective indexes.
I have rewritten the 0002-* patch according to your concern. A candidate 
and some thoughts are attached.
As I see, we have a problem here: expanding each array and trying to 
apply an element to each index can result in a lengthy planning stage.
Also, an index scan with the SAOP may potentially be more effective than 
with the list of OR clauses.
Originally, the transformation's purpose was to reduce a query's 
complexity and the number of optimization ways to speed up planning and 
(sometimes) execution. Here, we reduce planning complexity only in the 
case of an array size larger than MAX_SAOP_ARRAY_SIZE.
Maybe we can fall back to the previous version of the second patch, 
keeping in mind that someone who wants to get maximum profit from the 
BitmapOr scan of multiple indexes can just disable this optimization, 
enabling deep search of the most optimal scanning way?
As a compromise solution, I propose adding one more option to the 
previous version: if an element doesn't fit any partial index, try to 
cover it with a plain index.
In this case, we still do not guarantee the most optimal fit of elements 
to the set of indexes, but we speed up planning. Does that make sense?


--
regards,
Andrei Lepikhov
Postgres Professional
From d2d8944fc83ccd090653c1b15703a2c3ba096fa9 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 13 Mar 2024 12:26:02 +0700
Subject: [PATCH 2/2] Teach generate_bitmap_or_paths to build BitmapOr paths
 over SAOP clauses.

Likewise OR clauses, discover SAOP array and try to split its elements
between smaller sized arrays to fit a set of partial indexes.
---
 doc/src/sgml/config.sgml   |   3 +
 src/backend/optimizer/path/indxpath.c  |  74 +-
 src/backend/optimizer/util/predtest.c  |  37 +++
 src/backend/optimizer/util/restrictinfo.c  |  13 +
 src/include/optimizer/optimizer.h  |   3 +
 src/include/optimizer/restrictinfo.h   |   1 +
 src/test/regress/expected/create_index.out |  24 +-
 src/test/regress/expected/select.out   | 280 +
 src/test/regress/sql/select.sql|  82 ++
 9 files changed, 500 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de6ae301a..0df56f44e3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5485,6 +5485,9 @@ ANY num_sync 
( orclause)->args;
+
+   if (!enable_or_transformation)
+   return orlist;
+
+   if (restriction_is_saop_clause(rinfo))
+   {
+   result = transform_saop_to_ors(root, rinfo);
+   return (result == NIL) ? list_make1(rinfo) : result;
+   }
+
+   foreach(lc, orlist)
+   {
+   Expr *expr = (Expr *) lfirst(lc);
+
+   if (IsA(expr, RestrictInfo) && 
restriction_is_saop_clause((RestrictInfo *) expr))
+   {
+   List *sublist;
+
+   sublist = extract_saop_ors(root, (RestrictInfo *) 
lfirst(lc));
+   if (sublist != NIL)
+   {
+   result = list_concat(result, sublist);
+   continue;
+   }
+
+   /* Need to return expr to the result list */
+   }
+
+   result = lappend(result, expr);
+   }
+
+   return result;
+}
+
 /*
  * generate_bitmap_or_paths
- * Look through the list of clauses to find OR clauses, and 
generate
- * a BitmapOrPath for each one we can handle that way.  Return a 
list
- * of the generated BitmapOrPaths.
+ * Look through the list of clauses to find OR and SAOP clauses, 
and
+ * Each saop clause are splitted to be covered by partial indexes.
+ * generate a BitmapOrPath for each one we can handle that way.
+ * Return a list of the generated BitmapOrPaths.
  *
  * other_clauses is a list of additional clauses that can be assumed true
  * for the purpose of generating indexquals, but are not to be searched for
@@ -1247,20 +1303,24 @@ generate_bitmap_or_paths(PlannerInfo *root, RelOptInfo 
*rel,
foreach(lc, clauses)
{
RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc);
-   List   *pathlist;
+   List   *pathlist = NIL;
Path   *bitmapqual;
ListCell   *j;
+   List   *orlist = NIL;
 
-   /* Ignore RestrictInfos that aren't ORs */
-   if (!restriction_is_or_clause(rinfo))
+   orlist = extract_saop_ors(root, rinfo);
+   if (orlist == NIL)
+ 

Re: New Table Access Methods for Multi and Single Inserts

2024-03-18 Thread Masahiko Sawada
Hi,

On Fri, Mar 8, 2024 at 7:37 PM Bharath Rupireddy
 wrote:
>
> On Sat, Mar 2, 2024 at 12:02 PM Bharath Rupireddy
>  wrote:
> >
> > On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy
> >  wrote:
> > >
> > > > Please find the attached v9 patch set.
> >
> > I've had to rebase the patches due to commit 874d817, please find the
> > attached v11 patch set.
>
> Rebase needed. Please see the v12 patch set.
>

I've not reviewed the patches in depth yet, but run performance tests
for CREATE MATERIALIZED VIEW. The test scenarios is:

-- setup
create unlogged table test (c int);
insert into test select generate_series(1, 1000);

-- run
create materialized view test_mv as select * from test;

Here are the results:

* HEAD
3775.221 ms
3744.039 ms
3723.228 ms

* v12 patch
6289.972 ms
5880.674 ms
7663.509 ms

I can see performance regressions and the perf report says that CPU
spent most time on extending the ResourceOwner's array while copying
the buffer-heap tuple:

- 52.26% 0.18% postgres postgres [.] intorel_receive
52.08% intorel_receive
table_multi_insert_v2 (inlined)
- heap_multi_insert_v2
- 51.53% ExecCopySlot (inlined)
tts_buffer_heap_copyslot
tts_buffer_heap_store_tuple (inlined)
 - IncrBufferRefCount
 - ResourceOwnerEnlarge
 ResourceOwnerAddToHash (inlined)

Is there any reason why we copy a buffer-heap tuple to another
buffer-heap tuple? Which results in that we increments the buffer
refcount and register it to ResourceOwner for every tuples. I guess
that the destination tuple slot is not necessarily a buffer-heap, and
we could use VirtualTupleTableSlot instead. It would in turn require
copying a heap tuple. I might be missing something but it improved the
performance at least in my env. The change I made was:

-   dstslot = table_slot_create(state->rel, NULL);
+   //dstslot = table_slot_create(state->rel, NULL);
+   dstslot = MakeTupleTableSlot(RelationGetDescr(state->rel),
+);
+

And the execution times are:
1588.984 ms
1591.618 ms
1582.519 ms

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Be strict when request to flush past end of WAL in WaitXLogInsertionsToFinish

2024-03-18 Thread Jeff Davis
On Fri, 2024-03-15 at 13:12 +0530, Bharath Rupireddy wrote:
> Hi,
> 
> While working on [1], it was identified that
> WaitXLogInsertionsToFinish emits a LOG message, and adjusts the upto
> ptr to proceed further when caller requests to flush past the end of
> generated WAL. There's a comment explaining no caller should ever do
> that intentionally except in cases with bogus LSNs. For a similar
> situation, XLogWrite emits a PANIC "xlog write request %X/%X is past
> end of log %X/%X". Although there's no problem if
> WaitXLogInsertionsToFinish emits LOG, but why can't it be a bit more
> harsh and emit PANIC something like the attached to detect the corner
> case?
> 
> Thoughts?

I'm not clear on why the callers of WaitXLogInsertionsToFinish() are
handling errors the way they are. XLogWrite PANICs, XLogFlush ERRORs
(which is likely to be escalated to a PANIC anyway), and the other
callers ignore the return value and leave it up to XLogWrite() to
PANIC.

As far as I can tell, once WaitXLogInsertionsToFinish() detects this
bogus LSN, a PANIC is a likely outcome, so your proposed change makes
sense. But then why are the callers also checking?

I haven't looked in a lot of detail.

Regards,
Jeff Davis





RE: speed up a logical replica setup

2024-03-18 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for giving comments. I want to reply some of them.

> > 17.
> >
> > "specifies promote"
> >
> > We can do double-quote for the word promote.
> 
> The v30 patch has promote, which I think is adequate.

Opps. Actually I did look v29 patch while firstly reviewing. Sorry for noise.

> 
> > 20.
> >
> > New options must be also documented as well. This helps not only users but
> also
> > reviewers.
> > (Sometimes we cannot identify that the implementation is intentinal or not.)
> 
> Which ones are missing?

In v29, newly added options (publication/subscription/replication-slot) was not 
added.
Since they have been added, please ignore.

> > 21.
> >
> > Also, not sure the specification is good. I preferred to specify them by 
> > format
> > string. Because it can reduce the number of arguments and I cannot find use
> cases
> > which user want to control the name of objects.
> >
> > However, your approach has a benefit which users can easily identify the
> generated
> > objects by pg_createsubscriber. How do other think?
> 
> I think listing them explicitly is better for the first version.  It's
> simpler to implement and more flexible.

OK.

> > 22.
> >
> > ```
> > #define BASE_OUTPUT_DIR
>   "pg_createsubscriber_output.d"
> > ```
> >
> > No one refers the define.
> 
> This is gone in v30.

I wrote due to the above reason. Please ignore...

> 
> > 31.
> >
> > ```
> > /* Create replication slot on publisher */
> > if (lsn)
> > pg_free(lsn);
> > ```
> >
> > I think allocating/freeing memory is not so efficient.
> > Can we add a flag to create_logical_replication_slot() for controlling the
> > returning value (NULL or duplicated string)? We can use the condition (i ==
> num_dbs-1)
> > as flag.
> 
> Nothing is even using the return value of
> create_logical_replication_slot().  I think this can be removed altogether.

> > 37.
> >
> > ```
> > /* Register a function to clean up objects in case of failure */
> > atexit(cleanup_objects_atexit);
> > ```
> >
> > Sorry if we have already discussed. I think the registration can be moved 
> > just
> > before the boot of the standby. Before that, the callback will be no-op.
> 
> But it can also stay where it is.  What is the advantage of moving it later?

I thought we could reduce the risk of bugs. Previously some bugs were reported
because the registration is too early. However, this is not a strong opinion.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



Re: remaining sql/json patches

2024-03-18 Thread Himanshu Upadhyaya
On Mon, Mar 18, 2024 at 3:33 PM Amit Langote 
wrote:

> Himanshu,
>
> On Mon, Mar 18, 2024 at 4:57 PM Himanshu Upadhyaya
>  wrote:
> > I have tested a nested case  but  why is the negative number allowed in
> subscript(NESTED '$.phones[-1]'COLUMNS), it should error out if the number
> is negative.
> >
> > ‘postgres[170683]=#’SELECT * FROM JSON_TABLE(jsonb '{
> > ‘...>’ "id" : "0.234567897890",
> > ‘...>’ "name" : {
> "first":"John",
> "last":"Doe" },
> > ‘...>’ "phones" : [{"type":"home", "number":"555-3762"},
> > ‘...>’ {"type":"work", "number":"555-7252",
> "test":123}]}',
> > ‘...>’'$'
> > ‘...>’COLUMNS(
> > ‘...>’ id numeric(2,2) PATH 'lax $.id',
> > ‘...>’ last_name varCHAR(10) PATH 'lax $.name.last',
> first_name VARCHAR(10) PATH 'lax $.name.first',
> > ‘...>’  NESTED '$.phones[-1]'COLUMNS (
> > ‘...>’"type" VARCHAR(10),
> > ‘...>’"number" VARCHAR(10)
> > ‘...>’ )
> > ‘...>’  )
> > ‘...>’   ) as t;
> >   id  | last_name | first_name | type | number
> > --+---++--+
> >  0.23 | Doe   | Johnnn |  |
> > (1 row)
>
> You're not getting an error because the default mode of handling
> structural errors in SQL/JSON path expressions is "lax".  If you say
> "strict" in the path string, you will get an error:
>
>
ok, got it, thanks.


> SELECT * FROM JSON_TABLE(jsonb '{
>  "id" : "0.234567897890",
>  "name" : {
> "first":"John",
> "last":"Doe" },
>  "phones" : [{"type":"home", "number":"555-3762"},
>  {"type":"work", "number":"555-7252", "test":123}]}',
> '$'
> COLUMNS(
>  id numeric(2,2) PATH 'lax $.id',
>  last_name varCHAR(10) PATH 'lax $.name.last',
> first_name VARCHAR(10) PATH 'lax $.name.first',
>   NESTED 'strict $.phones[-1]'COLUMNS (
> "type" VARCHAR(10),
> "number" VARCHAR(10)
>  )
>   ) error on error
>) as t;
> ERROR:  jsonpath array subscript is out of bounds
>
> --
> Thanks, Amit Langote
>


-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com


Re: Built-in CTYPE provider

2024-03-18 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Mar 19, 2024 at 11:55 AM Tom Lane  wrote:
 This is causing all CI jobs to fail the "compiler warnings" check.

>>> I did run CI before checkin, and it passed:

> Maybe I misunderstood this exchange but ...

> Currently Windows warnings don't make any CI tasks fail ie turn red,
> which is why Jeff's run is all green in his personal github repo.
> ...
> But I did teach cfbot to do some extra digging through the logs,

Ah.  What I should have said was "it's causing cfbot to complain
about every patch".

Seems like the divergence in the pass criterion is not such a
great idea.

regards, tom lane




Re: Built-in CTYPE provider

2024-03-18 Thread Thomas Munro
On Tue, Mar 19, 2024 at 11:55 AM Tom Lane  wrote:
> Jeff Davis  writes:
> > On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote:
> >> This is causing all CI jobs to fail the "compiler warnings" check.
>
> > I did run CI before checkin, and it passed:
> > https://cirrus-ci.com/build/5382423490330624
>
> Weird, why did it not report with the same level of urgency?
> But anyway, thanks for fixing.

Maybe I misunderstood this exchange but ...

Currently Windows warnings don't make any CI tasks fail ie turn red,
which is why Jeff's run is all green in his personal github repo.
Unlike gcc and clang, and MinGW cross-build warnings which cause the
special "CompilerWarnings" CI task to fail (red).  That task is
running on a Linux system so it can't use MSVC.  The idea of keeping
it separate from the "main" Linux, FreeBSD, macOS tasks (which use
gcc, clang, clang respectively) was that it's nicer to try to run the
actual tests even if there is a pesky warning, so having it in a
separate task gets you that info without blocking other progress, and
it also tries with and without assertions (a category of warning
hazard, eg unused variables when assertions are off).

But I did teach cfbot to do some extra digging through the logs,
looking for various interesting patterns[1], including non-error
warnings, and if it finds anything interesting it shows a little
clickable ⚠ symbol on the front page.

If there is something like -Werror on MSVC we could turn that on for
the main Windows test, but that might also be a bit annoying.  Perhaps
there is another way: we could have it compile and test everything,
allowing warnings, but also then grep the build log afterwards in a
new step that fails if any warnings were there?  Then Jeff would have
got a failure in his personal CI run.  Or something like that.

[1] https://github.com/macdice/cfbot/blob/master/cfbot_work_queue.py




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-18 Thread Masahiko Sawada
On Tue, Mar 19, 2024 at 8:35 AM John Naylor  wrote:
>
> On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  
> wrote:
> >
> > On Sun, Mar 17, 2024 at 11:46 AM John Naylor  
> > wrote:
>
> > > Random offsets is what I was thinking of (if made distinct and
> > > ordered), but even there the code is fairy trivial, so I don't have a
> > > strong feeling about it.
> >
> > Agreed.
>
> Looks good.
>
> A related thing I should mention is that the tests which look up all
> possible offsets are really expensive with the number of blocks we're
> using now (assert build):
>
> v70 0.33s
> v72 1.15s
> v73 1.32
>
> To trim that back, I think we should give up on using shared memory
> for the is-full test: We can cause aset to malloc a new block with a
> lot fewer entries. In the attached, this brings it back down to 0.43s.

Looks good. Agreed with this change.

> It might also be worth reducing the number of blocks in the random
> test -- multiple runs will have different offsets anyway.

Yes. If we reduce the number of blocks from 1000 to 100, the
regression test took on my environment:

1000 blocks : 516 ms
100 blocks   : 228 ms

>
> > > I think we can stop including the debug-tid-store patch for CI now.
> > > That would allow getting rid of some unnecessary variables.
> >
> > Agreed.
>
> Okay, all that remains here is to get rid of those variables (might be
> just one).

Removed some unnecessary variables in 0002 patch.

>
> > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has 
> > > TIDs
> > > + * in one block. We return the block numbers in ascending order and the 
> > > offset
> > > + * numbers in each result is also sorted in ascending order.
> > > + */
> > > +TidStoreIterResult *
> > > +TidStoreIterateNext(TidStoreIter *iter)
> > >
> > > The wording is a bit awkward.
> >
> > Fixed.
>
> - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> - * in one block. We return the block numbers in ascending order and the 
> offset
> - * numbers in each result is also sorted in ascending order.
> + * Scan the TidStore and return the TIDs of the next block. The returned 
> block
> + * numbers is sorted in ascending order, and the offset numbers in each 
> result
> + * is also sorted in ascending order.
>
> Better, but it's still not very clear. Maybe "The offsets in each
> iteration result are ordered, as are the block numbers over all
> iterations."

Thanks, fixed.

>
> > > +/* Extract TIDs from the given key-value pair */
> > > +static void
> > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > > BlocktableEntry *page)
> > >
> > > This is a leftover from the old encoding scheme. This should really
> > > take a "BlockNumber blockno" not a "key", and the only call site
> > > should probably cast the uint64 to BlockNumber.
> >
> > Fixed.
>
> This part looks good. I didn't notice earlier, but this comment has a
> similar issue
>
> @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
>   return NULL;
>
>   /* Collect TIDs extracted from the key-value pair */
> - tidstore_iter_extract_tids(iter, key, page);
> + tidstore_iter_extract_tids(iter, (BlockNumber) key, page);
>
> ..."extracted" was once a separate operation. I think just removing
> that one word is enough to update it.

Fixed.

>
> Some other review on code comments:
>
> v73-0001:
>
> + /* Enlarge the TID array if necessary */
>
> It's "arrays" now.
>
> v73-0005:
>
> +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
> +-- different randon 100 offset numbers each other.
>
> The numbers are obvious from the query. Maybe just mention that the
> offsets are randomized and must be unique and ordered.
>
> + * The caller is responsible for release any locks.
>
> "releasing"

Fixed.

>
> > > +typedef struct BlocktableEntry
> > > +{
> > > + uint16 nwords;
> > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > > +} BlocktableEntry;
> > >
> > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
>
> I should be more clear here: nwords fitting into one byte allows 3
> embedded offsets (1 on 32-bit platforms, which is good for testing at
> least). With uint16 nwords that reduces to 2 (none on 32-bit
> platforms). Further, after the current patch series is fully
> committed, I plan to split the embedded-offset patch into two parts:
> The first would store the offsets in the header, but would still need
> a (smaller) allocation. The second would embed them in the child
> pointer. Only the second patch will care about the size of nwords
> because it needs to reserve a byte for the pointer tag.

Thank you for the clarification.

>
> > > That doesn't have any real-world affect on the largest offset
> > > encountered, and only in 32-bit builds with 32kB block size would the
> > > theoretical max change at all. To be precise, we could use in the
> > > MaxBlocktableEntrySize calculation:
> > >
> > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);
> >
> > I don't get 

Re: add AVX2 support to simd.h

2024-03-18 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 10:03:36AM +0700, John Naylor wrote:
> I took a brief look, and 0001 isn't quite what I had in mind. I can't
> quite tell what it's doing with the additional branches and "goto
> retry", but I meant something pretty simple:

Do you mean 0002?  0001 just adds a 2-register loop for remaining elements
once we've exhausted what can be processed with the 4-register loop.

> - if short, do one element at a time and return

0002 does this.

> - if long, do one block unconditionally, then round the start pointer
> up so that "end - start" is an exact multiple of blocks, and loop over
> them

0002 does the opposite of this.  That is, after we've completed as many
blocks as possible, we move the iterator variable back to "end -
block_size" and do one final iteration to cover all the remaining elements.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: add AVX2 support to simd.h

2024-03-18 Thread John Naylor
On Tue, Mar 19, 2024 at 9:03 AM Nathan Bossart  wrote:
>
> On Sun, Mar 17, 2024 at 09:47:33AM +0700, John Naylor wrote:
> > I haven't looked at the patches, but the graphs look good.
>
> I spent some more time on these patches.  Specifically, I reordered them to
> demonstrate the effects on systems without AVX2 support.  I've also added a
> shortcut to jump to the one-by-one approach when there aren't many
> elements, as the overhead becomes quite noticeable otherwise.  Finally, I
> ran the same benchmarks again on x86 and Arm out to 128 elements.
>
> Overall, I think 0001 and 0002 are in decent shape, although I'm wondering
> if it's possible to improve the style a bit.

I took a brief look, and 0001 isn't quite what I had in mind. I can't
quite tell what it's doing with the additional branches and "goto
retry", but I meant something pretty simple:

- if short, do one element at a time and return
- if long, do one block unconditionally, then round the start pointer
up so that "end - start" is an exact multiple of blocks, and loop over
them




Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2024-03-18 Thread Ajin Cherian
>
> Of course you can, but this will only convert disk space into memory space.
>  For details, please see the case in Email [1].
>
> [1]
> https://www.postgresql.org/message-id/CAGfChW51P944nM5h0HTV9HistvVfwBxNaMt_s-OZ9t%3DuXz%2BZbg%40mail.gmail.com
>
> Regards, lijie
>
>
Hi lijie,

Overall, I think the patch is a good improvement. Some comments from first
run through of patch:
1. The patch no longer applies cleanly, please rebase.

2. While testing the patch, I saw something strange. If I try to truncate a
table that is published. I still see the message:
2024-03-18 22:25:51.243 EDT [29385] LOG:  logical filter change by table
pg_class

This gives the impression that the truncate operation on the published
table has been filtered but it hasn't. Also the log message needs to be
reworded. Maybe, "Logical filtering change by non-published table
"

3. Below code:
@@ -1201,11 +1343,14 @@ DecodeMultiInsert(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
+
+ if (FilterByTable(ctx, change))
+ continue;;

extra semi-colon after continue.

4. I am not sure if this is possible, but is there a way to avoid the
overhead in the patch if the publication publishes "ALL TABLES"?

5. In function: pgoutput_table_filter() - this code appears to be filtering
out not just unpublished tables but also applying row based filters on
published tables as well. Is this really within the scope of the feature?

regards,
Ajin Cherian
Fujitsu Australia


Re: A problem about partitionwise join

2024-03-18 Thread Richard Guo
(Sorry it takes me some time to get back to this thread.)

On Thu, Mar 7, 2024 at 7:13 PM Ashutosh Bapat 
wrote:

> I did a deeper review of the patch. Here are some comments
>

Thank you for the review!


> Approach
> 
> The equijoin condition between partition keys doesn't appear in the join's
> restrictilist because of 'best_score' strategy as you explained well in
> [2]. What if we add an extra score for clauses between partition keys and
> give preference to equijoin between partition keys? Have you given it a
> thought? I feel that having an equijoin clause involving partition keys has
> more usages compared to a clause with any random column. E.g. nextloop may
> be able to prune partitions from inner relation if the clause contains a
> partition key.
>

Hmm, I think this approach won't work in cases where one certain pair of
partition keys has formed an EC that contains pseudoconstants.  In such
cases, the EC machinery will generate restriction clauses like 'pk =
const' rather than any join clauses.

Besides, it seems to me that it's not a cheap operation to check whether
a join clause is between partition keys when we generate join clauses
from ECs in generate_join_implied_equalities().


> Documentation
> -
> The patch does not modify any documentation. The only documentation I
> could find about partitionwise join is the one for GUC
> 'enable_partitionwise_join'. It says
> --- quote
> "Partitionwise join currently applies only when the join conditions
> include all the partition keys, which must be of the same data type and
> have one-to-one matching sets of child partitions.".
> --- unquote
> This sentence is general and IMO covers the case this patch considers. But
> in general I feel that partitionwise join and aggregation deserve separate
> sections next to "partition pruning" in [1]; It should mention advanced
> partition matching algorithm as well. Would you be willing to write one and
> then expand it for the case in the patch?
>

I don't think it should be part of this patch to add a new section in
the docs to explain partitionwise join and aggregation.  Maybe that
deserves a separate patch?


> Tests
> -
> The patch adds a testcase for single column partitioning. I think we need
> to do better like
> 1. Test for partitioning on expression, multilevel partitioning, advanced
> partition matching. Those all might just work. Having tests helps us to
> notice any future breakage.
> 2. Some negative test case e.g. equijoin clauses with disjunction, with
> inequality operator, equality operators with operators from different
> families etc.
>

Thanks for the suggestions.  We can do that.


> 3. The testcase added looks artificial. it outputs data that has same
> value for two columns which is equal to the primary key of the other table
> - when would somebody do that?. Is there some real life example where this
> change will be useful?
>

Hmm, I think the test case is good as long as it reveals the issue that
this patch fixes.  It follows the same format as the existing test case
just above it.  I'm not sure if there are real life examples, but I
think it may not always be necessary to derive test cases from them.


> Code
> 
> Minor comment for now. It will be better to increment num_equal_pks
> immediately after setting pk_known_equal[ipk] = true. Otherwise the code
> gets confusing around line 2269. I will spend more time reviewing the code
> next week.
>

Hmm, the increment of num_equal_pks on line 2272 is parallel to the one
in the first loop (around line 2200).  Maybe it's better to keep them
consistent as the current patch does?

Thanks
Richard


Re: add AVX2 support to simd.h

2024-03-18 Thread Nathan Bossart
On Sun, Mar 17, 2024 at 09:47:33AM +0700, John Naylor wrote:
> I haven't looked at the patches, but the graphs look good.

I spent some more time on these patches.  Specifically, I reordered them to
demonstrate the effects on systems without AVX2 support.  I've also added a
shortcut to jump to the one-by-one approach when there aren't many
elements, as the overhead becomes quite noticeable otherwise.  Finally, I
ran the same benchmarks again on x86 and Arm out to 128 elements.

Overall, I think 0001 and 0002 are in decent shape, although I'm wondering
if it's possible to improve the style a bit.  0003 at least needs a big
comment in simd.h, and it might need a note in the documentation, too.  If
the approach in this patch set seems reasonable, I'll spend some time on
that.

BTW I did try to add some other optimizations, such as processing remaining
elements with only one vector and trying to use the overlapping strategy
with more registers if we know there are relatively many remaining
elements.  These other approaches all added a lot of complexity and began
hurting performance, and I've probably already spent way too much time
optimizing a linear search, so this is where I've decided to stop.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 2f4a7747025cd3288453fdabd520638e37e3633c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 18 Mar 2024 10:44:08 -0500
Subject: [PATCH v4 1/3] pg_lfind32(): Optimize processing remaining elements.

Discussion: https://postgr.es/m/20231129171526.GA857928%40nathanxps13
---
 src/include/port/pg_lfind.h | 42 +
 1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index b8dfa66eef..bef0e2d5be 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -95,15 +95,16 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 
 	/*
 	 * For better instruction-level parallelism, each loop iteration operates
-	 * on a block of four registers.  Testing for SSE2 has showed this is ~40%
-	 * faster than using a block of two registers.
+	 * on a block of registers.  We first do as much processing as possible
+	 * with a block of 4 registers, then we try to process what remains with a
+	 * block of 2 registers.
 	 */
 	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
 	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
-	const uint32 nelem_per_iteration = 4 * nelem_per_vector;
+	uint32		nelem_per_iteration = 4 * nelem_per_vector;
 
 	/* round down to multiple of elements per iteration */
-	const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
+	uint32		tail_idx = nelem & ~(nelem_per_iteration - 1);
 
 #if defined(USE_ASSERT_CHECKING)
 	bool		assert_result = false;
@@ -157,6 +158,39 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 			return true;
 		}
 	}
+
+	/*
+	 * Try processing the remaining elements using 2 registers instead of 4.
+	 */
+	nelem_per_iteration = 2 * nelem_per_vector;
+	tail_idx = nelem & ~(nelem_per_iteration - 1);
+
+	for (; i < tail_idx; i += nelem_per_iteration)
+	{
+		Vector32	vals1,
+	vals2,
+	result1,
+	result2,
+	result;
+
+		/* load the next block into 2 registers */
+		vector32_load(, [i]);
+		vector32_load(, [i + nelem_per_vector]);
+
+		/* compare each value to the key */
+		result1 = vector32_eq(keys, vals1);
+		result2 = vector32_eq(keys, vals2);
+
+		/* combine the results into a single variable */
+		result = vector32_or(result1, result2);
+
+		/* see if there was a match */
+		if (vector32_is_highbit_set(result))
+		{
+			Assert(assert_result);
+			return true;
+		}
+	}
 #endif			/* ! USE_NO_SIMD */
 
 	/* Process the remaining elements one at a time. */
-- 
2.25.1

>From 68ee8bf34c80a0a3df02c2aae8357f664895b4de Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 18 Mar 2024 10:55:50 -0500
Subject: [PATCH v4 2/3] pg_lfind32(): Further optimize processing remaining
 elements.

Discussion: https://postgr.es/m/20231129171526.GA857928%40nathanxps13
---
 src/include/port/pg_lfind.h | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index bef0e2d5be..83fb8f50d2 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -96,8 +96,8 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	/*
 	 * For better instruction-level parallelism, each loop iteration operates
 	 * on a block of registers.  We first do as much processing as possible
-	 * with a block of 4 registers, then we try to process what remains with a
-	 * block of 2 registers.
+	 * with a block of 4 registers, then we process what remains with a block
+	 * of 2 registers.
 	 */
 	const Vector32 keys = vector32_broadcast(key);	/* load copies of key */
 	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
@@ -120,6 +120,15 @@ 

Re: Sequence Access Methods, round two

2024-03-18 Thread Michael Paquier
On Thu, Mar 14, 2024 at 09:40:29AM +0900, Michael Paquier wrote:
> In the context of this thread, this removes the dependency of sequence
> value lookup to heap.

I am not sure where this is leading in combination with the sequence
stuff for logical decoding, so for now I am moving this patch to the
next commit fest to discuss things in 18~.
--
Michael


signature.asc
Description: PGP signature


Re: doc issues in event-trigger-matrix.html

2024-03-18 Thread Michael Paquier
On Tue, Mar 19, 2024 at 08:00:00AM +0800, jian he wrote:
> I think the "X" and "-" mean in this matrix [0] is not very intuitive.
> mainly because "X" tends to mean negative things in most cases.
> we can write a sentence saying "X"  means this, "-" means that.
> 
> or maybe Check mark [1] and Cross mark [2]  are more universal.
> and we can use these marks.
> 
> "Only for local objects"
> is there any reference explaining "local objects"?
> I think local object means objects that only affect one single database?

It is true that in Japan the cross mark refers to a negation, and
that's the opposite in France: I would put a cross on a table in the
case where something is supported.  I've never seen anybody complain
about the format of these tables, FWIW, but if these were to be
changed, the update should happen across the board for all the tables
and not only one.

Using ASCII characters also has the advantage to make the maintenance
clightly easier, in my opinion, because there is no translation effort
between these special characters and their XML equivalent, like ""
<-> "<".
--
Michael


signature.asc
Description: PGP signature


Re: Support json_errdetail in FRONTEND builds

2024-03-18 Thread Michael Paquier
On Mon, Mar 18, 2024 at 06:17:18AM -0700, Jacob Champion wrote:
> Great! Thanks everyone.

Cool.  Thanks for the commit.
--
Michael


signature.asc
Description: PGP signature


Re: Autogenerate some wait events code and documentation

2024-03-18 Thread Michael Paquier
On Mon, Mar 18, 2024 at 05:57:02PM +, Bertrand Drouvot wrote:
> Thanks for looking at it!
> Oh right, the comment is wrong, re-worded in v2 attached.

I've added a couple of fake events in my txt file, and this results in
an ordering of the wait events in the docs while the backpatched wait
events are added at the end of the enums, based on their order in the
txt file.

 # When adding a new wait event, make sure it is placed in the appropriate
-# ClassName section.
+# ClassName section. If the wait event is backpatched from master to a version
+# >= 17 then put it under a "Backpatch:" delimiter at the end of the related
+# ClassName section (on the non master branches) or at its natural position on
+# the master branch.
+# Ensure that the backpatch regions are always empty on the master branch.

I'd recommend to not mention a version number at all, as this would
need a manual refresh each time a new stable branch is forked.

Your solution is simpler than what I finished in mind when looking at
the code yesterday, with the addition of a second array that's pushed
to be at the end of the "sorted" lines ordered by the second column.
That does the job.

(Note that I'll go silent for some time; I'll handle this thread when
I get back as this is not urgent.)
--
Michael


signature.asc
Description: PGP signature


Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-18 Thread Masahiko Sawada
On Thu, Mar 14, 2024 at 11:46 PM vignesh C  wrote:
>
> On Thu, 14 Mar 2024 at 15:49, Amit Kapila  wrote:
> >
> > On Thu, Mar 14, 2024 at 1:45 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 14, 2024 at 2:27 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > This fact makes me think that the slotsync worker might be able to
> > > > > accept the primary_conninfo value even if there is no dbname in the
> > > > > value. That is, if there is no dbname in the primary_conninfo, it uses
> > > > > the username in accordance with the specs of the connection string.
> > > > > Currently, the slotsync worker connects to the local database first
> > > > > and then establishes the connection to the primary server. But if we
> > > > > can reverse the two steps, it can get the dbname that has actually
> > > > > been used to establish the remote connection and use it for the local
> > > > > connection too. That way, the primary_conninfo generated by
> > > > > pg_basebackup could work even without the patch. For example, if the
> > > > > OS user executing pg_basebackup is 'postgres', the slotsync worker
> > > > > would connect to the postgres database. Given the 'postgres' database
> > > > > is created by default and 'postgres' OS user is used in common, I
> > > > > guess it could cover many cases in practice actually.
> > > > >
> > > >
> > > > I think this is worth investigating but I suspect that in most cases
> > > > users will end up using a replication connection without specifying
> > > > the user name and we may not be able to give a meaningful error
> > > > message when slotsync worker won't be able to connect. The same will
> > > > be true even when the dbname same as the username would be used.
> > >
> > > What do you mean by not being able to give a meaningful error message?
> > >
> > > If the slotsync worker uses the user name as the dbname, and such a
> > > database doesn't exist, the error message the user will get is
> > > "database "test_user" does not exist". ISTM the same is true when the
> > > user specifies the wrong database in the primary_conninfo.
> > >
> >
> > Right, the exact error message as mentioned by Shveta will be:
> > ERROR:  could not connect to the primary server: connection to server
> > at "127.0.0.1", port 5433 failed: FATAL: database "bckp_user" does not
> > exist
> >
> > Now, without this idea, the ERROR message will be:
> >  ERROR:  slot synchronization requires dbname to be specified in
> > primary_conninfo
> >
> > I am not sure how much this matters but the second message sounds more 
> > useful.
> >
> > > >
> > > > > Having said that, even with (or without) the above change, we might
> > > > > want to change the pg_basebackup so that it writes the dbname to the
> > > > > primary_conninfo if -R option is specified. Since the database where
> > > > > the slotsync worker connects cannot be dropped while the slotsync
> > > > > worker is running, the user might want to change the database to
> > > > > connect, and it would be useful if they can do that using
> > > > > pg_basebackup instead of modifying the configuration file manually.
> > > > >
> > > > > While the current approach makes sense to me, I'm a bit concerned that
> > > > > we might end up having the pg_basebackup search the actual database
> > > > > name (e.g. 'dbname=template1') from the .pgpass file instead of
> > > > > 'dbname=replication'. As far as I tested on my environment, suppose
> > > > > that I execute:
> > > > >
> > > > > pg_basebackup -D tmp -d "dbname=testdb" -R
> > > > >
> > > > > The pg_basebackup established a replication connection but looked for
> > > > > the password of the 'testdb' database. This could be another
> > > > > inconvenience for the existing users who want to use the slot
> > > > > synchronization.
> > > > >
> > > >
> > > > This is true because it is internally using logical replication
> > > > connection (as we will set set replication=database).
> > >
> > > Did you mean the pg_basebackup is using a logical replication
> > > connection in this case? As far as I tested, even if we specify dbname
> > > to the -d option of pg_basebackup, it uses a physical replication
> > > connection. For example, it can take a backup even if I specify a
> > > non-existing database name.
> > >
> >
> > You are right. I misunderstood some part of the code in GetConnection.
> > However, I think my point is still valid that if the user has provided
> > dbname in the connection string it means that she wants that database
> > entry to be looked upon not "replication" entry.
> >
> > >
> > > >
> > > > > But it's still just an idea and I might be missing something. And
> > > > > given we're getting closer to the feature freeze, it would be a PG18
> > > > > item.
> > > > >
> > > >
> > > > +1. At this stage, it is important to discuss whether we should allow
> > > > pg_baseback to write dbname (either a specified one or a default one)

Re: Java : Postgres double precession issue with different data format text and binary

2024-03-18 Thread Chapman Flack
Hi Rahul,

On 03/18/24 15:52, Rahul Uniyal wrote:
> Since the column format is text and not binary it converts the value
> to BigDecimal and give back the value as 40 .
> ...
> Now since the format is Binary ... it returns  DOUBLE from there
> result in 40.0
> 
> Now i am not sure for the same table and same column why we have two
> different format and this issue is intermittent.

I don't see, in this message or your earlier one, which public
ResultSet API method your Java client code is calling. It sounds as if
you are simply calling getObject, the flavor without a second parameter
narrowing the type, and you are finding that the object returned is
sometimes of class Double and sometimes of class BigDecimal. Is that
accurate? That would seem to be the nub of the issue.

You seem to have found that the class of the returned object is
influenced by whether text or binary format was used on the wire.
I will guess that would be worth reporting to the PGJDBC devs, using
the pgsql-jdbc list.

The question of why the driver might sometimes use one wire format
and sometimes the other seems secondary. There may be some technical
explanation, but it would not be very interesting except as an
implementation detail, if it did not have this visible effect of
changing the returned object's class.

For the time being, I assume that if your Java code calls a more
specific method, such as getObject(..., BigDecimal.class) or
getObject(..., Double.class), or simply getDouble, you will get
results of the desired class whatever wire format is used.

The issue of the wire format influencing what class of object
getObject returns (when a specific class hasn't been requested)
is probably worth raising on pgsql-jdbc.

Regards,
-Chap




Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-18 Thread Tom Lane
Dean Rasheed  writes:
> On Mon, 18 Mar 2024 at 23:19, Tom Lane  wrote:
>> After thinking a bit more, I understood what was bothering me about
>> that notation: it looks too much like a call of a user-defined
>> function named "rescan()".  I think we'd be better off with the
>> all-caps "RESCAN()".

> Or perhaps move the parentheses, and write "(rescan SubPlan N)" or
> "(reset SubPlan N)". Dunno.

Oh, I like that!  It seems rather parallel to the existing "hashed"
annotation.  If I had it to do over, I'd likely do the "hashed"
bit differently --- but as the proposal currently stands, we are
not changing "hashed", so we might as well double down on that.

I won't update the patch right now, but "(rescan SubPlan N)"
seems like a winner to me.

regards, tom lane




doc issues in event-trigger-matrix.html

2024-03-18 Thread jian he
hi.
I think the "X" and "-" mean in this matrix [0] is not very intuitive.
mainly because "X" tends to mean negative things in most cases.
we can write a sentence saying "X"  means this, "-" means that.

or maybe Check mark [1] and Cross mark [2]  are more universal.
and we can use these marks.

"Only for local objects"
is there any reference explaining "local objects"?
I think local object means objects that only affect one single database?



[0] https://www.postgresql.org/docs/current/event-trigger-matrix.html
[1] https://en.wikipedia.org/wiki/Check_mark
[2] https://en.wikipedia.org/wiki/X_mark




Re: Autogenerate some wait events code and documentation

2024-03-18 Thread Michael Paquier
On Mon, Mar 18, 2024 at 10:24:00AM +0100, Alvaro Herrera wrote:
> FTR I had a rather unpleasant time last week upon finding a wait event
> named PgSleep.  If you grep for that, there are no matches at all; and I
> spent ten minutes (for real) trying to figure out where that was coming
> from, until I remembered this thread.
> 
> Now you have to guess that not only random lowercasing is happening, but
> also underscore removal.  This is not a good developer experience and I
> think we should rethink this choice.  It would be infinitely more
> usable, and not one bit more difficult, to make these lines be
> 
> WAIT_EVENT_FOO_BARFooBar  "Waiting on Foo Bar."
> 
> then there is no guessing involved.

This has already gone through a couple of adjustments in 59cbf60c0f2b
and 183a60a628fe.  The latter has led to the elimination of one column
in the txt file, as a reply to the same kind of comments about the
format of this file:
https://www.postgresql.org/message-id/20230705215939.ulnfbr4zavb2x7ri%40awork3.anarazel.de

FWIW, I still like better what we have currently, where it is possible
to grep the enum values in the source code.
--
Michael


signature.asc
Description: PGP signature


Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-18 Thread Dean Rasheed
On Mon, 18 Mar 2024 at 23:19, Tom Lane  wrote:
>
> > Hm.  I used "rescan(SubPlan)" in the attached, but it still looks
> > a bit odd to my eye.
>
> After thinking a bit more, I understood what was bothering me about
> that notation: it looks too much like a call of a user-defined
> function named "rescan()".  I think we'd be better off with the
> all-caps "RESCAN()".
>

Or perhaps move the parentheses, and write "(rescan SubPlan N)" or
"(reset SubPlan N)". Dunno.

Regards,
Dean




Re: Autogenerate some wait events code and documentation

2024-03-18 Thread Michael Paquier
On Mon, Mar 18, 2024 at 08:49:34AM -0700, Noah Misch wrote:
> On Mon, Mar 18, 2024 at 09:04:44AM +, Bertrand Drouvot wrote:
>> +# Ensure that the wait events under the "Backpatch" delimiter are placed in 
>> the
>> +# same order as in the pre 17 wait_event_types.h (see VERSION_FILE_SYNC as 
>> an
>> +# example).
> 
> I expect the normal practice will be to put the entry in its natural position
> in git master, then put it in the backpatch section for any other branch.  In
> other words, the backpatch regions are always empty in git master, and the
> non-backpatch regions change in master only.

Yes, I'd expect the same experience.  And it is very important to
document that properly in the txt file.  I don't see a need to specify
any version numbers as well, that's less burden when bumping the major
version number on the master branch every year.
--
Michael


signature.asc
Description: PGP signature


Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Greg Sabino Mullane
Going to agree with Robert Treat here about an extension being a great
solution. I resisted posting earlier as I wanted to see how this all pans
out, but I wrote a quick little POC extension some months ago that does
the disabling and works well (and cannot be easily worked around).

On Mon, Mar 18, 2024 at 4:59 PM Robert Haas  wrote:

> I think that all of this is true except for (c). I think we'd need a
> new hook to make it work.
>

Seems we can just use ProcessUtility and:
if (IsA(parsetree, AlterSystemStmt) { ereport(ERROR, ...

When we know that a feature is
> widely-needed, it's better to have one good implementation of it in
> core than several perhaps not-so-good implementations out of core.
>

Meh, maybe. This one seems pretty dirt simple. Granted, I have expanded my
original POC to allow *some* things to be changed by ALTER SYSTEM, but the
original use case warrants a very small extension.

That allows us to focus all of our efforts on that one implementation
> instead of splitting them across several -- which is the whole selling
> point of open source, really -- and it makes it easier for users who
> want the feature to get access to it.
>

Well, yeah, but they have to wait until version 18 at best, while an
extension can run on any current version and probably be pretty
future-proof as well.

Cheers,
Greg


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-18 Thread John Naylor
On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  wrote:
>
> On Sun, Mar 17, 2024 at 11:46 AM John Naylor  wrote:

> > Random offsets is what I was thinking of (if made distinct and
> > ordered), but even there the code is fairy trivial, so I don't have a
> > strong feeling about it.
>
> Agreed.

Looks good.

A related thing I should mention is that the tests which look up all
possible offsets are really expensive with the number of blocks we're
using now (assert build):

v70 0.33s
v72 1.15s
v73 1.32

To trim that back, I think we should give up on using shared memory
for the is-full test: We can cause aset to malloc a new block with a
lot fewer entries. In the attached, this brings it back down to 0.43s.
It might also be worth reducing the number of blocks in the random
test -- multiple runs will have different offsets anyway.

> > I think we can stop including the debug-tid-store patch for CI now.
> > That would allow getting rid of some unnecessary variables.
>
> Agreed.

Okay, all that remains here is to get rid of those variables (might be
just one).

> > + * Scan the TidStore and return a pointer to TidStoreIterResult that has 
> > TIDs
> > + * in one block. We return the block numbers in ascending order and the 
> > offset
> > + * numbers in each result is also sorted in ascending order.
> > + */
> > +TidStoreIterResult *
> > +TidStoreIterateNext(TidStoreIter *iter)
> >
> > The wording is a bit awkward.
>
> Fixed.

- * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
- * in one block. We return the block numbers in ascending order and the offset
- * numbers in each result is also sorted in ascending order.
+ * Scan the TidStore and return the TIDs of the next block. The returned block
+ * numbers is sorted in ascending order, and the offset numbers in each result
+ * is also sorted in ascending order.

Better, but it's still not very clear. Maybe "The offsets in each
iteration result are ordered, as are the block numbers over all
iterations."

> > +/* Extract TIDs from the given key-value pair */
> > +static void
> > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > BlocktableEntry *page)
> >
> > This is a leftover from the old encoding scheme. This should really
> > take a "BlockNumber blockno" not a "key", and the only call site
> > should probably cast the uint64 to BlockNumber.
>
> Fixed.

This part looks good. I didn't notice earlier, but this comment has a
similar issue

@@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
  return NULL;

  /* Collect TIDs extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, page);
+ tidstore_iter_extract_tids(iter, (BlockNumber) key, page);

..."extracted" was once a separate operation. I think just removing
that one word is enough to update it.

Some other review on code comments:

v73-0001:

+ /* Enlarge the TID array if necessary */

It's "arrays" now.

v73-0005:

+-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
+-- different randon 100 offset numbers each other.

The numbers are obvious from the query. Maybe just mention that the
offsets are randomized and must be unique and ordered.

+ * The caller is responsible for release any locks.

"releasing"

> > +typedef struct BlocktableEntry
> > +{
> > + uint16 nwords;
> > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > +} BlocktableEntry;
> >
> > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.

I should be more clear here: nwords fitting into one byte allows 3
embedded offsets (1 on 32-bit platforms, which is good for testing at
least). With uint16 nwords that reduces to 2 (none on 32-bit
platforms). Further, after the current patch series is fully
committed, I plan to split the embedded-offset patch into two parts:
The first would store the offsets in the header, but would still need
a (smaller) allocation. The second would embed them in the child
pointer. Only the second patch will care about the size of nwords
because it needs to reserve a byte for the pointer tag.

> > That doesn't have any real-world affect on the largest offset
> > encountered, and only in 32-bit builds with 32kB block size would the
> > theoretical max change at all. To be precise, we could use in the
> > MaxBlocktableEntrySize calculation:
> >
> > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);
>
> I don't get this expression. Making the nwords one byte works well?
> With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
> bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
> OS, respectively. One byte nwrods variable seems not to be sufficient

I believe there is confusion between bitmap words and bytes:
2048 / 64 = 32 words = 256 bytes

It used to be max tuples per (heap) page, but we wanted a simple way
to make this independent of heap. I believe we won't need to ever
store the actual MaxOffsetNumber, although we technically still could
with a one-byte type and 32kB pages, at least on 

Re: documentation structure

2024-03-18 Thread Robert Haas
On Mon, Mar 18, 2024 at 6:51 PM Tom Lane  wrote:
> This might be a silly suggestion, but: could we just render the
> "most important" chapter titles in a larger font?

It's not the silliest suggestion ever -- you could have proposed
! -- but I also suspect it's not the right answer. Of course,
varying the font size can be a great way of emphasizing some things
more than others, but it doesn't usually work out well to just take a
document that was designed to be displayed in a uniform font size and
enlarge bits of text here and there. You usually want to have some
kind of overall plan of which font size is a single component.

For example, on a corporate home page, it's quite common to have two
nav bars, the larger of which has entries that correspond to the
company's product offerings and/or marketing materials, and the
smaller of which has "utility functions" like "login", "contact us",
and "search". Font size can be an effective tool for emphasizing the
relative importance of one nav bar versus the other, but you don't
start by deciding which things are going to get displayed in a larger
font. You start with an overall idea of the layout and then the font
size flows out of that.

Just riffing a bit, you could imagine adding a nav bar to our
documentation, either across the top or along the side, that is always
there on every page of the documentation and contains those links that
we want to make sure are always visible. Necessarily, these must be
limited in number. Then on the home page you could have the whole
table of contents as we do today, and you use that to navigate to
everything that isn't one of the quick links.

Or you can imagine that the home page of our documentation isn't just
a tree view like it is today; it might instead be written in paragraph
form. "Welcome to the PostgreSQL documentation! If you're new here,
check out our tutorial! Otherwise, you might be
interested in our SQL reference, our configuration
reference, or our banana plantation. If none of
those sound like what you want, check out the documentation
index." Obviously in order to actually work, something like
this would need to be expanded into enough paragraphs to actually
cover all of the important sections of the documentation, and probably
not mention banana plantations. Or maybe it wouldn't be just
paragraphs, but a two-column table, with each row of the table having
a main title and link in the narrower lefthand column and a blurb with
more links in the wider righthand column.

I'm sure there are a lot of other ways to do this, too. Our main
documentation page is very old-school, and there are probably a bunch
of ways to do better.

But I'm not sure how easy it would be to get agreement on something
specific, and I don't know how well our toolchain can support anything
other than what we've already got. I've also learned from painful
experience that you can't fix bad content with good markup. I think it
is worth spending some effort on trying to beat the existing format
into submission, promoting things that seem to deserve it and demoting
those that seem to deserve that. At some point, we'll probably reach a
point of diminishing returns, either because we all agree we've done
as well as we can, or because we can't agree on what else to do, and
maybe at that point the only way to improve further is with better web
design and/or a different documentation toolchain. But I think it's
fairly clear that we're not at that point now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Popcount optimization using AVX512

2024-03-18 Thread David Rowley
On Tue, 19 Mar 2024 at 11:08, Nathan Bossart  wrote:
>
> On Mon, Mar 18, 2024 at 04:29:19PM -0500, Nathan Bossart wrote:
> > Agreed.  Will send an updated patch shortly.
>
> As promised...

Looks good.

David




Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-18 Thread Tom Lane
I wrote:
> Hm.  I used "rescan(SubPlan)" in the attached, but it still looks
> a bit odd to my eye.

After thinking a bit more, I understood what was bothering me about
that notation: it looks too much like a call of a user-defined
function named "rescan()".  I think we'd be better off with the
all-caps "RESCAN()".

regards, tom lane




Re: Built-in CTYPE provider

2024-03-18 Thread Tom Lane
Jeff Davis  writes:
> On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote:
>> This is causing all CI jobs to fail the "compiler warnings" check.

> I did run CI before checkin, and it passed:
> https://cirrus-ci.com/build/5382423490330624

Weird, why did it not report with the same level of urgency?
But anyway, thanks for fixing.

regards, tom lane




Re: documentation structure

2024-03-18 Thread Tom Lane
Laurenz Albe  writes:
> On Mon, 2024-03-18 at 10:11 -0400, Robert Haas wrote:
>> I don't know what to do about "I. SQL commands". It's obviously
>> impractical to promote that to a top-level section, because it's got a
>> zillion sub-pages which I don't think we want in the top-level
>> documentation index. But having it as one of several unnumbered
>> chapters interposed between 51 and 52 doesn't seem great either.

> I think that both the GUCs and the SQL reference could be top-level
> sections.  For the GUCs there is an obvious split in sub-chapters,
> and the SQL reference could be a top-level section without any chapters
> under it.

I'd be in favor of promoting all three of the "Reference" things to
the top level, except that as Robert says, it seems likely that that
would end in having a hundred individual command reference pages
visible in the topmost table of contents.  Also, if we manage to
suppress that, did we really make it any more prominent?  Not sure.

Making "SQL commands" top-level with half a dozen subsections would
solve the visibility problem, but I'm not real eager to go there,
because I foresee endless arguments about which subsection a given
command goes in.  Robert's point about wanting a single alphabetized
list is valid too (although you could imagine that being a list in an
introductory section, similar to what we have for system catalogs).

This might be a silly suggestion, but: could we just render the
"most important" chapter titles in a larger font?

regards, tom lane




Re: Built-in CTYPE provider

2024-03-18 Thread Jeff Davis
On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote:
> This is causing all CI jobs to fail the "compiler warnings" check.

I did run CI before checkin, and it passed:

https://cirrus-ci.com/build/5382423490330624

If I open up the windows build, I see the warning:

https://cirrus-ci.com/task/5199979044667392

but I didn't happen to check this time.

> Probably the best fix is the traditional
> 
> return ;    /* keep compiler quiet */
> 
> but I'm not sure what the best default result is in this function.

In inverted the check so that I didn't have to choose a default.

Regards,
Jeff Davis





Re: sslinfo extension - add notbefore and notafter timestamps

2024-03-18 Thread Jacob Champion
On Mon, Mar 18, 2024 at 1:48 PM Cary Huang  wrote:
> Attached is the v10 patch with the above changes. Thanks again for the review.

Awesome, looks good.

On my final review pass I saw one last thing that bothered me (sorry
for not seeing it before). The backend version of
ASN1_TIME_to_timestamptz returns the following:

> +   return ((int64)days * 24 * 60 * 60) + (int64)seconds;

...but I think Timestamp[Tz]s are stored as microseconds, so we're off
by a factor of a million. This still works because later we cast to
double and pass it back through float8_timestamptz, which converts it:

> +   if (beentry->st_sslstatus->ssl_not_before != 0)
> +   values[25] = DirectFunctionCall1(float8_timestamptz,
> +Float8GetDatum((double) 
> beentry->st_sslstatus->ssl_not_before));

But anyone who ends up inspecting the value of
st_sslstatus->ssl_not_before directly is going to find an incorrect
timestamp. I think it'd be clearer to store microseconds to begin
with, and then just use TimestampTzGetDatum rather than the
DirectFunctionCall1 we have now. (I looked for an existing
implementation to reuse and didn't find one. Maybe we should use the
overflow-aware multiplication/addition routines -- i.e.
pg_mul_s64_overflow et al -- to multiply `days` and `seconds` by
USECS_PER_DAY/USECS_PER_SEC and combine them.)

And I think sslinfo can remain as-is, because that way overflow is
caught by float8_timestamptz. WDYT?

--Jacob




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 04:29:19PM -0500, Nathan Bossart wrote:
> Agreed.  Will send an updated patch shortly.

As promised...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b673663b1d1344549cbd0912220f96ba1712afc6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 18 Mar 2024 12:18:15 -0500
Subject: [PATCH v4 1/1] inline function calls in pg_popcount() when possible

---
 src/include/port/pg_bitutils.h |   5 +-
 src/port/pg_bitutils.c | 155 +
 2 files changed, 121 insertions(+), 39 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..53e5239717 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,17 +302,16 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
+extern uint64 pg_popcount(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
-/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
-
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..1197696e97 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,18 +103,22 @@ const uint8 pg_number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
-static int	pg_popcount32_slow(uint32 word);
-static int	pg_popcount64_slow(uint64 word);
+static inline int pg_popcount32_slow(uint32 word);
+static inline int pg_popcount64_slow(uint64 word);
+static uint64 pg_popcount_slow(const char *buf, int bytes);
 
 #ifdef TRY_POPCNT_FAST
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
-static int	pg_popcount32_fast(uint32 word);
-static int	pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
+static inline int pg_popcount32_fast(uint32 word);
+static inline int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -151,11 +155,13 @@ pg_popcount32_choose(uint32 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount32(word);
@@ -168,21 +174,42 @@ pg_popcount64_choose(uint64 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount64(word);
 }
 
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	if (pg_popcount_available())
+	{
+		pg_popcount32 = pg_popcount32_fast;
+		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
+	}
+	else
+	{
+		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
+	}
+
+	return pg_popcount(buf, bytes);
+}
+
 /*
  * pg_popcount32_fast
  *		Return the number of 1 bits set in word
  */
-static int
+static inline int
 pg_popcount32_fast(uint32 word)
 {
 #ifdef _MSC_VER
@@ -199,7 +226,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
  * pg_popcount64_fast
  *		Return the number of 1 bits set in word
  */
-static int
+static inline int
 pg_popcount64_fast(uint64 word)
 {
 #ifdef _MSC_VER
@@ -212,6 +239,52 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
 #endif
 }
 
+/*
+ * pg_popcount_fast
+ *		Returns the number of 1-bits in buf
+ */
+static uint64
+pg_popcount_fast(const char *buf, int bytes)
+{
+	uint64		popcnt = 0;
+
+#if SIZEOF_VOID_P >= 8
+	/* Process in 64-bit chunks if the buffer is aligned. */
+	if (buf == (const char *) TYPEALIGN(8, buf))
+	{
+		const uint64 *words = (const uint64 *) buf;
+
+		while (bytes >= 8)
+		{
+			popcnt += pg_popcount64_fast(*words++);
+			bytes -= 8;
+		}
+
+		buf = (const char *) words;
+	}
+#else
+	/* Process in 32-bit chunks if the buffer is aligned. */
+	if (buf == (const char *) TYPEALIGN(4, buf))
+	{
+		const uint32 *words = (const uint32 *) buf;
+
+		while 

RE: Psql meta-command conninfo+

2024-03-18 Thread Maiquel Grassi
On Thu, Feb 29, 2024 at 10:02:21PM +, Maiquel Grassi wrote:
> Sorry for the delay. I will make the adjustments as requested soon.

Looking forward to it.

//

Hi Nathan!

Sorry for the delay, I was busy with other professional as well as personal 
activities.

I made all the changes you suggested. I removed the variables and started using
views "pg_stat_ssl" and "pg_stat_gssapi". I handled the PostgreSQL versioning 
regarding the views used.

Here's a brief demonstration of the result:

[postgres@localhost ~]$ /home/pgsql-17devel/bin/psql -E -x -p 5433

psql (17devel)
Type "help" for help.

postgres=# \conninfo+
/ QUERY */
SELECT
  pg_catalog.current_database() AS "Database",
  'postgres' AS "Authenticated User",
  pg_catalog.system_user() AS "System User",
  pg_catalog.current_user() AS "Current User",
  pg_catalog.session_user() AS "Session User",
  pg_catalog.pg_backend_pid() AS "Backend PID",
  pg_catalog.inet_server_addr() AS "Server Address",
  pg_catalog.current_setting('port') AS "Server Port",
  pg_catalog.inet_client_addr() AS "Client Address",
  pg_catalog.inet_client_port() AS "Client Port",
  '/tmp' AS "Socket Directory",
  CASE
WHEN
  pg_catalog.inet_server_addr() IS NULL
  AND pg_catalog.inet_client_addr() IS NULL
THEN NULL
ELSE '/tmp'
  END AS "Host",
  (SELECT gss_authenticated AS "GSSAPI"
  FROM pg_catalog.pg_stat_gssapi
  WHERE pid = pg_catalog.pg_backend_pid()),
  ssl.ssl AS "SSL Connection",
  ssl.version AS "SSL Protocol",
  ssl.cipher AS "SSL Cipher",
  NULL AS "SSL Compression"
FROM
  pg_catalog.pg_stat_ssl ssl
WHERE
  pid = pg_catalog.pg_backend_pid()
;
//

Current Connection Information
-[ RECORD 1 ]--+-
Database   | postgres
Authenticated User | postgres
System User|
Current User   | postgres
Session User   | postgres
Backend PID| 29007
Server Address |
Server Port| 5433
Client Address |
Client Port|
Socket Directory   | /tmp
Host   |
GSSAPI | f
SSL Connection | f
SSL Protocol   |
SSL Cipher |
SSL Compression|

Rergards,
Maiquel Grassi.


v20-0001-psql-meta-command-conninfo-plus.patch
Description: v20-0001-psql-meta-command-conninfo-plus.patch


Re: Built-in CTYPE provider

2024-03-18 Thread Tom Lane
Jeff Davis  writes:
> It may be moot soon, but I committed a fix now.

Thanks, but it looks like 846311051 introduced a fresh issue.
MSVC is complaining about

[21:37:15.349] c:\cirrus\src\backend\utils\adt\pg_locale.c(2515) : warning 
C4715: 'builtin_locale_encoding': not all control paths return a value

This is causing all CI jobs to fail the "compiler warnings" check.

Probably the best fix is the traditional

return ;/* keep compiler quiet */

but I'm not sure what the best default result is in this function.

regards, tom lane




Re: documentation structure

2024-03-18 Thread Laurenz Albe
On Mon, 2024-03-18 at 10:11 -0400, Robert Haas wrote:
> The two sections of the documentation that seem really
> under-emphasized to me are the GUC documentation and the SQL
> reference. The GUC documentation is all buried under "20. Server
> Configuration" and the SQL command reference is under "I. SQL
> commands". For reasons that I don't understand, all chapters except
> for those in "VI. Reference" are numbered, but the chapters in that
> section have Roman numerals instead.

That last fact is very odd indeed and could be easily fixed.

> I don't know what other people's experience is, but for me, wanting to
> know what a command does or what a setting does is extremely common.
> Therefore, I think these chapters are disproportionately important and
> should be emphasized more. In the case of the GUC reference, one idea
> I have is to split up "III. Server Administration". My proposal is
> that we divide it into three sections. The first would be called "III.
> Server Installation" and would cover chapters 16 (installation from
> binaries) through 19 (server setup and operation). The second would be
> called "IV. Server Configuration" -- so every section that's currently
> a subsection of "server configuration" would become a top-level
> chapter. The third division would be "V. Server Administration," and
> would cover the current chapters 21-33. This is probably far from
> perfect, but it seems like a relatively simple change and better than
> what we have now.

I'm fine with splitting up "Server Administration" into three sections
like you propose.

> I don't know what to do about "I. SQL commands". It's obviously
> impractical to promote that to a top-level section, because it's got a
> zillion sub-pages which I don't think we want in the top-level
> documentation index. But having it as one of several unnumbered
> chapters interposed between 51 and 52 doesn't seem great either.

I think that both the GUCs and the SQL reference could be top-level
sections.  For the GUCs there is an obvious split in sub-chapters,
and the SQL reference could be a top-level section without any chapters
under it.

> The stuff that I think is over-emphasized is as follows: (a) chapters
> 1-3, the tutorial; (b) chapters 4-6, which are essentially a
> continuation of the tutorial, and not at all similar to chapters 8-11
> which are chalk-full of detailed technical information; (c) chapters
> 43-46, one per procedural language; perhaps these could just be
> demoted to sub-sections of chapter 42 on procedural languages; (d)
> chapters 47 (server programming interface), 50 (replication progress
> tracking), and 51 (archive modules), all of which are important to
> document but none of which seem important enough to put them in the
> top-level documentation index; and (e) large parts of section "VII.
> Internals," which again contain tons of stuff of very marginal
> interest. The first ~4 chapters of the internals section seem like
> they might be mainstream enough to justify the level of prominence
> that we give them, but the rest has got to be of interest to a tiny
> minority of readers.

I disagree that the tutorial is over-emphasized.

I also disagree that chapters 4 to 6 are a continuation of the tutorial.
Or at least, they shouldn't be.
When I am looking for a documentation reference on something like
security considerations of SECURITY DEFINER functions, my first
impulse is to look in chapter 5 (Data Definition) or in chapter 38
(Extending SQL), and I am surprised to find it discussed in the
SQL reference of CREATE FUNCTION.

Another case in point is the "Notes" section for CREATE VIEW.  Why is
that not somewhere under "Data Definition"?

For me, the reference should be terse and focused on the syntax.

Changing that is probably a lost cause by now, but I feel that we need
not encourage that development any more by playing down the earlier
chapters.

> I think it might be possible to consolidate the internals section by
> grouping a bunch of existing entries together by category. Basically,
> after the first few chapters, you've got stuff that is of interest to
> C programmers writing core or extension code; and you've got
> explainers on things like GEQO and index op-classes and support
> functions which might be of interest even to non-programmers. I think
> for example that we don't need separate top-level chapters on writing
> procedural language handlers, FDWs, tablesample methods, custom scan
> providers, table access methods, index access methods, and WAL
> resource managers. Some or all of those could be grouped under a
> single chapter, perhaps, e.g. Using PostgreSQL Extensibility
> Interfaces.

I have no strong feelings about that.

Yours,
Laurenz Albe




Re: REVOKE FROM warning on grantor

2024-03-18 Thread Tom Lane
Robert Haas  writes:
> On Sat, Mar 16, 2024 at 8:17 PM Tom Lane  wrote:
>> Even taking the position that this is an unspecified point that we
>> could implement how we like, I don't think there's a sufficient
>> argument for changing behavior that's stood for a couple of decades.

> I got curious about the behavior of other database systems.

Yeah, I was mildly curious about that too; it'd be unlikely to sway
my bottom-line opinion, but it would be interesting to check.

> https://dev.mysql.com/doc/refman/8.0/en/revoke.html documents an "IF
> EXISTS" option whose documentation reads, in relevant part,
> "Otherwise, REVOKE executes normally; if the user does not exist, the
> statement raises an error."

Hmm, I don't think that's quite what's at stake here.  We do throw
error if either named role doesn't exist:

regression=# revoke foo from joe;
ERROR:  role "joe" does not exist
regression=# create user joe;
CREATE ROLE
regression=# revoke foo from joe;
ERROR:  role "foo" does not exist
regression=# create role foo;
CREATE ROLE
regression=# revoke foo from joe;
WARNING:  role "joe" has not been granted membership in role "foo" by role 
"postgres"
REVOKE ROLE

What the OP is on about is that that last case issues WARNING not
ERROR.

Reading further down in the mysql page you cite, it looks like their
IF EXISTS conflates "role doesn't exist" with "role wasn't granted",
and suppresses errors for both those cases.  I'm not in favor of
changing things here, but if we did, I sure wouldn't want to adopt
those exact semantics.

regards, tom lane




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 10:27:58AM +1300, David Rowley wrote:
> On Tue, 19 Mar 2024 at 10:08, Nathan Bossart  wrote:
>> On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote:
>> > The only thing I'd question in the patch is in pg_popcount_fast(). It
>> > looks like you've opted to not do the 32-bit processing on 32-bit
>> > machines. I think that's likely still worth coding in a similar way to
>> > how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
>> > Probably one day we'll remove that code, but it seems strange to have
>> > pg_popcount_slow() do it and not pg_popcount_fast().
>>
>> The only reason I left it out was because I couldn't convince myself that
>> it wasn't dead code, given we assume that popcntq is available in
>> pg_popcount64_fast() today.  But I don't see any harm in adding that just
>> in case.
> 
> It's probably more of a case of using native instructions rather than
> ones that might be implemented only via microcode.  For the record, I
> don't know if that would be the case for popcntq on x86 32-bit and I
> don't have the hardware to test it. It just seems less risky just to
> do it.

Agreed.  Will send an updated patch shortly.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-18 Thread David Rowley
On Tue, 19 Mar 2024 at 10:08, Nathan Bossart  wrote:
>
> On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote:
> > The only thing I'd question in the patch is in pg_popcount_fast(). It
> > looks like you've opted to not do the 32-bit processing on 32-bit
> > machines. I think that's likely still worth coding in a similar way to
> > how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
> > Probably one day we'll remove that code, but it seems strange to have
> > pg_popcount_slow() do it and not pg_popcount_fast().
>
> The only reason I left it out was because I couldn't convince myself that
> it wasn't dead code, given we assume that popcntq is available in
> pg_popcount64_fast() today.  But I don't see any harm in adding that just
> in case.

It's probably more of a case of using native instructions rather than
ones that might be implemented only via microcode.  For the record, I
don't know if that would be the case for popcntq on x86 32-bit and I
don't have the hardware to test it. It just seems less risky just to
do it.

David




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 09:22:43PM +, Amonson, Paul D wrote:
>> The only reason I left it out was because I couldn't convince myself that it
>> wasn't dead code, given we assume that popcntq is available in
>> pg_popcount64_fast() today.  But I don't see any harm in adding that just in
>> case.
> 
> I am not sure how to read this. Does this mean that for popcount32_fast
> and popcount64_fast I can assume that the x86(_64) instructions exists
> and stop doing the runtime checks for instruction availability?

I think my question boils down to "if pg_popcount_available() returns true,
can I safely assume I'm on a 64-bit machine?"

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




RE: Popcount optimization using AVX512

2024-03-18 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
> Sent: Monday, March 18, 2024 2:08 PM
> To: David Rowley 
> Cc: Amonson, Paul D ; Andres Freund
>...
> 
> The only reason I left it out was because I couldn't convince myself that it
> wasn't dead code, given we assume that popcntq is available in
> pg_popcount64_fast() today.  But I don't see any harm in adding that just in
> case.

I am not sure how to read this. Does this mean that for popcount32_fast and 
popcount64_fast I can assume that the x86(_64) instructions exists and stop 
doing the runtime checks for instruction availability?

Thanks,
Paul





Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-18 Thread Tom Lane
Dean Rasheed  writes:
> The get_rule_expr() code could perhaps be simplified a bit, getting
> rid of the show_subplan_name variable and moving the recursive calls
> to get_rule_expr() to after the switch statement -- if testexpr is
> non-NULL, print it, else print the subplan name probably works for all
> subplan types.

Oooh, good idea.  The symmetry wasn't apparent when we started, but
it's there now, and the code does look nicer this way.

> The "colN" notation has grown on me, especially when you look at
> examples like those in partition_prune.out with a mix of Param types.

OK, I've left it like that in the attached v5, but I'm still open
to other opinions.

>> The undecorated reference to (SubPlan 1) is fairly confusing, since
>> it doesn't correspond to anything that will actually get output.
>> I suggest that perhaps instead this should read
>> Output: (SubPlan 1).col1, (SubPlan 1).col2, IGNORE(SubPlan 1), i.tableoid, 
>> i.ctid
>> or
>> Output: (SubPlan 1).col1, (SubPlan 1).col2, RESET(SubPlan 1), i.tableoid, 
>> i.ctid

> I think "RESET()" or "RESCAN()" or something like that is better than
> "INGORE()", because it indicates that it is actually doing something.
> I don't really have a better idea. Perhaps not all uppercase though,
> since that seems to go against the rest of the EXPLAIN output.

Hm.  I used "rescan(SubPlan)" in the attached, but it still looks
a bit odd to my eye.

I did some more work on the documentation too, to show the difference
between hashed and not-hashed subplans.  I feel like we're pretty
close here, with the possible exception of how to show MULTIEXPR.

regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 58a603ac56..8d75ca56c9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3062,10 +3062,10 @@ select exists(select 1 from pg_enum), sum(c1) from ft1;
 QUERY PLAN
 --
  Foreign Scan
-   Output: $0, (sum(ft1.c1))
+   Output: (InitPlan 1).col1, (sum(ft1.c1))
Relations: Aggregate on (public.ft1)
Remote SQL: SELECT sum("C 1") FROM "S 1"."T 1"
-   InitPlan 1 (returns $0)
+   InitPlan 1
  ->  Seq Scan on pg_catalog.pg_enum
 (6 rows)
 
@@ -3080,8 +3080,8 @@ select exists(select 1 from pg_enum), sum(c1) from ft1 group by 1;
 QUERY PLAN 
 ---
  GroupAggregate
-   Output: $0, sum(ft1.c1)
-   InitPlan 1 (returns $0)
+   Output: (InitPlan 1).col1, sum(ft1.c1)
+   InitPlan 1
  ->  Seq Scan on pg_catalog.pg_enum
->  Foreign Scan on public.ft1
  Output: ft1.c1
@@ -3305,10 +3305,10 @@ select sum(c1) filter (where (c1 / c1) * random() <= 1) from ft1 group by c2 ord
 
 explain (verbose, costs off)
 select sum(c2) filter (where c2 in (select c2 from ft1 where c2 < 5)) from ft1;
-QUERY PLAN 

+  QUERY PLAN   
+---
  Aggregate
-   Output: sum(ft1.c2) FILTER (WHERE (hashed SubPlan 1))
+   Output: sum(ft1.c2) FILTER (WHERE (ANY (ft1.c2 = (hashed SubPlan 1).col1)))
->  Foreign Scan on public.ft1
  Output: ft1.c2
  Remote SQL: SELECT c2 FROM "S 1"."T 1"
@@ -6171,9 +6171,9 @@ UPDATE ft2 AS target SET (c2, c7) = (
  Update on public.ft2 target
Remote SQL: UPDATE "S 1"."T 1" SET c2 = $2, c7 = $3 WHERE ctid = $1
->  Foreign Scan on public.ft2 target
- Output: $1, $2, (SubPlan 1 (returns $1,$2)), target.ctid, target.*
+ Output: (SubPlan 1).col1, (SubPlan 1).col2, rescan(SubPlan 1), target.ctid, target.*
  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 1" > 1100)) FOR UPDATE
- SubPlan 1 (returns $1,$2)
+ SubPlan 1
->  Foreign Scan on public.ft2 src
  Output: (src.c2 * 10), src.c7
  Remote SQL: SELECT c2, c7 FROM "S 1"."T 1" WHERE (($1::integer = "C 1"))
@@ -11685,9 +11685,9 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W
QUERY PLAN   
 
  Nested Loop Left Join
-   Output: t1.a, t1.b, t1.c, async_pt.a, async_pt.b, async_pt.c, ($0)
+   Output: t1.a, t1.b, t1.c, async_pt.a, async_pt.b, async_pt.c, ((InitPlan 1).col1)
Join Filter: (t1.a = async_pt.a)
-   InitPlan 1 (returns $0)
+   InitPlan 1
  ->  Aggregate
Output: count(*)
->  Append
@@ -11699,10 +11699,10 @@ SELECT * FROM local_tbl t1 LEFT JOIN 

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Tue, Mar 19, 2024 at 10:02:18AM +1300, David Rowley wrote:
> I looked at your latest patch and tried out the performance on a Zen4
> running windows and a Zen2 running on Linux. As follows:

Thanks for taking a look.

> The only thing I'd question in the patch is in pg_popcount_fast(). It
> looks like you've opted to not do the 32-bit processing on 32-bit
> machines. I think that's likely still worth coding in a similar way to
> how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
> Probably one day we'll remove that code, but it seems strange to have
> pg_popcount_slow() do it and not pg_popcount_fast().

The only reason I left it out was because I couldn't convince myself that
it wasn't dead code, given we assume that popcntq is available in
pg_popcount64_fast() today.  But I don't see any harm in adding that just
in case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Popcount optimization using AVX512

2024-03-18 Thread David Rowley
On Tue, 19 Mar 2024 at 06:30, Nathan Bossart  wrote:
> Here is a more fleshed-out version of what I believe David is proposing.
> On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the
> test_popcount benchmark).  I assume this is because this patch turns
> pg_popcount() into a function pointer, which is what the AVX512 patches do,
> too.  I left out the 32-bit section from pg_popcount_fast(), but I'll admit
> that I'm not yet 100% sure that we can assume we're on a 64-bit system
> there.

I looked at your latest patch and tried out the performance on a Zen4
running windows and a Zen2 running on Linux. As follows:

AMD 3990x:

master:
postgres=# select drive_popcount(1000, 1024);
Time: 11904.078 ms (00:11.904)
Time: 11907.176 ms (00:11.907)
Time: 11927.983 ms (00:11.928)

patched:
postgres=# select drive_popcount(1000, 1024);
Time: 3641.271 ms (00:03.641)
Time: 3610.934 ms (00:03.611)
Time: 3663.423 ms (00:03.663)


AMD 7945HX Windows

master:
postgres=# select drive_popcount(1000, 1024);
Time: 9832.845 ms (00:09.833)
Time: 9844.460 ms (00:09.844)
Time: 9858.608 ms (00:09.859)

patched:
postgres=# select drive_popcount(1000, 1024);
Time: 3427.942 ms (00:03.428)
Time: 3364.262 ms (00:03.364)
Time: 3413.407 ms (00:03.413)

The only thing I'd question in the patch is in pg_popcount_fast(). It
looks like you've opted to not do the 32-bit processing on 32-bit
machines. I think that's likely still worth coding in a similar way to
how pg_popcount_slow() works. i.e. use "#if SIZEOF_VOID_P >= 8".
Probably one day we'll remove that code, but it seems strange to have
pg_popcount_slow() do it and not pg_popcount_fast().

> IMHO this work is arguably a prerequisite for the AVX512 work, as turning
> pg_popcount() into a function pointer will likely regress performance for
> folks on systems without AVX512 otherwise.

I think so too.

David




Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Robert Haas
On Mon, Mar 18, 2024 at 4:07 PM Robert Treat  wrote:
> You know it's funny, you say #4 has no advantage and should be
> rejected outright, but AFAICT
>
> a) no one has actually laid out why it wouldn't work for them,
> b) and it's the one solution that can be implemented now
> c) and that implementation would be backwards compatible with some set
> of existing releases
> d) and certainly anyone running k8s or config management system would
> have the ability to install
> e) and it could be custom tailored to individual deployments as needed
> (including other potential commands that some systems might care
> about)
> f) and it seems like the least likely option to be mistaken for a
> security feature
> g) and also seems pretty safe wrt not breaking existing tooling (like
> 5/6 might do)
>
> Looking at it, you could make the argument that #4 is actually the
> best of the solutions proposed, except it has the one drawback that it
> requires folks to double down on the fiction that we think extensions
> are a good way to build solutions when really everyone just wants to
> have everything in core.

I think that all of this is true except for (c). I think we'd need a
new hook to make it work.

That said, I think that extensions are a good way of implementing some
functionality, but not this functionality. Extensions are a good
approach when there's a bunch of stuff core can't know but an
extension author can. For instance, the FDW interface caters to
situations where the extension author knows how to access some data
that PostgreSQL doesn't know how to access; and the operator class
stuff is useful when the extension author knows how some user-defined
data type should behave and we don't. But there's not really a
substantial policy question here. All we do by pushing a feature like
this out of core is wash our hands of it. Your (f) argues that might
be a good thing, but I don't think so. When we know that a feature is
widely-needed, it's better to have one good implementation of it in
core than several perhaps not-so-good implementations out of core.
That allows us to focus all of our efforts on that one implementation
instead of splitting them across several -- which is the whole selling
point of open source, really -- and it makes it easier for users who
want the feature to get access to it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: sslinfo extension - add notbefore and notafter timestamps

2024-03-18 Thread Cary Huang
Hi Jacob

> Hi Cary, did you have any thoughts on the timestamptz notes from my last mail?
> 
> > It might also be nice to rename
> > ASN1_TIME_to_timestamp().
> >
> > Squinting further at the server backend implementation, should that
> > also be using TimestampTz throughout, instead of Timestamp? It all
> > goes through float8_timestamptz at the end, so I guess it shouldn't
> > have a material impact, but it's a bit confusing.

Sorry I kind of missed this review comment from your last email. Thanks for 
bringing it up again though. I think it is right to change the backend 
references of "timestamp" to "timestampTz" for consistency reasons. I have gone 
ahead to make the changes.

I have also reviewed the wording on the documentation and removed "UTC" from 
the descriptions. Since sslinfo extension and pg_stat_ssl both return 
timestampTz in whatever timezone PostgreSQL is running on, they do not always 
return UTC timestamps.

Attached is the v10 patch with the above changes. Thanks again for the review.

Best regards

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca


v10-0001-Add-notBefore-and-notAfter-to-SSL-cert-info-displ.patch
Description: Binary data


Re: speed up a logical replica setup

2024-03-18 Thread Alvaro Herrera
On 2024-Mar-18, Peter Eisentraut wrote:

> * src/bin/pg_basebackup/pg_createsubscriber.c
> 
> I think the connection string handling is not robust against funny
> characters, like spaces, in database names etc.

Maybe it would be easier to deal with this by passing around a struct
with keyword/value pairs that can be given to PQconnectdbParams (and
keeping dbname as a param that's passed separately, so that it can be
added when one is needed), instead of messing with the string conninfos;
then you don't have to worry about quoting.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."   (New York Times, about Microsoft PowerPoint)




Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-18 Thread Alvaro Herrera
I enabled the test again and also pushed the changes to dblink,
isolationtester and fe_utils (which AFAICS is used by pg_dump,
pg_amcheck, reindexdb and vacuumdb).  I chickened out of committing the
postgres_fdw changes though, so here they are again.  Not sure I'll find
courage to get these done by tomorrow, or whether I should just let them
for Fujita-san or Noah, who have been the last committers to touch this.


-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No renuncies a nada. No te aferres a nada."
>From 737578fcdc6ed0de64838d2ad905054b18eb9ec1 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Mon, 18 Mar 2024 19:37:40 +0100
Subject: [PATCH v39] postgres_fdw: Start using new libpq cancel APIs

Commit 61461a300c1c introduced new functions to libpq for cancelling
queries.  This replaces the usage of the old ones in postgres_fdw.

Author: Jelte Fennema-Nio 
Discussion: https://postgr.es/m/cageczqt_vgowwenuqvuv9xqmbacyxjtrrayo8w07oqashk_...@mail.gmail.com
---
 contrib/postgres_fdw/connection.c | 108 +++---
 .../postgres_fdw/expected/postgres_fdw.out|  15 +++
 contrib/postgres_fdw/sql/postgres_fdw.sql |   7 ++
 3 files changed, 113 insertions(+), 17 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 4931ebf591..0c66eaa001 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -133,7 +133,7 @@ static void pgfdw_inval_callback(Datum arg, int cacheid, uint32 hashvalue);
 static void pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry);
 static void pgfdw_reset_xact_state(ConnCacheEntry *entry, bool toplevel);
 static bool pgfdw_cancel_query(PGconn *conn);
-static bool pgfdw_cancel_query_begin(PGconn *conn);
+static bool pgfdw_cancel_query_begin(PGconn *conn, TimestampTz endtime);
 static bool pgfdw_cancel_query_end(PGconn *conn, TimestampTz endtime,
    bool consume_input);
 static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
@@ -1315,36 +1315,106 @@ pgfdw_cancel_query(PGconn *conn)
 	endtime = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
 		  CONNECTION_CLEANUP_TIMEOUT);
 
-	if (!pgfdw_cancel_query_begin(conn))
+	if (!pgfdw_cancel_query_begin(conn, endtime))
 		return false;
 	return pgfdw_cancel_query_end(conn, endtime, false);
 }
 
+/*
+ * Submit a cancel request to the given connection, waiting only until
+ * the given time.
+ *
+ * We sleep interruptibly until we receive confirmation that the cancel
+ * request has been accepted, and if it is, return true; if the timeout
+ * lapses without that, or the request fails for whatever reason, return
+ * false.
+ */
 static bool
-pgfdw_cancel_query_begin(PGconn *conn)
+pgfdw_cancel_query_begin(PGconn *conn, TimestampTz endtime)
 {
-	PGcancel   *cancel;
-	char		errbuf[256];
+	bool		timed_out = false;
+	bool		failed = false;
+	PGcancelConn *cancel_conn = PQcancelCreate(conn);
 
-	/*
-	 * Issue cancel request.  Unfortunately, there's no good way to limit the
-	 * amount of time that we might block inside PQgetCancel().
-	 */
-	if ((cancel = PQgetCancel(conn)))
+	if (!PQcancelStart(cancel_conn))
 	{
-		if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
+		PG_TRY();
 		{
 			ereport(WARNING,
 	(errcode(ERRCODE_CONNECTION_FAILURE),
 	 errmsg("could not send cancel request: %s",
-			errbuf)));
-			PQfreeCancel(cancel);
-			return false;
+			pchomp(PQcancelErrorMessage(cancel_conn);
 		}
-		PQfreeCancel(cancel);
+		PG_FINALLY();
+		{
+			PQcancelFinish(cancel_conn);
+		}
+		PG_END_TRY();
+		return false;
 	}
 
-	return true;
+	/* In what follows, do not leak any PGcancelConn on an error. */
+	PG_TRY();
+	{
+		while (true)
+		{
+			PostgresPollingStatusType pollres = PQcancelPoll(cancel_conn);
+			TimestampTz now = GetCurrentTimestamp();
+			long		cur_timeout;
+			int			waitEvents = WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH;
+
+			if (pollres == PGRES_POLLING_OK)
+break;
+
+			/* If timeout has expired, give up, else get sleep time. */
+			cur_timeout = TimestampDifferenceMilliseconds(now, endtime);
+			if (cur_timeout <= 0)
+			{
+timed_out = true;
+failed = true;
+goto exit;
+			}
+
+			switch (pollres)
+			{
+case PGRES_POLLING_READING:
+	waitEvents |= WL_SOCKET_READABLE;
+	break;
+case PGRES_POLLING_WRITING:
+	waitEvents |= WL_SOCKET_WRITEABLE;
+	break;
+default:
+	failed = true;
+	goto exit;
+			}
+
+			/* Sleep until there's something to do */
+			WaitLatchOrSocket(MyLatch, waitEvents, PQcancelSocket(cancel_conn),
+			  cur_timeout, PG_WAIT_EXTENSION);
+			ResetLatch(MyLatch);
+
+			CHECK_FOR_INTERRUPTS();
+		}
+exit:
+		if (failed)
+		{
+			if (timed_out)
+ereport(WARNING,
+		(errmsg("could not cancel request due to timeout")));
+			else
+ereport(WARNING,
+		(errcode(ERRCODE_CONNECTION_FAILURE),
+		 errmsg("could not send cancel request: %s",
+

Re: Building with meson on NixOS/nixpkgs

2024-03-18 Thread Alvaro Herrera
On 2024-Mar-16, Wolfgang Walther wrote:

> The upstream name for the ossp-uuid package / pkg-config file is "uuid". Many
> distributions change this to be "ossp-uuid" to not conflict with e2fsprogs.

I can confirm that this is true for Debian, at least; the packaging
rules have this in override_dh_install:

install -D -m 644 
debian/tmp/usr/lib/$(DEB_HOST_MULTIARCH)/pkgconfig/uuid.pc \
debian/libossp-uuid-dev/usr/lib/pkgconfig/ossp-uuid.pc

which matches the fact that Engelschall's official repository has the
file named simply uuid.pc:
https://github.com/rse/uuid/blob/master/uuid.pc.in

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Java : Postgres double precession issue with different data format text and binary

2024-03-18 Thread Rahul Uniyal

Hello Chapman,

Thanks for the reply and suggestion.

Below are my observations when i was debugging the code of postgres-jdbc driver 
for double precision data type.

1- When the value in DB is 40 and fetched value is also 40
 A - In the QueryExecuterImpl class method - receiveFields() , we create 
Fields metadata 

 private Field[] receiveFields() throws IOException {
pgStream.receiveInteger4(); // MESSAGE SIZE
int size = pgStream.receiveInteger2();
Field[] fields = new Field[size];

if (LOGGER.isLoggable(Level.FINEST)) {
  LOGGER.log(Level.FINEST, " <=BE RowDescription({0})", size);
}

for (int i = 0; i < fields.length; i++) {
  String columnLabel = pgStream.receiveCanonicalString();
  int tableOid = pgStream.receiveInteger4();
  short positionInTable = (short) pgStream.receiveInteger2();
  int typeOid = pgStream.receiveInteger4();
  int typeLength = pgStream.receiveInteger2();
  int typeModifier = pgStream.receiveInteger4();
  int formatType = pgStream.receiveInteger2();
  fields[i] = new Field(columnLabel,
  typeOid, typeLength, typeModifier, tableOid, positionInTable);
  fields[i].setFormat(formatType);

  LOGGER.log(Level.FINEST, "{0}", fields[i]);
}

return fields;
  }

Output of this method is - [Field(id,FLOAT8,8,T), Field(client_id,FLOAT8,8,T), 
Field(create_ts,TIMESTAMP,8,T), Field(force_generation_flag,VARCHAR,65535,T), 
Field(instance_id,FLOAT8,8,T), Field(is_jmx_call,VARCHAR,65535,T), 
Field(ocode,VARCHAR,65535,T), Field(payload_type,VARCHAR,65535,T), 
Field(repository,VARCHAR,65535,T), Field(sub_repository,VARCHAR,65535,T)]

 

 
 B- Then in the class PgResultSet , it calls the method  
  public java.math.@Nullable BigDecimal getBigDecimal(@Positive int 
columnIndex) throws SQLException {
   return getBigDecimal(columnIndex, -1);
}
  and then it calls the method 
   @Pure
  private @Nullable Number getNumeric(
  int columnIndex, int scale, boolean allowNaN) throws SQLException {
byte[] value = getRawValue(columnIndex);
if (value == null) {
  return null;
}

if (isBinary(columnIndex)) {
  int sqlType = getSQLType(columnIndex);
  if (sqlType != Types.NUMERIC && sqlType != Types.DECIMAL) {
Object obj = internalGetObject(columnIndex, fields[columnIndex - 1]);
if (obj == null) {
  return null;
}
if (obj instanceof Long || obj instanceof Integer || obj instanceof 
Byte) {
  BigDecimal res = BigDecimal.valueOf(((Number) obj).longValue());
  res = scaleBigDecimal(res, scale);
  return res;
}
return toBigDecimal(trimMoney(String.valueOf(obj)), scale);
  } else {
Number num = ByteConverter.numeric(value);
if (allowNaN && Double.isNaN(num.doubleValue())) {
  return Double.NaN;
}

return num;
  }
}
Since the column format is text and not binary it converts the value to 
BigDecimal and give back the value as 40 .

2- When the value in DB is 40 and fetched value is 40.0 (trailing zero)
   In this case the field metadata is -

   [Field(id,FLOAT8,8,B), Field(client_id,FLOAT8,8,B), 
Field(ocode,VARCHAR,65535,T), Field(payload_type,VARCHAR,65535,T), 
Field(repository,VARCHAR,65535,T), Field(sub_repository,VARCHAR,65535,T), 
Field(force_generation_flag,VARCHAR,65535,T), 
Field(is_jmx_call,VARCHAR,65535,T), Field(instance_id,FLOAT8,8,B), 
Field(create_ts,TIMESTAMP,8,B)] 

Now since the format is Binary Type hence in  PgResultSet  class and in Numeric 
method condition  isBinary(columnIndex) is true.
and it returns  DOUBLE from there result in 40.0

Now i am not sure for the same table and same column why we have two different 
format and this issue is intermittent.

Thanks,

Rahul 

> On 19-Mar-2024, at 1:02 AM, Rahul Uniyal  wrote:
> 


Re: REVOKE FROM warning on grantor

2024-03-18 Thread Robert Haas
On Sat, Mar 16, 2024 at 8:17 PM Tom Lane  wrote:
> Even taking the position that this is an unspecified point that we
> could implement how we like, I don't think there's a sufficient
> argument for changing behavior that's stood for a couple of decades.
> (The spelling of the message has changed over the years, but giving a
> warning not an error appears to go all the way back to 99b8f8451
> where we implemented user groups.)  It is certain that there are
> applications out there that rely on this behavior and would break.

I got curious about the behavior of other database systems.

https://dev.mysql.com/doc/refman/8.0/en/revoke.html documents an "IF
EXISTS" option whose documentation reads, in relevant part,
"Otherwise, REVOKE executes normally; if the user does not exist, the
statement raises an error."

https://community.snowflake.com/s/article/Access-Control-Error-Message-When-Revoking-a-Non-existent-Role-Grant-From-a-Role-or-User
is kind of interesting. It says that such commands used to fail with
an error but that's been changed; now they don't.

I couldn't find a clear reference for Oracle or DB-2 or SQL server,
but it doesn't look like any of them have an IF EXISTS option, and the
examples they show don't mention this being an issue AFAICS, so I'm
guessing that all of them accept commands of this type without error.

On the whole, it seems like we might be taking the majority position
here, but I can't help but feel some sympathy with people who don't
like it. Maybe we need a REVOKE OR ELSE command. :-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Robert Treat
On Thu, Mar 14, 2024 at 12:37 PM Robert Haas  wrote:
>
> On Tue, Feb 13, 2024 at 2:05 AM Joel Jacobson  wrote:
> > > Wouldn't having system wide EVTs be a generic solution which could be the
> > > infrastructure for this requested change as well as others in the same 
> > > area?
> >
> > +1
> >
> > I like the wider vision of providing the necessary infrastructure to 
> > provide a solution for the general case.
>
> We don't seem to be making much progress here.
>
> As far as I can see from reading the thread, most people agree that
> it's reasonable to have some way to disable ALTER SYSTEM, but there
> are at least six competing ideas about how to do that:
>
> 1. command-line option
> 2. GUC
> 3. event trigger
> 4. extension
> 5. sentinel file
> 6. remove permissions on postgresql.auto.conf
>
> As I see it, (5) or (6) are most convenient for the system
> administrator, since they let that person make changes without needing
> to log into the database or, really, worry very much about the
> database's usual configuration mechanisms at all, and (5) seems like
> less work to implement than (6), because (6) probably breaks a bunch
> of client tools in weird ways that might not be easy for us to
> discover during patch review. (1) doesn't allow changing things at
> runtime, and might require the system administrator to fiddle with the
> startup scripts, which seems like it could be inconvenient. (2) and
> (3) seem like they put the superuser in a position to easily reverse a
> policy about what the superuser ought to do, but in the case of (2),
> that can be mitigated if the GUC can only be set in postgresql.conf
> and not elsewhere. (4) has no real advantages except for allowing core
> to maintain the fiction that we don't support this while actually
> supporting it; I think we should reject that approach outright.
>

You know it's funny, you say #4 has no advantage and should be
rejected outright, but AFAICT

a) no one has actually laid out why it wouldn't work for them,
b) and it's the one solution that can be implemented now
c) and that implementation would be backwards compatible with some set
of existing releases
d) and certainly anyone running k8s or config management system would
have the ability to install
e) and it could be custom tailored to individual deployments as needed
(including other potential commands that some systems might care
about)
f) and it seems like the least likely option to be mistaken for a
security feature
g) and also seems pretty safe wrt not breaking existing tooling (like
5/6 might do)

Looking at it, you could make the argument that #4 is actually the
best of the solutions proposed, except it has the one drawback that it
requires folks to double down on the fiction that we think extensions
are a good way to build solutions when really everyone just wants to
have everything in core.

Robert Treat
https://xzilla.net




Re: WIP Incremental JSON Parser

2024-03-18 Thread Jacob Champion
On Mon, Mar 18, 2024 at 3:32 AM Andrew Dunstan  wrote:
> Not very easily. But I think and hope I've fixed the issue you've identified 
> above about returning before lex->token_start is properly set.
>
>  Attached is a new set of patches that does that and is updated for the 
> json_errdetaiil() changes.

Thanks!

>++   * Normally token_start would be ptok->data, but it could be 
> later,
>++   * see json_lex_string's handling of invalid escapes.
> +   */
>-+  lex->token_start = ptok->data;
>++  lex->token_start = dummy_lex.token_start;
> +  lex->token_terminator = ptok->data + ptok->len;

By the same token (ha), the lex->token_terminator needs to be updated
from dummy_lex for some error paths. (IIUC, on success, the
token_terminator should always point to the end of the buffer. If it's
not possible to combine the two code paths, maybe it'd be good to
check that and assert/error out if we've incorrectly pulled additional
data into the partial token.)

With the incremental parser, I think prev_token_terminator is not
likely to be safe to use except in very specific circumstances, since
it could be pointing into a stale chunk. Some documentation around how
to use that safely in a semantic action would be good.

It looks like some of the newly added error handling paths cannot be
hit, because the production stack makes it logically impossible to get
there. (For example, if it takes a successfully lexed comma to
transition into JSON_PROD_MORE_ARRAY_ELEMENTS to begin with, then when
we pull that production's JSON_TOKEN_COMMA off the stack, we can't
somehow fail to match that same comma.) Assuming I haven't missed a
different way to get into that situation, could the "impossible" cases
have assert calls added?

I've attached two diffs. One is the group of tests I've been using
locally (called 002_inline.pl; I replaced the existing inline tests
with it), and the other is a set of potential fixes to get those tests
green.

Thanks,
--Jacob
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 8273b35b01..f7608f84b9 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -928,11 +928,12 @@ pg_parse_json_incremental(JsonLexContext *lex,
case JSON_TOKEN_END:
ctx = JSON_PARSE_END;
break;
-
-   /*
-* mostly we don't need to worry about 
non-terminals here,
-* but there are a few cases where we 
do.
-*/
+   case JSON_NT_MORE_ARRAY_ELEMENTS:
+   ctx = JSON_PARSE_ARRAY_NEXT;
+   break;
+   case JSON_NT_ARRAY_ELEMENTS:
+   ctx = JSON_PARSE_ARRAY_START;
+   break;
case JSON_NT_MORE_KEY_PAIRS:
ctx = JSON_PARSE_OBJECT_NEXT;
break;
@@ -1412,7 +1413,7 @@ json_lex(JsonLexContext *lex)
 * see json_lex_string's handling of invalid escapes.
 */
lex->token_start = dummy_lex.token_start;
-   lex->token_terminator = ptok->data + ptok->len;
+   lex->token_terminator = dummy_lex.token_terminator;
if (partial_result == JSON_SUCCESS)
lex->inc_state->partial_completed = true;
return partial_result;
commit 3d615593d6d79178a8b8a208172414806d40cc03
Author: Jacob Champion 
Date:   Mon Mar 11 06:13:47 2024 -0700

WIP: test many different inline cases

diff --git a/src/test/modules/test_json_parser/meson.build 
b/src/test/modules/test_json_parser/meson.build
index a5bedce94e..2563075c34 100644
--- a/src/test/modules/test_json_parser/meson.build
+++ b/src/test/modules/test_json_parser/meson.build
@@ -45,6 +45,7 @@ tests += {
   'tap': {
 'tests': [
   't/001_test_json_parser_incremental.pl',
+  't/002_inline.pl',
 ],
   },
 }
diff --git 
a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl 
b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
index 477198b843..24f665c021 100644
--- a/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
+++ b/src/test/modules/test_json_parser/t/001_test_json_parser_incremental.pl
@@ -12,8 +12,6 @@ my $test_file = "$FindBin::RealBin/../tiny.json";
 
 my $exe = "test_json_parser_incremental";
 
-my @inlinetests = ("false", "12345", '"a"', "[1,2]");
-
 for (my $size = 64; $size > 0; $size--)
 {
my ($stdout, $stderr) = run_command( [$exe, "-c", $size, $test_file] );
@@ -22,21 +20,4 @@ for (my $size = 64; $size > 0; $size--)
is($stderr, "", 

Re: documentation structure

2024-03-18 Thread Roberto Mello
On Mon, Mar 18, 2024 at 10:12 AM Robert Haas  wrote:

> I was looking at the documentation index this morning[1], and I can't
> help feeling like there are some parts of it that are over-emphasized
> and some parts that are under-emphasized. I'm not sure what we can do
> about this exactly, but I thought it worth writing an email and seeing
> what other people think.
>

I agree, and my usage patterns of the docs are similar.

As the project progresses and more features are added and tacked on to
existing docs, things can get
murky or buried. I imagine that web access and search logs could paint a
picture of documentation usage.

I don't know what other people's experience is, but for me, wanting to
> know what a command does or what a setting does is extremely common.
> Therefore, I think these chapters are disproportionately important and
> should be emphasized more. In the case of the GUC reference, one idea
>

+1

I have is to split up "III. Server Administration". My proposal is
> that we divide it into three sections. The first would be called "III.
> Server Installation" and would cover chapters 16 (installation from
> binaries) through 19 (server setup and operation). The second would be
> called "IV. Server Configuration" -- so every section that's currently
> a subsection of "server configuration" would become a top-level
>
chapter. The third division would be "V. Server Administration," and
> would cover the current chapters 21-33. This is probably far from


I like all of those.


> I don't know what to do about "I. SQL commands". It's obviously
> impractical to promote that to a top-level section, because it's got a
> zillion sub-pages which I don't think we want in the top-level
> documentation index. But having it as one of several unnumbered
> chapters interposed between 51 and 52 doesn't seem great either.
>

I think it'd be easier to read if current "VI. Reference" came right after
"Server Administration",
ahead of "Client Interfaces" and "Server Programming", which are of
interest to a much smaller
subset of users.

Also if the subchapters were numbered like the rest of them. I don't think
the roman numerals are
particularly helpful.

The stuff that I think is over-emphasized is as follows: (a) chapters
> 1-3, the tutorial; (b) chapters 4-6, which are essentially a

...

Also +1

Thoughts? I realize that this topic is HIGHLY prone to ENDLESS
> bikeshedding, and it's inevitable that not everybody is going to
> agree. But I hope we can agree that it's completely silly that it's
> vastly easier to find the documentation about the backup manifest
> format than it is to find the documentation on CREATE TABLE or
> shared_buffers, and if we can agree on that, then perhaps we can agree
> on some way to make things better.
>

Impossible to please everyone, but I'm sure we can improve things.

I've contributed to different parts of the docs over the years, and would
be happy
to help with this work.

Roberto


Re: Statistics Import and Export

2024-03-18 Thread Corey Huinker
>
>
>
>
> From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
> for the "not ok" and then look at what it tried to do right before
> that. I see:
>
> pg_dump: error: prepared statement failed: ERROR:  syntax error at or
> near "%"
> LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
> a.nsp...
>

Thanks. Unfamiliar turf for me.


>
> > All those changes are available in the patches attached.
>
> How about if you provided "get" versions of the functions that return a
> set of rows that match what the "set" versions expect? That would make
> 0001 essentially a complete feature itself.
>

That's tricky. At the base level, those functions would just be an
encapsulation of "SELECT * FROM pg_stats WHERE schemaname = $1 AND
tablename = $2" which isn't all that much of a savings. Perhaps we can make
the documentation more explicit about the source and nature of the
parameters going into the pg_set_ functions.

Per conversation, it would be trivial to add a helper functions that
replace the parameters after the initial oid with a pg_class rowtype, and
that would dissect the values needed and call the more complex function:

pg_set_relation_stats( oid, pg_class)
pg_set_attribute_stats( oid, pg_stats)


>
> I think it would also make the changes in pg_dump simpler, and the
> tests in 0001 a lot simpler.
>

I agree. The tests are currently showing that a fidelity copy can be made
from one table to another, but to do so we have to conceal the actual stats
values because those are 1. not deterministic/known and 2. subject to
change from version to version.

I can add some sets to arbitrary values like was done for
pg_set_relation_stats().


Re: Autogenerate some wait events code and documentation

2024-03-18 Thread Bertrand Drouvot
Hi,

On Mon, Mar 18, 2024 at 08:49:34AM -0700, Noah Misch wrote:
> On Mon, Mar 18, 2024 at 09:04:44AM +, Bertrand Drouvot wrote:
> > --- a/src/backend/utils/activity/wait_event_names.txt
> > +++ b/src/backend/utils/activity/wait_event_names.txt
> > @@ -24,7 +24,12 @@
> >  #  SGML tables of wait events for inclusion in the documentation.
> >  #
> >  # When adding a new wait event, make sure it is placed in the appropriate
> > -# ClassName section.
> > +# ClassName section. If the wait event is backpatched to a version < 17 
> > then
> > +# put it under a "Backpatch" delimiter at the end of the related ClassName
> > +# section.
> 
> Back-patch from v17 to pre-v17 won't use this, because v16 has hand-maintained
> enums.  It's back-patch v18->v17 or v22->v17 where this will come up.

Thanks for looking at it!
Oh right, the comment is wrong, re-worded in v2 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 868d97a853c45639685198c9090cedd77dc4b9af Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 18 Mar 2024 08:34:19 +
Subject: [PATCH v2] Add "Backpatch" regions in wait_event_names.txt

When backpatching, adding an event will renumber others, which can make an
extension report the wrong event until recompiled. This is due to the fact that
generate-wait_event_types.pl sorts events to position them. The "Backpatch"
region added here ensures no ordering is done for the wait events found after
this delimiter.
---
 .../activity/generate-wait_event_types.pl | 26 ++-
 .../utils/activity/wait_event_names.txt   | 19 +-
 2 files changed, 43 insertions(+), 2 deletions(-)
 100.0% src/backend/utils/activity/

diff --git a/src/backend/utils/activity/generate-wait_event_types.pl b/src/backend/utils/activity/generate-wait_event_types.pl
index f1adf0e8e7..d129d94889 100644
--- a/src/backend/utils/activity/generate-wait_event_types.pl
+++ b/src/backend/utils/activity/generate-wait_event_types.pl
@@ -38,7 +38,9 @@ die "Not possible to specify --docs and --code simultaneously"
 
 open my $wait_event_names, '<', $ARGV[0] or die;
 
+my @backpatch_lines;
 my @lines;
+my $backpatch = 0;
 my $section_name;
 my $note;
 my $note_name;
@@ -59,10 +61,26 @@ while (<$wait_event_names>)
 	{
 		$section_name = $_;
 		$section_name =~ s/^.*- //;
+		$backpatch = 0;
 		next;
 	}
 
-	push(@lines, $section_name . "\t" . $_);
+	# Are we dealing with backpatch wait events?
+	if (/^Backpatch:$/)
+	{
+		$backpatch = 1;
+		next;
+	}
+
+	# Backpatch wait events won't be sorted during code generation
+	if ($gen_code && $backpatch)
+	{
+		push(@backpatch_lines, $section_name . "\t" . $_);
+	}
+	else
+	{
+		push(@lines, $section_name . "\t" . $_);
+	}
 }
 
 # Sort the lines based on the second column.
@@ -70,6 +88,12 @@ while (<$wait_event_names>)
 my @lines_sorted =
   sort { uc((split(/\t/, $a))[1]) cmp uc((split(/\t/, $b))[1]) } @lines;
 
+# If we are generating code then concat @lines_sorted and @backpatch_lines.
+if ($gen_code)
+{
+	push(@lines_sorted, @backpatch_lines);
+}
+
 # Read the sorted lines and populate the hash table
 foreach my $line (@lines_sorted)
 {
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c08e00d1d6..b6303a0fdb 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -24,7 +24,11 @@
 #  SGML tables of wait events for inclusion in the documentation.
 #
 # When adding a new wait event, make sure it is placed in the appropriate
-# ClassName section.
+# ClassName section. If the wait event is backpatched from master to a version
+# >= 17 then put it under a "Backpatch:" delimiter at the end of the related
+# ClassName section (on the non master branches) or at its natural position on
+# the master branch.
+# Ensure that the backpatch regions are always empty on the master branch.
 #
 # WaitEventLWLock and WaitEventLock have their own C code for their wait event
 # enums and function names.  Hence, for these, only the SGML documentation is
@@ -61,6 +65,7 @@ WAL_SENDER_MAIN	"Waiting in main loop of WAL sender process."
 WAL_SUMMARIZER_WAL	"Waiting in WAL summarizer for more WAL to be generated."
 WAL_WRITER_MAIN	"Waiting in main loop of WAL writer process."
 
+Backpatch:
 
 #
 # Wait Events - Client
@@ -82,6 +87,7 @@ WAIT_FOR_STANDBY_CONFIRMATION	"Waiting for WAL to be received and flushed by the
 WAL_SENDER_WAIT_FOR_WAL	"Waiting for WAL to be flushed in WAL sender process."
 WAL_SENDER_WRITE_DATA	"Waiting for any activity when processing replies from WAL receiver in WAL sender process."
 
+Backpatch:
 
 #
 # Wait Events - IPC
@@ -149,6 +155,7 @@ WAL_RECEIVER_WAIT_START	"Waiting for startup process to send initial data for st
 WAL_SUMMARY_READY	"Waiting for a new WAL summary to be generated."
 XACT_GROUP_UPDATE	"Waiting for the group leader to update 

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 12:30:04PM -0500, Nathan Bossart wrote:
> Here is a more fleshed-out version of what I believe David is proposing.
> On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the
> test_popcount benchmark).  I assume this is because this patch turns
> pg_popcount() into a function pointer, which is what the AVX512 patches do,
> too.  I left out the 32-bit section from pg_popcount_fast(), but I'll admit
> that I'm not yet 100% sure that we can assume we're on a 64-bit system
> there.
> 
> IMHO this work is arguably a prerequisite for the AVX512 work, as turning
> pg_popcount() into a function pointer will likely regress performance for
> folks on systems without AVX512 otherwise.

Apologies for the noise.  I noticed that we could (and probably should)
inline the pg_popcount32/64 calls in the "slow" version, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3047674f0950435b7fa30746be7f8e5cc7249e6d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 18 Mar 2024 12:18:15 -0500
Subject: [PATCH 1/1] inline function calls in pg_popcount() when possible

---
 src/include/port/pg_bitutils.h |   5 +-
 src/port/pg_bitutils.c | 135 -
 2 files changed, 103 insertions(+), 37 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..53e5239717 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,17 +302,16 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
+extern uint64 pg_popcount(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
-/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
-
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..d0c93dafcb 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -103,18 +103,22 @@ const uint8 pg_number_of_ones[256] = {
 	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
 };
 
-static int	pg_popcount32_slow(uint32 word);
-static int	pg_popcount64_slow(uint64 word);
+static inline int pg_popcount32_slow(uint32 word);
+static inline int pg_popcount64_slow(uint64 word);
+static uint64 pg_popcount_slow(const char *buf, int bytes);
 
 #ifdef TRY_POPCNT_FAST
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
 static int	pg_popcount32_fast(uint32 word);
-static int	pg_popcount64_fast(uint64 word);
+static inline int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -151,11 +155,13 @@ pg_popcount32_choose(uint32 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount32(word);
@@ -168,16 +174,37 @@ pg_popcount64_choose(uint64 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount64(word);
 }
 
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	if (pg_popcount_available())
+	{
+		pg_popcount32 = pg_popcount32_fast;
+		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
+	}
+	else
+	{
+		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
+	}
+
+	return pg_popcount(buf, bytes);
+}
+
 /*
  * pg_popcount32_fast
  *		Return the number of 1 bits set in word
@@ -199,7 +226,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
  * pg_popcount64_fast
  *		Return the number of 1 bits set in word
  */
-static int
+static inline int
 pg_popcount64_fast(uint64 word)
 {
 #ifdef _MSC_VER
@@ -212,6 +239,36 @@ __asm__ __volatile__(" popcntq %1,%0\n":"=q"(res):"rm"(word):"cc");
 #endif
 }
 
+/*
+ * pg_popcount_fast
+ *		Returns the number of 

Re: Add LSN <-> time conversion functionality

2024-03-18 Thread Tomas Vondra


On 3/18/24 15:02, Daniel Gustafsson wrote:
>> On 22 Feb 2024, at 03:45, Melanie Plageman  wrote:
>> On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
>>  wrote:
> 
>>> - Not sure why we need 0001. Just so that the "estimate" functions in
>>> 0002 have a convenient "start" point? Surely we could look at the
>>> current LSNTimeline data and use the oldest value, or (if there's no
>>> data) use the current timestamp/LSN?
>>
>> When there are 0 or 1 entries in the timeline you'll get an answer
>> that could be very off if you just return the current timestamp or
>> LSN. I guess that is okay?
> 
> I don't think that's a huge problem at such a young "lsn-age", but I might be
> missing something.
> 
>>> - I wonder what happens if we lose the data - we know that if people
>>> reset statistics for whatever reason (or just lose them because of a
>>> crash, or because they're on a replica), bad things happen to
>>> autovacuum. What's the (expected) impact on pruning?
>>
>> This is an important question. Because stats aren't crashsafe, we
>> could return very inaccurate numbers for some time/LSN values if we
>> crash. I don't actually know what we could do about that. When I use
>> the LSNTimeline for the freeze heuristic it is less of an issue
>> because the freeze heuristic has a fallback strategy when there aren't
>> enough stats to do its calculations. But other users of this
>> LSNTimeline will simply get highly inaccurate results (I think?). Is
>> there anything we could do about this? It seems bad.
> 

Do we have something to calculate a sufficiently good "average" to use
as a default, if we don't have a better value? For example, we know the
timestamp of the last checkpoint, and we know the LSN, right? Maybe if
we're sufficiently far from the checkpoint, we could use that.

Or maybe checkpoint_timeout / max_wal_size would be enough to calculate
some default value?

I wonder how long it takes until LSNTimeline gives us sufficiently good
data for all LSNs we need. That is, if we lose this, how long it takes
until we get enough data to do good decisions?

Why don't we simply WAL-log this in some trivial way? It's pretty small,
so if we WAL-log this once in a while (after a merge happens), that
should not be a problem.

Or a different idea - if we lost the data, but commit_ts is enabled,
can't we "simply" walk commit_ts and feed LSN/timestamp into the
timeline? I guess we don't want to walk 2B transactions, but even just
sampling some recent transactions might be enough, no?

> A complication with this over stats is that we can't recompute this in case of
> a crash/corruption issue.  The simple solution is to consider this unlogged
> data and start fresh at every unclean shutdown, but I have a feeling that 
> won't
> be good enough for basing heuristics on?
> 
>> Andres had brought up something at some point about, what if the
>> database is simply turned off for awhile and then turned back on. Even
>> if you cleanly shut down, will there be "gaps" in the timeline? I
>> think that could be okay, but it is something to think about.
> 
> The gaps would represent reality, so there is nothing wrong per se with gaps,
> but if they inflate the interval of a bucket which in turns impact the
> precision of the results then that can be a problem.
> 

Well, I think the gaps are a problem in the sense that they disappear
once we start merging the buckets. But maybe that's fine, if we're only
interested in approximate data.

>> Just a note, all of my comments could use a lot of work, but I want to
>> get consensus on the algorithm before I make sure and write about it
>> in a perfect way.
> 
> I'm not sure "a lot of work" is accurate, they seem pretty much there to me,
> but I think that an illustration of running through the algorithm in an
> ascii-art array would be helpful.
> 

+1

> 
> Reading through this I think such a function has merits, not only for your
> usecase but other heuristic based work and quite possibly systems debugging.
> 
> While the bucketing algorithm is a clever algorithm for degrading precision 
> for
> older entries without discarding them, I do fear that we'll risk ending up 
> with
> answers like "somewhere between in the past and even further in the past".
> I've been playing around with various compression algorithms for packing the
> buckets such that we can retain precision for longer.  Since you were aiming 
> to
> work on other patches leading up to the freeze, let's pick this up again once
> things calm down.
> 

I guess this ambiguity is pretty inherent to a structure that does not
keep all the data, and gradually reduces the resolution for old stuff.
But my understanding was that's sufficient for the freezing patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Maciek Sakrejda
On Mon, Mar 18, 2024 at 10:27 AM Robert Haas  wrote:
> Right, we're adding this because of environments like Kubernetes,
> which isn't a version, but it is a platform, or at least a deployment
> mode, which is why I thought of that section. I think for now we
> should just file this under "Other platforms and clients," which only
> has one existing setting. If the number of settings of this type
> grows, we can split it out.

Fair enough, +1.

> Using enable_* as code for "this is a planner GUC" is a pretty stupid
> pattern, honestly, but I agree with you that it's long-established and
> we probably shouldn't deviate from it lightly. Perhaps just rename to
> allow_alter_system?

+1




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 05:28:32PM +, Amonson, Paul D wrote:
> Question: I applied the patch for the drive_popcount* functions and
> rebuilt.  The resultant server complains that the function is missing.
> What is the trick to make this work?

You probably need to install the test_popcount extension and run "CREATE
EXTENION test_popcount;".

> Another Question: Is there a reason "time psql" is used over the Postgres
> "\timing" command?

I don't think there's any strong reason.  I've used both.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




libpq behavior with hostname with multiple addresses and target_session_attrs=primary

2024-03-18 Thread Sameer M. Deshpande
Hello,
I would like to make a feature request for existing implementation of 
connection object. What I would like to request you is as follows.

I have configured primary and hot standby PostgreSQL server. The replication 
works fine with repmanager. All I need to achieve is to keep the client 
connection string using a common hostname (also without specifying multiple 
hosts) and not to change due to role change. To simplify my need the details 
are as follows.

Primary host: host1.home.network, IP: 192.168.0.2
Standby host: host2.home.network, IP: 192.168.0.3 
Common host: myapphost.home.network: this resolves to IP: 192.168.0.2 and IP: 
192.168.0.3

The connection string I want my applications to use is:

user@localnode:~> psql "host=myapphost.home.network dbname=appdb 
sslmode=require target_session_attrs=primary"

But when I test the connection multiple times, I have observe strange behaviour 
and I am not sure if that's the expectation. The client application uses the 
myapphost.home.network. By using "target_session_attrs=primary", my expectation 
is to connect always to primary node all the time, no matter which one it is 
due to manual role change, either host1 or host2.home.network

The reason of having two IPs to common hostname is, sometimes we manually 
perform the role switch using repmanager. And because of this, I do not want 
any changes to client connection and I expect attribute target_session_attrs 
will always hook client connection to primary node. But after testing few 
times, I get error:

connection to server at "myapphost.home.network" (192.168.0.3), port 5432 
failed: server is in hot standby mode

Now here my expectation is, despite myapphost.home.network resolves to two 
different servers (primary and slave/standby), because of using attribute 
target_session_attrs=primary, why my session does not get redirected to only 
primary host? And of course load_balance_hosts default value is disable. 
Furthermore, this also gives me flexibility to add new ore remove hosts from 
DNS entry myapphost.home.network.

Based on above requirement, I would like to request you if it is possible to 
make an enhacenment to connection object when using 
target_session_attrs=primary, it will hook to always primary server and not to 
thrown an error saying next on in the list is server is in hot standby mode or 
read only etc…

Many thanks in advance,

Regards
Sameer Deshpande

Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 11:20:18AM -0500, Nathan Bossart wrote:
> I don't think David was suggesting that we need to remove the runtime
> checks for AVX512.  IIUC he was pointing out that most of the performance
> gain is from removing the function call overhead, which your v8-0002 patch
> already does for the proposed AVX512 code.  We can apply a similar
> optimization for systems without AVX512 by inlining the code for
> pg_popcount64() and pg_popcount32().

Here is a more fleshed-out version of what I believe David is proposing.
On my machine, the gains aren't quite as impressive (~8.8s to ~5.2s for the
test_popcount benchmark).  I assume this is because this patch turns
pg_popcount() into a function pointer, which is what the AVX512 patches do,
too.  I left out the 32-bit section from pg_popcount_fast(), but I'll admit
that I'm not yet 100% sure that we can assume we're on a 64-bit system
there.

IMHO this work is arguably a prerequisite for the AVX512 work, as turning
pg_popcount() into a function pointer will likely regress performance for
folks on systems without AVX512 otherwise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 1d33c803feb7428f798b13fd643a16c73628f8a9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 18 Mar 2024 12:18:15 -0500
Subject: [PATCH 1/1] inline function calls in pg_popcount() when possible

---
 src/include/port/pg_bitutils.h |   5 +-
 src/port/pg_bitutils.c | 123 +
 2 files changed, 97 insertions(+), 31 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 46bf4f0103..53e5239717 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -302,17 +302,16 @@ pg_ceil_log2_64(uint64 num)
 /* Attempt to use the POPCNT instruction, but perform a runtime check first */
 extern PGDLLIMPORT int (*pg_popcount32) (uint32 word);
 extern PGDLLIMPORT int (*pg_popcount64) (uint64 word);
+extern PGDLLIMPORT uint64 (*pg_popcount) (const char *buf, int bytes);
 
 #else
 /* Use a portable implementation -- no need for a function pointer. */
 extern int	pg_popcount32(uint32 word);
 extern int	pg_popcount64(uint64 word);
+extern uint64 pg_popcount(const char *buf, int bytes);
 
 #endif			/* TRY_POPCNT_FAST */
 
-/* Count the number of one-bits in a byte array */
-extern uint64 pg_popcount(const char *buf, int bytes);
-
 /*
  * Rotate the bits of "word" to the right/left by n bits.
  */
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..e374e753d7 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -105,16 +105,20 @@ const uint8 pg_number_of_ones[256] = {
 
 static int	pg_popcount32_slow(uint32 word);
 static int	pg_popcount64_slow(uint64 word);
+static uint64 pg_popcount_slow(const char *buf, int bytes);
 
 #ifdef TRY_POPCNT_FAST
 static bool pg_popcount_available(void);
 static int	pg_popcount32_choose(uint32 word);
 static int	pg_popcount64_choose(uint64 word);
+static uint64 pg_popcount_choose(const char *buf, int bytes);
 static int	pg_popcount32_fast(uint32 word);
-static int	pg_popcount64_fast(uint64 word);
+static inline int pg_popcount64_fast(uint64 word);
+static uint64 pg_popcount_fast(const char *buf, int bytes);
 
 int			(*pg_popcount32) (uint32 word) = pg_popcount32_choose;
 int			(*pg_popcount64) (uint64 word) = pg_popcount64_choose;
+uint64		(*pg_popcount) (const char *buf, int bytes) = pg_popcount_choose;
 #endif			/* TRY_POPCNT_FAST */
 
 #ifdef TRY_POPCNT_FAST
@@ -151,11 +155,13 @@ pg_popcount32_choose(uint32 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount32(word);
@@ -168,16 +174,37 @@ pg_popcount64_choose(uint64 word)
 	{
 		pg_popcount32 = pg_popcount32_fast;
 		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
 	}
 	else
 	{
 		pg_popcount32 = pg_popcount32_slow;
 		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
 	}
 
 	return pg_popcount64(word);
 }
 
+static uint64
+pg_popcount_choose(const char *buf, int bytes)
+{
+	if (pg_popcount_available())
+	{
+		pg_popcount32 = pg_popcount32_fast;
+		pg_popcount64 = pg_popcount64_fast;
+		pg_popcount = pg_popcount_fast;
+	}
+	else
+	{
+		pg_popcount32 = pg_popcount32_slow;
+		pg_popcount64 = pg_popcount64_slow;
+		pg_popcount = pg_popcount_slow;
+	}
+
+	return pg_popcount(buf, bytes);
+}
+
 /*
  * pg_popcount32_fast
  *		Return the number of 1 bits set in word
@@ -199,7 +226,7 @@ __asm__ __volatile__(" popcntl %1,%0\n":"=q"(res):"rm"(word):"cc");
  * pg_popcount64_fast
  *		Return the number of 1 bits set in word
  */
-static int
+static inline int
 pg_popcount64_fast(uint64 word)
 {
 #ifdef _MSC_VER
@@ -212,6 +239,36 @@ __asm__ __volatile__(" popcntq 

Re: Add LSN <-> time conversion functionality

2024-03-18 Thread Tomas Vondra



On 2/22/24 03:45, Melanie Plageman wrote:
> Thanks so much for reviewing!
> 
> On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
>  wrote:
>>
>> When I first read this, I immediately started wondering if this might
>> use the commit timestamp stuff we already have. Because for each commit
>> we already store the LSN and commit timestamp, right? But I'm not sure
>> that would be a good match - the commit_ts serves a very special purpose
>> of mapping XID => (LSN, timestamp), I don't see how to make it work for
>> (LSN=>timestmap) and (timestamp=>LSN) very easily.
> 
> I took a look at the code in commit_ts.c, and I really don't see a way
> of reusing any of this commit<->timestamp infrastructure for
> timestamp<->LSN mappings.
> 
>> As for the inner workings of the patch, my understanding is this:
>>
>> - "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts)
>> points, but those points are really "buckets" that grow larger and
>> larger for older periods of time.
> 
> Yes, they are buckets in the sense that they represent multiple values
> but each contains a single LSNTime value which is the minimum of all
> the LSNTimes we "merged" into that single array element. In order to
> represent a range of time, you need to use two array elements. The
> linear interpolation from time <-> LSN is all done with two elements.
> 
>> - AFAIK each entry represent an interval of time, and the next (older)
>> interval is twice as long, right? So the first interval is 1 second,
>> then 2 seconds, 4 seconds, 8 seconds, ...
>>
>> - But I don't understand how the LSNTimeline entries are "aging" and get
>> less accurate, while the "current" bucket is short. lsntime_insert()
>> seems to simply move to the next entry, but doesn't that mean we insert
>> the entries into larger and larger buckets?
> 
> Because the earlier array elements can represent fewer logical members
> than later ones and because elements are merged into the next element
> when space runs out, later array elements will contain older data and
> more of it, so those "ranges" will be larger. But, after thinking
> about it and also reading your feedback, I realized my algorithm was
> silly because it starts merging logical members before it has even
> used the whole array.
> 
> The attached v3 has a new algorithm. Now, LSNTimes are added from the
> end of the array forward until all array elements have at least one
> logical member (array length == volume). Once array length == volume,
> new LSNTimes will result in merging logical members in existing
> elements. We want to merge older members because those can be less
> precise. So, the number of logical members per array element will
> always monotonically increase starting from the beginning of the array
> (which contains the most recent data) and going to the end. We want to
> use all the available space in the array. That means that each LSNTime
> insertion will always result in a single merge. We want the timeline
> to be inclusive of the oldest data, so merging means taking the
> smaller value of two LSNTime values. I had to pick a rule for choosing
> which elements to merge. So, I choose the merge target as the oldest
> element whose logical membership is < 2x its predecessor. I merge the
> merge target's predecessor into the merge target. Then I move all of
> the intervening elements down 1. Then I insert the new LSNTime at
> index 0.
> 

I can't help but think about t-digest [1], which also merges data into
variable-sized buckets (called centroids, which is a pair of values,
just like LSNTime). But the merging is driven by something called "scale
function" which I found like a pretty nice approach to this, and it
yields some guarantees regarding accuracy. I wonder if we could do
something similar here ...

The t-digest is a way to approximate quantiles, and the default scale
function is optimized for best accuracy on the extremes (close to 0.0
and 1.0), but it's possible to use scale functions that optimize only
for accuracy close to 1.0.

This may be misguided, but I see similarity between quantiles and what
LSNTimeline does - timestamps are ordered, and quantiles close to 0.0
are "old timestamps" while quantiles close to 1.0 are "now".

And t-digest also defines a pretty efficient algorithm to merge data in
a way that gradually combines older "buckets" into larger ones.

>> - The comments never really spell what amount of time the entries cover
>> / how granular it is. My understanding is it's simply measured in number
>> of entries added, which is assumed to be constant and drive by
>> bgwriter_delay, right? Which is 200ms by default. Which seems fine, but
>> isn't the hibernation (HIBERNATE_FACTOR) going to mess with it?
>>
>> Is there some case where bgwriter would just loop without sleeping,
>> filling the timeline much faster? (I can't think of any, but ...)
> 
> bgwriter will wake up when there are buffers to flush, which is likely
> correlated with there being new LSNs. So, actually it 

RE: Popcount optimization using AVX512

2024-03-18 Thread Amonson, Paul D
> -Original Message-
> From: Nathan Bossart 
> Sent: Monday, March 18, 2024 9:20 AM
> ...
> I don't think David was suggesting that we need to remove the runtime checks
> for AVX512.  IIUC he was pointing out that most of the performance gain is
> from removing the function call overhead, which your v8-0002 patch already
> does for the proposed AVX512 code.  We can apply a similar optimization for
> systems without AVX512 by inlining the code for
> pg_popcount64() and pg_popcount32().

Ok, got you.

Question: I applied the patch for the drive_popcount* functions and rebuilt.  
The resultant server complains that the function is missing. What is the trick 
to make this work?

Another Question: Is there a reason "time psql" is used over the Postgres 
"\timing" command?

Thanks,
Paul





Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Robert Haas
On Mon, Mar 18, 2024 at 12:19 PM Maciek Sakrejda  wrote:
> +1 on Version and Platform Compatibility. Maybe it just needs a new
> subsection there? This is for compatibility with a "deployment
> platform". The "Platform and Client Compatibility" subsection has just
> one entry, so a new subsection with also just one entry seems
> defensible, maybe just "Deployment Compatibility"? I think it's also
> plausible that there will be other similar settings for managed
> deployments in the future.

Right, we're adding this because of environments like Kubernetes,
which isn't a version, but it is a platform, or at least a deployment
mode, which is why I thought of that section. I think for now we
should just file this under "Other platforms and clients," which only
has one existing setting. If the number of settings of this type
grows, we can split it out.

> Do we really want to break that pattern?

Using enable_* as code for "this is a planner GUC" is a pretty stupid
pattern, honestly, but I agree with you that it's long-established and
we probably shouldn't deviate from it lightly. Perhaps just rename to
allow_alter_system?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Robert Haas
On Mon, Mar 18, 2024 at 11:46 AM Magnus Hagander  wrote:
> > Wouldn't that break pgBackrest which IIRC write to .auto.conf directly
> > without using ALTER SYSTEM?
>
> Ugh of course. And not only that, it would also break pg_basebackup
> which does the same.
>
> So I guess that's not a good idea. I guess nobody anticipated this
> when that was done:)

I'm also +1 for the idea that the feature should only disable ALTER
SYSTEM, not postgresql.auto.conf. I can't really see any reason why it
needs to do both, and it might be more convenient if it didn't. If
you're managing PostgreSQL's configuration externally, you might find
it convenient to write the configuration you're managing into
postgresql.auto.conf. Or you might want to write it to
postgresql.conf. Or you might want to do something more complicated
with include directives or whatever. But there's no reason why you
*couldn't* want to use postgresql.auto.conf, and on the other hand I
don't see how anyone benefits from that file not being read. That just
seems confusing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Java : Postgres double precession issue with different data format text and binary

2024-03-18 Thread Chapman Flack
Hi,

On 03/16/24 11:10, Rahul Uniyal wrote:
> We are encountering an issue where the double precision data type
> in PostgreSQL is giving some intermittent results when fetching data.
> For example, in the database the value is 40, but sometimes this value
> is fetched as 40.0. Similarly, for a value of 0.0005, it is being
> fetched as 0.00050, resulting in extra trailing zeros.

As a first observation, the column names in your schema suggest that
these columns are being used as IDs of some kind, for which a float type
would be an unusual choice. Unless something in your situation requires it,
you might consider changing to integer types for IDs.

That said, you may have found something interesting in how JDBC handles
the float8 type in text vs. binary format, but comparing the results
of conversion to decimal string is not the most direct way to
investigate it.

It would be clearer to compare the raw bits of the values.

For example, with SELECT float8send(ID) FROM SUBMISSION_QUEUE,
you should see \x4044 if ID is 40, and you should see
\x3f40624dd2f1a9fc if ID is 0.0005.

Likewise, on the Java side,
Long.toHexString(Double.doubleToLongBits(id)) should also show you
4044 for the value 40, and 3f40624dd2f1a9fc for the
value 0.0005.

If you end up finding that the text/binary transmission format
sometimes causes the Java value not to have the same bits as the
PostgreSQL value, that information could be of interest on the
pgsql-jdbc list.

Regards,
Chapman Flack




Re: DOCS: add helpful partitioning links

2024-03-18 Thread Robert Treat
On Thu, Mar 14, 2024 at 12:15 PM Ashutosh Bapat
 wrote:
>
> Hi Robert,
>
> On Thu, Mar 7, 2024 at 10:49 PM Robert Treat  wrote:
>>
>> This patch adds a link to the "attach partition" command section
>> (similar to the detach partition link above it) as well as a link to
>> "create table like" as both commands contain additional information
>> that users should review beyond what is laid out in this section.
>> There's also a couple of wordsmiths in nearby areas to improve
>> readability.
>
>
> Thanks.
>
> The patch gives error when building html
>
> ddl.sgml:4300: element link: validity error : No declaration for attribute 
> linked of element link
>  CREATE TABLE ... 
> LIKE   ^
> ddl.sgml:4300: element link: validity error : Element link does not carry 
> attribute linkend
> nked="sql-createtable-parms-like">CREATE TABLE ... 
> LIKE   
>  ^
> make[1]: *** [Makefile:72: postgres-full.xml] Error 4
> make[1]: *** Deleting file 'postgres-full.xml'
> make[1]: Leaving directory 
> '/home/ashutosh/work/units/pg_review/coderoot/pg/doc/src/sgml'
> make: *** [Makefile:8: html] Error 2
>
> I have fixed in the attached.
>

Doh! Thanks!

>
> - As an alternative, it is sometimes more convenient to create the
> - new table outside the partition structure, and attach it as a
> + As an alternative, it is sometimes more convenient to create a
> + new table outside of the partition structure, and attach it as a
>
> it uses article "the" for "new table" since it's referring to the partition 
> mentioned in the earlier example. I don't think using "a" is correct.
>

I think this section has a general problem of mingling the terms table
and partition in a way they can be confusing.

In this case, you have to infer that the term "the new table" is
referring not to the only table mentioned in the previous paragraph
(the partitioned table), but actually to the partition mentioned in
the previous paragraph. For long term postgres folks the idea that
partitions are tables isn't a big deal, but that isn't how folks
coming from other databases see things. So I lean towards "a new
table" because we are specifically talking about an alternative to the
above paragraph... ie. don't make a new partition, make a new table.
And tbh I think that wording (create a new table and attach it as a
partition) is still better than the wording in your patch, because the
"new partition" you are creating isn't a partition until it is
attached; it is just a new table.

> "outside" seems better than "outside of". See 
> https://english.stackexchange.com/questions/9700/outside-or-outside-of. But I 
> think the meaning of the sentence will be more clear if we rephrase it as in 
> the attached patch.
>

This didn't really clarify anything for me, as the discussion in that
link seems to be around the usage of the term wrt physical location,
and it is much less clear about the context of a logical construct.
Granted, your patch removes that, though I think now I'd lean toward
using the phrase "separate from".

> - convenient, as not only will the existing partitions become indexed, but
> - also any partitions that are created in the future will.  One 
> limitation is
> + convenient as not only will the existing partitions become indexed, but
> + any partitions created in the future will as well.  One limitation is
>
> I am finding the current construct hard to read. The comma is misplaced as 
> you have pointed out. The pair of commas break the "not only" ... "but also" 
> construct. I have tried to simplify the sentence in the attached. Please 
> review.
>
> - the partitioned table; such an index is marked invalid, and the 
> partitions
> - do not get the index applied automatically.  The indexes on partitions 
> can
> - be created individually using CONCURRENTLY, and then
> + the partitioned table; such an index is marked invalid and the 
> partitions
> + do not get the index applied automatically.  The partition indexes can
>
> "indexes on partition" is clearer than "partition index". Fixed in the 
> attached patch.
>
> Please review.

The language around all this is certainly tricky (like, what is a
partitioned index vs parent index?), and one thing I'd certainly try
to avoid is using any words like "inherited" which is also overloaded
in this context. In any case, I took in all the above and had a stab
at a v3

Robert Treat
https://xzilla.net


improve-partition-links_v3.patch
Description: Binary data


Re: documentation structure

2024-03-18 Thread Robert Haas
On Mon, Mar 18, 2024 at 10:55 AM Matthias van de Meent
 wrote:
> > I don't know what to do about "I. SQL commands". It's obviously
> > impractical to promote that to a top-level section, because it's got a
> > zillion sub-pages which I don't think we want in the top-level
> > documentation index. But having it as one of several unnumbered
> > chapters interposed between 51 and 52 doesn't seem great either.
>
> Could "SQL Commands" be a top-level construct, with subsections for
> SQL/DML, SQL/DDL, SQL/Transaction management, and PG's
> extensions/administrative/misc features? I sometimes find myself
> trying to mentally organize what SQL commands users can use vs those
> accessible to database owners and administrators, which is not
> currently organized as such in the SQL Commands section.

Yeah, I wondered about that, too. Or for example you could group all
CREATE commands together, all ALTER commands together, all DROP
commands together, etc. But I can't really see a future in such
schemes, because having a single page that links to the reference
documentation for every single command we have in alphabetical order
is incredibly handy, or at least I have found it so. So my feeling -
at least at present - is that it's more fruitful to look into cutting
down the amount of clutter that appears in the top-level documentation
index, and maybe finding ways to make important sections like the SQL
reference more prominent.

Given how much documentation we have, it's just not going to be
possible to make everything that matters conveniently visible at the
top level. I think if people have to click down a level for the SQL
reference, that's fine, as long as the link they need to click on is
reasonably visible. What annoys me about the present structure is that
it isn't. You don't get any visual clue that the "SQL Commands" page
with ~100 subpages is more important than "51. Archive Modules" or
"33. Regression Tests" or "58. Writing a Procedural Language Handler,"
but it totally is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Built-in CTYPE provider

2024-03-18 Thread Jeff Davis
On Sun, 2024-03-17 at 17:46 -0400, Tom Lane wrote:
> Jeff Davis  writes:
> > New series attached.
> 
> Coverity thinks there's something wrong with builtin_validate_locale,
> and as far as I can tell it's right: the last ereport is unreachable,
> because required_encoding is never changed from its initial -1 value.
> It looks like there's a chunk of logic missing there, or else that
> the code could be simplified further.

Thank you, it was a bit of over-generalization in anticipation of
future patches.

It may be moot soon, but I committed a fix now.

Regards,
Jeff Davis





Re: Statistics Import and Export

2024-03-18 Thread Jeff Davis
On Sun, 2024-03-17 at 23:33 -0400, Corey Huinker wrote:
> 
> A few TAP tests are still failing and I haven't been able to diagnose
> why, though the failures in parallel dump seem to be that it tries to
> import stats on indexes that haven't been created yet, which is odd
> because I sent the dependency.

>From testrun/pg_dump/002_pg_dump/log/regress_log_002_pg_dump, search
for the "not ok" and then look at what it tried to do right before
that. I see:

pg_dump: error: prepared statement failed: ERROR:  syntax error at or
near "%"
LINE 1: ..._histogram => %L::real[]) coalesce($2, format('%I.%I',
a.nsp...

> All those changes are available in the patches attached.

How about if you provided "get" versions of the functions that return a
set of rows that match what the "set" versions expect? That would make
0001 essentially a complete feature itself.

I think it would also make the changes in pg_dump simpler, and the
tests in 0001 a lot simpler.

Regards,
Jeff Davis





Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 04:07:40PM +, Amonson, Paul D wrote:
> Won't I still need the runtime checks? If I compile with a compiler
> supporting the HW "feature" but run on HW without that feature,  I will
> want to avoid faults due to illegal operations. Won't that also affect
> performance?

I don't think David was suggesting that we need to remove the runtime
checks for AVX512.  IIUC he was pointing out that most of the performance
gain is from removing the function call overhead, which your v8-0002 patch
already does for the proposed AVX512 code.  We can apply a similar
optimization for systems without AVX512 by inlining the code for
pg_popcount64() and pg_popcount32().

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Maciek Sakrejda
On Mon, Mar 18, 2024 at 7:12 AM Jelte Fennema-Nio  wrote:
>
> On Mon, 18 Mar 2024 at 13:57, Robert Haas  wrote:
> > I would have been somewhat inclined to find an existing section
> > of postgresql.auto.conf for this parameter, perhaps "platform and
> > version compatibility".
>
> I tried to find an existing section, but I couldn't find any that this
> new GUC would fit into naturally. "Version and Platform Compatibility
> / Previous PostgreSQL Versions" (the one you suggested) seems wrong
> too. The GUCs there are to get back to Postgres behaviour from
> previous versions. So that section would only make sense if we'd turn
> enable_alter_system off by default (which obviously no-one in this
> thread suggests/wants).
>
> If you have another suggestion for an existing category that we should
> use, feel free to share. But imho, none of the existing ones are a
> good fit.

+1 on Version and Platform Compatibility. Maybe it just needs a new
subsection there? This is for compatibility with a "deployment
platform". The "Platform and Client Compatibility" subsection has just
one entry, so a new subsection with also just one entry seems
defensible, maybe just "Deployment Compatibility"? I think it's also
plausible that there will be other similar settings for managed
deployments in the future.

> > Even if that is what we're going to do, do we want to call them "guard
> > rails"? I'm not sure I'd find that name terribly clear, as a user.
>
> If anyone has a better suggestion, I'm happy to change it.

No better suggestion at the moment, but while I used the term to
explain the feature, I also don't think that's a great official name.
For one thing, the section could easily be misinterpreted as guard
rails for end-users who are new to Postgres. Also, I think it's more
colloquial in tone than Postgres docs conventions.

Further, I think we may want to change the GUC name itself. All the
other GUCs that start with enable_ control planner behavior:

maciek=# select name from pg_settings where name like 'enable_%';
  name

 enable_async_append
 enable_bitmapscan
 enable_gathermerge
 enable_hashagg
 enable_hashjoin
 enable_incremental_sort
 enable_indexonlyscan
 enable_indexscan
 enable_material
 enable_memoize
 enable_mergejoin
 enable_nestloop
 enable_parallel_append
 enable_parallel_hash
 enable_partition_pruning
 enable_partitionwise_aggregate
 enable_partitionwise_join
 enable_presorted_aggregate
 enable_seqscan
 enable_sort
 enable_tidscan
(21 rows)

Do we really want to break that pattern?




RE: Popcount optimization using AVX512

2024-03-18 Thread Amonson, Paul D
Won't I still need the runtime checks? If I compile with a compiler supporting 
the HW "feature" but run on HW without that feature,  I will want to avoid 
faults due to illegal operations. Won't that also affect performance?

Paul

> -Original Message-
> From: Nathan Bossart 
> Sent: Monday, March 18, 2024 8:29 AM
> To: David Rowley 
> Cc: Amonson, Paul D ; Andres Freund
> ; Alvaro Herrera ; Shankaran,
> Akash ; Noah Misch ;
> Tom Lane ; Matthias van de Meent
> ; pgsql-hackers@lists.postgresql.org
> Subject: Re: Popcount optimization using AVX512
> 
> On Mon, Mar 18, 2024 at 09:56:32AM +1300, David Rowley wrote:
> > Maybe it's worth exploring something along the lines of the attached
> > before doing the AVX512 stuff.  It seems like a pretty good speed-up
> > and will apply for CPUs without AVX512 support.
> 
> +1
> 
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-18 Thread Tomas Vondra



On 3/18/24 15:47, Melanie Plageman wrote:
> On Sun, Mar 17, 2024 at 3:21 PM Tomas Vondra
>  wrote:
>>
>> On 3/14/24 22:39, Melanie Plageman wrote:
>>> On Thu, Mar 14, 2024 at 5:26 PM Tomas Vondra
>>>  wrote:

 On 3/14/24 19:16, Melanie Plageman wrote:
> On Thu, Mar 14, 2024 at 03:32:04PM +0200, Heikki Linnakangas wrote:
>> ...
>>
>> Ok, committed that for now. Thanks for looking!
>
> Attached v6 is rebased over your new commit. It also has the "fix" in
> 0010 which moves BitmapAdjustPrefetchIterator() back above
> table_scan_bitmap_next_block(). I've also updated the Streaming Read API
> commit (0013) to Thomas' v7 version from [1]. This has the update that
> we theorize should address some of the regressions in the bitmapheapscan
> streaming read user in 0014.
>

 Should I rerun the benchmarks with these new patches, to see if it
 really helps with the regressions?
>>>
>>> That would be awesome!
>>>
>>
>> OK, here's a couple charts comparing the effect of v6 patches to master.
>> These are from 1M and 10M data sets, same as the runs presented earlier
>> in this thread (the 10M is still running, but should be good enough for
>> this kind of visual comparison).
> 
> Thanks for doing this!
> 
>> What is even more obvious is that 0014 behaves *VERY* differently. I'm
>> not sure if this is a good thing or a problem is debatable/unclear. I'm
>> sure we don't want to cause regressions, but perhaps those are due to
>> the prefetch issue discussed elsewhere in this thread (identified by
>> Andres and Melanie). There are also many cases that got much faster, but
>> the question is whether this is due to better efficiency or maybe the
>> new code being more aggressive in some way (not sure).
> 
> Are these with the default effective_io_concurrency (1)? If so, the
> "effective" prefetch distance in many cases will be higher with the
> streaming read code applied. With effective_io_concurrency 1,
> "max_ios" will always be 1, but the number of blocks prefetched may
> exceed this (up to MAX_BUFFERS_PER_TRANSFER) because the streaming
> read code is always trying to build bigger IOs. And, if prefetching,
> it will prefetch IOs not yet in shared buffers before reading them.
> 

No, it's a mix of runs with random combinations of these parameters:

dataset: uniform uniform_pages linear linear_fuzz cyclic cyclic_fuzz
workers: 0 4
work_mem: 128kB 4MB 64MB
eic: 0 1 8 16 32
selectivity: 0-100%

I can either share the data (~70MB of CSV) or generate charts for
results with some filter.

> It's hard to tell without going into a specific repro why this would
> cause some queries to be much slower. In the forced bitmapheapscan, it
> would make sense that more prefetching is worse -- which is why a
> bitmapheapscan plan wouldn't have been chosen. But in the optimal
> cases, it is unclear why it would be worse.
> 

Yes, not sure about the optimal cases. I'll wait for the 10M runs to
complete, and then we can look for some patterns.

> I don't think there is any way it could be the issue Andres
> identified, because there is only one iterator. Nothing to get out of
> sync. It could be that the fadvises are being issued too close to the
> reads and aren't effective enough at covering up read latency on
> slower, older hardware. But that doesn't explain why master would
> sometimes be faster.
> 

Ah, right, thanks for the clarification. I forgot the streaming read API
does not use the two-iterator approach.

> Probably the only thing we can do is get into a repro. It would, of
> course, be easiest to do this with a serial query. I can dig into the
> scripts you shared earlier and try to find a good repro. Because the
> regressions may have shifted with Thomas' new version, it would help
> if you shared a category (cyclic/uniform/etc, parallel or serial, eic
> value, work mem, etc) where you now see the most regressions.
> 

OK, I've restarted the tests for only 0012 and 0014 patches, and I'll
wait for these to complete - I don't want to be looking for patterns
until we have enough data to smooth this out.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Autogenerate some wait events code and documentation

2024-03-18 Thread Noah Misch
On Mon, Mar 18, 2024 at 09:04:44AM +, Bertrand Drouvot wrote:
> --- a/src/backend/utils/activity/wait_event_names.txt
> +++ b/src/backend/utils/activity/wait_event_names.txt
> @@ -24,7 +24,12 @@
>  #  SGML tables of wait events for inclusion in the documentation.
>  #
>  # When adding a new wait event, make sure it is placed in the appropriate
> -# ClassName section.
> +# ClassName section. If the wait event is backpatched to a version < 17 then
> +# put it under a "Backpatch" delimiter at the end of the related ClassName
> +# section.

Back-patch from v17 to pre-v17 won't use this, because v16 has hand-maintained
enums.  It's back-patch v18->v17 or v22->v17 where this will come up.

> +# Ensure that the wait events under the "Backpatch" delimiter are placed in 
> the
> +# same order as in the pre 17 wait_event_types.h (see VERSION_FILE_SYNC as an
> +# example).

I expect the normal practice will be to put the entry in its natural position
in git master, then put it in the backpatch section for any other branch.  In
other words, the backpatch regions are always empty in git master, and the
non-backpatch regions change in master only.




Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2024-03-18 Thread Ashutosh Bapat
On Mon, Mar 18, 2024 at 5:40 PM Amit Langote 
wrote:

>
> >>
> >> Sorry, I should’ve mentioned that I was interested in seeing cpu times
> to compare the two approaches. Specifically, to see if the palloc / frees
> add noticeable overhead.
> >
> > No problem. Here you go
> >
> >  tables |  master  | patched  | perc_change
> > +--+--+-
> >   2 |   477.87 |   492.32 |   -3.02
> >   3 |  1970.83 |  1989.52 |   -0.95
> >   4 |  6305.01 |  6268.81 |0.57
> >   5 | 19261.56 | 18684.86 |2.99
> >
> > +ve change indicates reduced planning time. It seems that the planning
> time improves as the number of tables increases. But all the numbers are
> well within noise levels and thus may not show any improvement or
> deterioration. If you want, I can repeat the experiment.
>
> Thanks.  I also wanted to see cpu time difference between the two
> approaches of SpecialJoinInfo allocation -- on stack and on heap.  So
> comparing times with the previous version of the patch using stack
> allocation and the new one with palloc.  I'm hoping that the new
> approach is only worse in the noise range.
>

Ah, sorry, I didn't read it carefully. Alvaro made me realise that I have
been gathering numbers using assert enabled builds, so they are not that
reliable. Here they are with non-assert enabled builds.

planning time (in ms) reported by EXPLAIN.
 tables |  master  | stack_alloc | perc_change_1 |  palloc  | perc_change_2
| total_perc_change
+--+-+---+--+---+---
  2 |338.1 |  333.92 |  1.24 |   332.16 |  0.53
|  1.76
  3 |  1507.93 | 1475.76 |  2.13 |  1472.79 |  0.20
|  2.33
  4 |  5099.45 | 4980.35 |  2.34 |   4947.3 |  0.66
|  2.98
  5 | 15442.64 |15531.94 | -0.58 | 15393.41 |  0.89
|  0.32

stack_alloc = timings with SpecialJoinInfo on stack
perc_change_1 = percentage change of above wrt master
palloc = timings with palloc'ed SpecialJoinInfo
perc_change_2 = percentage change of above wrt timings with stack_alloc
total_perc_change = percentage change between master and all patches

total change is within noise. Timing with palloc is better than that with
SpecialJoinInfo on stack but only marginally. I don't think that means
palloc based allocation is better or worse than stack based.

You requested CPU time difference between stack based SpecialJoinInfo and
palloc'ed SpecialJoinInfo. Here it is
tables | stack_alloc |  palloc  | perc_change
+-+--+-
  2 |0.438204 | 0.438986 |   -0.18
  3 |1.224672 | 1.238781 |   -1.15
  4 |3.511317 | 3.663334 |   -4.33
  5 |9.283856 | 9.340516 |   -0.61

Yes, there's a consistent degradation albeit within noise levels.

The memory measurements
 tables | master | with_patch
++
  2 | 26 MB  | 26 MB
  3 | 93 MB  | 84 MB
  4 | 277 MB | 235 MB
  5 | 954 MB | 724 MB

The first row shows the same value because of rounding. The actual values
there are 27740432 and 26854704 resp.

Please let me know if you need anything.

-- 
Best Wishes,
Ashutosh Bapat


Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Magnus Hagander
On Mon, Mar 18, 2024 at 4:44 PM Daniel Gustafsson  wrote:
>
> > On 18 Mar 2024, at 16:34, Magnus Hagander  wrote:
> >
> > On Mon, Mar 18, 2024 at 2:09 PM Daniel Gustafsson  wrote:
> >>
> >>> On 18 Mar 2024, at 13:57, Robert Haas  wrote:
> >>
> >>> my proposal is something like this, taking a
> >>> bunch of text from Jelte's patch and some inspiration from Magnus's
> >>> earlier remarks:
> >>
> >> I still think any wording should clearly mention that settings in the file 
> >> are
> >> still applied.  The proposed wording says to implicitly but to avoid 
> >> confusion
> >> I think it should be explicit.
> >
> > I haven't kept up with the thread, but in general I'd prefer it to
> > actually turn off parsing the file as well. I think just turning off
> > the ability to change it -- including the ability to *revert* changes
> > that were made to it before -- is going to be confusing.
>
> Wouldn't that break pgBackrest which IIRC write to .auto.conf directly
> without using ALTER SYSTEM?

Ugh of course. And not only that, it would also break pg_basebackup
which does the same.

So I guess that's not a good idea. I guess nobody anticipated this
when that was done:)


> > But, if we have decided it shouldn't do that, then IMHO we should
> > consider naming it maybe enable_alter_system_command instead -- since
> > we're only disabling the alter system command, not the actual feature
> > in total.
>
> Good point.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Daniel Gustafsson
> On 18 Mar 2024, at 16:34, Magnus Hagander  wrote:
> 
> On Mon, Mar 18, 2024 at 2:09 PM Daniel Gustafsson  wrote:
>> 
>>> On 18 Mar 2024, at 13:57, Robert Haas  wrote:
>> 
>>> my proposal is something like this, taking a
>>> bunch of text from Jelte's patch and some inspiration from Magnus's
>>> earlier remarks:
>> 
>> I still think any wording should clearly mention that settings in the file 
>> are
>> still applied.  The proposed wording says to implicitly but to avoid 
>> confusion
>> I think it should be explicit.
> 
> I haven't kept up with the thread, but in general I'd prefer it to
> actually turn off parsing the file as well. I think just turning off
> the ability to change it -- including the ability to *revert* changes
> that were made to it before -- is going to be confusing.

Wouldn't that break pgBackrest which IIRC write to .auto.conf directly
without using ALTER SYSTEM?

> But, if we have decided it shouldn't do that, then IMHO we should
> consider naming it maybe enable_alter_system_command instead -- since
> we're only disabling the alter system command, not the actual feature
> in total.

Good point.

--
Daniel Gustafsson





Re: Add pg_basetype() function to obtain a DOMAIN base type

2024-03-18 Thread Tom Lane
Alexander Korotkov  writes:
> On Mon, Mar 18, 2024 at 2:01 AM jian he  wrote:
>> `
>> Datum
>> pg_basetype(PG_FUNCTION_ARGS)
>> {
>>  Oid oid;
>> 
>>  oid =  get_fn_expr_argtype(fcinfo->flinfo, 0);
>>  PG_RETURN_OID(getBaseType(oid));
>> }
>> `

> Looks good to me.  But should it be named pg_basetypeof()?

I still don't like this approach.  It forces the function to be
used in a particular way that's highly redundant with pg_typeof.
I think we'd be better off with

pg_basetype(PG_FUNCTION_ARGS)
{
Oid typid = PG_GETARG_OID(0);

PG_RETURN_OID(getBaseType(typid));
}

The use-case that the other definition handles would be implemented
like

pg_basetype(pg_typeof(expression))

but there are other use-cases.  For example, if you want to know
the base types of the columns of a table, you could do something
like

select attname, pg_basetype(atttypid) from pg_attribute
  where attrelid = 'foo'::regclass order by attnum;

but that functionality is simply not available with the other
definition.

Perhaps there's an argument for providing both things, but that
feels like overkill to me.  I doubt that pg_basetype(pg_typeof())
is going to be so common as to need a shorthand.

regards, tom lane




Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Magnus Hagander
On Mon, Mar 18, 2024 at 2:09 PM Daniel Gustafsson  wrote:
>
> > On 18 Mar 2024, at 13:57, Robert Haas  wrote:
>
> > my proposal is something like this, taking a
> > bunch of text from Jelte's patch and some inspiration from Magnus's
> > earlier remarks:
>
> I still think any wording should clearly mention that settings in the file are
> still applied.  The proposed wording says to implicitly but to avoid confusion
> I think it should be explicit.

I haven't kept up with the thread, but in general I'd prefer it to
actually turn off parsing the file as well. I think just turning off
the ability to change it -- including the ability to *revert* changes
that were made to it before -- is going to be confusing.

But, if we have decided it shouldn't do that, then IMHO we should
consider naming it maybe enable_alter_system_command instead -- since
we're only disabling the alter system command, not the actual feature
in total.


> > Perhaps figuring out how to
> > document this is best left to a separate thread, and there's also the
> > question of whether a new section that talks about this also ought to
> > talk about anything else. But I feel like we're way overdue to do
> > something about this.
>
> Seconded, both that it needs to be addressed and that it should be done on a
> separate thread from this one.

+1.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Re: Streaming I/O, vectored I/O (WIP)

2024-03-18 Thread Nazir Bilal Yavuz
Hi,

On Sat, 16 Mar 2024 at 02:53, Thomas Munro  wrote:
>
> I am planning to push the bufmgr.c patch soon.  At that point the new
> API won't have any direct callers yet, but the traditional
> ReadBuffer() family of functions will internally reach
> StartReadBuffers(nblocks=1) followed by WaitReadBuffers(),
> ZeroBuffer() or nothing as appropriate.  Any more thoughts or
> objections?  Naming, semantics, correctness of buffer protocol,
> sufficiency of comments, something else?

+if (StartReadBuffers(bmr,
+ ,
+ forkNum,
+ blockNum,
+ ,
+ strategy,
+ flags,
+ ))
+WaitReadBuffers();

I think we need to call WaitReadBuffers when 'mode !=
RBM_ZERO_AND_CLEANUP_LOCK && mode != RBM_ZERO_AND_LOCK' or am I
missing something?

Couple of nitpicks:

It would be nice to explain what the PrepareReadBuffer function does
with a comment.

+if (nblocks == 0)
+return;/* nothing to do */
It is guaranteed that nblocks will be bigger than 0. Can't we just use
Assert(operation->io_buffers_len > 0);?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Popcount optimization using AVX512

2024-03-18 Thread Nathan Bossart
On Mon, Mar 18, 2024 at 09:56:32AM +1300, David Rowley wrote:
> Maybe it's worth exploring something along the lines of the attached
> before doing the AVX512 stuff.  It seems like a pretty good speed-up
> and will apply for CPUs without AVX512 support.

+1

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: small_cleanups around login event triggers

2024-03-18 Thread Robert Treat
On Thu, Mar 14, 2024 at 7:23 PM Daniel Gustafsson  wrote:
>
> > On 14 Mar 2024, at 14:21, Robert Treat  wrote:
> > On Thu, Mar 14, 2024 at 8:21 AM Daniel Gustafsson  wrote:
>
> >> - canceling connection in psql wouldn't 
> >> cancel
> >> + canceling a connection in psql will not 
> >> cancel
> >> Nitpickery (perhaps motivated by english not being my first language), but
> >> since psql only deals with one connection I would expect this to read "the
> >> connection".
> >>
> >
> > My interpretation of this is that "a connection" is more correct
> > because it could be your connection or someone else's connection (ie,
> > you are canceling one of many possible connections). Definitely
> > nitpickery either way.
>
> Fair point.
>
> >> - * Returns true iff the lock was acquired.
> >> + * Returns true if the lock was acquired.
> >> Using "iff" here is being consistent with the rest of the file (and 
> >> technically
> >> correct):
>
> > Ah, yeah, I was pretty focused on the event trigger stuff and didn't
> > notice it being used elsewhere; thought it was a typo, but I guess
> > it's meant as shorthand for "if and only if", I wonder how many people
> > are familiar with that.
>
> I would like to think it's fairly widely understood among programmers, but I
> might be dating myself in saying so.
>
> I went ahead and applied this with the fixes mentioned here with one more tiny
> change to the last hunk of the patch to make it say "login event trigger"
> rather than just "login trigger". Thanks for the submission!
>

LGTM, thanks!

Robert Treat
https://xzilla.net




Re: documentation structure

2024-03-18 Thread Matthias van de Meent
On Mon, 18 Mar 2024 at 15:12, Robert Haas  wrote:

I'm not going into detail about the other docs comments, I don't have
much of an opinion either way on the mentioned sections. You make good
arguments; yet I don't usually use those sections of the docs but
rather do code searches.

> I don't know what to do about "I. SQL commands". It's obviously
> impractical to promote that to a top-level section, because it's got a
> zillion sub-pages which I don't think we want in the top-level
> documentation index. But having it as one of several unnumbered
> chapters interposed between 51 and 52 doesn't seem great either.

Could "SQL Commands" be a top-level construct, with subsections for
SQL/DML, SQL/DDL, SQL/Transaction management, and PG's
extensions/administrative/misc features? I sometimes find myself
trying to mentally organize what SQL commands users can use vs those
accessible to database owners and administrators, which is not
currently organized as such in the SQL Commands section.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-18 Thread Bertrand Drouvot
Hi,

On Thu, Mar 14, 2024 at 12:27:26PM +0530, Amit Kapila wrote:
> On Wed, Mar 13, 2024 at 10:16 PM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Mar 13, 2024 at 11:13 AM shveta malik  
> > wrote:
> > >
> > > > Thanks. v8-0001 is how it looks. Please see the v8 patch set with this 
> > > > change.
> > >
> > > JFYI, the patch does not apply to the head. There is a conflict in
> > > multiple files.
> >
> > Thanks for looking into this. I noticed that the v8 patches needed
> > rebase. Before I go do anything with the patches, I'm trying to gain
> > consensus on the design. Following is the summary of design choices
> > we've discussed so far:
> > 1) conflict_reason vs invalidation_reason.
> > 2) When to compute the XID age?
> >
> 
> I feel we should focus on two things (a) one is to introduce a new
> column invalidation_reason, and (b) let's try to first complete
> invalidation due to timeout. We can look into XID stuff if time
> permits, remember, we don't have ample time left.

Agree. While it makes sense to invalidate slots for wal removal in
CreateCheckPoint() (because this is the place where wal is removed), I 'm not
sure this is the right place for the 2 new cases.

Let's focus on the timeout one as proposed above (as probably the simplest one):
as this one is purely related to time and activity what about to invalidate them
when?:

- their usage resume
- in pg_get_replication_slots()

The idea is to invalidate the slot when one resumes activity on it or wants to
get information about it (and among other things wants to know if the slot is
valid or not).

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-18 Thread Melanie Plageman
On Sun, Mar 17, 2024 at 3:21 PM Tomas Vondra
 wrote:
>
> On 3/14/24 22:39, Melanie Plageman wrote:
> > On Thu, Mar 14, 2024 at 5:26 PM Tomas Vondra
> >  wrote:
> >>
> >> On 3/14/24 19:16, Melanie Plageman wrote:
> >>> On Thu, Mar 14, 2024 at 03:32:04PM +0200, Heikki Linnakangas wrote:
>  ...
> 
>  Ok, committed that for now. Thanks for looking!
> >>>
> >>> Attached v6 is rebased over your new commit. It also has the "fix" in
> >>> 0010 which moves BitmapAdjustPrefetchIterator() back above
> >>> table_scan_bitmap_next_block(). I've also updated the Streaming Read API
> >>> commit (0013) to Thomas' v7 version from [1]. This has the update that
> >>> we theorize should address some of the regressions in the bitmapheapscan
> >>> streaming read user in 0014.
> >>>
> >>
> >> Should I rerun the benchmarks with these new patches, to see if it
> >> really helps with the regressions?
> >
> > That would be awesome!
> >
>
> OK, here's a couple charts comparing the effect of v6 patches to master.
> These are from 1M and 10M data sets, same as the runs presented earlier
> in this thread (the 10M is still running, but should be good enough for
> this kind of visual comparison).

Thanks for doing this!

> What is even more obvious is that 0014 behaves *VERY* differently. I'm
> not sure if this is a good thing or a problem is debatable/unclear. I'm
> sure we don't want to cause regressions, but perhaps those are due to
> the prefetch issue discussed elsewhere in this thread (identified by
> Andres and Melanie). There are also many cases that got much faster, but
> the question is whether this is due to better efficiency or maybe the
> new code being more aggressive in some way (not sure).

Are these with the default effective_io_concurrency (1)? If so, the
"effective" prefetch distance in many cases will be higher with the
streaming read code applied. With effective_io_concurrency 1,
"max_ios" will always be 1, but the number of blocks prefetched may
exceed this (up to MAX_BUFFERS_PER_TRANSFER) because the streaming
read code is always trying to build bigger IOs. And, if prefetching,
it will prefetch IOs not yet in shared buffers before reading them.

It's hard to tell without going into a specific repro why this would
cause some queries to be much slower. In the forced bitmapheapscan, it
would make sense that more prefetching is worse -- which is why a
bitmapheapscan plan wouldn't have been chosen. But in the optimal
cases, it is unclear why it would be worse.

I don't think there is any way it could be the issue Andres
identified, because there is only one iterator. Nothing to get out of
sync. It could be that the fadvises are being issued too close to the
reads and aren't effective enough at covering up read latency on
slower, older hardware. But that doesn't explain why master would
sometimes be faster.

Probably the only thing we can do is get into a repro. It would, of
course, be easiest to do this with a serial query. I can dig into the
scripts you shared earlier and try to find a good repro. Because the
regressions may have shifted with Thomas' new version, it would help
if you shared a category (cyclic/uniform/etc, parallel or serial, eic
value, work mem, etc) where you now see the most regressions.

- Melanie




Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-18 Thread Dean Rasheed
On Sun, 17 Mar 2024 at 19:39, Tom Lane  wrote:
>
> Here's a cleaned-up version that seems to successfully resolve
> PARAM_EXEC references in all cases.  I haven't changed the
> "(plan_name).colN" notation, but that's an easy fix once we have
> consensus on the spelling.

I took a more detailed look at this and the code and doc changes all
look good to me.

There's a comment at the end of find_param_generator() that should
probably say "No generator found", rather than "No referent found".

The get_rule_expr() code could perhaps be simplified a bit, getting
rid of the show_subplan_name variable and moving the recursive calls
to get_rule_expr() to after the switch statement -- if testexpr is
non-NULL, print it, else print the subplan name probably works for all
subplan types.

The "colN" notation has grown on me, especially when you look at
examples like those in partition_prune.out with a mix of Param types.
Not using "$n" for 2 different purposes is good, and I much prefer
this to the original "FROM [HASHED] SubPlan N (returns ...)" notation.

> There are two other loose ends bothering me:
>
> 1.  I see that Gather nodes sometimes print things like
>
>->  Gather (actual rows=N loops=N)
>  Workers Planned: 2
>  Params Evaluated: $0, $1
>  Workers Launched: N
>
> I propose we nuke the "Params Evaluated" output altogether.

+1

> 2.  MULTIEXPR_SUBLINK subplans now result in EXPLAIN output like
>
>->  Result
>  Output: (SubPlan 1).col1, (SubPlan 1).col2, (SubPlan 1), i.tableoid, 
> i.ctid
>
> The undecorated reference to (SubPlan 1) is fairly confusing, since
> it doesn't correspond to anything that will actually get output.
> I suggest that perhaps instead this should read
>
>  Output: (SubPlan 1).col1, (SubPlan 1).col2, IGNORE(SubPlan 1), 
> i.tableoid, i.ctid
>
> or
>
>  Output: (SubPlan 1).col1, (SubPlan 1).col2, RESET(SubPlan 1), 
> i.tableoid, i.ctid

I think "RESET()" or "RESCAN()" or something like that is better than
"INGORE()", because it indicates that it is actually doing something.
I don't really have a better idea. Perhaps not all uppercase though,
since that seems to go against the rest of the EXPLAIN output.

Regards,
Dean




Re: DRAFT: Pass sk_attno to consistent function

2024-03-18 Thread Michał Kłeczek
Wrong file in the previous message - sorry for the noise.Attached is a fixed patch.Thanks,Michal

0001-Pass-key-sk_attno-to-consistent-function.patch
Description: Binary data
On 18 Mar 2024, at 15:14, Michał Kłeczek  wrote:I realised it might be enough to pass sk_attno to consistent function as that should be enough to lookup metadata if needed.Attached is a draft patch that does this.—Michal<0001-Pass-key-sk_attno-to-consistent-function.patch>On 18 Mar 2024, at 14:31, Michał Kłeczek  wrote:Hi All,On 11 Mar 2024, at 18:58, Michał Kłeczek  wrote:Hi All,During my journey of designing Pg based solution at my work I was severely hit by several shortcomings in GiST.The most severe one is the lack of support for SAOP filters as it makes it difficult to have partition pruning and index (only) scans working together.To overcome the difficulties I implemented a simple extension:mkleczek/btree_gist_extra: Extra operators support for PostgreSQL btree_gistgithub.comSince it provides a separate operator class from btree_gist it requires re-indexing the data.So I thought maybe it would be a good idea to incorporate it into btree_gist.While working on supporting (sort of) SAOP support for Gist I was stuck with the interplay between Gist consistent function and partition pruning.Not sure how it applies to SAOP handling in general though.I’ve implemented an operator class/family that supports Gist index scan for the following query:SELECT * FROM tbl WHERE col ||= ARRAY[‘a’, ‘b’, ‘c’];It all works well except cases where tbl is partitioned by “col” column.In this case index scan unnecessarily scans pages for values that are not in the partition.I am not sure if it works as expected (ie. no unnecessary scans) in case of ANY = (ARRAY[]) and nbtree.In case of Gist the only place where pre-processing of queries can be performed is consistent function.But right now there is no possibility to access any scan related information as it is not passed to consistent function.The only thing available is GISTENTRY and it does not contain any metadata.As a workaround I’ve added options to the op family that allows (redundantly) specifying MODULUS/REMAINDER for the index:CREATE INDEX idx ON tbl_partition_01 USING gist ( col gist_text_extra_ops (modulus=4, remainder=2) )and use the values to filter out query array passed to consistent function.This is of course not ideal:- the information is redundant- changing partitioning scheme requires re-indexingI am wondering if it is possible to extend consistent function API so that either ScanKey or even the whole IndexScanDesc is passed as additional parameter.—Michal

DRAFT: Pass sk_attno to consistent function

2024-03-18 Thread Michał Kłeczek
I realised it might be enough to pass sk_attno to consistent function as that should be enough to lookup metadata if needed.Attached is a draft patch that does this.—Michal

0001-Pass-key-sk_attno-to-consistent-function.patch
Description: Binary data
On 18 Mar 2024, at 14:31, Michał Kłeczek  wrote:Hi All,On 11 Mar 2024, at 18:58, Michał Kłeczek  wrote:Hi All,During my journey of designing Pg based solution at my work I was severely hit by several shortcomings in GiST.The most severe one is the lack of support for SAOP filters as it makes it difficult to have partition pruning and index (only) scans working together.To overcome the difficulties I implemented a simple extension:mkleczek/btree_gist_extra: Extra operators support for PostgreSQL btree_gistgithub.comSince it provides a separate operator class from btree_gist it requires re-indexing the data.So I thought maybe it would be a good idea to incorporate it into btree_gist.While working on supporting (sort of) SAOP support for Gist I was stuck with the interplay between Gist consistent function and partition pruning.Not sure how it applies to SAOP handling in general though.I’ve implemented an operator class/family that supports Gist index scan for the following query:SELECT * FROM tbl WHERE col ||= ARRAY[‘a’, ‘b’, ‘c’];It all works well except cases where tbl is partitioned by “col” column.In this case index scan unnecessarily scans pages for values that are not in the partition.I am not sure if it works as expected (ie. no unnecessary scans) in case of ANY = (ARRAY[]) and nbtree.In case of Gist the only place where pre-processing of queries can be performed is consistent function.But right now there is no possibility to access any scan related information as it is not passed to consistent function.The only thing available is GISTENTRY and it does not contain any metadata.As a workaround I’ve added options to the op family that allows (redundantly) specifying MODULUS/REMAINDER for the index:CREATE INDEX idx ON tbl_partition_01 USING gist ( col gist_text_extra_ops (modulus=4, remainder=2) )and use the values to filter out query array passed to consistent function.This is of course not ideal:- the information is redundant- changing partitioning scheme requires re-indexingI am wondering if it is possible to extend consistent function API so that either ScanKey or even the whole IndexScanDesc is passed as additional parameter.—Michal

Re: Possibility to disable `ALTER SYSTEM`

2024-03-18 Thread Jelte Fennema-Nio
On Mon, 18 Mar 2024 at 13:57, Robert Haas  wrote:
> I would have been somewhat inclined to find an existing section
> of postgresql.auto.conf for this parameter, perhaps "platform and
> version compatibility".

I tried to find an existing section, but I couldn't find any that this
new GUC would fit into naturally. "Version and Platform Compatibility
/ Previous PostgreSQL Versions" (the one you suggested) seems wrong
too. The GUCs there are to get back to Postgres behaviour from
previous versions. So that section would only make sense if we'd turn
enable_alter_system off by default (which obviously no-one in this
thread suggests/wants).

If you have another suggestion for an existing category that we should
use, feel free to share. But imho, none of the existing ones are a
good fit.

> Even if that is what we're going to do, do we want to call them "guard
> rails"? I'm not sure I'd find that name terribly clear, as a user.

If anyone has a better suggestion, I'm happy to change it.


On Mon, 18 Mar 2024 at 14:09, Daniel Gustafsson  wrote:
>
> > On 18 Mar 2024, at 13:57, Robert Haas  wrote:
>
> > my proposal is something like this, taking a
> > bunch of text from Jelte's patch and some inspiration from Magnus's
> > earlier remarks:
>
> I still think any wording should clearly mention that settings in the file are
> still applied.  The proposed wording says to implicitly but to avoid confusion
> I think it should be explicit.

I updated the first two paragraphs with Robert his wording (and did
not remove the third one as that addresses the point made by Daniel)


v5-0001-Add-enable_alter_system-GUC.patch
Description: Binary data


documentation structure

2024-03-18 Thread Robert Haas
I was looking at the documentation index this morning[1], and I can't
help feeling like there are some parts of it that are over-emphasized
and some parts that are under-emphasized. I'm not sure what we can do
about this exactly, but I thought it worth writing an email and seeing
what other people think.

The two sections of the documentation that seem really
under-emphasized to me are the GUC documentation and the SQL
reference. The GUC documentation is all buried under "20. Server
Configuration" and the SQL command reference is under "I. SQL
commands". For reasons that I don't understand, all chapters except
for those in "VI. Reference" are numbered, but the chapters in that
section have Roman numerals instead.

I don't know what other people's experience is, but for me, wanting to
know what a command does or what a setting does is extremely common.
Therefore, I think these chapters are disproportionately important and
should be emphasized more. In the case of the GUC reference, one idea
I have is to split up "III. Server Administration". My proposal is
that we divide it into three sections. The first would be called "III.
Server Installation" and would cover chapters 16 (installation from
binaries) through 19 (server setup and operation). The second would be
called "IV. Server Configuration" -- so every section that's currently
a subsection of "server configuration" would become a top-level
chapter. The third division would be "V. Server Administration," and
would cover the current chapters 21-33. This is probably far from
perfect, but it seems like a relatively simple change and better than
what we have now.

I don't know what to do about "I. SQL commands". It's obviously
impractical to promote that to a top-level section, because it's got a
zillion sub-pages which I don't think we want in the top-level
documentation index. But having it as one of several unnumbered
chapters interposed between 51 and 52 doesn't seem great either.

The stuff that I think is over-emphasized is as follows: (a) chapters
1-3, the tutorial; (b) chapters 4-6, which are essentially a
continuation of the tutorial, and not at all similar to chapters 8-11
which are chalk-full of detailed technical information; (c) chapters
43-46, one per procedural language; perhaps these could just be
demoted to sub-sections of chapter 42 on procedural languages; (d)
chapters 47 (server programming interface), 50 (replication progress
tracking), and 51 (archive modules), all of which are important to
document but none of which seem important enough to put them in the
top-level documentation index; and (e) large parts of section "VII.
Internals," which again contain tons of stuff of very marginal
interest. The first ~4 chapters of the internals section seem like
they might be mainstream enough to justify the level of prominence
that we give them, but the rest has got to be of interest to a tiny
minority of readers.

I think it might be possible to consolidate the internals section by
grouping a bunch of existing entries together by category. Basically,
after the first few chapters, you've got stuff that is of interest to
C programmers writing core or extension code; and you've got
explainers on things like GEQO and index op-classes and support
functions which might be of interest even to non-programmers. I think
for example that we don't need separate top-level chapters on writing
procedural language handlers, FDWs, tablesample methods, custom scan
providers, table access methods, index access methods, and WAL
resource managers. Some or all of those could be grouped under a
single chapter, perhaps, e.g. Using PostgreSQL Extensibility
Interfaces.

Thoughts? I realize that this topic is HIGHLY prone to ENDLESS
bikeshedding, and it's inevitable that not everybody is going to
agree. But I hope we can agree that it's completely silly that it's
vastly easier to find the documentation about the backup manifest
format than it is to find the documentation on CREATE TABLE or
shared_buffers, and if we can agree on that, then perhaps we can agree
on some way to make things better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

[1] https://www.postgresql.org/docs/16/index.html




Re: Add LSN <-> time conversion functionality

2024-03-18 Thread Daniel Gustafsson
> On 22 Feb 2024, at 03:45, Melanie Plageman  wrote:
> On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
>  wrote:

>> - Not sure why we need 0001. Just so that the "estimate" functions in
>> 0002 have a convenient "start" point? Surely we could look at the
>> current LSNTimeline data and use the oldest value, or (if there's no
>> data) use the current timestamp/LSN?
> 
> When there are 0 or 1 entries in the timeline you'll get an answer
> that could be very off if you just return the current timestamp or
> LSN. I guess that is okay?

I don't think that's a huge problem at such a young "lsn-age", but I might be
missing something.

>> - I wonder what happens if we lose the data - we know that if people
>> reset statistics for whatever reason (or just lose them because of a
>> crash, or because they're on a replica), bad things happen to
>> autovacuum. What's the (expected) impact on pruning?
> 
> This is an important question. Because stats aren't crashsafe, we
> could return very inaccurate numbers for some time/LSN values if we
> crash. I don't actually know what we could do about that. When I use
> the LSNTimeline for the freeze heuristic it is less of an issue
> because the freeze heuristic has a fallback strategy when there aren't
> enough stats to do its calculations. But other users of this
> LSNTimeline will simply get highly inaccurate results (I think?). Is
> there anything we could do about this? It seems bad.

A complication with this over stats is that we can't recompute this in case of
a crash/corruption issue.  The simple solution is to consider this unlogged
data and start fresh at every unclean shutdown, but I have a feeling that won't
be good enough for basing heuristics on?

> Andres had brought up something at some point about, what if the
> database is simply turned off for awhile and then turned back on. Even
> if you cleanly shut down, will there be "gaps" in the timeline? I
> think that could be okay, but it is something to think about.

The gaps would represent reality, so there is nothing wrong per se with gaps,
but if they inflate the interval of a bucket which in turns impact the
precision of the results then that can be a problem.

> Just a note, all of my comments could use a lot of work, but I want to
> get consensus on the algorithm before I make sure and write about it
> in a perfect way.

I'm not sure "a lot of work" is accurate, they seem pretty much there to me,
but I think that an illustration of running through the algorithm in an
ascii-art array would be helpful.


Reading through this I think such a function has merits, not only for your
usecase but other heuristic based work and quite possibly systems debugging.

While the bucketing algorithm is a clever algorithm for degrading precision for
older entries without discarding them, I do fear that we'll risk ending up with
answers like "somewhere between in the past and even further in the past".
I've been playing around with various compression algorithms for packing the
buckets such that we can retain precision for longer.  Since you were aiming to
work on other patches leading up to the freeze, let's pick this up again once
things calm down.

When compiling I got this warning for lsntime_merge_target:

pgstat_wal.c:242:1: warning: non-void function does not return a value in all 
control paths [-Wreturn-type]
}
^
1 warning generated.

The issue seems to be that the can't-really-happen path is protected with an
Assertion, which is a no-op for production builds.  I think we should handle
the error rather than rely on testing catching it (since if it does happen even
though it can't, it's going to be when it's for sure not tested).  Returning an
invalid array subscript like -1 and testing for it in lsntime_insert, with an
elog(WARNING..), seems enough.


-last_snapshot_lsn <= GetLastImportantRecPtr())
+last_snapshot_lsn <= current_lsn)
I think we should delay extracting the LSN with GetLastImportantRecPtr until we
know that we need it, to avoid taking locks in this codepath unless needed.

I've attached a diff with the above suggestions which applies on op of your
patchset.

--
Daniel Gustafsson

commit 908f3bb511450f05980ba01d42c909cc9ef8007a
Author: Daniel Gustafsson 
Date:   Thu Mar 14 14:32:15 2024 +0100

Code review hackery

diff --git a/src/backend/postmaster/bgwriter.c 
b/src/backend/postmaster/bgwriter.c
index 7d9ad9046f..115204f708 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -267,7 +267,7 @@ BackgroundWriterMain(void)
{
TimestampTz timeout = 0;
TimestampTz now = GetCurrentTimestamp();
-   XLogRecPtr  current_lsn = GetLastImportantRecPtr();
+   XLogRecPtr  current_lsn;
 
timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,

  

Re: speed up a logical replica setup

2024-03-18 Thread Peter Eisentraut

On 16.03.24 16:42, Euler Taveira wrote:

I'm attaching a new version (v30) that adds:


I have some review comments and attached a patch with some smaller 
fixups (mainly message wording and avoid fixed-size string buffers).


* doc/src/sgml/ref/pg_createsubscriber.sgml

I would remove the "How It Works" section.  This is not relevant to
users, and it is very detailed and will require updating whenever the
implementation changes.  It could be a source code comment instead.

* src/bin/pg_basebackup/pg_createsubscriber.c

I think the connection string handling is not robust against funny
characters, like spaces, in database names etc.

Most SQL commands need to be amended for proper identifier or string
literal quoting and/or escaping.

In check_subscriber(): All these permissions checks seem problematic
to me.  We shouldn't reimplement our own copy of the server's
permission checks.  The server can check the permissions.  And if the
permission checking in the server ever changes, then we have
inconsistencies to take care of.  Also, the error messages "permission
denied" are inappropriate, because we are not doing the actual thing.
Maybe we want to do a dry-run for the benefit of the user, but then we
should do the actual thing, like try to create a replication slot, or
whatever.  But I would rather just remove all this, it seems too
problematic.

In main(): The first check if the standby is running is problematic.
I think it would be better to require that the standby is initially
shut down.  Consider, the standby might be running under systemd.
This tool will try to stop it, systemd will try to restart it.  Let's
avoid these kinds of battles.  It's also safer if we don't try to
touch running servers.

The -p option (--subscriber-port) doesn't seem to do anything.  In my
testing, it always uses the compiled-in default port.

Printing all the server log lines to the terminal doesn't seem very
user-friendly.  Not sure what to do about that, short of keeping a 
pg_upgrade-style directory of log files.  But it's ugly.


From ec8e6ed6c3325a6f9fde2d1632346e212ade9c9f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 18 Mar 2024 14:48:55 +0100
Subject: [PATCH] Various improvements

---
 src/bin/pg_basebackup/Makefile  |  2 +-
 src/bin/pg_basebackup/nls.mk|  1 +
 src/bin/pg_basebackup/pg_createsubscriber.c | 49 +
 3 files changed, 23 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index e9a920dbcda..26c53e473f5 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -49,7 +49,7 @@ all: pg_basebackup pg_createsubscriber pg_receivewal 
pg_recvlogical
 pg_basebackup: $(BBOBJS) $(OBJS) | submake-libpq submake-libpgport 
submake-libpgfeutils
$(CC) $(CFLAGS) $(BBOBJS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o 
$@$(X)
 
-pg_createsubscriber: $(WIN32RES) pg_createsubscriber.o | submake-libpq 
submake-libpgport submake-libpgfeutils
+pg_createsubscriber: pg_createsubscriber.o $(WIN32RES) | submake-libpq 
submake-libpgport submake-libpgfeutils
$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
 
 pg_receivewal: pg_receivewal.o $(OBJS) | submake-libpq submake-libpgport 
submake-libpgfeutils
diff --git a/src/bin/pg_basebackup/nls.mk b/src/bin/pg_basebackup/nls.mk
index fc475003e8e..7870cea71ce 100644
--- a/src/bin/pg_basebackup/nls.mk
+++ b/src/bin/pg_basebackup/nls.mk
@@ -8,6 +8,7 @@ GETTEXT_FILES= $(FRONTEND_COMMON_GETTEXT_FILES) \
bbstreamer_tar.c \
bbstreamer_zstd.c \
pg_basebackup.c \
+   pg_createsubscriber.c \
pg_receivewal.c \
pg_recvlogical.c \
receivelog.c \
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c 
b/src/bin/pg_basebackup/pg_createsubscriber.c
index e24ed7ef506..91c3a2f0036 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -43,7 +43,7 @@ struct CreateSubscriberOptions
SimpleStringList sub_names; /* list of subscription names */
SimpleStringList replslot_names;/* list of replication slot 
names */
int recovery_timeout;   /* stop recovery after 
this time */
-}  CreateSubscriberOptions;
+};
 
 struct LogicalRepInfo
 {
@@ -57,7 +57,7 @@ struct LogicalRepInfo
 
boolmade_replslot;  /* replication slot was created */
boolmade_publication;   /* publication was created */
-}  LogicalRepInfo;
+};
 
 static void cleanup_objects_atexit(void);
 static void usage();
@@ -155,9 +155,9 @@ cleanup_objects_atexit(void)
 */
if (recovery_ended)
{
-   pg_log_warning("pg_createsubscriber failed after the end of 
recovery");
-   pg_log_warning_hint("The target server 

Re: Is there still password max length restrictions in PG?

2024-03-18 Thread Sean
Thanks Daniel.
That's a big help to me!


--Original--
From:   
 "Daniel Gustafsson"



Re: Is there still password max length restrictions in PG?

2024-03-18 Thread Daniel Gustafsson
> On 18 Mar 2024, at 14:43, Sean  wrote:
> 
> Hi, 
> Thanks for your information.  Even using SCRAM,  when specified the content 
> of "password",  still there is a basic request about the length of it.  From 
> the source code, seems there is no restriction, right? 

SCRAM stores a hashed fixed-size representation of the password, so there is no
restriction in terms of length on the user supplied secret.

--
Daniel Gustafsson





Java : Postgres double precession issue with different data format text and binary

2024-03-18 Thread Rahul Uniyal

Hello Team,

Hope everyone is doing well here.

I am writing this email to understand an issue I'm facing when fetching data in 
our Java application. We are using PostgreSQL JDBC Driver version 42.6.0.

Issue:

We are encountering an issue where the double precision data type in PostgreSQL 
is giving some intermittent results when fetching data. For example, in the 
database the value is 40, but sometimes this value is fetched as 40.0. 
Similarly, for a value of 0.0005, it is being fetched as 0.00050, resulting in 
extra trailing zeros.

While debugging, it seems like this issue is caused by the different data 
formats, such as Text and Binary. There is some logic in the PgResultSet class 
that converts values based on this data format.

Example:

Below is an example where we are getting different data formats for the same 
table:

Text Format: [Field(id,FLOAT8,8,T), Field(client_id,FLOAT8,8,T), 
Field(create_ts,TIMESTAMP,8,T), ...]

Binary Format: [Field(id,FLOAT8,8,B), Field(client_id,FLOAT8,8,B), ...] (notice 
some format changes)

We are not sure why different formats are coming for the same table.

Schema:

Below is the schema for the table used:

SQL

 

 CREATE TABLE IF NOT EXISTS SUBMISSION_QUEUE(
  ID   DOUBLE PRECISION,
  CLIENT_ID   DOUBLE PRECISION,
  OCODE VARCHAR(20) NOT NULL,
  PAYLOAD_TYPEVARCHAR(20),
  REPOSITORY VARCHAR(16),
  SUB_REPOSITORY  VARCHAR(20),
  FORCE_GENERATION_FLAG   BOOLEAN,
IS_JMX_CALL BOOLEAN,
INSTANCE_ID   DOUBLE PRECISION,
CREATE_TS TIMESTAMP(6) NOT NULL,
);

Request:

Team, would it be possible to give some insight on this issue? Any help would 
be greatly appreciated.

Thanks,

Re: Is there still password max length restrictions in PG?

2024-03-18 Thread Sean
Hi,
Thanks for your information. Even using SCRAM, when specified the 
content of "password", still there is a basic request about the length of 
it. From the source code, seems there is no restriction, right???

Is it reasonable???

BR,
Sean He


--Original--
From:   
 "Daniel Gustafsson"



  1   2   >