Re: XLog size reductions: Reduced XLog record header size for PG17

2024-04-06 Thread Andrey M. Borodin



> On 3 Jan 2024, at 15:15, Pavel Borisov  wrote:
> 
> Hi and Happy New Year!
> 
> I've looked through the patches and the change seems quite small and 
> justified. But at the second round, some doubt arises on whether this long 
> patchset indeed introduces enough performance gain? I may be wrong, but it 
> saves only several bytes and the performance gain would be only in some 
> specific artificial workload.  Did you do some measurements? Do we have 
> several percent performance-wise?
> 
> Kind regards,
> Pavel Borisov

Hi Matthias!

This is a kind reminder that the thread is waiting for your reply. Are you 
interesting in CF entry [0]?

Thanks!


Best regards, Andrey Borodin.
[0] https://commitfest.postgresql.org/47/4386/



Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-04-06 Thread Andrey M. Borodin



> On 15 Mar 2024, at 17:12, Nazir Bilal Yavuz  wrote:
> 
> I did not have the time to check other things you mentioned but I
> tested the read performance. The table size is 5.5GB, I did 20 runs in
> total.

Hi Nazir!

Do you plan to review anything else? Or do you think it worth to look at by 
someone else? Or is the patch Ready for Committer? If so, please swap CF entry 
[0] to status accordingly, currently it's "Waiting on Author".


Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/47/4763/



Re: Improve heapgetpage() performance, overhead from serializable

2024-04-06 Thread John Naylor
On Sun, Apr 7, 2024 at 11:49 AM Andres Freund  wrote:
>
> Hi,
>
> On 2024-01-22 13:01:31 +0700, John Naylor wrote:
> > On Mon, Jul 17, 2023 at 9:58 PM Andres Freund  wrote:
> > > And 397->320ms
> > > for something as core as this, is imo worth considering on its own.
> >
> > Hi Andres, this interesting work seems to have fallen off the radar --
> > are you still planning to move forward with this for v17?
>
> I had completely forgotten about this patch, but some discussion around
> streaming read reminded me of it.  Here's a rebased version, with conflicts
> resolved and very light comment polish and a commit message. Given that
> there's been no changes otherwise in the last months, I'm inclined to push in
> a few hours.

Just in time ;-) The commit message should also have "reviewed by
Zhang Mingli" and "tested by Quan Zongliang", who shared results in a
thread for a withrawn CF entry with a similar idea but covering fewer
cases:

https://www.postgresql.org/message-id/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce%40yeah.net




Re: Improve heapgetpage() performance, overhead from serializable

2024-04-06 Thread Andres Freund
Hi,

On 2024-01-22 13:01:31 +0700, John Naylor wrote:
> On Mon, Jul 17, 2023 at 9:58 PM Andres Freund  wrote:
> > And 397->320ms
> > for something as core as this, is imo worth considering on its own.
> 
> Hi Andres, this interesting work seems to have fallen off the radar --
> are you still planning to move forward with this for v17?

I had completely forgotten about this patch, but some discussion around
streaming read reminded me of it.  Here's a rebased version, with conflicts
resolved and very light comment polish and a commit message. Given that
there's been no changes otherwise in the last months, I'm inclined to push in
a few hours.

Greetings,

Andres Freund
>From f2158455ac1a10eb393e25f7ddf87433a98e2ab0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sat, 6 Apr 2024 20:51:07 -0700
Subject: [PATCH v2] Reduce branches in heapgetpage()'s per-tuple loop

Until now, heapgetpage()'s loop over all tuples performed some conditional
checks for each tuple, even though condition did not change across the loop.

This commit fixes that by moving the loop into an inline function. By calling
it with different constant arguments, the compiler can generate an optimized
loop for the different conditions, at the price of two per-page checks.

For cases of all-visible tables and an isolation level other than
serializable, speedups of up to 25% have been measured.

Reviewed-by: John Naylor 
Discussion: https://postgr.es/m/20230716015656.xjvemfbp5fysj...@awork3.anarazel.de
---
 src/backend/access/heap/heapam.c | 102 ++-
 1 file changed, 74 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e3..cbbc87dec9a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -364,6 +364,56 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 	scan->rs_numblocks = numBlks;
 }
 
+/*
+ * Per-tuple loop for heapgetpage() in pagemode. Pulled out so it can be
+ * called multiple times, with constant arguments for all_visible,
+ * check_serializable.
+ */
+pg_attribute_always_inline
+static int
+heapgetpage_collect(HeapScanDesc scan, Snapshot snapshot,
+	Page page, Buffer buffer,
+	BlockNumber block, int lines,
+	bool all_visible, bool check_serializable)
+{
+	int			ntup = 0;
+	OffsetNumber lineoff;
+
+	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	{
+		ItemId		lpp = PageGetItemId(page, lineoff);
+		bool		valid;
+		HeapTupleData loctup;
+
+		if (!ItemIdIsNormal(lpp))
+			continue;
+
+		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+		loctup.t_len = ItemIdGetLength(lpp);
+		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+		ItemPointerSet(&(loctup.t_self), block, lineoff);
+
+		if (all_visible)
+			valid = true;
+		else
+			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
+
+		if (check_serializable)
+			HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd,
+&loctup, buffer, snapshot);
+
+		if (valid)
+		{
+			scan->rs_vistuples[ntup] = lineoff;
+			ntup++;
+		}
+	}
+
+	Assert(ntup <= MaxHeapTuplesPerPage);
+
+	return ntup;
+}
+
 /*
  * heap_prepare_pagescan - Prepare current scan page to be scanned in pagemode
  *
@@ -379,9 +429,8 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
-	int			ntup;
-	OffsetNumber lineoff;
 	bool		all_visible;
+	bool		check_serializable;
 
 	Assert(BufferGetBlockNumber(buffer) == block);
 
@@ -403,7 +452,6 @@ heap_prepare_pagescan(TableScanDesc sscan)
 
 	page = BufferGetPage(buffer);
 	lines = PageGetMaxOffsetNumber(page);
-	ntup = 0;
 
 	/*
 	 * If the all-visible flag indicates that all tuples on the page are
@@ -426,37 +474,35 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	 * tuple for visibility the hard way.
 	 */
 	all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery;
+	check_serializable =
+		CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot);
 
-	for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
+	/*
+	 * We call heapgetpage_collect() with constant arguments, for faster code,
+	 * without having to duplicate too much logic. The separate calls with
+	 * constant arguments are needed on several compilers to actually perform
+	 * constant folding.
+	 */
+	if (likely(all_visible))
 	{
-		ItemId		lpp = PageGetItemId(page, lineoff);
-		HeapTupleData loctup;
-		bool		valid;
-
-		if (!ItemIdIsNormal(lpp))
-			continue;
-
-		loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
-		loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-		loctup.t_len = ItemIdGetLength(lpp);
-		ItemPointerSet(&(loctup.t_self), block, lineoff);
-
-		if (all_visible)
-			valid = true;
+		if (likely(!check_serializable))
+			scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer,
+   block, lines, true, false);
 		else
-			valid = HeapTupleSatisf

Re: CASE control block broken by a single line comment

2024-04-06 Thread Tom Lane
Erik Wienhold  writes:
> On 2024-04-06 20:14 +0200, Michal Bartak wrote:
>> The issue described bellow exists in postgresql ver 16.2 (found in some
>> previous major versions)

> Can confirm also on master.

I'm sure it's been there a while :-(

> I'm surprised that the comment is not skipped by the scanner at this
> point.  Maybe because the parser just reads the raw expression between
> WHEN and THEN with plpgsql_append_source_text via read_sql_construct.

> How about the attached patch?  It's a workaround by simply adding a line
> feed character between the raw expression and the closing parenthesis.

I don't have time to look into this on this deadline weekend, but
what's bothering me about this report is the worry that we've made
the same mistake elsewhere, or will do so in future.  I suspect
it'd be much more robust if we could remove the comment from the
expr->query string.  No idea how hard that is.

regards, tom lane




Re: remaining sql/json patches

2024-04-06 Thread jian he
hi.
about v50.
+/*
+ * JsonTableSiblingJoin -
+ * Plan to union-join rows of nested paths of the same level
+ */
+typedef struct JsonTableSiblingJoin
+{
+ JsonTablePlan plan;
+
+ JsonTablePlan *lplan;
+ JsonTablePlan *rplan;
+} JsonTableSiblingJoin;

"Plan to union-join rows of nested paths of the same level"
same level problem misleading?
I think it means
"Plan to union-join rows of top level columns clause is a nested path"

+ if (IsA(planstate->plan, JsonTableSiblingJoin))
+ {
+ /* Fetch new from left sibling. */
+ if (!JsonTablePlanNextRow(planstate->left))
+ {
+ /*
+ * Left sibling ran out of rows, fetch new from right sibling.
+ */
+ if (!JsonTablePlanNextRow(planstate->right))
+ {
+ /* Right sibling and thus the plan has now more rows. */
+ return false;
+ }
+ }
+ }
/* Right sibling and thus the plan has now more rows. */
I think you mean:
/* Right sibling ran out of rows and thus the plan has no more rows. */


in  section,
+  | NESTED PATH json_path_specification
 AS path_name 
+COLUMNS ( json_table_column
, ... )

maybe make it into one line.

  | NESTED PATH json_path_specification
 AS path_name  COLUMNS
( json_table_column ,
... )

since the surrounding pattern is the next line beginning with "[",
meaning that next line is optional.


+ at arbitrary nesting levels.
maybe
+ at arbitrary nested level.

in src/tools/pgindent/typedefs.list, "JsonPathSpec" is unnecessary.

other than that, it looks good to me.




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Melanie Plageman
On Sun, Apr 07, 2024 at 02:27:43AM +0200, Tomas Vondra wrote:
> On 4/6/24 23:34, Melanie Plageman wrote:
> > ...
> >>
> >> I realized it makes more sense to add a FIXME (I used XXX. I'm not when
> >> to use what) with a link to the message where Andres describes why he
> >> thinks it is a bug. If we plan on fixing it, it is good to have a record
> >> of that. And it makes it easier to put a clear and accurate comment.
> >> Done in 0009.
> >>
> >>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks
> >>> per above (tuple vs. tuples etc.), and the question about the recheck
> >>> flag. If you can do these tweaks, I'll get that committed today and we
> >>> can try to get a couple more patches in tomorrow.
> > 
> > Attached v19 rebases the rest of the commits from v17 over the first
> > nine patches from v18. All patches 0001-0009 are unchanged from v18. I
> > have made updates and done cleanup on 0010-0021.
> > 
> 
> I've pushed 0001-0005, I'll get back to this tomorrow and see how much
> more we can get in for v17.

Thanks! I thought about it a bit more, and I got worried about the

Assert(scan->rs_empty_tuples_pending == 0);

in heap_rescan() and heap_endscan().

I was worried if we don't complete the scan it could end up tripping
incorrectly.

I tried to come up with a query which didn't end up emitting all of the
tuples on the page (using a LIMIT clause), but I struggled to come up
with an example that qualified for the skip fetch optimization and also
returned before completing the scan.

I could work a bit harder tomorrow to try and come up with something.
However, I think it might be safer to just change these to:

scan->rs_empty_tuples_pending = 0

> What bothers me on 0006-0008 is that the justification in the commit
> messages is "future commit will do something". I think it's fine to have
> a separate "prepareatory" patches (I really like how you structured the
> patches this way), but it'd be good to have them right before that
> "future" commit - I'd like not to have one in v17 and then the "future
> commit" in v18, because that's annoying complication for backpatching,
> (and probably also when implementing the AM?) etc.

Yes, I was thinking about this also.

> AFAICS for v19, the "future commit" for all three patches (0006-0008) is
> 0012, which introduces the unified iterator. Is that correct?

Actually, those patches (v19 0006-0008) were required for v19 0009,
which is why I put them directly before it. 0009 eliminates use of the
TBMIterateResult for control flow in BitmapHeapNext().

I've rephrased the commit messages to not mention future commits and
instead focus on what the changes in the commit are enabling.

v19-0006 actually squashed very easily with v19-0009 and is actually
probably better that way. It is still easy to understand IMO.

In v20, I've attached just the functionality from v19 0006-0009 but in
three patches instead of four.

> Also, for 0008 I'm not sure we could even split it between v17 and v18,
> because even if heapam did not use the iterator, what if some other AM
> uses it? Without 0012 it'd be a problem for the AM, no?

The iterators in the TableScanDescData were introduced in v19-0009. It
is okay for other AMs to use it. In fact, they will want to use it. It
is still initialized and set up in BitmapHeapNext(). They would just
need to call tbm_iterate()/tbm_shared_iterate() on it.

As for how table AMs will cope without the TBMIterateResult passed to
table_scan_bitmap_next_tuple() (which is what v19 0008 did): they can
save the location of the tuples to be scanned somewhere in their scan
descriptor. Heap AM already did this and actually didn't use the
TBMIterateResult at all.

> Would it make sense to move 0009 before these three patches? That seems
> like a meaningful change on it's own, right?

Since v19 0009 requires these patches, I don't think we could do that.
I think up to and including 0009 would be an improvement in clarity and
function.

As for all the patches after 0009, I've dropped them from this version.
We are out of time, and they need more thought.

After we decided not to pursue streaming bitmapheapscan for 17, I wanted
to make sure we removed the prefetch code table AM violation -- since we
weren't deleting that code. So what started out as me looking for a way
to clean up one commit ended up becoming a much larger project. Sorry
about that last minute code explosion! I do think there is a way to do
it right and make it nice. Also that violation would be gone if we
figure out how to get streaming bitmapheapscan behaving correctly.

So, there's just more motivation to make streaming bitmapheapscan
awesome for 18!

Given all that, I've only included the three patches I think we are
considering (former v19 0006-0008). They are largely the same as you saw
them last except for squashing the two commits I mentioned above and
updating all of the commit messages.

> FWIW I don't think it's very likely I'll commit the Un

Re: Fixing backslash dot for COPY FROM...CSV

2024-04-06 Thread Tom Lane
Bruce Momjian  writes:
> On Sun, Apr  7, 2024 at 12:00:25AM +0200, Daniel Verite wrote:
>> Agreed we don't want to document that, but also why doesn't \. in the
>> contents represent just a dot  (as opposed to being an error),
>> just like \a is a?

> I looked into this and started to realize that \. is the only copy
> backslash command where we define the behavior only alone at the
> beginning of a line, and not in other circumstances.  The \a example
> above suggests \. should be period in all other cases, as suggested, but
> I don't know the ramifications if that.

Here's the problem: if some client-side code thinks it's okay to
quote "." as "\.", what is likely to happen when it's presented
a data value consisting only of "."?  It could very easily fall
into the trap of producing an end-of-data marker.

If we get rid of the whole concept of end-of-data markers, then
it'd be totally reasonable to accept "\." as ".".  But as long
as we still have end-of-data markers, I think it's unwise to allow
"\." to appear as anything but an end-of-data marker.  It'd just
add camouflage to the booby trap.

regards, tom lane




Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 02:41:01PM -0500, Nathan Bossart wrote:
> Here is what I have staged for commit, which I intend to do shortly.

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Table AM Interface Enhancements

2024-04-06 Thread Alexander Korotkov
Hi, Pavel!

On Fri, Apr 5, 2024 at 6:58 PM Pavel Borisov  wrote:
> On Tue, 2 Apr 2024 at 19:17, Jeff Davis  wrote:
>>
>> On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote:
>> > I don't like the idea that every custom table AM reltoptions should
>> > begin with StdRdOptions.  I would rather introduce the new data
>> > structure with table options, which need to be accessed outside of
>> > table AM.  Then reloptions will be a backbox only directly used in
>> > table AM, while table AM has a freedom on what to store in reloptions
>> > and how to calculate externally-visible options.  What do you think?
>>
>> Hi Alexander!
>>
>> I agree with all of that. It will take some refactoring to get there,
>> though.
>>
>> One idea is to store StdRdOptions like normal, but if an unrecognized
>> option is found, ask the table AM if it understands the option. In that
>> case I think we'd just use a different field in pg_class so that it can
>> use whatever format it wants to represent its options.
>>
>> Regards,
>> Jeff Davis
>
> I tried to rework a patch regarding table am according to the input from 
> Alexander and Jeff.
>
> It splits table reloptions into two categories:
> - common for all tables (stored in a fixed size structure and could be 
> accessed from outside)
> - table-am specific (variable size, parsed and accessed by access method only)

Thank you for your work.  Please, check the revised patch.

It makes CommonRdOptions a separate data structure, not directly
involved in parsing the reloption.  Instead table AM can fill it on
the base of its reloptions or calculate the other way.  Patch comes
with a test module, which comes with heap-based table AM.  This table
AM has "enable_parallel" reloption, which is used as the base to set
the value of CommonRdOptions.parallel_workers.

--
Regards,
Alexander Korotkov


v10-0001-Custom-reloptions-for-table-AM.patch
Description: Binary data


Re: Fixing backslash dot for COPY FROM...CSV

2024-04-06 Thread Bruce Momjian
On Sun, Apr  7, 2024 at 12:00:25AM +0200, Daniel Verite wrote:
>   Tom Lane wrote:
> 
> > This is sufficiently weird that I'm starting to come around to
> > Daniel's original proposal that we just drop the server's recognition
> > of \. altogether (which would allow removal of some dozens of lines of
> > complicated and now known-buggy code)
> 
> FWIW my plan was to not change anything in the TEXT mode,
> but I wasn't aware it had this issue that you found when
> \. is not in a line by itself.
> 
> >  Alternatively, we could fix it so that \. at the end of a line draws
> > "end-of-copy marker corrupt"
> > which would at least make things consistent, but I'm not sure that has
> > any great advantage.  I surely don't want to document the current
> > behavioral details as being the right thing that we're going to keep
> > doing.
> 
> Agreed we don't want to document that, but also why doesn't \. in the
> contents represent just a dot  (as opposed to being an error),
> just like \a is a?

I looked into this and started to realize that \. is the only copy
backslash command where we define the behavior only alone at the
beginning of a line, and not in other circumstances.  The \a example
above suggests \. should be period in all other cases, as suggested, but
I don't know the ramifications if that.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-06 Thread John Naylor
On Mon, Apr 1, 2024 at 9:54 AM Masahiko Sawada  wrote:
>
> Thank you! I've attached the patch that I'm going to push tomorrow.

Excellent!

I've attached a mostly-polished update on runtime embeddable values,
storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
As discussed, this includes a macro to cap max possible offset that
can be stored in the bitmap, which I believe only reduces the valid
offset range for 32kB pages on 32-bit platforms. Even there, it allows
for more line pointers than can possibly be useful. It also splits
into two parts for readability. It would be committed in two pieces as
well, since they are independently useful.
From 24bd672deb4a6fa14abfc8583b500d1cbc332032 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 5 Apr 2024 17:04:12 +0700
Subject: [PATCH v83 1/3] store offsets in the header

---
 src/backend/access/common/tidstore.c  | 52 +++
 .../test_tidstore/expected/test_tidstore.out  | 14 +
 .../test_tidstore/sql/test_tidstore.sql   |  5 ++
 3 files changed, 71 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index e1a7e82469..4eb5d46951 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -34,6 +34,9 @@
 /* number of active words for a page: */
 #define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1)
 
+/* number of offsets we can store in the header of a BlocktableEntry */
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint16)) / sizeof(OffsetNumber))
+
 /*
  * This is named similarly to PagetableEntry in tidbitmap.c
  * because the two have a similar function.
@@ -41,6 +44,10 @@
 typedef struct BlocktableEntry
 {
 	uint16		nwords;
+
+	/* We can store a small number of offsets here to avoid wasting space with a sparse bitmap. */
+	OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
 } BlocktableEntry;
 #define MaxBlocktableEntrySize \
@@ -316,6 +323,25 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	for (int i = 1; i < num_offsets; i++)
 		Assert(offsets[i] > offsets[i - 1]);
 
+	memset(page, 0, offsetof(BlocktableEntry, words));
+
+	if (num_offsets <= NUM_FULL_OFFSETS)
+	{
+		for (int i = 0; i < num_offsets; i++)
+		{
+			OffsetNumber off = offsets[i];
+
+			/* safety check to ensure we don't overrun bit array bounds */
+			if (!OffsetNumberIsValid(off))
+elog(ERROR, "tuple offset out of range: %u", off);
+
+			page->full_offsets[i] = off;
+		}
+
+		page->nwords = 0;
+	}
+	else
+	{
 	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
 		 wordnum <= WORDNUM(offsets[num_offsets - 1]);
 		 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
@@ -343,6 +369,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 
 	page->nwords = wordnum;
 	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+}
 
 	if (TidStoreIsShared(ts))
 		shared_ts_set(ts->tree.shared, blkno, page);
@@ -369,6 +396,18 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	if (page == NULL)
 		return false;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] == off)
+return true;
+		}
+		return false;
+	}
+	else
+	{
 	wordnum = WORDNUM(off);
 	bitnum = BITNUM(off);
 
@@ -378,6 +417,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 
 	return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
 }
+}
 
 /*
  * Prepare to iterate through a TidStore.
@@ -496,6 +536,17 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 	result->num_offsets = 0;
 	result->blkno = blkno;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] != InvalidOffsetNumber)
+result->offsets[result->num_offsets++] = page->full_offsets[i];
+		}
+	}
+	else
+	{
 	for (wordnum = 0; wordnum < page->nwords; wordnum++)
 	{
 		bitmapword	w = page->words[wordnum];
@@ -518,3 +569,4 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 		}
 	}
 }
+}
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 0ae2f970da..c40779859b 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -36,6 +36,20 @@ SELECT do_set_block_offsets(blk, array_agg(off)::int2[])
 (VALUES (0), (1), (:maxblkno / 2), (:maxblkno - 1), (:maxblkno)) AS blocks(blk),
 (VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)) AS offsets(off)
   GROUP BY blk;
+-- Test offsets embedded in the bitmap header.
+SELECT do_set_block_offsets(501, array[greatest((random() * :maxoffset)::int, 1)]::int2[]);
+ do_set_block_offsets 
+--
+  501
+(1 row)
+
+S

Re: Change GUC hashtable to use simplehash?

2024-04-06 Thread John Naylor
On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma  wrote:
>
> But given that we know the data length and we have it in a register
> already, it's easy enough to just mask out data past the end with a
> shift. See patch 1. Performance benefit is about 1.5x Measured on a
> small test harness that just hashes and finalizes an array of strings,
> with a data dependency between consecutive hashes (next address
> depends on the previous hash output).

I pushed this with a couple cosmetic adjustments, after fixing the
endianness issue. I'm not sure why valgrind is fine with this way, and
the other ways I tried forming the (little-endian) mask raised errors.
In addition to "zero_byte_low | (zero_byte_low - 1)", I tried
"~zero_byte_low & (zero_byte_low - 1)" and "zero_byte_low ^
(zero_byte_low - 1)" to no avail.

On Thu, Mar 28, 2024 at 12:37 PM Jeff Davis  wrote:
> 0001 looks good to me, thank you.
>
> > v21-0003 adds a new file hashfn_unstable.c for convenience functions
> > and converts all the duplicate frontend uses of hash_string_pointer.
>
> Why not make hash_string() inline, too? I'm fine with it either way,
> I'm just curious why you went to the trouble to create a new .c file so
> it didn't have to be inlined.

Thanks for looking! I pushed these, with hash_string() inlined.

I've attached (not reindented for clarity) an update of something
mentioned a few times already -- removing strlen calls for dynahash
and dshash string keys. I'm not quite sure how the comments should be
updated about string_hash being deprecated to call directly. This
patch goes further and semi-deprecates calling it at all, so these
comments seem a bit awkward now.
From 2e41e683b2fe2bc76b808e58ca2fea9067bff4e1 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 5 Apr 2024 13:59:13 +0700
Subject: [PATCH v21] Use fasthash for string keys in dynahash and dshash

This avoids strlen calls. string_hash is kept around in case
extensions are using it.
---
 src/backend/lib/dshash.c |  5 +++--
 src/backend/utils/hash/dynahash.c| 25 -
 src/common/hashfn.c  |  4 +++-
 src/include/common/hashfn_unstable.h | 33 
 src/include/lib/dshash.h |  2 +-
 5 files changed, 59 insertions(+), 10 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 93a9e21ddd..8bebf92cb8 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -32,6 +32,7 @@
 #include "postgres.h"
 
 #include "common/hashfn.h"
+#include "common/hashfn_unstable.h"
 #include "lib/dshash.h"
 #include "storage/lwlock.h"
 #include "utils/dsa.h"
@@ -605,14 +606,14 @@ dshash_strcmp(const void *a, const void *b, size_t size, void *arg)
 }
 
 /*
- * A hash function that forwards to string_hash.
+ * A hash function that forwards to hash_string_with_len.
  */
 dshash_hash
 dshash_strhash(const void *v, size_t size, void *arg)
 {
 	Assert(strlen((const char *) v) < size);
 
-	return string_hash((const char *) v, size);
+	return hash_string_with_len((const char *) v, size - 1);
 }
 
 /*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 145e058fe6..9b85af2743 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -98,6 +98,7 @@
 
 #include "access/xact.h"
 #include "common/hashfn.h"
+#include "common/hashfn_unstable.h"
 #include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -309,6 +310,18 @@ string_compare(const char *key1, const char *key2, Size keysize)
 	return strncmp(key1, key2, keysize - 1);
 }
 
+/*
+ * hash function used when HASH_STRINGS is set
+ *
+ * If the string exceeds keysize-1 bytes, we want to hash only that many,
+ * because when it is copied into the hash table it will be truncated at
+ * that length.
+ */
+static uint32
+default_string_hash(const void *key, Size keysize)
+{
+	return hash_string_with_len((const char *) key, keysize - 1);
+}
 
 /** CREATE ROUTINES **/
 
@@ -420,8 +433,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	else
 	{
 		/*
-		 * string_hash used to be considered the default hash method, and in a
-		 * non-assert build it effectively still is.  But we now consider it
+		 * string_hash used to be considered the default hash method, and
+		 * it effectively still was until version 17.  Since version 14 we consider it
 		 * an assertion error to not say HASH_STRINGS explicitly.  To help
 		 * catch mistaken usage of HASH_STRINGS, we also insist on a
 		 * reasonably long string length: if the keysize is only 4 or 8 bytes,
@@ -430,12 +443,12 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		Assert(flags & HASH_STRINGS);
 		Assert(info->keysize > 8);
 
-		hashp->hash = string_hash;
+		hashp->hash = default_string_hash;
 	}
 
 	/*
 	 * If you don't specify a match function, it defaults to string_compare if
-	 * you used str

Re: Flushing large data immediately in pqcomm

2024-04-06 Thread Andres Freund
Hi,

On 2024-04-07 00:45:31 +0200, Jelte Fennema-Nio wrote:
> On Sat, 6 Apr 2024 at 22:21, Andres Freund  wrote:
> > The small regression for small results is still kinda visible, I haven't yet
> > tested the patch downthread.
> 
> Thanks a lot for the faster test script, I'm also impatient. I still
> saw the small regression with David his patch. Here's a v6 where I
> think it is now gone. I added inline to internal_put_bytes too. I
> think that helped especially because for two calls to
> internal_put_bytes len is a constant (1 and 4) that is smaller than
> PqSendBufferSize. So for those calls the compiler can now statically
> eliminate the new codepath because "len >= PqSendBufferSize" is known
> to be false at compile time.

Nice.


> Also I incorporated all of Ranier his comments.

Changing the global vars to size_t seems mildly bogus to me. All it's
achieving is to use slightly more memory. It also just seems unrelated to the
change.

Greetings,

Andres Freund




Re: Streaming read-ready sequential scan code

2024-04-06 Thread Melanie Plageman
On Sat, Apr 6, 2024 at 9:25 PM Thomas Munro  wrote:
>
> I found a bug in read_stream.c that could be hit with Melanie's
> streaming seq scan patch with parallelism enabled and certain buffer
> pool conditions.  Short version: there is an edge case where an "if"
> needed to be a "while", or we could lose a few blocks.  Here's the fix
> for that, longer explanation in commit message.

Attached v13 0001 is your fix and 0002 is a new version of the
sequential scan streaming read user. Off-list Andres mentioned that I
really ought to separate the parallel and serial sequential scan users
into two different callbacks. I've done that in the attached. It
actually makes the code used by the callbacks nicer and more readable
anyway (putting aside performance). I was able to measure a small
performance difference as well.

I've also added a few comments and improved existing comments.

- Melanie
From eded321df22bf472f147bd8f94b596d465355c70 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 5 Apr 2024 13:32:14 +1300
Subject: [PATCH v13 2/2] Use streaming IO in heapam sequential and TID range
 scans

Instead of calling ReadBuffer() for each block, heap sequential scans
and TID range scans now use the streaming read API introduced in
b5a9b18cd0.

Author: Melanie Plageman 
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
---
 src/backend/access/heap/heapam.c | 234 +--
 src/include/access/heapam.h  |  15 ++
 2 files changed, 176 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 01bb2f4cc16..9d10d42b69b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,66 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * 
  */
 
+/*
+ * Streaming read API callback for parallel sequential scans. Returns the next
+ * block the caller wants from the read stream or InvalidBlockNumber when done.
+ */
+static BlockNumber
+heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data,
+	void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	Assert(ScanDirectionIsForward(scan->rs_dir));
+	Assert(scan->rs_base.rs_parallel);
+
+	if (unlikely(!scan->rs_inited))
+	{
+		/* parallel scan */
+		table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+ scan->rs_parallelworkerdata,
+ (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		/* may return InvalidBlockNumber if there are no more blocks */
+		scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+	scan->rs_parallelworkerdata,
+	(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+		scan->rs_inited = true;
+	}
+	else
+	{
+		scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+	scan->rs_parallelworkerdata, (ParallelBlockTableScanDesc)
+	scan->rs_base.rs_parallel);
+	}
+
+	return scan->rs_prefetch_block;
+}
+
+/*
+ * Streaming read API callback for serial sequential and TID range scans.
+ * Returns the next block the caller wants from the read stream or
+ * InvalidBlockNumber when done.
+ */
+static BlockNumber
+heap_scan_stream_read_next_serial(ReadStream *pgsr, void *private_data,
+  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+		   scan->rs_prefetch_block,
+		   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* 
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * 
@@ -325,6 +385,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and
+	 * because heap scans go forward before going backward (e.g. CURSORs).
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -462,12 +529,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.
  */
 static inline void
 heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_

Re: Streaming read-ready sequential scan code

2024-04-06 Thread Thomas Munro
I found a bug in read_stream.c that could be hit with Melanie's
streaming seq scan patch with parallelism enabled and certain buffer
pool conditions.  Short version: there is an edge case where an "if"
needed to be a "while", or we could lose a few blocks.  Here's the fix
for that, longer explanation in commit message.
From 43cef2d58141ba048e9349b0027afff148be5553 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 7 Apr 2024 12:36:44 +1200
Subject: [PATCH] Fix bug in read_stream.c.

When we determine that a wanted block can't be combined with the current
pending read, it's time to start that pending read to get it out of the
way.  An "if" in that code path should have been a "while", because it
might take more than one go to get that job done.  Otherwise the
remaining part of a partially started read could be clobbered and we
could lose some blocks.  This was only broken for smaller ranges, as the
more common case of io_combine_limit-sized ranges is handled earlier in
the code and knows how to loop.

Discovered while testing parallel sequential scans of partially cached
tables.  They have a ramp-down phase with ever smaller ranges of
contiguous blocks, to be fair to parallel workers as the work runs out.

Defect in commit b5a9b18c.
---
 src/backend/storage/aio/read_stream.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 9a70a81f7ae..f54dacdd914 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -363,7 +363,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		}
 
 		/* We have to start the pending read before we can build another. */
-		if (stream->pending_read_nblocks > 0)
+		while (stream->pending_read_nblocks > 0)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
-- 
2.44.0



Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs

2024-04-06 Thread Tom Lane
So what was really bothering me about this patchset was that I
didn't think marginal performance gains were a sufficient reason
to put a whole different operating mode into libpq.  However,
I've reconsidered after realizing that implementing FETCH_COUNT
atop traditional single-row mode would require either merging
single-row results into a bigger PGresult or persuading psql's
results-printing code to accept an array of PGresults not just
one.  Either of those would be expensive and ugly, not to mention
needing chunks of code we don't have today.

Also, it doesn't really need to be a whole different operating mode.
There's no reason that single-row mode shouldn't be exactly equivalent
to chunk mode with chunk size 1, except for the result status code.
(We've got to keep PGRES_SINGLE_TUPLE for the old behavior, but
using that for a chunked result would be too confusing.)

So I whacked the patch around till I liked it better, and pushed it.
I hope my haste will not come back to bite me, but we are getting
pretty hard up against the feature-freeze deadline.

regards, tom lane




Re: Cluster::restart dumping logs when stop fails

2024-04-06 Thread Andres Freund
Hi,

On 2024-04-07 00:19:35 +0200, Daniel Gustafsson wrote:
> > On 6 Apr 2024, at 23:44, Andres Freund  wrote:
> 
> > It might be useful to print a few lines, but the whole log files can be
> > several megabytes worth of output.
> 
> The non-context aware fix would be to just print the last 1024 (or something)
> bytes from the logfile:

That'd be better, yes. I'd mainly log the path to the logfile though, that's
probably at least as helpful for actually investigating the issue?

> diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm 
> b/src/test/perl/PostgreSQL/Test/Cluster.pm
> index 54e1008ae5..53d4751ffc 100644
> --- a/src/test/perl/PostgreSQL/Test/Cluster.pm
> +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
> @@ -951,8 +951,8 @@ sub start
>  
>   if ($ret != 0)
>   {
> - print "# pg_ctl start failed; logfile:\n";
> - print PostgreSQL::Test::Utils::slurp_file($self->logfile);
> + print "# pg_ctl start failed; logfile excerpt:\n";
> + print substr 
> PostgreSQL::Test::Utils::slurp_file($self->logfile), -1024;
>  
>   # pg_ctl could have timed out, so check to see if there's a pid 
> file;
>   # otherwise our END block will fail to shut down the new 
> postmaster.

That's probably unnecessary optimization, but it seems a tad silly to read an
entire, potentially sizable, file to just use the last 1k. Not sure if the way
slurp_file() uses seek supports negative ofsets, the docs read to me like that
may only be supported with SEEK_END.

Greetings,

Andres Freund




Re: Add bump memory context type and use it for tuplesorts

2024-04-06 Thread Matthias van de Meent
On Sun, 7 Apr 2024, 01:59 David Rowley,  wrote:

> On Sun, 7 Apr 2024 at 05:45, Matthias van de Meent
>  wrote:
> > Malloc's docs specify the minimum chunk size at 4*sizeof(void*) and
> itself uses , so using powers of 2 for chunks would indeed fail to detect
> 1s in the 4th bit. I suspect you'll get different results when you check
> the allocation patterns of multiples of 8 bytes, starting from 40,
> especially on 32-bit arm (where MALLOC_ALIGNMENT is 8 bytes, rather than
> the 16 bytes on i386 and 64-bit architectures, assuming  [0] is accurate)


I'd hazard a guess that
> there are more instances of Postgres running on Windows today than on
> 32-bit CPUs and we don't seem too worried about the bit-patterns used
> for Windows.
>

Yeah, that is something I had some thoughts about too, but didn't check if
there was historical context around. I don't think it's worth bothering
right now though.

>> Another reason not to make it 5 bits is that I believe that would make
> >> the mcxt_methods[] array 2304 bytes rather than 576 bytes.  4 bits
> >> makes it 1152 bytes, if I'm counting correctly.
> >
> >
> > I don't think I understand why this would be relevant when only 5 of the
> contexts are actually in use (thus in caches). Is that size concern about
> TLB entries then?
>
> It's a static const array. I don't want to bloat the binary with
> something we'll likely never need.  If we one day need it, we can
> reserve another bit using the same overlapping method.
>

Fair points.

>> I revised the patch to simplify hdrmask logic.  This started with me
> >> having trouble finding the best set of words to document that the
> >> offset is "half the bytes between the chunk and block".  So, instead
> >> of doing that, I've just made it so these two fields effectively
> >> overlap. The lowest bit of the block offset is the same bit as the
> >> high bit of what MemoryChunkGetValue returns.
> >
> >
> > Works for me, I suppose.
>
> hmm. I don't detect much enthusiasm for it.
>

I had a tiring day leaving me short on enthousiasm, after which I realised
there were some things to this patch that would need fixing.

I could've worded this better, but nothing against this code.

-Matthias


Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Tomas Vondra
On 4/6/24 23:34, Melanie Plageman wrote:
> ...
>>
>> I realized it makes more sense to add a FIXME (I used XXX. I'm not when
>> to use what) with a link to the message where Andres describes why he
>> thinks it is a bug. If we plan on fixing it, it is good to have a record
>> of that. And it makes it easier to put a clear and accurate comment.
>> Done in 0009.
>>
>>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks
>>> per above (tuple vs. tuples etc.), and the question about the recheck
>>> flag. If you can do these tweaks, I'll get that committed today and we
>>> can try to get a couple more patches in tomorrow.
> 
> Attached v19 rebases the rest of the commits from v17 over the first
> nine patches from v18. All patches 0001-0009 are unchanged from v18. I
> have made updates and done cleanup on 0010-0021.
> 

I've pushed 0001-0005, I'll get back to this tomorrow and see how much
more we can get in for v17.

What bothers me on 0006-0008 is that the justification in the commit
messages is "future commit will do something". I think it's fine to have
a separate "prepareatory" patches (I really like how you structured the
patches this way), but it'd be good to have them right before that
"future" commit - I'd like not to have one in v17 and then the "future
commit" in v18, because that's annoying complication for backpatching,
(and probably also when implementing the AM?) etc.

AFAICS for v19, the "future commit" for all three patches (0006-0008) is
0012, which introduces the unified iterator. Is that correct?

Also, for 0008 I'm not sure we could even split it between v17 and v18,
because even if heapam did not use the iterator, what if some other AM
uses it? Without 0012 it'd be a problem for the AM, no?

Would it make sense to move 0009 before these three patches? That seems
like a meaningful change on it's own, right?

FWIW I don't think it's very likely I'll commit the UnifiedTBMIterator
stuff. I do agree with the idea in general, but I think I'd need more
time to think about the details. Sorry about that ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Add bump memory context type and use it for tuplesorts

2024-04-06 Thread David Rowley
On Sun, 7 Apr 2024 at 05:45, Matthias van de Meent
 wrote:
> Malloc's docs specify the minimum chunk size at 4*sizeof(void*) and itself 
> uses , so using powers of 2 for chunks would indeed fail to detect 1s in the 
> 4th bit. I suspect you'll get different results when you check the allocation 
> patterns of multiples of 8 bytes, starting from 40, especially on 32-bit arm 
> (where MALLOC_ALIGNMENT is 8 bytes, rather than the 16 bytes on i386 and 
> 64-bit architectures, assuming  [0] is accurate)

I'm prepared to be overruled, but I just don't have strong feelings
that 32-bit is worth making these reservations for. Especially so
given the rate we're filling these slots.  The only system that I see
the 4th bit change is Cygwin. It doesn't look like a very easy system
to protect against pfreeing of malloc'd chunks as the prior 8-bytes
seem to vary depending on the malloc'd size and I see all bit patterns
there, including the ones we use for our memory contexts.

Since we can't protect against every possible bit pattern there, we
need to draw the line somewhere.  I don't think 32-bit systems are
worth reserving these precious slots for. I'd hazard a guess that
there are more instances of Postgres running on Windows today than on
32-bit CPUs and we don't seem too worried about the bit-patterns used
for Windows.

> In your updated 0001, you don't seem to fill the RESERVED_GLIBC memctx array 
> entries with BOGUS_MCTX().

Oops. Thanks

>> Another reason not to make it 5 bits is that I believe that would make
>> the mcxt_methods[] array 2304 bytes rather than 576 bytes.  4 bits
>> makes it 1152 bytes, if I'm counting correctly.
>
>
> I don't think I understand why this would be relevant when only 5 of the 
> contexts are actually in use (thus in caches). Is that size concern about TLB 
> entries then?

It's a static const array. I don't want to bloat the binary with
something we'll likely never need.  If we one day need it, we can
reserve another bit using the same overlapping method.

>> I revised the patch to simplify hdrmask logic.  This started with me
>> having trouble finding the best set of words to document that the
>> offset is "half the bytes between the chunk and block".  So, instead
>> of doing that, I've just made it so these two fields effectively
>> overlap. The lowest bit of the block offset is the same bit as the
>> high bit of what MemoryChunkGetValue returns.
>
>
> Works for me, I suppose.

hmm. I don't detect much enthusiasm for it.

Personally, I quite like the method as it adds no extra instructions
when encoding the MemoryChunk and only a simple bitwise-AND when
decoding it.  Your method added extra instructions in the encode and
decode.  I went to great lengths to make this code as fast as
possible, so I know which method that I prefer.  We often palloc and
never do anything that requires the chunk header to be decoded, so not
adding extra instructions on the encoding stage is a big win.

The only method I see to avoid adding instructions in encoding and
decoding is to reduce the bit-space for the MemoryChunkGetValue field
to 29 bits. Effectively, that means non-external chunks can only be
512MB rather than 1GB.  As far as I know, that just limits slab.c to
only being able to do 512MB pallocs as generation.c and aset.c use
external chunks well below that threshold. Restricting slab to 512MB
is probably far from the end of the world. Anything close to that
would be a terrible use case for slab.  I was just less keen on using
a bit from there as that's a field we allow the context implementation
to do what they like with. Having bitspace for 2^30 possible values in
there just seems nice given that it can store any possible value from
zero up to MaxAllocSize.

David




Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-04-06 Thread Thomas Munro
On second thoughts, I think the original "invalidate" terminology was
fine, no need to invent a new term.

I thought of a better name for the bufmgr.c function though:
InvalidateUnpinnedBuffer().  That name seemed better to me after I
festooned it with warnings about why exactly it's inherently racy and
only for testing use.

I suppose someone could propose an additional function
pg_buffercache_invalidate(db, tbspc, rel, fork, blocknum) that would
be slightly better in the sense that it couldn't accidentally evict
some innocent block that happened to replace the real target just
before it runs, but I don't think it matters much for this purpose and
it would still be racy on return (vacuum decides to load your block
back in) so I don't think it's worth bothering with.

So this is the version I plan to commit.
From 6a13349b788c8539b2d349f9553706d7c563f8f8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 7 Apr 2024 09:13:17 +1200
Subject: [PATCH v6] Add pg_buffercache_invalidate() function for testing.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

When testing buffer pool logic, it is useful to be able to evict
arbitrary blocks.  This function can be used in SQL queries over the
pg_buffercache view to set up a wide range of buffer pool states.  Of
course, buffer mappings might change concurrently so you might evict a
block other than the one you had in mind, and another session might
bring it back in at any time.  That's OK for the intended purpose of
setting up developer testing scenarios, and more complicated interlocking
schemes to give stronger guararantees about that would likely be less
flexible for actual testing work anyway.  Superuser-only.

Author: Palak Chaturvedi 
Author: Thomas Munro  (docs, small tweaks)
Reviewed-by: Nitin Jadhav 
Reviewed-by: Andres Freund 
Reviewed-by: Cary Huang 
Reviewed-by: Cédric Villemain 
Reviewed-by: Jim Nasby 
Reviewed-by: Maxim Orlov 
Reviewed-by: Thomas Munro 
Discussion: https://postgr.es/m/calfch19pw48zwwzuorspsav9hqt0upyabpc4boz4w+c7ff5...@mail.gmail.com
---
 contrib/pg_buffercache/Makefile   |  2 +-
 contrib/pg_buffercache/meson.build|  1 +
 .../pg_buffercache--1.4--1.5.sql  |  6 ++
 contrib/pg_buffercache/pg_buffercache.control |  2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c | 20 ++
 doc/src/sgml/pgbuffercache.sgml   | 37 +--
 src/backend/storage/buffer/bufmgr.c   | 63 +++
 src/include/storage/bufmgr.h  |  2 +
 8 files changed, 125 insertions(+), 8 deletions(-)
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index d6b58d4da9..eae65ead9e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,7 @@ OBJS = \
 EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
-	pg_buffercache--1.3--1.4.sql
+	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index c86e33cc95..1ca3452918 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -22,6 +22,7 @@ install_data(
   'pg_buffercache--1.2--1.3.sql',
   'pg_buffercache--1.2.sql',
   'pg_buffercache--1.3--1.4.sql',
+  'pg_buffercache--1.4--1.5.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
diff --git a/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql
new file mode 100644
index 00..92c530bc19
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql
@@ -0,0 +1,6 @@
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.5'" to load this file. \quit
+
+CREATE FUNCTION pg_buffercache_invalidate(IN int)
+RETURNS bool
+AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate'
+LANGUAGE C PARALLEL SAFE VOLATILE STRICT;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index a82ae5f9bb..5ee875f77d 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.4'
+default_version = '1.5'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3316732365..9617bfa47b 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -63,6 +63,7 @@ typedef struct
 PG_FUNCTION_INFO_V1(pg_buffercache_pages);
 PG_FUNCTION_INFO_V1(pg_buffercache_summa

Re: Flushing large data immediately in pqcomm

2024-04-06 Thread Jelte Fennema-Nio
On Sat, 6 Apr 2024 at 22:21, Andres Freund  wrote:
> The small regression for small results is still kinda visible, I haven't yet
> tested the patch downthread.

Thanks a lot for the faster test script, I'm also impatient. I still
saw the small regression with David his patch. Here's a v6 where I
think it is now gone. I added inline to internal_put_bytes too. I
think that helped especially because for two calls to
internal_put_bytes len is a constant (1 and 4) that is smaller than
PqSendBufferSize. So for those calls the compiler can now statically
eliminate the new codepath because "len >= PqSendBufferSize" is known
to be false at compile time.

Also I incorporated all of Ranier his comments.


v6-0001-Faster-internal_putbytes.patch
Description: Binary data


Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-06 Thread Dmitry Koval

Hi, Alexander!


I didn't push 0003 for the following reasons. 


Thanks for clarifying. You are right, these are serious reasons.

--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com




Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-04-06 Thread Alexander Korotkov
Hi, Dmitry!

On Fri, Apr 5, 2024 at 4:00 PM Dmitry Koval  wrote:
> > I've revised the patchset.
>
> Thanks for the corrections (especially ddl.sgml).
> Could you also look at a small optimization for the MERGE PARTITIONS
> command (in a separate file
> v31-0003-Additional-patch-for-ALTER-TABLE-.-MERGE-PARTITI.patch, I wrote
> about it in an email 2024-03-31 00:56:50)?
>
> Files v31-0001-*.patch, v31-0002-*.patch are the same as
> v30-0001-*.patch, v30-0002-*.patch (after rebasing because patch stopped
> applying due to changes in upstream).

I've pushed 0001 and 0002.  I didn't push 0003 for the following reasons.
1) This doesn't keep functionality equivalent to 0001.  With 0003, the
merged partition will inherit indexes, constraints, and so on from the
one of merging partitions.
2) This is not necessarily an optimization. Without 0003 indexes on
the merged partition are created after moving the rows in
attachPartitionTable(). With 0003 we merge data into the existing
partition which saves its indexes.  That might cause a significant
performance loss because mass inserts into indexes may be much slower
than building indexes from scratch.
I think both aspects need to be carefully considered.  Even if we
accept them, this needs to be documented.  I think now it's too late
for both of these.  So, this should wait for v18.

--
Regards,
Alexander Korotkov




Re: Cluster::restart dumping logs when stop fails

2024-04-06 Thread Daniel Gustafsson
> On 6 Apr 2024, at 23:44, Andres Freund  wrote:

> It might be useful to print a few lines, but the whole log files can be
> several megabytes worth of output.

The non-context aware fix would be to just print the last 1024 (or something)
bytes from the logfile:

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm 
b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 54e1008ae5..53d4751ffc 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -951,8 +951,8 @@ sub start
 
if ($ret != 0)
{
-   print "# pg_ctl start failed; logfile:\n";
-   print PostgreSQL::Test::Utils::slurp_file($self->logfile);
+   print "# pg_ctl start failed; logfile excerpt:\n";
+   print substr 
PostgreSQL::Test::Utils::slurp_file($self->logfile), -1024;
 
# pg_ctl could have timed out, so check to see if there's a pid 
file;
# otherwise our END block will fail to shut down the new 
postmaster.


Would that be a reasonable fix?

--
Daniel Gustafsson





Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-04-06 Thread Jelte Fennema-Nio
On Fri, 5 Apr 2024 at 18:48, Robert Haas  wrote:
> Maybe we'd be better off adding a libpq connection
> option that forces the use of a specific minor protocol version, but
> then we'll need backward-compatibility code in libpq basically
> forever. But maybe we need that anyway to avoid older and newer
> servers being unable to communicate.

I think this would be good because it makes testing easy and, like you
said, I think we'll need this backward-compatibility code in libpq
anyway to be able to connect to old servers. To have even better and
more realistic test coverage though, I think we might also want to
actually test new libpq against old postgres servers and vice-versa in
a build farm animal though.

> Plus, you've got all of the consequences for non-core drivers, which
> have to both add support for the new wire protocol - if they don't
> want to seem outdated and eventually obsolete - and also test that
> they're still compatible with all supported server versions.

I think for clients/drivers, the work would generally be pretty
minimal. For almost all proposed changes, clients can "support" the
protocol version update by simply not using the new features, e.g. a
client can "support" the ParameterSet feature, by simply never sending
the ParameterSet message. So binding it to a protocol version bump
doesn't make it any harder for that client to support that protocol
version. I'm not saying that is the case for all protocol changes, but
based on what's being proposed so far that's definitely a very common
theme. Overall, I think this is something to discuss for each protocol
change in isolation: i.e. how to make supporting the new feature as
painless as possible for clients/drivers.

> Connection poolers have the same set of problems.

For connection poolers this is indeed a bigger hassle, because they at
least need to be able to handle all the new message types that a
client can send and maybe do something special for them. But I think
if we're careful to keep connection poolers in mind when designing the
features themselves then I think this isn't necessarily a problem. And
probably indeed for the features that we think are hard for connection
poolers to implement, we should be using protocol extension parameter
feature flags. But I think a lot of protocol would be fairly trivial
for a connection pooler to support.

> The whole thing is
> almost a hole with no bottom. Keeping up with core changes in this
> area could become a massive undertaking for lots and lots of people,
> some of whom may be the sole maintainer of some important driver that
> now needs a whole bunch of work.

I agree with Dave here, if you want to benefit from new features
there's some expectation to keep up with the changes. But to be clear,
we'd still support old protocol versions too. So we wouldn't break
connecting using those clients, they simply wouldn't benefit from some
of the new features. I think that's acceptable.

> I'm not sure how much it improves things if we imagine adding feature
> flags to the existing protocol versions, rather than whole new
> protocol versions, but at least it cuts down on the assumption that
> adopting new features is mandatory, and that such features are
> cumulative. If a driver wants to support TDE but not protocol
> parameters or protocol parameters but not TDE, who are we to say no?
> If in supporting those things we bump the protocol version to 3.2, and
> then 3.3 fixes a huge performance problem, are drivers going to be
> required to add support for features they don't care about to get the
> performance fixes?

I think there's an important trade-off here. On one side we don't want
to make maintainers of clients/poolers do lots of work to support
features they don't care about. And on the other side it seems quite
useful to limit the amount of feature combinations that are used it
the wild (both for users and for us) e.g. the combinations of
backwards compatibility testing you were talking about would explode
if every protocol change was a feature flag. I think this trade-off is
something we should be deciding on based on the specific protocol
change. But if work needed to "support" the feature is "minimal"
(to-be-defined exactly what we consider minimal), I think making it
part of a protocol version bump is reasonable.

> I see some benefit in bumping the protocol version
> for major changes, or for changes that we have an important reason to
> make mandatory, or to make previously-optional features for which
> support has become in practical terms universal part of the base
> feature set. But I'm very skeptical of the idea that we should just
> handle as many things as possible via a protocol version bump. We've
> been avoiding protocol version bumps like the plague since forever,
> and swinging all the way to the other extreme doesn't sound like the
> right idea to me.

I think there's two parts to a protocol version bump:
1. The changes that cause us to consider a protocol bump

Re: Fixing backslash dot for COPY FROM...CSV

2024-04-06 Thread Daniel Verite
Tom Lane wrote:

> This is sufficiently weird that I'm starting to come around to
> Daniel's original proposal that we just drop the server's recognition
> of \. altogether (which would allow removal of some dozens of lines of
> complicated and now known-buggy code)

FWIW my plan was to not change anything in the TEXT mode,
but I wasn't aware it had this issue that you found when
\. is not in a line by itself.

>  Alternatively, we could fix it so that \. at the end of a line draws
> "end-of-copy marker corrupt"
> which would at least make things consistent, but I'm not sure that has
> any great advantage.  I surely don't want to document the current
> behavioral details as being the right thing that we're going to keep
> doing.

Agreed we don't want to document that, but also why doesn't \. in the
contents represent just a dot  (as opposed to being an error),
just like \a is a?

I mean if eofdata contains

  foobar\a
  foobaz\aother

then we get after import:
  f1  
--
 foobara
 foobazaother
(2 rows)

Reading the current doc on the text format, I can't see why
importing:

  foobar\.
  foobar\.other

is not supposed to produce
  f1  
--
 foobar.
 foobaz.other
(2 rows)


I see these rules in [1] about backslash:

#1. 
  "End of data can be represented by a single line containing just
   backslash-period (\.)."

foobar\. and foobar\.other do not match that so #1 does not describe
how they're interpreted.

#2.
  "Backslash characters (\) can be used in the COPY data to quote data
  characters that might otherwise be taken as row or column
  delimiters."

Dot is not a column delimiter (it's forbidden anyway), so #2 does
not apply.

#3.
  "In particular, the following characters must be preceded by a
  backslash if they appear as part of a column value: backslash itself,
  newline, carriage return, and the current delimiter character"

Dot is not in that list so #3 does not apply.

#4.
  "The following special backslash sequences are recognized by COPY
  FROM:" (followed by the table with \b \f, ...,)

Dot is not mentioned.

#5.
  "Any other backslashed character that is not mentioned in the above
  table will be taken to represent itself"

Here we say that backslash dot represents a dot (unless other
rules apply)

  foobar\. => foobar. 
  foobar\.other => foobar.other

#6.
  "However, beware of adding backslashes unnecessarily, since that
   might accidentally produce a string matching the end-of-data marker
   (\.) or the null string (\N by default)."

So we *recommend* not to use \. but as I understand it, the warning
with the EOD marker is about accidentally creating a line that matches #1,
that is, \. alone on a line.

#7
  "These strings will be recognized before any other backslash
  processing is done."

TBH I don't understand what #7 implies. The order in backslash
processing looks like an implementation detail that should not
matter in understanding the format?


Considering this, it seems to me that #5 says that
backslash-dot represents a dot unless #1 applies, and the
other #2 #3 #4 #6 #7 rules do not state anything that would
contradict that.


[1] https://www.postgresql.org/docs/current/sql-copy.html


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: UUID v7

2024-04-06 Thread Sergey Prokhorenko
For every complex problem there is an answer that is clear, simple, and wrong. 
Since the RFC allows microsecond timestamp granularity, the first thing that 
comes to everyone's mind is to insert microsecond granularity into UUIDv7. And 
if the RFC allowed nanosecond timestamp granularity, then they would try to 
insert nanosecond granularity into UUIDv7.
But I am categorically against abandoning the counter under pressure from the 
unfounded proposal to replace the counter with microsecond granularity.
1) The RFC specifies millisecond timestamp granularity by default.
2) All advanced UUIDv7 implementations include a counter:• for JavaScript 
https://www.npmjs.com/package/uuidv7• for Rust https://crates.io/crates/uuid7• 
for Go (Golang) https://pkg.go.dev/github.com/gofrs/uuid#NewV7• for Python 
https://github.com/oittaa/uuid6-python
3) The theoretical performance of generating UUIDv7 without loss of 
monotonicity for microsecond granularity is only 1000 UUIDv7 per millisecond. 
This is very low and insufficient generation performance! But the actual 
generation performance is even worse, since the generation demand is unevenly 
distributed within a millisecond. Therefore, a UUIDv7 will not be generated 
every microsecond.
For a counter 18 bits long, with the most significant bit initialized to zero 
and the remaining bits initialized to a random number, the actual performance 
of generating a UUIDv7 without loss of monotonicity is between 2 to the power 
of 17 = 131072 UUIDv7 per millisecond (if the random number happens to be all 
ones) to 2 to the power of 18 = 262144 UUIDv7 per millisecond (if the random 
number happens to be all zeros). This is more than enough.
4) Microsecond timestamp fraction subtracts 10 bits from random data, which 
increases the risk of collision. In the counter, almost all bits are 
initialized with a random number, which reduces the risk of collision.


The only reasonable use of microsecond granularity is when writing to a 
database table in parallel. However, monotonicity in this case can be ensured 
in another way, namely a single UUIDv7 generator per database table, similar to 
SERIAL 
(https://postgrespro.com/docs/postgresql/16/datatype-numeric#DATATYPE-SERIAL) 
in PostgreSQL.
Best regards,
Sergey prokhorenkosergeyprokhore...@yahoo.com.au 

On Thursday, 4 April 2024 at 09:12:17 pm GMT+3, Andrey M. Borodin 
 wrote:  
 
 
...

At this point we can skip the counter\microseconds entirely, just fill 
everything after unix_ts_ms with randomness. It's still a valid UUIDv7, 
exhibiting much more data locality than UUIDv4. We can adjust this sortability 
measures later.


Best regards, Andrey Borodin.  

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-04-06 Thread Alexander Korotkov
Hi, Alvaro!

Thank you for your care on this matter.

On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera  wrote:
> BTW I noticed that
> https://coverage.postgresql.org/src/backend/commands/waitlsn.c.gcov.html
> says that lsn_cmp is not covered by the tests.  This probably indicates
> that the tests are a little too light, but I'm not sure how much extra
> effort we want to spend.

I'm aware of this.  Ivan promised to send a patch to improve the test.
If he doesn't, I'll care about it.

> I'm still concerned that WaitLSNCleanup is only called in ProcKill.
> Does this mean that if a process throws an error while waiting, it'll
> not get cleaned up until it exits?  Maybe this is not a big deal, but it
> seems odd.

I've added WaitLSNCleanup() to the AbortTransaction().  Just pushed
that together with the improvements upthread.

--
Regards,
Alexander Korotkov




Cluster::restart dumping logs when stop fails

2024-04-06 Thread Andres Freund
Hi,

I recently started to be bothered by regress_* logs after some kinds of test
failures containing the whole log of a test failure. E.g. in
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-04-06%2016%3A28%3A38

...
### Restarting node "standby"
# Running: pg_ctl -w -D 
/home/bf/bf-build/serinus/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata
 -l 
/home/bf/bf-build/serinus/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log
 restart
waiting for server to shut 
down...
 failed
pg_ctl: server does not shut down
# pg_ctl restart failed; logfile:
2024-04-06 16:33:37.496 UTC [2628363][postmaster][:0] LOG:  starting PostgreSQL 
17devel on x86_64-linux, compiled by gcc-14.0.1, 64-bit
2024-04-06 16:33:37.503 UTC [2628363][postmaster][:0] LOG:  listening on Unix 
socket "/tmp/55kikMaTyW/.s.PGSQL.63274"



Looks like the printing of the entire log was added in:

commit 33774978c78175095da9e6c276e8bcdb177725f8
Author: Daniel Gustafsson 
Date:   2023-09-22 13:35:37 +0200

Avoid using internal test methods in SSL tests


It might be useful to print a few lines, but the whole log files can be
several megabytes worth of output.  In the buildfarm that leads to the same
information being collected multiple times, and locally it makes it hard to
see where the "normal" contents of regress_log* continue.

Greetings,

Andres Freund




Re: Synchronizing slots from primary to standby

2024-04-06 Thread Andres Freund
Hi,

On 2024-04-06 10:58:32 +0530, Amit Kapila wrote:
> On Sat, Apr 6, 2024 at 10:13 AM Amit Kapila  wrote:
> >
> 
> There are still a few pending issues to be fixed in this feature but
> otherwise, we have committed all the main patches, so I marked the CF
> entry corresponding to this work as committed.

There are a a fair number of failures of 040_standby_failover_slots_sync in
the buildfarm.  It'd be nice to get those fixed soon-ish.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-04-06%2020%3A58%3A50
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-04-06%2015%3A18%3A08
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-06%2010%3A13%3A58
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2016%3A04%3A10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-05%2014%3A59%3A40
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-05%2014%3A59%3A07
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2014%3A18%3A07

The symptoms are similar, but not entirely identical across all of them, I 
think.

I've also seen a bunch of failures of this test locally.

Greetings,

Andres Freund




Re: Flushing large data immediately in pqcomm

2024-04-06 Thread Ranier Vilela
Hi,

On Fri, 5 Apr 2024 at 03:28, Melih Mutlu
 wrote:
>Right. It was a mistake, forgot to remove that. Fixed it in v5.

If you don't mind, I have some suggestions for patch v5.

1.  Shouldn't PqSendBufferSize be of type size_t?
There are several comparisons with other size_t variables.
static size_t PqSendBufferSize; /* Size send buffer */

I think this would prevent possible overflows.

2. If PqSendBufferSize is changed to size_t, in the function
socket_putmessage_noblock, the variable which name is *required*, should
be changed to size_t as well.

static void
socket_putmessage_noblock(char msgtype, const char *s, size_t len)
{
int res PG_USED_FOR_ASSERTS_ONLY;
size_t required;

3. In the internal_putbytes function, the *amout* variable could
have the scope safely reduced.

else
{
size_t amount;

amount = PqSendBufferSize - PqSendPointer;

4. In the function internal_flush_buffer, the variables named
*bufptr* and *bufend* could be const char * type, like:

static int
internal_flush_buffer(const char *s, size_t *start, size_t *end)
{
static int last_reported_send_errno = 0;

const char *bufptr = s + *start;
const char *bufend = s + *end;

best regards,
Ranier Vilela


Re: Statistics Import and Export

2024-04-06 Thread Corey Huinker
>
>
>
> I think it'll be a serious, serious error for this not to be
> SECTION_DATA.  Maybe POST_DATA is OK, but even that seems like
> an implementation compromise not "the way it ought to be".
>

We'd have to split them on account of when the underlying object is
created. Index statistics would be SECTION_POST_DATA, and everything else
would be SECTION_DATA. Looking ahead, statistics data for extended
statistics objects would also be POST. That's not a big change, but my
first attempt at that resulted in a bunch of unrelated grants dumping in
the wrong section.


Re: CASE control block broken by a single line comment

2024-04-06 Thread Erik Wienhold
On 2024-04-06 20:14 +0200, Michal Bartak wrote:
> The issue described bellow exists in postgresql ver 16.2 (found in some
> previous major versions)

Can confirm also on master.

> The documentation defines a comment as:
> 
> > A comment is a sequence of characters beginning with double dashes and
> > extending to the end of the line
> 
> 
> When using such a comment within CASE control block, it ends up with an
> error:
> 
> DO LANGUAGE plpgsql $$
> DECLARE
> t TEXT = 'a';
> BEGIN
> CASE t
> WHEN 'a'  -- my comment
> THEN RAISE NOTICE 'a';
> WHEN 'b'
> THEN RAISE NOTICE 'b';
> ELSE NULL;
> END CASE;
> END;$$;
> 
> ERROR:  syntax error at end of input
> LINE 1: "__Case__Variable_2__" IN ('a'  -- my comment)
>   ^
> QUERY:  "__Case__Variable_2__" IN ('a'  -- my comment)
> CONTEXT:  PL/pgSQL function inline_code_block line 5 at CASE

I'm surprised that the comment is not skipped by the scanner at this
point.  Maybe because the parser just reads the raw expression between
WHEN and THEN with plpgsql_append_source_text via read_sql_construct.

How about the attached patch?  It's a workaround by simply adding a line
feed character between the raw expression and the closing parenthesis.

-- 
Erik
>From 85456a22f41a8a51703650f2853fb6d8c9711fc7 Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Sat, 6 Apr 2024 22:36:54 +0200
Subject: [PATCH v1] plpgsql: create valid IN expression for CASE WHEN clause

The expression in CASE x WHEN  THEN may end with a line comment.
This results in a syntax error when the WHEN clause is rewritten to
x IN ().  Cope with that by appending \n after  to terminate
the line comment.
---
 src/pl/plpgsql/src/expected/plpgsql_control.out | 17 +
 src/pl/plpgsql/src/pl_gram.y|  6 +-
 src/pl/plpgsql/src/sql/plpgsql_control.sql  | 14 ++
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/src/pl/plpgsql/src/expected/plpgsql_control.out 
b/src/pl/plpgsql/src/expected/plpgsql_control.out
index 328bd48586..ccd4f54704 100644
--- a/src/pl/plpgsql/src/expected/plpgsql_control.out
+++ b/src/pl/plpgsql/src/expected/plpgsql_control.out
@@ -681,3 +681,20 @@ select case_test(13);
  other
 (1 row)
 
+-- test line comment between WHEN and THEN
+create or replace function case_comment(int) returns text as $$
+begin
+  case $1
+when 1 -- comment before THEN
+  then return 'one';
+else
+  return 'other';
+  end case;
+end;
+$$ language plpgsql immutable;
+select case_comment(1);
+ case_comment 
+--
+ one
+(1 row)
+
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index bef33d58a2..98339589ba 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -4161,7 +4161,11 @@ make_case(int location, PLpgSQL_expr *t_expr,
/* Do the string hacking */
initStringInfo(&ds);
 
-   appendStringInfo(&ds, "\"%s\" IN (%s)",
+   /*
+* The read expression may end in a line comment, so
+* append \n after it to create a valid expression.
+*/
+   appendStringInfo(&ds, "\"%s\" IN (%s\n)",
 varname, expr->query);
 
pfree(expr->query);
diff --git a/src/pl/plpgsql/src/sql/plpgsql_control.sql 
b/src/pl/plpgsql/src/sql/plpgsql_control.sql
index ed7231134f..8e007c51dc 100644
--- a/src/pl/plpgsql/src/sql/plpgsql_control.sql
+++ b/src/pl/plpgsql/src/sql/plpgsql_control.sql
@@ -486,3 +486,17 @@ select case_test(1);
 select case_test(2);
 select case_test(12);
 select case_test(13);
+
+-- test line comment between WHEN and THEN
+create or replace function case_comment(int) returns text as $$
+begin
+  case $1
+when 1 -- comment before THEN
+  then return 'one';
+else
+  return 'other';
+  end case;
+end;
+$$ language plpgsql immutable;
+
+select case_comment(1);
-- 
2.44.0



Re: Flushing large data immediately in pqcomm

2024-04-06 Thread Andres Freund
Hi,

On 2024-04-06 14:34:17 +1300, David Rowley wrote:
> I don't see any issues with v5, so based on the performance numbers
> shown on this thread for the latest patch, it would make sense to push
> it.  The problem is, I just can't recreate the performance numbers.
>
> I've tried both on my AMD 3990x machine and an Apple M2 with a script
> similar to the test.sh from above.  I mostly just stripped out the
> buffer size stuff and adjusted the timing code to something that would
> work with mac.

I think there are a few issues with the test script leading to not seeing a
gain:

1) I think using the textual protocol, with the text datatype, will make it
   harder to spot differences. That's a lot of overhead.

2) Afaict the test is connecting over the unix socket, I think we expect
   bigger wins for tcp

3) Particularly the larger string is bottlenecked due to pglz compression in
   toast.


Where I had noticed the overhead of the current approach badly, was streaming
out basebackups. Which is all binary, of course.


I added WITH BINARY, SET STORAGE EXTERNAL and tested both unix socket and
localhost. I also reduced row counts and iteration counts, because I am
impatient, and I don't think it matters much here. Attached the modified
version.


On a dual xeon Gold 5215, turbo boost disabled, server pinned to one core,
script pinned to another:


unix:

master:
Run 100 100 100: 0.058482377
Run 1024 10240 10: 0.120909810
Run 1024 1048576 2000: 0.153027916
Run 1048576 1048576 1000: 0.154953512

v5:
Run 100 100 100: 0.058760126
Run 1024 10240 10: 0.118831396
Run 1024 1048576 2000: 0.124282503
Run 1048576 1048576 1000: 0.123894962


localhost:

master:
Run 100 100 100: 0.067088000
Run 1024 10240 10: 0.170894273
Run 1024 1048576 2000: 0.230346632
Run 1048576 1048576 1000: 0.230336078

v5:
Run 100 100 100: 0.067144036
Run 1024 10240 10: 0.167950948
Run 1024 1048576 2000: 0.135167027
Run 1048576 1048576 1000: 0.135347867


The perf difference for 1MB via TCP is really impressive.

The small regression for small results is still kinda visible, I haven't yet
tested the patch downthread.

Greetings,

Andres Freund
#!/bin/bash

set -e

dbname=postgres
port=5440
host=/tmp
host=localhost

test_cases=(
"100 100 100"   # only 100 bytes
"1024 10240 10"# 1Kb and 10Kb
"1024 1048576 2000" # 1Kb and 1Mb
"1048576 1048576 1000"  # all 1Mb
)

insert_rows(){
psql -d $dbname -p $port -h $host  -c "
DO \$\$
DECLARE
counter INT;
BEGIN
FOR counter IN 1..$3 LOOP
IF counter % 2 = 1 THEN
INSERT INTO test_table VALUES (repeat('a', $1)::text);
ELSE
INSERT INTO test_table VALUES (repeat('b', $2)::text);
END IF;
END LOOP;
END \$\$;
" > /dev/null
}


psql -d $dbname -p $port -c "CREATE EXTENSION IF NOT EXISTS pg_prewarm;" > 
/dev/null

for case in "${test_cases[@]}"
do
psql -d $dbname -p $port -h $host -c "DROP TABLE IF EXISTS test_table;" 
> /dev/null
psql -d $dbname -p $port -h $host  -c "CREATE UNLOGGED TABLE 
test_table(data text not null);" > /dev/null
psql -d $dbname -p $port -h $host  -c "ALTER TABLE test_table ALTER 
data SET STORAGE EXTERNAL;" > /dev/null

insert_rows $case

psql -d $dbname -p $port -h $host  -c "select 
pg_prewarm('test_table');" > /dev/null

echo -n "Run $case: "

elapsed_time=0
for a in {1..5}
do
start_time=$(perl -MTime::HiRes=time -e 'printf "%.9f\n", time')
psql -d $dbname -p $port -h $host -c "COPY test_table TO STDOUT 
WITH BINARY;" > /dev/null
end_time=$(perl -MTime::HiRes=time -e 'printf "%.9f\n", time')
elapsed_time=$(perl -e "printf('%.9f', ($end_time - 
$start_time) + $elapsed_time)")
done

avg_elapsed_time_in_ms=$(perl -e "printf('%.9f', ($elapsed_time / 30))")
echo $avg_elapsed_time_in_ms
done


Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 02:51:39PM +1300, David Rowley wrote:
> On Sat, 6 Apr 2024 at 14:17, Nathan Bossart  wrote:
>> On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote:
>> > Won't Valgrind complain about this?
>> >
>> > +pg_popcount_avx512(const char *buf, int bytes)
>> >
>> > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf);
>> >
>> > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf);
>>
>> I haven't been able to generate any complaints, at least with some simple
>> tests.  But I see your point.  If this did cause such complaints, ISTM we'd
>> just want to add it to the suppression file.  Otherwise, I think we'd have
>> to go back to the non-maskz approach (which I really wanted to avoid
>> because of the weird function overhead juggling) or find another way to do
>> a partial load into an __m512i.
> 
> [1] seems to think it's ok.  If this is true then the following
> shouldn't segfault:
> 
> The following seems to run without any issue and if I change the mask
> to 1 it crashes, as you'd expect.

Cool.

Here is what I have staged for commit, which I intend to do shortly.  At
some point, I'd like to revisit converting TRY_POPCNT_FAST to a
configure-time check and maybe even moving the "fast" and "slow"
implementations to their own files, but since that's mostly for code
neatness and we are rapidly approaching the v17 deadline, I'm content to
leave that for v18.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 9eea49555cbd14c7871085e159c9b0b78e92 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 27 Mar 2024 16:39:24 -0500
Subject: [PATCH v28 1/2] Optimize pg_popcount() with AVX-512 instructions.

Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible.  Newer hardware that supports AVX-512 instructions
can perform these tasks in 512-bit chunks, which can provide a nice
speedup, especially for larger buffers.

This commit introduces the infrastructure required to detect both
compiler and CPU support for the required AVX-512 intrinsic
functions, and it makes use of that infrastructure in a new
pg_popcount() implementation.  If CPU support for this optimized
implementation is detected at runtime, a function pointer is
updated so that it is used for subsequent calls to pg_popcount().

Most of the existing in-tree calls to pg_popcount() should benefit
nicely from these instructions, and calls for smaller buffers
should not regress when compared to v16.  The new infrastructure
introduced by this commit can also be used to optimized
visibilitymap_count(), but that work is left for a follow-up
commit.

Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
---
 config/c-compiler.m4 |  58 ++
 configure| 252 +++
 configure.ac |  51 ++
 meson.build  |  87 +
 src/Makefile.global.in   |   5 +
 src/include/pg_config.h.in   |  12 ++
 src/include/port/pg_bitutils.h   |  11 ++
 src/makefiles/meson.build|   4 +-
 src/port/Makefile|  11 ++
 src/port/meson.build |   6 +-
 src/port/pg_bitutils.c   |   5 +
 src/port/pg_popcount_avx512.c|  82 +
 src/port/pg_popcount_avx512_choose.c |  87 +
 src/test/regress/expected/bit.out|  24 +++
 src/test/regress/sql/bit.sql |   4 +
 15 files changed, 696 insertions(+), 3 deletions(-)
 create mode 100644 src/port/pg_popcount_avx512.c
 create mode 100644 src/port/pg_popcount_avx512_choose.c

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb..cfff48c1bc 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_LOONGARCH_CRC32C_INTRINSICS
+
+# PGAC_XSAVE_INTRINSICS
+# -
+# Check if the compiler supports the XSAVE instructions using the _xgetbv
+# intrinsic function.
+#
+# An optional compiler flag can be passed as argument (e.g., -mxsave).  If the
+# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE.
+AC_DEFUN([PGAC_XSAVE_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ],
+  [return _xgetbv(0) & 0xe0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_XSAVE="$1"
+  pgac_xsave_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_XSAVE_INTRINSICS
+
+# PGAC_AVX512_POPCNT_INTRINSICS
+# --

CASE control block broken by a single line comment

2024-04-06 Thread Michal Bartak
Hello all

The issue described bellow exists in postgresql ver 16.2 (found in some
previous major versions)

The documentation defines a comment as:

> A comment is a sequence of characters beginning with double dashes and
> extending to the end of the line


When using such a comment within CASE control block, it ends up with an
error:

DO LANGUAGE plpgsql $$
DECLARE
t TEXT = 'a';
BEGIN
CASE t
WHEN 'a'  -- my comment
THEN RAISE NOTICE 'a';
WHEN 'b'
THEN RAISE NOTICE 'b';
ELSE NULL;
END CASE;
END;$$;

ERROR:  syntax error at end of input
LINE 1: "__Case__Variable_2__" IN ('a'  -- my comment)
  ^
QUERY:  "__Case__Variable_2__" IN ('a'  -- my comment)
CONTEXT:  PL/pgSQL function inline_code_block line 5 at CASE

With Regards
Michal Bartak


Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-06 Thread Daniel Gustafsson
> On 6 Apr 2024, at 16:04, Tom Lane  wrote:
> Daniel Gustafsson  writes:
>>> On 6 Apr 2024, at 08:02, Peter Eisentraut  wrote:

>>> Why do we need to check for the versions at all?  We should just check for 
>>> the functions we need.  At least that's always been the normal approach in 
>>> configure.
> 
>> We could, but finding a stable set of functions which identifies the version 
>> of
>> OpenSSL *and* LibreSSL that we want, and their successors, while not matching
>> any older versions seemed more opaque than testing two numeric values.
> 
> I don't think you responded to Peter's point at all.  The way autoconf
> is designed to work is explicitly NOT to try to identify the exact
> version of $whatever.  Rather, the idea is to probe for the API
> features that you want to rely on: functions, macros, struct fields,
> or the like.  If you can't point to an important API difference
> between 1.0.2 and 1.1.1, why drop support for 1.0.2?

My apologies, I thought I did but clearly failed.  My point was that this is a
special/corner case where we try to find one of two different libraries (with
different ideas about backwards compatability etc) for supporting a single
thing.  So instead I tested for the explicit versions like how we already test
for the exact Perl version in config/perl.m4 (albeit that a library and a
program are two different things of course).

In bumping we want to move to 1.1.1 since that's the first version with the
rewritten RNG which is fork-safe by design, something PostgreSQL clearly
benefits from.  There is no new API for this to gate on though.  For LibreSSL
we want 3.3.2 to a) ensure we have coverage in the BF and b) since it's the
first version where the tests pass due to error message alignment with OpenSSL.
The combination of these gets rid of lots of specialcased #ifdef soup.  I
wasn't however able to find a specific API call which is unique to the two
version which we rely on.

Testing for the presence of an API provided and introduced by both libraries in
the version we're interested in, but which we don't use, is the alternative but
I thought that would be more frowned upon.  EVP_PKEY_new_CMAC_key() was
introduced in 1.1.1 and LibreSSL 3.3.2, so an AC_CHECK_FUNCS for that, as in
the attached, achieves the version check but pollutes pg_config.h with a define
which will never be used which seemed a bit ugly.

--
Daniel Gustafsson



v6-0001-Remove-support-for-OpenSSL-1.0.2-and-1.1.0.patch
Description: Binary data


Re: LogwrtResult contended spinlock

2024-04-06 Thread Jeff Davis
On Sat, 2024-04-06 at 18:13 +0200, Alvaro Herrera wrote:
> my understanding of pg_atomic_compare_exchange_u64 is that
> *expected is set to the value that's stored in *ptr prior to the
> exchange.

Sorry, my mistake. Your version looks good.

Regards,
Jeff Davis





Re: Add bump memory context type and use it for tuplesorts

2024-04-06 Thread Matthias van de Meent
On Sat, 6 Apr 2024, 14:36 David Rowley,  wrote:

> On Sat, 6 Apr 2024 at 02:30, Matthias van de Meent
>  wrote:
> >
> > On Thu, 4 Apr 2024 at 22:43, Tom Lane  wrote:
> > >
> > > Matthias van de Meent  writes:
> > > > It extends memory context IDs to 5 bits (32 values), of which
> > > > - 8 have glibc's malloc pattern of 001/010;
> > > > - 1 is unused memory's 0
> > > > - 1 is wipe_mem's 1
> > > > - 4 are used by existing contexts
> (Aset/Generation/Slab/AlignedRedirect)
> > > > - 18 are newly available.
> > >
> > > This seems like it would solve the problem for a good long time
> > > to come; and if we ever need more IDs, we could steal one more bit
> > > by requiring the offset to the block header to be a multiple of 8.
> > > (Really, we could just about do that today at little or no cost ...
> > > machines with MAXALIGN less than 8 are very thin on the ground.)
> >
> > Hmm, it seems like a decent idea, but I didn't want to deal with the
> > repercussions of that this late in the cycle when these 2 bits were
> > still relatively easy to get hold of.
>
> Thanks for writing the patch.
>
> I think 5 bits is 1 too many. 4 seems fine. I also think you've
> reserved too many slots in your patch as I disagree that we need to
> reserve the glibc malloc pattern anywhere but in the 1 and 2 slots of
> the mcxt_methods[] array.  I looked again at the 8 bytes prior to a
> glibc malloc'd chunk and I see the lowest 4 bits of the headers
> consistently set to 0001 for all powers of 2 starting at 8 up to
> 65536.


Malloc's docs specify the minimum chunk size at 4*sizeof(void*) and itself
uses , so using powers of 2 for chunks would indeed fail to detect 1s in
the 4th bit. I suspect you'll get different results when you check the
allocation patterns of multiples of 8 bytes, starting from 40, especially
on 32-bit arm (where MALLOC_ALIGNMENT is 8 bytes, rather than the 16 bytes
on i386 and 64-bit architectures, assuming  [0] is accurate)

131072 seems to vary and beyond that, they seem to be set to
> 0010.
>

In your updated 0001, you don't seem to fill the RESERVED_GLIBC memctx
array entries with BOGUS_MCTX().

With that, there's no increase in the number of reserved slots from
> what we have reserved today. Still 4.  So having 4 bits instead of 3
> bits gives us a total of 12 slots rather than 4 slots.  Having 3x
> slots seems enough.  We might need an extra bit for something else
> sometime. I think keeping it up our sleeve is a good idea.
>
> Another reason not to make it 5 bits is that I believe that would make
> the mcxt_methods[] array 2304 bytes rather than 576 bytes.  4 bits
> makes it 1152 bytes, if I'm counting correctly.
>

I don't think I understand why this would be relevant when only 5 of the
contexts are actually in use (thus in caches). Is that size concern about
TLB entries then?


> I revised the patch to simplify hdrmask logic.  This started with me
> having trouble finding the best set of words to document that the
> offset is "half the bytes between the chunk and block".  So, instead
> of doing that, I've just made it so these two fields effectively
> overlap. The lowest bit of the block offset is the same bit as the
> high bit of what MemoryChunkGetValue returns.


Works for me, I suppose.

I also updated src/backend/utils/mmgr/README to explain this and
> adjust the mentions of 3-bits and 61-bits to 4-bits and 60-bits.  I
> also explained the overlapping part.
>

Thanks!

[0]
https://sourceware.org/glibc/wiki/MallocInternals#Platform-specific_Thresholds_and_Constants

>


Re: Fixing backslash dot for COPY FROM...CSV

2024-04-06 Thread Tom Lane
I wrote:
> So the current behavior is that \. that is on the end of a line,
> but is not the whole line, is silently discarded and we keep going.

> All versions throw "end-of-copy marker corrupt" if there is
> something after \. on the same line.

> This is sufficiently weird that I'm starting to come around to
> Daniel's original proposal that we just drop the server's recognition
> of \. altogether (which would allow removal of some dozens of lines of
> complicated and now known-buggy code).

I experimented with that and soon ran into a nasty roadblock: it
breaks dump/restore, because pg_dump includes a "\." line after
COPY data whether or not it really needs one.  Worse, that's
implemented by including the "\." line into the archive format,
so that existing dump files contain it.  Getting rid of it would
require an archive format version bump, plus some hackery to allow
removal of the line when reading old dump files.

While that's surely doable with enough effort, it's not the kind
of thing to be undertaking with less than 2 days to feature freeze.
Not to mention that I'm not sure we have consensus to do it at all.

More fun stuff: PQgetline actually invents a "\." line when it
sees server end-of-copy, and we tell users of that function to
check for that not an out-of-band return value to detect EOF.
It looks like we have no callers of that in the core distro,
but do we want to deprecate it completely?

So I feel like we need to put this patch on the shelf for the moment
and come back to it early in v18.  Although it seems reasonably clear
what to do on the CSV side of things, it's very much less clear what
to do about text-format handling of EOD markers, and I don't want to
change one of those things in v17 and the other in v18.  Also it
seems like there are more dependencies on "\." than we realized.

There could be an argument for applying just the psql change now,
to remove its unnecessary sending of "\.".  That won't break
anything and it would give us at least one year's leg up on
compatibility issues.

regards, tom lane




Re: Synchronizing slots from primary to standby

2024-04-06 Thread Bertrand Drouvot
Hi,

On Sat, Apr 06, 2024 at 10:13:00AM +0530, Amit Kapila wrote:
> On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot
>  wrote:
> 
> I think the new LSN can be visible only when the corresponding WAL is
> written by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can
> make it visible. In your experiment below, isn't it possible that in
> the meantime WAL writer has written that WAL due to which you are
> seeing an updated location?

What I did is:

session 1:  select pg_current_wal_lsn();\watch 1
session 2:  select pg_backend_pid();

terminal 1: tail -f logfile | grep -i snap 
terminal 2 : gdb -p ) at standby.c:1346
1346{
(gdb) n
1350
 
Then next, next until the DEBUG message is emitted (confirmed in terminal 1).

At this stage the DEBUG message shows the new LSN while session 1 still displays
the previous LSN.

Then once XLogSetAsyncXactLSN() is done in the gdb session (terminal 2) then
session 1 displays the new LSN.

This is reproducible as desired.

With more debugging I can see that when the spinlock is released in 
XLogSetAsyncXactLSN()
then XLogWrite() is doing its job and then session 1 does see the new value 
(that
happens in this order, and as you said that's expected).

My point is that while the DEBUG message is emitted session 1 still 
see the old LSN (until the new LSN is vsible). I think that we should emit the
DEBUG message once session 1 can see the new value (If not, I think the 
timestamp
of the DEBUG message can be missleading during debugging purpose).

> I think I am missing how exactly moving DEBUG2 can confirm the above theory.

I meant to say that instead of seeing:

2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:  snapshot 
of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 
739 next xid 740)
2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG:  statement: 
SELECT '0/360' <= replay_lsn AND state = 'streaming'

We would probably see something like:

2024-04-05 04:37:05. UTC [3866475][client backend][2/4:0] LOG:  
statement: SELECT '0/360' <= replay_lsn AND state = 'streaming'
2024-04-05 04:37:05.+xx UTC [3854278][background writer][:0] DEBUG:  
snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest 
complete 739 next xid 740)

And then it would be clear that the query has ran before the new LSN is visible.

> > If the theory is proven then I'm not sure we need the extra complexity of
> > injection point here, maybe just relying on the slots created via SQL API 
> > could
> > be enough.
> >
> 
> Yeah, that could be the first step. We can probably add an injection
> point to control the bgwrite behavior and then add tests involving
> walsender performing the decoding. But I think it is important to have
> sufficient tests in this area as I see they are quite helpful in
> uncovering the issues.
>

Yeah agree. 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: LogwrtResult contended spinlock

2024-04-06 Thread Alvaro Herrera
On 2024-Apr-05, Jeff Davis wrote:

> Minor comments:

> * Also, I assume that the Max() call in
> pg_atomic_monotonic_advance_u64() is just for clarity? Surely the
> currval cannot be less than _target when it returns. I'd probably just
> do Assert(currval >= _target) and then return currval.

Uhh ... my understanding of pg_atomic_compare_exchange_u64 is that
*expected is set to the value that's stored in *ptr prior to the
exchange.  Its comment

 * Atomically compare the current value of ptr with *expected and store newval
 * iff ptr and *expected have the same value. The current value of *ptr will
 * always be stored in *expected.

is actually not very clear, because what does "current" mean in this
context?  Is the original "current" value, or is it the "current" value
after the exchange?  Anyway, looking at the spinlock-emulated code for
pg_atomic_compare_exchange_u32_impl in atomics.c,

/* perform compare/exchange logic */
ret = ptr->value == *expected;
*expected = ptr->value;
if (ret)
ptr->value = newval;

it's clear that *expected receives the original value, not the new one.

Because of this behavior, not doing the Max() would return the obsolete
value rather than the one we just installed.  (It would only work
correctly without Max() when our cmpxchg operation "fails", that is,
somebody else incremented the value past the one we want to install.)


Another reason is that when I used pg_atomic_monotonic_advance_u64 with
the Write and Flush pointers and did not have the Max(), the assertion
failed pretty quickly :-)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Melanie Plageman
On Sat, Apr 06, 2024 at 04:57:51PM +0200, Tomas Vondra wrote:
> On 4/6/24 15:40, Melanie Plageman wrote:
> > On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote:
> >>
> >>
> >> On 4/6/24 01:53, Melanie Plageman wrote:
> >>> On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote:
>  On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote:
> > On 4/4/24 00:57, Melanie Plageman wrote:
> >> On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote:
> > I'd focus on the first ~8-9 commits or so for now, we can commit more if
> > things go reasonably well.
> 
>  Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would
>  love to know if you agree with the direction before I spend more time.
> >>>
> >>> In attached v16, I've split out 0010-0013 into 0011-0017. I think it is
> >>> much easier to understand.
> >>>
> >>
> >> Anyway, I've attached it as .tgz in order to not confuse cfbot. All the
> >> review comments are marked with XXX, so grep for that in the patches.
> >> There's two separate patches - the first one suggests a code change, so
> >> it was better to not merge that with your code. The second has just a
> >> couple XXX comments, I'm not sure why I kept it separate.
> >>
> >> A couple review comments:
> >>
> >> * I think 0001-0009 are 99% ready to. I reworded some of the commit
> >> messages a bit - I realize it's a bit bold, considering you're native
> >> speaker and I'm not. If you could check I didn't make it worse, that
> >> would be great.
> > 
> > Attached v17 has *only* patches 0001-0009 with these changes. I will
> > work on applying the remaining patches, addressing feedback, and adding
> > comments next.
> > 
> > I have reviewed and incorporated all of your feedback on these patches.
> > Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to
> > commit messages (comma splice removal and single word adjustments) as
> > well as the changes listed below:
> > 
> > I have open questions on the following:
> > 
> > - 0003: should it be SO_NEED_TUPLES and need_tuples (instead of
> > SO_NEED_TUPLE and need_tuple)?
> > 
> 
> I think SO_NEED_TUPLES is more accurate, as we need all tuples from the
> block. But either would work.

Attached v18 changes it to TUPLES/tuples

> 
> > - 0009 (your 0010)
> > - Should I mention in the commit message that we added blockno and
> > pfblockno in the BitmapHeapScanState only for validation or is 
> > that
> > too specific?
> > 
> 
> For the commit message I'd say it's too specific, I'd put it in the
> comment before the struct.

It is in the comment for the struct

> > - I'm worried this comment is vague and or potentially not totally
> > correct. Should we remove it? I don't think we have conclusive 
> > proof
> > that this is true.
> >   /*
> >* Adjusting the prefetch iterator before invoking
> >* table_scan_bitmap_next_block() keeps prefetch distance higher across
> >* the parallel workers.
> >*/
> > 
> 
> TBH it's not clear to me what "higher across parallel workers" means.
> But it sure shouldn't claim things that we think may not be correct. I
> don't have a good idea how to reword it, though.

I realized it makes more sense to add a FIXME (I used XXX. I'm not when
to use what) with a link to the message where Andres describes why he
thinks it is a bug. If we plan on fixing it, it is good to have a record
of that. And it makes it easier to put a clear and accurate comment.
Done in 0009.

> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks
> per above (tuple vs. tuples etc.), and the question about the recheck
> flag. If you can do these tweaks, I'll get that committed today and we
> can try to get a couple more patches in tomorrow.

Sounds good.

- Melanie
>From 3e80ba8914dd5876ab921ca39540be3b639236f7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v18 01/10] BitmapHeapScan: begin scan after bitmap creation

It makes more sense for BitmapHeapScan to scan the index, build the
bitmap, and only then begin the scan on the underlying table.

This is mostly a cosmetic change for now, but later commits will need
to pass parameters to table_beginscan_bm() that are unavailable in
ExecInitBitmapHeapScan().

Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas
Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
---
 src/backend/executor/nodeBitmapHeapscan.c | 27 +--
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index cee7f45aabe..c8c466e3c5c 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -178,6 +178,21 @@ BitmapHeapNext(BitmapHeapScanState *node)
 			}
 #endif			/* US

Re: remaining sql/json patches

2024-04-06 Thread Amit Langote
Hi,

On Sat, Apr 6, 2024 at 3:55 PM jian he  wrote:
> On Sat, Apr 6, 2024 at 2:03 PM Amit Langote  wrote:
> >
> > >
> > > * problem with type "char". the view def  output is not the same as
> > > the select * from v1.
> > >
> > > create or replace view v1 as
> > > SELECT col FROM s,
> > > JSON_TABLE(jsonb '{"d": ["hello", "hello1"]}', '$' as c1
> > > COLUMNS(col "char" path '$.d' without wrapper keep quotes))sub;
> > >
> > > \sv v1
> > > CREATE OR REPLACE VIEW public.v1 AS
> > >  SELECT sub.col
> > >FROM s,
> > > JSON_TABLE(
> > > '{"d": ["hello", "hello1"]}'::jsonb, '$' AS c1
> > > COLUMNS (
> > > col "char" PATH '$."d"'
> > > )
> > > ) sub
> > > one under the hood called JSON_QUERY_OP, another called JSON_VALUE_OP.
> >
> > Hmm, I don't see a problem as long as both are equivalent or produce
> > the same result.  Though, perhaps we could make
> > get_json_expr_options() also deparse JSW_NONE explicitly into "WITHOUT
> > WRAPPER" instead of a blank.  But that's existing code, so will take
> > care of it as part of the above open item.
> >
> > > I will do extensive checking for other types later, so far, other than
> > > these two issues,
> > > get_json_table_columns is pretty solid, I've tried nested columns with
> > > nested columns, it just works.
> >
> > Thanks for checking.
> >
> After applying v50, this type also has some issues.
> CREATE OR REPLACE VIEW t1 as
> SELECT sub.* FROM JSON_TABLE(jsonb '{"d": ["hello", "hello1"]}',
> '$' AS c1 COLUMNS (
> "tsvector0" tsvector path '$.d' without wrapper omit quotes,
> "tsvector1" tsvector path '$.d' without wrapper keep quotes))sub;
> table t1;
>
> return
> tsvector0|tsvector1
> -+-
>  '"hello1"]' '["hello",' | '"hello1"]' '["hello",'
> (1 row)
>
> src5=# \sv t1
> CREATE OR REPLACE VIEW public.t1 AS
>  SELECT tsvector0,
> tsvector1
>FROM JSON_TABLE(
> '{"d": ["hello", "hello1"]}'::jsonb, '$' AS c1
> COLUMNS (
> tsvector0 tsvector PATH '$."d"' OMIT QUOTES,
> tsvector1 tsvector PATH '$."d"'
> )
> ) sub
>
> but
>
>  SELECT tsvector0,
> tsvector1
>FROM JSON_TABLE(
> '{"d": ["hello", "hello1"]}'::jsonb, '$' AS c1
> COLUMNS (
> tsvector0 tsvector PATH '$."d"' OMIT QUOTES,
> tsvector1 tsvector PATH '$."d"'
> )
> ) sub
>
> only return
> tsvector0| tsvector1
> -+---
>  '"hello1"]' '["hello",' |

Yep, we *should* fix get_json_expr_options() to emit KEEP QUOTES and
WITHOUT WRAPPER options so that transformJsonTableColumns() does the
correct thing when you execute the \sv output.  Like this:

diff --git a/src/backend/utils/adt/ruleutils.c
b/src/backend/utils/adt/ruleutils.c
index 283ca53cb5..5a6aabe100 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -8853,9 +8853,13 @@ get_json_expr_options(JsonExpr *jsexpr,
deparse_context *context,
 appendStringInfo(context->buf, " WITH CONDITIONAL WRAPPER");
 else if (jsexpr->wrapper == JSW_UNCONDITIONAL)
 appendStringInfo(context->buf, " WITH UNCONDITIONAL WRAPPER");
+else if (jsexpr->wrapper == JSW_NONE)
+appendStringInfo(context->buf, " WITHOUT WRAPPER");

 if (jsexpr->omit_quotes)
 appendStringInfo(context->buf, " OMIT QUOTES");
+else
+appendStringInfo(context->buf, " KEEP QUOTES");
 }

Will get that pushed tomorrow.  Thanks for the test case.

-- 
Thanks, Amit Langote




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Tom Lane
Melanie Plageman  writes:
> On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote:
>> * The one question I'm somewhat unsure about is why Tom chose to use the
>> "wrong" recheck flag in the 2017 commit, when the correct recheck flag
>> is readily available. Surely that had a reason, right? But I can't think
>> of one ...

> See above.

Hi, I hadn't been paying attention to this thread, but Melanie pinged
me off-list about this question.  I think it's just a flat-out
oversight in 7c70996eb.  Looking at the mailing list thread
(particularly [1][2]), it seems that Alexander hadn't really addressed
the question of when to prefetch at all, but just skipped prefetch if
the current page was skippable:

+   /*
+* If we did not need to fetch the current page,
+* we probably will not need to fetch the next.
+*/
+   return;

It looks like I noticed that we could check the appropriate VM bits,
but failed to notice that we could easily check the appropriate
recheck flag as well.  Feel free to change it.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/a6434d5c-ed8d-b09c-a7c3-b2d1677e35b3%40postgrespro.ru
[2] https://www.postgresql.org/message-id/5974.1509573988%40sss.pgh.pa.us




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Tomas Vondra
On 4/6/24 15:40, Melanie Plageman wrote:
> On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote:
>>
>>
>> On 4/6/24 01:53, Melanie Plageman wrote:
>>> On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote:
 On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote:
> On 4/4/24 00:57, Melanie Plageman wrote:
>> On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote:
> I'd focus on the first ~8-9 commits or so for now, we can commit more if
> things go reasonably well.

 Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would
 love to know if you agree with the direction before I spend more time.
>>>
>>> In attached v16, I've split out 0010-0013 into 0011-0017. I think it is
>>> much easier to understand.
>>>
>>
>> Anyway, I've attached it as .tgz in order to not confuse cfbot. All the
>> review comments are marked with XXX, so grep for that in the patches.
>> There's two separate patches - the first one suggests a code change, so
>> it was better to not merge that with your code. The second has just a
>> couple XXX comments, I'm not sure why I kept it separate.
>>
>> A couple review comments:
>>
>> * I think 0001-0009 are 99% ready to. I reworded some of the commit
>> messages a bit - I realize it's a bit bold, considering you're native
>> speaker and I'm not. If you could check I didn't make it worse, that
>> would be great.
> 
> Attached v17 has *only* patches 0001-0009 with these changes. I will
> work on applying the remaining patches, addressing feedback, and adding
> comments next.
> 
> I have reviewed and incorporated all of your feedback on these patches.
> Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to
> commit messages (comma splice removal and single word adjustments) as
> well as the changes listed below:
> 
> I have changed the following:
> 
> - 0003 added an assert that rs_empty_tuples_pending is 0 on rescan and
>   endscan
> 

OK

> - 0004 (your 0005)-- I followed up with Tom, but for now I have just
>   removed the XXX and also reworded the message a bit
> 

After the exercise I described a couple minutes ago, I think I'm
convinced the assumption is unnecessary and we should use the correct
recheck. Not that it'd make any difference in practice, considering none
of the opclasses ever changes the recheck.

Maybe the most prudent thing would be to skip this commit and maybe
leave this for later, but I'm not forcing you to do that if it would
mean a lot of disruption for the following patches.

> - 0006 (your 0007) fixed up the variable name (you changed valid ->
>   valid_block but it had gotten changed back)
> 

OK

> I have open questions on the following:
> 
> - 0003: should it be SO_NEED_TUPLES and need_tuples (instead of
>   SO_NEED_TUPLE and need_tuple)?
> 

I think SO_NEED_TUPLES is more accurate, as we need all tuples from the
block. But either would work.

> - 0009 (your 0010)
>   - Should I mention in the commit message that we added blockno and
>   pfblockno in the BitmapHeapScanState only for validation or is 
> that
>   too specific?
> 

For the commit message I'd say it's too specific, I'd put it in the
comment before the struct.

>   - Should I mention that a future (imminent) commit will remove the
>   iterators from TableScanDescData and put them in 
> HeapScanDescData? I
>   imagine folks don't want those there, but it is easier for the
>   progression of commits to put them there first and then move 
> them
> 

I'd try not to mention future commits as justification too often, if we
don't know that the future commit lands shortly after.

>   - I'm worried this comment is vague and or potentially not totally
>   correct. Should we remove it? I don't think we have conclusive 
> proof
>   that this is true.
>   /*
>* Adjusting the prefetch iterator before invoking
>* table_scan_bitmap_next_block() keeps prefetch distance higher across
>* the parallel workers.
>*/
> 

TBH it's not clear to me what "higher across parallel workers" means.
But it sure shouldn't claim things that we think may not be correct. I
don't have a good idea how to reword it, though.

> 
>> * I'm not sure extra_flags is the right way to pass the flag in 0003.
>> The "extra_" name is a bit weird, and no other table AM functions do it
>> this way and pass explicit bool flags instead. So my first "review"
>> commit does it like that. Do you agree it's better that way?
> 
> Yes.
> 

Cool

>> * The one question I'm somewhat unsure about is why Tom chose to use the
>> "wrong" recheck flag in the 2017 commit, when the correct recheck flag
>> is readily available. Surely that had a reason, right? But I can't think
>> of one ...
> 
> See above.
> 
>>> While I was doing that, I realized that I should remove the call to
>>> table_rescan() from ExecReScanBitmapHeapScan() and just rely on the ne

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Tomas Vondra
On 4/6/24 02:51, Tomas Vondra wrote:
> 
> * The one question I'm somewhat unsure about is why Tom chose to use the
> "wrong" recheck flag in the 2017 commit, when the correct recheck flag
> is readily available. Surely that had a reason, right? But I can't think
> of one ...
> 

I've been wondering about this a bit more, so I decided to experiment
and try to construct a case for which the current code prefetches the
wrong blocks, and the patch fixes that. But I haven't been very
successful so far :-(

My understanding was that the current code should do the wrong thing if
I alternate all-visible and not-all-visible pages. This understanding is
not correct, as I learned, because the thing that needs to change is the
recheck flag, not visibility :-( I'm still posting what I tried, perhaps
you will have an idea how to alter it to demonstrate the incorrect
behavior with current master.

The test was very simple:

  create table t (a int, b int) with (fillfactor=10);
  insert into t select mod((i/22),2), (i/22)
from generate_series(0,1000) s(i);
  create index on t (a);

which creates a table with 46 pages, 22 rows per page, column "a"
alternates between 0/1 on pages, column "b" increments on each page (so
"b" identifies page).

and then

  delete from t where mod(b,8) = 0;

which deletes tuples on pages 0, 8, 16, 24, 32, 40, so these pages will
need to be prefetched as not-all-visible by this query

  explain analyze select count(1) from t where a = 0

when forced to do bitmap heap scan. The other even-numbered pages remain
all-visible. I added a bit of logging into BitmapPrefetch(), but even
with master I get this:

LOG:  prefetching block 8 0 current block 6 0
LOG:  prefetching block 16 0 current block 14 0
LOG:  prefetching block 24 0 current block 22 0
LOG:  prefetching block 32 0 current block 30 0
LOG:  prefetching block 40 0 current block 38 0

So it prefetches the correct pages (the other value is the recheck flag
for that block from the iterator result).

Turns out (and I realize the comment about the assumption actually
states that, I just failed to understand it) the thing that would have
to differ for the blocks is the recheck flag.

But that can't actually happen because that's set by the AM/opclass and
the built-in ones do essentially this:

.../hash.c:  scan->xs_recheck = true;
.../nbtree.c:  scan->xs_recheck = false;


gist opclasses (e.g. btree_gist):

/* All cases served by this function are exact */
*recheck = false;

spgist opclasses (e.g. geo_spgist):

/* All tests are exact. */
out->recheck = false;

If there's an opclass that alters the recheck flag, it's well hidden and
I missed it.

Anyway, after this exercise and learning more about the recheck flag, I
think I agree the assumption is unnecessary. It's pretty harmless
because none of the built-in opclasses alters the recheck flag, but the
correct recheck flag is readily available. I'm still a bit puzzled why
the 2017 commit even relied on this assumption, though.

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: remaining sql/json patches

2024-04-06 Thread jian he
On Fri, Apr 5, 2024 at 8:35 PM Amit Langote  wrote:
>
> On Thu, Apr 4, 2024 at 9:02 PM Amit Langote  wrote:
> > I'll post the rebased 0002 tomorrow after addressing your comments.
>
> Here's one.  Main changes:
>
> * Fixed a bug in get_table_json_columns() which caused nested columns
> to be deparsed incorrectly, something Jian reported upthread.
> * Simplified the algorithm in JsonTablePlanNextRow()
>
> I'll post another revision or two maybe tomorrow, but posting what I
> have now in case Jian wants to do more testing.
>

+ else
+ {
+ /*
+ * Parent and thus the plan has no more rows.
+ */
+ return false;
+ }
in JsonTablePlanNextRow, the above comment seems strange to me.

+ /*
+ * Re-evaluate a nested plan's row pattern using the new parent row
+ * pattern, if present.
+ */
+ Assert(parent != NULL);
+ if (!parent->current.isnull)
+ JsonTableResetRowPattern(planstate, parent->current.value);
Is this assertion useful?
if parent is null, then parent->current.isnull will cause segmentation fault.

I tested with 3 NESTED PATH, it works! (I didn't fully understand
JsonTablePlanNextRow though).
the doc needs some polish work.




Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-06 Thread Tom Lane
Daniel Gustafsson  writes:
>> On 6 Apr 2024, at 08:02, Peter Eisentraut  wrote:
>> Why do we need to check for the versions at all?  We should just check for 
>> the functions we need.  At least that's always been the normal approach in 
>> configure.

> We could, but finding a stable set of functions which identifies the version 
> of
> OpenSSL *and* LibreSSL that we want, and their successors, while not matching
> any older versions seemed more opaque than testing two numeric values.

I don't think you responded to Peter's point at all.  The way autoconf
is designed to work is explicitly NOT to try to identify the exact
version of $whatever.  Rather, the idea is to probe for the API
features that you want to rely on: functions, macros, struct fields,
or the like.  If you can't point to an important API difference
between 1.0.2 and 1.1.1, why drop support for 1.0.2?

regards, tom lane




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-04-06 Thread Melanie Plageman
On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote:
> 
> 
> On 4/6/24 01:53, Melanie Plageman wrote:
> > On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote:
> >> On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote:
> >>> On 4/4/24 00:57, Melanie Plageman wrote:
>  On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote:
> >>> I'd focus on the first ~8-9 commits or so for now, we can commit more if
> >>> things go reasonably well.
> >>
> >> Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would
> >> love to know if you agree with the direction before I spend more time.
> > 
> > In attached v16, I've split out 0010-0013 into 0011-0017. I think it is
> > much easier to understand.
> > 
> 
> Anyway, I've attached it as .tgz in order to not confuse cfbot. All the
> review comments are marked with XXX, so grep for that in the patches.
> There's two separate patches - the first one suggests a code change, so
> it was better to not merge that with your code. The second has just a
> couple XXX comments, I'm not sure why I kept it separate.
> 
> A couple review comments:
> 
> * I think 0001-0009 are 99% ready to. I reworded some of the commit
> messages a bit - I realize it's a bit bold, considering you're native
> speaker and I'm not. If you could check I didn't make it worse, that
> would be great.

Attached v17 has *only* patches 0001-0009 with these changes. I will
work on applying the remaining patches, addressing feedback, and adding
comments next.

I have reviewed and incorporated all of your feedback on these patches.
Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to
commit messages (comma splice removal and single word adjustments) as
well as the changes listed below:

I have changed the following:

- 0003 added an assert that rs_empty_tuples_pending is 0 on rescan and
endscan

- 0004 (your 0005)-- I followed up with Tom, but for now I have just
removed the XXX and also reworded the message a bit

- 0006 (your 0007) fixed up the variable name (you changed valid ->
valid_block but it had gotten changed back)

I have open questions on the following:

- 0003: should it be SO_NEED_TUPLES and need_tuples (instead of
SO_NEED_TUPLE and need_tuple)?

- 0009 (your 0010)
- Should I mention in the commit message that we added blockno and
pfblockno in the BitmapHeapScanState only for validation or is 
that
too specific?

- Should I mention that a future (imminent) commit will remove the
iterators from TableScanDescData and put them in 
HeapScanDescData? I
imagine folks don't want those there, but it is easier for the
progression of commits to put them there first and then move 
them

- I'm worried this comment is vague and or potentially not totally
correct. Should we remove it? I don't think we have conclusive 
proof
that this is true.
  /*
   * Adjusting the prefetch iterator before invoking
   * table_scan_bitmap_next_block() keeps prefetch distance higher across
   * the parallel workers.
   */


> * I'm not sure extra_flags is the right way to pass the flag in 0003.
> The "extra_" name is a bit weird, and no other table AM functions do it
> this way and pass explicit bool flags instead. So my first "review"
> commit does it like that. Do you agree it's better that way?

Yes.

> * The one question I'm somewhat unsure about is why Tom chose to use the
> "wrong" recheck flag in the 2017 commit, when the correct recheck flag
> is readily available. Surely that had a reason, right? But I can't think
> of one ...

See above.

> > While I was doing that, I realized that I should remove the call to
> > table_rescan() from ExecReScanBitmapHeapScan() and just rely on the new
> > table_rescan_bm() invoked from BitmapHeapNext(). That is done in the
> > attached.
> > 
> > 0010-0018 still need comments updated but I focused on getting the split
> > out, reviewable version of them ready. I'll add comments (especially to
> > 0011 table AM functions) tomorrow. I also have to double-check if I
> > should add any asserts for table AMs about having implemented all of the
> > new begin/re/endscan() functions.
> > 
> 
> I added a couple more comments for those patches (10-12). Chances are
> the split in v16 clarifies some of my questions, but it'll have to wait
> till the morning ...

Will address this in next mail.

- Melanie
>From e65a9baa52c99c93a9a9a281aebacd38b8299721 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 12 Feb 2024 18:50:29 -0500
Subject: [PATCH v17 1/9] BitmapHeapScan: begin scan after bitmap creation

It makes more sense for BitmapHeapScan to scan the index, build the
bitmap, and only then begin the scan on the underlying table.

This is mostly a cosmetic change for now, but later commits will need
to pass parameters to table_beginscan_bm() that are unavailab

Re: Flushing large data immediately in pqcomm

2024-04-06 Thread David Rowley
On Sat, 6 Apr 2024 at 23:17, Jelte Fennema-Nio  wrote:
> Weird that on your machines you don't see a difference. Are you sure
> you didn't make a silly mistake, like not restarting postgres or
> something?

I'm sure. I spent quite a long time between the AMD and an Apple m2 trying.

I did see the same regression as you on the smaller numbers.  I
experimented with the attached which macro'ifies internal_flush() and
pg_noinlines internal_flush_buffer.

Can you try that to see if it gets rid of the regression on the first two tests?

David
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 6497100a1a..824b2f11a3 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -120,8 +120,8 @@ static List *sock_paths = NIL;
 
 static char *PqSendBuffer;
 static int PqSendBufferSize;   /* Size send buffer */
-static int PqSendPointer;  /* Next index to store a byte in 
PqSendBuffer */
-static int PqSendStart;/* Next index to send a byte in 
PqSendBuffer */
+static size_t PqSendPointer;   /* Next index to store a byte in 
PqSendBuffer */
+static size_t PqSendStart; /* Next index to send a byte in 
PqSendBuffer */
 
 static char PqRecvBuffer[PQ_RECV_BUFFER_SIZE];
 static int PqRecvPointer;  /* Next index to read a byte from 
PqRecvBuffer */
@@ -133,6 +133,7 @@ static int  PqRecvLength;   /* End of data 
available in PqRecvBuffer */
 static bool PqCommBusy;/* busy sending data to the 
client */
 static bool PqCommReadingMsg;  /* in the middle of reading a message */
 
+#define internal_flush()   internal_flush_buffer(PqSendBuffer, 
&PqSendStart, &PqSendPointer)
 
 /* Internal functions */
 static void socket_comm_reset(void);
@@ -144,7 +145,8 @@ static bool socket_is_send_pending(void);
 static int socket_putmessage(char msgtype, const char *s, size_t len);
 static void socket_putmessage_noblock(char msgtype, const char *s, size_t len);
 static int internal_putbytes(const char *s, size_t len);
-static int internal_flush(void);
+static pg_noinline int internal_flush_buffer(const char *s, size_t *start,
+   
 size_t *end);
 
 static int Lock_AF_UNIX(const char *unixSocketDir, const char 
*unixSocketPath);
 static int Setup_AF_UNIX(const char *sock_path);
@@ -1282,14 +1284,32 @@ internal_putbytes(const char *s, size_t len)
if (internal_flush())
return EOF;
}
-   amount = PqSendBufferSize - PqSendPointer;
-   if (amount > len)
-   amount = len;
-   memcpy(PqSendBuffer + PqSendPointer, s, amount);
-   PqSendPointer += amount;
-   s += amount;
-   len -= amount;
+
+   /*
+* If the buffer is empty and data length is larger than the 
buffer
+* size, send it without buffering. Otherwise, put as much data 
as
+* possible into the buffer.
+*/
+   if (len >= PqSendBufferSize && PqSendStart == PqSendPointer)
+   {
+   size_t start = 0;
+
+   socket_set_nonblocking(false);
+   if (internal_flush_buffer(s, &start, &len))
+   return EOF;
+   }
+   else
+   {
+   amount = PqSendBufferSize - PqSendPointer;
+   if (amount > len)
+   amount = len;
+   memcpy(PqSendBuffer + PqSendPointer, s, amount);
+   PqSendPointer += amount;
+   s += amount;
+   len -= amount;
+   }
}
+
return 0;
 }
 
@@ -1315,19 +1335,19 @@ socket_flush(void)
 }
 
 /* 
- * internal_flush - flush pending output
+ * internal_flush_buffer - flush the given buffer content
  *
  * Returns 0 if OK (meaning everything was sent, or operation would block
  * and the socket is in non-blocking mode), or EOF if trouble.
  * 
  */
-static int
-internal_flush(void)
+static pg_noinline int
+internal_flush_buffer(const char *s, size_t *start, size_t *end)
 {
static int  last_reported_send_errno = 0;
 
-   char   *bufptr = PqSendBuffer + PqSendStart;
-   char   *bufend = PqSendBuffer + PqSendPointer;
+   char   *bufptr = (char*) s + *start;
+   char   *bufend = (char*) s + *end;
 
while (bufptr < bufend)
{
@@ -1373,7 +1393,7 @@ internal_flush(void)
 * flag that'll cause the next CHECK_FOR_INTERRUPTS to 
terminate
 * the connection.
 */
-   PqSendStart =

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-06 Thread Bharath Rupireddy
On Sat, Apr 6, 2024 at 12:18 PM Amit Kapila  wrote:
>
> Why the handling w.r.t active_pid in InvalidatePossiblyInactiveSlot()
> is not similar to InvalidatePossiblyObsoleteSlot(). Won't we need to
> ensure that there is no other active slot user? Is it sufficient to
> check inactive_since for the same? If so, we need some comments to
> explain the same.

I removed the separate functions and with minimal changes, I've now
placed the RS_INVAL_INACTIVE_TIMEOUT logic into
InvalidatePossiblyObsoleteSlot and use that even in
CheckPointReplicationSlots.

> Can we avoid introducing the new functions like
> SaveGivenReplicationSlot() and MarkGivenReplicationSlotDirty(), if we
> do the required work in the caller?

Hm. Removed them now.

Please see the attached v38 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v38-0001-Add-inactive_timeout-based-replication-slot-inva.patch
Description: Binary data


Re: MultiXact\SLRU buffers configuration

2024-04-06 Thread Andrey M. Borodin



> On 29 Feb 2024, at 06:59, Kyotaro Horiguchi  wrote:
> 
> At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin"  
> wrote in 
>> Here's the test draft. This test reliably reproduces sleep on CV when 
>> waiting next multixact to be filled into "members" SLRU.
> 
> By the way, I raised a question about using multiple CVs
> simultaneously [1]. That is, I suspect that the current CV
> implementation doesn't allow us to use multiple condition variables at
> the same time, because all CVs use the same PCPROC member cvWaitLink
> to accommodate different waiter sets. If this assumption is correct,
> we should resolve the issue before spreading more uses of CVs.

Alvaro, Kyotaro, what's our plan for this?
It seems to late to deal with this pg_usleep(1000L) for PG17.
I propose following course of action
1. Close this long-standing CF item
2. Start new thread with CV-sleep patch aimed at PG18
3. Create new entry in July CF

What do you think?


Best regards, Andrey Borodin.



Re: Flushing large data immediately in pqcomm

2024-04-06 Thread Jelte Fennema-Nio
On Sat, 6 Apr 2024 at 03:34, David Rowley  wrote:
> Does anyone else want to try the attached script on the v5 patch to
> see if their numbers are better?

On my machine (i9-10900X, in Ubuntu 22.04 on WSL on Windows) v5
consistently beats master by ~0.25 seconds:

master:
Run 100 100 500: 1.948975205
Run 1024 10240 20: 3.039986587
Run 1024 1048576 2000: 2.444176276
Run 1048576 1048576 1000: 2.475328596

v5:
Run 100 100 500: 1.997170909
Run 1024 10240 20: 3.057802598
Run 1024 1048576 2000: 2.199449857
Run 1048576 1048576 1000: 2.210328762

The first two runs are pretty much equal, and I ran your script a few
more times and this seems like just random variance (sometimes v5 wins
those, sometimes master does always quite close to each other). But
the last two runs v5 consistently wins.

Weird that on your machines you don't see a difference. Are you sure
you didn't make a silly mistake, like not restarting postgres or
something?




Re: DROP OWNED BY fails to clean out pg_init_privs grants

2024-04-06 Thread Daniel Gustafsson
> On 6 Apr 2024, at 01:10, Tom Lane  wrote:

> (So now I'm wondering why *only* copperhead has shown this so far.
> Are our other cross-version-upgrade testing animals AWOL?)

Clicking around searching for Xversion animals I didn't spot any, but without
access to the database it's nontrivial to know which animal does what.

> I doubt this is something we'll have fixed by Monday, so I will
> go add an open item for it.

+1. Having opened the can of worms I'll have a look at it next week.

--
Daniel Gustafsson





Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-04-06 Thread Daniel Gustafsson
> On 6 Apr 2024, at 08:02, Peter Eisentraut  wrote:
> 
> On 05.04.24 23:48, Daniel Gustafsson wrote:
>> The reason to expand the check is to ensure that we have the version we want
>> for both OpenSSL and LibreSSL, and deprecating OpenSSL versions isn't all 
>> that
>> commonly done so having to change the version in the check didn't seem that
>> invasive to me.
> 
> Why do we need to check for the versions at all?  We should just check for 
> the functions we need.  At least that's always been the normal approach in 
> configure.

We could, but finding a stable set of functions which identifies the version of
OpenSSL *and* LibreSSL that we want, and their successors, while not matching
any older versions seemed more opaque than testing two numeric values.  The
suggested check is modelled on the LDAP check which tests for an explicit
version in a header file (albeit not for erroring out).

--
Daniel Gustafsson