Re: XLog size reductions: Reduced XLog record header size for PG17
> On 3 Jan 2024, at 15:15, Pavel Borisov wrote: > > Hi and Happy New Year! > > I've looked through the patches and the change seems quite small and > justified. But at the second round, some doubt arises on whether this long > patchset indeed introduces enough performance gain? I may be wrong, but it > saves only several bytes and the performance gain would be only in some > specific artificial workload. Did you do some measurements? Do we have > several percent performance-wise? > > Kind regards, > Pavel Borisov Hi Matthias! This is a kind reminder that the thread is waiting for your reply. Are you interesting in CF entry [0]? Thanks! Best regards, Andrey Borodin. [0] https://commitfest.postgresql.org/47/4386/
Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED
> On 15 Mar 2024, at 17:12, Nazir Bilal Yavuz wrote: > > I did not have the time to check other things you mentioned but I > tested the read performance. The table size is 5.5GB, I did 20 runs in > total. Hi Nazir! Do you plan to review anything else? Or do you think it worth to look at by someone else? Or is the patch Ready for Committer? If so, please swap CF entry [0] to status accordingly, currently it's "Waiting on Author". Best regards, Andrey Borodin. [0] https://commitfest.postgresql.org/47/4763/
Re: Improve heapgetpage() performance, overhead from serializable
On Sun, Apr 7, 2024 at 11:49 AM Andres Freund wrote: > > Hi, > > On 2024-01-22 13:01:31 +0700, John Naylor wrote: > > On Mon, Jul 17, 2023 at 9:58 PM Andres Freund wrote: > > > And 397->320ms > > > for something as core as this, is imo worth considering on its own. > > > > Hi Andres, this interesting work seems to have fallen off the radar -- > > are you still planning to move forward with this for v17? > > I had completely forgotten about this patch, but some discussion around > streaming read reminded me of it. Here's a rebased version, with conflicts > resolved and very light comment polish and a commit message. Given that > there's been no changes otherwise in the last months, I'm inclined to push in > a few hours. Just in time ;-) The commit message should also have "reviewed by Zhang Mingli" and "tested by Quan Zongliang", who shared results in a thread for a withrawn CF entry with a similar idea but covering fewer cases: https://www.postgresql.org/message-id/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce%40yeah.net
Re: Improve heapgetpage() performance, overhead from serializable
Hi, On 2024-01-22 13:01:31 +0700, John Naylor wrote: > On Mon, Jul 17, 2023 at 9:58 PM Andres Freund wrote: > > And 397->320ms > > for something as core as this, is imo worth considering on its own. > > Hi Andres, this interesting work seems to have fallen off the radar -- > are you still planning to move forward with this for v17? I had completely forgotten about this patch, but some discussion around streaming read reminded me of it. Here's a rebased version, with conflicts resolved and very light comment polish and a commit message. Given that there's been no changes otherwise in the last months, I'm inclined to push in a few hours. Greetings, Andres Freund >From f2158455ac1a10eb393e25f7ddf87433a98e2ab0 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sat, 6 Apr 2024 20:51:07 -0700 Subject: [PATCH v2] Reduce branches in heapgetpage()'s per-tuple loop Until now, heapgetpage()'s loop over all tuples performed some conditional checks for each tuple, even though condition did not change across the loop. This commit fixes that by moving the loop into an inline function. By calling it with different constant arguments, the compiler can generate an optimized loop for the different conditions, at the price of two per-page checks. For cases of all-visible tables and an isolation level other than serializable, speedups of up to 25% have been measured. Reviewed-by: John Naylor Discussion: https://postgr.es/m/20230716015656.xjvemfbp5fysj...@awork3.anarazel.de --- src/backend/access/heap/heapam.c | 102 ++- 1 file changed, 74 insertions(+), 28 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index dada2ecd1e3..cbbc87dec9a 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -364,6 +364,56 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk scan->rs_numblocks = numBlks; } +/* + * Per-tuple loop for heapgetpage() in pagemode. Pulled out so it can be + * called multiple times, with constant arguments for all_visible, + * check_serializable. + */ +pg_attribute_always_inline +static int +heapgetpage_collect(HeapScanDesc scan, Snapshot snapshot, + Page page, Buffer buffer, + BlockNumber block, int lines, + bool all_visible, bool check_serializable) +{ + int ntup = 0; + OffsetNumber lineoff; + + for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++) + { + ItemId lpp = PageGetItemId(page, lineoff); + bool valid; + HeapTupleData loctup; + + if (!ItemIdIsNormal(lpp)) + continue; + + loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp); + loctup.t_len = ItemIdGetLength(lpp); + loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd); + ItemPointerSet(&(loctup.t_self), block, lineoff); + + if (all_visible) + valid = true; + else + valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); + + if (check_serializable) + HeapCheckForSerializableConflictOut(valid, scan->rs_base.rs_rd, +&loctup, buffer, snapshot); + + if (valid) + { + scan->rs_vistuples[ntup] = lineoff; + ntup++; + } + } + + Assert(ntup <= MaxHeapTuplesPerPage); + + return ntup; +} + /* * heap_prepare_pagescan - Prepare current scan page to be scanned in pagemode * @@ -379,9 +429,8 @@ heap_prepare_pagescan(TableScanDesc sscan) Snapshot snapshot; Page page; int lines; - int ntup; - OffsetNumber lineoff; bool all_visible; + bool check_serializable; Assert(BufferGetBlockNumber(buffer) == block); @@ -403,7 +452,6 @@ heap_prepare_pagescan(TableScanDesc sscan) page = BufferGetPage(buffer); lines = PageGetMaxOffsetNumber(page); - ntup = 0; /* * If the all-visible flag indicates that all tuples on the page are @@ -426,37 +474,35 @@ heap_prepare_pagescan(TableScanDesc sscan) * tuple for visibility the hard way. */ all_visible = PageIsAllVisible(page) && !snapshot->takenDuringRecovery; + check_serializable = + CheckForSerializableConflictOutNeeded(scan->rs_base.rs_rd, snapshot); - for (lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++) + /* + * We call heapgetpage_collect() with constant arguments, for faster code, + * without having to duplicate too much logic. The separate calls with + * constant arguments are needed on several compilers to actually perform + * constant folding. + */ + if (likely(all_visible)) { - ItemId lpp = PageGetItemId(page, lineoff); - HeapTupleData loctup; - bool valid; - - if (!ItemIdIsNormal(lpp)) - continue; - - loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd); - loctup.t_data = (HeapTupleHeader) PageGetItem(page, lpp); - loctup.t_len = ItemIdGetLength(lpp); - ItemPointerSet(&(loctup.t_self), block, lineoff); - - if (all_visible) - valid = true; + if (likely(!check_serializable)) + scan->rs_ntuples = heapgetpage_collect(scan, snapshot, page, buffer, + block, lines, true, false); else - valid = HeapTupleSatisf
Re: CASE control block broken by a single line comment
Erik Wienhold writes: > On 2024-04-06 20:14 +0200, Michal Bartak wrote: >> The issue described bellow exists in postgresql ver 16.2 (found in some >> previous major versions) > Can confirm also on master. I'm sure it's been there a while :-( > I'm surprised that the comment is not skipped by the scanner at this > point. Maybe because the parser just reads the raw expression between > WHEN and THEN with plpgsql_append_source_text via read_sql_construct. > How about the attached patch? It's a workaround by simply adding a line > feed character between the raw expression and the closing parenthesis. I don't have time to look into this on this deadline weekend, but what's bothering me about this report is the worry that we've made the same mistake elsewhere, or will do so in future. I suspect it'd be much more robust if we could remove the comment from the expr->query string. No idea how hard that is. regards, tom lane
Re: remaining sql/json patches
hi. about v50. +/* + * JsonTableSiblingJoin - + * Plan to union-join rows of nested paths of the same level + */ +typedef struct JsonTableSiblingJoin +{ + JsonTablePlan plan; + + JsonTablePlan *lplan; + JsonTablePlan *rplan; +} JsonTableSiblingJoin; "Plan to union-join rows of nested paths of the same level" same level problem misleading? I think it means "Plan to union-join rows of top level columns clause is a nested path" + if (IsA(planstate->plan, JsonTableSiblingJoin)) + { + /* Fetch new from left sibling. */ + if (!JsonTablePlanNextRow(planstate->left)) + { + /* + * Left sibling ran out of rows, fetch new from right sibling. + */ + if (!JsonTablePlanNextRow(planstate->right)) + { + /* Right sibling and thus the plan has now more rows. */ + return false; + } + } + } /* Right sibling and thus the plan has now more rows. */ I think you mean: /* Right sibling ran out of rows and thus the plan has no more rows. */ in section, + | NESTED PATH json_path_specification AS path_name +COLUMNS ( json_table_column , ... ) maybe make it into one line. | NESTED PATH json_path_specification AS path_name COLUMNS ( json_table_column , ... ) since the surrounding pattern is the next line beginning with "[", meaning that next line is optional. + at arbitrary nesting levels. maybe + at arbitrary nested level. in src/tools/pgindent/typedefs.list, "JsonPathSpec" is unnecessary. other than that, it looks good to me.
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sun, Apr 07, 2024 at 02:27:43AM +0200, Tomas Vondra wrote: > On 4/6/24 23:34, Melanie Plageman wrote: > > ... > >> > >> I realized it makes more sense to add a FIXME (I used XXX. I'm not when > >> to use what) with a link to the message where Andres describes why he > >> thinks it is a bug. If we plan on fixing it, it is good to have a record > >> of that. And it makes it easier to put a clear and accurate comment. > >> Done in 0009. > >> > >>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks > >>> per above (tuple vs. tuples etc.), and the question about the recheck > >>> flag. If you can do these tweaks, I'll get that committed today and we > >>> can try to get a couple more patches in tomorrow. > > > > Attached v19 rebases the rest of the commits from v17 over the first > > nine patches from v18. All patches 0001-0009 are unchanged from v18. I > > have made updates and done cleanup on 0010-0021. > > > > I've pushed 0001-0005, I'll get back to this tomorrow and see how much > more we can get in for v17. Thanks! I thought about it a bit more, and I got worried about the Assert(scan->rs_empty_tuples_pending == 0); in heap_rescan() and heap_endscan(). I was worried if we don't complete the scan it could end up tripping incorrectly. I tried to come up with a query which didn't end up emitting all of the tuples on the page (using a LIMIT clause), but I struggled to come up with an example that qualified for the skip fetch optimization and also returned before completing the scan. I could work a bit harder tomorrow to try and come up with something. However, I think it might be safer to just change these to: scan->rs_empty_tuples_pending = 0 > What bothers me on 0006-0008 is that the justification in the commit > messages is "future commit will do something". I think it's fine to have > a separate "prepareatory" patches (I really like how you structured the > patches this way), but it'd be good to have them right before that > "future" commit - I'd like not to have one in v17 and then the "future > commit" in v18, because that's annoying complication for backpatching, > (and probably also when implementing the AM?) etc. Yes, I was thinking about this also. > AFAICS for v19, the "future commit" for all three patches (0006-0008) is > 0012, which introduces the unified iterator. Is that correct? Actually, those patches (v19 0006-0008) were required for v19 0009, which is why I put them directly before it. 0009 eliminates use of the TBMIterateResult for control flow in BitmapHeapNext(). I've rephrased the commit messages to not mention future commits and instead focus on what the changes in the commit are enabling. v19-0006 actually squashed very easily with v19-0009 and is actually probably better that way. It is still easy to understand IMO. In v20, I've attached just the functionality from v19 0006-0009 but in three patches instead of four. > Also, for 0008 I'm not sure we could even split it between v17 and v18, > because even if heapam did not use the iterator, what if some other AM > uses it? Without 0012 it'd be a problem for the AM, no? The iterators in the TableScanDescData were introduced in v19-0009. It is okay for other AMs to use it. In fact, they will want to use it. It is still initialized and set up in BitmapHeapNext(). They would just need to call tbm_iterate()/tbm_shared_iterate() on it. As for how table AMs will cope without the TBMIterateResult passed to table_scan_bitmap_next_tuple() (which is what v19 0008 did): they can save the location of the tuples to be scanned somewhere in their scan descriptor. Heap AM already did this and actually didn't use the TBMIterateResult at all. > Would it make sense to move 0009 before these three patches? That seems > like a meaningful change on it's own, right? Since v19 0009 requires these patches, I don't think we could do that. I think up to and including 0009 would be an improvement in clarity and function. As for all the patches after 0009, I've dropped them from this version. We are out of time, and they need more thought. After we decided not to pursue streaming bitmapheapscan for 17, I wanted to make sure we removed the prefetch code table AM violation -- since we weren't deleting that code. So what started out as me looking for a way to clean up one commit ended up becoming a much larger project. Sorry about that last minute code explosion! I do think there is a way to do it right and make it nice. Also that violation would be gone if we figure out how to get streaming bitmapheapscan behaving correctly. So, there's just more motivation to make streaming bitmapheapscan awesome for 18! Given all that, I've only included the three patches I think we are considering (former v19 0006-0008). They are largely the same as you saw them last except for squashing the two commits I mentioned above and updating all of the commit messages. > FWIW I don't think it's very likely I'll commit the Un
Re: Fixing backslash dot for COPY FROM...CSV
Bruce Momjian writes: > On Sun, Apr 7, 2024 at 12:00:25AM +0200, Daniel Verite wrote: >> Agreed we don't want to document that, but also why doesn't \. in the >> contents represent just a dot (as opposed to being an error), >> just like \a is a? > I looked into this and started to realize that \. is the only copy > backslash command where we define the behavior only alone at the > beginning of a line, and not in other circumstances. The \a example > above suggests \. should be period in all other cases, as suggested, but > I don't know the ramifications if that. Here's the problem: if some client-side code thinks it's okay to quote "." as "\.", what is likely to happen when it's presented a data value consisting only of "."? It could very easily fall into the trap of producing an end-of-data marker. If we get rid of the whole concept of end-of-data markers, then it'd be totally reasonable to accept "\." as ".". But as long as we still have end-of-data markers, I think it's unwise to allow "\." to appear as anything but an end-of-data marker. It'd just add camouflage to the booby trap. regards, tom lane
Re: Popcount optimization using AVX512
On Sat, Apr 06, 2024 at 02:41:01PM -0500, Nathan Bossart wrote: > Here is what I have staged for commit, which I intend to do shortly. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Table AM Interface Enhancements
Hi, Pavel! On Fri, Apr 5, 2024 at 6:58 PM Pavel Borisov wrote: > On Tue, 2 Apr 2024 at 19:17, Jeff Davis wrote: >> >> On Tue, 2024-04-02 at 11:49 +0300, Alexander Korotkov wrote: >> > I don't like the idea that every custom table AM reltoptions should >> > begin with StdRdOptions. I would rather introduce the new data >> > structure with table options, which need to be accessed outside of >> > table AM. Then reloptions will be a backbox only directly used in >> > table AM, while table AM has a freedom on what to store in reloptions >> > and how to calculate externally-visible options. What do you think? >> >> Hi Alexander! >> >> I agree with all of that. It will take some refactoring to get there, >> though. >> >> One idea is to store StdRdOptions like normal, but if an unrecognized >> option is found, ask the table AM if it understands the option. In that >> case I think we'd just use a different field in pg_class so that it can >> use whatever format it wants to represent its options. >> >> Regards, >> Jeff Davis > > I tried to rework a patch regarding table am according to the input from > Alexander and Jeff. > > It splits table reloptions into two categories: > - common for all tables (stored in a fixed size structure and could be > accessed from outside) > - table-am specific (variable size, parsed and accessed by access method only) Thank you for your work. Please, check the revised patch. It makes CommonRdOptions a separate data structure, not directly involved in parsing the reloption. Instead table AM can fill it on the base of its reloptions or calculate the other way. Patch comes with a test module, which comes with heap-based table AM. This table AM has "enable_parallel" reloption, which is used as the base to set the value of CommonRdOptions.parallel_workers. -- Regards, Alexander Korotkov v10-0001-Custom-reloptions-for-table-AM.patch Description: Binary data
Re: Fixing backslash dot for COPY FROM...CSV
On Sun, Apr 7, 2024 at 12:00:25AM +0200, Daniel Verite wrote: > Tom Lane wrote: > > > This is sufficiently weird that I'm starting to come around to > > Daniel's original proposal that we just drop the server's recognition > > of \. altogether (which would allow removal of some dozens of lines of > > complicated and now known-buggy code) > > FWIW my plan was to not change anything in the TEXT mode, > but I wasn't aware it had this issue that you found when > \. is not in a line by itself. > > > Alternatively, we could fix it so that \. at the end of a line draws > > "end-of-copy marker corrupt" > > which would at least make things consistent, but I'm not sure that has > > any great advantage. I surely don't want to document the current > > behavioral details as being the right thing that we're going to keep > > doing. > > Agreed we don't want to document that, but also why doesn't \. in the > contents represent just a dot (as opposed to being an error), > just like \a is a? I looked into this and started to realize that \. is the only copy backslash command where we define the behavior only alone at the beginning of a line, and not in other circumstances. The \a example above suggests \. should be period in all other cases, as suggested, but I don't know the ramifications if that. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Apr 1, 2024 at 9:54 AM Masahiko Sawada wrote: > > Thank you! I've attached the patch that I'm going to push tomorrow. Excellent! I've attached a mostly-polished update on runtime embeddable values, storing up to 3 offsets in the child pointer (1 on 32-bit platforms). As discussed, this includes a macro to cap max possible offset that can be stored in the bitmap, which I believe only reduces the valid offset range for 32kB pages on 32-bit platforms. Even there, it allows for more line pointers than can possibly be useful. It also splits into two parts for readability. It would be committed in two pieces as well, since they are independently useful. From 24bd672deb4a6fa14abfc8583b500d1cbc332032 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 5 Apr 2024 17:04:12 +0700 Subject: [PATCH v83 1/3] store offsets in the header --- src/backend/access/common/tidstore.c | 52 +++ .../test_tidstore/expected/test_tidstore.out | 14 + .../test_tidstore/sql/test_tidstore.sql | 5 ++ 3 files changed, 71 insertions(+) diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c index e1a7e82469..4eb5d46951 100644 --- a/src/backend/access/common/tidstore.c +++ b/src/backend/access/common/tidstore.c @@ -34,6 +34,9 @@ /* number of active words for a page: */ #define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1) +/* number of offsets we can store in the header of a BlocktableEntry */ +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint16)) / sizeof(OffsetNumber)) + /* * This is named similarly to PagetableEntry in tidbitmap.c * because the two have a similar function. @@ -41,6 +44,10 @@ typedef struct BlocktableEntry { uint16 nwords; + + /* We can store a small number of offsets here to avoid wasting space with a sparse bitmap. */ + OffsetNumber full_offsets[NUM_FULL_OFFSETS]; + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; } BlocktableEntry; #define MaxBlocktableEntrySize \ @@ -316,6 +323,25 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, for (int i = 1; i < num_offsets; i++) Assert(offsets[i] > offsets[i - 1]); + memset(page, 0, offsetof(BlocktableEntry, words)); + + if (num_offsets <= NUM_FULL_OFFSETS) + { + for (int i = 0; i < num_offsets; i++) + { + OffsetNumber off = offsets[i]; + + /* safety check to ensure we don't overrun bit array bounds */ + if (!OffsetNumberIsValid(off)) +elog(ERROR, "tuple offset out of range: %u", off); + + page->full_offsets[i] = off; + } + + page->nwords = 0; + } + else + { for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD; wordnum <= WORDNUM(offsets[num_offsets - 1]); wordnum++, next_word_threshold += BITS_PER_BITMAPWORD) @@ -343,6 +369,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, page->nwords = wordnum; Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1])); +} if (TidStoreIsShared(ts)) shared_ts_set(ts->tree.shared, blkno, page); @@ -369,6 +396,18 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) if (page == NULL) return false; + if (page->nwords == 0) + { + /* we have offsets in the header */ + for (int i = 0; i < NUM_FULL_OFFSETS; i++) + { + if (page->full_offsets[i] == off) +return true; + } + return false; + } + else + { wordnum = WORDNUM(off); bitnum = BITNUM(off); @@ -378,6 +417,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0; } +} /* * Prepare to iterate through a TidStore. @@ -496,6 +536,17 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno, result->num_offsets = 0; result->blkno = blkno; + if (page->nwords == 0) + { + /* we have offsets in the header */ + for (int i = 0; i < NUM_FULL_OFFSETS; i++) + { + if (page->full_offsets[i] != InvalidOffsetNumber) +result->offsets[result->num_offsets++] = page->full_offsets[i]; + } + } + else + { for (wordnum = 0; wordnum < page->nwords; wordnum++) { bitmapword w = page->words[wordnum]; @@ -518,3 +569,4 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno, } } } +} diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out index 0ae2f970da..c40779859b 100644 --- a/src/test/modules/test_tidstore/expected/test_tidstore.out +++ b/src/test/modules/test_tidstore/expected/test_tidstore.out @@ -36,6 +36,20 @@ SELECT do_set_block_offsets(blk, array_agg(off)::int2[]) (VALUES (0), (1), (:maxblkno / 2), (:maxblkno - 1), (:maxblkno)) AS blocks(blk), (VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)) AS offsets(off) GROUP BY blk; +-- Test offsets embedded in the bitmap header. +SELECT do_set_block_offsets(501, array[greatest((random() * :maxoffset)::int, 1)]::int2[]); + do_set_block_offsets +-- + 501 +(1 row) + +S
Re: Change GUC hashtable to use simplehash?
On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma wrote: > > But given that we know the data length and we have it in a register > already, it's easy enough to just mask out data past the end with a > shift. See patch 1. Performance benefit is about 1.5x Measured on a > small test harness that just hashes and finalizes an array of strings, > with a data dependency between consecutive hashes (next address > depends on the previous hash output). I pushed this with a couple cosmetic adjustments, after fixing the endianness issue. I'm not sure why valgrind is fine with this way, and the other ways I tried forming the (little-endian) mask raised errors. In addition to "zero_byte_low | (zero_byte_low - 1)", I tried "~zero_byte_low & (zero_byte_low - 1)" and "zero_byte_low ^ (zero_byte_low - 1)" to no avail. On Thu, Mar 28, 2024 at 12:37 PM Jeff Davis wrote: > 0001 looks good to me, thank you. > > > v21-0003 adds a new file hashfn_unstable.c for convenience functions > > and converts all the duplicate frontend uses of hash_string_pointer. > > Why not make hash_string() inline, too? I'm fine with it either way, > I'm just curious why you went to the trouble to create a new .c file so > it didn't have to be inlined. Thanks for looking! I pushed these, with hash_string() inlined. I've attached (not reindented for clarity) an update of something mentioned a few times already -- removing strlen calls for dynahash and dshash string keys. I'm not quite sure how the comments should be updated about string_hash being deprecated to call directly. This patch goes further and semi-deprecates calling it at all, so these comments seem a bit awkward now. From 2e41e683b2fe2bc76b808e58ca2fea9067bff4e1 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 5 Apr 2024 13:59:13 +0700 Subject: [PATCH v21] Use fasthash for string keys in dynahash and dshash This avoids strlen calls. string_hash is kept around in case extensions are using it. --- src/backend/lib/dshash.c | 5 +++-- src/backend/utils/hash/dynahash.c| 25 - src/common/hashfn.c | 4 +++- src/include/common/hashfn_unstable.h | 33 src/include/lib/dshash.h | 2 +- 5 files changed, 59 insertions(+), 10 deletions(-) diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c index 93a9e21ddd..8bebf92cb8 100644 --- a/src/backend/lib/dshash.c +++ b/src/backend/lib/dshash.c @@ -32,6 +32,7 @@ #include "postgres.h" #include "common/hashfn.h" +#include "common/hashfn_unstable.h" #include "lib/dshash.h" #include "storage/lwlock.h" #include "utils/dsa.h" @@ -605,14 +606,14 @@ dshash_strcmp(const void *a, const void *b, size_t size, void *arg) } /* - * A hash function that forwards to string_hash. + * A hash function that forwards to hash_string_with_len. */ dshash_hash dshash_strhash(const void *v, size_t size, void *arg) { Assert(strlen((const char *) v) < size); - return string_hash((const char *) v, size); + return hash_string_with_len((const char *) v, size - 1); } /* diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 145e058fe6..9b85af2743 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -98,6 +98,7 @@ #include "access/xact.h" #include "common/hashfn.h" +#include "common/hashfn_unstable.h" #include "port/pg_bitutils.h" #include "storage/shmem.h" #include "storage/spin.h" @@ -309,6 +310,18 @@ string_compare(const char *key1, const char *key2, Size keysize) return strncmp(key1, key2, keysize - 1); } +/* + * hash function used when HASH_STRINGS is set + * + * If the string exceeds keysize-1 bytes, we want to hash only that many, + * because when it is copied into the hash table it will be truncated at + * that length. + */ +static uint32 +default_string_hash(const void *key, Size keysize) +{ + return hash_string_with_len((const char *) key, keysize - 1); +} /** CREATE ROUTINES **/ @@ -420,8 +433,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) else { /* - * string_hash used to be considered the default hash method, and in a - * non-assert build it effectively still is. But we now consider it + * string_hash used to be considered the default hash method, and + * it effectively still was until version 17. Since version 14 we consider it * an assertion error to not say HASH_STRINGS explicitly. To help * catch mistaken usage of HASH_STRINGS, we also insist on a * reasonably long string length: if the keysize is only 4 or 8 bytes, @@ -430,12 +443,12 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags) Assert(flags & HASH_STRINGS); Assert(info->keysize > 8); - hashp->hash = string_hash; + hashp->hash = default_string_hash; } /* * If you don't specify a match function, it defaults to string_compare if - * you used str
Re: Flushing large data immediately in pqcomm
Hi, On 2024-04-07 00:45:31 +0200, Jelte Fennema-Nio wrote: > On Sat, 6 Apr 2024 at 22:21, Andres Freund wrote: > > The small regression for small results is still kinda visible, I haven't yet > > tested the patch downthread. > > Thanks a lot for the faster test script, I'm also impatient. I still > saw the small regression with David his patch. Here's a v6 where I > think it is now gone. I added inline to internal_put_bytes too. I > think that helped especially because for two calls to > internal_put_bytes len is a constant (1 and 4) that is smaller than > PqSendBufferSize. So for those calls the compiler can now statically > eliminate the new codepath because "len >= PqSendBufferSize" is known > to be false at compile time. Nice. > Also I incorporated all of Ranier his comments. Changing the global vars to size_t seems mildly bogus to me. All it's achieving is to use slightly more memory. It also just seems unrelated to the change. Greetings, Andres Freund
Re: Streaming read-ready sequential scan code
On Sat, Apr 6, 2024 at 9:25 PM Thomas Munro wrote: > > I found a bug in read_stream.c that could be hit with Melanie's > streaming seq scan patch with parallelism enabled and certain buffer > pool conditions. Short version: there is an edge case where an "if" > needed to be a "while", or we could lose a few blocks. Here's the fix > for that, longer explanation in commit message. Attached v13 0001 is your fix and 0002 is a new version of the sequential scan streaming read user. Off-list Andres mentioned that I really ought to separate the parallel and serial sequential scan users into two different callbacks. I've done that in the attached. It actually makes the code used by the callbacks nicer and more readable anyway (putting aside performance). I was able to measure a small performance difference as well. I've also added a few comments and improved existing comments. - Melanie From eded321df22bf472f147bd8f94b596d465355c70 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 5 Apr 2024 13:32:14 +1300 Subject: [PATCH v13 2/2] Use streaming IO in heapam sequential and TID range scans Instead of calling ReadBuffer() for each block, heap sequential scans and TID range scans now use the streaming read API introduced in b5a9b18cd0. Author: Melanie Plageman Reviewed-by: Andres Freund Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com --- src/backend/access/heap/heapam.c | 234 +-- src/include/access/heapam.h | 15 ++ 2 files changed, 176 insertions(+), 73 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 01bb2f4cc16..9d10d42b69b 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -223,6 +223,66 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] = * */ +/* + * Streaming read API callback for parallel sequential scans. Returns the next + * block the caller wants from the read stream or InvalidBlockNumber when done. + */ +static BlockNumber +heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data, + void *per_buffer_data) +{ + HeapScanDesc scan = (HeapScanDesc) private_data; + + Assert(ScanDirectionIsForward(scan->rs_dir)); + Assert(scan->rs_base.rs_parallel); + + if (unlikely(!scan->rs_inited)) + { + /* parallel scan */ + table_block_parallelscan_startblock_init(scan->rs_base.rs_rd, + scan->rs_parallelworkerdata, + (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel); + + /* may return InvalidBlockNumber if there are no more blocks */ + scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd, + scan->rs_parallelworkerdata, + (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel); + scan->rs_inited = true; + } + else + { + scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd, + scan->rs_parallelworkerdata, (ParallelBlockTableScanDesc) + scan->rs_base.rs_parallel); + } + + return scan->rs_prefetch_block; +} + +/* + * Streaming read API callback for serial sequential and TID range scans. + * Returns the next block the caller wants from the read stream or + * InvalidBlockNumber when done. + */ +static BlockNumber +heap_scan_stream_read_next_serial(ReadStream *pgsr, void *private_data, + void *per_buffer_data) +{ + HeapScanDesc scan = (HeapScanDesc) private_data; + + if (unlikely(!scan->rs_inited)) + { + scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir); + scan->rs_inited = true; + } + else + scan->rs_prefetch_block = heapgettup_advance_block(scan, + scan->rs_prefetch_block, + scan->rs_dir); + + return scan->rs_prefetch_block; +} + /* * initscan - scan code common to heap_beginscan and heap_rescan * @@ -325,6 +385,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) scan->rs_cbuf = InvalidBuffer; scan->rs_cblock = InvalidBlockNumber; + /* + * Initialize to ForwardScanDirection because it is most common and + * because heap scans go forward before going backward (e.g. CURSORs). + */ + scan->rs_dir = ForwardScanDirection; + scan->rs_prefetch_block = InvalidBlockNumber; + /* page-at-a-time fields are always invalid when not rs_inited */ /* @@ -462,12 +529,14 @@ heap_prepare_pagescan(TableScanDesc sscan) /* * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM. * - * Read the next block of the scan relation into a buffer and pin that buffer - * before saving it in the scan descriptor. + * Read the next block of the scan relation from the read stream and pin that + * buffer before saving it in the scan descriptor. */ static inline void heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir) { + Assert(scan->rs_read_
Re: Streaming read-ready sequential scan code
I found a bug in read_stream.c that could be hit with Melanie's streaming seq scan patch with parallelism enabled and certain buffer pool conditions. Short version: there is an edge case where an "if" needed to be a "while", or we could lose a few blocks. Here's the fix for that, longer explanation in commit message. From 43cef2d58141ba048e9349b0027afff148be5553 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 7 Apr 2024 12:36:44 +1200 Subject: [PATCH] Fix bug in read_stream.c. When we determine that a wanted block can't be combined with the current pending read, it's time to start that pending read to get it out of the way. An "if" in that code path should have been a "while", because it might take more than one go to get that job done. Otherwise the remaining part of a partially started read could be clobbered and we could lose some blocks. This was only broken for smaller ranges, as the more common case of io_combine_limit-sized ranges is handled earlier in the code and knows how to loop. Discovered while testing parallel sequential scans of partially cached tables. They have a ramp-down phase with ever smaller ranges of contiguous blocks, to be fair to parallel workers as the work runs out. Defect in commit b5a9b18c. --- src/backend/storage/aio/read_stream.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 9a70a81f7ae..f54dacdd914 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -363,7 +363,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) } /* We have to start the pending read before we can build another. */ - if (stream->pending_read_nblocks > 0) + while (stream->pending_read_nblocks > 0) { read_stream_start_pending_read(stream, suppress_advice); suppress_advice = false; -- 2.44.0
Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs
So what was really bothering me about this patchset was that I didn't think marginal performance gains were a sufficient reason to put a whole different operating mode into libpq. However, I've reconsidered after realizing that implementing FETCH_COUNT atop traditional single-row mode would require either merging single-row results into a bigger PGresult or persuading psql's results-printing code to accept an array of PGresults not just one. Either of those would be expensive and ugly, not to mention needing chunks of code we don't have today. Also, it doesn't really need to be a whole different operating mode. There's no reason that single-row mode shouldn't be exactly equivalent to chunk mode with chunk size 1, except for the result status code. (We've got to keep PGRES_SINGLE_TUPLE for the old behavior, but using that for a chunked result would be too confusing.) So I whacked the patch around till I liked it better, and pushed it. I hope my haste will not come back to bite me, but we are getting pretty hard up against the feature-freeze deadline. regards, tom lane
Re: Cluster::restart dumping logs when stop fails
Hi, On 2024-04-07 00:19:35 +0200, Daniel Gustafsson wrote: > > On 6 Apr 2024, at 23:44, Andres Freund wrote: > > > It might be useful to print a few lines, but the whole log files can be > > several megabytes worth of output. > > The non-context aware fix would be to just print the last 1024 (or something) > bytes from the logfile: That'd be better, yes. I'd mainly log the path to the logfile though, that's probably at least as helpful for actually investigating the issue? > diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm > b/src/test/perl/PostgreSQL/Test/Cluster.pm > index 54e1008ae5..53d4751ffc 100644 > --- a/src/test/perl/PostgreSQL/Test/Cluster.pm > +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm > @@ -951,8 +951,8 @@ sub start > > if ($ret != 0) > { > - print "# pg_ctl start failed; logfile:\n"; > - print PostgreSQL::Test::Utils::slurp_file($self->logfile); > + print "# pg_ctl start failed; logfile excerpt:\n"; > + print substr > PostgreSQL::Test::Utils::slurp_file($self->logfile), -1024; > > # pg_ctl could have timed out, so check to see if there's a pid > file; > # otherwise our END block will fail to shut down the new > postmaster. That's probably unnecessary optimization, but it seems a tad silly to read an entire, potentially sizable, file to just use the last 1k. Not sure if the way slurp_file() uses seek supports negative ofsets, the docs read to me like that may only be supported with SEEK_END. Greetings, Andres Freund
Re: Add bump memory context type and use it for tuplesorts
On Sun, 7 Apr 2024, 01:59 David Rowley, wrote: > On Sun, 7 Apr 2024 at 05:45, Matthias van de Meent > wrote: > > Malloc's docs specify the minimum chunk size at 4*sizeof(void*) and > itself uses , so using powers of 2 for chunks would indeed fail to detect > 1s in the 4th bit. I suspect you'll get different results when you check > the allocation patterns of multiples of 8 bytes, starting from 40, > especially on 32-bit arm (where MALLOC_ALIGNMENT is 8 bytes, rather than > the 16 bytes on i386 and 64-bit architectures, assuming [0] is accurate) I'd hazard a guess that > there are more instances of Postgres running on Windows today than on > 32-bit CPUs and we don't seem too worried about the bit-patterns used > for Windows. > Yeah, that is something I had some thoughts about too, but didn't check if there was historical context around. I don't think it's worth bothering right now though. >> Another reason not to make it 5 bits is that I believe that would make > >> the mcxt_methods[] array 2304 bytes rather than 576 bytes. 4 bits > >> makes it 1152 bytes, if I'm counting correctly. > > > > > > I don't think I understand why this would be relevant when only 5 of the > contexts are actually in use (thus in caches). Is that size concern about > TLB entries then? > > It's a static const array. I don't want to bloat the binary with > something we'll likely never need. If we one day need it, we can > reserve another bit using the same overlapping method. > Fair points. >> I revised the patch to simplify hdrmask logic. This started with me > >> having trouble finding the best set of words to document that the > >> offset is "half the bytes between the chunk and block". So, instead > >> of doing that, I've just made it so these two fields effectively > >> overlap. The lowest bit of the block offset is the same bit as the > >> high bit of what MemoryChunkGetValue returns. > > > > > > Works for me, I suppose. > > hmm. I don't detect much enthusiasm for it. > I had a tiring day leaving me short on enthousiasm, after which I realised there were some things to this patch that would need fixing. I could've worded this better, but nothing against this code. -Matthias
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/6/24 23:34, Melanie Plageman wrote: > ... >> >> I realized it makes more sense to add a FIXME (I used XXX. I'm not when >> to use what) with a link to the message where Andres describes why he >> thinks it is a bug. If we plan on fixing it, it is good to have a record >> of that. And it makes it easier to put a clear and accurate comment. >> Done in 0009. >> >>> OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks >>> per above (tuple vs. tuples etc.), and the question about the recheck >>> flag. If you can do these tweaks, I'll get that committed today and we >>> can try to get a couple more patches in tomorrow. > > Attached v19 rebases the rest of the commits from v17 over the first > nine patches from v18. All patches 0001-0009 are unchanged from v18. I > have made updates and done cleanup on 0010-0021. > I've pushed 0001-0005, I'll get back to this tomorrow and see how much more we can get in for v17. What bothers me on 0006-0008 is that the justification in the commit messages is "future commit will do something". I think it's fine to have a separate "prepareatory" patches (I really like how you structured the patches this way), but it'd be good to have them right before that "future" commit - I'd like not to have one in v17 and then the "future commit" in v18, because that's annoying complication for backpatching, (and probably also when implementing the AM?) etc. AFAICS for v19, the "future commit" for all three patches (0006-0008) is 0012, which introduces the unified iterator. Is that correct? Also, for 0008 I'm not sure we could even split it between v17 and v18, because even if heapam did not use the iterator, what if some other AM uses it? Without 0012 it'd be a problem for the AM, no? Would it make sense to move 0009 before these three patches? That seems like a meaningful change on it's own, right? FWIW I don't think it's very likely I'll commit the UnifiedTBMIterator stuff. I do agree with the idea in general, but I think I'd need more time to think about the details. Sorry about that ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Add bump memory context type and use it for tuplesorts
On Sun, 7 Apr 2024 at 05:45, Matthias van de Meent wrote: > Malloc's docs specify the minimum chunk size at 4*sizeof(void*) and itself > uses , so using powers of 2 for chunks would indeed fail to detect 1s in the > 4th bit. I suspect you'll get different results when you check the allocation > patterns of multiples of 8 bytes, starting from 40, especially on 32-bit arm > (where MALLOC_ALIGNMENT is 8 bytes, rather than the 16 bytes on i386 and > 64-bit architectures, assuming [0] is accurate) I'm prepared to be overruled, but I just don't have strong feelings that 32-bit is worth making these reservations for. Especially so given the rate we're filling these slots. The only system that I see the 4th bit change is Cygwin. It doesn't look like a very easy system to protect against pfreeing of malloc'd chunks as the prior 8-bytes seem to vary depending on the malloc'd size and I see all bit patterns there, including the ones we use for our memory contexts. Since we can't protect against every possible bit pattern there, we need to draw the line somewhere. I don't think 32-bit systems are worth reserving these precious slots for. I'd hazard a guess that there are more instances of Postgres running on Windows today than on 32-bit CPUs and we don't seem too worried about the bit-patterns used for Windows. > In your updated 0001, you don't seem to fill the RESERVED_GLIBC memctx array > entries with BOGUS_MCTX(). Oops. Thanks >> Another reason not to make it 5 bits is that I believe that would make >> the mcxt_methods[] array 2304 bytes rather than 576 bytes. 4 bits >> makes it 1152 bytes, if I'm counting correctly. > > > I don't think I understand why this would be relevant when only 5 of the > contexts are actually in use (thus in caches). Is that size concern about TLB > entries then? It's a static const array. I don't want to bloat the binary with something we'll likely never need. If we one day need it, we can reserve another bit using the same overlapping method. >> I revised the patch to simplify hdrmask logic. This started with me >> having trouble finding the best set of words to document that the >> offset is "half the bytes between the chunk and block". So, instead >> of doing that, I've just made it so these two fields effectively >> overlap. The lowest bit of the block offset is the same bit as the >> high bit of what MemoryChunkGetValue returns. > > > Works for me, I suppose. hmm. I don't detect much enthusiasm for it. Personally, I quite like the method as it adds no extra instructions when encoding the MemoryChunk and only a simple bitwise-AND when decoding it. Your method added extra instructions in the encode and decode. I went to great lengths to make this code as fast as possible, so I know which method that I prefer. We often palloc and never do anything that requires the chunk header to be decoded, so not adding extra instructions on the encoding stage is a big win. The only method I see to avoid adding instructions in encoding and decoding is to reduce the bit-space for the MemoryChunkGetValue field to 29 bits. Effectively, that means non-external chunks can only be 512MB rather than 1GB. As far as I know, that just limits slab.c to only being able to do 512MB pallocs as generation.c and aset.c use external chunks well below that threshold. Restricting slab to 512MB is probably far from the end of the world. Anything close to that would be a terrible use case for slab. I was just less keen on using a bit from there as that's a field we allow the context implementation to do what they like with. Having bitspace for 2^30 possible values in there just seems nice given that it can store any possible value from zero up to MaxAllocSize. David
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On second thoughts, I think the original "invalidate" terminology was fine, no need to invent a new term. I thought of a better name for the bufmgr.c function though: InvalidateUnpinnedBuffer(). That name seemed better to me after I festooned it with warnings about why exactly it's inherently racy and only for testing use. I suppose someone could propose an additional function pg_buffercache_invalidate(db, tbspc, rel, fork, blocknum) that would be slightly better in the sense that it couldn't accidentally evict some innocent block that happened to replace the real target just before it runs, but I don't think it matters much for this purpose and it would still be racy on return (vacuum decides to load your block back in) so I don't think it's worth bothering with. So this is the version I plan to commit. From 6a13349b788c8539b2d349f9553706d7c563f8f8 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 7 Apr 2024 09:13:17 +1200 Subject: [PATCH v6] Add pg_buffercache_invalidate() function for testing. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When testing buffer pool logic, it is useful to be able to evict arbitrary blocks. This function can be used in SQL queries over the pg_buffercache view to set up a wide range of buffer pool states. Of course, buffer mappings might change concurrently so you might evict a block other than the one you had in mind, and another session might bring it back in at any time. That's OK for the intended purpose of setting up developer testing scenarios, and more complicated interlocking schemes to give stronger guararantees about that would likely be less flexible for actual testing work anyway. Superuser-only. Author: Palak Chaturvedi Author: Thomas Munro (docs, small tweaks) Reviewed-by: Nitin Jadhav Reviewed-by: Andres Freund Reviewed-by: Cary Huang Reviewed-by: Cédric Villemain Reviewed-by: Jim Nasby Reviewed-by: Maxim Orlov Reviewed-by: Thomas Munro Discussion: https://postgr.es/m/calfch19pw48zwwzuorspsav9hqt0upyabpc4boz4w+c7ff5...@mail.gmail.com --- contrib/pg_buffercache/Makefile | 2 +- contrib/pg_buffercache/meson.build| 1 + .../pg_buffercache--1.4--1.5.sql | 6 ++ contrib/pg_buffercache/pg_buffercache.control | 2 +- contrib/pg_buffercache/pg_buffercache_pages.c | 20 ++ doc/src/sgml/pgbuffercache.sgml | 37 +-- src/backend/storage/buffer/bufmgr.c | 63 +++ src/include/storage/bufmgr.h | 2 + 8 files changed, 125 insertions(+), 8 deletions(-) create mode 100644 contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile index d6b58d4da9..eae65ead9e 100644 --- a/contrib/pg_buffercache/Makefile +++ b/contrib/pg_buffercache/Makefile @@ -8,7 +8,7 @@ OBJS = \ EXTENSION = pg_buffercache DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \ pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \ - pg_buffercache--1.3--1.4.sql + pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time" REGRESS = pg_buffercache diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build index c86e33cc95..1ca3452918 100644 --- a/contrib/pg_buffercache/meson.build +++ b/contrib/pg_buffercache/meson.build @@ -22,6 +22,7 @@ install_data( 'pg_buffercache--1.2--1.3.sql', 'pg_buffercache--1.2.sql', 'pg_buffercache--1.3--1.4.sql', + 'pg_buffercache--1.4--1.5.sql', 'pg_buffercache.control', kwargs: contrib_data_args, ) diff --git a/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql new file mode 100644 index 00..92c530bc19 --- /dev/null +++ b/contrib/pg_buffercache/pg_buffercache--1.4--1.5.sql @@ -0,0 +1,6 @@ +\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.5'" to load this file. \quit + +CREATE FUNCTION pg_buffercache_invalidate(IN int) +RETURNS bool +AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate' +LANGUAGE C PARALLEL SAFE VOLATILE STRICT; diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control index a82ae5f9bb..5ee875f77d 100644 --- a/contrib/pg_buffercache/pg_buffercache.control +++ b/contrib/pg_buffercache/pg_buffercache.control @@ -1,5 +1,5 @@ # pg_buffercache extension comment = 'examine the shared buffer cache' -default_version = '1.4' +default_version = '1.5' module_pathname = '$libdir/pg_buffercache' relocatable = true diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c index 3316732365..9617bfa47b 100644 --- a/contrib/pg_buffercache/pg_buffercache_pages.c +++ b/contrib/pg_buffercache/pg_buffercache_pages.c @@ -63,6 +63,7 @@ typedef struct PG_FUNCTION_INFO_V1(pg_buffercache_pages); PG_FUNCTION_INFO_V1(pg_buffercache_summa
Re: Flushing large data immediately in pqcomm
On Sat, 6 Apr 2024 at 22:21, Andres Freund wrote: > The small regression for small results is still kinda visible, I haven't yet > tested the patch downthread. Thanks a lot for the faster test script, I'm also impatient. I still saw the small regression with David his patch. Here's a v6 where I think it is now gone. I added inline to internal_put_bytes too. I think that helped especially because for two calls to internal_put_bytes len is a constant (1 and 4) that is smaller than PqSendBufferSize. So for those calls the compiler can now statically eliminate the new codepath because "len >= PqSendBufferSize" is known to be false at compile time. Also I incorporated all of Ranier his comments. v6-0001-Faster-internal_putbytes.patch Description: Binary data
Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
Hi, Alexander! I didn't push 0003 for the following reasons. Thanks for clarifying. You are right, these are serious reasons. -- With best regards, Dmitry Koval Postgres Professional: http://postgrespro.com
Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
Hi, Dmitry! On Fri, Apr 5, 2024 at 4:00 PM Dmitry Koval wrote: > > I've revised the patchset. > > Thanks for the corrections (especially ddl.sgml). > Could you also look at a small optimization for the MERGE PARTITIONS > command (in a separate file > v31-0003-Additional-patch-for-ALTER-TABLE-.-MERGE-PARTITI.patch, I wrote > about it in an email 2024-03-31 00:56:50)? > > Files v31-0001-*.patch, v31-0002-*.patch are the same as > v30-0001-*.patch, v30-0002-*.patch (after rebasing because patch stopped > applying due to changes in upstream). I've pushed 0001 and 0002. I didn't push 0003 for the following reasons. 1) This doesn't keep functionality equivalent to 0001. With 0003, the merged partition will inherit indexes, constraints, and so on from the one of merging partitions. 2) This is not necessarily an optimization. Without 0003 indexes on the merged partition are created after moving the rows in attachPartitionTable(). With 0003 we merge data into the existing partition which saves its indexes. That might cause a significant performance loss because mass inserts into indexes may be much slower than building indexes from scratch. I think both aspects need to be carefully considered. Even if we accept them, this needs to be documented. I think now it's too late for both of these. So, this should wait for v18. -- Regards, Alexander Korotkov
Re: Cluster::restart dumping logs when stop fails
> On 6 Apr 2024, at 23:44, Andres Freund wrote: > It might be useful to print a few lines, but the whole log files can be > several megabytes worth of output. The non-context aware fix would be to just print the last 1024 (or something) bytes from the logfile: diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm index 54e1008ae5..53d4751ffc 100644 --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -951,8 +951,8 @@ sub start if ($ret != 0) { - print "# pg_ctl start failed; logfile:\n"; - print PostgreSQL::Test::Utils::slurp_file($self->logfile); + print "# pg_ctl start failed; logfile excerpt:\n"; + print substr PostgreSQL::Test::Utils::slurp_file($self->logfile), -1024; # pg_ctl could have timed out, so check to see if there's a pid file; # otherwise our END block will fail to shut down the new postmaster. Would that be a reasonable fix? -- Daniel Gustafsson
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Fri, 5 Apr 2024 at 18:48, Robert Haas wrote: > Maybe we'd be better off adding a libpq connection > option that forces the use of a specific minor protocol version, but > then we'll need backward-compatibility code in libpq basically > forever. But maybe we need that anyway to avoid older and newer > servers being unable to communicate. I think this would be good because it makes testing easy and, like you said, I think we'll need this backward-compatibility code in libpq anyway to be able to connect to old servers. To have even better and more realistic test coverage though, I think we might also want to actually test new libpq against old postgres servers and vice-versa in a build farm animal though. > Plus, you've got all of the consequences for non-core drivers, which > have to both add support for the new wire protocol - if they don't > want to seem outdated and eventually obsolete - and also test that > they're still compatible with all supported server versions. I think for clients/drivers, the work would generally be pretty minimal. For almost all proposed changes, clients can "support" the protocol version update by simply not using the new features, e.g. a client can "support" the ParameterSet feature, by simply never sending the ParameterSet message. So binding it to a protocol version bump doesn't make it any harder for that client to support that protocol version. I'm not saying that is the case for all protocol changes, but based on what's being proposed so far that's definitely a very common theme. Overall, I think this is something to discuss for each protocol change in isolation: i.e. how to make supporting the new feature as painless as possible for clients/drivers. > Connection poolers have the same set of problems. For connection poolers this is indeed a bigger hassle, because they at least need to be able to handle all the new message types that a client can send and maybe do something special for them. But I think if we're careful to keep connection poolers in mind when designing the features themselves then I think this isn't necessarily a problem. And probably indeed for the features that we think are hard for connection poolers to implement, we should be using protocol extension parameter feature flags. But I think a lot of protocol would be fairly trivial for a connection pooler to support. > The whole thing is > almost a hole with no bottom. Keeping up with core changes in this > area could become a massive undertaking for lots and lots of people, > some of whom may be the sole maintainer of some important driver that > now needs a whole bunch of work. I agree with Dave here, if you want to benefit from new features there's some expectation to keep up with the changes. But to be clear, we'd still support old protocol versions too. So we wouldn't break connecting using those clients, they simply wouldn't benefit from some of the new features. I think that's acceptable. > I'm not sure how much it improves things if we imagine adding feature > flags to the existing protocol versions, rather than whole new > protocol versions, but at least it cuts down on the assumption that > adopting new features is mandatory, and that such features are > cumulative. If a driver wants to support TDE but not protocol > parameters or protocol parameters but not TDE, who are we to say no? > If in supporting those things we bump the protocol version to 3.2, and > then 3.3 fixes a huge performance problem, are drivers going to be > required to add support for features they don't care about to get the > performance fixes? I think there's an important trade-off here. On one side we don't want to make maintainers of clients/poolers do lots of work to support features they don't care about. And on the other side it seems quite useful to limit the amount of feature combinations that are used it the wild (both for users and for us) e.g. the combinations of backwards compatibility testing you were talking about would explode if every protocol change was a feature flag. I think this trade-off is something we should be deciding on based on the specific protocol change. But if work needed to "support" the feature is "minimal" (to-be-defined exactly what we consider minimal), I think making it part of a protocol version bump is reasonable. > I see some benefit in bumping the protocol version > for major changes, or for changes that we have an important reason to > make mandatory, or to make previously-optional features for which > support has become in practical terms universal part of the base > feature set. But I'm very skeptical of the idea that we should just > handle as many things as possible via a protocol version bump. We've > been avoiding protocol version bumps like the plague since forever, > and swinging all the way to the other extreme doesn't sound like the > right idea to me. I think there's two parts to a protocol version bump: 1. The changes that cause us to consider a protocol bump
Re: Fixing backslash dot for COPY FROM...CSV
Tom Lane wrote: > This is sufficiently weird that I'm starting to come around to > Daniel's original proposal that we just drop the server's recognition > of \. altogether (which would allow removal of some dozens of lines of > complicated and now known-buggy code) FWIW my plan was to not change anything in the TEXT mode, but I wasn't aware it had this issue that you found when \. is not in a line by itself. > Alternatively, we could fix it so that \. at the end of a line draws > "end-of-copy marker corrupt" > which would at least make things consistent, but I'm not sure that has > any great advantage. I surely don't want to document the current > behavioral details as being the right thing that we're going to keep > doing. Agreed we don't want to document that, but also why doesn't \. in the contents represent just a dot (as opposed to being an error), just like \a is a? I mean if eofdata contains foobar\a foobaz\aother then we get after import: f1 -- foobara foobazaother (2 rows) Reading the current doc on the text format, I can't see why importing: foobar\. foobar\.other is not supposed to produce f1 -- foobar. foobaz.other (2 rows) I see these rules in [1] about backslash: #1. "End of data can be represented by a single line containing just backslash-period (\.)." foobar\. and foobar\.other do not match that so #1 does not describe how they're interpreted. #2. "Backslash characters (\) can be used in the COPY data to quote data characters that might otherwise be taken as row or column delimiters." Dot is not a column delimiter (it's forbidden anyway), so #2 does not apply. #3. "In particular, the following characters must be preceded by a backslash if they appear as part of a column value: backslash itself, newline, carriage return, and the current delimiter character" Dot is not in that list so #3 does not apply. #4. "The following special backslash sequences are recognized by COPY FROM:" (followed by the table with \b \f, ...,) Dot is not mentioned. #5. "Any other backslashed character that is not mentioned in the above table will be taken to represent itself" Here we say that backslash dot represents a dot (unless other rules apply) foobar\. => foobar. foobar\.other => foobar.other #6. "However, beware of adding backslashes unnecessarily, since that might accidentally produce a string matching the end-of-data marker (\.) or the null string (\N by default)." So we *recommend* not to use \. but as I understand it, the warning with the EOD marker is about accidentally creating a line that matches #1, that is, \. alone on a line. #7 "These strings will be recognized before any other backslash processing is done." TBH I don't understand what #7 implies. The order in backslash processing looks like an implementation detail that should not matter in understanding the format? Considering this, it seems to me that #5 says that backslash-dot represents a dot unless #1 applies, and the other #2 #3 #4 #6 #7 rules do not state anything that would contradict that. [1] https://www.postgresql.org/docs/current/sql-copy.html Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Re: UUID v7
For every complex problem there is an answer that is clear, simple, and wrong. Since the RFC allows microsecond timestamp granularity, the first thing that comes to everyone's mind is to insert microsecond granularity into UUIDv7. And if the RFC allowed nanosecond timestamp granularity, then they would try to insert nanosecond granularity into UUIDv7. But I am categorically against abandoning the counter under pressure from the unfounded proposal to replace the counter with microsecond granularity. 1) The RFC specifies millisecond timestamp granularity by default. 2) All advanced UUIDv7 implementations include a counter:• for JavaScript https://www.npmjs.com/package/uuidv7• for Rust https://crates.io/crates/uuid7• for Go (Golang) https://pkg.go.dev/github.com/gofrs/uuid#NewV7• for Python https://github.com/oittaa/uuid6-python 3) The theoretical performance of generating UUIDv7 without loss of monotonicity for microsecond granularity is only 1000 UUIDv7 per millisecond. This is very low and insufficient generation performance! But the actual generation performance is even worse, since the generation demand is unevenly distributed within a millisecond. Therefore, a UUIDv7 will not be generated every microsecond. For a counter 18 bits long, with the most significant bit initialized to zero and the remaining bits initialized to a random number, the actual performance of generating a UUIDv7 without loss of monotonicity is between 2 to the power of 17 = 131072 UUIDv7 per millisecond (if the random number happens to be all ones) to 2 to the power of 18 = 262144 UUIDv7 per millisecond (if the random number happens to be all zeros). This is more than enough. 4) Microsecond timestamp fraction subtracts 10 bits from random data, which increases the risk of collision. In the counter, almost all bits are initialized with a random number, which reduces the risk of collision. The only reasonable use of microsecond granularity is when writing to a database table in parallel. However, monotonicity in this case can be ensured in another way, namely a single UUIDv7 generator per database table, similar to SERIAL (https://postgrespro.com/docs/postgresql/16/datatype-numeric#DATATYPE-SERIAL) in PostgreSQL. Best regards, Sergey prokhorenkosergeyprokhore...@yahoo.com.au On Thursday, 4 April 2024 at 09:12:17 pm GMT+3, Andrey M. Borodin wrote: ... At this point we can skip the counter\microseconds entirely, just fill everything after unix_ts_ms with randomness. It's still a valid UUIDv7, exhibiting much more data locality than UUIDv4. We can adjust this sortability measures later. Best regards, Andrey Borodin.
Re: [HACKERS] make async slave to wait for lsn to be replayed
Hi, Alvaro! Thank you for your care on this matter. On Fri, Apr 5, 2024 at 9:15 PM Alvaro Herrera wrote: > BTW I noticed that > https://coverage.postgresql.org/src/backend/commands/waitlsn.c.gcov.html > says that lsn_cmp is not covered by the tests. This probably indicates > that the tests are a little too light, but I'm not sure how much extra > effort we want to spend. I'm aware of this. Ivan promised to send a patch to improve the test. If he doesn't, I'll care about it. > I'm still concerned that WaitLSNCleanup is only called in ProcKill. > Does this mean that if a process throws an error while waiting, it'll > not get cleaned up until it exits? Maybe this is not a big deal, but it > seems odd. I've added WaitLSNCleanup() to the AbortTransaction(). Just pushed that together with the improvements upthread. -- Regards, Alexander Korotkov
Cluster::restart dumping logs when stop fails
Hi, I recently started to be bothered by regress_* logs after some kinds of test failures containing the whole log of a test failure. E.g. in https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-04-06%2016%3A28%3A38 ... ### Restarting node "standby" # Running: pg_ctl -w -D /home/bf/bf-build/serinus/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/data/t_035_standby_logical_decoding_standby_data/pgdata -l /home/bf/bf-build/serinus/HEAD/pgsql.build/testrun/recovery/035_standby_logical_decoding/log/035_standby_logical_decoding_standby.log restart waiting for server to shut down... failed pg_ctl: server does not shut down # pg_ctl restart failed; logfile: 2024-04-06 16:33:37.496 UTC [2628363][postmaster][:0] LOG: starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.1, 64-bit 2024-04-06 16:33:37.503 UTC [2628363][postmaster][:0] LOG: listening on Unix socket "/tmp/55kikMaTyW/.s.PGSQL.63274" Looks like the printing of the entire log was added in: commit 33774978c78175095da9e6c276e8bcdb177725f8 Author: Daniel Gustafsson Date: 2023-09-22 13:35:37 +0200 Avoid using internal test methods in SSL tests It might be useful to print a few lines, but the whole log files can be several megabytes worth of output. In the buildfarm that leads to the same information being collected multiple times, and locally it makes it hard to see where the "normal" contents of regress_log* continue. Greetings, Andres Freund
Re: Synchronizing slots from primary to standby
Hi, On 2024-04-06 10:58:32 +0530, Amit Kapila wrote: > On Sat, Apr 6, 2024 at 10:13 AM Amit Kapila wrote: > > > > There are still a few pending issues to be fixed in this feature but > otherwise, we have committed all the main patches, so I marked the CF > entry corresponding to this work as committed. There are a a fair number of failures of 040_standby_failover_slots_sync in the buildfarm. It'd be nice to get those fixed soon-ish. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-04-06%2020%3A58%3A50 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-04-06%2015%3A18%3A08 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-06%2010%3A13%3A58 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2016%3A04%3A10 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-05%2014%3A59%3A40 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-05%2014%3A59%3A07 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2014%3A18%3A07 The symptoms are similar, but not entirely identical across all of them, I think. I've also seen a bunch of failures of this test locally. Greetings, Andres Freund
Re: Flushing large data immediately in pqcomm
Hi, On Fri, 5 Apr 2024 at 03:28, Melih Mutlu wrote: >Right. It was a mistake, forgot to remove that. Fixed it in v5. If you don't mind, I have some suggestions for patch v5. 1. Shouldn't PqSendBufferSize be of type size_t? There are several comparisons with other size_t variables. static size_t PqSendBufferSize; /* Size send buffer */ I think this would prevent possible overflows. 2. If PqSendBufferSize is changed to size_t, in the function socket_putmessage_noblock, the variable which name is *required*, should be changed to size_t as well. static void socket_putmessage_noblock(char msgtype, const char *s, size_t len) { int res PG_USED_FOR_ASSERTS_ONLY; size_t required; 3. In the internal_putbytes function, the *amout* variable could have the scope safely reduced. else { size_t amount; amount = PqSendBufferSize - PqSendPointer; 4. In the function internal_flush_buffer, the variables named *bufptr* and *bufend* could be const char * type, like: static int internal_flush_buffer(const char *s, size_t *start, size_t *end) { static int last_reported_send_errno = 0; const char *bufptr = s + *start; const char *bufend = s + *end; best regards, Ranier Vilela
Re: Statistics Import and Export
> > > > I think it'll be a serious, serious error for this not to be > SECTION_DATA. Maybe POST_DATA is OK, but even that seems like > an implementation compromise not "the way it ought to be". > We'd have to split them on account of when the underlying object is created. Index statistics would be SECTION_POST_DATA, and everything else would be SECTION_DATA. Looking ahead, statistics data for extended statistics objects would also be POST. That's not a big change, but my first attempt at that resulted in a bunch of unrelated grants dumping in the wrong section.
Re: CASE control block broken by a single line comment
On 2024-04-06 20:14 +0200, Michal Bartak wrote: > The issue described bellow exists in postgresql ver 16.2 (found in some > previous major versions) Can confirm also on master. > The documentation defines a comment as: > > > A comment is a sequence of characters beginning with double dashes and > > extending to the end of the line > > > When using such a comment within CASE control block, it ends up with an > error: > > DO LANGUAGE plpgsql $$ > DECLARE > t TEXT = 'a'; > BEGIN > CASE t > WHEN 'a' -- my comment > THEN RAISE NOTICE 'a'; > WHEN 'b' > THEN RAISE NOTICE 'b'; > ELSE NULL; > END CASE; > END;$$; > > ERROR: syntax error at end of input > LINE 1: "__Case__Variable_2__" IN ('a' -- my comment) > ^ > QUERY: "__Case__Variable_2__" IN ('a' -- my comment) > CONTEXT: PL/pgSQL function inline_code_block line 5 at CASE I'm surprised that the comment is not skipped by the scanner at this point. Maybe because the parser just reads the raw expression between WHEN and THEN with plpgsql_append_source_text via read_sql_construct. How about the attached patch? It's a workaround by simply adding a line feed character between the raw expression and the closing parenthesis. -- Erik >From 85456a22f41a8a51703650f2853fb6d8c9711fc7 Mon Sep 17 00:00:00 2001 From: Erik Wienhold Date: Sat, 6 Apr 2024 22:36:54 +0200 Subject: [PATCH v1] plpgsql: create valid IN expression for CASE WHEN clause The expression in CASE x WHEN THEN may end with a line comment. This results in a syntax error when the WHEN clause is rewritten to x IN (). Cope with that by appending \n after to terminate the line comment. --- src/pl/plpgsql/src/expected/plpgsql_control.out | 17 + src/pl/plpgsql/src/pl_gram.y| 6 +- src/pl/plpgsql/src/sql/plpgsql_control.sql | 14 ++ 3 files changed, 36 insertions(+), 1 deletion(-) diff --git a/src/pl/plpgsql/src/expected/plpgsql_control.out b/src/pl/plpgsql/src/expected/plpgsql_control.out index 328bd48586..ccd4f54704 100644 --- a/src/pl/plpgsql/src/expected/plpgsql_control.out +++ b/src/pl/plpgsql/src/expected/plpgsql_control.out @@ -681,3 +681,20 @@ select case_test(13); other (1 row) +-- test line comment between WHEN and THEN +create or replace function case_comment(int) returns text as $$ +begin + case $1 +when 1 -- comment before THEN + then return 'one'; +else + return 'other'; + end case; +end; +$$ language plpgsql immutable; +select case_comment(1); + case_comment +-- + one +(1 row) + diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y index bef33d58a2..98339589ba 100644 --- a/src/pl/plpgsql/src/pl_gram.y +++ b/src/pl/plpgsql/src/pl_gram.y @@ -4161,7 +4161,11 @@ make_case(int location, PLpgSQL_expr *t_expr, /* Do the string hacking */ initStringInfo(&ds); - appendStringInfo(&ds, "\"%s\" IN (%s)", + /* +* The read expression may end in a line comment, so +* append \n after it to create a valid expression. +*/ + appendStringInfo(&ds, "\"%s\" IN (%s\n)", varname, expr->query); pfree(expr->query); diff --git a/src/pl/plpgsql/src/sql/plpgsql_control.sql b/src/pl/plpgsql/src/sql/plpgsql_control.sql index ed7231134f..8e007c51dc 100644 --- a/src/pl/plpgsql/src/sql/plpgsql_control.sql +++ b/src/pl/plpgsql/src/sql/plpgsql_control.sql @@ -486,3 +486,17 @@ select case_test(1); select case_test(2); select case_test(12); select case_test(13); + +-- test line comment between WHEN and THEN +create or replace function case_comment(int) returns text as $$ +begin + case $1 +when 1 -- comment before THEN + then return 'one'; +else + return 'other'; + end case; +end; +$$ language plpgsql immutable; + +select case_comment(1); -- 2.44.0
Re: Flushing large data immediately in pqcomm
Hi, On 2024-04-06 14:34:17 +1300, David Rowley wrote: > I don't see any issues with v5, so based on the performance numbers > shown on this thread for the latest patch, it would make sense to push > it. The problem is, I just can't recreate the performance numbers. > > I've tried both on my AMD 3990x machine and an Apple M2 with a script > similar to the test.sh from above. I mostly just stripped out the > buffer size stuff and adjusted the timing code to something that would > work with mac. I think there are a few issues with the test script leading to not seeing a gain: 1) I think using the textual protocol, with the text datatype, will make it harder to spot differences. That's a lot of overhead. 2) Afaict the test is connecting over the unix socket, I think we expect bigger wins for tcp 3) Particularly the larger string is bottlenecked due to pglz compression in toast. Where I had noticed the overhead of the current approach badly, was streaming out basebackups. Which is all binary, of course. I added WITH BINARY, SET STORAGE EXTERNAL and tested both unix socket and localhost. I also reduced row counts and iteration counts, because I am impatient, and I don't think it matters much here. Attached the modified version. On a dual xeon Gold 5215, turbo boost disabled, server pinned to one core, script pinned to another: unix: master: Run 100 100 100: 0.058482377 Run 1024 10240 10: 0.120909810 Run 1024 1048576 2000: 0.153027916 Run 1048576 1048576 1000: 0.154953512 v5: Run 100 100 100: 0.058760126 Run 1024 10240 10: 0.118831396 Run 1024 1048576 2000: 0.124282503 Run 1048576 1048576 1000: 0.123894962 localhost: master: Run 100 100 100: 0.067088000 Run 1024 10240 10: 0.170894273 Run 1024 1048576 2000: 0.230346632 Run 1048576 1048576 1000: 0.230336078 v5: Run 100 100 100: 0.067144036 Run 1024 10240 10: 0.167950948 Run 1024 1048576 2000: 0.135167027 Run 1048576 1048576 1000: 0.135347867 The perf difference for 1MB via TCP is really impressive. The small regression for small results is still kinda visible, I haven't yet tested the patch downthread. Greetings, Andres Freund #!/bin/bash set -e dbname=postgres port=5440 host=/tmp host=localhost test_cases=( "100 100 100" # only 100 bytes "1024 10240 10"# 1Kb and 10Kb "1024 1048576 2000" # 1Kb and 1Mb "1048576 1048576 1000" # all 1Mb ) insert_rows(){ psql -d $dbname -p $port -h $host -c " DO \$\$ DECLARE counter INT; BEGIN FOR counter IN 1..$3 LOOP IF counter % 2 = 1 THEN INSERT INTO test_table VALUES (repeat('a', $1)::text); ELSE INSERT INTO test_table VALUES (repeat('b', $2)::text); END IF; END LOOP; END \$\$; " > /dev/null } psql -d $dbname -p $port -c "CREATE EXTENSION IF NOT EXISTS pg_prewarm;" > /dev/null for case in "${test_cases[@]}" do psql -d $dbname -p $port -h $host -c "DROP TABLE IF EXISTS test_table;" > /dev/null psql -d $dbname -p $port -h $host -c "CREATE UNLOGGED TABLE test_table(data text not null);" > /dev/null psql -d $dbname -p $port -h $host -c "ALTER TABLE test_table ALTER data SET STORAGE EXTERNAL;" > /dev/null insert_rows $case psql -d $dbname -p $port -h $host -c "select pg_prewarm('test_table');" > /dev/null echo -n "Run $case: " elapsed_time=0 for a in {1..5} do start_time=$(perl -MTime::HiRes=time -e 'printf "%.9f\n", time') psql -d $dbname -p $port -h $host -c "COPY test_table TO STDOUT WITH BINARY;" > /dev/null end_time=$(perl -MTime::HiRes=time -e 'printf "%.9f\n", time') elapsed_time=$(perl -e "printf('%.9f', ($end_time - $start_time) + $elapsed_time)") done avg_elapsed_time_in_ms=$(perl -e "printf('%.9f', ($elapsed_time / 30))") echo $avg_elapsed_time_in_ms done
Re: Popcount optimization using AVX512
On Sat, Apr 06, 2024 at 02:51:39PM +1300, David Rowley wrote: > On Sat, 6 Apr 2024 at 14:17, Nathan Bossart wrote: >> On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: >> > Won't Valgrind complain about this? >> > >> > +pg_popcount_avx512(const char *buf, int bytes) >> > >> > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); >> > >> > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); >> >> I haven't been able to generate any complaints, at least with some simple >> tests. But I see your point. If this did cause such complaints, ISTM we'd >> just want to add it to the suppression file. Otherwise, I think we'd have >> to go back to the non-maskz approach (which I really wanted to avoid >> because of the weird function overhead juggling) or find another way to do >> a partial load into an __m512i. > > [1] seems to think it's ok. If this is true then the following > shouldn't segfault: > > The following seems to run without any issue and if I change the mask > to 1 it crashes, as you'd expect. Cool. Here is what I have staged for commit, which I intend to do shortly. At some point, I'd like to revisit converting TRY_POPCNT_FAST to a configure-time check and maybe even moving the "fast" and "slow" implementations to their own files, but since that's mostly for code neatness and we are rapidly approaching the v17 deadline, I'm content to leave that for v18. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 9eea49555cbd14c7871085e159c9b0b78e92 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [PATCH v28 1/2] Optimize pg_popcount() with AVX-512 instructions. Presently, pg_popcount() processes data in 32-bit or 64-bit chunks when possible. Newer hardware that supports AVX-512 instructions can perform these tasks in 512-bit chunks, which can provide a nice speedup, especially for larger buffers. This commit introduces the infrastructure required to detect both compiler and CPU support for the required AVX-512 intrinsic functions, and it makes use of that infrastructure in a new pg_popcount() implementation. If CPU support for this optimized implementation is detected at runtime, a function pointer is updated so that it is used for subsequent calls to pg_popcount(). Most of the existing in-tree calls to pg_popcount() should benefit nicely from these instructions, and calls for smaller buffers should not regress when compared to v16. The new infrastructure introduced by this commit can also be used to optimized visibilitymap_count(), but that work is left for a follow-up commit. Co-authored-by: Paul Amonson, Ants Aasma Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com --- config/c-compiler.m4 | 58 ++ configure| 252 +++ configure.ac | 51 ++ meson.build | 87 + src/Makefile.global.in | 5 + src/include/pg_config.h.in | 12 ++ src/include/port/pg_bitutils.h | 11 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 11 ++ src/port/meson.build | 6 +- src/port/pg_bitutils.c | 5 + src/port/pg_popcount_avx512.c| 82 + src/port/pg_popcount_avx512_choose.c | 87 + src/test/regress/expected/bit.out| 24 +++ src/test/regress/sql/bit.sql | 4 + 15 files changed, 696 insertions(+), 3 deletions(-) create mode 100644 src/port/pg_popcount_avx512.c create mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb..cfff48c1bc 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -694,3 +694,61 @@ if test x"$Ac_cachevar" = x"yes"; then fi undefine([Ac_cachevar])dnl ])# PGAC_LOONGARCH_CRC32C_INTRINSICS + +# PGAC_XSAVE_INTRINSICS +# - +# Check if the compiler supports the XSAVE instructions using the _xgetbv +# intrinsic function. +# +# An optional compiler flag can be passed as argument (e.g., -mxsave). If the +# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +AC_DEFUN([PGAC_XSAVE_INTRINSICS], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl +AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], +[pgac_save_CFLAGS=$CFLAGS +CFLAGS="$pgac_save_CFLAGS $1" +AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], + [return _xgetbv(0) & 0xe0;])], + [Ac_cachevar=yes], + [Ac_cachevar=no]) +CFLAGS="$pgac_save_CFLAGS"]) +if test x"$Ac_cachevar" = x"yes"; then + CFLAGS_XSAVE="$1" + pgac_xsave_intrinsics=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_XSAVE_INTRINSICS + +# PGAC_AVX512_POPCNT_INTRINSICS +# --
CASE control block broken by a single line comment
Hello all The issue described bellow exists in postgresql ver 16.2 (found in some previous major versions) The documentation defines a comment as: > A comment is a sequence of characters beginning with double dashes and > extending to the end of the line When using such a comment within CASE control block, it ends up with an error: DO LANGUAGE plpgsql $$ DECLARE t TEXT = 'a'; BEGIN CASE t WHEN 'a' -- my comment THEN RAISE NOTICE 'a'; WHEN 'b' THEN RAISE NOTICE 'b'; ELSE NULL; END CASE; END;$$; ERROR: syntax error at end of input LINE 1: "__Case__Variable_2__" IN ('a' -- my comment) ^ QUERY: "__Case__Variable_2__" IN ('a' -- my comment) CONTEXT: PL/pgSQL function inline_code_block line 5 at CASE With Regards Michal Bartak
Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?
> On 6 Apr 2024, at 16:04, Tom Lane wrote: > Daniel Gustafsson writes: >>> On 6 Apr 2024, at 08:02, Peter Eisentraut wrote: >>> Why do we need to check for the versions at all? We should just check for >>> the functions we need. At least that's always been the normal approach in >>> configure. > >> We could, but finding a stable set of functions which identifies the version >> of >> OpenSSL *and* LibreSSL that we want, and their successors, while not matching >> any older versions seemed more opaque than testing two numeric values. > > I don't think you responded to Peter's point at all. The way autoconf > is designed to work is explicitly NOT to try to identify the exact > version of $whatever. Rather, the idea is to probe for the API > features that you want to rely on: functions, macros, struct fields, > or the like. If you can't point to an important API difference > between 1.0.2 and 1.1.1, why drop support for 1.0.2? My apologies, I thought I did but clearly failed. My point was that this is a special/corner case where we try to find one of two different libraries (with different ideas about backwards compatability etc) for supporting a single thing. So instead I tested for the explicit versions like how we already test for the exact Perl version in config/perl.m4 (albeit that a library and a program are two different things of course). In bumping we want to move to 1.1.1 since that's the first version with the rewritten RNG which is fork-safe by design, something PostgreSQL clearly benefits from. There is no new API for this to gate on though. For LibreSSL we want 3.3.2 to a) ensure we have coverage in the BF and b) since it's the first version where the tests pass due to error message alignment with OpenSSL. The combination of these gets rid of lots of specialcased #ifdef soup. I wasn't however able to find a specific API call which is unique to the two version which we rely on. Testing for the presence of an API provided and introduced by both libraries in the version we're interested in, but which we don't use, is the alternative but I thought that would be more frowned upon. EVP_PKEY_new_CMAC_key() was introduced in 1.1.1 and LibreSSL 3.3.2, so an AC_CHECK_FUNCS for that, as in the attached, achieves the version check but pollutes pg_config.h with a define which will never be used which seemed a bit ugly. -- Daniel Gustafsson v6-0001-Remove-support-for-OpenSSL-1.0.2-and-1.1.0.patch Description: Binary data
Re: LogwrtResult contended spinlock
On Sat, 2024-04-06 at 18:13 +0200, Alvaro Herrera wrote: > my understanding of pg_atomic_compare_exchange_u64 is that > *expected is set to the value that's stored in *ptr prior to the > exchange. Sorry, my mistake. Your version looks good. Regards, Jeff Davis
Re: Add bump memory context type and use it for tuplesorts
On Sat, 6 Apr 2024, 14:36 David Rowley, wrote: > On Sat, 6 Apr 2024 at 02:30, Matthias van de Meent > wrote: > > > > On Thu, 4 Apr 2024 at 22:43, Tom Lane wrote: > > > > > > Matthias van de Meent writes: > > > > It extends memory context IDs to 5 bits (32 values), of which > > > > - 8 have glibc's malloc pattern of 001/010; > > > > - 1 is unused memory's 0 > > > > - 1 is wipe_mem's 1 > > > > - 4 are used by existing contexts > (Aset/Generation/Slab/AlignedRedirect) > > > > - 18 are newly available. > > > > > > This seems like it would solve the problem for a good long time > > > to come; and if we ever need more IDs, we could steal one more bit > > > by requiring the offset to the block header to be a multiple of 8. > > > (Really, we could just about do that today at little or no cost ... > > > machines with MAXALIGN less than 8 are very thin on the ground.) > > > > Hmm, it seems like a decent idea, but I didn't want to deal with the > > repercussions of that this late in the cycle when these 2 bits were > > still relatively easy to get hold of. > > Thanks for writing the patch. > > I think 5 bits is 1 too many. 4 seems fine. I also think you've > reserved too many slots in your patch as I disagree that we need to > reserve the glibc malloc pattern anywhere but in the 1 and 2 slots of > the mcxt_methods[] array. I looked again at the 8 bytes prior to a > glibc malloc'd chunk and I see the lowest 4 bits of the headers > consistently set to 0001 for all powers of 2 starting at 8 up to > 65536. Malloc's docs specify the minimum chunk size at 4*sizeof(void*) and itself uses , so using powers of 2 for chunks would indeed fail to detect 1s in the 4th bit. I suspect you'll get different results when you check the allocation patterns of multiples of 8 bytes, starting from 40, especially on 32-bit arm (where MALLOC_ALIGNMENT is 8 bytes, rather than the 16 bytes on i386 and 64-bit architectures, assuming [0] is accurate) 131072 seems to vary and beyond that, they seem to be set to > 0010. > In your updated 0001, you don't seem to fill the RESERVED_GLIBC memctx array entries with BOGUS_MCTX(). With that, there's no increase in the number of reserved slots from > what we have reserved today. Still 4. So having 4 bits instead of 3 > bits gives us a total of 12 slots rather than 4 slots. Having 3x > slots seems enough. We might need an extra bit for something else > sometime. I think keeping it up our sleeve is a good idea. > > Another reason not to make it 5 bits is that I believe that would make > the mcxt_methods[] array 2304 bytes rather than 576 bytes. 4 bits > makes it 1152 bytes, if I'm counting correctly. > I don't think I understand why this would be relevant when only 5 of the contexts are actually in use (thus in caches). Is that size concern about TLB entries then? > I revised the patch to simplify hdrmask logic. This started with me > having trouble finding the best set of words to document that the > offset is "half the bytes between the chunk and block". So, instead > of doing that, I've just made it so these two fields effectively > overlap. The lowest bit of the block offset is the same bit as the > high bit of what MemoryChunkGetValue returns. Works for me, I suppose. I also updated src/backend/utils/mmgr/README to explain this and > adjust the mentions of 3-bits and 61-bits to 4-bits and 60-bits. I > also explained the overlapping part. > Thanks! [0] https://sourceware.org/glibc/wiki/MallocInternals#Platform-specific_Thresholds_and_Constants >
Re: Fixing backslash dot for COPY FROM...CSV
I wrote: > So the current behavior is that \. that is on the end of a line, > but is not the whole line, is silently discarded and we keep going. > All versions throw "end-of-copy marker corrupt" if there is > something after \. on the same line. > This is sufficiently weird that I'm starting to come around to > Daniel's original proposal that we just drop the server's recognition > of \. altogether (which would allow removal of some dozens of lines of > complicated and now known-buggy code). I experimented with that and soon ran into a nasty roadblock: it breaks dump/restore, because pg_dump includes a "\." line after COPY data whether or not it really needs one. Worse, that's implemented by including the "\." line into the archive format, so that existing dump files contain it. Getting rid of it would require an archive format version bump, plus some hackery to allow removal of the line when reading old dump files. While that's surely doable with enough effort, it's not the kind of thing to be undertaking with less than 2 days to feature freeze. Not to mention that I'm not sure we have consensus to do it at all. More fun stuff: PQgetline actually invents a "\." line when it sees server end-of-copy, and we tell users of that function to check for that not an out-of-band return value to detect EOF. It looks like we have no callers of that in the core distro, but do we want to deprecate it completely? So I feel like we need to put this patch on the shelf for the moment and come back to it early in v18. Although it seems reasonably clear what to do on the CSV side of things, it's very much less clear what to do about text-format handling of EOD markers, and I don't want to change one of those things in v17 and the other in v18. Also it seems like there are more dependencies on "\." than we realized. There could be an argument for applying just the psql change now, to remove its unnecessary sending of "\.". That won't break anything and it would give us at least one year's leg up on compatibility issues. regards, tom lane
Re: Synchronizing slots from primary to standby
Hi, On Sat, Apr 06, 2024 at 10:13:00AM +0530, Amit Kapila wrote: > On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot > wrote: > > I think the new LSN can be visible only when the corresponding WAL is > written by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can > make it visible. In your experiment below, isn't it possible that in > the meantime WAL writer has written that WAL due to which you are > seeing an updated location? What I did is: session 1: select pg_current_wal_lsn();\watch 1 session 2: select pg_backend_pid(); terminal 1: tail -f logfile | grep -i snap terminal 2 : gdb -p ) at standby.c:1346 1346{ (gdb) n 1350 Then next, next until the DEBUG message is emitted (confirmed in terminal 1). At this stage the DEBUG message shows the new LSN while session 1 still displays the previous LSN. Then once XLogSetAsyncXactLSN() is done in the gdb session (terminal 2) then session 1 displays the new LSN. This is reproducible as desired. With more debugging I can see that when the spinlock is released in XLogSetAsyncXactLSN() then XLogWrite() is doing its job and then session 1 does see the new value (that happens in this order, and as you said that's expected). My point is that while the DEBUG message is emitted session 1 still see the old LSN (until the new LSN is vsible). I think that we should emit the DEBUG message once session 1 can see the new value (If not, I think the timestamp of the DEBUG message can be missleading during debugging purpose). > I think I am missing how exactly moving DEBUG2 can confirm the above theory. I meant to say that instead of seeing: 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 739 next xid 740) 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG: statement: SELECT '0/360' <= replay_lsn AND state = 'streaming' We would probably see something like: 2024-04-05 04:37:05. UTC [3866475][client backend][2/4:0] LOG: statement: SELECT '0/360' <= replay_lsn AND state = 'streaming' 2024-04-05 04:37:05.+xx UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 739 next xid 740) And then it would be clear that the query has ran before the new LSN is visible. > > If the theory is proven then I'm not sure we need the extra complexity of > > injection point here, maybe just relying on the slots created via SQL API > > could > > be enough. > > > > Yeah, that could be the first step. We can probably add an injection > point to control the bgwrite behavior and then add tests involving > walsender performing the decoding. But I think it is important to have > sufficient tests in this area as I see they are quite helpful in > uncovering the issues. > Yeah agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: LogwrtResult contended spinlock
On 2024-Apr-05, Jeff Davis wrote: > Minor comments: > * Also, I assume that the Max() call in > pg_atomic_monotonic_advance_u64() is just for clarity? Surely the > currval cannot be less than _target when it returns. I'd probably just > do Assert(currval >= _target) and then return currval. Uhh ... my understanding of pg_atomic_compare_exchange_u64 is that *expected is set to the value that's stored in *ptr prior to the exchange. Its comment * Atomically compare the current value of ptr with *expected and store newval * iff ptr and *expected have the same value. The current value of *ptr will * always be stored in *expected. is actually not very clear, because what does "current" mean in this context? Is the original "current" value, or is it the "current" value after the exchange? Anyway, looking at the spinlock-emulated code for pg_atomic_compare_exchange_u32_impl in atomics.c, /* perform compare/exchange logic */ ret = ptr->value == *expected; *expected = ptr->value; if (ret) ptr->value = newval; it's clear that *expected receives the original value, not the new one. Because of this behavior, not doing the Max() would return the obsolete value rather than the one we just installed. (It would only work correctly without Max() when our cmpxchg operation "fails", that is, somebody else incremented the value past the one we want to install.) Another reason is that when I used pg_atomic_monotonic_advance_u64 with the Write and Flush pointers and did not have the Max(), the assertion failed pretty quickly :-) -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ Al principio era UNIX, y UNIX habló y dijo: "Hello world\n". No dijo "Hello New Jersey\n", ni "Hello USA\n".
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Apr 06, 2024 at 04:57:51PM +0200, Tomas Vondra wrote: > On 4/6/24 15:40, Melanie Plageman wrote: > > On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote: > >> > >> > >> On 4/6/24 01:53, Melanie Plageman wrote: > >>> On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote: > On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote: > > On 4/4/24 00:57, Melanie Plageman wrote: > >> On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote: > > I'd focus on the first ~8-9 commits or so for now, we can commit more if > > things go reasonably well. > > Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would > love to know if you agree with the direction before I spend more time. > >>> > >>> In attached v16, I've split out 0010-0013 into 0011-0017. I think it is > >>> much easier to understand. > >>> > >> > >> Anyway, I've attached it as .tgz in order to not confuse cfbot. All the > >> review comments are marked with XXX, so grep for that in the patches. > >> There's two separate patches - the first one suggests a code change, so > >> it was better to not merge that with your code. The second has just a > >> couple XXX comments, I'm not sure why I kept it separate. > >> > >> A couple review comments: > >> > >> * I think 0001-0009 are 99% ready to. I reworded some of the commit > >> messages a bit - I realize it's a bit bold, considering you're native > >> speaker and I'm not. If you could check I didn't make it worse, that > >> would be great. > > > > Attached v17 has *only* patches 0001-0009 with these changes. I will > > work on applying the remaining patches, addressing feedback, and adding > > comments next. > > > > I have reviewed and incorporated all of your feedback on these patches. > > Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to > > commit messages (comma splice removal and single word adjustments) as > > well as the changes listed below: > > > > I have open questions on the following: > > > > - 0003: should it be SO_NEED_TUPLES and need_tuples (instead of > > SO_NEED_TUPLE and need_tuple)? > > > > I think SO_NEED_TUPLES is more accurate, as we need all tuples from the > block. But either would work. Attached v18 changes it to TUPLES/tuples > > > - 0009 (your 0010) > > - Should I mention in the commit message that we added blockno and > > pfblockno in the BitmapHeapScanState only for validation or is > > that > > too specific? > > > > For the commit message I'd say it's too specific, I'd put it in the > comment before the struct. It is in the comment for the struct > > - I'm worried this comment is vague and or potentially not totally > > correct. Should we remove it? I don't think we have conclusive > > proof > > that this is true. > > /* > >* Adjusting the prefetch iterator before invoking > >* table_scan_bitmap_next_block() keeps prefetch distance higher across > >* the parallel workers. > >*/ > > > > TBH it's not clear to me what "higher across parallel workers" means. > But it sure shouldn't claim things that we think may not be correct. I > don't have a good idea how to reword it, though. I realized it makes more sense to add a FIXME (I used XXX. I'm not when to use what) with a link to the message where Andres describes why he thinks it is a bug. If we plan on fixing it, it is good to have a record of that. And it makes it easier to put a clear and accurate comment. Done in 0009. > OK, thanks. If think 0001-0008 are ready to go, with some minor tweaks > per above (tuple vs. tuples etc.), and the question about the recheck > flag. If you can do these tweaks, I'll get that committed today and we > can try to get a couple more patches in tomorrow. Sounds good. - Melanie >From 3e80ba8914dd5876ab921ca39540be3b639236f7 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 12 Feb 2024 18:50:29 -0500 Subject: [PATCH v18 01/10] BitmapHeapScan: begin scan after bitmap creation It makes more sense for BitmapHeapScan to scan the index, build the bitmap, and only then begin the scan on the underlying table. This is mostly a cosmetic change for now, but later commits will need to pass parameters to table_beginscan_bm() that are unavailable in ExecInitBitmapHeapScan(). Author: Melanie Plageman Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com --- src/backend/executor/nodeBitmapHeapscan.c | 27 +-- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index cee7f45aabe..c8c466e3c5c 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -178,6 +178,21 @@ BitmapHeapNext(BitmapHeapScanState *node) } #endif /* US
Re: remaining sql/json patches
Hi, On Sat, Apr 6, 2024 at 3:55 PM jian he wrote: > On Sat, Apr 6, 2024 at 2:03 PM Amit Langote wrote: > > > > > > > > * problem with type "char". the view def output is not the same as > > > the select * from v1. > > > > > > create or replace view v1 as > > > SELECT col FROM s, > > > JSON_TABLE(jsonb '{"d": ["hello", "hello1"]}', '$' as c1 > > > COLUMNS(col "char" path '$.d' without wrapper keep quotes))sub; > > > > > > \sv v1 > > > CREATE OR REPLACE VIEW public.v1 AS > > > SELECT sub.col > > >FROM s, > > > JSON_TABLE( > > > '{"d": ["hello", "hello1"]}'::jsonb, '$' AS c1 > > > COLUMNS ( > > > col "char" PATH '$."d"' > > > ) > > > ) sub > > > one under the hood called JSON_QUERY_OP, another called JSON_VALUE_OP. > > > > Hmm, I don't see a problem as long as both are equivalent or produce > > the same result. Though, perhaps we could make > > get_json_expr_options() also deparse JSW_NONE explicitly into "WITHOUT > > WRAPPER" instead of a blank. But that's existing code, so will take > > care of it as part of the above open item. > > > > > I will do extensive checking for other types later, so far, other than > > > these two issues, > > > get_json_table_columns is pretty solid, I've tried nested columns with > > > nested columns, it just works. > > > > Thanks for checking. > > > After applying v50, this type also has some issues. > CREATE OR REPLACE VIEW t1 as > SELECT sub.* FROM JSON_TABLE(jsonb '{"d": ["hello", "hello1"]}', > '$' AS c1 COLUMNS ( > "tsvector0" tsvector path '$.d' without wrapper omit quotes, > "tsvector1" tsvector path '$.d' without wrapper keep quotes))sub; > table t1; > > return > tsvector0|tsvector1 > -+- > '"hello1"]' '["hello",' | '"hello1"]' '["hello",' > (1 row) > > src5=# \sv t1 > CREATE OR REPLACE VIEW public.t1 AS > SELECT tsvector0, > tsvector1 >FROM JSON_TABLE( > '{"d": ["hello", "hello1"]}'::jsonb, '$' AS c1 > COLUMNS ( > tsvector0 tsvector PATH '$."d"' OMIT QUOTES, > tsvector1 tsvector PATH '$."d"' > ) > ) sub > > but > > SELECT tsvector0, > tsvector1 >FROM JSON_TABLE( > '{"d": ["hello", "hello1"]}'::jsonb, '$' AS c1 > COLUMNS ( > tsvector0 tsvector PATH '$."d"' OMIT QUOTES, > tsvector1 tsvector PATH '$."d"' > ) > ) sub > > only return > tsvector0| tsvector1 > -+--- > '"hello1"]' '["hello",' | Yep, we *should* fix get_json_expr_options() to emit KEEP QUOTES and WITHOUT WRAPPER options so that transformJsonTableColumns() does the correct thing when you execute the \sv output. Like this: diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 283ca53cb5..5a6aabe100 100644 --- a/src/backend/utils/adt/ruleutils.c +++ b/src/backend/utils/adt/ruleutils.c @@ -8853,9 +8853,13 @@ get_json_expr_options(JsonExpr *jsexpr, deparse_context *context, appendStringInfo(context->buf, " WITH CONDITIONAL WRAPPER"); else if (jsexpr->wrapper == JSW_UNCONDITIONAL) appendStringInfo(context->buf, " WITH UNCONDITIONAL WRAPPER"); +else if (jsexpr->wrapper == JSW_NONE) +appendStringInfo(context->buf, " WITHOUT WRAPPER"); if (jsexpr->omit_quotes) appendStringInfo(context->buf, " OMIT QUOTES"); +else +appendStringInfo(context->buf, " KEEP QUOTES"); } Will get that pushed tomorrow. Thanks for the test case. -- Thanks, Amit Langote
Re: BitmapHeapScan streaming read user and prelim refactoring
Melanie Plageman writes: > On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote: >> * The one question I'm somewhat unsure about is why Tom chose to use the >> "wrong" recheck flag in the 2017 commit, when the correct recheck flag >> is readily available. Surely that had a reason, right? But I can't think >> of one ... > See above. Hi, I hadn't been paying attention to this thread, but Melanie pinged me off-list about this question. I think it's just a flat-out oversight in 7c70996eb. Looking at the mailing list thread (particularly [1][2]), it seems that Alexander hadn't really addressed the question of when to prefetch at all, but just skipped prefetch if the current page was skippable: + /* +* If we did not need to fetch the current page, +* we probably will not need to fetch the next. +*/ + return; It looks like I noticed that we could check the appropriate VM bits, but failed to notice that we could easily check the appropriate recheck flag as well. Feel free to change it. regards, tom lane [1] https://www.postgresql.org/message-id/a6434d5c-ed8d-b09c-a7c3-b2d1677e35b3%40postgrespro.ru [2] https://www.postgresql.org/message-id/5974.1509573988%40sss.pgh.pa.us
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/6/24 15:40, Melanie Plageman wrote: > On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote: >> >> >> On 4/6/24 01:53, Melanie Plageman wrote: >>> On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote: On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote: > On 4/4/24 00:57, Melanie Plageman wrote: >> On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote: > I'd focus on the first ~8-9 commits or so for now, we can commit more if > things go reasonably well. Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would love to know if you agree with the direction before I spend more time. >>> >>> In attached v16, I've split out 0010-0013 into 0011-0017. I think it is >>> much easier to understand. >>> >> >> Anyway, I've attached it as .tgz in order to not confuse cfbot. All the >> review comments are marked with XXX, so grep for that in the patches. >> There's two separate patches - the first one suggests a code change, so >> it was better to not merge that with your code. The second has just a >> couple XXX comments, I'm not sure why I kept it separate. >> >> A couple review comments: >> >> * I think 0001-0009 are 99% ready to. I reworded some of the commit >> messages a bit - I realize it's a bit bold, considering you're native >> speaker and I'm not. If you could check I didn't make it worse, that >> would be great. > > Attached v17 has *only* patches 0001-0009 with these changes. I will > work on applying the remaining patches, addressing feedback, and adding > comments next. > > I have reviewed and incorporated all of your feedback on these patches. > Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to > commit messages (comma splice removal and single word adjustments) as > well as the changes listed below: > > I have changed the following: > > - 0003 added an assert that rs_empty_tuples_pending is 0 on rescan and > endscan > OK > - 0004 (your 0005)-- I followed up with Tom, but for now I have just > removed the XXX and also reworded the message a bit > After the exercise I described a couple minutes ago, I think I'm convinced the assumption is unnecessary and we should use the correct recheck. Not that it'd make any difference in practice, considering none of the opclasses ever changes the recheck. Maybe the most prudent thing would be to skip this commit and maybe leave this for later, but I'm not forcing you to do that if it would mean a lot of disruption for the following patches. > - 0006 (your 0007) fixed up the variable name (you changed valid -> > valid_block but it had gotten changed back) > OK > I have open questions on the following: > > - 0003: should it be SO_NEED_TUPLES and need_tuples (instead of > SO_NEED_TUPLE and need_tuple)? > I think SO_NEED_TUPLES is more accurate, as we need all tuples from the block. But either would work. > - 0009 (your 0010) > - Should I mention in the commit message that we added blockno and > pfblockno in the BitmapHeapScanState only for validation or is > that > too specific? > For the commit message I'd say it's too specific, I'd put it in the comment before the struct. > - Should I mention that a future (imminent) commit will remove the > iterators from TableScanDescData and put them in > HeapScanDescData? I > imagine folks don't want those there, but it is easier for the > progression of commits to put them there first and then move > them > I'd try not to mention future commits as justification too often, if we don't know that the future commit lands shortly after. > - I'm worried this comment is vague and or potentially not totally > correct. Should we remove it? I don't think we have conclusive > proof > that this is true. > /* >* Adjusting the prefetch iterator before invoking >* table_scan_bitmap_next_block() keeps prefetch distance higher across >* the parallel workers. >*/ > TBH it's not clear to me what "higher across parallel workers" means. But it sure shouldn't claim things that we think may not be correct. I don't have a good idea how to reword it, though. > >> * I'm not sure extra_flags is the right way to pass the flag in 0003. >> The "extra_" name is a bit weird, and no other table AM functions do it >> this way and pass explicit bool flags instead. So my first "review" >> commit does it like that. Do you agree it's better that way? > > Yes. > Cool >> * The one question I'm somewhat unsure about is why Tom chose to use the >> "wrong" recheck flag in the 2017 commit, when the correct recheck flag >> is readily available. Surely that had a reason, right? But I can't think >> of one ... > > See above. > >>> While I was doing that, I realized that I should remove the call to >>> table_rescan() from ExecReScanBitmapHeapScan() and just rely on the ne
Re: BitmapHeapScan streaming read user and prelim refactoring
On 4/6/24 02:51, Tomas Vondra wrote: > > * The one question I'm somewhat unsure about is why Tom chose to use the > "wrong" recheck flag in the 2017 commit, when the correct recheck flag > is readily available. Surely that had a reason, right? But I can't think > of one ... > I've been wondering about this a bit more, so I decided to experiment and try to construct a case for which the current code prefetches the wrong blocks, and the patch fixes that. But I haven't been very successful so far :-( My understanding was that the current code should do the wrong thing if I alternate all-visible and not-all-visible pages. This understanding is not correct, as I learned, because the thing that needs to change is the recheck flag, not visibility :-( I'm still posting what I tried, perhaps you will have an idea how to alter it to demonstrate the incorrect behavior with current master. The test was very simple: create table t (a int, b int) with (fillfactor=10); insert into t select mod((i/22),2), (i/22) from generate_series(0,1000) s(i); create index on t (a); which creates a table with 46 pages, 22 rows per page, column "a" alternates between 0/1 on pages, column "b" increments on each page (so "b" identifies page). and then delete from t where mod(b,8) = 0; which deletes tuples on pages 0, 8, 16, 24, 32, 40, so these pages will need to be prefetched as not-all-visible by this query explain analyze select count(1) from t where a = 0 when forced to do bitmap heap scan. The other even-numbered pages remain all-visible. I added a bit of logging into BitmapPrefetch(), but even with master I get this: LOG: prefetching block 8 0 current block 6 0 LOG: prefetching block 16 0 current block 14 0 LOG: prefetching block 24 0 current block 22 0 LOG: prefetching block 32 0 current block 30 0 LOG: prefetching block 40 0 current block 38 0 So it prefetches the correct pages (the other value is the recheck flag for that block from the iterator result). Turns out (and I realize the comment about the assumption actually states that, I just failed to understand it) the thing that would have to differ for the blocks is the recheck flag. But that can't actually happen because that's set by the AM/opclass and the built-in ones do essentially this: .../hash.c: scan->xs_recheck = true; .../nbtree.c: scan->xs_recheck = false; gist opclasses (e.g. btree_gist): /* All cases served by this function are exact */ *recheck = false; spgist opclasses (e.g. geo_spgist): /* All tests are exact. */ out->recheck = false; If there's an opclass that alters the recheck flag, it's well hidden and I missed it. Anyway, after this exercise and learning more about the recheck flag, I think I agree the assumption is unnecessary. It's pretty harmless because none of the built-in opclasses alters the recheck flag, but the correct recheck flag is readily available. I'm still a bit puzzled why the 2017 commit even relied on this assumption, though. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: remaining sql/json patches
On Fri, Apr 5, 2024 at 8:35 PM Amit Langote wrote: > > On Thu, Apr 4, 2024 at 9:02 PM Amit Langote wrote: > > I'll post the rebased 0002 tomorrow after addressing your comments. > > Here's one. Main changes: > > * Fixed a bug in get_table_json_columns() which caused nested columns > to be deparsed incorrectly, something Jian reported upthread. > * Simplified the algorithm in JsonTablePlanNextRow() > > I'll post another revision or two maybe tomorrow, but posting what I > have now in case Jian wants to do more testing. > + else + { + /* + * Parent and thus the plan has no more rows. + */ + return false; + } in JsonTablePlanNextRow, the above comment seems strange to me. + /* + * Re-evaluate a nested plan's row pattern using the new parent row + * pattern, if present. + */ + Assert(parent != NULL); + if (!parent->current.isnull) + JsonTableResetRowPattern(planstate, parent->current.value); Is this assertion useful? if parent is null, then parent->current.isnull will cause segmentation fault. I tested with 3 NESTED PATH, it works! (I didn't fully understand JsonTablePlanNextRow though). the doc needs some polish work.
Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?
Daniel Gustafsson writes: >> On 6 Apr 2024, at 08:02, Peter Eisentraut wrote: >> Why do we need to check for the versions at all? We should just check for >> the functions we need. At least that's always been the normal approach in >> configure. > We could, but finding a stable set of functions which identifies the version > of > OpenSSL *and* LibreSSL that we want, and their successors, while not matching > any older versions seemed more opaque than testing two numeric values. I don't think you responded to Peter's point at all. The way autoconf is designed to work is explicitly NOT to try to identify the exact version of $whatever. Rather, the idea is to probe for the API features that you want to rely on: functions, macros, struct fields, or the like. If you can't point to an important API difference between 1.0.2 and 1.1.1, why drop support for 1.0.2? regards, tom lane
Re: BitmapHeapScan streaming read user and prelim refactoring
On Sat, Apr 06, 2024 at 02:51:45AM +0200, Tomas Vondra wrote: > > > On 4/6/24 01:53, Melanie Plageman wrote: > > On Fri, Apr 05, 2024 at 04:06:34AM -0400, Melanie Plageman wrote: > >> On Thu, Apr 04, 2024 at 04:35:45PM +0200, Tomas Vondra wrote: > >>> On 4/4/24 00:57, Melanie Plageman wrote: > On Sun, Mar 31, 2024 at 11:45:51AM -0400, Melanie Plageman wrote: > >>> I'd focus on the first ~8-9 commits or so for now, we can commit more if > >>> things go reasonably well. > >> > >> Sounds good. I will spend cleanup time on 0010-0013 tomorrow but would > >> love to know if you agree with the direction before I spend more time. > > > > In attached v16, I've split out 0010-0013 into 0011-0017. I think it is > > much easier to understand. > > > > Anyway, I've attached it as .tgz in order to not confuse cfbot. All the > review comments are marked with XXX, so grep for that in the patches. > There's two separate patches - the first one suggests a code change, so > it was better to not merge that with your code. The second has just a > couple XXX comments, I'm not sure why I kept it separate. > > A couple review comments: > > * I think 0001-0009 are 99% ready to. I reworded some of the commit > messages a bit - I realize it's a bit bold, considering you're native > speaker and I'm not. If you could check I didn't make it worse, that > would be great. Attached v17 has *only* patches 0001-0009 with these changes. I will work on applying the remaining patches, addressing feedback, and adding comments next. I have reviewed and incorporated all of your feedback on these patches. Attached v17 is your exact patches with 1 or 2 *very* slight tweaks to commit messages (comma splice removal and single word adjustments) as well as the changes listed below: I have changed the following: - 0003 added an assert that rs_empty_tuples_pending is 0 on rescan and endscan - 0004 (your 0005)-- I followed up with Tom, but for now I have just removed the XXX and also reworded the message a bit - 0006 (your 0007) fixed up the variable name (you changed valid -> valid_block but it had gotten changed back) I have open questions on the following: - 0003: should it be SO_NEED_TUPLES and need_tuples (instead of SO_NEED_TUPLE and need_tuple)? - 0009 (your 0010) - Should I mention in the commit message that we added blockno and pfblockno in the BitmapHeapScanState only for validation or is that too specific? - Should I mention that a future (imminent) commit will remove the iterators from TableScanDescData and put them in HeapScanDescData? I imagine folks don't want those there, but it is easier for the progression of commits to put them there first and then move them - I'm worried this comment is vague and or potentially not totally correct. Should we remove it? I don't think we have conclusive proof that this is true. /* * Adjusting the prefetch iterator before invoking * table_scan_bitmap_next_block() keeps prefetch distance higher across * the parallel workers. */ > * I'm not sure extra_flags is the right way to pass the flag in 0003. > The "extra_" name is a bit weird, and no other table AM functions do it > this way and pass explicit bool flags instead. So my first "review" > commit does it like that. Do you agree it's better that way? Yes. > * The one question I'm somewhat unsure about is why Tom chose to use the > "wrong" recheck flag in the 2017 commit, when the correct recheck flag > is readily available. Surely that had a reason, right? But I can't think > of one ... See above. > > While I was doing that, I realized that I should remove the call to > > table_rescan() from ExecReScanBitmapHeapScan() and just rely on the new > > table_rescan_bm() invoked from BitmapHeapNext(). That is done in the > > attached. > > > > 0010-0018 still need comments updated but I focused on getting the split > > out, reviewable version of them ready. I'll add comments (especially to > > 0011 table AM functions) tomorrow. I also have to double-check if I > > should add any asserts for table AMs about having implemented all of the > > new begin/re/endscan() functions. > > > > I added a couple more comments for those patches (10-12). Chances are > the split in v16 clarifies some of my questions, but it'll have to wait > till the morning ... Will address this in next mail. - Melanie >From e65a9baa52c99c93a9a9a281aebacd38b8299721 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 12 Feb 2024 18:50:29 -0500 Subject: [PATCH v17 1/9] BitmapHeapScan: begin scan after bitmap creation It makes more sense for BitmapHeapScan to scan the index, build the bitmap, and only then begin the scan on the underlying table. This is mostly a cosmetic change for now, but later commits will need to pass parameters to table_beginscan_bm() that are unavailab
Re: Flushing large data immediately in pqcomm
On Sat, 6 Apr 2024 at 23:17, Jelte Fennema-Nio wrote: > Weird that on your machines you don't see a difference. Are you sure > you didn't make a silly mistake, like not restarting postgres or > something? I'm sure. I spent quite a long time between the AMD and an Apple m2 trying. I did see the same regression as you on the smaller numbers. I experimented with the attached which macro'ifies internal_flush() and pg_noinlines internal_flush_buffer. Can you try that to see if it gets rid of the regression on the first two tests? David diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 6497100a1a..824b2f11a3 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -120,8 +120,8 @@ static List *sock_paths = NIL; static char *PqSendBuffer; static int PqSendBufferSize; /* Size send buffer */ -static int PqSendPointer; /* Next index to store a byte in PqSendBuffer */ -static int PqSendStart;/* Next index to send a byte in PqSendBuffer */ +static size_t PqSendPointer; /* Next index to store a byte in PqSendBuffer */ +static size_t PqSendStart; /* Next index to send a byte in PqSendBuffer */ static char PqRecvBuffer[PQ_RECV_BUFFER_SIZE]; static int PqRecvPointer; /* Next index to read a byte from PqRecvBuffer */ @@ -133,6 +133,7 @@ static int PqRecvLength; /* End of data available in PqRecvBuffer */ static bool PqCommBusy;/* busy sending data to the client */ static bool PqCommReadingMsg; /* in the middle of reading a message */ +#define internal_flush() internal_flush_buffer(PqSendBuffer, &PqSendStart, &PqSendPointer) /* Internal functions */ static void socket_comm_reset(void); @@ -144,7 +145,8 @@ static bool socket_is_send_pending(void); static int socket_putmessage(char msgtype, const char *s, size_t len); static void socket_putmessage_noblock(char msgtype, const char *s, size_t len); static int internal_putbytes(const char *s, size_t len); -static int internal_flush(void); +static pg_noinline int internal_flush_buffer(const char *s, size_t *start, + size_t *end); static int Lock_AF_UNIX(const char *unixSocketDir, const char *unixSocketPath); static int Setup_AF_UNIX(const char *sock_path); @@ -1282,14 +1284,32 @@ internal_putbytes(const char *s, size_t len) if (internal_flush()) return EOF; } - amount = PqSendBufferSize - PqSendPointer; - if (amount > len) - amount = len; - memcpy(PqSendBuffer + PqSendPointer, s, amount); - PqSendPointer += amount; - s += amount; - len -= amount; + + /* +* If the buffer is empty and data length is larger than the buffer +* size, send it without buffering. Otherwise, put as much data as +* possible into the buffer. +*/ + if (len >= PqSendBufferSize && PqSendStart == PqSendPointer) + { + size_t start = 0; + + socket_set_nonblocking(false); + if (internal_flush_buffer(s, &start, &len)) + return EOF; + } + else + { + amount = PqSendBufferSize - PqSendPointer; + if (amount > len) + amount = len; + memcpy(PqSendBuffer + PqSendPointer, s, amount); + PqSendPointer += amount; + s += amount; + len -= amount; + } } + return 0; } @@ -1315,19 +1335,19 @@ socket_flush(void) } /* - * internal_flush - flush pending output + * internal_flush_buffer - flush the given buffer content * * Returns 0 if OK (meaning everything was sent, or operation would block * and the socket is in non-blocking mode), or EOF if trouble. * */ -static int -internal_flush(void) +static pg_noinline int +internal_flush_buffer(const char *s, size_t *start, size_t *end) { static int last_reported_send_errno = 0; - char *bufptr = PqSendBuffer + PqSendStart; - char *bufend = PqSendBuffer + PqSendPointer; + char *bufptr = (char*) s + *start; + char *bufend = (char*) s + *end; while (bufptr < bufend) { @@ -1373,7 +1393,7 @@ internal_flush(void) * flag that'll cause the next CHECK_FOR_INTERRUPTS to terminate * the connection. */ - PqSendStart =
Re: Introduce XID age and inactive timeout based replication slot invalidation
On Sat, Apr 6, 2024 at 12:18 PM Amit Kapila wrote: > > Why the handling w.r.t active_pid in InvalidatePossiblyInactiveSlot() > is not similar to InvalidatePossiblyObsoleteSlot(). Won't we need to > ensure that there is no other active slot user? Is it sufficient to > check inactive_since for the same? If so, we need some comments to > explain the same. I removed the separate functions and with minimal changes, I've now placed the RS_INVAL_INACTIVE_TIMEOUT logic into InvalidatePossiblyObsoleteSlot and use that even in CheckPointReplicationSlots. > Can we avoid introducing the new functions like > SaveGivenReplicationSlot() and MarkGivenReplicationSlotDirty(), if we > do the required work in the caller? Hm. Removed them now. Please see the attached v38 patch. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v38-0001-Add-inactive_timeout-based-replication-slot-inva.patch Description: Binary data
Re: MultiXact\SLRU buffers configuration
> On 29 Feb 2024, at 06:59, Kyotaro Horiguchi wrote: > > At Sat, 3 Feb 2024 22:32:45 +0500, "Andrey M. Borodin" > wrote in >> Here's the test draft. This test reliably reproduces sleep on CV when >> waiting next multixact to be filled into "members" SLRU. > > By the way, I raised a question about using multiple CVs > simultaneously [1]. That is, I suspect that the current CV > implementation doesn't allow us to use multiple condition variables at > the same time, because all CVs use the same PCPROC member cvWaitLink > to accommodate different waiter sets. If this assumption is correct, > we should resolve the issue before spreading more uses of CVs. Alvaro, Kyotaro, what's our plan for this? It seems to late to deal with this pg_usleep(1000L) for PG17. I propose following course of action 1. Close this long-standing CF item 2. Start new thread with CV-sleep patch aimed at PG18 3. Create new entry in July CF What do you think? Best regards, Andrey Borodin.
Re: Flushing large data immediately in pqcomm
On Sat, 6 Apr 2024 at 03:34, David Rowley wrote: > Does anyone else want to try the attached script on the v5 patch to > see if their numbers are better? On my machine (i9-10900X, in Ubuntu 22.04 on WSL on Windows) v5 consistently beats master by ~0.25 seconds: master: Run 100 100 500: 1.948975205 Run 1024 10240 20: 3.039986587 Run 1024 1048576 2000: 2.444176276 Run 1048576 1048576 1000: 2.475328596 v5: Run 100 100 500: 1.997170909 Run 1024 10240 20: 3.057802598 Run 1024 1048576 2000: 2.199449857 Run 1048576 1048576 1000: 2.210328762 The first two runs are pretty much equal, and I ran your script a few more times and this seems like just random variance (sometimes v5 wins those, sometimes master does always quite close to each other). But the last two runs v5 consistently wins. Weird that on your machines you don't see a difference. Are you sure you didn't make a silly mistake, like not restarting postgres or something?
Re: DROP OWNED BY fails to clean out pg_init_privs grants
> On 6 Apr 2024, at 01:10, Tom Lane wrote: > (So now I'm wondering why *only* copperhead has shown this so far. > Are our other cross-version-upgrade testing animals AWOL?) Clicking around searching for Xversion animals I didn't spot any, but without access to the database it's nontrivial to know which animal does what. > I doubt this is something we'll have fixed by Monday, so I will > go add an open item for it. +1. Having opened the can of worms I'll have a look at it next week. -- Daniel Gustafsson
Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?
> On 6 Apr 2024, at 08:02, Peter Eisentraut wrote: > > On 05.04.24 23:48, Daniel Gustafsson wrote: >> The reason to expand the check is to ensure that we have the version we want >> for both OpenSSL and LibreSSL, and deprecating OpenSSL versions isn't all >> that >> commonly done so having to change the version in the check didn't seem that >> invasive to me. > > Why do we need to check for the versions at all? We should just check for > the functions we need. At least that's always been the normal approach in > configure. We could, but finding a stable set of functions which identifies the version of OpenSSL *and* LibreSSL that we want, and their successors, while not matching any older versions seemed more opaque than testing two numeric values. The suggested check is modelled on the LDAP check which tests for an explicit version in a header file (albeit not for erroring out). -- Daniel Gustafsson