Re: Making Vars outer-join aware
On Mon, Jul 11, 2022 at 3:38 AM Tom Lane wrote: > Here's v2 of this patch series. It's functionally identical to v1, > but I've rebased it over the recent auto-node-support-generation > changes, and also extracted a few separable bits in hopes of making > the main planner patch smaller. (It's still pretty durn large, > unfortunately.) Unlike the original submission, each step will > compile on its own, though the intermediate states mostly don't > pass all regression tests. Noticed a different behavior from previous regarding PlaceHolderVar. Take the query below as an example: select a.i, ss.jj from a left join (select b.i, b.j + 1 as jj from b) ss on a.i = ss.i; Previously the expression 'b.j + 1' would not be wrapped in a PlaceHolderVar, since it contains a Var of the subquery and meanwhile it does not contain any non-strict constructs. And now in the patch, we would insert a PlaceHolderVar for it, in order to have a place to store varnullingrels. So the plan for the above query now becomes: # explain (verbose, costs off) select a.i, ss.jj from a left join (select b.i, b.j + 1 as jj from b) ss on a.i = ss.i; QUERY PLAN -- Hash Right Join Output: a.i, ((b.j + 1)) Hash Cond: (b.i = a.i) -> Seq Scan on public.b Output: b.i, (b.j + 1) -> Hash Output: a.i -> Seq Scan on public.a Output: a.i (9 rows) Note that the evaluation of expression 'b.j + 1' now occurs below the outer join. Is this something we need to be concerned about? Thanks Richard
Re: Support TRUNCATE triggers on foreign tables
On Tue, 12 Jul 2022 09:24:20 +0900 Fujii Masao wrote: > > > On 2022/07/08 17:13, Ian Lawrence Barwick wrote: > >> If we want to add such prevention, we will need similar checks for > >> INSERT/DELETE/UPDATE not only TRUNCATE. However, I think such fix is > >> independent > >> from this and it can be proposed as another patch. > > > > Ah OK, makes sense from that point of view. Thanks for the clarification! > > So I pushed the v2 patch that Yugo-san posted. Thanks! Thanks! > > Regards, > > -- > Fujii Masao > Advanced Computing Technology Center > Research and Development Headquarters > NTT DATA CORPORATION -- Yugo NAGATA
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Tue, Jul 12, 2022 at 3:25 PM Amit Kapila wrote: > > On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada > wrote: > > > > On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada > > wrote: > > > > > > > > > I'm doing benchmark tests and will share the results. > > > > > > > I've done benchmark tests to measure the overhead introduced by doing > > bsearch() every time when decoding a commit record. I've simulated a > > very intensified situation where we decode 1M commit records while > > keeping builder->catchange.xip array but the overhead is negilible: > > > > HEAD: 584 ms > > Patched: 614 ms > > > > I've attached the benchmark script I used. With increasing > > LOG_SNAPSHOT_INTERVAL_MS to 9, the last decoding by > > pg_logicla_slot_get_changes() decodes 1M commit records while keeping > > catalog modifying transactions. > > > > Thanks for the test. We should also see how it performs when (a) we > don't change LOG_SNAPSHOT_INTERVAL_MS, What point do you want to see in this test? I think the performance overhead depends on how many times we do bsearch() and how many transactions are in the list. I increased this value to easily simulate the situation where we decode many commit records while keeping catalog modifying transactions. But even if we don't change this value, the result would not change if we don't change how many commit records we decode. > and (b) we have more DDL xacts > so that the array to search is somewhat bigger I've done the same performance tests while creating 64 catalog modifying transactions. The result is: HEAD: 595 ms Patched: 628 ms There was no big overhead. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman wrote in > Hi, > > In the attached patch set, I've added in missing IO operations for > certain IO Paths as well as enumerating in the commit message which IO > Paths and IO Operations are not currently counted and or not possible. > > There is a TODO in HandleWalWriterInterrupts() about removing > pgstat_report_wal() since it is immediately before a proc_exit() Right. walwriter does that without needing the explicit call. > I was wondering if LocalBufferAlloc() should increment the counter or if > I should wait until GetLocalBufferStorage() to increment the counter. Depends on what "allocate" means. Different from shared buffers, local buffers are taken from OS then allocated to page. OS-allcoated pages are restricted by num_temp_buffers so I think what we're interested in is the count incremented by LocalBuferAlloc(). (And it is the parallel of alloc for shared-buffers) > I also realized that I am not differentiating between IOPATH_SHARED and > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type > of buffer we are fsync'ing by the time we call register_dirty_segment(), > I'm not sure how we would fix this. I think there scarcely happens flush for strategy-loaded buffers. If that is sensible, IOOP_FSYNC would not make much sense for IOPATH_STRATEGY. > On Wed, Jul 6, 2022 at 3:20 PM Andres Freund wrote: > > > On 2022-07-05 13:24:55 -0400, Melanie Plageman wrote: > > > @@ -176,6 +176,8 @@ InitStandaloneProcess(const char *argv0) > > > { > > > Assert(!IsPostmasterEnvironment); > > > > > > + MyBackendType = B_STANDALONE_BACKEND; > > > > Hm. This is used for singleuser mode as well as bootstrap. Should we > > split those? It's not like bootstrap mode really matters for stats, so > > I'm inclined not to. > > > > > I have no opinion currently. > It depends on how commonly you think developers might want separate > bootstrap and single user mode IO stats. Regarding to stats, I don't think separating them makes much sense. > > > @@ -375,6 +376,8 @@ BootstrapModeMain(int argc, char *argv[], bool > > check_only) > > >* out the initial relation mapping files. > > >*/ > > > RelationMapFinishBootstrap(); > > > + // TODO: should this be done for bootstrap? > > > + pgstat_report_io_ops(); > > > > Hm. Not particularly useful, but also not harmful. But we don't need an > > explicit call, because it'll be done at process exit too. At least I > > think, it could be that it's different for bootstrap. > > I've removed this and other occurrences which were before proc_exit() > (and thus redundant). (Though I did not explicitly check if it was > different for bootstrap.) pgstat_report_stat(true) is supposed to be called as needed via before_shmem_hook so I think that's the right thing. > > IMO it'd be a good idea to add pgstat_report_io_ops() to > > pgstat_report_vacuum()/analyze(), so that the stats for a longrunning > > autovac worker get updated more regularly. > > > > noted and fixed. > > How about moving the pgstat_report_io_ops() into > > pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems > > unnecessary to have multiple pgstat_* calls in these places. > > > > > > > noted and fixed. +* Also report IO Operations statistics I think that the function comment also should mention this. > > Wonder if it's worth making the lock specific to the backend type? > > > > I've added another Lock into PgStat_IOPathOps so that each BackendType > can be locked separately. But, I've also kept the lock in > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be > done easily. Looks fine about the lock separation. By the way, in the following line: + &pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)]; backend_type_get_idx(x) is actually (x - 1) plus assertion on the value range. And the only use-case is here. There's an reverse function and also used only at one place. + Datum backend_type_desc = + CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i))); In this usage GetBackendTypeDesc() gracefully treats out-of-domain values but idx_get_backend_type keenly kills the process for the same. This is inconsistent. My humbel opinion on this is we don't define the two functions and replace the calls to them with (x +/- 1). Addition to that, I think we should not abort() by invalid backend types. In that sense, I wonder if we could use B_INVALIDth element for this purpose. > > > + LWLockAcquire(&stats_shmem->lock, LW_SHARED); > > > + memcpy(&reset, reset_offset, sizeof(stats_shmem->stats)); > > > + LWLockRelease(&stats_shmem->lock); > > > > Which then also means that you don't need the reset offset stuff. It's > > only there because with the changecount approach we can't take a lock to > > reset the stats (since there is no lock). With a lock you can just reset > > th
Re: replacing role-level NOINHERIT with a grant-level option
On 7/11/22 11:01 PM, Robert Haas wrote: Oops. Here is a rebased version of v3 which aims to fix this bug. Thanks, Issue seems to be fixed with this patch. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
Re: Weird behaviour with binary copy, arrays and column count
That's a good tip! Can't believe I hadn't even thought of that! :/ Cheers Jim On Mon, 11 Jul 2022 at 21:59, Greg Stark wrote: > > On Fri, 8 Jul 2022 at 13:09, James Vanns wrote: > > > > It does seem to smell of an alignment, padding, buffer overrun, parsing > > kind of error. > > It does I think you may need to bust out a debugger and see what > array_recv is actually seeing... > > -- > greg -- Jim Vanns Principal Production Engineer Industrial Light & Magic, London
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Tue, Jul 12, 2022 at 1:13 PM Masahiko Sawada wrote: > > On Tue, Jul 12, 2022 at 3:25 PM Amit Kapila wrote: > > > > On Tue, Jul 12, 2022 at 11:38 AM Masahiko Sawada > > wrote: > > > > > > On Tue, Jul 12, 2022 at 10:28 AM Masahiko Sawada > > > wrote: > > > > > > > > > > > > I'm doing benchmark tests and will share the results. > > > > > > > > > > I've done benchmark tests to measure the overhead introduced by doing > > > bsearch() every time when decoding a commit record. I've simulated a > > > very intensified situation where we decode 1M commit records while > > > keeping builder->catchange.xip array but the overhead is negilible: > > > > > > HEAD: 584 ms > > > Patched: 614 ms > > > > > > I've attached the benchmark script I used. With increasing > > > LOG_SNAPSHOT_INTERVAL_MS to 9, the last decoding by > > > pg_logicla_slot_get_changes() decodes 1M commit records while keeping > > > catalog modifying transactions. > > > > > > > Thanks for the test. We should also see how it performs when (a) we > > don't change LOG_SNAPSHOT_INTERVAL_MS, > > What point do you want to see in this test? I think the performance > overhead depends on how many times we do bsearch() and how many > transactions are in the list. > Right, I am not expecting any visible performance difference in this case. This is to ensure that we are not incurring any overhead in the more usual scenarios (or default cases). As per my understanding, the purpose of increasing the value of LOG_SNAPSHOT_INTERVAL_MS is to simulate a stress case for the changes made by the patch, and keeping its value default will test the more usual scenarios. > I increased this value to easily > simulate the situation where we decode many commit records while > keeping catalog modifying transactions. But even if we don't change > this value, the result would not change if we don't change how many > commit records we decode. > > > and (b) we have more DDL xacts > > so that the array to search is somewhat bigger > > I've done the same performance tests while creating 64 catalog > modifying transactions. The result is: > > HEAD: 595 ms > Patched: 628 ms > > There was no big overhead. > Yeah, especially considering you have simulated a stress case for the patch. -- With Regards, Amit Kapila.
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada wrote: > > I've attached an updated patch. > Hi, I met a segmentation fault in test_decoding test after applying the patch for master branch. Attach the backtrace. It happened when executing the following code because it tried to free a NULL pointer (catchange_xip). /* be tidy */ if (ondisk) pfree(ondisk); + if (catchange_xip) + pfree(catchange_xip); } It seems to be related to configure option. I could reproduce it when using `./configure --enable-debug`. But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -ggdb"`. Regards, Shi yu #0 0x00910333 in GetMemoryChunkContext (pointer=0x0) at ../../../../src/include/utils/memutils.h:129 #1 pfree (pointer=0x0) at mcxt.c:1177 #2 0x0078186b in SnapBuildSerialize (builder=0x1fd5e78, lsn=25719712) at snapbuild.c:1792 #3 0x00782797 in SnapBuildProcessRunningXacts (builder=0x1fd5e78, lsn=25719712, running=0x27dfe50) at snapbuild.c:1199 #4 0x00774273 in standby_decode (ctx=0x1fc3e08, buf=0x7ffd8c95b5d0) at decode.c:346 #5 0x00773ab3 in LogicalDecodingProcessRecord (ctx=ctx@entry=0x1fc3e08, record=0x1fc41a0) at decode.c:119 #6 0x0077815d in pg_logical_slot_get_changes_guts (fcinfo=0x1fb5d88, confirm=, binary=) at logicalfuncs.c:271 #7 0x0067b63d in ExecMakeTableFunctionResult (setexpr=0x1fb42e0, econtext=0x1fb41b0, argContext=, expectedDesc=0x1fd7da8, randomAccess=false) at execSRF.c:234 #8 0x0068b76f in FunctionNext (node=node@entry=0x1fb3fa0) at nodeFunctionscan.c:94 #9 0x0067be37 in ExecScanFetch (recheckMtd=0x68b450 , accessMtd=0x68b470 , node=0x1fb3fa0) at execScan.c:133 #10 ExecScan (node=0x1fb3fa0, accessMtd=0x68b470 , recheckMtd=0x68b450 ) at execScan.c:199 #11 0x0067344b in ExecProcNode (node=0x1fb3fa0) at ../../../src/include/executor/executor.h:259 #12 ExecutePlan (execute_once=, dest=0x1fc02e8, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x1fb3fa0, estate=0x1fb3d78) at execMain.c:1636 #13 standard_ExecutorRun (queryDesc=0x1f9e178, direction=, count=0, execute_once=) at execMain.c:363 #14 0x007dda9e in PortalRunSelect (portal=0x1f51338, forward=, count=0, dest=) at pquery.c:924 #15 0x007decf7 in PortalRun (portal=portal@entry=0x1f51338, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x1fc02e8, altdest=altdest@entry=0x1fc02e8, qc=0x7ffd8c95bbf0) at pquery.c:768 #16 0x007db4ff in exec_simple_query ( query_string=0x1ee29a8 "SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids', '0');") at postgres.c:1243 #17 0x007dc742 in PostgresMain (dbname=, username=) at postgres.c:4482 #18 0x0076132b in BackendRun (port=, port=) at postmaster.c:4503 #19 BackendStartup (port=) at postmaster.c:4231 #20 ServerLoop () at postmaster.c:1805 #21 0x007621fb in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1edbf20) at postmaster.c:1477 #22 0x004f0f69 in main (argc=3, argv=0x1edbf20) at main.c:202
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com wrote: > > On Tue, Jul 12, 2022 8:49 AM Masahiko Sawada wrote: > > > > I've attached an updated patch. > > > > Hi, > > I met a segmentation fault in test_decoding test after applying the patch for > master > branch. Attach the backtrace. Thank you for testing the patch! > > It happened when executing the following code because it tried to free a NULL > pointer (catchange_xip). > > /* be tidy */ > if (ondisk) > pfree(ondisk); > + if (catchange_xip) > + pfree(catchange_xip); > } > > It seems to be related to configure option. I could reproduce it when using > `./configure --enable-debug`. > But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og -ggdb"`. Hmm, I could not reproduce this problem even if I use ./configure --enable-debug. And it's weird that we checked if catchange_xip is not null but we did pfree for it: #1 pfree (pointer=0x0) at mcxt.c:1177 #2 0x0078186b in SnapBuildSerialize (builder=0x1fd5e78, lsn=25719712) at snapbuild.c:1792 Is it reproducible in your environment? If so, could you test it again with the following changes? diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index d015c06ced..a6e76e3781 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -1788,7 +1788,7 @@ out: /* be tidy */ if (ondisk) pfree(ondisk); - if (catchange_xip) + if (catchange_xip != NULL) pfree(catchange_xip); } Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Cleaning up historical portability baggage
On 12.07.22 03:10, Thomas Munro wrote: AFAIK we generally only use pg_whatever() when there's a good reason, such as an incompatibility, a complication or a different abstraction that you want to highlight to a reader. The reason here was temporary: we couldn't implement standard pread/pwrite perfectly on ancient HP-UX, but we*can* implement it on Windows, so the reason is gone. These particular pg_ prefixes have only been in our tree for a few years and I was hoping to boot them out again before they stick, like "Size". I like using standard interfaces where possible for the very basic stuff, to de-weird our stuff. I agree. That's been the established approach.
Re: CREATE TABLE ( .. STORAGE ..)
Hi Peter, > The "safety check: do not allow toasted storage modes unless column > datatype is TOAST-aware" could be moved into GetAttributeStorage(), so > it doesn't have to be repeated. (Note that GetAttributeCompression() > does similar checking.) Good point. Fixed. > ATExecSetStorage() currently doesn't do any such check, and your patch > isn't adding one. Is there a reason for that? ATExecSetStorage() does this, but the check is a bit below [1]. In v7 I moved the check to GetAttributeStorage() as well. [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/tablecmds.c#l8312 -- Best regards, Aleksander Alekseev v7-0001-Allow-specifying-STORAGE-attribute-for-a-new-tabl.patch Description: Binary data
Re: [PATCH] Compression dictionaries for JSONB
Hi hackers! Aleksander, please point me in the right direction if it was mentioned before, I have a few questions: 1) It is not clear for me, how do you see the life cycle of such a dictionary? If it is meant to keep growing without cleaning up/rebuilding it could affect performance in an undesirable way, along with keeping unused data without any means to get rid of them. Also, I agree with Simon Riggs, using OIDs from the general pool for dictionary entries is a bad idea. 2) From (1) follows another question - I haven't seen any means for getting rid of unused keys (or any other means for dictionary cleanup). How could it be done? 3) Is the possible scenario legal - by some means a dictionary does not contain some keys for entries? What happens then? 4) If one dictionary is used by several tables - I see future issues in concurrent dictionary updates. This will for sure affect performance and can cause unpredictable behavior for queries. If you have any questions on Pluggable TOAST don't hesitate to ask me and on JSONB Toaster you can ask Nikita Glukhov. Thank you! Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/ On Mon, Jul 11, 2022 at 6:41 PM Aleksander Alekseev < aleksan...@timescale.com> wrote: > Hi hackers, > > > I invite anyone interested to join this effort as a co-author! > > Here is v5. Same as v4 but with a fixed compiler warning (thanks, > cfbot). Sorry for the noise. > > -- > Best regards, > Aleksander Alekseev >
Re: POC: Lock updated tuples in tuple_update() and tuple_delete()
Hi Aleksander! Thank you for your efforts reviewing this patch. On Thu, Jul 7, 2022 at 12:43 PM Aleksander Alekseev wrote: > > I'm going to need more time to meditate on the proposed changes and to > > figure out the performance impact. > > OK, turned out this patch is slightly more complicated than I > initially thought, but I think I managed to get some vague > understanding of what's going on. > > I tried to reproduce the case with concurrently updated tuples you > described on the current `master` branch. I created a new table: > > ``` > CREATE TABLE phonebook( > "id" SERIAL PRIMARY KEY NOT NULL, > "name" NAME NOT NULL, > "phone" INT NOT NULL); > > INSERT INTO phonebook ("name", "phone") > VALUES ('Alice', 123), ('Bob', 456), ('Charlie', 789); > ``` > > Then I opened two sessions and attached them with LLDB. I did: > > ``` > (lldb) b heapam_tuple_update > (lldb) c > ``` > > ... in both cases because I wanted to see two calls (steps 2 and 4) to > heapam_tuple_update() and check the return values. > > Then I did: > > ``` > session1 =# BEGIN; > session2 =# BEGIN; > session1 =# UPDATE phonebook SET name = 'Alex' WHERE name = 'Alice'; > ``` > > This update succeeds and I see heapam_tuple_update() returning TM_Ok. > > ``` > session2 =# UPDATE phonebook SET name = 'Alfred' WHERE name = 'Alice'; > ``` > > This update hangs on a lock. > > ``` > session1 =# COMMIT; > ``` > > Now session2 unfreezes and returns 'UPDATE 0'. table_tuple_update() > was called once and returned TM_Updated. Also session2 sees an updated > tuple now. So apparently the visibility check (step 3) didn't pass. Yes. But it's not exactly a visibility check. Session2 re-evaluates WHERE condition on the most recent row version (bypassing snapshot). WHERE condition is not true anymore, thus the row is not upated. > At this point I'm slightly confused. I don't see where a performance > improvement is expected, considering that session2 gets blocked until > session1 commits. > > Could you please walk me through here? Am I using the right test case > or maybe you had another one in mind? Which steps do you consider > expensive and expect to be mitigated by the patch? This patch is not intended to change some high-level logic. On the high level transaction, which updated the row, still holding a lock on it until finished. The possible positive performance impact I expect from doing the work of two calls tuple_update() and tuple_lock() in the one call of tuple_update(). If we do this in one call, we can save some efforts, for instance lock the same buffer once not twice. -- Regards, Alexander Korotkov
RE: proposal: Allocate work_mem From Pool
>> I think it would be better if work_mem was allocated from a pool >> of memory > I think this has been proposed before, and the issue/objection > with this idea is probably that query plans will be inconsistent, > and end up being sub-optimal. > work_mem is considered at planning time, but I think you only > consider its application execution. A query that was planned > with the configured work_mem but can't obtain the expected > amount at execution time might perform poorly. Maybe it > should be replanned with lower work_mem, but that would > lose the arms-length relationship between the planner-executor. > Should an expensive query wait a bit to try to get more > work_mem? What do you do if 3 expensive queries are all > waiting ? Before I try to answer that, I need to know how the scheduler works. Let's say there's a max of 8 worker process, and 12 queries trying to run. When does query #9 run? After the first of 1-8 completes, simple FIFO? Or something else? Also, how long goes a query hold a worker process? All the way to completion? Or does is perform some unit of work and rotate to another query? Joseph D Wagner P.S. If there's a link to all this somewhere, please let me know. Parsing through years of email archives is not always user friendly or helpful.
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Tue, Jul 12, 2022 at 2:53 PM Masahiko Sawada wrote: > > On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com > wrote: > > > > > > It happened when executing the following code because it tried to free a > > NULL > > pointer (catchange_xip). > > > > /* be tidy */ > > if (ondisk) > > pfree(ondisk); > > + if (catchange_xip) > > + pfree(catchange_xip); > > } > > > > It seems to be related to configure option. I could reproduce it when using > > `./configure --enable-debug`. > > But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og > > -ggdb"`. > > Hmm, I could not reproduce this problem even if I use ./configure > --enable-debug. And it's weird that we checked if catchange_xip is not > null but we did pfree for it: > Yeah, this looks weird to me as well but one difference in running tests could be the timing of WAL LOG for XLOG_RUNNING_XACTS. That may change the timing of SnapBuildSerialize. The other thing we can try is by checking the value of catchange_xcnt before pfree. BTW, I think ReorderBufferGetCatalogChangesXacts should have an Assert to ensure rb->catchange_ntxns and xcnt are equal. We can probably then avoid having xcnt_p as an out parameter as the caller can use rb->catchange_ntxns instead. -- With Regards, Amit Kapila.
Re: Cleaning up historical portability baggage
On Mon, Jul 11, 2022 at 9:11 PM Thomas Munro wrote: > Hmm, but that's not what we're doing in general. For example, on > Windows we're redirecting open() to a replacement function of our own, > we're not using "pg_open()" in our code. That's not an example based > on AC_REPLACE_FUNCS, but there are plenty of those too. Isn't this > quite well established? Yes. I just don't care for it. Sounds like I'm in the minority, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: [PATCH] Compression dictionaries for JSONB
Hi Nikita, > Aleksander, please point me in the right direction if it was mentioned > before, I have a few questions: Thanks for your feedback. These are good questions indeed. > 1) It is not clear for me, how do you see the life cycle of such a > dictionary? If it is meant to keep growing without > cleaning up/rebuilding it could affect performance in an undesirable way, > along with keeping unused data without > any means to get rid of them. > 2) From (1) follows another question - I haven't seen any means for getting > rid of unused keys (or any other means > for dictionary cleanup). How could it be done? Good point. This was not a problem for ZSON since the dictionary size was limited to 2**16 entries, the dictionary was immutable, and the dictionaries had versions. For compression dictionaries we removed the 2**16 entries limit and also decided to get rid of versions. The idea was that you can simply continue adding new entries, but no one thought about the fact that this will consume the memory required to decompress the document indefinitely. Maybe we should return to the idea of limited dictionary size and versions. Objections? > 4) If one dictionary is used by several tables - I see future issues in > concurrent dictionary updates. This will for sure > affect performance and can cause unpredictable behavior for queries. You are right. Another reason to return to the idea of dictionary versions. > Also, I agree with Simon Riggs, using OIDs from the general pool for > dictionary entries is a bad idea. Yep, we agreed to stop using OIDs for this, however this was not changed in the patch at this point. Please don't hesitate joining the effort if you want to. I wouldn't mind taking a short break from this patch. > 3) Is the possible scenario legal - by some means a dictionary does not > contain some keys for entries? What happens then? No, we should either forbid removing dictionary entries or check that all the existing documents are not using the entries being removed. > If you have any questions on Pluggable TOAST don't hesitate to ask me and on > JSONB Toaster you can ask Nikita Glukhov. Will do! Thanks for working on this and I'm looking forward to the next version of the patch for the next round of review. -- Best regards, Aleksander Alekseev
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Fri, Jul 8, 2022 at 10:26 PM Melih Mutlu wrote: > >> I think that won't work because each time on restart the slot won't be >> fixed. Now, it is possible that we may drop the wrong slot if that >> state of copying rel is SUBREL_STATE_DATASYNC. Also, it is possible >> that while creating a slot, we fail because the same name slot already >> exists due to some other worker which has created that slot has been >> restarted. Also, what about origin_name, won't that have similar >> problems? Also, if the state is already SUBREL_STATE_FINISHEDCOPY, if >> the slot is not the same as we have used in the previous run of a >> particular worker, it may start WAL streaming from a different point >> based on the slot's confirmed_flush_location. > > > You're right Amit. In case of a failure, tablesync phase of a relation may > continue with different worker and replication slot due to this change in > naming. > Seems like the same replication slot should be used from start to end for a > relation during tablesync. However, creating/dropping replication slots can > be a major overhead in some cases. > It would be nice if these slots are somehow reused. > > To overcome this issue, I've been thinking about making some changes in my > patch. > So far, my proposal would be as follows: > > Slot naming can be like pg__ instead of > pg__. This way each worker can use the same replication > slot during their lifetime. > But if a worker is restarted, then it will switch to a new replication slot > since its pid has changed. > I think using worker_pid also has similar risks of mixing slots from different workers because after restart same worker_pid could be assigned to a totally different worker. Can we think of using a unique 64-bit number instead? This will be allocated when each workers started for the very first time and after that we can refer catalog to find it as suggested in the idea below. > pg_subscription_rel catalog can store replication slot name for each > non-ready relation. Then we can find the slot needed for that particular > relation to complete tablesync. > Yeah, this is worth investigating. However, instead of storing the slot_name, we can store just the unique number (as suggested above). We should use the same for the origin name as well. > If a worker syncs a relation without any error, everything works well and > this new replication slot column from the catalog will not be needed. > However if a worker is restarted due to a failure, the previous run of that > worker left its slot behind since it did not exit properly. > And the restarted worker (with a different pid) will see that the relation is > actually in SUBREL_STATE_FINISHEDCOPY and want to proceed for the catchup > step. > Then the worker can look for that particular relation's replication slot from > pg_subscription_rel catalog (slot name should be there since relation state > is not ready). And tablesync can proceed with that slot. > > There might be some cases where some replication slots are left behind. An > example to such cases would be when the slot is removed from > pg_subscription_rel catalog and detached from any relation, but tha slot > actually couldn't be dropped for some reason. For such cases, a slot cleanup > logic is needed. This cleanup can also be done by tablesync workers. > Whenever a tablesync worker is created, it can look for existing replication > slots that do not belong to any relation and any worker (slot name has pid > for that), and drop those slots if it finds any. > This sounds tricky. Why not first drop slot/origin and then detach it from pg_subscription_rel? On restarts, it is possible that we may error out after dropping the slot or origin but before updating the catalog entry but in such case we can ignore missing slot/origin and detach them from pg_subscription_rel. Also, if we use the unique number as suggested above, I think even if we don't remove it after the relation state is ready, it should be okay. > What do you think about this new way of handling slots? Do you see any points > of concern? > > I'm currently working on adding this change into the patch. And would > appreciate any comment. > Thanks for making progress! -- With Regards, Amit Kapila.
Re: proposal: Allocate work_mem From Pool
On Tue, Jul 12, 2022 at 5:55 PM Joseph D Wagner wrote: > Before I try to answer that, I need to know how the scheduler works. As I understand the term used, there is no scheduler inside Postgres for user connections -- they're handled by the OS kernel. That's probably why it'd be a difficult project to be smart about memory -- step one might be to invent a scheduler. (The autovacuum launcher and checkpointer, etc have their own logic about when to do things, but while running they too are just OS processes scheduled by the kernel.) -- John Naylor EDB: http://www.enterprisedb.com
[PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Hi hackers, It is often not feasible to use `REPLICA IDENTITY FULL` on the publication, because it leads to full table scan per tuple change on the subscription. This makes `REPLICA IDENTITY FULL` impracticable -- probably other than some small number of use cases. With this patch, I'm proposing the following change: If there is an index on the subscriber, use the index as long as the planner sub-modules picks any index over sequential scan. Majority of the logic on the subscriber side has already existed in the code. The subscriber is already capable of doing (unique) index scans. With this patch, we are allowing the index to iterate over the tuples fetched and only act when tuples are equal. The ones familiar with this part of the code could realize that the sequential scan code on the subscriber already implements the `tuples_equal()` function. In short, the changes on the subscriber are mostly combining parts of (unique) index scan and sequential scan codes. The decision on whether to use an index (or which index) is mostly derived from planner infrastructure. The idea is that on the subscriber we have all the columns. So, construct all the `Path`s with the restrictions on all columns, such as `col_1 = $1 AND col_2 = $2 ... AND col_n = $N`. Finally, let the planner sub-module -- `create_index_paths()` -- to give us the relevant index `Path`s. On top of that adds the sequential scan `Path` as well. Finally, pick the cheapest `Path` among. >From the performance point of view, there are few things to note. First, the patch aims not to change the behavior when PRIMARY KEY or UNIQUE INDEX is used. Second, when REPLICA IDENTITY IS FULL on the publisher and an index is used on the subscriber, the difference mostly comes down to `index scan` vs `sequential scan`. That's why it is hard to claim a certain number of improvements. It mostly depends on the data size, index and the data distribution. Still, below I try to showcase the potential improvements using an index on the subscriber `pgbench_accounts(bid)`. With the index, the replication catches up around ~5 seconds. When the index is dropped, the replication takes around ~300 seconds. // init source db pgbench -i -s 100 -p 5432 postgres psql -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" -p 5432 postgres psql -c "CREATE INDEX i1 ON pgbench_accounts(aid);" -p 5432 postgres psql -c "ALTER TABLE pgbench_accounts REPLICA IDENTITY FULL;" -p 5432 postgres psql -c "CREATE PUBLICATION pub_test_1 FOR TABLE pgbench_accounts;" -p 5432 postgres // init target db, drop existing primary key pgbench -i -p 9700 postgres psql -c "truncate pgbench_accounts;" -p 9700 postgres psql -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" -p 9700 postgres psql -c "CREATE SUBSCRIPTION sub_test_1 CONNECTION 'host=localhost port=5432 user=onderkalaci dbname=postgres' PUBLICATION pub_test_1;" -p 9700 postgres // create one index, even on a low cardinality column psql -c "CREATE INDEX i2 ON pgbench_accounts(bid);" -p 9700 postgres // now, run some pgbench tests and observe replication pgbench -t 500 -b tpcb-like -p 5432 postgres What do hackers think about this change? Thanks, Onder Kalaci & Developing the Citus extension for PostgreSQL 0001_use_index_on_subs_when_pub_rep_ident_full.patch Description: Binary data
Re: Making Vars outer-join aware
Richard Guo writes: > Note that the evaluation of expression 'b.j + 1' now occurs below the > outer join. Is this something we need to be concerned about? It seems more formally correct to me, but perhaps somebody would complain about possibly-useless expression evals. We could likely re-complicate the logic that inserts PHVs during pullup so that it looks for Vars it can apply the nullingrels to. Maybe there's an opportunity to share code with flatten_join_alias_vars? regards, tom lane
Re: making relfilenodes 56 bits
On Mon, Jul 11, 2022 at 7:22 PM Andres Freund wrote: > I guess I'm not enthused in duplicating the necessary knowledge in evermore > places. We've forgotten one of the magic incantations in the past, and needing > to find all the places that need to be patched is a bit bothersome. > > Perhaps we could add extract helpers out of durable_rename()? > > OTOH, I don't really see what we gain by keeping things out of the critical > section? It does seem good to have the temp-file creation/truncation and write > separately, but after that I don't think it's worth much to avoid a > PANIC. What legitimate issue does it avoid? OK, so then I think we should just use durable_rename(). Here's a patch that does it that way. I briefly considered the idea of extracting helpers, but it doesn't seem worthwhile to me. There's not that much code in durable_rename() in the first place. In this version, I also removed the struct padding, changed the limit on the number of entries to a nice round 64, and made some comment updates. I considered trying to go further and actually make the file variable-size, so that we never again need to worry about the limit on the number of entries, but I don't actually think that's a good idea. It would require substantially more changes to the code in this file, and that means there's more risk of introducing bugs, and I don't see that there's much value anyway, because if we ever do hit the current limit, we can just raise the limit. If we were going to split up durable_rename(), the only intelligible split I can see would be to have a second version of the function, or a flag to the existing function, that caters to the situation where the old file is already known to have been fsync()'d. -- Robert Haas EDB: http://www.enterprisedb.com v2-0001-Remove-the-restriction-that-the-relmap-must-be-51.patch Description: Binary data
Introduce "log_connection_stages" setting.
Hello, The problem we face is excessive logging of connection information that clutters the logs and in corner cases with many short-lived connections leads to disk space exhaustion. Current connection log lines share significant parts of the information - host, port, very close timestamps etc. - in the common case of a successful connection: 2022-07-12 12:17:39.369 UTC,,,12875,"10.2.101.63:35616",62cd6663.324b,1,"",2022-07-12 12:17:39 UTC,,0,LOG,0,"connection received: host=10.2.101.63 port=35616","","not initialized" 2022-07-12 12:17:39.374 UTC,"some_user","postgres",12875,"10.2.101.63:35616",62cd6663.324b,2,"authentication",2022-07-12 12:17:39 UTC,18/4571,0,LOG,0,"connection authorized: user=some_user database=postgres SSL enabled (protocol=, cipher=, compression=)","","client backend" 2022-07-12 12:17:39.430 UTC,"some_user","postgres",12875,"10.2.101.63:35616",62cd6663.324b,3,"idle",2022-07-12 12:17:39 UTC,,0,LOG,0,"disconnection: session time: 0:00:00.060 user=some_user database=postgres host=10.2.101.63 port=35616","","client backend" Removing some of the lines should not harm log-based investigations in most cases, but will shrink the logs improving readability and space consumption. I would like to get feedback on the following idea: Add the `log_connection_stages` setting of type “string” with possible values "received", "authorized", "authenticated", "disconnected", and "all", with "all" being the default. The setting would have effect only when `log_connections` is on. Example: log_connection_stages=’authorized,disconnected’. That also implies there would be no need for a separate "log_disconnection" setting. For the sake of completeness I have to mention omitting ‘received’ from `log_connection_stages` would lead to absence in logs info about connections that do not complete initialization: for them only the “connection received” line is currently logged. The attachment contains a log excerpt to clarify the situation I am talking about. Regards, Sergey. 2022-07-11 15:44:53.126 UTC,,,75,,62cbfa36.4b,3148,,2022-07-11 10:23:50 UTC,,0,DEBUG,0,"forked new backend, pid=270675 socket=12","","postmaster" 2022-07-11 15:44:53.127 UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,1,"",2022-07-11 15:44:53 UTC,,0,LOG,0,"connection received: host=10.2.80.0 port=1094","","not initialized" 2022-07-11 15:44:53.127 UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,2,"",2022-07-11 15:44:53 UTC,,0,DEBUG,0,"shmem_exit(0): 0 before_shmem_exit callbacks to make","","not initialized" 2022-07-11 15:44:53.127 UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,3,"",2022-07-11 15:44:53 UTC,,0,DEBUG,0,"shmem_exit(0): 0 on_shmem_exit callbacks to make","","not initialized" 2022-07-11 15:44:53.127 UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,4,"",2022-07-11 15:44:53 UTC,,0,DEBUG,0,"proc_exit(0): 1 callbacks to make","","not initialized" 2022-07-11 15:44:53.127 UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,5,"",2022-07-11 15:44:53 UTC,,0,DEBUG,0,"exit(0)","","not initialized" 2022-07-11 15:44:53.127 UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,6,"",2022-07-11 15:44:53 UTC,,0,DEBUG,0,"shmem_exit(-1): 0 before_shmem_exit callbacks to make","","not initialized" 2022-07-11 15:44:53.127 UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,7,"",2022-07-11 15:44:53 UTC,,0,DEBUG,0,"shmem_exit(-1): 0 on_shmem_exit callbacks to make","","not initialized" 2022-07-11 15:44:53.127 UTC,,,270675,"10.2.80.0:1094",62cc4575.42153,8,"",2022-07-11 15:44:53 UTC,,0,DEBUG,0,"proc_exit(-1): 0 callbacks to make","","not initialized" 2022-07-11 15:44:53.129 UTC,,,75,,62cbfa36.4b,3158,,2022-07-11 10:23:50 UTC,,0,DEBUG,0,"reaping dead processes ","","postmaster" 2022-07-11 15:44:53.129 UTC,,,75,,62cbfa36.4b,3159,,2022-07-11 10:23:50 UTC,,0,DEBUG,0,"server process (PID 270675) exited with exit code 0","","postmaster"
Re: Cleaning up historical portability baggage
Robert Haas writes: > On Mon, Jul 11, 2022 at 9:11 PM Thomas Munro wrote: >> Hmm, but that's not what we're doing in general. For example, on >> Windows we're redirecting open() to a replacement function of our own, >> we're not using "pg_open()" in our code. That's not an example based >> on AC_REPLACE_FUNCS, but there are plenty of those too. Isn't this >> quite well established? > Yes. I just don't care for it. > Sounds like I'm in the minority, though. I concur with your point that it's not great to use the standard name for a function that doesn't have exactly the standard semantics. But if it does, using a nonstandard name is not better. It's just one more thing that readers of our code have to learn about. Note that "exactly" only needs to mean "satisfies all the promises made by POSIX". If some caller is depending on behavior details not specified in the standard, that's the caller's bug not the wrapper function's. Otherwise, yeah, we couldn't ever be sure whether a wrapper function is close enough. regards, tom lane
Re: DropRelFileLocatorBuffers
On Tue, Jul 12, 2022 at 12:07 AM Dilip Kumar wrote: > I think the naming used in your patch looks better to me. So +1 for the > change. Committed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: proposal: Allocate work_mem From Pool
On Tue, Jul 12, 2022 at 03:55:39AM -0700, Joseph D Wagner wrote: > Before I try to answer that, I need to know how the scheduler works. > > Let's say there's a max of 8 worker process, and 12 queries trying to run. > When does query #9 run? After the first of 1-8 completes, simple FIFO? > Or something else? > > Also, how long goes a query hold a worker process? All the way to > completion? Or does is perform some unit of work and rotate to > another query? I think what you're referring to as a worker process is what postgres refers to as a "client backend" (and not a "parallel worker", even though that sounds more similar to your phrase). > P.S. If there's a link to all this somewhere, please let me know. > Parsing through years of email archives is not always user friendly or > helpful. Looking at historic communication is probably the easy part. Here's some to start you out. https://www.postgresql.org/message-id/flat/4d39869f4bdc42b3a43004e3685ac45d%40index.de -- Justin
Re: Introduce "log_connection_stages" setting.
On Tue, Jul 12, 2022, at 10:52 AM, Sergey Dudoladov wrote: > The problem we face is excessive logging of connection information > that clutters the logs and in corner cases with many short-lived > connections leads to disk space exhaustion. You are proposing a fine-grained control over connection stages reported when log_connections = on. It seems useful if you're only interested in (a) malicious access or (b) authorized access (for audit purposes). > I would like to get feedback on the following idea: > > Add the `log_connection_stages` setting of type “string” with possible > values "received", "authorized", "authenticated", "disconnected", and > "all", with "all" being the default. > The setting would have effect only when `log_connections` is on. > Example: log_connection_stages=’authorized,disconnected’. > That also implies there would be no need for a separate > "log_disconnection" setting. Your proposal will add more confusion to the already-confused logging-related GUCs. If you are proposing to introduce a fine-grained control, the first step should be merge log_connections and log_disconnections into a new GUC (?) and deprecate them. (I wouldn't introduce a new GUC that depends on the stage of other GUC as you proposed.) There are 3 stages: connect (received), authorized (authenticated), disconnect. You can also add 'all' that mimics log_connections / log_disconnections enabled. Another question is if we can reuse log_connections for this improvement instead of a new GUC. In this case, you would turn the boolean value into an enum value. Will it cause trouble while upgrading to this new version? It is one of the reasons to create a new GUC. I would suggest log_connection_messages or log_connection (with the 's' in the end -- since it is too similar to the current GUC name, I'm afraid it is not a good name for it). -- Euler Taveira EDB https://www.enterprisedb.com/
Re: making relfilenodes 56 bits
On Mon, Jul 11, 2022 at 9:49 PM Robert Haas wrote: > > It also makes me wonder why we're using macros rather than static > inline functions in buf_internals.h. I wonder whether we could do > something like this, for example, and keep InvalidForkNumber as -1: > > static inline ForkNumber > BufTagGetForkNum(BufferTag *tagPtr) > { > int8 ret; > > StaticAssertStmt(MAX_FORKNUM <= INT8_MAX); > ret = (int8) ((tagPtr->relForkDetails[0] >> BUFFERTAG_RELNUMBER_BITS); > return (ForkNumber) ret; > } > > Even if we don't use that particular trick, I think we've generally > been moving toward using static inline functions rather than macros, > because it provides better type-safety and the code is often easier to > read. Maybe we should also approach it that way here. Or even commit a > preparatory patch replacing the existing macros with inline functions. > Or maybe it's best to leave it alone, not sure. I think it make sense to convert existing macros as well, I have attached a patch for the same, > > > I had those changes in v7-0003, now I have merged with 0002. This has > > assert check while replaying the WAL for smgr create and smgr > > truncate, and while during normal path when allocating the new > > relfilenumber we are asserting for any existing file. > > I think a test-and-elog might be better. Most users won't be running > assert-enabled builds, but this seems worth checking regardless. IMHO the recovery time asserts we can convert to elog but one which we are doing after each GetNewRelFileNumber is better to keep as an assert as we are doing the file access so it can be costly? > > I have done some performance tests, with very small values I can see a > > lot of wait events for RelFileNumberGen but with bigger numbers like > > 256 or 512 it is not really bad. See results at the end of the > > mail[1] > > It's a little hard to interpret these results because you don't say > how often you were checking the wait events, or how often the > operation took to complete. I suppose we can guess the relative time > scale from the number of Activity events: if there were 190 > WalWriterMain events observed, then the time to complete the operation > is probably 190 times how often you were checking the wait events, but > was that every second or every half second or every tenth of a second? I am executing it after every 0.5 sec using below script in psql \t select wait_event_type, wait_event from pg_stat_activity where pid != pg_backend_pid() \watch 0.5 And running test for 60 sec ./pgbench -c 32 -j 32 -T 60 -f create_script.sql -p 54321 postgres $ cat create_script.sql select create_table(100); // function body 'create_table' CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$ DECLARE relname varchar; pid int; i int; BEGIN SELECT pg_backend_pid() INTO pid; relname := 'test_' || pid; FOR i IN 1..count LOOP EXECUTE format('CREATE TABLE %s(a int)', relname); EXECUTE format('DROP TABLE %s', relname); END LOOP; END; $$ LANGUAGE plpgsql; > > I have done these changes during GetNewRelFileNumber() this required > > to track the last logged record pointer as well but I think this looks > > clean. With this I can see some reduction in RelFileNumberGen wait > > event[1] > > I find the code you wrote here a little bit magical. I believe it > depends heavily on choosing to issue the new WAL record when we've > exhausted exactly 50% of the available space. I suggest having two > constants, one of which is the number of relfilenumber values per WAL > record, and the other of which is the threshold for issuing a new WAL > record. Maybe something like RFN_VALUES_PER_XLOG and > RFN_NEW_XLOG_THRESHOLD, or something. And then work code that works > correctly for any value of RFN_NEW_XLOG_THRESHOLD between 0 (don't log > new RFNs until old allocation is completely exhausted) and > RFN_VALUES_PER_XLOG - 1 (log new RFNs after using just 1 item from the > previous allocation). That way, if in the future someone decides to > change the constant values, they can do that and the code still works. ok -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From 553a6b185ee7270bf206387421e50b48c7a8b97b Mon Sep 17 00:00:00 2001 From: Dilip Kumar Date: Tue, 12 Jul 2022 17:10:04 +0530 Subject: [PATCH v1] Convert buf_internal.h macros to static inline functions Readability wise inline functions are better compared to macros and this will also help to write cleaner and readable code for 64-bit relfilenode because as part of that patch we will have to do some complex bitwise operation so doing that inside the inline function will be cleaner. --- src/backend/storage/buffer/buf_init.c | 2 +- src/backend/storage/buffer/bufmgr.c | 16 ++-- src/backend/storage/buffer/localbuf.c | 12 +-- src/include/storage/buf_internals.h | 156 +- 4 files changed, 111 insertions(+), 75 deletions(-) diff --git a/src/backend/storage/buffer/bu
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Thanks for the review! On Tue, Jul 12, 2022 at 4:06 AM Kyotaro Horiguchi wrote: > At Mon, 11 Jul 2022 22:22:28 -0400, Melanie Plageman < > melanieplage...@gmail.com> wrote in > > Hi, > > > > In the attached patch set, I've added in missing IO operations for > > certain IO Paths as well as enumerating in the commit message which IO > > Paths and IO Operations are not currently counted and or not possible. > > > > There is a TODO in HandleWalWriterInterrupts() about removing > > pgstat_report_wal() since it is immediately before a proc_exit() > > Right. walwriter does that without needing the explicit call. > I have deleted it. > > > I was wondering if LocalBufferAlloc() should increment the counter or if > > I should wait until GetLocalBufferStorage() to increment the counter. > > Depends on what "allocate" means. Different from shared buffers, local > buffers are taken from OS then allocated to page. OS-allcoated pages > are restricted by num_temp_buffers so I think what we're interested in > is the count incremented by LocalBuferAlloc(). (And it is the parallel > of alloc for shared-buffers) > I've left it in LocalBufferAlloc(). > > > I also realized that I am not differentiating between IOPATH_SHARED and > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type > > of buffer we are fsync'ing by the time we call register_dirty_segment(), > > I'm not sure how we would fix this. > > I think there scarcely happens flush for strategy-loaded buffers. If > that is sensible, IOOP_FSYNC would not make much sense for > IOPATH_STRATEGY. > Why would it be less likely for a backend to do its own fsync when flushing a dirty strategy buffer than a regular dirty shared buffer? > > > > IMO it'd be a good idea to add pgstat_report_io_ops() to > > > pgstat_report_vacuum()/analyze(), so that the stats for a longrunning > > > autovac worker get updated more regularly. > > > > > > > noted and fixed. > > > > How about moving the pgstat_report_io_ops() into > > > pgstat_report_bgwriter(), pgstat_report_autovacuum() etc? Seems > > > unnecessary to have multiple pgstat_* calls in these places. > > > > > > > > > > > noted and fixed. > > +* Also report IO Operations statistics > > I think that the function comment also should mention this. > I've added comments at the top of all these functions. > > > > Wonder if it's worth making the lock specific to the backend type? > > > > > > > I've added another Lock into PgStat_IOPathOps so that each BackendType > > can be locked separately. But, I've also kept the lock in > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be > > done easily. > > Looks fine about the lock separation. > Actually, I think it is not safe to use both of these locks. So for picking one method, it is probably better to go with the locks in PgStat_IOPathOps, it will be more efficient for flush (and not for fetching and resetting), so that is probably the way to go, right? > By the way, in the following line: > > + > &pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)]; > > backend_type_get_idx(x) is actually (x - 1) plus assertion on the > value range. And the only use-case is here. There's an reverse > function and also used only at one place. > > + Datum backend_type_desc = > + > CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i))); > > In this usage GetBackendTypeDesc() gracefully treats out-of-domain > values but idx_get_backend_type keenly kills the process for the > same. This is inconsistent. > > My humbel opinion on this is we don't define the two functions and > replace the calls to them with (x +/- 1). Addition to that, I think > we should not abort() by invalid backend types. In that sense, I > wonder if we could use B_INVALIDth element for this purpose. > I think that GetBackendTypeDesc() should probably also error out for an unknown value. I would be open to not using the helper functions. I thought it would be less error-prone, but since it is limited to the code in pgstat_io_ops.c, it is probably okay. Let me think a bit more. Could you explain more about what you mean about using B_INVALID BackendType? > > > > > + LWLockAcquire(&stats_shmem->lock, LW_SHARED); > > > > + memcpy(&reset, reset_offset, sizeof(stats_shmem->stats)); > > > > + LWLockRelease(&stats_shmem->lock); > > > > > > Which then also means that you don't need the reset offset stuff. It's > > > only there because with the changecount approach we can't take a lock > to > > > reset the stats (since there is no lock). With a lock you can just > reset > > > the shared state. > > > > > > > Yes, I believe I have cleaned up all of this embarrassing mess. I use the > > lock in PgStatShared_BackendIOPathOps for reset all and snapshot and the > > locks in PgStat_IOPathOps for flush. > > Looks fine, but I think pgstat_flush_io_ops() need more comments like > other pgstat_flush_* functions. > > + for (int
test_oat_hooks bug (was: Re: pgsql: Add copy/equal support for XID lists)
While looking for a place to host a test for XID lists support, I noticed a mistake in test_oat_hooks, fixed as per the attached. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ >From 7005187f58a2a8b0300eceaa8fae8f825d5525a9 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 12 Jul 2022 17:18:23 +0200 Subject: [PATCH] Fix flag tests in src/test/modules/test_oat_hooks In what must have been a copy'n paste mistake, all the flag tests use the same flag rather than a different flag each. The bug is not suprising, considering that it's dead code. --- src/test/modules/test_oat_hooks/test_oat_hooks.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/test/modules/test_oat_hooks/test_oat_hooks.c b/src/test/modules/test_oat_hooks/test_oat_hooks.c index 1f40d632e0..900d597f5d 100644 --- a/src/test/modules/test_oat_hooks/test_oat_hooks.c +++ b/src/test/modules/test_oat_hooks/test_oat_hooks.c @@ -1795,15 +1795,15 @@ accesstype_arg_to_string(ObjectAccessType access, void *arg) return psprintf("%s%s%s%s%s%s", ((drop_arg->dropflags & PERFORM_DELETION_INTERNAL) ? "internal action," : ""), -((drop_arg->dropflags & PERFORM_DELETION_INTERNAL) +((drop_arg->dropflags & PERFORM_DELETION_CONCURRENTLY) ? "concurrent drop," : ""), -((drop_arg->dropflags & PERFORM_DELETION_INTERNAL) +((drop_arg->dropflags & PERFORM_DELETION_QUIETLY) ? "suppress notices," : ""), -((drop_arg->dropflags & PERFORM_DELETION_INTERNAL) +((drop_arg->dropflags & PERFORM_DELETION_SKIP_ORIGINAL) ? "keep original object," : ""), -((drop_arg->dropflags & PERFORM_DELETION_INTERNAL) +((drop_arg->dropflags & PERFORM_DELETION_SKIP_EXTENSIONS) ? "keep extensions," : ""), -((drop_arg->dropflags & PERFORM_DELETION_INTERNAL) +((drop_arg->dropflags & PERFORM_DELETION_CONCURRENT_LOCK) ? "normal concurrent drop," : "")); } break; -- 2.30.2
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hi, On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote: > > > I also realized that I am not differentiating between IOPATH_SHARED and > > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type > > > of buffer we are fsync'ing by the time we call register_dirty_segment(), > > > I'm not sure how we would fix this. > > > > I think there scarcely happens flush for strategy-loaded buffers. If > > that is sensible, IOOP_FSYNC would not make much sense for > > IOPATH_STRATEGY. > > > > Why would it be less likely for a backend to do its own fsync when > flushing a dirty strategy buffer than a regular dirty shared buffer? We really just don't expect a backend to do many segment fsyncs at all. Otherwise there's something wrong with the forwarding mechanism. It'd be different if we tracked WAL fsyncs more granularly - which would be quite interesting - but that's something for another day^Wpatch. > > > > Wonder if it's worth making the lock specific to the backend type? > > > > > > > > > > I've added another Lock into PgStat_IOPathOps so that each BackendType > > > can be locked separately. But, I've also kept the lock in > > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be > > > done easily. > > > > Looks fine about the lock separation. > > > > Actually, I think it is not safe to use both of these locks. So for > picking one method, it is probably better to go with the locks in > PgStat_IOPathOps, it will be more efficient for flush (and not for > fetching and resetting), so that is probably the way to go, right? I think it's good to just use one kind of lock, and efficiency of snapshotting / resetting is nearly irrelevant. But I don't see why it's not safe to use both kinds of locks? > > Looks fine, but I think pgstat_flush_io_ops() need more comments like > > other pgstat_flush_* functions. > > > > + for (int i = 0; i < BACKEND_NUM_TYPES; i++) > > + stats_shmem->stats[i].stat_reset_timestamp = ts; > > > > I'm not sure we need a separate reset timestamp for each backend type > > but SLRU counter does the same thing.. > > > > Yes, I think for SLRU stats it is because you can reset individual SLRU > stats. Also there is no wrapper data structure to put it in. I could > keep it in PgStatShared_BackendIOPathOps since you have to reset all IO > operation stats at once, but I am thinking of getting rid of > PgStatShared_BackendIOPathOps since it is not needed if I only keep the > locks in PgStat_IOPathOps and make the global shared value an array of > PgStat_IOPathOps. I'm strongly against introducing super granular reset timestamps. I think that was a mistake for SLRU stats, but we can't fix that as easily. > Currently, strategy allocs count only reuses of a strategy buffer (not > initial shared buffers which are added to the ring). > strategy writes count only the writing out of dirty buffers which are > already in the ring and are being reused. That seems right to me. > Alternatively, we could also count as strategy allocs all those buffers > which are added to the ring and count as strategy writes all those > shared buffers which are dirty when initially added to the ring. I don't think that'd provide valuable information. The whole reason that strategy writes are interesting is that they can lead to writing out data a lot sooner than they would be written out without a strategy being used. > Subject: [PATCH v24 2/3] Track IO operation statistics > > Introduce "IOOp", an IO operation done by a backend, and "IOPath", the > location or type of IO done by a backend. For example, the checkpointer > may write a shared buffer out. This would be counted as an IOOp write on > an IOPath IOPATH_SHARED by BackendType "checkpointer". I'm still not 100% happy with IOPath - seems a bit too easy to confuse with the file path. What about 'origin'? > Each IOOp (alloc, fsync, extend, write) is counted per IOPath > (direct, local, shared, or strategy) through a call to > pgstat_count_io_op(). It seems we should track reads too - it's quite interesting to know whether reads happened because of a strategy, for example. You do reference reads in a later part of the commit message even :) > The primary concern of these statistics is IO operations on data blocks > during the course of normal database operations. IO done by, for > example, the archiver or syslogger is not counted in these statistics. We could extend this at a later stage, if we really want to. But I'm not sure it's interesting or fully possible. E.g. the archiver's write are largely not done by the archiver itself, but by a command (or module these days) it shells out to. > Note that this commit does not add code to increment IOPATH_DIRECT. A > future patch adding wrappers for smgrwrite(), smgrextend(), and > smgrimmedsync() would provide a good location to call > pgstat_count_io_op() for unbuffered IO and avoid regressions for future > users of these functions. Hm. Perhaps we sh
Re: making relfilenodes 56 bits
Hi, On 2022-07-12 09:51:12 -0400, Robert Haas wrote: > On Mon, Jul 11, 2022 at 7:22 PM Andres Freund wrote: > > I guess I'm not enthused in duplicating the necessary knowledge in evermore > > places. We've forgotten one of the magic incantations in the past, and > > needing > > to find all the places that need to be patched is a bit bothersome. > > > > Perhaps we could add extract helpers out of durable_rename()? > > > > OTOH, I don't really see what we gain by keeping things out of the critical > > section? It does seem good to have the temp-file creation/truncation and > > write > > separately, but after that I don't think it's worth much to avoid a > > PANIC. What legitimate issue does it avoid? > > OK, so then I think we should just use durable_rename(). Here's a > patch that does it that way. I briefly considered the idea of > extracting helpers, but it doesn't seem worthwhile to me. There's not > that much code in durable_rename() in the first place. Cool. > In this version, I also removed the struct padding, changed the limit > on the number of entries to a nice round 64, and made some comment > updates. What does currently happen if we exceed that? I wonder if we should just reference a new define generated by genbki.pl documenting the number of relations that need to be tracked. Then we don't need to maintain this manually going forward. > I considered trying to go further and actually make the file > variable-size, so that we never again need to worry about the limit on > the number of entries, but I don't actually think that's a good idea. Yea, I don't really see what we'd gain. For this stuff to change we need to recompile anyway. > If we were going to split up durable_rename(), the only intelligible > split I can see would be to have a second version of the function, or > a flag to the existing function, that caters to the situation where > the old file is already known to have been fsync()'d. I was thinking of something like durable_rename_prep() that'd fsync the file/directories under their old names, and then durable_rename_exec() that actually renames and then fsyncs. But without a clear usecase... > + /* Write new data to the file. */ > + pgstat_report_wait_start(WAIT_EVENT_RELATION_MAP_WRITE); > + if (write(fd, newmap, sizeof(RelMapFile)) != sizeof(RelMapFile)) ... > + pgstat_report_wait_end(); > + Not for this patch, but we eventually should move this sequence into a wrapper. Perhaps combined with retry handling for short writes, the ENOSPC stuff and an error message when the write fails. It's a bit insane how many copies of this we have. > diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h > index b578e2ec75..5d3775ccde 100644 > --- a/src/include/utils/wait_event.h > +++ b/src/include/utils/wait_event.h > @@ -193,7 +193,7 @@ typedef enum > WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE, > WAIT_EVENT_LOGICAL_REWRITE_WRITE, > WAIT_EVENT_RELATION_MAP_READ, > - WAIT_EVENT_RELATION_MAP_SYNC, > + WAIT_EVENT_RELATION_MAP_RENAME, Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME, since it includes fsync etc? Greetings, Andres Freund
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hi, On 2022-07-11 22:22:28 -0400, Melanie Plageman wrote: > Yes, per an off list suggestion by you, I have changed the tests to use a > sum of writes. I've also added a test for IOPATH_LOCAL and fixed some of > the missing calls to count IO Operations for IOPATH_LOCAL and > IOPATH_STRATEGY. > > I struggled to come up with a way to test writes for a particular > type of backend are counted correctly since a dirty buffer could be > written out by another type of backend before the target BackendType has > a chance to write it out. I guess temp file writes would be reliably done by one backend... Don't have a good idea otherwise. > I also struggled to come up with a way to test IO operations for > background workers. I'm not sure of a way to deterministically have a > background worker do a particular kind of IO in a test scenario. I think it's perfectly fine to not test that - for it to be broken we'd have to somehow screw up setting the backend type. Everything else is the same as other types of backends anyway. If you *do* want to test it, you probably could use SET parallel_leader_participation = false; SET force_parallel_mode = 'regress'; SELECT something_triggering_io; > I'm not sure how to cause a strategy "extend" for testing. COPY into a table should work. But might be unattractive due to the size of of the COPY ringbuffer. > > Would be nice to have something testing that the ringbuffer stats stuff > > does something sensible - that feels not entirely trivial. > > > > > I've added a test to test that reused strategy buffers are counted as > allocs. I would like to add a test which checks that if a buffer in the > ring is pinned and thus not reused, that it is not counted as a strategy > alloc, but I found it challenging without a way to pause vacuuming, pin > a buffer, then resume vacuuming. Yea, that's probably too hard to make reliable to be worth it. Greetings, Andres Freund
Re: Reducing the chunk header sizes on all memory context types
Good day, David. В Вт, 12/07/2022 в 17:01 +1200, David Rowley пишет: > Over on [1], I highlighted that 40af10b57 (Use Generation memory > contexts to store tuples in sorts) could cause some performance > regressions for sorts when the size of the tuple is exactly a power of > 2. The reason for this is that the chunk header for generation.c > contexts is 8 bytes larger (on 64-bit machines) than the aset.c chunk > header. This means that we can store fewer tuples per work_mem during > the sort and that results in more batches being required. > > As I write this, this regression is still marked as an open item for > PG15 in [2]. So I've been working on this to try to assist the > discussion about if we need to do anything about that for PG15. > > Over on [3], I mentioned that Andres came up with an idea and a > prototype patch to reduce the chunk header size across the board by > storing the context type in the 3 least significant bits in a uint64 > header. > > I've taken Andres' patch and made some quite significant changes to > it. In the patch's current state, the sort performance regression in > PG15 vs PG14 is fixed. The generation.c context chunk header has been > reduced to 8 bytes from the previous 24 bytes as it is in master. > aset.c context chunk header is now 8 bytes instead of 16 bytes. > > We can use this 8-byte chunk header by using the remaining 61-bits of > the uint64 header to encode 2 30-bit values to store the chunk size > and also the number of bytes we must subtract from the given pointer > to find the block that the chunk is stored on. Once we have the > block, we can find the owning context by having a pointer back to the > context from the block. For memory allocations that are larger than > what can be stored in 30 bits, the chunk header gets an additional two > Size fields to store the chunk_size and the block offset. We can tell > the difference between the 2 chunk sizes by looking at the spare 1-bit > the uint64 portion of the header. I don't get, why "large chunk" needs additional fields for size and offset. Large allocation sizes are certainly rounded to page size. And allocations which doesn't fit 1GB we could easily round to 1MB. Then we could simply store `size>>20`. It will limit MaxAllocHugeSize to `(1<<(30+20))-1` - 1PB. Doubdfully we will deal with such huge allocations in near future. And to limit block offset, we just have to limit maxBlockSize to 1GB, which is quite reasonable limitation. Chunks that are larger than maxBlockSize goes to separate blocks anyway, therefore they have small block offset. > Aside from speeding up the sort case, this also seems to have a good > positive performance impact on pgbench read-only workload with -M > simple. I'm seeing about a 17% performance increase on my AMD > Threadripper machine. > > master + Reduced Memory Context Chunk Overhead > drowley@amd3990x:~$ pgbench -S -T 60 -j 156 -c 156 -M simple postgres > tps = 1876841.953096 (without initial connection time) > tps = 1919605.408254 (without initial connection time) > tps = 1891734.480141 (without initial connection time) > > Master > drowley@amd3990x:~$ pgbench -S -T 60 -j 156 -c 156 -M simple postgres > tps = 1632248.236962 (without initial connection time) > tps = 1615663.151604 (without initial connection time) > tps = 1602004.146259 (without initial connection time) Trick with 3bit context type is great. > The attached .png file shows the same results for PG14 and PG15 as I > showed in the blog [4] where I discovered the regression and adds the > results from current master + the attached patch. See bars in orange. > You can see that the regression at 64MB work_mem is fixed. Adding some > tracing to the sort shows that we're now doing 671745 tuples per batch > instead of 576845 tuples. This reduces the number of batches from 245 > down to 210. > > Drawbacks: > > There is at least one. It might be major; to reduce the AllocSet chunk > header from 16 bytes down to 8 bytes I had to get rid of the freelist > pointer that was reusing the "aset" field in the chunk header struct. > This works now by storing that pointer in the actual palloc'd memory. > This could lead to pretty hard-to-trace bugs if we have any code that > accidentally writes to memory after pfree. The slab.c context already > does this, but that's far less commonly used. If we decided this was > unacceptable then it does not change anything for the generation.c > context. The chunk header will still be 8 bytes instead of 24 there. > So the sort performance regression will still be fixed. At least we can still mark free list pointer as VALGRIND_MAKE_MEM_NOACCESS and do VALGRIND_MAKE_MEM_DEFINED on fetching from free list, can we? > To improve this situation, we might be able code it up so that > MEMORY_CONTEXT_CHECKING builds add an additional freelist pointer to > the header and also write it to the palloc'd memory then verify > they're set to the same thing when we reuse a chunk from the fre
remove_useless_groupby_columns is too enthusiastic
I looked into the complaint at [1] about the planner being much stupider when one side of a JOIN USING is referenced than the other side. It seemed to me that that shouldn't be happening, because the relevant decisions are made on the basis of EquivalenceClasses and both USING columns should be in the same EquivalenceClass. I was right about that ... but the damage is being done far upstream of any EquivalenceClass work. It turns out that the core of the issue is that the query looks like SELECT ... t1 JOIN t2 USING (x) GROUP BY x, t2.othercol ORDER BY t2.othercol LIMIT n In the "okay" case, we resolve "GROUP BY x" as GROUP BY t1.x. Later on, we are able to realize that ordering by t1.x is equivalent to ordering by t2.x (because same EquivalenceClass), and that it's equally good to consider the GROUP BY items in either order, and that ordering by t2.othercol, t2.x would allow us to perform the grouping using a GroupAggregate on data that's already sorted to meet the ORDER BY requirement. Since there happens to be an index on t2.othercol, t2.x, we can implement the query with no explicit sort, which wins big thanks to the small LIMIT. In the not-okay case, we resolve "GROUP BY x" as GROUP BY t2.x. Then remove_useless_groupby_columns notices that x is t2's primary key, so it figures that grouping by t2.othercol is redundant and throws away that element of GROUP BY. Now there is no apparent connection between the GROUP BY and ORDER BY lists, defeating the optimizations that would lead to a good choice of plan --- in fact, we conclude early on that that index's sort ordering is entirely useless to the query :-( I tried the attached quick-hack patch that just prevents remove_useless_groupby_columns from removing anything that appears in ORDER BY. That successfully fixes the complained-of case, and it doesn't change any existing regression test results. I'm not sure whether there are any cases that it makes significantly worse. (I also kind of wonder if the fundamental problem here is that remove_useless_groupby_columns is being done at the wrong time, and we ought to do it later when we have more semantic info. But I'm not volunteering to rewrite it completely.) Anyway, remove_useless_groupby_columns has been there since 9.6 and we've not previously seen reports of cases that it makes worse, so this seems like a corner-case problem. Hence I wouldn't risk back-patching this change. It seems worth considering for HEAD though, so I'll stick it in the September CF. regards, tom lane [1] https://www.postgresql.org/message-id/17544-ebd06b00b8836a04%40postgresql.org diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 06ad856eac..b2716abcc7 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -2733,12 +2733,15 @@ remove_useless_groupby_columns(PlannerInfo *root) /* * New list must include non-Vars, outer Vars, and anything not - * marked as surplus. + * marked as surplus. In addition, keep anything that appears in + * the ORDER BY clause, because otherwise we may falsely make it + * look like the GROUP BY and ORDER BY clauses are incompatible. */ if (!IsA(var, Var) || var->varlevelsup > 0 || !bms_is_member(var->varattno - FirstLowInvalidHeapAttributeNumber, - surplusvars[var->varno])) + surplusvars[var->varno]) || +list_member(parse->sortClause, sgc)) new_groupby = lappend(new_groupby, sgc); }
Re: Cleaning up historical portability baggage
Hi, On 2022-07-12 08:01:40 -0400, Robert Haas wrote: > On Mon, Jul 11, 2022 at 9:11 PM Thomas Munro wrote: > > Hmm, but that's not what we're doing in general. For example, on > > Windows we're redirecting open() to a replacement function of our own, > > we're not using "pg_open()" in our code. That's not an example based > > on AC_REPLACE_FUNCS, but there are plenty of those too. Isn't this > > quite well established? > > Yes. I just don't care for it. > > Sounds like I'm in the minority, though. I agree with you, at least largely. Redefining functions, be it by linking in something or by redefining function names via macros, is a mess. There's places where we then have to undefine some of these things to be able to include external headers etc. Some functions are only replaced in backends, others in frontend too. It makes it hard to know what exactly the assumed set of platform primitives is. Etc. Greetings, Andres Freund
Re: Reducing the chunk header sizes on all memory context types
Hi, On 2022-07-12 17:01:18 +1200, David Rowley wrote: > I've taken Andres' patch and made some quite significant changes to > it. In the patch's current state, the sort performance regression in > PG15 vs PG14 is fixed. The generation.c context chunk header has been > reduced to 8 bytes from the previous 24 bytes as it is in master. > aset.c context chunk header is now 8 bytes instead of 16 bytes. I think this is *very* cool. But I might be biased :) > There is at least one. It might be major; to reduce the AllocSet chunk > header from 16 bytes down to 8 bytes I had to get rid of the freelist > pointer that was reusing the "aset" field in the chunk header struct. > This works now by storing that pointer in the actual palloc'd memory. > This could lead to pretty hard-to-trace bugs if we have any code that > accidentally writes to memory after pfree. Can't we use the same trick for allcations in the freelist as we do for the header in a live allocation? I.e. split the 8 byte header into two and use part of it to point to the next element in the list using the offset from the start of the block, and part of it to indicate the size? Greetings, Andres Freund
Re: Reducing the chunk header sizes on all memory context types
Hi, On 2022-07-12 20:22:57 +0300, Yura Sokolov wrote: > I don't get, why "large chunk" needs additional fields for size and > offset. > Large allocation sizes are certainly rounded to page size. > And allocations which doesn't fit 1GB we could easily round to 1MB. > Then we could simply store `size>>20`. > It will limit MaxAllocHugeSize to `(1<<(30+20))-1` - 1PB. Doubdfully we > will deal with such huge allocations in near future. What would gain by doing something like this? The storage density loss of storing an exact size is smaller than what you propose here. Greetings, Andres Freund
Re: Cleaning up historical portability baggage
Andres Freund writes: > Redefining functions, be it by linking in something or by redefining function > names via macros, is a mess. There's places where we then have to undefine > some of these things to be able to include external headers etc. Some > functions are only replaced in backends, others in frontend too. It makes it > hard to know what exactly the assumed set of platform primitives is. Etc. In the cases at hand, we aren't doing that, are we? The replacement function is only used on platforms that lack the relevant POSIX function, so it's hard to argue that we're replacing anything. regards, tom lane
Re: First draft of the PG 15 release notes
On Mon, Jul 11, 2022 at 11:31:32PM -0700, Noah Misch wrote: > On Mon, Jul 11, 2022 at 12:39:57PM -0400, Bruce Momjian wrote: > > I had trouble reading the sentences in the order you used so I > > restructured it: > > > > The new default is one of the secure schema usage patterns that > linkend="ddl-schemas-patterns"/> has recommended since the security > > release for CVE-2018-1058. The change applies to newly-created > > databases in existing clusters and for new clusters. Upgrading a > > cluster or restoring a database dump will preserve existing permissions. > > I agree with the sentence order change. Great. > > For existing databases, especially those having multiple users, consider > > issuing REVOKE to adopt this new default. For new > > databases having zero need to defend against insider threats, granting > > USAGE permission on their public > > schemas will yield the behavior of prior releases. > > s/USAGE/CREATE/ in the last sentence. Looks good with that change. Ah, yes, of course. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: System catalog documentation chapter
On 08.07.22 18:07, Bruce Momjian wrote: so I guess we can backpatch this with no issues. It inserts a new chapter, which would renumber all other chapters. That's a pretty big change to backpatch. I'm against that.
Re: System catalog documentation chapter
On 08.07.22 21:32, Bruce Momjian wrote: On Fri, Jul 8, 2022 at 12:07:45PM -0400, Bruce Momjian wrote: On Fri, Jul 8, 2022 at 11:49:47AM -0400, Tom Lane wrote: Bruce Momjian writes: Agreed. I don't want to break links into the documentation in final released versions, so head and PG15 seem wise. I would not expect this to change the doc URLs for any individual catalogs or views --- if it does, I won't be happy. Good point --- I thought the chapter was in the URL, but I now see it is just the section heading: https://www.postgresql.org/docs/devel/view-pg-available-extensions.html so I guess we can backpatch this with no issues. Attached is a patch to accomplish this. Its output can be seen here: https://momjian.us/tmp/pgsql/internals.html views.sgml is a pretty generic name for a chapter that just contains system views.
Re: Making CallContext and InlineCodeBlock less special-case-y
On 12.07.22 01:01, Tom Lane wrote: I wrote: Peter Eisentraut writes: On 10.07.22 01:50, Tom Lane wrote: As committed, gen_node_support.pl excludes CallContext and InlineCodeBlock from getting unneeded support functions via some very ad-hoc code. Couldn't we just enable those support functions? I think they were just excluded because they didn't have any before and nobody bothered to make any. Well, we could I suppose, but that path leads to a lot of dead code in backend/nodes/ --- obviously these two alone are negligible, but I want a story other than "it's a hack" for execnodes.h and the other files we exclude from generation of support code. Here's a proposed patch for this bit. Again, whether these two node types have unnecessary support functions is not the point --- obviously we could afford to waste that much space. Rather, what I'm after is to have a more explainable and flexible way of dealing with the file-level exclusions applied to a lot of other node types. This patch doesn't make any change in the script's output now, but it gives us flexibility for the future. Yeah, looks reasonable.
Re: automatically generating node support functions
On 11.07.22 19:57, Tom Lane wrote: So at this point I'm rather attracted to the idea of reverting to a manually-maintained NodeTag enum. We know how to avoid ABI breakage with that, and it's not exactly the most painful part of adding a new node type. One of the nicer features is that you now get to see the numbers assigned to the enum tags, like T_LockingClause = 91, T_XmlSerialize = 92, T_PartitionElem = 93, so that when you get an error like "unsupported node type: %d", you can just look up what it is.
Re: WIN32 pg_import_system_collations
Please find attached a rebased version. I have split the patch into two parts trying to make it easier to review, one with the code changes and the other with the test. Other than that, there are minimal changes from the previous version to the code due to the update of _WIN32_WINNT and enabling the test on cirrus. Regards, Juan José Santamaría Flecha > v6-0001-WIN32-pg_import_system_collations.patch Description: Binary data v6-0002-Add-collate.windows.win1252-test.patch Description: Binary data
Re: Remove trailing newlines from pg_upgrade's messages
Peter Eisentraut writes: > On 14.06.22 20:57, Tom Lane wrote: >> Hence, the patch below removes trailing newlines from all of >> pg_upgrade's message strings, and teaches its logging infrastructure >> to print them where appropriate. As in logging.c, there's now an >> Assert that no format string passed to pg_log() et al ends with >> a newline. > This patch looks okay to me. I compared the output before and after in > a few scenarios and didn't see any problematic differences. Thanks, pushed after rebasing and adjusting some recently-added messages. > In this particular patch, the few empty lines that disappeared don't > bother me. In general, however, I think we can just fprintf(stderr, > "\n") directly as necessary. Hmm, if anyone wants to do that I think it'd be advisable to invent "pg_log_blank_line()" or something like that, so as to preserve the logging abstraction layer. But it's moot unless anyone's interested enough to send a patch for that. I'm not. (I think it *would* be a good idea to try to get rid of the leading newlines that appear in some of the messages, as discussed upthread. But I'm not going to trouble over that right now either.) regards, tom lane
Re: Remove trailing newlines from pg_upgrade's messages
Kyotaro Horiguchi writes: > FWIW, the following change makes sense to me according to the spec of > validate_exec()... > diff --git a/src/bin/pg_upgrade/exec.c b/src/bin/pg_upgrade/exec.c > index fadeea12ca..3cff186213 100644 > --- a/src/bin/pg_upgrade/exec.c > +++ b/src/bin/pg_upgrade/exec.c > @@ -430,10 +430,10 @@ check_exec(const char *dir, const char *program, bool > check_version) > ret = validate_exec(path); > if (ret == -1) > - pg_fatal("check for \"%s\" failed: not a regular file\n", > + pg_fatal("check for \"%s\" failed: does not exist or > inexecutable\n", >path); > else if (ret == -2) > - pg_fatal("check for \"%s\" failed: cannot execute (permission > denied)\n", > + pg_fatal("check for \"%s\" failed: cannot read (permission > denied)\n", >path); > snprintf(cmd, sizeof(cmd), "\"%s\" -V", path); I initially did this, but then I wondered why validate_exec() was making it so hard: why can't we just report the failure with %m? It turns out to take only a couple extra lines of code to ensure that something more-or-less appropriate is returned, so we don't need to guess about it here. Pushed that way. regards, tom lane
Re: automatically generating node support functions
Peter Eisentraut writes: > On 11.07.22 19:57, Tom Lane wrote: >> So at this point I'm rather attracted to the idea of reverting to >> a manually-maintained NodeTag enum. We know how to avoid ABI >> breakage with that, and it's not exactly the most painful part >> of adding a new node type. > One of the nicer features is that you now get to see the numbers > assigned to the enum tags, like > T_LockingClause = 91, > T_XmlSerialize = 92, > T_PartitionElem = 93, > so that when you get an error like "unsupported node type: %d", you can > just look up what it is. Yeah, I wasn't thrilled about reverting that either. I think the defenses I installed in eea9fa9b2 should be sufficient to deal with the risk. regards, tom lane
Re: should check interrupts in BuildRelationExtStatistics ?
I wrote: > Thomas Munro writes: >> On Wed, Jul 6, 2022 at 11:37 AM Tom Lane wrote: >>> qsort_interruptible >> +1 > So here's a patch that does it that way. Hearing no comments, pushed. regards, tom lane
Re: making relfilenodes 56 bits
On Tue, Jul 12, 2022 at 1:09 PM Andres Freund wrote: > What does currently happen if we exceed that? elog > > diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h > > index b578e2ec75..5d3775ccde 100644 > > --- a/src/include/utils/wait_event.h > > +++ b/src/include/utils/wait_event.h > > @@ -193,7 +193,7 @@ typedef enum > > WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE, > > WAIT_EVENT_LOGICAL_REWRITE_WRITE, > > WAIT_EVENT_RELATION_MAP_READ, > > - WAIT_EVENT_RELATION_MAP_SYNC, > > + WAIT_EVENT_RELATION_MAP_RENAME, > > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME, > since it includes fsync etc? Sure, I had it that way for a while and changed it at the last minute. I can change it back. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pg15b2: large objects lost on upgrade
On Mon, Jul 11, 2022 at 9:16 AM Robert Haas wrote: > I am not saying we shouldn't try to fix this up more thoroughly, just > that I think you are overestimating the consequences. I spent a bunch of time looking at this today and I have more sympathy for Justin's previous proposal now. I found it somewhat hacky that he was relying on the hard-coded value of LargeObjectRelationId and LargeObjectLOidPNIndexId, but I discovered that it's harder to do better than I had assumed. Suppose we don't want to compare against a hard-coded constant but against the value that is actually present before the dump overwrites the pg_class row's relfilenode. Well, we can't get that value from the database in question before restoring the dump, because restoring either the dump creates or recreates the database in all cases. The CREATE DATABASE command that will be part of the dump always specifies TEMPLATE template0, so if we want something other than a hard-coded constant, we need the pg_class.relfilenode values from template0 for pg_largeobject and pg_largeobject_loid_pn_index. But we can't connect to that database to query those values, because it has datallowconn = false. Oops. I have a few more ideas to try here. It occurs to me that we could fix this more cleanly if we could get the dump itself to set the relfilenode for pg_largeobject to the desired value. Right now, it's just overwriting the relfilenode stored in the catalog without actually doing anything that would cause a change on disk. But if we could make it change the relfilenode in a more principled way that would actually cause an on-disk change, then the orphaned-file problem would be fixed, because we'd always be installing the new file over top of the old file. I'm going to investigate how hard it would be to make that work. -- Robert Haas EDB: http://www.enterprisedb.com
Re: should check interrupts in BuildRelationExtStatistics ?
On Tue, Jul 12, 2022 at 1:31 PM Tom Lane wrote: > I wrote: > > Thomas Munro writes: > >> On Wed, Jul 6, 2022 at 11:37 AM Tom Lane wrote: > >>> qsort_interruptible > > >> +1 > > > So here's a patch that does it that way. > > Hearing no comments, pushed. > > regards, tom lane > > > Hi, Looking at the files under src/backend/utils/sort/, looks like license header is missing from qsort_interruptible.c Please consider the patch which adds license header to qsort_interruptible.c Cheers qsort-interruptible-copyright.patch Description: Binary data
Re: making relfilenodes 56 bits
Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX); Have you really thought through making the ForkNum 8-bit ? For example this would limit a columnar storage with each column stored in it's own fork (which I'd say is not entirely unreasonable) to having just about ~250 columns. And there can easily be other use cases where we do not want to limit number of forks so much Cheers Hannu On Tue, Jul 12, 2022 at 10:36 PM Robert Haas wrote: > > On Tue, Jul 12, 2022 at 1:09 PM Andres Freund wrote: > > What does currently happen if we exceed that? > > elog > > > > diff --git a/src/include/utils/wait_event.h > > > b/src/include/utils/wait_event.h > > > index b578e2ec75..5d3775ccde 100644 > > > --- a/src/include/utils/wait_event.h > > > +++ b/src/include/utils/wait_event.h > > > @@ -193,7 +193,7 @@ typedef enum > > > WAIT_EVENT_LOGICAL_REWRITE_TRUNCATE, > > > WAIT_EVENT_LOGICAL_REWRITE_WRITE, > > > WAIT_EVENT_RELATION_MAP_READ, > > > - WAIT_EVENT_RELATION_MAP_SYNC, > > > + WAIT_EVENT_RELATION_MAP_RENAME, > > > > Very minor nitpick: To me REPLACE would be a bit more accurate than RENAME, > > since it includes fsync etc? > > Sure, I had it that way for a while and changed it at the last minute. > I can change it back. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > >
Re: should check interrupts in BuildRelationExtStatistics ?
Zhihong Yu writes: > Looking at the files under src/backend/utils/sort/, looks like license > header is missing from qsort_interruptible.c [ shrug ... ] qsort.c and qsort_arg.c don't have that either. regards, tom lane
Re: System catalog documentation chapter
On Tue, Jul 12, 2022 at 08:56:01PM +0200, Peter Eisentraut wrote: > On 08.07.22 18:07, Bruce Momjian wrote: > > so I guess we can backpatch this with no issues. > > It inserts a new chapter, which would renumber all other chapters. That's a > pretty big change to backpatch. I'm against that. Okay, I can see the renumbering as being confusing so I will do PG 15 and head only. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: System catalog documentation chapter
On Tue, Jul 12, 2022 at 08:56:36PM +0200, Peter Eisentraut wrote: > On 08.07.22 21:32, Bruce Momjian wrote: > > On Fri, Jul 8, 2022 at 12:07:45PM -0400, Bruce Momjian wrote: > > > On Fri, Jul 8, 2022 at 11:49:47AM -0400, Tom Lane wrote: > > > > Bruce Momjian writes: > > > > > Agreed. I don't want to break links into the documentation in final > > > > > released versions, so head and PG15 seem wise. > > > > > > > > I would not expect this to change the doc URLs for any individual > > > > catalogs or views --- if it does, I won't be happy. > > > > > > Good point --- I thought the chapter was in the URL, but I now see it is > > > just the section heading: > > > > > > https://www.postgresql.org/docs/devel/view-pg-available-extensions.html > > > > > > so I guess we can backpatch this with no issues. > > > > Attached is a patch to accomplish this. Its output can be seen here: > > > > https://momjian.us/tmp/pgsql/internals.html > > views.sgml is a pretty generic name for a chapter that just contains system > views. Yes, I struggled with that. What made me choose "views" is that the current name was catalogs.sgml, not syscatalogs.sgml. If is acceptable to use catalogs.sgml and sysviews.sgml? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Reducing Memory Consumption (aset and generation)
Em ter., 12 de jul. de 2022 às 02:35, David Rowley escreveu: > On Mon, 11 Jul 2022 at 20:48, Matthias van de Meent > wrote: > > > 2) v1-002-generation-reduces-memory-consumption.patch > > > Reduces memory used by struct GenerationBlock, by minus 8 bits, > > > > That seems fairly straight-forward -- 8 bytes saved on each page isn't > > a lot, but it's something. > > I think 002 is likely the only patch here that has some merit. > However, it's hard to imagine any measurable performance gains from > it. I think the smallest generation block we have today is 8192 > bytes. Saving 8 bytes in that equates to a saving of 0.1% of memory. > For an 8MB page, it's 1024 times less than that. > > I imagine Ranier has been working on this due the performance > regression mentioned in [1]. I think it'll be much more worthwhile to > aim to reduce the memory chunk overheads rather than the block > overheads, as Ranier is doing here. I posted a patch in [2] which does > that. To make that work, I need to have the owning context in the > block. The 001 and 003 patch seems to remove those here. > I saw the numbers at [2], 17% is very impressive. How you need the context in the block, 001 and 003, they are more of a hindrance than a help. So, feel free to incorporate 002 into your patch if you wish. The best thing to do here is to close and withdraw from commitfest. regards, Ranier Vilela
Re: System catalog documentation chapter
Bruce Momjian writes: > On Tue, Jul 12, 2022 at 08:56:36PM +0200, Peter Eisentraut wrote: >> views.sgml is a pretty generic name for a chapter that just contains system >> views. > Yes, I struggled with that. What made me choose "views" is that the > current name was catalogs.sgml, not syscatalogs.sgml. If is acceptable > to use catalogs.sgml and sysviews.sgml? "catalogs" isn't too confusable with user-defined objects, so I think that name is fine --- and anyway it has decades of history so changing it seems unwise. We seem to have been trending towards less-abbreviated .sgml file names over time, so personally I'd go for system-views.sgml. regards, tom lane
Re: Extending outfuncs support to utility statements
I wrote: > There might be enough node types that are raw-parse-tree-only, > but not involved in utility statements, to make it worth > continuing to suppress readfuncs support for them. But I kinda > doubt it. I'll try to get some numbers later today. Granting that we want write/read support for utility statements, it seems that what we can save by suppressing raw-parse-tree-only nodes is only about 10kB. That's clearly not worth troubling over in the grand scheme of things, so I suggest that we just open the floodgates as attached. regards, tom lane diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl index 7694e04d0a..96af17516a 100644 --- a/src/backend/nodes/gen_node_support.pl +++ b/src/backend/nodes/gen_node_support.pl @@ -166,23 +166,6 @@ push @scalar_types, qw(EquivalenceClass* EquivalenceMember*); # currently not required. push @scalar_types, qw(QualCost); -# XXX various things we are not publishing right now to stay level -# with the manual system -push @no_read_write, - qw(AccessPriv AlterTableCmd CreateOpClassItem FunctionParameter InferClause ObjectWithArgs OnConflictClause PartitionCmd RoleSpec VacuumRelation); -push @no_read, qw(A_ArrayExpr A_Indices A_Indirection AlterStatsStmt - CollateClause ColumnDef ColumnRef CreateForeignTableStmt CreateStatsStmt - CreateStmt FuncCall ImportForeignSchemaStmt IndexElem IndexStmt - JsonAggConstructor JsonArgument JsonArrayAgg JsonArrayConstructor - JsonArrayQueryConstructor JsonCommon JsonFuncExpr JsonKeyValue - JsonObjectAgg JsonObjectConstructor JsonOutput JsonParseExpr JsonScalarExpr - JsonSerializeExpr JsonTable JsonTableColumn JsonTablePlan LockingClause - MultiAssignRef PLAssignStmt ParamRef PartitionElem PartitionSpec - PlaceHolderVar PublicationObjSpec PublicationTable RangeFunction - RangeSubselect RangeTableFunc RangeTableFuncCol RangeTableSample RawStmt - ResTarget ReturnStmt SelectStmt SortBy StatsElem TableLikeClause - TriggerTransition TypeCast TypeName WindowDef WithClause XmlSerialize); - ## check that we have the expected number of files on the command line die "wrong number of input files, expected @all_input_files\n" @@ -795,14 +778,6 @@ foreach my $n (@node_types) next if elem $n, @nodetag_only; next if elem $n, @no_read_write; - # XXX For now, skip all "Stmt"s except that ones that were there before. - if ($n =~ /Stmt$/) - { - my @keep = - qw(AlterStatsStmt CreateForeignTableStmt CreateStatsStmt CreateStmt DeclareCursorStmt ImportForeignSchemaStmt IndexStmt NotifyStmt PlannedStmt PLAssignStmt RawStmt ReturnStmt SelectStmt SetOperationStmt); - next unless elem $n, @keep; - } - my $no_read = (elem $n, @no_read); # output format starts with upper case node type name @@ -827,13 +802,20 @@ _out${n}(StringInfo str, const $n *node) "; - print $rff " + if (!$no_read) + { + my $macro = + (@{ $node_type_info{$n}->{fields} } > 0) + ? 'READ_LOCALS' + : 'READ_LOCALS_NO_FIELDS'; + print $rff " static $n * _read${n}(void) { -\tREAD_LOCALS($n); +\t$macro($n); -" unless $no_read; +"; + } # print instructions for each field foreach my $f (@{ $node_type_info{$n}->{fields} }) @@ -883,6 +865,7 @@ _read${n}(void) print $rff "\tREAD_LOCATION_FIELD($f);\n" unless $no_read; } elsif ($t eq 'int' + || $t eq 'int16' || $t eq 'int32' || $t eq 'AttrNumber' || $t eq 'StrategyNumber') diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 4d776e7b51..85ac0b5e21 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -415,79 +415,6 @@ _outExtensibleNode(StringInfo str, const ExtensibleNode *node) methods->nodeOut(str, node); } -static void -_outQuery(StringInfo str, const Query *node) -{ - WRITE_NODE_TYPE("QUERY"); - - WRITE_ENUM_FIELD(commandType, CmdType); - WRITE_ENUM_FIELD(querySource, QuerySource); - /* we intentionally do not print the queryId field */ - WRITE_BOOL_FIELD(canSetTag); - - /* - * Hack to work around missing outfuncs routines for a lot of the - * utility-statement node types. (The only one we actually *need* for - * rules support is NotifyStmt.) Someday we ought to support 'em all, but - * for the meantime do this to avoid getting lots of warnings when running - * with debug_print_parse on. - */ - if (node->utilityStmt) - { - switch (nodeTag(node->utilityStmt)) - { - case T_CreateStmt: - case T_IndexStmt: - case T_NotifyStmt: - case T_DeclareCursorStmt: -WRITE_NODE_FIELD(utilityStmt); -break; - default: -appendStringInfoString(str, " :utilityStmt ?"); -break; - } - } - else - appendStringInfoString(str, " :utilityStmt <>"); - - WRITE_INT_FIELD(resultRelation); - WRITE_BOOL_FIELD(hasAggs); - WRITE_BOOL_FIELD(hasWindowFuncs); - WRITE_BOOL_FIELD(hasTargetSRFs); - WRITE_BOOL_FIELD(hasSubLinks); - WRITE_BOOL_FIELD(hasDistinctOn); - WRITE_BOOL_FIELD(hasRecursive); - WRITE_BOOL_FIELD(hasModifyingCTE); - WRITE_BOO
Re: making relfilenodes 56 bits
Hi, Please don't top quote - as mentioned a couple times recently. On 2022-07-12 23:00:22 +0200, Hannu Krosing wrote: > Re: staticAssertStmt(MAX_FORKNUM <= INT8_MAX); > > Have you really thought through making the ForkNum 8-bit ? MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with filesystem access each time), it's not feasible to make that number large. Greetings, Andres Freund
Re: making relfilenodes 56 bits
On Tue, Jul 12, 2022 at 6:02 PM Andres Freund wrote: > MAX_FORKNUM is way lower right now. And hardcoded. So this doesn't imply a new > restriction. As we iterate over 0..MAX_FORKNUM in a bunch of places (with > filesystem access each time), it's not feasible to make that number large. Yeah. TBH, what I'd really like to do is kill the entire fork system with fire and replace it with something more scalable, which would maybe permit the sort of thing Hannu suggests here. With the current system, forget it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Extending outfuncs support to utility statements
Peter Eisentraut writes: > This is also needed to be able to store utility statements in (unquoted) > SQL function bodies. I have some in-progress code for that that I need > to dust off. IIRC, there are still some nontrivial issues to work > through on the reading side. I don't have a problem with enabling the > outfuncs side in the meantime. BTW, I experimented with trying to enable WRITE_READ_PARSE_PLAN_TREES for utility statements, and found that the immediate problem is that Constraint and a couple of other node types lack read functions (they're the ones marked "custom_read_write, no_read" in parsenodes.h). They have out functions, so writing the inverses seems like it's just something nobody ever got around to. Perhaps there are deeper problems lurking behind that one, though. regards, tom lane
Some clean-up work in get_cheapest_group_keys_order()
I was rebasing a patch which requires me to make some changes in get_cheapest_group_keys_order(). I noticed a few things in there that I think we could do a little better on: * The code uses pfree() on a list and it should be using list_free() * There's a manually coded for loop over a list which seems to be done so we can skip the first n elements of the list. for_each_from() should be used for that. * I think list_truncate(list_copy(list), n) is a pretty bad way to copy the first n elements of a list, especially when n is likely to be 0 most of the time. I think we should just add a function called list_copy_head(). We already have list_copy_tail(). * We could reduce some of the branching in the while loop and just set cheapest_sort_cost to DBL_MAX to save having to check if we're doing the first loop. I think the first 3 are worth fixing in PG15 since all that code is new to that version. The 4th, I'm so sure about. Does anyone else have any thoughts? David diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index 9d8f4fd5c7..b969a52dd6 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -1584,6 +1584,27 @@ list_copy(const List *oldlist) return newlist; } +/* + * Return a shallow copy of the specified list containing only the first 'len' + * elements. If oldlist is shorter than 'len' then we copy the entire list. + */ +List * +list_copy_head(const List *oldlist, int len) +{ + List *newlist; + + len = Min(oldlist->length, len); + + if (len <= 0) + return NIL; + + newlist = new_list(oldlist->type, len); + memcpy(newlist->elements, oldlist->elements, len * sizeof(ListCell)); + + check_list_invariants(newlist); + return newlist; +} + /* * Return a shallow copy of the specified list, without the first N elements. */ diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c index 9775c4a722..849abbaaba 100644 --- a/src/backend/optimizer/path/pathkeys.c +++ b/src/backend/optimizer/path/pathkeys.c @@ -17,6 +17,8 @@ */ #include "postgres.h" +#include + #include "miscadmin.h" #include "access/stratnum.h" #include "catalog/pg_opfamily.h" @@ -591,7 +593,7 @@ get_cheapest_group_keys_order(PlannerInfo *root, double nrows, ListCell *cell; PathkeyMutatorState mstate; - double cheapest_sort_cost = -1.0; + double cheapest_sort_cost = DBL_MAX; int nFreeKeys; int nToPermute; @@ -620,23 +622,23 @@ get_cheapest_group_keys_order(PlannerInfo *root, double nrows, nToPermute = 4; if (nFreeKeys > nToPermute) { - int i; PathkeySortCost *costs = palloc(sizeof(PathkeySortCost) * nFreeKeys); + PathkeySortCost *cost = costs; - /* skip the pre-ordered pathkeys */ - cell = list_nth_cell(*group_pathkeys, n_preordered); - - /* estimate cost for sorting individual pathkeys */ - for (i = 0; cell != NULL; i++, (cell = lnext(*group_pathkeys, cell))) + /* +* Estimate cost for sorting individual pathkeys skipping the +* pre-ordered pathkeys. +*/ + for_each_from(cell, *group_pathkeys, n_preordered) { - List *to_cost = list_make1(lfirst(cell)); - - Assert(i < nFreeKeys); + PathKey*pathkey = (PathKey *) lfirst(cell); + List *to_cost = list_make1(pathkey); - costs[i].pathkey = lfirst(cell); - costs[i].cost = cost_sort_estimate(root, to_cost, 0, nrows); + cost->pathkey = pathkey; + cost->cost = cost_sort_estimate(root, to_cost, 0, nrows); + cost++; - pfree(to_cost); + list_free(to_cost); } /* sort the pathkeys by sort cost in ascending order */ @@ -646,9 +648,9 @@ get_cheapest_group_keys_order(PlannerInfo *root, double nrows, * Rebuild the list of pathkeys - first the preordered ones, then the * rest ordered by cost. */ - new_group_pathkeys = list_truncate(list_copy(*group_pathkeys), n_preordered); + new_group_pathkeys = list_copy_head(*group_pathkeys, n_preordered); - for (i = 0; i < nFreeKeys; i++) + for (int i = 0; i < nFreeKeys; i++) new_group_pathkeys = lappend(new_group_pathkeys, costs[i].pathkey); pfree(costs); @@ -689,7 +691,7 @@ get_cheapest_group_keys_order(PlannerInfo *root, double nrows, cost = cost_sort_estimate(root, var_group_pathkeys, n_preordered, nrows); -
Re: Some clean-up work in get_cheapest_group_keys_order()
David Rowley writes: > * I think list_truncate(list_copy(list), n) is a pretty bad way to > copy the first n elements of a list, especially when n is likely to be > 0 most of the time. I think we should just add a function called > list_copy_head(). We already have list_copy_tail(). Agreed, but I think there are other instances of that idiom that should be cleaned up while you're at it. > I think the first 3 are worth fixing in PG15 since all that code is > new to that version. The 4th, I'm so sure about. I'd say keeping v15 and v16 in sync here is worth something. regards, tom lane
Re: [PATCH] Log details for client certificate failures
On Sat, Jul 9, 2022 at 6:49 AM Graham Leggett wrote: > Please don’t invent another format, or try and truncate the data. This is a > huge headache when troubleshooting. I hear you, and I agree that correlating these things across machines is something we should be making easier. I'm just not convinced that the particular format you've proposed, with a new set of rules for quoting and escaping, needs to be part of this patch. (And I think there are good reasons to truncate unverified cert data, so there'd have to be clear benefits to offset the risk of opening it up.) Searching Google for "issuer rdnSequence" comes up with mostly false positives related to LDAP filtering and certificate dumps, and the true positives seem to be mail threads that you've participated in. Do many LDAP servers log certificate failures in this format by default? (For that matter, does httpd?) The discussion at the time you added this to httpd [1] seemed to be making the point that this was a niche format, suited mostly for interaction with LDAP filters -- and Kaspar additionally pointed out that it's not a canonical format, so all of our implementations would have to have an ad hoc agreement to choose exactly one encoding. If you're using randomized serial numbers, you should be able to grep for those by themselves and successfully match many different formats, no? To me, that seems good enough for a first patch, considering we don't currently log any of this information. --Jacobfi [1] https://lists.apache.org/thread/1665qc4mod7ppp58qk3bqc2l3wtl3lkn
Re: [PATCH] Log details for client certificate failures
On Mon, Jul 11, 2022 at 6:09 AM Peter Eisentraut wrote: > I squashed those two together. I also adjusted the error message a bit > more for project style. (We can put both lines into detail.) Oh, okay. Log parsers don't have any issues with that? > I had to read up on this "ex_data" API. Interesting. But I'm wondering > a bit about how the life cycle of these objects is managed. What > happens if the allocated error string is deallocated before its > containing object? Or vice versa? Yeah, I'm currently leaning heavily on the lack of any memory context switches here. And I end up leaking out a pointer to the stale stack of be_tls_open_server(), which is gross -- it works since there are no other clients, but that could probably come back to bite us. The ex_data API exposes optional callbacks for new/dup/free (I'm currently setting those to NULL), so we can run custom code whenever the SSL* is destroyed. If you'd rather the data have the same lifetime of the SSL* object, we can switch to malloc/strdup/free (or even OPENSSL_strdup() in later versions). But since we don't have any use for the ex_data outside of this function, maybe we should just clear it before we return, rather than carrying it around. > How do we ensure we don't > accidentally reuse the error string when the code runs again? (I guess > currently it can't?) The ex_data is associated with the SSL*, not the global SSL_CTX*, so that shouldn't be an issue. A new SSL* gets created at the start of be_tls_open_server(). > Maybe we should avoid this and just put the > errdetail itself into a static variable that we unset once the message > is printed? If you're worried about the lifetime of the palloc'd data being too short, does switching to a static variable help in that case? --Jacob
Re: pg15b2: large objects lost on upgrade
On Tue, Jul 12, 2022 at 04:51:44PM -0400, Robert Haas wrote: > I spent a bunch of time looking at this today and I have more sympathy > for Justin's previous proposal now. I found it somewhat hacky that he > was relying on the hard-coded value of LargeObjectRelationId and > LargeObjectLOidPNIndexId, but I discovered that it's harder to do > better than I had assumed. Suppose we don't want to compare against a > hard-coded constant but against the value that is actually present > before the dump overwrites the pg_class row's relfilenode. Well, we > can't get that value from the database in question before restoring > the dump, because restoring either the dump creates or recreates the > database in all cases. The CREATE DATABASE command that will be part > of the dump always specifies TEMPLATE template0, so if we want > something other than a hard-coded constant, we need the > pg_class.relfilenode values from template0 for pg_largeobject and > pg_largeobject_loid_pn_index. But we can't connect to that database to > query those values, because it has datallowconn = false. Oops. > > I have a few more ideas to try here. It occurs to me that we could fix > this more cleanly if we could get the dump itself to set the > relfilenode for pg_largeobject to the desired value. Right now, it's > just overwriting the relfilenode stored in the catalog without > actually doing anything that would cause a change on disk. But if we > could make it change the relfilenode in a more principled way that > would actually cause an on-disk change, then the orphaned-file problem > would be fixed, because we'd always be installing the new file over > top of the old file. I'm going to investigate how hard it would be to > make that work. Thanks for all the details here. This originally sounded like the new cluster was keeping around some orphaned relation files with the old LOs still stored in it. But as that's just the freshly initdb'd relfilenodes of pg_largeobject, that does not strike me as something absolutely critical to fix for v15 as orphaned relfilenodes are an existing problem. If we finish with a solution rather simple in design, I'd be fine to stick a fix in REL_15_STABLE, but playing with this stable branch more than necessary may be risky after beta2. At the end, I would be fine to drop the open item now that the main issue has been fixed. -- Michael signature.asc Description: PGP signature
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Tue, Jul 12, 2022 at 7:59 PM Amit Kapila wrote: > > On Tue, Jul 12, 2022 at 2:53 PM Masahiko Sawada wrote: > > > > On Tue, Jul 12, 2022 at 5:58 PM shiy.f...@fujitsu.com > > wrote: > > > > > > > > > It happened when executing the following code because it tried to free a > > > NULL > > > pointer (catchange_xip). > > > > > > /* be tidy */ > > > if (ondisk) > > > pfree(ondisk); > > > + if (catchange_xip) > > > + pfree(catchange_xip); > > > } > > > > > > It seems to be related to configure option. I could reproduce it when > > > using > > > `./configure --enable-debug`. > > > But I couldn't reproduce with `./configure --enable-debug CFLAGS="-Og > > > -ggdb"`. > > > > Hmm, I could not reproduce this problem even if I use ./configure > > --enable-debug. And it's weird that we checked if catchange_xip is not > > null but we did pfree for it: > > > > Yeah, this looks weird to me as well but one difference in running > tests could be the timing of WAL LOG for XLOG_RUNNING_XACTS. That may > change the timing of SnapBuildSerialize. The other thing we can try is > by checking the value of catchange_xcnt before pfree. Yeah, we can try that. While reading the code, I realized that we try to pfree both ondisk and catchange_xip also when we jumped to 'out:': out: ReorderBufferSetRestartPoint(builder->reorder, builder->last_serialized_snapshot); /* be tidy */ if (ondisk) pfree(ondisk); if (catchange_xip) pfree(catchange_xip); But we use both ondisk and catchange_xip only if we didn't jump to 'out:'. If this problem is related to compiler optimization with 'goto' statement, moving them before 'out:' might be worth trying. > > BTW, I think ReorderBufferGetCatalogChangesXacts should have an Assert > to ensure rb->catchange_ntxns and xcnt are equal. We can probably then > avoid having xcnt_p as an out parameter as the caller can use > rb->catchange_ntxns instead. > Agreed. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: Some clean-up work in get_cheapest_group_keys_order()
On Wed, 13 Jul 2022 at 11:02, Tom Lane wrote: > > David Rowley writes: > > * I think list_truncate(list_copy(list), n) is a pretty bad way to > > copy the first n elements of a list, especially when n is likely to be > > 0 most of the time. I think we should just add a function called > > list_copy_head(). We already have list_copy_tail(). > > Agreed, but I think there are other instances of that idiom that > should be cleaned up while you're at it. Agreed. I imagine we should just do the remaining cleanup in master only. Do you agree? David
Re: test_oat_hooks bug (was: Re: pgsql: Add copy/equal support for XID lists)
On Tue, Jul 12, 2022 at 05:20:59PM +0200, Alvaro Herrera wrote: > While looking for a place to host a test for XID lists support, I > noticed a mistake in test_oat_hooks, fixed as per the attached. Indeed. Good catch. -- Michael signature.asc Description: PGP signature
Re: Some clean-up work in get_cheapest_group_keys_order()
David Rowley writes: > On Wed, 13 Jul 2022 at 11:02, Tom Lane wrote: >> Agreed, but I think there are other instances of that idiom that >> should be cleaned up while you're at it. > Agreed. I imagine we should just do the remaining cleanup in master > only. Do you agree? No objection. regards, tom lane
Re: DropRelFileLocatorBuffers
At Tue, 12 Jul 2022 10:30:20 -0400, Robert Haas wrote in > On Tue, Jul 12, 2022 at 12:07 AM Dilip Kumar wrote: > > I think the naming used in your patch looks better to me. So +1 for the > > change. > > Committed. Thank you, Robert and Dilip. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
At Tue, 12 Jul 2022 12:19:06 -0400, Melanie Plageman wrote in > > + > > &pgStatLocal.shmem->io_ops.stats[backend_type_get_idx(MyBackendType)]; > > > > backend_type_get_idx(x) is actually (x - 1) plus assertion on the > > value range. And the only use-case is here. There's an reverse > > function and also used only at one place. > > > > + Datum backend_type_desc = > > + > > CStringGetTextDatum(GetBackendTypeDesc(idx_get_backend_type(i))); > > > > In this usage GetBackendTypeDesc() gracefully treats out-of-domain > > values but idx_get_backend_type keenly kills the process for the > > same. This is inconsistent. > > > > My humbel opinion on this is we don't define the two functions and > > replace the calls to them with (x +/- 1). Addition to that, I think > > we should not abort() by invalid backend types. In that sense, I > > wonder if we could use B_INVALIDth element for this purpose. > > > > I think that GetBackendTypeDesc() should probably also error out for an > unknown value. > > I would be open to not using the helper functions. I thought it would be > less error-prone, but since it is limited to the code in > pgstat_io_ops.c, it is probably okay. Let me think a bit more. > > Could you explain more about what you mean about using B_INVALID > BackendType? I imagined to use B_INVALID as a kind of "default" partition, which accepts all unknown backend types. We can just ignore that values but then we lose the clue for malfunction of stats machinery. I thought that that backend-type as the sentinel for malfunctions. Thus we can emit logs instead. I feel that the stats machinery shouldn't stop the server as possible, or I think it is overreaction to abort for invalid values that can be easily coped with. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Hi, On 2022-07-13 11:00:07 +0900, Kyotaro Horiguchi wrote: > I imagined to use B_INVALID as a kind of "default" partition, which > accepts all unknown backend types. There shouldn't be any unknown backend types. Something has gone wrong if we get far without a backend type set. > We can just ignore that values but then we lose the clue for malfunction of > stats machinery. I thought that that backend-type as the sentinel for > malfunctions. Thus we can emit logs instead. > > I feel that the stats machinery shouldn't stop the server as possible, > or I think it is overreaction to abort for invalid values that can be > easily coped with. I strongly disagree. That just ends up with hard to find bugs. Greetings, Andres Freund
Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
At Tue, 12 Jul 2022 19:18:22 -0700, Andres Freund wrote in > Hi, > > On 2022-07-13 11:00:07 +0900, Kyotaro Horiguchi wrote: > > I imagined to use B_INVALID as a kind of "default" partition, which > > accepts all unknown backend types. > > There shouldn't be any unknown backend types. Something has gone wrong if we > get far without a backend type set. > > > > We can just ignore that values but then we lose the clue for malfunction of > > stats machinery. I thought that that backend-type as the sentinel for > > malfunctions. Thus we can emit logs instead. > > > > I feel that the stats machinery shouldn't stop the server as possible, > > or I think it is overreaction to abort for invalid values that can be > > easily coped with. > > I strongly disagree. That just ends up with hard to find bugs. I was not sure about the policy on that since, as Melanie (and I) mentioned, GetBackendTypeDesc() is gracefully treating invalid values. Since both of you are agreeing on this point, I'm fine with Assert()ing assuming that GetBackendTypeDesc() (or other places backend-type is handled) is modified to behave the same way. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Some clean-up work in get_cheapest_group_keys_order()
On Wed, 13 Jul 2022 at 13:12, Tom Lane wrote: > > David Rowley writes: > > On Wed, 13 Jul 2022 at 11:02, Tom Lane wrote: > >> Agreed, but I think there are other instances of that idiom that > >> should be cleaned up while you're at it. > > > Agreed. I imagine we should just do the remaining cleanup in master > > only. Do you agree? > > No objection. I've now pushed the original patch to 15 and master and also pushed a cleanup commit to remove the list_truncate(list_copy instances from master only. Thanks for looking. David
Re: enable/disable broken for statement triggers on partitioned tables
Hi, On Fri, Jul 8, 2022 at 3:44 AM Dmitry Koval wrote: > I've looked through the code and everything looks good. > But there is one thing I doubt. > Patch changes result of test: > > > create function trig_nothing() returns trigger language plpgsql >as $$ begin return null; end $$; > create table parent (a int) partition by list (a); > create table child1 partition of parent for values in (1); > > create trigger tg after insert on parent >for each row execute procedure trig_nothing(); > select tgrelid::regclass, tgname, tgenabled from pg_trigger >where tgrelid in ('parent'::regclass, 'child1'::regclass) >order by tgrelid::regclass::text; > alter table only parent enable always trigger tg; -- no recursion > select tgrelid::regclass, tgname, tgenabled from pg_trigger >where tgrelid in ('parent'::regclass, 'child1'::regclass) >order by tgrelid::regclass::text; > alter table parent enable always trigger tg; -- recursion > select tgrelid::regclass, tgname, tgenabled from pg_trigger >where tgrelid in ('parent'::regclass, 'child1'::regclass) >order by tgrelid::regclass::text; > > drop table parent, child1; > drop function trig_nothing(); > > > Results of vanilla + patch: > > CREATE FUNCTION > CREATE TABLE > CREATE TABLE > CREATE TRIGGER > tgrelid | tgname | tgenabled > -++--- > child1 | tg | O > parent | tg | O > (2 rows) > > ALTER TABLE > tgrelid | tgname | tgenabled > -++--- > child1 | tg | O > parent | tg | A > (2 rows) > > ALTER TABLE > tgrelid | tgname | tgenabled > -++--- > child1 | tg | O > parent | tg | A > (2 rows) > > DROP TABLE > DROP FUNCTION > > > Results of vanilla: > > CREATE FUNCTION > CREATE TABLE > CREATE TABLE > CREATE TRIGGER > tgrelid | tgname | tgenabled > -++--- > child1 | tg | O > parent | tg | O > (2 rows) > > ALTER TABLE > tgrelid | tgname | tgenabled > -++--- > child1 | tg | O > parent | tg | A > (2 rows) > > ALTER TABLE > tgrelid | tgname | tgenabled > -++--- > child1 | tg | A > parent | tg | A > (2 rows) > > DROP TABLE > DROP FUNCTION > > The patch doesn't start recursion in case 'tgenabled' flag of parent > table is not changes. > Probably vanilla result is more correct. Thanks for the review and this test case. I agree that the patch shouldn't have changed that behavior, so I've fixed the patch so that EnableDisableTrigger() recurses even if the parent trigger is unchanged. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com v2-0001-Fix-ENABLE-DISABLE-TRIGGER-to-handle-recursion-co.patch Description: Binary data
Re: Add --{no-,}bypassrls flags to createuser
On Thu, May 26, 2022 at 04:47:46PM +0900, Kyotaro Horiguchi wrote: > FWIW, the "fancy" here causes me to think about something likely to > cause syntax breakage of the query to be sent. > > createuser -a 'user"1' -a 'user"2' 'user"3' > createuser -v "2023-1-1'; DROP TABLE public.x; select '" hoge That would be mostly using spaces here, to make sure that quoting is correctly applied. > BUT, thses should be prevented by the functions enumerated above. So, > I don't think we need them. Mostly. For example, the test for --valid-until can use a timestamp with spaces to validate the use of appendStringLiteralConn(). A second thing is that --member was checked, but not --admin, so I have renamed regress_user2 to "regress user2" for that to apply a maximum of coverage, and applied the patch. One thing that I found annoying is that this made the list of options of createuser much harder to follow. That's not something caused by this patch as many options have accumulated across the years and there is a kind pattern where the connection options were listed first, but I have cleaned up that while on it. A second area where this could be done is createdb, as it could be easily expanded if the backend query gains support for more stuff, but that can happen when it makes more sense. -- Michael signature.asc Description: PGP signature
Re: "ERROR: latch already owned" on gharial
Thanks Robert. We are receiving the alerts from buildfarm-admins for anole and gharial not reporting. Who can help to stop these? Thanks On Wed, Jul 6, 2022 at 1:27 AM Robert Haas wrote: > On Sun, Jul 3, 2022 at 11:51 PM Thomas Munro > wrote: > > On Wed, Jun 1, 2022 at 12:55 AM Robert Haas > wrote: > > > OK, I have access to the box now. I guess I might as well leave the > > > crontab jobs enabled until the next time this happens, since Thomas > > > just took steps to improve the logging, but I do think these BF > > > members are overdue to be killed off, and would like to do that as > > > soon as it seems like a reasonable step to take. > > > > A couple of months later, there has been no repeat of that error. I'd > > happily forget about that and move on, if you want to decommission > > these. > > I have commented out the BF stuff in crontab on that machine. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > > > -- Sandeep Thakkar
Re: POC: GROUP BY optimization
On Thu, 31 Mar 2022 at 12:19, Tomas Vondra wrote: > Pushed, after going through the patch once more, running check-world > under valgrind, and updating the commit message. I'm just in this general area of the code again today and wondered about the header comment for the preprocess_groupclause() function. It says: * In principle it might be interesting to consider other orderings of the * GROUP BY elements, which could match the sort ordering of other * possible plans (eg an indexscan) and thereby reduce cost. We don't * bother with that, though. Hashed grouping will frequently win anyway. I'd say this commit makes that paragraph mostly obsolete. It's only true now in the sense that we don't try orders that suit some index that would provide pre-sorted results for a GroupAggregate path. The comment leads me to believe that we don't do anything at all to find a better order, and that's not true now. David
Re: proposal: Allocate work_mem From Pool
Before I try to answer that, I need to know how the scheduler works. As I understand the term used, there is no scheduler inside Postgres for user connections -- they're handled by the OS kernel. Then, I'm probably using the wrong term. Right now, I have max_worker_processes set to 16. What happens when query #17 wants some work done? What do you call the thing that handles that? What is its algorithm for allocating work to the processes? Or am I completely misunderstanding the role worker processes play in execution? Joseph Wagner
Re: making relfilenodes 56 bits
On Tue, Jul 12, 2022 at 7:21 PM Robert Haas wrote: > > In this version, I also removed the struct padding, changed the limit > on the number of entries to a nice round 64, and made some comment > updates. I considered trying to go further and actually make the file > variable-size, so that we never again need to worry about the limit on > the number of entries, but I don't actually think that's a good idea. > It would require substantially more changes to the code in this file, > and that means there's more risk of introducing bugs, and I don't see > that there's much value anyway, because if we ever do hit the current > limit, we can just raise the limit. > > If we were going to split up durable_rename(), the only intelligible > split I can see would be to have a second version of the function, or > a flag to the existing function, that caters to the situation where > the old file is already known to have been fsync()'d. The patch looks good except one minor comment + * corruption. Since the file might be more tha none standard-size disk + * sector in size, we cannot rely on overwrite-in-place. Instead, we generate typo "more tha none" -> "more than one" -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: remove_useless_groupby_columns is too enthusiastic
On Wed, 13 Jul 2022 at 05:31, Tom Lane wrote: > I tried the attached quick-hack patch that just prevents > remove_useless_groupby_columns from removing anything that > appears in ORDER BY. That successfully fixes the complained-of > case, and it doesn't change any existing regression test results. > I'm not sure whether there are any cases that it makes significantly > worse. In addition to this, we also do a pre-verification step to see if the ORDER BY has anything in it that the GROUP BY does not? I don't think there's any harm in removing the GROUP BY item if the pre-check finds something extra in ORDER BY since preprocess_groupclause() is going to fail in that case anyway. David
Re: PG15 beta1 sort performance regression due to Generation context change
On Tue, 12 Jul 2022 at 17:15, David Rowley wrote: > So far only Robert has raised concerns with this regression for PG15 > (see [2]). Tom voted for leaving things as they are for PG15 in [3]. > John agrees, as quoted above. Does anyone else have any opinion? Let me handle this slightly differently. I've moved the open item for this into the "won't fix" section. If nobody shouts at me for that then I'll let that end the debate. Otherwise, we can consider the argument when it arrives. David
Re: Documentation about PL transforms
Hi I am doing an review of this patch and I have two comments 1. I am not sure if get_call_trftypes is a good name - the prefix get_call is used when some runtime data is processed. This function just returns reformatted data from the system catalogue. Maybe get_func_trftypes_list, or just replace function get_func_trftypes (now, the list is an array, so there should not be performance issues). For consistency, the new function should be used in plperl and plpython too. Probably this function is not widely used, so I don't see why we need to have two almost equal functions in code. 2. +It also contains the OID of the intended procedural language and whether +that procedural language is declared as TRUSTED, useful +if a single inline handler is supporting more than one procedural language. I am not sure if this sentence is in the correct place. Maybe can be mentioned separately, so generally handlers can be used by more than one procedural language. But maybe I don't understand this sentence. Regards Pavel
Re: Perform streaming logical transactions by background workers and parallel apply
Below are my review comments for the v16* patch set: v16-0001 1.0 There are places (comments, docs, errmsgs, etc) in the patch referring to "parallel mode". I think every one of those references should be found and renamed to "parallel streaming mode" or "streaming=parallel" or at the very least match sure that "streaming" is in the same sentence. IMO it's too vague just saying "parallel" without also saying the context is for the "streaming" parameter. I have commented on some of those examples below, but please search everything anyway (including the docs) to catch the ones I haven't explicitly mentioned. == 1.1 src/backend/commands/subscriptioncmds.c +defGetStreamingMode(DefElem *def) +{ + /* + * If no value given, assume "true" is meant. + */ Please fix this comment to identical to this pushed patch [1] == 1.2 .../replication/logical/applybgworker.c - apply_bgworker_start + if (list_length(ApplyWorkersFreeList) > 0) + { + wstate = (ApplyBgworkerState *) llast(ApplyWorkersFreeList); + ApplyWorkersFreeList = list_delete_last(ApplyWorkersFreeList); + Assert(wstate->pstate->status == APPLY_BGWORKER_FINISHED); + } The Assert that the entries in the free-list are FINISHED seems like unnecessary checking. IIUC, code is already doing the Assert that entries are FINISHED before allowing them into the free-list in the first place. ~~~ 1.3 .../replication/logical/applybgworker.c - apply_bgworker_find + if (found) + { + char status = entry->wstate->pstate->status; + + /* If any workers (or the postmaster) have died, we have failed. */ + if (status == APPLY_BGWORKER_EXIT) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("background worker %u failed to apply transaction %u", + entry->wstate->pstate->n, + entry->wstate->pstate->stream_xid))); + + Assert(status == APPLY_BGWORKER_BUSY); + + return entry->wstate; + } Why not remove that Assert but change the condition to be: if (status != APPLY_BGWORKER_BUSY) ereport(...) == 1.4 src/backend/replication/logical/proto.c - logicalrep_write_stream_abort @@ -1163,31 +1163,56 @@ logicalrep_read_stream_commit(StringInfo in, LogicalRepCommitData *commit_data) /* * Write STREAM ABORT to the output stream. Note that xid and subxid will be * same for the top-level transaction abort. + * + * If write_abort_lsn is true, send the abort_lsn and abort_time fields. + * Otherwise not. */ "Otherwise not." -> ", otherwise don't." ~~~ 1.5 src/backend/replication/logical/proto.c - logicalrep_read_stream_abort + * + * If read_abort_lsn is true, try to read the abort_lsn and abort_time fields. + * Otherwise not. */ void -logicalrep_read_stream_abort(StringInfo in, TransactionId *xid, - TransactionId *subxid) +logicalrep_read_stream_abort(StringInfo in, + LogicalRepStreamAbortData *abort_data, + bool read_abort_lsn) "Otherwise not." -> ", otherwise don't." == 1.6 src/backend/replication/logical/worker.c - file comment + * If streaming = parallel, We assign a new apply background worker (if + * available) as soon as the xact's first stream is received. The main apply "We" -> "we" ... or maybe better just remove it completely. ~~~ 1.7 src/backend/replication/logical/worker.c - apply_handle_stream_prepare + /* + * After sending the data to the apply background worker, wait for + * that worker to finish. This is necessary to maintain commit + * order which avoids failures due to transaction dependencies and + * deadlocks. + */ + apply_bgworker_send_data(wstate, s->len, s->data); + apply_bgworker_wait_for(wstate, APPLY_BGWORKER_FINISHED); + apply_bgworker_free(wstate); The comment should be changed how you had suggested [2], so that it will be formatted the same way as a couple of other similar comments. ~~~ 1.8 src/backend/replication/logical/worker.c - apply_handle_stream_abort + /* Check whether the publisher sends abort_lsn and abort_time. */ + if (am_apply_bgworker()) + read_abort_lsn = MyParallelState->server_version >= 16; This is handling decisions about read/write of the protocol bytes. I think feel like it will be better to be checking the server *protocol* version (not the server postgres version) to make this decision – e.g. this code should be using the new macro you introduced so it will end up looking much like how the pgoutput_stream_abort code is doing it. ~~~ 1.9 src/backend/replication/logical/worker.c - store_flush_position @@ -2636,6 +2999,10 @@ store_flush_position(XLogRecPtr remote_lsn) { FlushPosition *flushpos; + /* We only need to collect the LSN in main apply worker */ + if (am_apply_bgworker()) + return; + SUGGESTION /* Skip if not the main apply worker */ == 1.10 src/backend/replication/pgoutput/pgoutput.c @@ -1820,6 +1820,8 @@ pgoutput_stream_abort(struct LogicalDecodingContext *ctx, XLogRecPtr abort_lsn) { ReorderBufferTXN *toptxn; + bool write_abort_lsn = false; + PGOutputData *data = (PGOutputData *) ctx->output_plugin_private;
Re: PATCH: Add Table Access Method option to pgbench
On Thu, 30 Jun 2022 at 18:09, Michael Paquier wrote: > On Fri, Jul 01, 2022 at 10:06:49AM +0900, Michael Paquier wrote: > > And the conclusion back then is that one can already achieve this by > > using PGOPTIONS: > > PGOPTIONS='-c default_table_access_method=wuzza' pgbench [...] > > > > So there is no need to complicate more pgbench, particularly when it > > comes to partitioned tables where USING is not supported. Your patch > > touches this area of the client code to bypass the backend error. > > Actually, it could be a good thing to mention that directly in the > docs of pgbench. > I've attached a documentation patch that mentions and links to the PGOPTIONS documentation per your suggestion. I'll keep the other patch on the back burner, perhaps in the future there will be demand for a command line option as more TAMs are created. Thanks, -Michel > -- > Michael > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 37fd80388c..e2d728e0c4 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -317,7 +317,7 @@ UPDATE pg_settings SET setting = reset_val WHERE name = 'configuration_parameter - + Parameter Interaction via the Shell diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 2acf55c2ac..f15825c293 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -1018,6 +1018,17 @@ pgbench options d always, auto and never. + + + The environment variable PGOPTIONS specifies database + configuration options that are passed to PostgreSQL via the command line + (See ). For example, a hypothetical + default Table Access Method for the tables that pgbench creates + called wuzza can be specified with: + +PGOPTIONS='-c default_table_access_method=wuzza' + +
Re: minor change for create_list_bounds()
On Thu, 30 Jun 2022 at 11:41, Nathan Bossart wrote: > > On Tue, Mar 08, 2022 at 11:05:10AM -0800, Zhihong Yu wrote: > > I was looking at commit db632fbca and noticed that, > > in create_list_bounds(), if index is added to boundinfo->interleaved_parts > > in the first if statement, there is no need to perform the second check > > involving call to partition_bound_accepts_nulls(). > > Given this change probably doesn't meaningfully impact performance or code > clarity, I'm personally -1 for this patch. Is there another motivation > that I am missing? While I agree that the gains on making this change are small. It just accounts to saving a call to bms_add_member() when we've already found the partition to be interleaved due to interleaved Datum values, I just disagree with not doing anything about it. My reasons are: 1. This code is new to PG15. We have the opportunity now to make a meaningful improvement and backpatch it. When PG15 is out, the bar is set significantly higher for fixing this type of thing due to having to consider the additional cost of backpatching conflicts with other future fixes in that area. 2. I think the code as I just pushed it is easier to understand than what was there before. 3. I'd like to encourage people to look at and critique our newly added code. Having a concern addressed seems like a good reward for the work. I've now pushed the patch along with some other minor adjustments in the area. Thanks for the report/patch. David
Re: Reducing the chunk header sizes on all memory context types
On Wed, 13 Jul 2022 at 05:44, Andres Freund wrote: > On 2022-07-12 20:22:57 +0300, Yura Sokolov wrote: > > I don't get, why "large chunk" needs additional fields for size and > > offset. > > Large allocation sizes are certainly rounded to page size. > > And allocations which doesn't fit 1GB we could easily round to 1MB. > > Then we could simply store `size>>20`. > > It will limit MaxAllocHugeSize to `(1<<(30+20))-1` - 1PB. Doubdfully we > > will deal with such huge allocations in near future. > > What would gain by doing something like this? The storage density loss of > storing an exact size is smaller than what you propose here. I do agree that the 16-byte additional header size overhead for allocations >= 1GB are not really worth troubling too much over. However, if there was some way to make it so we always had an 8-byte header, it would simplify some of the code in places such as AllocSetFree(). For example, (ALLOC_BLOCKHDRSZ + hdrsize + chunksize) could be simplified at compile time if hdrsize was a known constant. I did consider that in all cases where the allocations are above allocChunkLimit that the chunk is put on a dedicated block and in fact, the blockoffset is always the same for those. I wondered if we could use the full 60 bits for the chunksize for those cases. The reason I didn't pursue that is because: #define MaxAllocHugeSize (SIZE_MAX / 2) That's 63-bits, so 60 isn't enough. Yeah, we likely could reduce that without upsetting anyone. It feels like it'll be a while before not being able to allocate a chunk of memory more than 1024 petabytes will be an issue, although, I do hope to grow old enough to one day come back here at laugh at that. David
Re: Reducing the chunk header sizes on all memory context types
On Wed, 13 Jul 2022 at 05:42, Andres Freund wrote: > > There is at least one. It might be major; to reduce the AllocSet chunk > > header from 16 bytes down to 8 bytes I had to get rid of the freelist > > pointer that was reusing the "aset" field in the chunk header struct. > > This works now by storing that pointer in the actual palloc'd memory. > > This could lead to pretty hard-to-trace bugs if we have any code that > > accidentally writes to memory after pfree. > > Can't we use the same trick for allcations in the freelist as we do for the > header in a live allocation? I.e. split the 8 byte header into two and use > part of it to point to the next element in the list using the offset from the > start of the block, and part of it to indicate the size? That can't work as the next freelist item might be on some other block. David
Re: Reducing the chunk header sizes on all memory context types
Hi, On 2022-07-12 10:42:07 -0700, Andres Freund wrote: > On 2022-07-12 17:01:18 +1200, David Rowley wrote: > > There is at least one. It might be major; to reduce the AllocSet chunk > > header from 16 bytes down to 8 bytes I had to get rid of the freelist > > pointer that was reusing the "aset" field in the chunk header struct. > > This works now by storing that pointer in the actual palloc'd memory. > > This could lead to pretty hard-to-trace bugs if we have any code that > > accidentally writes to memory after pfree. > > Can't we use the same trick for allcations in the freelist as we do for the > header in a live allocation? I.e. split the 8 byte header into two and use > part of it to point to the next element in the list using the offset from the > start of the block, and part of it to indicate the size? So that doesn't work because the members in the freelist can be in different blocks and those can be further away from each other. Perhaps that could still made work somehow: To point to a block we don't actually need 64bit pointers, they're always at least of some certain size - assuming we can allocate them suitably aligned. And chunks are always 8 byte aligned. Unfortunately that doesn't quite get us far enough - assuming a 4kB minimum block size (larger than now, but potentially sensible as a common OS page size), we still only get to 2^12*8 = 32kB. It'd easily work if we made each context have an array of allocated non-huge blocks, so that the blocks can be addressed with a small index. The overhead of that could be reduced in the common case by embedding a small constant sized array in the Aset. That might actually be worth trying out. Greetings, Andres Freund
Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths
On Mon, Jul 11, 2022 at 02:26:46PM +0200, Matthias van de Meent wrote: > Thanks for reviewing. I think that v6 is over-engineered because there should be no need to add a check in xlogreader.c as long as the origin of the problem is blocked, no? And the origin here is when the record is assembled. At least this is the cleanest solution for HEAD, but not in the back-branches if we'd care about doing something with records already generated, and I am not sure that we need to care about other things than HEAD, TBH. So it seems to me that there is no need to create a XLogRecMaxLength which is close to a duplicate of DecodeXLogRecordRequiredSpace(). @@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included) { XLogRecData *rdt; - uint32 total_len = 0; + uint64 total_len = 0; This has no need to change. My suggestion from upthread was close to what you proposed, but I had in mind something simpler, as of: + /* +* Ensure that xlogreader.c can read the record. +*/ + if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len + elog(ERROR, "too much WAL data"); This would be the amount of data allocated by the WAL reader when it is possible to allocate an oversized record, related to the business of the circular buffer depending on if the read is blocking or not. Among the two problems to solve at hand, the parts where the APIs are changed and made more robust with unsigned types and where block data is not overflowed with its 16-byte limit are committable, so I'd like to do that first (still need to check its performance with some micro benchmark on XLogRegisterBufData()). The second part to block the creation of the assembled record is simpler, now DecodeXLogRecordRequiredSpace() would make the path a bit hotter, though we could inline it in the worst case? -- Michael From 2031f15fe90c94deb21d4e41b717ada9a8bdb520 Mon Sep 17 00:00:00 2001 From: Matthias van de Meent Date: Mon, 20 Jun 2022 10:21:22 +0200 Subject: [PATCH v7] Add protections in xlog record APIs against large numbers and overflows. Before this; it was possible for an extension to create malicious WAL records that were too large to replay; or that would overflow the xl_tot_len field, causing potential corruption in WAL record IO ops. Emitting invalid records was also possible through pg_logical_emit_message(), which allowed you to emit arbitrary amounts of data up to 2GB, much higher than the replay limit of 1GB. --- src/include/access/xloginsert.h | 4 ++-- src/backend/access/transam/xloginsert.c | 30 - 2 files changed, 27 insertions(+), 7 deletions(-) diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h index c04f77b173..aed4643d1c 100644 --- a/src/include/access/xloginsert.h +++ b/src/include/access/xloginsert.h @@ -43,12 +43,12 @@ extern void XLogBeginInsert(void); extern void XLogSetRecordFlags(uint8 flags); extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info); extern void XLogEnsureRecordSpace(int max_block_id, int ndatas); -extern void XLogRegisterData(char *data, int len); +extern void XLogRegisterData(char *data, uint32 len); extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags); extern void XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum, BlockNumber blknum, char *page, uint8 flags); -extern void XLogRegisterBufData(uint8 block_id, char *data, int len); +extern void XLogRegisterBufData(uint8 block_id, char *data, uint32 len); extern void XLogResetInsertion(void); extern bool XLogCheckBufferNeedsBackup(Buffer buffer); diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index f3c29fa909..3788429b70 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -348,13 +348,13 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum, * XLogRecGetData(). */ void -XLogRegisterData(char *data, int len) +XLogRegisterData(char *data, uint32 len) { XLogRecData *rdata; Assert(begininsert_called); - if (num_rdatas >= max_rdatas) + if (unlikely(num_rdatas >= max_rdatas)) elog(ERROR, "too much WAL data"); rdata = &rdatas[num_rdatas++]; @@ -386,7 +386,7 @@ XLogRegisterData(char *data, int len) * limited) */ void -XLogRegisterBufData(uint8 block_id, char *data, int len) +XLogRegisterBufData(uint8 block_id, char *data, uint32 len) { registered_buffer *regbuf; XLogRecData *rdata; @@ -399,8 +399,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len) elog(ERROR, "no block with id %d registered with WAL insertion", block_id); - if (num_rdatas >= max_rdatas) + /* + * Check against max_rdatas and ensure we do not register more data per + * buffer than can be handled by the physical data format; i.e. that + * regbuf->rdata_
Re: Make name optional in CREATE STATISTICS
On Thu, 7 Jul 2022 at 11:58, Matthias van de Meent wrote: > > On Thu, 7 Jul 2022 at 12:55, Simon Riggs wrote: > > There are various other ways of doing this and, yes, we could refactor > > other parts of the grammar to make this work. There is a specific > > guideline about patch submission that says the best way to get a patch > > rejected is to include unnecessary changes. With that in mind, let's > > keep the patch simple and exactly aimed at the original purpose. > > > > I'll leave it for committers to decide whether other refactoring is wanted. > > Fair enough. > > > I have made the comment show that the name is optional, thank you. > > The updated comment implies that IF NOT EXISTS is allowed without a > defined name, which is false: > > > + *CREATE STATISTICS [IF NOT EXISTS] [stats_name] [(stat > > types)] > > A more correct version would be > > + *CREATE STATISTICS [ [IF NOT EXISTS] stats_name ] > [(stat types)] There you go -- Simon Riggshttp://www.EnterpriseDB.com/ create_stats_opt_name.v5.patch Description: Binary data
Re: [RFC] building postgres with meson -v9
On 06.07.22 15:21, Andres Freund wrote: - This patch is for unifying the list of languages in NLS, as previously discussed:https://commitfest.postgresql.org/38/3737/ There seems little downside to doing so, so ... This has been committed, so on the next rebase, the languages arguments can be removed from the i18n.gettext() calls.
Re: generic plans and "initial" pruning
Rebased over 964d01ae90c. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com v18-0002-Optimize-AcquireExecutorLocks-by-locking-only-un.patch Description: Binary data v18-0001-Move-PartitioPruneInfo-out-of-plan-nodes-into-Pl.patch Description: Binary data
Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations.
Hi, Most of the code is common between GetSubscriptionRelations and GetSubscriptionNotReadyRelations. Added a parameter to GetSubscriptionRelations which could provide the same functionality as the existing GetSubscriptionRelations and GetSubscriptionNotReadyRelations. Attached patch has the changes for the same. Thoughts? Regards, Vignesh From 5fb15291abb489cb4469f43afa95106e421002c0 Mon Sep 17 00:00:00 2001 From: Vigneshwaran C Date: Wed, 13 Jul 2022 11:54:59 +0530 Subject: [PATCH v1] Refactor to make use of a common function for GetSubscriptionRelations and GetSubscriptionNotReadyRelations. Most of the code is common between GetSubscriptionRelations and GetSubscriptionNotReadyRelations. Added a parameter to GetSubscriptionRelations which could provide the same functionality for GetSubscriptionRelations and GetSubscriptionNotReadyRelations. --- src/backend/catalog/pg_subscription.c | 68 +++-- src/backend/commands/subscriptioncmds.c | 4 +- src/backend/replication/logical/tablesync.c | 2 +- src/include/catalog/pg_subscription_rel.h | 3 +- 4 files changed, 13 insertions(+), 64 deletions(-) diff --git a/src/backend/catalog/pg_subscription.c b/src/backend/catalog/pg_subscription.c index 8856ce3b50..5757e2b338 100644 --- a/src/backend/catalog/pg_subscription.c +++ b/src/backend/catalog/pg_subscription.c @@ -525,65 +525,14 @@ HasSubscriptionRelations(Oid subid) } /* - * Get all relations for subscription. + * Get the relations for subscription. * + * If all_rels is true, return all the relations for subscription. If false, + * return all the relations for subscription that are not in a ready state. * Returned list is palloc'ed in current memory context. */ List * -GetSubscriptionRelations(Oid subid) -{ - List *res = NIL; - Relation rel; - HeapTuple tup; - ScanKeyData skey[1]; - SysScanDesc scan; - - rel = table_open(SubscriptionRelRelationId, AccessShareLock); - - ScanKeyInit(&skey[0], -Anum_pg_subscription_rel_srsubid, -BTEqualStrategyNumber, F_OIDEQ, -ObjectIdGetDatum(subid)); - - scan = systable_beginscan(rel, InvalidOid, false, - NULL, 1, skey); - - while (HeapTupleIsValid(tup = systable_getnext(scan))) - { - Form_pg_subscription_rel subrel; - SubscriptionRelState *relstate; - Datum d; - bool isnull; - - subrel = (Form_pg_subscription_rel) GETSTRUCT(tup); - - relstate = (SubscriptionRelState *) palloc(sizeof(SubscriptionRelState)); - relstate->relid = subrel->srrelid; - relstate->state = subrel->srsubstate; - d = SysCacheGetAttr(SUBSCRIPTIONRELMAP, tup, - Anum_pg_subscription_rel_srsublsn, &isnull); - if (isnull) - relstate->lsn = InvalidXLogRecPtr; - else - relstate->lsn = DatumGetLSN(d); - - res = lappend(res, relstate); - } - - /* Cleanup */ - systable_endscan(scan); - table_close(rel, AccessShareLock); - - return res; -} - -/* - * Get all relations for subscription that are not in a ready state. - * - * Returned list is palloc'ed in current memory context. - */ -List * -GetSubscriptionNotReadyRelations(Oid subid) +GetSubscriptionRelations(Oid subid, bool all_rels) { List *res = NIL; Relation rel; @@ -599,10 +548,11 @@ GetSubscriptionNotReadyRelations(Oid subid) BTEqualStrategyNumber, F_OIDEQ, ObjectIdGetDatum(subid)); - ScanKeyInit(&skey[nkeys++], -Anum_pg_subscription_rel_srsubstate, -BTEqualStrategyNumber, F_CHARNE, -CharGetDatum(SUBREL_STATE_READY)); + if (!all_rels) + ScanKeyInit(&skey[nkeys++], + Anum_pg_subscription_rel_srsubstate, + BTEqualStrategyNumber, F_CHARNE, + CharGetDatum(SUBREL_STATE_READY)); scan = systable_beginscan(rel, InvalidOid, false, NULL, nkeys, skey); diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index bdc1208724..f9563304af 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -785,7 +785,7 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data, pubrel_names = fetch_table_list(wrconn, sub->publications); /* Get local table list. */ - subrel_states = GetSubscriptionRelations(sub->oid); + subrel_states = GetSubscriptionRelations(sub->oid, true); /* * Build qsorted array of local table oids for faster lookup. This can @@ -1457,7 +1457,7 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel) * the apply and tablesync workers and they can't restart because of * exclusive lock on the subscription. */ - rstates = GetSubscriptionNotReadyRelations(subid); + rstates = GetSubscriptionRelations(subid, false); foreach(lc, rstates) { SubscriptionRelState *rstate = (SubscriptionRelState *) lfirst(lc); diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index 670c6fcada..4b5b388f6c 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -1479,7 +1479,7 @@ FetchTableStates(bool *started_