Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On Tue, Aug 9, 2022 at 10:57 AM Thomas Munro wrote: > > OK, so there aren't many places in 0001 that error out where we > previously did not. Well, that's not true I believe. The 0001 patch introduces get_dirent_type() with elevel ERROR which means that lstat failures are now reported as ERRORs with ereport(ERROR, as opposed to continuing on the HEAD. -if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode)) +if (get_dirent_type(path, mapping_de, false, ERROR) != PGFILETYPE_REG) continue; -if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode)) +if (get_dirent_type(path, snap_de, false, ERROR) != PGFILETYPE_REG) { so on. The main idea of using get_dirent_type() instead of lstat or stat is to benefit from avoiding system calls on platforms where the directory entry type is stored in dirent structure, but not to cause errors for lstat or stat system calls failures where we previously didn't. I think 0001 must be just replacing lstat or stat with get_dirent_type() with elevel ERROR if lstat or stat failure is treated as ERROR previously, otherwise with elevel LOG. I modified it as attached. Once this patch gets committed, then we can continue the discussion as to whether we make elog-LOG to elog-ERROR or vice-versa. I'm tempted to use get_dirent_type() in RemoveXlogFile() but that requires us to pass struct dirent * from RemoveOldXlogFiles() (as attached 0002 for now), If done, this avoids one lstat() system call per WAL file. IMO, that's okay to pass an extra function parameter to RemoveXlogFile() as it is a static function and there can be many (thousands of) WAL files at times, so saving system calls (lstat() * number of WAL files) is definitely an advantage. Thoughts? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/ v14-0001-Make-more-use-of-get_dirent_type.patch Description: Binary data v14-0002-Replace-lstat-with-get_dirent_type-in-xlog.c.patch Description: Binary data
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Tue, Aug 9, 2022 at 9:46 PM Simon Riggs wrote: > Those calls are unaffected, i.e. they both still work. > > Right now, we register all subxids in subtrans. But not all xids are > subxids, so in fact, subtrans has many "holes" in it, where if you > look up the parent for an xid it will just return > c. There is a protection against that causing a > problem because if you call TransactionIdDidCommit/Abort you can get a > WARNING, or if you call SubTransGetTopmostTransaction() you can get an > ERROR, but it is possible if you do a lookup for an inappropriate xid. > i.e. if you call TransactionIdDidCommit() without first calling > TransactionIdIsInProgress() as you are supposed to do. IIUC, if SubTransGetParent SubTransGetParent then SubTransGetTopmostTransaction() loop will break and return the previousxid. So if we pass any topxid to SubTransGetTopmostTransaction() it will return back the same xid and that's fine as next we are going to search in the snapshot->xip array. But if we are calling this function with the subxid which might be there in the snapshot->subxip array but if we are first calling SubTransGetTopmostTransaction() then it will just return the same xid if the parent is not set for it. And now if we search this in the snapshot->xip array then we will get the wrong answer? So I still think some adjustment is required in XidInMVCCSnapdhot() such that we first search the snapshot->subxip array. Am I still missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Using each rel as both outer and inner for JOIN_ANTI
On Tue, Aug 9, 2022 at 6:54 PM Alvaro Herrera wrote: > I suppose this looks good as far as the plan goes, but the cost estimation > might be a little bit too optimistic: it is reporting that the new plan > costs 50% of the original, yet the execution time is only 5% lower. Thanks for trying this patch. Yeah, the estimated cost doesn't match the execution time here. I tried the query locally and here is what I got (best of three): Unpatched: # explain analyze select * from foo left join bar on foo.a = bar.c where bar.c is null; QUERY PLAN --- Hash Anti Join (cost=154156.00..173691.19 rows=1 width=16) (actual time=1548.622..1548.624 rows=0 loops=1) Hash Cond: (foo.a = bar.c) -> Seq Scan on foo (cost=0.00..1.10 rows=10 width=8) (actual time=0.024..0.026 rows=10 loops=1) -> Hash (cost=72124.00..72124.00 rows=500 width=8) (actual time=1443.157..1443.158 rows=500 loops=1) Buckets: 262144 Batches: 64 Memory Usage: 5107kB -> Seq Scan on bar (cost=0.00..72124.00 rows=500 width=8) (actual time=0.045..481.059 rows=500 loops=1) Planning Time: 0.262 ms Execution Time: 1549.138 ms (8 rows) Patched: # explain analyze select * from foo left join bar on foo.a = bar.c where bar.c is null; QUERY PLAN - Hash Right Anti Join (cost=1.23..90875.33 rows=1 width=16) (actual time=985.773..985.775 rows=0 loops=1) Hash Cond: (bar.c = foo.a) -> Seq Scan on bar (cost=0.00..72124.00 rows=500 width=8) (actual time=0.095..438.333 rows=500 loops=1) -> Hash (cost=1.10..1.10 rows=10 width=8) (actual time=0.076..0.077 rows=10 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on foo (cost=0.00..1.10 rows=10 width=8) (actual time=0.060..0.064 rows=10 loops=1) Planning Time: 0.290 ms Execution Time: 985.830 ms (8 rows) Seems the cost matches the execution time better in my local box. The right-anti join plan has the same cost estimation with right join plan in this case. So would you please help to test what the right join plan looks like in your env for the query below? select * from foo left join bar on foo.a = bar.c; Thanks Richard
Re: [RFC] building postgres with meson - v10
Hi, On 8/8/22 18:53, Andres Freund wrote: Bilal's version checked different directories for expected files, but I don't think that's necessary. Bilal, do you remember why you added that? This was for not breaking autoconf build. Autoconf wasn't using expecteddir, so I checked different directories. Greetings, Nazir Bilal Yavuz
Re: createuser doesn't tell default settings for some options
> On 10 Aug 2022, at 08:12, Kyotaro Horiguchi wrote: > > (I suppose this is a pg15 issue) > > createuser --help shows the following help text. > >> --bypassrls role can bypass row-level security (RLS) policy >> --no-bypassrlsrole cannot bypass row-level security (RLS) policy >> --replication role can initiate replication >> --no-replication role cannot initiate replication > > For other options the text tells which one is the default, which I > think the two options also should have the same. Agreed. For --no-replication the docs in createuser.sgml should fixed to include a "This is the default" sentence like the others have as well. > The interacitive mode doesn't cover all options, but I'm not sure what > we should do to the mode since I don't have a clear idea of how the > mode is used. In the attached only --bypassrls is arbirarily added. > The remaining options omitted in the interactive mode are: password, > valid-until, role, member and replication. (attached second) I'm not convinced that we should add more to the interactive mode, it's IMO mostly a backwards compat option for ancient (pre-9.2) createuser where this was automatically done. Back then we had this in the documentation which has since been removed: "You will be prompted for a name and other missing information if it is not specified on the command line." > The ternary options are checked against decimal 0, but it should use > TRI_DEFAULT instead. (attached third) Agreed, nice catch. -- Daniel Gustafsson https://vmware.com/
Re: Fast COPY FROM based on batch insert
Hi, On Wed, Aug 10, 2022 at 1:06 AM Zhihong Yu wrote: > On Tue, Aug 9, 2022 at 4:45 AM Etsuro Fujita wrote: >> * When running AFTER ROW triggers in CopyMultiInsertBufferFlush(), the >> patch uses the slots passed to ExecForeignBatchInsert(), not the ones >> returned by the callback function, but I don't think that that is >> always correct, as the documentation about the callback function says: >> >> The return value is an array of slots containing the data that was >> actually inserted (this might differ from the data supplied, for >> example as a result of trigger actions.) >> The passed-in slots can be re-used for this purpose. >> >> postgres_fdw re-uses the passed-in slots, but other FDWs might not, so >> I modified the patch to reference the returned slots when running the >> AFTER ROW triggers. I noticed that my explanation was not correct. Let me explain. Before commit 82593b9a3, when batching into a view referencing a postgres_fdw foreign table that has WCO constraints, postgres_fdw used the passed-in slots to store the first tuple that was actually inserted to the remote table. But that commit disabled batching in that case, so postgres_fdw wouldn’t use the passed-in slots (until we support batching when there are WCO constraints from the parent views and/or AFTER ROW triggers on the foreign table). > + /* If any rows were inserted, run AFTER ROW INSERT triggers. */ > ... > + for (i = 0; i < inserted; i++) > + { > + TupleTableSlot *slot = rslots[i]; > ... > + slot->tts_tableOid = > + RelationGetRelid(resultRelInfo->ri_RelationDesc); > > It seems the return value of > `RelationGetRelid(resultRelInfo->ri_RelationDesc)` can be stored in a > variable outside the for loop. > Inside the for loop, assign this variable to slot->tts_tableOid. Actually, I did this to match the code in ExecBatchInsert(), but that seems like a good idea, so I’ll update the patch as such in the next version. Thanks for reviewing! Best regards, Etsuro Fujita
Re: Fix a typo in pgstatfuncs.c
On Wed, Aug 10, 2022 at 1:22 PM Drouvot, Bertrand wrote: > > Hi, > > It seems like there's the following typo in pgstatfuncs.c: > > - /* Values only available to role member or > pg_read_all_stats */ > + /* Values only available to role member of > pg_read_all_stats */ > > Attaching a tiny patch to fix it. I don't think it's a typo, the comment says that the values are only available to the user who has privileges of backend's role or pg_read_all_stats, the macro HAS_PGSTAT_PERMISSIONS says it all. IMO, any of the following works better, if the existing comment is confusing: /* Values only available to the member{or role or user} with privileges of backend's role or pg_read_all_stats */ /* Values only available to the member{or role or user} that has membership in backend's role or has privileges of pg_read_all_stats */ -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Hello. > Yes, that is correct. Mmm. I believed that the log came from a single server run, since the PID (I believe the [359], [357] are PID) did not change through the log lines. > 2022-08-05 18:50:13 UTC::@:[359]:LOG: creating missing WAL directory > "pg_wal/archive_status" This means that someone removes the content of pg_wal directory. Removing some WAL files in pg_wal may lead to this symptom. > 2022-08-05 18:50:13 UTC::@:[359]:LOG: entering standby mode > recovering 0002.history > 2022-08-05 18:50:13 UTC::@:[359]:LOG: ### [S] @6/B8000198: abort=(0/0)0/0, > miss=(0/0)0/0, SbyMode=0, SbyModeReq=1 > 2022-08-05 18:50:13 UTC::@:[359]:LOG: database system was not properly shut > down; automatic recovery in progress > 2022-08-05 18:50:13 UTC::@:[359]:LOG: redo starts at 6/B8E8 > > And a few hours later, is when we see a panic So, it seems that the *standby* received the inconsistent WAL stream (aborted-contrecord not followed by a overwriting-missing-contrecord) from the primary. Thus the inconsistency happened on the primary, not on the standby. So... I'm still failing to draw the big picutre of what is happening here. Could you show us the server configuration (dbservers involved and their roles (primary/standby..)), and the exact steps when you restart the server after carsh? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Using each rel as both outer and inner for JOIN_ANTI
On 2022-Aug-10, Richard Guo wrote: > The right-anti join plan has the same cost estimation with right join > plan in this case. So would you please help to test what the right join > plan looks like in your env for the query below? > > select * from foo left join bar on foo.a = bar.c; You're right, it does. 55432 16devel 475322=# explain (analyze, buffers) select * from foo left join bar on foo.a = bar.c; QUERY PLAN ── Hash Right Join (cost=1.23..90875.24 rows=10 width=20) (actual time=456.410..456.415 rows=10 loops=1) Hash Cond: (bar.c = foo.a) Buffers: shared hit=15852 read=6273 -> Seq Scan on bar (cost=0.00..72124.00 rows=500 width=12) (actual time=0.036..210.468 rows=500 loops=1) Buffers: shared hit=15852 read=6272 -> Hash (cost=1.10..1.10 rows=10 width=8) (actual time=0.037..0.038 rows=10 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared read=1 -> Seq Scan on foo (cost=0.00..1.10 rows=10 width=8) (actual time=0.022..0.026 rows=10 loops=1) Buffers: shared read=1 Planning: Buffers: shared hit=92 read=13 Planning Time: 1.077 ms Execution Time: 456.458 ms (14 filas) 55432 16devel 475322=# explain (analyze, buffers) select * from foo left join bar on foo.a = bar.c where bar.c is null; QUERY PLAN ── Hash Right Anti Join (cost=1.23..90875.24 rows=10 width=20) (actual time=451.747..451.751 rows=10 loops=1) Hash Cond: (bar.c = foo.a) Buffers: shared hit=15646 read=6479 -> Seq Scan on bar (cost=0.00..72124.00 rows=500 width=12) (actual time=0.048..204.940 rows=500 loops=1) Buffers: shared hit=15645 read=6479 -> Hash (cost=1.10..1.10 rows=10 width=8) (actual time=0.030..0.031 rows=10 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=1 -> Seq Scan on foo (cost=0.00..1.10 rows=10 width=8) (actual time=0.017..0.020 rows=10 loops=1) Buffers: shared hit=1 Planning Time: 0.227 ms Execution Time: 451.793 ms -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Every machine is a smoke machine if you operate it wrong enough." https://twitter.com/libseybieda/status/1541673325781196801
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On Wed, Aug 10, 2022 at 7:15 PM Bharath Rupireddy wrote: > The main idea of using get_dirent_type() instead of lstat or stat is > to benefit from avoiding system calls on platforms where the directory > entry type is stored in dirent structure, but not to cause errors for > lstat or stat system calls failures where we previously didn't. Yeah, I get it... I'm OK with separating mechanical changes like lstat() -> get_dirent_type() from policy changes on error handling. I initially thought the ERROR was surely better than what we're doing now (making extra system calls to avoid unlinking unexpected things, for which our only strategy on failure is to try to unlink the thing anyway), but it's better to discuss that separately. > I > think 0001 must be just replacing lstat or stat with get_dirent_type() > with elevel ERROR if lstat or stat failure is treated as ERROR > previously, otherwise with elevel LOG. I modified it as attached. Once > this patch gets committed, then we can continue the discussion as to > whether we make elog-LOG to elog-ERROR or vice-versa. Agreed, will look at this version tomorrow. > I'm tempted to use get_dirent_type() in RemoveXlogFile() but that > requires us to pass struct dirent * from RemoveOldXlogFiles() (as > attached 0002 for now), If done, this avoids one lstat() system call > per WAL file. IMO, that's okay to pass an extra function parameter to > RemoveXlogFile() as it is a static function and there can be many > (thousands of) WAL files at times, so saving system calls (lstat() * > number of WAL files) is definitely an advantage. -lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) && +get_dirent_type(path, xlde, false, LOG) != PGFILETYPE_REG && I think you wanted ==, but maybe it'd be nicer to pass in a "recyclable" boolean to RemoveXlogFile()? If you're looking for more, rmtree() looks like another.
Small typo in OAT README
Reading over the new object access hook test I spotted a small typo in the documentation. Will apply a fix shortly. -A real-world OAT hook should certainly provide more fine-grained conrol than +A real-world OAT hook should certainly provide more fine-grained control than -- Daniel Gustafsson https://vmware.com/ OAT_hooks_README_typo.diff Description: Binary data
Re: pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled
On 2022-Aug-09, Lukas Fittl wrote: > But I wonder, why do we have an explicit pretty printing flag on these > functions, and PRETTYFLAG_SCHEMA in the code to represent this behavior. > If we don't want pretty printing to affect schema qualification, why > does that flag exist? Because of CVE-2018-1058. See commit 815172ba8068. I imagine that that commit only touched the minimum necessary to solve the immediate security problem, but that further work is needed to make PRETTYFLAG_SCHEMA become a fully functional gadget; but that would require that the whole of ruleutils.c (and everything downstream from it) behaves sanely. In other words, I think your patch is too small. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock
On Wed, Aug 10, 2022 at 10:58 AM Andres Freund wrote: > > Hi, > > On 2022-08-09 20:21:19 -0700, Mark Dilger wrote: > > > On Aug 9, 2022, at 7:26 PM, Andres Freund wrote: > > > > > > The relevant code triggering it: > > > > > > newbuf = XLogInitBufferForRedo(record, 1); > > > _hash_initbuf(newbuf, xlrec->new_bucket, xlrec->new_bucket, > > > xlrec->new_bucket_flag, true); > > > if (!IsBufferCleanupOK(newbuf)) > > > elog(PANIC, "hash_xlog_split_allocate_page: failed to acquire > > > cleanup lock"); > > > > > > Why do we just crash if we don't already have a cleanup lock? That can't > > > be > > > right. Or is there supposed to be a guarantee this can't happen? > > > > Perhaps the code assumes that when xl_hash_split_allocate_page record was > > written, the new_bucket field referred to an unused page, and so during > > replay it should also refer to an unused page, and being unused, that nobody > > will have it pinned. But at least in heap we sometimes pin unused pages > > just long enough to examine them and to see that they are unused. Maybe > > something like that is happening here? > > I don't think it's a safe assumption that nobody would hold a pin on such a > page during recovery. While not the case here, somebody else could have used > pg_prewarm to read it in. > > But also, the checkpointer or bgwriter could have it temporarily pinned, to > write it out, or another backend could try to write it out as a victim buffer > and have it temporarily pinned. > > > static int > SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext > *wb_context) > { > ... > /* > * Pin it, share-lock it, write it. (FlushBuffer will do nothing if > the > * buffer is clean by the time we've locked it.) > */ > PinBuffer_Locked(bufHdr); > LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED); > > > As you can see we acquire a pin without holding a lock on the page (and that > can't be changed!). > I think this could be the probable reason for failure though I didn't try to debug/reproduce this yet. AFAIU, this is possible during recovery/replay of WAL record XLOG_HASH_SPLIT_ALLOCATE_PAGE as via XLogReadBufferForRedoExtended, we can mark the buffer dirty while restoring from full page image. OTOH, because during normal operation we didn't mark the page dirty SyncOneBuffer would have skipped it due to check (if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))). > > I assume this is trying to defend against some sort of deadlock by not > actually getting a cleanup lock (by passing get_cleanup_lock = true to > XLogReadBufferForRedoExtended()). > IIRC, this is just following what we do during normal operation and based on the theory that the meta-page is not updated yet so no backend will access it. I think we can do what you wrote unless there is some other reason behind this failure. -- With Regards, Amit Kapila.
Re: Perform streaming logical transactions by background workers and parallel apply
Here are some review comments for the patch v20-0001: == 1. doc/src/sgml/catalogs.sgml + p = apply changes directly using a background + worker, if available, otherwise, it behaves the same as 't' The different char values 'f','t','p' are separated by comma (,) in the list, which is normal for the pgdocs AFAIK. However, because of this I don't think it is a good idea to use those other commas within the description for 'p', I suggest you remove those ones to avoid ambiguity with the separators. == 2. doc/src/sgml/protocol.sgml @@ -3096,7 +3096,7 @@ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;" Protocol version. Currently versions 1, 2, - and 3 are supported. + 3 and 4 are supported. Put a comma after the penultimate value like it had before. == 3. src/backend/replication/logical/applybgworker.c - There are multiple function comments and other code comments in this file that are missing a terminating period (.) == 4. src/backend/replication/logical/applybgworker.c - apply_bgworker_start +/* + * Try to get a free apply background worker. + * + * If there is at least one worker in the free list, then take one. Otherwise, + * try to start a new apply background worker. If successful, cache it in + * ApplyBgworkersHash keyed by the specified xid. + */ +ApplyBgworkerState * +apply_bgworker_start(TransactionId xid) SUGGESTION (for function comment) Return the apply background worker that will be used for the specified xid. If an apply background worker is found in the free list then re-use it, otherwise start a fresh one. Cache the worker ApplyBgworkersHash keyed by the specified xid. ~~~ 5. + /* Try to get a free apply background worker */ + if (list_length(ApplyBgworkersFreeList) > 0) if (list_length(ApplyBgworkersFreeList) > 0) AFAIK a non-empty list is guaranteed to be not NIL, and an empty list is guaranteed to be NIL. So if you want to the above can simply be written as: if (ApplyBgworkersFreeList) ~~~ 6. src/backend/replication/logical/applybgworker.c - apply_bgworker_find +/* + * Try to look up worker assigned before (see function apply_bgworker_get_free) + * inside ApplyBgworkersHash for requested xid. + */ +ApplyBgworkerState * +apply_bgworker_find(TransactionId xid) SUGGESTION (for function comment) Find the worker previously assigned/cached for this xid. (see function apply_bgworker_start) ~~~ 7. + Assert(status == APPLY_BGWORKER_BUSY); + + return entry->wstate; + } + else + return NULL; IMO here it is better to just remove that 'else' and unconditionally return NULL at the end of this function. ~~~ 8. src/backend/replication/logical/applybgworker.c - apply_bgworker_subxact_info_add + * Inside apply background worker we can figure out that new subtransaction was + * started if new change arrived with different xid. In that case we can define + * named savepoint, so that we were able to commit/rollback it separately + * later. + * Special case is if the first change comes from subtransaction, then + * we check that current_xid differs from stream_xid. + */ +void +apply_bgworker_subxact_info_add(TransactionId current_xid) It is not quite English. Can you improve it a bit? SUGGESTION (maybe like this?) The apply background worker can figure out if a new subtransaction was started by checking if the new change arrived with different xid. In that case define a named savepoint, so that we are able to commit/rollback it separately later. A special case is when the first change comes from subtransaction – this is determined by checking if the current_xid differs from stream_xid. == 9. src/backend/replication/logical/launcher.c - WaitForReplicationWorkerAttach + * + * Return false if the attach fails. Otherwise return true. */ -static void +static bool WaitForReplicationWorkerAttach(LogicalRepWorker *worker, Why not just say "Return whether the attach was successful." ~~~ 10. src/backend/replication/logical/launcher.c - logicalrep_worker_stop + /* Found the main worker, then try to stop it. */ + if (worker) + logicalrep_worker_stop_internal(worker); IMO the comment is kind of pointless because it only says what the code is clearly doing. If you really wanted to reinforce this worker is a main apply worker then you can do that with code like: if (worker) { Assert(!worker->subworker); logicalrep_worker_stop_internal(worker); } ~~~ 11. src/backend/replication/logical/launcher.c - logicalrep_worker_detach @@ -599,6 +632,29 @@ logicalrep_worker_attach(int slot) static void logicalrep_worker_detach(void) { + /* + * This is the main apply worker, stop all the apply background workers we + * started before. + */ + if (!MyLogicalRepWorker->subworker) SUGGESTION (for comment) This is the main apply worker. Stop all apply background workers previously started from here. ~~~ 12 src/backend/replication/logical/launcher.c - logicalrep_apply_bgworker_count +/* + * Count the
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On Wed, Aug 10, 2022 at 2:11 PM Thomas Munro wrote: > > On Wed, Aug 10, 2022 at 7:15 PM Bharath Rupireddy > wrote: > > The main idea of using get_dirent_type() instead of lstat or stat is > > to benefit from avoiding system calls on platforms where the directory > > entry type is stored in dirent structure, but not to cause errors for > > lstat or stat system calls failures where we previously didn't. > > Yeah, I get it... I'm OK with separating mechanical changes like > lstat() -> get_dirent_type() from policy changes on error handling. I > initially thought the ERROR was surely better than what we're doing > now (making extra system calls to avoid unlinking unexpected things, > for which our only strategy on failure is to try to unlink the thing > anyway), but it's better to discuss that separately. > > > I > > think 0001 must be just replacing lstat or stat with get_dirent_type() > > with elevel ERROR if lstat or stat failure is treated as ERROR > > previously, otherwise with elevel LOG. I modified it as attached. Once > > this patch gets committed, then we can continue the discussion as to > > whether we make elog-LOG to elog-ERROR or vice-versa. > > Agreed, will look at this version tomorrow. Thanks. > > I'm tempted to use get_dirent_type() in RemoveXlogFile() but that > > requires us to pass struct dirent * from RemoveOldXlogFiles() (as > > attached 0002 for now), If done, this avoids one lstat() system call > > per WAL file. IMO, that's okay to pass an extra function parameter to > > RemoveXlogFile() as it is a static function and there can be many > > (thousands of) WAL files at times, so saving system calls (lstat() * > > number of WAL files) is definitely an advantage. > > -lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) && > +get_dirent_type(path, xlde, false, LOG) != PGFILETYPE_REG && > > I think you wanted ==, but maybe it'd be nicer to pass in a > "recyclable" boolean to RemoveXlogFile()? Ah, my bad, I corrected it now. If you mean to do "bool recyclable = (get_dirent_type(path, xlde, false, LOG) == PGFILETYPE_REG)"" and pass it as parameter to RemoveXlogFile(), then we have to do get_dirent_type calls for every WAL file even when any of (wal_recycle && *endlogSegNo <= recycleSegNo && XLogCtl->InstallXLogFileSegmentActive) is false which is unnecessary and it's better to pass dirent structure as a parameter. > If you're looking for more, rmtree() looks like another. Are you suggesting to not use pgfnames() in rmtree() and do opendir()-readdir()-get_dirent_type() directly? If yes, I don't think it's a great idea because rmtree() has recursive calls and holding opendir()-readdir() for all rmtree() recursive calls without closedir() may eat up dynamic memory if there are deeper level directories. I would like to leave it that way. Do you have any other ideas in mind? Please find the v15 patch, I merged the RemoveXlogFile() changes into 0001. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/ v15-0001-Make-more-use-of-get_dirent_type-in-place-of-sta.patch Description: Binary data
Re: NAMEDATALEN increase because of non-latin languages
I wrote: > The syscache use of GETSTRUCT still uses a simple cast of the tuple (for > pg_cast those calls live in parse_coerce.c, which is unchanged from master in > v3). Next step I think is to see about the syscache piece -- teaching a > syscache miss to deform the entire tuple into a struct and store it in the > syscache. v4 is a hackish attempt at getting the syscache to deform the tuple and store the struct. Using only pg_cast again for ease, it seems to work for that. It's now about as far as I can get without thinking about byref types. 0001 just adds copies of some syscache / catcache functions for private use by pg_cast, as scaffolding. 0002 teaches the temporary CatalogCacheCreateEntry_STRUCT() to call heap_deform_tuple(). This then calls a for-now handwritten function (similar to the one generated in v3) which palloc's space for the struct and copies the fields over. Only this function knows the catalog struct type, so a future design could call the per-cache function via a pointer in the catcache control array. Since we already have the isnull/values array, it's also trivial to copy the datums for the cache keys. WIP: CatCTup->tuple is still declared a tuple, but the t_data contents are now bogus, so there is a temporary GETSTRUCT_NEW that knows it's looking directly at the struct. Getting to varlen attributes: For one, I think it was mentioned above that we will need a way to perform a deep copy of structs that contain pointers to varlen fields. I imagine keeping track of the attributes length will come up for some types and might be tricky. And before that, the catalog machinery will need some preprocessor tricks to declare pointers in the structs. -- John Naylor EDB: http://www.enterprisedb.com From 20fba44412e8ef1bb4cd5b051b9d7e82618a6d93 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Wed, 10 Aug 2022 17:19:24 +0700 Subject: [PATCH v4 2/2] Teach catcache to store structs for pg_cast --- src/backend/parser/parse_coerce.c | 6 +-- src/backend/utils/cache/catcache.c | 73 -- src/include/access/htup_details.h | 2 + 3 files changed, 55 insertions(+), 26 deletions(-) diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c index 39b7e5707b..07a1b047e3 100644 --- a/src/backend/parser/parse_coerce.c +++ b/src/backend/parser/parse_coerce.c @@ -3056,7 +3056,7 @@ IsBinaryCoercible(Oid srctype, Oid targettype) ObjectIdGetDatum(targettype)); if (!HeapTupleIsValid(tuple)) return false; /* no cast */ - castForm = (Form_pg_cast) GETSTRUCT(tuple); + castForm = GETSTRUCT_NEW(pg_cast, tuple); result = (castForm->castmethod == COERCION_METHOD_BINARY && castForm->castcontext == COERCION_CODE_IMPLICIT); @@ -3120,7 +3120,7 @@ find_coercion_pathway(Oid targetTypeId, Oid sourceTypeId, if (HeapTupleIsValid(tuple)) { - Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple); + Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple); CoercionContext castcontext; /* convert char value for castcontext to CoercionContext enum */ @@ -3287,7 +3287,7 @@ find_typmod_coercion_function(Oid typeId, if (HeapTupleIsValid(tuple)) { - Form_pg_cast castForm = (Form_pg_cast) GETSTRUCT(tuple); + Form_pg_cast castForm = GETSTRUCT_NEW(pg_cast, tuple); *funcid = castForm->castfunc; ReleaseSysCache(tuple); diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index b1287bb6a0..8ddc109052 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -21,6 +21,7 @@ #include "access/table.h" #include "access/valid.h" #include "access/xact.h" +#include "catalog/pg_cast.h" // fixme #include "catalog/pg_collation.h" #include "catalog/pg_operator.h" #include "catalog/pg_type.h" @@ -2158,6 +2159,42 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, return ct; } +// WIP: generated functions would look like this and be called through a pointer +// FIXME: ct->tuple is no longer a real tuple +// XXX: for now assume the caller has switched to the right memory context +static CatCTup * +CatCArrayGetStruct_pg_cast(HeapTuple pg_cast_tuple, CatCTup *ct, Datum *values, bool *isnull) +{ + Form_pg_cast pg_cast_struct; + + /* Allocate memory for CatCTup and the cached struct in one go */ + ct = (CatCTup *) palloc(sizeof(CatCTup) + + MAXIMUM_ALIGNOF + sizeof(FormData_pg_cast)); + + /* copy the identification info */ + // WIP: for caches we only need t_self, can we just have that as a + // separate field in CatCTup? + ct->tuple.t_len = pg_cast_tuple->t_len; + ct->tuple.t_self = pg_cast_tuple->t_self; + ct->tuple.t_tableOid = pg_cast_tuple->t_tableOid; + + // WIP: treat t_data as a pointer to the struct + ct->tuple.t_data = (HeapTupleHeader) + MAXALIGN(((char *) ct) + sizeof(CatCTup)); + pg_cast_struct = (Form_pg_cast) ct->tuple.t_data; + + /* copy tuple contents */ + // WIP: we can just assign because there are no varlen attribute
Re: Typo in misc_sanity.sql?
On 25.07.22 14:04, Japin Li wrote: Yeah, they do not have unique keys, however, here we check primary keys. So, IMO, the description exceptions should say they do not have primary keys, rather than do not have unique keys. The context of that check is that for each system catalog we pick one of the available unique keys and designate it as the one primary key. If a system catalog doesn't have a unique key to choose from, then we can't do that, hence the comment. Changing the comment as suggested would essentially be saying, this catalog has no primary key because it has no primary key, which wouldn't be helpful.
RE: logical replication restrictions
On Tuesday, August 9, 2022 6:47 AM Euler Taveira wrote: > I attached a v6. Hi, thank you for posting the updated patch. Minor review comments for v6. (1) commit message "If the subscriber sets min_apply_delay parameter, ..." I suggest we use subscription rather than subscriber, because this parameter refers to and is used for one subscription. My suggestion is "If one subscription sets min_apply_delay parameter, ..." In case if you agree, there are other places to apply this change. (2) commit message It might be better to write a note for committer like "Bump catalog version" at the bottom of the commit message. (3) unit alignment between recovery_min_apply_delay and min_apply_delay The former interprets input number as milliseconds in case of no units, while the latter takes it as seconds without units. I feel it would be better to make them aligned. (4) catalogs.sgml + Delay the application of changes by a specified amount of time. The + unit is in milliseconds. As a column explanation, it'd be better to use a noun in the first sentence to make this description aligned with other places. My suggestion is "Application delay of changes by ". (5) pg_subscription.c There is one missing blank line before writing if statement. It's written in the AlterSubscription for other cases. @@ -1100,6 +1130,12 @@ AlterSubscription(ParseState *pstate, AlterSubscriptionStmt *stmt, replaces[Anum_pg_subscription_subdisableonerr - 1] = true; } + if (IsSet(opts.specified_opts, SUBOPT_MIN_APPLY_DELAY)) (6) tab-complete.c The order of tab-complete parameters listed in the COMPLETE_WITH should follow alphabetical order. "min_apply_delay" can come before "origin". We can refer to d547f7c commit. (7) 032_apply_delay.pl There are missing whitespaces after comma in the mod functions. UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; DELETE FROM test_tab WHERE mod(a,3) = 0; Best Regards, Takamichi Osumi
Re: Generalize ereport_startup_progress infrastructure
> > > Here's an attempt to generalize the ereport_startup_progress > > > infrastructure. The attached v1 patch places the code in elog.c/.h, > > > renames associated functions and variables, something like > > > ereport_startup_progress to ereport_progress, > > > log_startup_progress_interval to log_progress_report_interval and so > > > on. > > > > I'm not averse to reusing this infrastructure in other places, but I > > doubt we'd want all of those places to be controlled by a single GUC, > > especially because that GUC is also the on/off switch for the feature. > > Thanks Robert! How about we tweak the function a bit - > begin_progress_report_phase(int timeout), so that each process can use > their own timeout interval? In this case, do we want to retain > log_startup_progress_interval as-is specific to the startup process? > If yes, other processes might come up with their own GUCs (if they > don't want to use hard-coded timeouts) similar to > log_startup_progress_interval, which isn't the right way IMO. > > I think the notion of ereport_progress feature being disabled when the > timeout is 0, makes sense to me at least. > > On the flip side, what if we just have a single GUC > log_progress_report_interval (as proposed in the v1 patch)? Do we ever > want different processes to emit progress report messages at different > frequencies? Well, I can think of the startup process during standby > recovery needing to emit recovery progress report messages at a much > lower frequency than the startup process during the crash recovery. > Again, controlling the frequencies with different GUCs isn't the way > forward. But we can do something like: process 1 emits messages with a > frequency of 2*log_progress_report_interval, process 2 with a > frequency 4*log_progress_report_interval and so on without needing > additional GUCs. > > Thoughts? +1 for the idea to generalize the infrastructure. Given two options, option-1 is to use a single GUC across all kind of log running operations and option-2 is to use multiple GUCs (one for each kind of long running operations), I go with option-1 because if a user is interested to see a log message after every 10s for startup operations (or any other long running operations) then it is likely that he is interested to see other long running operations after every 10s only. It does not make sense to use different intervals for each kind of long running operation here. It also increases the number of GUCs which makes things complex. So it is a good idea to use a single GUC here. But I am worried about the on/off switch as Robert mentioned. How about using a new GUC to indicate features on/off. Say "log_long_running_operations" which contains a comma separated string which indicates the features to be enabled. For example, "log_long_running_operations = startup, postmaster" will enable logging for startup and postmaster operations and disables logging of other long running operations. With this the number of GUCs will be limited to 2 and it is simple and easy for the user. Thanks & Regards, Nitin Jadhav On Thu, Aug 4, 2022 at 9:58 AM Bharath Rupireddy wrote: > > On Wed, Aug 3, 2022 at 12:11 AM Robert Haas wrote: > > > > On Tue, Aug 2, 2022 at 3:25 AM Bharath Rupireddy > > wrote: > > > ereport_startup_progress infrastructure added by commit 9ce346e [1] > > > will be super-useful for reporting progress of any long-running server > > > operations, not just the startup process operations. For instance, > > > postmaster can use it for reporting progress of temp file and temp > > > relation file removals [2], checkpointer can use it for reporting > > > progress of snapshot or mapping file processing or even WAL file > > > processing and so on. And I'm sure there can be many places in the > > > code where we have while or for loops which can, at times, take a long > > > time to finish and having a log message there would definitely help. > > > > > > Here's an attempt to generalize the ereport_startup_progress > > > infrastructure. The attached v1 patch places the code in elog.c/.h, > > > renames associated functions and variables, something like > > > ereport_startup_progress to ereport_progress, > > > log_startup_progress_interval to log_progress_report_interval and so > > > on. > > > > I'm not averse to reusing this infrastructure in other places, but I > > doubt we'd want all of those places to be controlled by a single GUC, > > especially because that GUC is also the on/off switch for the feature. > > Thanks Robert! How about we tweak the function a bit - > begin_progress_report_phase(int timeout), so that each process can use > their own timeout interval? In this case, do we want to retain > log_startup_progress_interval as-is specific to the startup process? > If yes, other processes might come up with their own GUCs (if they > don't want to use hard-coded timeouts) similar to > log_startup_progress_interval, which isn't the right way IMO. > > I think the notion of erepor
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Wed, 10 Aug 2022 at 08:34, Dilip Kumar wrote: > > Am I still missing something? No, you have found a dependency between the patches that I was unaware of. So there is no bug if you apply both patches. Thanks for looking. > So I still think some adjustment is required in XidInMVCCSnapdhot() That is one way to resolve the issue, but not the only one. I can also change AssignTransactionId() to recursively register parent xids for all of a subxid's parents. I will add in a test case and resolve the dependency in my next patch. Thanks again. -- Simon Riggshttp://www.EnterpriseDB.com/
s390x builds on buildfarm
Hi, Are builds being paused on s390x as it looks like the s390x builds were last run 15 days ago. If so, wondering what is the reason for the pause and what is required to resume the builds? The OS the builds were running on seems to have reached end of life. Please let me know if we can help with getting them updated and resume the builds. Regards, Vivian Kong Linux on IBM Z Open Source Ecosystem IBM Canada Toronto Lab
Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock
On Wed, Aug 10, 2022 at 12:39 AM Thomas Munro wrote: > Here's an email about that: > > https://www.postgresql.org/message-id/CAE9k0P=OXww6RQCGrmDNa8=L3EeB01SGbYuP23y-qZJ=4td...@mail.gmail.com Hmm. If I'm reading that email correctly, it indicates that I noticed this problem before commit and asked for it to be changed, but then for some reason it wasn't changed and I still committed it. I can't immediately think of a reason why it wouldn't be safe to insist on acquiring a cleanup lock there. -- Robert Haas EDB: http://www.enterprisedb.com
Re: PG 15 (and to a smaller degree 14) regression due to ExprEvalStep size
On Wed, Aug 10, 2022 at 3:57 AM Andres Freund wrote: > One way this code could be drastically simplified is to force all > type-coercions to go through the "io coercion" path, which could be > implemented as a single execution step (which thus could trivially > start/finish a subtransaction) and would remove a lot of the complicated code > around coercions. Could you please clarify how you think we might do the io coercion wrapped with a subtransaction all as a single execution step? I would've thought that we couldn't do the sub-transaction without leaving ExecInterpExpr() anyway, so maybe you meant the io coercion itself was done using some code outside ExecInterpExpr()? The current JsonExpr code does it by recursively calling ExecInterpExpr() using the nested ExprState expressly for the coercion. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Tuesday, August 9, 2022 12:59 AM Önder Kalacı wrote: > Attaching v5 of the patch which reflects the review on this email, also few > minor test improvements. Hi, Thank you for the updated patch. FYI, I noticed that v5 causes cfbot failure in [1]. Could you please fix it in the next version ? [19:44:38.420] execReplication.c: In function ‘RelationFindReplTupleByIndex’: [19:44:38.420] execReplication.c:186:24: error: ‘eq’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [19:44:38.420] 186 | if (!indisunique && !tuples_equal(outslot, searchslot, eq)) [19:44:38.420] | ^ [19:44:38.420] cc1: all warnings being treated as errors [1] - https://cirrus-ci.com/task/6544573026533376 Best Regards, Takamichi Osumi
something has gone wrong, but what is it?
Hi, Today while hacking I encountered this delight: 2022-08-10 09:30:29.025 EDT [27126] FATAL: something has gone wrong I actually already knew that something had gone wrong, because the code I was writing was incomplete. And if I hadn't known that, the word FATAL would have been a real good clue. What I was hoping was that the error message might tell me WHAT had gone wrong, but it didn't. This seems to be the fault of Andres's commit 5aa4a9d2077fa902b4041245805082fec6be0648. In his defense, the addition of any kind of elog() at that point in the code appears to be an improvement over the previous state of affairs. Nonetheless I feel we could do better still, as in the attached. -- Robert Haas EDB: http://www.enterprisedb.com v1-0001-Be-more-specific-about-exactly-what-has-gone-wron.patch Description: Binary data
Re: pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled
Alvaro Herrera writes: > On 2022-Aug-09, Lukas Fittl wrote: >> But I wonder, why do we have an explicit pretty printing flag on these >> functions, and PRETTYFLAG_SCHEMA in the code to represent this behavior. >> If we don't want pretty printing to affect schema qualification, why >> does that flag exist? > Because of CVE-2018-1058. See commit 815172ba8068. > I imagine that that commit only touched the minimum necessary to solve > the immediate security problem, but that further work is needed to make > PRETTYFLAG_SCHEMA become a fully functional gadget; but that would > require that the whole of ruleutils.c (and everything downstream from > it) behaves sanely. In other words, I think your patch is too small. What I'm inclined to do, rather than repeat the same finicky & undocumented coding pattern in one more place, is write a convenience function for it that can be named and documented to reflect the coding rule about which call sites should use it (rather than calling plain generate_relation_name). However, the first requirement for that is to have a clearly defined rule. I think the intent of 815172ba8068 was to convert all uses that would determine the object-creation schema in commands issued by pg_dump. Do we want to widen that, and if so by how much? I'd be on board I think with adjusting other ruleutils.c functions that could plausibly be used for building creation commands, but happen not to be called by pg_dump. I'm not on board with converting every single generate_relation_name call --- mainly because it'd be pointless unless you also qualify every single function name, operator name, etc; and that would be unreadable. regards, tom lane
Re: something has gone wrong, but what is it?
> On 10 Aug 2022, at 15:41, Robert Haas wrote: > I feel we could do better still, as in the attached. +1, LGTM. -- Daniel Gustafsson https://vmware.com/
Re: something has gone wrong, but what is it?
Robert Haas writes: - elog(ERROR, "something has gone wrong"); + elog(ERROR, "unrecognized AuxProcType: %d", (int) auxtype); +1 ... the existing message is clearly not up to project standard. regards, tom lane
Re: s390x builds on buildfarm
On 2022-08-10 We 09:04, Vivian Kong wrote: > > Hi, > > > > Are builds being paused on s390x as it looks like the s390x builds > were last run 15 days ago. If so, wondering what is the reason for > the pause and what is required to resume the builds? > The OS the builds were running on seems to have reached end of life. > Please let me know if we can help with getting them updated and resume > the builds. > > > > Mark, I think you run most or all of these. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
designated initializers
(Coming from https://postgr.es/m/20220809193616.5uucf33piwdxn452@alvherre.pgsql ) On 2022-Aug-09, Alvaro Herrera wrote: > On 2022-Aug-09, Andres Freund wrote: > > > Mildly wondering whether we ought to use designated initializers instead, > > given we're whacking it around already. Too easy to get the order wrong when > > adding new members, and we might want to have optional callbacks too. > > Strong +1. It makes code much easier to navigate (see XmlTableRoutine > and compare with heapam_methods, for example). For example, I propose the attached. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El que vive para el futuro es un iluso, y el que vive para el pasado, un imbécil" (Luis Adler, "Los tripulantes de la noche") >From 66f5981751cdaeb27a0614ca2b0643632f7cc128 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 10 Aug 2022 15:57:09 +0200 Subject: [PATCH] use designated initializors --- .../libpqwalreceiver/libpqwalreceiver.c | 30 +-- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c index da9c359af1..2865024524 100644 --- a/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c +++ b/src/backend/replication/libpqwalreceiver/libpqwalreceiver.c @@ -82,21 +82,21 @@ static WalRcvExecResult *libpqrcv_exec(WalReceiverConn *conn, static void libpqrcv_disconnect(WalReceiverConn *conn); static WalReceiverFunctionsType PQWalReceiverFunctions = { - libpqrcv_connect, - libpqrcv_check_conninfo, - libpqrcv_get_conninfo, - libpqrcv_get_senderinfo, - libpqrcv_identify_system, - libpqrcv_server_version, - libpqrcv_readtimelinehistoryfile, - libpqrcv_startstreaming, - libpqrcv_endstreaming, - libpqrcv_receive, - libpqrcv_send, - libpqrcv_create_slot, - libpqrcv_get_backend_pid, - libpqrcv_exec, - libpqrcv_disconnect + .walrcv_connect = libpqrcv_connect, + .walrcv_check_conninfo = libpqrcv_check_conninfo, + .walrcv_get_conninfo = libpqrcv_get_conninfo, + .walrcv_get_senderinfo = libpqrcv_get_senderinfo, + .walrcv_identify_system = libpqrcv_identify_system, + .walrcv_server_version = libpqrcv_server_version, + .walrcv_readtimelinehistoryfile = libpqrcv_readtimelinehistoryfile, + .walrcv_startstreaming = libpqrcv_startstreaming, + .walrcv_endstreaming = libpqrcv_endstreaming, + .walrcv_receive = libpqrcv_receive, + .walrcv_send = libpqrcv_send, + .walrcv_create_slot = libpqrcv_create_slot, + .walrcv_get_backend_pid = libpqrcv_get_backend_pid, + .walrcv_exec = libpqrcv_exec, + .walrcv_disconnect = libpqrcv_disconnect }; /* Prototypes for private functions */ -- 2.30.2
Re: moving basebackup code to its own directory
On Tue, Aug 9, 2022 at 3:28 PM Robert Haas wrote: > On Tue, Aug 9, 2022 at 2:40 PM Justin Pryzby wrote: > > It looks like this updates the header comments in the .h files but not the > > .c > > files. > > > > Personally, I find these to be silly boilerplate .. > > Here is a version with some updates to the silly boilerplate. If there are no further comments on this I will go ahead and commit it. David Steele voted for back-patching this on the grounds that it would make future back-patching easier, which is an argument that seems to me to have some merit, although on the other hand, we are already into August so it's quite late in the day. Anyone else want to vote? -- Robert Haas EDB: http://www.enterprisedb.com
Re: moving basebackup code to its own directory
On Wed, Aug 10, 2022 at 10:08:02AM -0400, Robert Haas wrote: > On Tue, Aug 9, 2022 at 3:28 PM Robert Haas wrote: > > On Tue, Aug 9, 2022 at 2:40 PM Justin Pryzby wrote: > > > It looks like this updates the header comments in the .h files but not > > > the .c > > > files. > > > > > > Personally, I find these to be silly boilerplate .. > > > > Here is a version with some updates to the silly boilerplate. > > If there are no further comments on this I will go ahead and commit it. > > David Steele voted for back-patching this on the grounds that it would > make future back-patching easier, which is an argument that seems to > me to have some merit, although on the other hand, we are already into > August so it's quite late in the day. Anyone else want to vote? No objection to backpatching to v15, but if you don't, git ought to handle renamed files just fine. These look like similar precedent for "late" renaming+backpatching: 41dae3553, 47ca48364 -- Justin
Re: something has gone wrong, but what is it?
On Wed, Aug 10, 2022 at 9:53 AM Tom Lane wrote: > Robert Haas writes: > > - elog(ERROR, "something has gone wrong"); > + elog(ERROR, "unrecognized AuxProcType: %d", (int) > auxtype); > > +1 ... the existing message is clearly not up to project standard. After a bit of further looking around I noticed that there's another check for an invalid auxtype in this function which uses a slightly different message text and also PANIC rather than ERROR. I think we should adopt that here too, for consistency, as in the attached. The distinction between PANIC and ERROR doesn't really seem to matter here. Either way, the server goes into an infinite crash-and-restart loop. May as well be consistent. -- Robert Haas EDB: http://www.enterprisedb.com v2-0001-Be-more-specific-about-exactly-what-has-gone-wron.patch Description: Binary data
SPI versus read/write expanded datums
In the report at [1] we learned that the SQL-language function handler is too cavalier about read/write expanded datums that it receives as input. A function that receives such a datum is entitled to scribble on its value, or even delete it. If the function turns around and passes the datum on to some other function, the same applies there. So in general, it can only be safe to pass such a datum to *one* subsidiary function. If you want to use the value more than once, you'd better convert the pointer to read-only. fmgr_sql wasn't doing that, leading to the reported bug. After fixing that, I wondered if we had the same problem anywhere else, and it didn't take long to think of such a place: SPI. If you pass a read/write datum to SPI_execute_plan or one of its siblings, and the executed query references that datum more than once, you're potentially in trouble. Even if it does only reference it once, you might be surprised that your copy of the datum got modified. However, we can't install a 100% fix in SPI itself, because plpgsql intentionally exploits exactly this behavior to optimize things like "arr := array_append(arr, val)". I considered the idea of adding a 90% fix by making _SPI_convert_params() convert R/W pointers to R/O. That would protect places using the old-style "char *Nulls" APIs, and then we'd deem it the responsibility of callers using ParamListInfo APIs to protect themselves. I can't get terribly excited about that though, because it'd be adding complexity and cycles for a problem that seems entirely theoretical at this point. I can't find any SPI callers that would *actually* be passing a R/W datum to a query that'd be likely to modify it. The non-plpgsql PLs are at the most risk of calling a hazardous query, but they all pass "flat" datums that are the immediate result of a typinput function or the like. So my inclination is to do nothing about this now, and maybe nothing ever. But I thought it'd be a good idea to memorialize this issue for the archives. regards, tom lane [1] https://www.postgresql.org/message-id/flat/WScDU5qfoZ7PB2gXwNqwGGgDPmWzz08VdydcPFLhOwUKZcdWbblbo-0Lku-qhuEiZoXJ82jpiQU4hOjOcrevYEDeoAvz6nR0IU4IHhXnaCA%3D%40mackler.email
Re: something has gone wrong, but what is it?
Hi, On 2022-08-10 10:49:59 -0400, Robert Haas wrote: > On Wed, Aug 10, 2022 at 9:53 AM Tom Lane wrote: > > Robert Haas writes: > > > > - elog(ERROR, "something has gone wrong"); > > + elog(ERROR, "unrecognized AuxProcType: %d", (int) > > auxtype); > > > > +1 ... the existing message is clearly not up to project standard. > > After a bit of further looking around I noticed that there's another > check for an invalid auxtype in this function which uses a slightly > different message text and also PANIC rather than ERROR. > > I think we should adopt that here too, for consistency, as in the attached. > > The distinction between PANIC and ERROR doesn't really seem to matter > here. Either way, the server goes into an infinite crash-and-restart > loop. May as well be consistent. Makes sense. Greetings, Andres Freund
Re: Generalize ereport_startup_progress infrastructure
On Tue, Aug 9, 2022 at 11:54 AM Bharath Rupireddy wrote: > I'm attaching 0002 for reporting removal of temp files and temp > relation files by postmaster. > > If this looks okay, I can code 0003 for reporting processing of > snapshot, mapping and old WAL files by checkpointer. I think that someone is going to complain about the changes to timeout.c. Some trouble has been taken to allow things like SetLatch(MyLatch) to be unconditional. Aside from that, I am unsure how generally safe it is to use the timeout infrastructure in the postmaster. >From a user-interface point of view, log_postmaster_progress_interval seems a bit awkward. It's really quite narrow, basically just checking for one thing. I'm not sure I like adding a GUC for something that specific, although I also don't have another idea at the moment either. Hmm. Maybe the checkpointer is a better candidate, but somehow I feel that we can't consider this sort of thing separate from the existing progress reporting that checkpointer already does. Perhaps we need to think of changing or improving that in some way rather than adding something wholly new alongside the existing system. -- Robert Haas EDB: http://www.enterprisedb.com
Allow logical replication to copy tables in binary format
Hey hackers, I see that logical replication subscriptions have an option to enable binary [1]. When it's enabled, subscription requests publisher to send data in binary format. But this is only the case for apply phase. In tablesync, tables are still copied as text. To copy tables, COPY command is used and that command supports copying in binary. So it seemed to me possible to copy in binary for tablesync too. I'm not sure if there is a reason to always copy tables in text format. But I couldn't see why not to do it in binary if it's enabled. You can find the small patch that only enables binary copy attached. What do you think about this change? Does it make sense? Am I missing something? [1] https://www.postgresql.org/docs/15/sql-createsubscription.html Best, Melih 0001-Allow-logical-replication-to-copy-table-in-binary.patch Description: Binary data
Re: [PATCH] Expose port->authn_id to extensions and triggers
On Tue, Aug 9, 2022 at 3:39 AM Drouvot, Bertrand wrote: > Agree that it makes sense to work on those patches in this particular > order then. Sounds good. The ClientConnectionInfo patch (previously 0002) is attached, with the SQL function removed. Thanks, --Jacob From a22ff3ba36f5eb93c582a957c7c2caca07ed21c5 Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Wed, 23 Mar 2022 15:07:05 -0700 Subject: [PATCH] Allow parallel workers to read authn_id Move authn_id into a new global, MyClientConnectionInfo, which is intended to hold all the client information that needs to be shared between the backend and any parallel workers. MyClientConnectionInfo is serialized and restored using a new parallel key. --- src/backend/access/transam/parallel.c | 19 ++- src/backend/libpq/auth.c | 16 +++--- src/backend/utils/init/miscinit.c | 72 +++ src/include/libpq/libpq-be.h | 39 ++- src/include/miscadmin.h | 4 ++ 5 files changed, 129 insertions(+), 21 deletions(-) diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c index df0cd77558..bc93101ff7 100644 --- a/src/backend/access/transam/parallel.c +++ b/src/backend/access/transam/parallel.c @@ -76,6 +76,7 @@ #define PARALLEL_KEY_REINDEX_STATE UINT64CONST(0x000C) #define PARALLEL_KEY_RELMAPPER_STATE UINT64CONST(0x000D) #define PARALLEL_KEY_UNCOMMITTEDENUMS UINT64CONST(0x000E) +#define PARALLEL_KEY_CLIENTCONNINFO UINT64CONST(0x000F) /* Fixed-size parallel state. */ typedef struct FixedParallelState @@ -212,6 +213,7 @@ InitializeParallelDSM(ParallelContext *pcxt) Size reindexlen = 0; Size relmapperlen = 0; Size uncommittedenumslen = 0; + Size clientconninfolen = 0; Size segsize = 0; int i; FixedParallelState *fps; @@ -272,8 +274,10 @@ InitializeParallelDSM(ParallelContext *pcxt) shm_toc_estimate_chunk(&pcxt->estimator, relmapperlen); uncommittedenumslen = EstimateUncommittedEnumsSpace(); shm_toc_estimate_chunk(&pcxt->estimator, uncommittedenumslen); + clientconninfolen = EstimateClientConnectionInfoSpace(); + shm_toc_estimate_chunk(&pcxt->estimator, clientconninfolen); /* If you add more chunks here, you probably need to add keys. */ - shm_toc_estimate_keys(&pcxt->estimator, 11); + shm_toc_estimate_keys(&pcxt->estimator, 12); /* Estimate space need for error queues. */ StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) == @@ -352,6 +356,7 @@ InitializeParallelDSM(ParallelContext *pcxt) char *session_dsm_handle_space; char *entrypointstate; char *uncommittedenumsspace; + char *clientconninfospace; Size lnamelen; /* Serialize shared libraries we have loaded. */ @@ -422,6 +427,12 @@ InitializeParallelDSM(ParallelContext *pcxt) shm_toc_insert(pcxt->toc, PARALLEL_KEY_UNCOMMITTEDENUMS, uncommittedenumsspace); + /* Serialize our ClientConnectionInfo. */ + clientconninfospace = shm_toc_allocate(pcxt->toc, clientconninfolen); + SerializeClientConnectionInfo(clientconninfolen, clientconninfospace); + shm_toc_insert(pcxt->toc, PARALLEL_KEY_CLIENTCONNINFO, + clientconninfospace); + /* Allocate space for worker information. */ pcxt->worker = palloc0(sizeof(ParallelWorkerInfo) * pcxt->nworkers); @@ -1270,6 +1281,7 @@ ParallelWorkerMain(Datum main_arg) char *reindexspace; char *relmapperspace; char *uncommittedenumsspace; + char *clientconninfospace; StringInfoData msgbuf; char *session_dsm_handle_space; Snapshot tsnapshot; @@ -1479,6 +1491,11 @@ ParallelWorkerMain(Datum main_arg) false); RestoreUncommittedEnums(uncommittedenumsspace); + /* Restore the ClientConnectionInfo. */ + clientconninfospace = shm_toc_lookup(toc, PARALLEL_KEY_CLIENTCONNINFO, + false); + RestoreClientConnectionInfo(clientconninfospace); + /* Attach to the leader's serializable transaction, if SERIALIZABLE. */ AttachSerializableXact(fps->serializable_xact_handle); diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c index 2d9ab7edce..313a6ea701 100644 --- a/src/backend/libpq/auth.c +++ b/src/backend/libpq/auth.c @@ -342,15 +342,15 @@ auth_failed(Port *port, int status, const char *logdetail) * authorization will fail later. * * The provided string will be copied into TopMemoryContext, to match the - * lifetime of the Port, so it is safe to pass a string that is managed by an - * external library. + * lifetime of MyClientConnectionInfo, so it is safe to pass a string that is + * managed by an external library. */ static void set_authn_id(Port *port, const char *id) { Assert(id); - if (port->authn_id) + if (MyClientConnectionInfo.authn_id) { /* * An existing authn_id should never be overwritten; that means two @@ -361,17 +361,18 @@ set_authn_id(Port *port, const char *id) ereport(FATAL, (errmsg("authentication identifier set
fix stale help message
when parsing command-line options, the -f option support disabling 8 scan and join methods, o, b and t disable index-only scans, bitmap index scans, and TID scans respectively, add them to the help message. diff --git a/src/backend/main/main.c b/src/backend/main/main.c index 5a964a0db6..f5da4260a1 100644 --- a/src/backend/main/main.c +++ b/src/backend/main/main.c @@ -351,7 +351,7 @@ help(const char *progname) printf(_(" -?, --help show this help, then exit\n")); printf(_("\nDeveloper options:\n")); - printf(_(" -f s|i|n|m|h forbid use of some plan types\n")); + printf(_(" -f s|i|o|b|t|n|m|h forbid use of some plan types\n")); printf(_(" -n do not reinitialize shared memory after abnormal exit\n")); printf(_(" -O allow system table structure changes\n")); printf(_(" -P disable system indexes\n")); -- Regards Junwang Zhao 0001-fix-stale-help-message.patch Description: Binary data
Re: SQL/JSON features for v15
On 2022-08-09 Tu 16:58, Jonathan S. Katz wrote: > Hi, > > (Personal hat, not RMT hat unless otherwise noted). > > This thread[1] raised some concerns around the implementation of the > SQL/JSON features that are slated for v15, which includes an > outstanding open item[2]. Given the current state of the discussion, > when the RMT met on Aug 8, they several options, readable here[3]. > Given we are now into the later part of the release cycle, we need to > make some decisions on how to proceed with this feature given the > concerns raised. > > Per additional discussion on the thread, the group wanted to provide > more visibility into the discussion to get opinions on how to proceed > for the v15 release. > > Without rehashing the thread, the options presented were: > > 1. Fixing the concerns addressed in the thread around the v15 SQL/JSON > features implementation, noting that this would likely entail at least > one more beta release and would push the GA date past our normal > timeframe. > > 2. Try to commit a subset of the features that caused less debate. > This was ruled out. > > 3. Revert the v15 SQL/JSON features work. > > > Based on the current release timing and the open issues presented on > the thread, and the RMT had recommended reverting, but preferred to > drive consensus on next steps. > > > From a release advocacy standpoint, I need about 6 weeks lead time to > put together the GA launch. We're at the point where I typically > deliver a draft release announcement. From this, given this involves a > high visibility feature, I would want some clarity on what option we > would like to pursue. Once the announcement translation process has > begun (and this is when we have consensus on the release > announcement), it becomes more challenging to change things out. > > From a personal standpoint (restating from[3]), I would like to see > what we could do to include support for this batch of the SQL/JSON > features in v15. What is included looks like it closes most of the gap > on what we've been missing syntactically since the standard was > adopted, and the JSON_TABLE work is very convenient for converting > JSON data into a relational format. I believe having this feature set > is important for maintaining standards compliance, interoperability, > tooling support, and general usability. Plus, JSON still seems to be > pretty popular. > > We're looking for additional input on what makes sense as a best > course of action, given what is presented in[3]. > > Thanks, > > Jonathan > > [1] > https://www.postgresql.org/message-id/flat/20220616233130.rparivafipt6doj3%40alap3.anarazel.de > [2] https://wiki.postgresql.org/wiki/PostgreSQL_15_Open_Items > [3] > https://www.postgresql.org/message-id/787cef45-15de-8f1d-ed58-a1c1435bfc0e%40postgresql.org To preserve options I will start preparing reversion patches. Given there are I think more than 20 commits all told that could be fun, and will probably take me a little while. The sad part is that to the best of my knowledge this code is producing correct results, and not disturbing the stability or performance of anything else. There was a performance issue but it's been dealt with AIUI. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
use argument instead of global variable
The caller of `get_stats_option_name` pass optarg as the argument, it's saner to use the argument instead of the global variable set by getopt, which is more safe since the argument has a *const* specifier. diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 11e802eba9..68552b8779 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -3598,9 +3598,9 @@ get_stats_option_name(const char *arg) switch (arg[0]) { case 'p': - if (optarg[1] == 'a') /* "parser" */ + if (arg[1] == 'a') /* "parser" */ return "log_parser_stats"; - else if (optarg[1] == 'l') /* "planner" */ + else if (arg[1] == 'l') /* "planner" */ return "log_planner_stats"; break; -- Regards Junwang Zhao
Re: moving basebackup code to its own directory
Robert Haas writes: > David Steele voted for back-patching this on the grounds that it would > make future back-patching easier, which is an argument that seems to > me to have some merit, although on the other hand, we are already into > August so it's quite late in the day. Anyone else want to vote? Seems like low-risk refactoring, so +1 for keeping v15 close to HEAD. regards, tom lane
Re: moving basebackup code to its own directory
On Wed, Aug 10, 2022 at 6:20 PM Tom Lane wrote: > > Robert Haas writes: > > David Steele voted for back-patching this on the grounds that it would > > make future back-patching easier, which is an argument that seems to > > me to have some merit, although on the other hand, we are already into > > August so it's quite late in the day. Anyone else want to vote? > > Seems like low-risk refactoring, so +1 for keeping v15 close to HEAD. +1, but I suggest also getting a hat-tip from the RMT on it. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: moving basebackup code to its own directory
On 8/10/22 12:32 PM, Magnus Hagander wrote: On Wed, Aug 10, 2022 at 6:20 PM Tom Lane wrote: Robert Haas writes: David Steele voted for back-patching this on the grounds that it would make future back-patching easier, which is an argument that seems to me to have some merit, although on the other hand, we are already into August so it's quite late in the day. Anyone else want to vote? Seems like low-risk refactoring, so +1 for keeping v15 close to HEAD. +1, but I suggest also getting a hat-tip from the RMT on it. With RMT hat on, given a few folks who maintain backup utilities seem to be in favor of backpatching to v15 and they are the ones to be most affected by this, it seems to me that this is an acceptable, noncontroversial course of action. Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: designated initializers
Hi, On 2022-08-10 16:03:00 +0200, Alvaro Herrera wrote: > (Coming from > https://postgr.es/m/20220809193616.5uucf33piwdxn452@alvherre.pgsql ) > > On 2022-Aug-09, Alvaro Herrera wrote: > > > On 2022-Aug-09, Andres Freund wrote: > > > > > Mildly wondering whether we ought to use designated initializers instead, > > > given we're whacking it around already. Too easy to get the order wrong > > > when > > > adding new members, and we might want to have optional callbacks too. > > > > Strong +1. It makes code much easier to navigate (see XmlTableRoutine > > and compare with heapam_methods, for example). > > For example, I propose the attached. +1 I've fought with this one when fixing a conflict when rebasing a patch... Greetings, Andres Freund
Re: moving basebackup code to its own directory
On 2022-Aug-10, Robert Haas wrote: > David Steele voted for back-patching this on the grounds that it would > make future back-patching easier, which is an argument that seems to > me to have some merit, although on the other hand, we are already into > August so it's quite late in the day. Anyone else want to vote? Given that 10 of these 11 files are new in 15, I definitely agree with backpatching the move. Moving the include/ files is going to cause some pain for any third-party code #including those files. I don't think this is a problem. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: [PATCH] CF app: add "Returned: Needs more interest"
On Wed, Aug 3, 2022 at 02:53:23PM -0400, Tom Lane wrote: > Andres Freund writes: > > My impression is that a lot of the patches floating from CF to CF have > > gotten > > sceptical feedback and at best a minor amount of work to address that has > > been > > done. > > That I think is a distinct issue: nobody wants to take on the > unpleasant job of saying "no, we don't want this" in a final way. > We may raise some objections but actually rejecting a patch is hard. > So it tends to slide forward until the author gives up. Agreed. There is a sense when I look at patches in that status that they seem like a good idea to someone and could be useful to someone, but the overhead or complexity it would add to the software doesn't seem warranted. It is complex to explain that to someone, and since it is a judgement call, not worth the argument. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: [RFC] building postgres with meson
Hi, On 2022-06-02 10:26:09 -0700, Andres Freund wrote: > > Could we have the meson build check that, say, if gram.c exists it > > is newer than gram.y? Or get it to ignore an in-tree gram.c? > > I suspect the problem with ignoring is gram.h, that's probably a bit harder to > ignore. I tried to ignore various generated files in the source tree, but I don't think it's doable for all of them. Consider e.g. src/backend/utils/misc/guc-file.c which is gets built via #include "guc-file.c" from gram.c Because it's a "" include, the search path starts in the current directory and only then -I is searched. To my knowledge there's no way of changing that. Quoting the gcc manpage: -I dir -iquote dir -isystem dir -idirafter dir Add the directory dir to the list of directories to be searched for header files during preprocessing. If dir begins with = or $SYSROOT, then the = or $SYSROOT is replaced by the sysroot prefix; see --sysroot and -isysroot. Directories specified with -iquote apply only to the quote form of the directive, "#include "file"". Directories specified with -I, -isystem, or -idirafter apply to lookup for both the "#include "file"" and "#include " directives. You can specify any number or combination of these options on the command line to search for header files in several directories. The lookup order is as follows: 1. For the quote form of the include directive, the directory of the current file is searched first. 2. For the quote form of the include directive, the directories specified by -iquote options are searched in left-to-right order, as they appear on the command line. 3. Directories specified with -I options are scanned in left-to-right order. [...] Except for copying guc.c from source to build tree before building, I don't see a way of ignoring the in-build-tree guc-file.c. Not sure what a good way of dealing with this is. For now I'll make it just error out if there's any known such file in the source tree, but that's not a good solution forever. If it were just "normal" build leftovers I'd propose to (optionally) just remove them, but that's not good for tarballs. Greetings, Andres Freund
Re: Get the statistics based on the application name and IP address
On Mon, Aug 8, 2022 at 10:11 PM Julien Rouhaud wrote: > Hi, > > On Mon, Aug 08, 2022 at 08:21:06PM +0500, Ibrar Ahmed wrote: > > While working on pg_stat_stements, I got some questions from customers to > > have statistics by application and IP address. > > [...] > > name. I did some POC and had a patch. But before sharing the patch. > > > > I need to know if there has been any previous discussion about this > topic; > > by the way, > > Thanks for the input. > I don't think there was any discussion on this exactly, but there have been > some related discussions. > > This would likely bring 2 problems. > First, for now each entry contains its own > query text in the query file. There can already be some duplication, which > isn't great, but adding the application_name and/or IP address will make > things > way worse, so you would probably need to fix that first. I doubt that makes it worst because these (IP and Application) will be part of the key, not the query text. But yes, I agree that it will increase the footprint of rows, excluding query text. I am not 100% sure about the query text duplication but will look at that in detail, if you have more insight, then it will help to solve that. > There has been some > discussion about it recently (1) but more work and benchmarking are needed. > > The other problem is the multiplication of entries. It's a well known > limitation that pg_stat_statements eviction are so costly that it makes it > unusable. The last numbers I saw about it was ~55% overhead (2). Adding > application_name or ip address to the key would probably make > pg_stat_statements unusable for anyone who would actually need those > metrics. > I am sure adding a new item in the key does not affect the performance of evictions of the row, as it will not be part of that area. I am doing some benchmarking and hacking to reduce that and will send results with the patch. > [1]: > https://www.postgresql.org/message-id/flat/604E3199-2DD2-47DD-AC47-774A6F97DCA9%40amazon.com > [2]: https://twitter.com/AndresFreundTec/status/1105585237772263424 > -- Ibrar Ahmed. Senior Software Engineer, PostgreSQL Consultant.
Re: moving basebackup code to its own directory
On Wed, Aug 10, 2022 at 12:57 PM Alvaro Herrera wrote: > Given that 10 of these 11 files are new in 15, I definitely agree with > backpatching the move. OK, done. -- Robert Haas EDB: http://www.enterprisedb.com
Re: something has gone wrong, but what is it?
> On 10 Aug 2022, at 19:49, Robert Haas wrote: > > After a bit of further looking around I noticed that there's another > check for an invalid auxtype in this function which uses a slightly > different message text and also PANIC rather than ERROR. Is there a reason to do MyBackendType = B_INVALID; after PANIC or ERROR? Best regards, Andrey Borodin.
Re: shared-memory based stats collector - v70
On Tue, 9 Aug 2022 at 12:48, Andres Freund wrote: > > The reason I want a C function is I'm trying to get as far as I can > > without a connection to a database, without a transaction, without > > accessing the catalog, and as much as possible without taking locks. > > I assume you don't include lwlocks under locks? I guess it depends on which lwlock :) I would be leery of a monitoring system taking an lwlock that could interfere with regular transactions doing work. Or taking a lock that is itself the cause of the problem elsewhere that you really need stats to debug would be a deal breaker. > I don't think that's a large enough issue to worry about unless you're > polling at a very high rate, which'd be a bad idea in itself. If a backend > can't get the lock for some stats change it'll defer flushing the stats a bit, > so it'll not cause a lot of other problems. Hm. I wonder if we're on the same page about what constitutes a "high rate". I've seen people try push prometheus or other similar systems to 5s poll intervals. That would be challenging for Postgres due to the volume of statistics. The default is 30s and people often struggle to even have that function for large fleets. But if you had a small fleet, perhaps an iot style system with a "one large table" type of schema you might well want stats every 5s or even every 1s. > I'm *dead* set against including catalog names in shared memory stats. That'll > add a good amount of memory usage and complexity, without any sort of > comensurate gain. Well it's pushing the complexity there from elsewhere. If the labels aren't in the stats structures then the exporter needs to connect to each database, gather all the names into some local cache and then it needs to worry about keeping it up to date. And if there are any database problems such as disk errors or catalog objects being locked then your monitoring breaks though perhaps it can be limited to just missing some object names or having out of date names. > > I also think it would be nice to have a change counter for every stat > > object, or perhaps a change time. Prometheus wouldn't be able to make > > use of it but other monitoring software might be able to receive only > > metrics that have changed since the last update which would really > > help on databases with large numbers of mostly static objects. > > I think you're proposing adding overhead that doesn't even have a real user. I guess I'm just brainstorming here. I don't need to currently no. It doesn't seem like significant overhead though compared to the locking and copying though? -- greg
Re: something has gone wrong, but what is it?
On Wed, Aug 10, 2022 at 2:06 PM Andrey Borodin wrote: > > On 10 Aug 2022, at 19:49, Robert Haas wrote: > > After a bit of further looking around I noticed that there's another > > check for an invalid auxtype in this function which uses a slightly > > different message text and also PANIC rather than ERROR. > > Is there a reason to do > MyBackendType = B_INVALID; > after PANIC or ERROR? That could probably be taken out, but it doesn't seem important to take it out. -- Robert Haas EDB: http://www.enterprisedb.com
Re: shared-memory based stats collector - v70
One thing that's bugging me is that the names we use for these stats are *all* over the place. The names go through three different stages pgstat structs -> pgstatfunc tupledescs -> pg_stat_* views (Followed by a fourth stage where pg_exporter or whatever names for the monitoring software) And for some reason both transitions (plus the exporter) felt the need to fiddle with the names or values. And not in any sort of even vaguely consistent way. So there are three (or four) different sets of names for the same metrics :( e.g. * Some of the struct elements have abbreviated words which are expanded in the tupledesc names or the view columns -- some have long names which get abbreviated. * Some struct members have n_ prefixes (presumably to avoid C keywords or other namespace issues?) and then lose them at one of the other stages. But then the relation stats do not have n_ prefixes and then the pg_stat view *adds* n_ prefixes in the SQL view! * Some columns are added together in the SQL view which seems like gratuitously hiding information from the user. The pg_stat_*_tables view actually looks up information from the indexes stats and combines them to get idx_scan and idx_tup_fetch. * The pg_stat_bgwriter view returns data from two different fixed entries, the checkpointer and the bgwriter, is there a reason those are kept separately but then reported as if they're one thing? Some of the simpler renaming could be transparently fixed by making the internal stats match the public facing names. But for many of them I think the internal names are better. And the cases where the views aggregate data in a way that loses information are not something I want to reproduce. I had intended to use the internal names directly, reasoning that transparency and consistency are the direction to be headed. But in some cases I think the current public names are the better choice -- I certainly don't want to remove n_* prefixes from some names but then add them to different names! And some of the cases where the data is combined or modified do seem like they would be missed. -- greg
Re: pg_auth_members.grantor is bunk
On Mon, Aug 1, 2022 at 3:51 PM Robert Haas wrote: > On Mon, Aug 1, 2022 at 1:38 PM Tom Lane wrote: > >> I think the latter --- the cfbot thinks the July CF is no longer relevant, > > but Jacob hasn't yet moved your patches forward. You could wait for > > him to do that, or do it yourself. > > Done. New patches attached. Well, CI isn't happy with this, and for good reason: ALTER GROUP regress_priv_group2 ADD USER regress_priv_user2; -- duplicate -NOTICE: role "regress_priv_user2" has already been granted membership in role "regress_priv_group2" by role "rhaas" +NOTICE: role "regress_priv_user2" has already been granted membership in role "regress_priv_group2" by role "postgres" The problem here is that I revised the error message to include the name of the grantor, since that's now a part of the identity of the grant. It would be misleading to say, as we did previously... NOTICE: role "regress_priv_user2" is already a member of role "regress_priv_group2" ...because them being in the group isn't relevant so much as them being in the group by means of the same grantor. However, I suspect that I can't persuade all of you that we should hard-code the name of the bootstrap superuser as "rhaas", so this test case needs some alteration. I found, however, that the original intent of the test case couldn't be preserved with the patch as written, because when you grant membership in one role to another role as the superuser or a CREATEROLE user, the grant is attributed to the bootstrap superuser, whose name is variable, as this test failure shows. Therefore, to fix the test, I needed to use ALTER GROUP as a non-CREATEROLE user, some user created as part of the test, for the results to be stable. But that was impossible, because even though "GRANT user_name TO group_name" requires *either* CREATEROLE *or* ADMIN OPTION on the group, the equivalent command "ALTER GROUP group_name ADD USER user_name" requires specifically CREATEROLE. I debated whether to fix that inconsistency or just remove this test case and eventually came down on the side of fixing the inconsistency, so the attached version does it that way. -- Robert Haas EDB: http://www.enterprisedb.com v5-0001-Ensure-that-pg_auth_members.grantor-is-always-val.patch Description: Binary data v5-0002-Make-role-grant-system-more-consistent-with-other.patch Description: Binary data
Re: logical replication restrictions
On Wed, Aug 10, 2022, at 9:39 AM, osumi.takami...@fujitsu.com wrote: > Minor review comments for v6. Thanks for your review. I'm attaching v7. > "If the subscriber sets min_apply_delay parameter, ..." > > I suggest we use subscription rather than subscriber, because > this parameter refers to and is used for one subscription. > My suggestion is > "If one subscription sets min_apply_delay parameter, ..." > In case if you agree, there are other places to apply this change. I changed the terminology to subscription. I also checked other "subscriber" occurrences but I don't think it should be changed. Some of them are used as publisher/subscriber pair. If you think there is another sentence to consider, point it out. > It might be better to write a note for committer > like "Bump catalog version" at the bottom of the commit message. It is a committer task to bump the catalog number. IMO it is easy to notice (using a git hook?) that it must bump it when we are modifying the catalog. AFAICS there is no recommendation to add such a warning. > The former interprets input number as milliseconds in case of no units, > while the latter takes it as seconds without units. > I feel it would be better to make them aligned. In a previous version I decided not to add a code to attach a unit when there isn't one. Instead, I changed the documentation to reflect what interval_in uses (seconds as unit). Under reflection, let's use ms as default unit if the user doesn't specify one. I fixed all the other suggestions too. -- Euler Taveira EDB https://www.enterprisedb.com/ From a28987c8adb70d6932558f5e39f9dd4c55223a30 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Sat, 6 Nov 2021 11:31:10 -0300 Subject: [PATCH v7] Time-delayed logical replication subscriber Similar to physical replication, a time-delayed copy of the data for logical replication is useful for some scenarios (particularly to fix errors that might cause data loss). If the subscription sets min_apply_delay parameter, the logical replication worker will delay the transaction commit for min_apply_delay milliseconds. The delay is calculated between the WAL time stamp and the current time on the subscriber. The delay occurs only on WAL records for transaction begins. The main reason is to avoid keeping a transaction open for a long time. Regular and prepared transactions are covered. Streamed transactions are also covered. Author: Euler Taveira Discussion: https://postgr.es/m/CAB-JLwYOYwL=xtyaxkih5ctm_vm8kjkh7aaitckvmch4rzr...@mail.gmail.com --- doc/src/sgml/catalogs.sgml | 10 ++ doc/src/sgml/ref/alter_subscription.sgml | 5 +- doc/src/sgml/ref/create_subscription.sgml | 43 +- src/backend/catalog/pg_subscription.c | 1 + src/backend/catalog/system_views.sql | 7 +- src/backend/commands/subscriptioncmds.c| 59 +++- src/backend/replication/logical/worker.c | 100 + src/backend/utils/adt/timestamp.c | 32 src/bin/pg_dump/pg_dump.c | 17 ++- src/bin/pg_dump/pg_dump.h | 1 + src/bin/psql/describe.c| 9 +- src/bin/psql/tab-complete.c| 4 +- src/include/catalog/pg_subscription.h | 3 + src/include/datatype/timestamp.h | 2 + src/include/utils/timestamp.h | 2 + src/test/regress/expected/subscription.out | 165 - src/test/regress/sql/subscription.sql | 20 +++ src/test/subscription/t/032_apply_delay.pl | 129 18 files changed, 526 insertions(+), 83 deletions(-) create mode 100644 src/test/subscription/t/032_apply_delay.pl diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index cd2cc37aeb..291ebdafad 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -7833,6 +7833,16 @@ SCRAM-SHA-256$:&l + + + subapplydelay int8 + + + Application delay of changes by a specified amount of time. The + unit is in milliseconds. + + + subname name diff --git a/doc/src/sgml/ref/alter_subscription.sgml b/doc/src/sgml/ref/alter_subscription.sgml index 64efc21f53..8901e1361c 100644 --- a/doc/src/sgml/ref/alter_subscription.sgml +++ b/doc/src/sgml/ref/alter_subscription.sgml @@ -208,8 +208,9 @@ ALTER SUBSCRIPTION name RENAME TO < are slot_name, synchronous_commit, binary, streaming, - disable_on_error, and - origin. + disable_on_error, + origin, and + min_apply_delay. diff --git a/doc/src/sgml/ref/create_subscription.sgml b/doc/src/sgml/ref/create_subscription.sgml index 7390c715bc..a794c07042 100644 --- a/doc/src/sgml/ref/create_subscription.sgml +++ b/doc/src/sgml/ref/create_subscription.sgml @@ -317,7 +317,36 @@ CREATE SUBSCRIPTION subscription_name - + + +min_apply_delay (integer) + +
Re: shared-memory based stats collector - v70
On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand wrote: > > Hi, > > On 8/9/22 6:00 PM, Greg Stark wrote: > > On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand wrote: > >> > >> What do you think about adding a function in core PG to provide such > >> functionality? (means being able to retrieve all the stats (+ eventually > >> add some filtering) without the need to connect to each database). > > I'm working on it myself too. I'll post a patch for discussion in a bit. > > Great! Thank you! So I was adding the code to pgstat.c because I had thought there were some data types I needed and/or static functions I needed. However you and Andres encouraged me to check again now. And indeed I was able, after fixing a couple things, to make the code work entirely externally. This is definitely not polished and there's a couple obvious things missing. But at the risk of embarrassment I've attached my WIP. Please be gentle :) I'll post the github link in a bit when I've fixed up some meta info. I'm definitely not wedded to the idea of using callbacks, it was just the most convenient way to get started, especially when I was putting the main loop in pgstat.c. Ideally I do want to keep open the possibility of streaming the results out without buffering the whole set in memory. > Out of curiosity, would you be also interested by such a feature for > previous versions (that will not get the patch in) ? I always had trouble understanding the existing stats code so I was hoping the new code would make it easier. It seems to have worked but it's possible I'm wrong and it was always possible and the problem was always just me :) -- greg /*- * * telemetry.c * * Most of this code was copied from pg_prewarm.c as a template. * * *- */ #include "postgres.h" #include #include #include #include "access/relation.h" #include "access/xact.h" #include "catalog/pg_class.h" #include "catalog/pg_type.h" #include "miscadmin.h" #include "pgstat.h" #include "postmaster/bgworker.h" #include "postmaster/interrupt.h" #include "storage/buf_internals.h" #include "storage/dsm.h" #include "storage/ipc.h" #include "storage/latch.h" #include "storage/lwlock.h" #include "storage/proc.h" #include "storage/procsignal.h" #include "storage/shmem.h" #include "storage/smgr.h" #include "tcop/tcopprot.h" #include "utils/acl.h" #include "utils/datetime.h" #include "utils/guc.h" #include "utils/memutils.h" #include "utils/rel.h" #include "utils/resowner.h" #include "telemetry.h" #include "telemetry_pgstat.h" PG_MODULE_MAGIC; /* We should already have included what we need to get uint64_t size_t fd_set socklen_t and struct sockaddr so don't let microhttpd include extra stuff for them */ #define MHD_PLATFORM_H #include /* MHD_USE_EPOLL */ /* MHD_USE_AUTO */ /* MHD_OPTION_CONNECTION_LIMIT */ /* MHD_OPTION_CONNECTION_TIMEOUT */ /* MHD_OPTION_EXTERNAL_LOGGER */ /* Background worker harness */ void _PG_init(void); /* Actual internal background worker main entry point */ static void telemetry_start_worker(unsigned short port); /* GUC variables. */ static int telemetry_port = 9187; /* TCP port to listen on for metrics */ static char *telemetry_listen_addresses; /* TCP listen addresses */ static bool telemetry = true; /* start worker automatically on default port */ static enum MHD_Result telemetry_handler(void *cls, struct MHD_Connection *connection, const char *url, const char *method, const char *version, const char *upload_data, size_t *upload_data_size, void **con_cls); /* * Module load callback. */ void _PG_init(void) { DefineCustomIntVariable("telemetry.port", "TCP Port to serve metrics on by default", NULL, &telemetry_port, 9187, 1, 65535, PGC_SIGHUP, 0, /* flags */ NULL, NULL, NULL /* hooks */ ); DefineCustomStringVariable("telemetry.listen_addresses", "TCP Listen Addresses to serve metrics on by default", NULL, &telemetry_listen_addresses, "*", PGC_SIGHUP, GUC_LIST_INPUT, /* flags */ NULL, NULL, NULL /* hooks */ ); if (!process_shared_preload_libraries_in_progress) return; /* can't define PGC_POSTMASTER variable after startup */ DefineCustomBoolVariable("telemetry.start_server", "Starts the telemetry worker on startup.", NULL, &telemetry, true, PGC_POSTMASTER, 0, NULL, NULL, NULL); EmitWarningsOnPlaceholders("telemetry"); /* Register telemetry worker, if enabled. */ if (telemetry) telemetry_start_worker(telemetry_port); } static void telemetry_logger(void *arg, const char *fmt, va_list ap); struct MHD_OptionItem mhd_ops[] = { { MHD_OPTION_EXTERNAL_LOGGER, (intptr_t)telemetry_logger
Re: optimize lookups in snapshot [sub]xip arrays
On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote: > LGTM, let's see what the buildfarm thinks of 0001. Thanks! I haven't noticed any related buildfarm failures yet. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: shared-memory based stats collector - v70
Hi, On 2022-08-10 14:18:25 -0400, Greg Stark wrote: > > I don't think that's a large enough issue to worry about unless you're > > polling at a very high rate, which'd be a bad idea in itself. If a backend > > can't get the lock for some stats change it'll defer flushing the stats a > > bit, > > so it'll not cause a lot of other problems. > > Hm. I wonder if we're on the same page about what constitutes a "high rate". > > I've seen people try push prometheus or other similar systems to 5s > poll intervals. That would be challenging for Postgres due to the > volume of statistics. The default is 30s and people often struggle to > even have that function for large fleets. But if you had a small > fleet, perhaps an iot style system with a "one large table" type of > schema you might well want stats every 5s or even every 1s. That's probably fine. Although I think you might run into trouble not from the stats subystem side, but from the "amount of data" side. On a system with a lot of objects that can be a fair amount. If you really want to do very low latency stats reporting, I suspect you'd have to build an incremental system. > > I'm *dead* set against including catalog names in shared memory stats. > > That'll > > add a good amount of memory usage and complexity, without any sort of > > comensurate gain. > > Well it's pushing the complexity there from elsewhere. If the labels > aren't in the stats structures then the exporter needs to connect to > each database, gather all the names into some local cache and then it > needs to worry about keeping it up to date. And if there are any > database problems such as disk errors or catalog objects being locked > then your monitoring breaks though perhaps it can be limited to just > missing some object names or having out of date names. Shrug. If the stats system state desynchronizes from an alter table rename you'll also have a problem in monitoring. And even if you can benefit from having all that information, it'd still be an overhead born by everybody for a very small share of users. > > > I also think it would be nice to have a change counter for every stat > > > object, or perhaps a change time. Prometheus wouldn't be able to make > > > use of it but other monitoring software might be able to receive only > > > metrics that have changed since the last update which would really > > > help on databases with large numbers of mostly static objects. > > > > I think you're proposing adding overhead that doesn't even have a real user. > > I guess I'm just brainstorming here. I don't need to currently no. It > doesn't seem like significant overhead though compared to the locking > and copying though? Yes, timestamps aren't cheap to determine (nor free too store, but that's a lesser issue). Greetings, Andres Freund
Re: shared-memory based stats collector - v70
Hi, On 2022-08-10 15:48:15 -0400, Greg Stark wrote: > One thing that's bugging me is that the names we use for these stats > are *all* over the place. Yes. I had a huge issue with this when polishing the patch. And Horiguchi-san did as well. I had to limit the amount of cleanup done to make it feasible to get anything committed. I think it's a bit less bad than before, but by no means good. > * The pg_stat_bgwriter view returns data from two different fixed > entries, the checkpointer and the bgwriter, is there a reason those > are kept separately but then reported as if they're one thing? Historical raisins. Checkpointer and bgwriter used to be one thing, but isn't anymore. Greetings, Andres Freund
Re: use SSE2 for is_valid_ascii
On Wed, Aug 10, 2022 at 01:50:14PM +0700, John Naylor wrote: > Here is an updated patch using the new USE_SSE2 symbol. The style is > different from the last one in that each stanza has platform-specific > code. I wanted to try it this way because is_valid_ascii() is already > written in SIMD-ish style using general purpose registers and bit > twiddling, so it seemed natural to see the two side-by-side. Sometimes > they can share the same comment. If we think this is bad for > readability, I can go back to one block each, but that way leads to > duplication of code and it's difficult to see what's different for > each platform, IMO. This is a neat patch. I don't know that we need an entirely separate code block for the USE_SSE2 path, but I do think that a little bit of extra commentary would improve the readability. IMO the existing comment for the zero accumulator has the right amount of detail. + /* +* Set all bits in each lane of the error accumulator where input +* bytes are zero. +*/ + error_cum = _mm_or_si128(error_cum, + _mm_cmpeq_epi8(chunk, _mm_setzero_si128())); I wonder if reusing a zero vector (instead of creating a new one every time) has any noticeable effect on performance. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Checking pgwin32_is_junction() errors
On Tue, Aug 9, 2022 at 10:59 PM wrote: > On 2022-08-09 05:44, Thomas Munro wrote: > > On Tue, Aug 9, 2022 at 8:30 AM Thomas Munro > > wrote: > >> On Mon, Aug 8, 2022 at 8:23 PM wrote: > >> > "C:/HOME" is the junction point to the second volume on my hard drive - > >> > "\??\Volume{GUID}\" which name pgreadlink() erroneously strips here: > >> > https://github.com/postgres/postgres/blob/7e29a79a46d30dc236d097825ab849158929d977/src/port/dirmod.c#L357. > > > >> ... Would it > >> be better to say: if it doesn't begin with "\??\X:", where X could be > >> any letter, then don't modify it? > > > > Concretely, I wonder if this is a good fix at least in the short term. > > Does this work for you, and do the logic and explanation make sense? > > Yes, this patch works well with my junction points. Thanks for testing! I did a bit more reading on this stuff, so that I could update the comments with the correct terminology from Windows APIs. I also realised that the pattern we could accept to symlink() and expect to work is not just "C:..." (could be RtlPathTypeDriveRelative, which wouldn't actually work in a junction point) but "C:\..." (RtlPathTypeDriveAbsolute). I tweaked it a bit to test for that. > I checked a few variants: > > 21.07.2022 15:11 HOME [\??\Volume{GUID}\] > 09.08.2022 15:06 Test1 [\\?\Volume{GUID}\] > 09.08.2022 15:06 Test2 [\\.\Volume{GUID}\] > 09.08.2022 15:17 Test3 [\??\Volume{GUID}\] > 09.08.2022 15:27 Test4 [C:\temp\1] > 09.08.2022 15:28 Test5 [C:\HOME\Temp\1] One more thing I wondered about, now that we're following junctions outside PGDATA: can a junction point to another junction? If so, I didn't allow for that: stat() gives up after one hop, because I figured that was enough for the stuff we expect inside PGDATA and I couldn't find any evidence in the man pages that referred to chains. But if you *are* allowed to create a junction "c:\huey" that points to junction "c:\dewey" that points to "c:\louie", and then you do initdb -D c:\huey\pgdata, then I guess it would fail. Would you mind checking if that is a real possibility, and if so, testing this chain-following patch to see if it fixes it? > After hours of reading the documentation and debugging, it seems to me > we can use REPARSE_GUID_DATA_BUFFER structure instead of our > REPARSE_JUNCTION_DATA_BUFFER [1]. DataBuffer doesn't contain any > prefixes, > so we don't need to strip them. But we still need to construct a correct > volume name if a junction point is a volume mount point. Is it worth to > check this idea? I don't know. I think I prefer our current approach, because it can handle anything (raw/full NT paths) and doesn't try to be very clever, and I don't want to change to a different scheme for no real benefit... From 6ed3edc77275aa51d1d0b1012c0a36db65735d39 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 11 Aug 2022 10:42:13 +1200 Subject: [PATCH v2 1/2] Fix readlink() for general Windows junction points. Our symlink() and readlink() replacements perform a naive transformation from DOS paths to NT paths and back, as required by the junction point APIs. This was enough for the "drive absolute" paths we expect users to provide for tablespaces, for example "d:\my\fast\storage". Since commit c5cb8f3b taught stat() to follow symlinks, and since initdb uses pg_mkdir_p(), and that examines parent directories, our humble readlink() implementation can now be exposed to junction points not of PostgreSQL origin. Those might be corrupted by our naive path mangling, which doesn't really understand NT paths in general. Simply decline to transform paths that don't look like a drive absolute path. That means that readlink() returns the NT path directly when checking a parent directory of PGDATA that happen to point to a drive using "rooted" format. That works for the purposes of our stat() emulation. Reported-by: Roman Zharkov Discussion: https://postgr.es/m/4590c37927d7b8ee84f9855d83229018%40postgrespro.ru --- src/port/dirmod.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/src/port/dirmod.c b/src/port/dirmod.c index 2818bfd2e9..53310baafb 100644 --- a/src/port/dirmod.c +++ b/src/port/dirmod.c @@ -199,7 +199,11 @@ pgsymlink(const char *oldpath, const char *newpath) if (dirhandle == INVALID_HANDLE_VALUE) return -1; - /* make sure we have an unparsed native win32 path */ + /* + * We expect either an NT path or a "drive absolute" path like "C:\..." + * that we convert to an NT path by bolting on a prefix and converting any + * Unix-style path delimiters to NT-style. + */ if (memcmp("\\??\\", oldpath, 4) != 0) snprintf(nativeTarget, sizeof(nativeTarget), "\\??\\%s", oldpath); else @@ -351,10 +355,21 @@ pgreadlink(const char *path, char *buf, size_t size) } /* - * If the path starts with "\??\", which it will do in most (all?) cases, - * strip those out. + * If the path starts with "\??\" follo
Re: Get the statistics based on the application name and IP address
Hi, On Wed, Aug 10, 2022 at 10:42:31PM +0500, Ibrar Ahmed wrote: > On Mon, Aug 8, 2022 at 10:11 PM Julien Rouhaud wrote: > > > First, for now each entry contains its own > > query text in the query file. There can already be some duplication, which > > isn't great, but adding the application_name and/or IP address will make > > things > > way worse, so you would probably need to fix that first. > > I doubt that makes it worst because these (IP and Application) will be part > of > the key, not the query text. It's because you want to add new elements to the key that it would make it worse, as the exact same query text will be stored much more often. > But yes, I agree that it will increase the > footprint of rows, > excluding query text. I don't think that this part should be a concern. > I am not 100% sure about the query text duplication but will look at that > in detail, > if you have more insight, then it will help to solve that. You can refer to the mentioned thread for the (only?) discussion about that. > > There has been some > > discussion about it recently (1) but more work and benchmarking are needed. > > > > The other problem is the multiplication of entries. It's a well known > > limitation that pg_stat_statements eviction are so costly that it makes it > > unusable. The last numbers I saw about it was ~55% overhead (2). Adding > > application_name or ip address to the key would probably make > > pg_stat_statements unusable for anyone who would actually need those > > metrics. > > > > I am sure adding a new item in the key does not affect the performance of > evictions of the row, > as it will not be part of that area. I am doing some benchmarking and > hacking to reduce that and will > send results with the patch. Sorry if that was unclear. I didn't meant that adding new items to the key would make evictions way costlier (although it would have some impact), but much more frequent. Adding application_name and the IP to the key can very easily amplify the number of entries so much that you will either need an unreasonable value for pg_stat_statements.max (which would likely bring its own set of problems) if possible at all, or evict entries frequently.
Re: Allow logical replication to copy tables in binary format
On Wed, Aug 10, 2022, at 12:03 PM, Melih Mutlu wrote: > I see that logical replication subscriptions have an option to enable binary > [1]. > When it's enabled, subscription requests publisher to send data in binary > format. > But this is only the case for apply phase. In tablesync, tables are still > copied as text. This option could have been included in the commit 9de77b54531; it wasn't. Maybe it wasn't considered because the initial table synchronization can be a separate step in your logical replication setup idk. I agree that the binary option should be available for the initial table synchronization. > To copy tables, COPY command is used and that command supports copying in > binary. So it seemed to me possible to copy in binary for tablesync too. > I'm not sure if there is a reason to always copy tables in text format. But I > couldn't see why not to do it in binary if it's enabled. The reason to use text format is that it is error prone. There are restrictions while using the binary format. For example, if your schema has different data types for a certain column, the copy will fail. Even with such restrictions, I think it is worth adding it. > You can find the small patch that only enables binary copy attached. I have a few points about your implementation. * Are we considering to support prior Postgres versions too? These releases support binary mode but it could be an unexpected behavior (initial sync in binary mode) for a publisher using 14 or 15 and a subscriber using 16. IMO you should only allow it for publisher on 16 or later. * Docs should say that the binary option also applies to initial table synchronization and possibly emphasize some of the restrictions. * Tests. Are the current tests enough? 014_binary.pl. -- Euler Taveira EDB https://www.enterprisedb.com/
Re: optimize lookups in snapshot [sub]xip arrays
On Thu, Aug 11, 2022 at 4:46 AM Nathan Bossart wrote: > > On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote: > > LGTM, let's see what the buildfarm thinks of 0001. > > Thanks! I haven't noticed any related buildfarm failures yet. I was waiting for all the Windows animals to report in, and it looks like they have, so I've pushed 0002. Thanks for picking this topic up again! -- John Naylor EDB: http://www.enterprisedb.com
Re: Using each rel as both outer and inner for JOIN_ANTI
On Wed, Aug 10, 2022 at 4:40 PM Alvaro Herrera wrote: > On 2022-Aug-10, Richard Guo wrote: > > > The right-anti join plan has the same cost estimation with right join > > plan in this case. So would you please help to test what the right join > > plan looks like in your env for the query below? > > > > select * from foo left join bar on foo.a = bar.c; > > You're right, it does. > > 55432 16devel 475322=# explain (analyze, buffers) select * from foo left > join bar on foo.a = bar.c; > QUERY PLAN > > > ── > Hash Right Join (cost=1.23..90875.24 rows=10 width=20) (actual > time=456.410..456.415 rows=10 loops=1) >Hash Cond: (bar.c = foo.a) >Buffers: shared hit=15852 read=6273 >-> Seq Scan on bar (cost=0.00..72124.00 rows=500 width=12) > (actual time=0.036..210.468 rows=500 loops=1) > Buffers: shared hit=15852 read=6272 >-> Hash (cost=1.10..1.10 rows=10 width=8) (actual time=0.037..0.038 > rows=10 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 9kB > Buffers: shared read=1 > -> Seq Scan on foo (cost=0.00..1.10 rows=10 width=8) (actual > time=0.022..0.026 rows=10 loops=1) >Buffers: shared read=1 > Planning: >Buffers: shared hit=92 read=13 > Planning Time: 1.077 ms > Execution Time: 456.458 ms > (14 filas) Thanks for help testing. Comparing the anti join plan and the right join plan, the estimated cost and the execution time mismatch a lot. Seems the cost estimate of hashjoin path is not that precise for this case even in the unpatched codes. Maybe this is something we need to improve. Thanks Richard
Re: [RFC] building postgres with meson
On Thu, Aug 11, 2022 at 12:19 AM Andres Freund wrote: > I tried to ignore various generated files in the source tree, but I don't > think it's doable for all of them. Consider > e.g. src/backend/utils/misc/guc-file.c which is gets built via #include > "guc-file.c" from gram.c With a bit of work, we could probably get rid of those includes. See 27199058d98ef7f for one example. -- John Naylor EDB: http://www.enterprisedb.com
Re: [RFC] building postgres with meson
John Naylor writes: > With a bit of work, we could probably get rid of those includes. See > 27199058d98ef7f for one example. Yeah --- it would mean creating gram.h files for all the bison grammars not just a few of them, but it's certainly do-able if there's motivation to make the changes. Most of the files that are done that way date from before we knew about flex's %top. regards, tom lane
Re: [RFC] building postgres with meson
I wrote: > John Naylor writes: >> With a bit of work, we could probably get rid of those includes. See >> 27199058d98ef7f for one example. > Yeah --- it would mean creating gram.h files for all the bison grammars > not just a few of them, but it's certainly do-able if there's motivation > to make the changes. Most of the files that are done that way date > from before we knew about flex's %top. BTW, 72b1e3a21 is another useful precedent in this area. regards, tom lane
Re: [RFC] building postgres with meson
On Thu, Aug 11, 2022 at 10:37 AM Tom Lane wrote: > > John Naylor writes: > > With a bit of work, we could probably get rid of those includes. See > > 27199058d98ef7f for one example. > > Yeah --- it would mean creating gram.h files for all the bison grammars > not just a few of them, but it's certainly do-able if there's motivation > to make the changes. Most of the files that are done that way date > from before we knew about flex's %top. I'll volunteer to work on this unless an easier solution happens to come along in the next couple days. (aside: guc-file.l doesn't have a grammar, so not yet sure if that makes the issue easier or harder...) -- John Naylor EDB: http://www.enterprisedb.com
tests and meson - test names and file locations
Hi, One of the things motivating me to work on the meson conversion is the ability to run tests in an easily understandable way. Meson has a testrunner that both allows to run all tests at once, and run subsets of tests. = Test and testsuite naming = Each test has a unique name, and 0-n labels (called 'suites'). One can run tests by their name, or select tests by suites. The way I've organized it so far is that tests are named like this: main/pg_regress main/isolation recovery/t/018_wal_optimize.pl pg_prewarm/t/001_basic.pl test_shm_mq/pg_regress psql/t/001_basic.pl we could also name tests by their full path, but that makes the display very unwieldy. At the moment there's three suites differentiating by the type of test: 'pg_regress', 'isolation' and 'tap'. There's also a separate "axis" of suites, describing what's being tested, e.g. 'main', 'test_decoding', 'recovery' etc. That currently works out to each test having two suites, although I've wondered about adding a 'contrib' suite as well. Perhaps the pg_regress suite should just be named 'regress'? And perhaps the t/ in the tap tests is superfluous - the display is already fairly wide? = Log and Data locations = To make things like the selection of log files for a specific test easier, I've so far set it up so that test data and logs are stored in a separate directory from the sources. testrun/// The runner now creates a test.start at the start of a test and either test.success or test.failure at the end. That should make it pretty easy for e.g. the buildfarm and CI to make the logs for a failed test easily accessible. I've spent far too much time going through the ~hundred logs in src/test/recovery/ that the buildfarm displays as one thing. I really like having all the test data separately from the sources, but I get that that's not what we've done so far. It's doable to just mirror the current choice, but I don't think we should. But I won't push too hard against keeping things the same. I do wonder if we should put test data and log files in a separate directory tree, but that'd be a bit more work probably. Any comments on the above? = Example outputs = Here's an example output that you mostly should be able to make sense of now: $ m test --print-errorlogs ninja: Entering directory `/tmp/meson' ninja: no work to do. 1/242 postgresql:setup / tmp_install OK0.33s 2/242 postgresql:tap+pg_upgrade / pg_upgrade/t/001_basic.pl OK0.35s 8 subtests passed 3/242 postgresql:tap+recovery / recovery/t/011_crash_recovery.pl OK2.67s 3 subtests passed 4/242 postgresql:tap+recovery / recovery/t/013_crash_restart.pl OK2.91s 18 subtests passed 5/242 postgresql:tap+recovery / recovery/t/014_unlogged_reinit.pl OK3.01s 23 subtests passed 6/242 postgresql:tap+recovery / recovery/t/022_crash_temp_files.pl OK3.16s 11 subtests passed 7/242 postgresql:tap+recovery / recovery/t/016_min_consistency.pl OK3.43s 1 subtests passed 8/242 postgresql:tap+recovery / recovery/t/021_row_visibility.pl OK3.46s 10 subtests passed 9/242 postgresql:isolation+tcn / tcn/isolation OK3.42s 10/242 postgresql:tap+recovery / recovery/t/023_pitr_prepared_xact.pl OK3.63s 1 subtests passed ... 241/242 postgresql:isolation+main / main/isolation OK 46.69s 242/242 postgresql:tap+pg_upgrade / pg_upgrade/t/002_pg_upgrade.pl OK 57.00s 13 subtests passed Ok: 242 Expected Fail: 0 Fail: 0 Unexpected Pass:0 Skipped:0 Timeout:0 Full log written to /tmp/meson/meson-logs/testlog.txt The 'postgresql' is because meson supports subprojects (both to provide dependencies if needed, and "real" subprojects), and their tests can be run at once. If a test fails it'll show the error output at the time of test: 39/242 postgresql:pg_regress+cube / cube/pg_regress FAIL 3.74s exit status 1 >>> REGRESS_SHLIB=/tmp/meson/src/test/regress/regress.so MALLOC_PERTURB_=44 >>> PG_TEST_EXTRA='kerberos ldap ssl' >>> PG_REGRESS=/tmp/meson/src/test/regress/pg_regress >>> PATH=/tmp/meson/tmp_install/tmp/meson-install/bin:/tmp/meson/contrib/cube:/home/andres/bin/perl5/bin:/home/andres/bin/pg:/home/andres/bin/bin:/usr/sbin:/sbin:/home/andres/bin/pg:/home/andres/bin/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/
Re: [RFC] building postgres with meson
Hi, On 2022-08-11 10:57:33 +0700, John Naylor wrote: > I'll volunteer to work on this unless an easier solution happens to > come along in the next couple days. Cool! > (aside: guc-file.l doesn't have a grammar, so not yet sure if that makes the > issue easier or harder...) I think we should consider compiling it separately from guc.c. guc.c already compiles quite slowly (iirc beat only by ecpg and main grammar), and it's a relatively commonly changed source file. It might even be a good idea to split guc.c so it only contains the settings arrays + direct dependencies... Greetings, Andres Freund
Re: use SSE2 for is_valid_ascii
On Thu, Aug 11, 2022 at 5:31 AM Nathan Bossart wrote: > > This is a neat patch. I don't know that we need an entirely separate code > block for the USE_SSE2 path, but I do think that a little bit of extra > commentary would improve the readability. IMO the existing comment for the > zero accumulator has the right amount of detail. > > + /* > +* Set all bits in each lane of the error accumulator where > input > +* bytes are zero. > +*/ > + error_cum = _mm_or_si128(error_cum, > + > _mm_cmpeq_epi8(chunk, _mm_setzero_si128())); Okay, I will think about the comments, thanks for looking. > I wonder if reusing a zero vector (instead of creating a new one every > time) has any noticeable effect on performance. Creating a zeroed register is just FOO PXOR FOO, which should get hoisted out of the (unrolled in this case) loop, and which a recent CPU will just map to a hard-coded zero in the register file, in which case the execution latency is 0 cycles. :-) -- John Naylor EDB: http://www.enterprisedb.com
Re: Refactor backup related code (was: Is it correct to say, "invalid data in file \"%s\"", BACKUP_LABEL_FILE in do_pg_backup_stop?)
On Mon, Aug 8, 2022 at 7:20 PM Bharath Rupireddy wrote: > > > Please review. > > I added this to current CF - https://commitfest.postgresql.org/39/3808/ Here's the v2 patch, no change from v1, just rebased because of commit a8c012869763c711abc9085f54b2a100b60a85fa (Move basebackup code to new directory src/backend/backup). -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/ v2-0001-Refactor-backup-related-code.patch Description: Binary data
Re: [RFC] building postgres with meson
John Naylor writes: > I'll volunteer to work on this unless an easier solution happens to > come along in the next couple days. (aside: guc-file.l doesn't have a > grammar, so not yet sure if that makes the issue easier or harder...) That one's probably mostly about the issue mentioned in the other commit you identified. Without %top, it's impossible to make a standalone flex module honor the rule about thou-shalt-have-no- other-includes-before-postgres.h. So embedding it in some other file was originally a necessity for that. Now that we know how to fix that, it's just a matter of making sure that any other stuff the scanner needs is available from a .h file. regards, tom lane
Re: optimize lookups in snapshot [sub]xip arrays
On Thu, Aug 11, 2022 at 09:50:54AM +0700, John Naylor wrote: > I was waiting for all the Windows animals to report in, and it looks > like they have, so I've pushed 0002. Thanks for picking this topic up > again! Thanks for reviewing and committing. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: making relfilenodes 56 bits
On Tue, Aug 9, 2022 at 8:51 PM Robert Haas wrote: > > On Fri, Aug 5, 2022 at 3:25 AM Dilip Kumar wrote: > > I think even if we start the range from the 4 billion we can not avoid > > keeping two separate ranges for system and user tables otherwise the > > next upgrade where old and new clusters both have 56 bits > > relfilenumber will get conflicting files. And, for the same reason we > > still have to call SetNextRelFileNumber() during upgrade. > > Well, my proposal to move everything from the new cluster up to higher > numbers would address this without requiring two ranges. > > > So the idea is, we will be having 2 ranges for relfilenumbers, system > > range will start from 4 billion and user range maybe something around > > 4.1 (I think we can keep it very small though, just reserve 50k > > relfilenumber for system for future expansion and start user range > > from there). > > A disadvantage of this is that it basically means all the file names > in new clusters are going to be 10 characters long. That's not a big > disadvantage, but it's not wonderful. File names that are only 5-7 > characters long are common today, and easier to remember. That's correct. > > So now system tables have no issues and also the user tables from the > > old cluster have no issues. But pg_largeobject might get conflict > > when both old and new cluster are using 56 bits relfilenumber, because > > it is possible that in the new cluster some other system table gets > > that relfilenumber which is used by pg_largeobject in the old cluster. > > > > This could be resolved if we allocate pg_largeobject's relfilenumber > > from the user range, that means this relfilenumber will always be the > > first value from the user range. So now if the old and new cluster > > both are using 56bits relfilenumber then pg_largeobject in both > > cluster would have got the same relfilenumber and if the old cluster > > is using the current 32 bits relfilenode system then the whole range > > of the new cluster is completely different than that of the old > > cluster. > > I think this can work, but it does rely to some extent on the fact > that there are no other tables which need to be treated like > pg_largeobject. If there were others, they'd need fixed starting > RelFileNumber assignments, or some other trick, like renumbering them > twice in the cluster, first two a known-unused value and then back to > the proper value. You'd have trouble if in the other cluster > pg_largeobject was 4bn+1 and pg_largeobject2 was 4bn+2 and in the new > cluster the reverse, without some hackery. Agree, if it has more catalog like pg_largeobject then it would require some hacking. > I do feel like your idea here has some advantages - my proposal > requires rewriting all the catalogs in the new cluster before we do > anything else, and that's going to take some time even though they > should be small. But I also feel like it has some disadvantages: it > seems to rely on complicated reasoning and special cases more than I'd > like. One other advantage with your approach is that since we are starting the "nextrelfilenumber" after the old cluster's relfilenumber range. So only at the beginning we need to set the "nextrelfilenumber" but after that while upgrading each object we don't need to set the nextrelfilenumber every time because that is already higher than the complete old cluster range. In other 2 approaches we will have to try to set the nextrelfilenumber everytime we preserve the relfilenumber during upgrade. Other than these two approaches we have another approach (what the patch set is already doing) where we keep the system relfilenumber range same as Oid. I know you have already pointed out that this might have some hidden bug but one advantage of this approach is it is simple compared two above two approaches in the sense that it doesn't need to maintain two ranges and it also doesn't need to rewrite all system tables in the new cluster. So I think it would be good if we can get others' opinions on all these 3 approaches. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: SUBTRANS: Minimizing calls to SubTransSetParent()
On Wed, Aug 10, 2022 at 6:31 PM Simon Riggs wrote: > > On Wed, 10 Aug 2022 at 08:34, Dilip Kumar wrote: > > > > Am I still missing something? > > No, you have found a dependency between the patches that I was unaware > of. So there is no bug if you apply both patches. Right > > > So I still think some adjustment is required in XidInMVCCSnapdhot() > > That is one way to resolve the issue, but not the only one. I can also > change AssignTransactionId() to recursively register parent xids for > all of a subxid's parents. > > I will add in a test case and resolve the dependency in my next patch. Okay, thanks, I will look into the updated patch after you submit that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: use SSE2 for is_valid_ascii
On Thu, Aug 11, 2022 at 11:10:34AM +0700, John Naylor wrote: >> I wonder if reusing a zero vector (instead of creating a new one every >> time) has any noticeable effect on performance. > > Creating a zeroed register is just FOO PXOR FOO, which should get > hoisted out of the (unrolled in this case) loop, and which a recent > CPU will just map to a hard-coded zero in the register file, in which > case the execution latency is 0 cycles. :-) Ah, indeed. At -O2, my compiler seems to zero out two registers before the loop with either approach: pxor%xmm0, %xmm0; accumulator pxor%xmm2, %xmm2; always zeros And within the loop, I see the following: movdqu (%rdi), %xmm1 movdqu (%rdi), %xmm3 addq$16, %rdi pcmpeqb %xmm2, %xmm1; check for zeros por %xmm3, %xmm0; OR data into accumulator por %xmm1, %xmm0; OR zero check results into accumulator cmpq%rdi, %rsi So the call to _mm_setzero_si128() within the loop is fine. Apologies for the noise. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Stats collector's idx_blks_hit value is highly misleading in practice
Hi again, Having played with the task for a little while, I am no longer sure it completely justifies the effort involved. The reason being the task requires modifying the buffer pool in one way or the other, which implies (a) significant effort on performance testing and (b) changes in the buffer pool interfaces that community might not welcome just to get 1-2 extra statistics numbers. Any ideas ? Regards, Sergey
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Mon, Aug 8, 2022 at 9:34 AM Amit Kapila wrote: > > On Wed, Aug 3, 2022 at 1:20 PM Masahiko Sawada wrote: > > > > > > Oops, thanks for pointing it out. I've fixed it and attached updated > > patches for all branches so as not to confuse the patch version. There > > is no update from v12 patch on REL12 - master patches. > > > > Thanks for the updated patches, the changes look good to me. > Horiguchi-San, and others, do you have any further comments on this or > do you want to spend time in review of it? If not, I would like to > push this after the current minor version release. > Pushed. -- With Regards, Amit Kapila.
Re: shared-memory based stats collector - v70
Hi, On 8/10/22 11:25 PM, Greg Stark wrote: On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand wrote: Hi, On 8/9/22 6:00 PM, Greg Stark wrote: On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand wrote: What do you think about adding a function in core PG to provide such functionality? (means being able to retrieve all the stats (+ eventually add some filtering) without the need to connect to each database). I'm working on it myself too. I'll post a patch for discussion in a bit. Great! Thank you! So I was adding the code to pgstat.c because I had thought there were some data types I needed and/or static functions I needed. However you and Andres encouraged me to check again now. And indeed I was able, after fixing a couple things, to make the code work entirely externally. Nice! Though I still think to have an SQL API in core could be useful to. As Andres was not -1 about that idea (as it should be low cost to add and maintain) as long as somebody cares enough to write something: then I'll give it a try and submit a patch for it. This is definitely not polished and there's a couple obvious things missing. But at the risk of embarrassment I've attached my WIP. Please be gentle :) I'll post the github link in a bit when I've fixed up some meta info. Thanks! I will have a look at it on github (once you share the link). Regards, -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
Re: Perform streaming logical transactions by background workers and parallel apply
On Thu, Aug 4, 2022 at 12:07 PM wangw.f...@fujitsu.com wrote: > > On Thurs, Jul 28, 2022 at 13:20 PM Kuroda, Hayato/黒田 隼人 > wrote: > > > > Dear Wang-san, > > > > Hi, I'm also interested in the patch and I started to review this. > > Followings are comments about 0001. > > Thanks for your kindly review and comments. > To avoid making this thread too long, I will reply to all of your comments > (#1~#13) in this email. > > > 1. terminology > > > > In your patch a new worker "apply background worker" has been introduced, > > but I thought it might be confused because PostgreSQL has already the worker > > "background worker". > > Both of apply worker and apply bworker are categolized as bgworker. > > Do you have any reasons not to use "apply parallel worker" or "apply > > streaming > > worker"? > > (Note that I'm not native English speaker) > > Since we will later consider applying non-streamed transactions in parallel, I > think "apply streaming worker" might not be very suitable. I think PostgreSQL > also has the worker "parallel worker", so for "apply parallel worker" and > "apply background worker", I feel that "apply background worker" will make the > relationship between workers more clear. ("[main] apply worker" and "apply > background worker") > But, on similar lines, we do have vacuumparallel.c for parallelizing index vacuum. I agree with Kuroda-San on this point that the currently proposed terminology doesn't sound to be very clear. The other options that come to my mind are "apply streaming transaction worker", "apply parallel worker" and file name could be applystreamworker.c, applyparallel.c, applyparallelworker.c, etc. I see the point why you are hesitant in calling it "apply parallel worker" but it is quite possible that even for non-streamed xacts, we will share quite some part of this code. Thoughts? -- With Regards, Amit Kapila.