Re: "make installcheck" fails in src/test/recovery
On Fri, Apr 19, 2019 at 11:23:04AM +0900, Michael Paquier wrote: > The dependency is not strongly mandatory for the test coverage. I'll > just go remove it if that's an annoyance. And done. -- Michael signature.asc Description: PGP signature
RE: Global shared meta cache
>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com] >Do you have any thoughts? > Hi, I updated my idea, hoping get some feedback. [TL; DR] The basic idea is following 4 points: A. User can choose which database to put a cache (relation and catalog) on shared memory and how much memory is used B. Caches of committed data are on the shared memory. Caches of uncommitted data are on the local memory. C. Caches on the shared memory have xid information (xmin, xmax) D. Evict not recently used cache from shared memory [A] Regarding point A, I can imagine some databases are connected by lots of clients but others don't. So I introduced a new parameter in postgresql.conf, "shared_meta_cache", which is disabled by default and needs server restart to enable. ex. shared_meta_cache = 'db1:500MB, db2:100MB'. Some catcaches like pg_database are shared among the whole database, so such shared catcaches are allocated in a dedicated space within shared memory. This space can be controlled by "shared_meta_global_catcache" parameter, which is named after global directory. But I want this parameter to be hidden in postgresql.conf to make it simple for users. It's too detailed. [B & C] Regarding B & C, the motivation is we don't want other backends to see uncommitted tables. Search order is local memory -> shared memory -> disk. Local process searches cache in shared memory based from its own snapshot and xid of cache. When cache is not found in shared memory, cache with xmin is made in shared memory ( but not in local one). When cache definition is changed by DDL, new cache is created in local one, and thus next commands refer to local cache if needed. When it's committed, local cache is cleared and shared cache is updated. This update is done by adding xmax to old cache and also make a new one with xmin. The idea behind adding a new one is that newly created cache (new table or altered table) is likely to be used in next transactions. At this point maybe we can make use of current invalidation mechanism, even though invalidation message to other backends is not sent. [D] As for D, I'm thinking to do benchmark with simple LRU. If the performance is bad, change to other algorithm like Clock. We don't care about eviction of local cache because its lifetime is in a transaction, and I don't want to make it bloat. best regards, Takeshi Ideriha
Re: [HACKERS] proposal: schema variables
fresh rebase Regards Pavel schema-variables-20180419.patch.gz Description: application/gzip
Patch: doc for pg_logical_emit_message()
Hi, Hackers pg_logical_emit_message() can be used by any user, but the following document says that it can be used by only superuser. > Table 9.88. Replication SQL Functions > Use of these functions is restricted to superusers. I think that pg_logicl_emit_message() should be used by any user. Therefore, I attach the document patch. Ryo Matsumura pg_logical_emit_message_doc.patch Description: pg_logical_emit_message_doc.patch
Re: bug in update tuple routing with foreign partitions
On 2019/04/19 14:39, Etsuro Fujita wrote: > (2019/04/19 13:00), Amit Langote wrote: >> On 2019/04/18 22:10, Etsuro Fujita wrote: >>> * I kept all the changes in the previous patch, because otherwise >>> postgres_fdw would fail to release resources for a foreign-insert >>> operation created by postgresBeginForeignInsert() for a tuple-routable >>> foreign table (ie, a foreign-table subplan resultrel that has been updated >>> already) during postgresEndForeignInsert(). >> >> Hmm are you saying that the cases for which we'll still allow tuple >> routing (foreign table receiving moved-in rows has already been updated), >> there will be two fmstates to be released -- the original fmstate >> (UPDATE's) and aux_fmstate (INSERT's)? > > Yeah, but I noticed that that explanation was not correct. (I think I was > really in hurry.) See the correction in [1]. Ah, I hadn't noticed your corrected description in [1] even though your message was in my inbox before I sent my email. Thanks, Amit
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
On April 18, 2019 7:53:58 PM PDT, Tom Lane wrote: >Amit Kapila writes: >> I am thinking that we should at least give it a try to move the map >to >> rel cache level to see how easy or difficult it is and also let's >wait >> for a day or two to see if Andres/Tom has to say anything about this >> or on the response by me above to improve the current patch. > >FWIW, it's hard for me to see how moving the map to the relcache isn't >the right thing to do. You will lose state during a relcache flush, >but that's still miles better than how often the state gets lost now. +1 -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: bug in update tuple routing with foreign partitions
(2019/04/19 13:17), Amit Langote wrote: OTOH, we don't mention at all in postgres-fdw.sgml that postgres_fdw supports tuple routing. Maybe we should and list this limitation or would it be too much burden to maintain? I think it's better to add this limitation to postgres-fdw.sgml. Will do. Thanks for pointing that out! Best regards, Etsuro Fujita
Re: bug in update tuple routing with foreign partitions
(2019/04/19 13:00), Amit Langote wrote: On 2019/04/18 22:10, Etsuro Fujita wrote: On 2019/04/18 14:06, Amit Langote wrote: On 2019/04/17 21:49, Etsuro Fujita wrote: So what I'm thinking is to throw an error for cases like this. (Though, I think we should keep to allow rows to be moved from local partitions to a foreign-table subplan targetrel that has been updated already.) Hmm, how would you distinguish (totally inside postgred_fdw I presume) the two cases? One thing I came up with to do that is this: @@ -1917,6 +1937,23 @@ postgresBeginForeignInsert(ModifyTableState *mtstate, initStringInfo(&sql); + /* +* If the foreign table is an UPDATE subplan resultrel that hasn't yet +* been updated, routing tuples to the table might yield incorrect +* results, because if routing tuples, routed tuples will be mistakenly +* read from the table and updated twice when updating the table --- it +* would be nice if we could handle this case; but for now, throw an error +* for safety. +*/ + if (plan&& plan->operation == CMD_UPDATE&& + (resultRelInfo->ri_usesFdwDirectModify || +resultRelInfo->ri_FdwState)&& + resultRelInfo> mtstate->resultRelInfo + mtstate->mt_whichplan) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +errmsg("cannot route tuples into foreign table to be updated \"%s\"", + RelationGetRelationName(rel; Oh, I see. So IIUC, you're making postgresBeginForeignInsert() check two things: 1. Whether the foreign table it's about to begin inserting (moved) rows into is a subplan result rel, checked by (resultRelInfo->ri_usesFdwDirectModify || resultRelInfo->ri_FdwState) 2. Whether the foreign table it's about to begin inserting (moved) rows into will be updated later, checked by (resultRelInfo> mtstate->resultRelInfo + mtstate->mt_whichplan) This still allows a foreign table to receive rows moved from the local partitions if it has already finished the UPDATE operation. Seems reasonable to me. Great! Forgot to say that since this is a separate issue from the original bug report, maybe we can first finish fixing the latter. What to do you think? Yeah, I think we can do that, but my favorite would be to fix the two issues in a single shot, because they seem to me rather close to each other. So I updated a new version in a single patch, which I'm attaching. I agree that we can move to fix the other issue right away as the fix you outlined above seems reasonable, but I wonder if it wouldn't be better to commit the two fixes separately? The two fixes, although small, are somewhat complicated and combining them in a single commit might be confusing. Also, a combined commit might make it harder for the release note author to list down the exact set of problems being fixed. OK, I'll separate the patch into two. Notes: * I kept all the changes in the previous patch, because otherwise postgres_fdw would fail to release resources for a foreign-insert operation created by postgresBeginForeignInsert() for a tuple-routable foreign table (ie, a foreign-table subplan resultrel that has been updated already) during postgresEndForeignInsert(). Hmm are you saying that the cases for which we'll still allow tuple routing (foreign table receiving moved-in rows has already been updated), there will be two fmstates to be released -- the original fmstate (UPDATE's) and aux_fmstate (INSERT's)? Yeah, but I noticed that that explanation was not correct. (I think I was really in hurry.) See the correction in [1]. * I revised a comment according to your previous comment, though I changed a state name: s/sub_fmstate/aux_fmstate/ That seems fine to me. Cool! Some mostly cosmetic comments on the code changes: * In the following comment: +/* + * If the foreign table is an UPDATE subplan resultrel that hasn't yet + * been updated, routing tuples to the table might yield incorrect + * results, because if routing tuples, routed tuples will be mistakenly + * read from the table and updated twice when updating the table --- it + * would be nice if we could handle this case; but for now, throw an error + * for safety. + */ the part that start with "because if routing tuples..." reads a bit unclear to me. How about writing this as: /* * If the foreign table we are about to insert routed rows into is * also an UPDATE result rel and the UPDATE hasn't been performed yet, * proceeding with the INSERT will result in the later UPDATE * incorrectly modifying those routed rows, so prevent the INSERT --- * it would be nice if we could handle this case, but for now, throw * an error for safety. */ I think that's better than mine; will use that wording. I see that in all the hunks involving some manipulation of aux_fmst
Re: Do CustomScan as not projection capable node
Andrey Lepikhov writes: > Can we include the CustomScan node in the list of nodes that do not > support projection? That seems like a pretty bad idea. Maybe it's a good thing for whatever unspecified extension you have in mind right now, but it's likely to be a net negative for many more. As an example, if some custom extension has a better way to calculate some output expression than the core code does, a restriction like this would prevent that from being implemented. > Reason is that custom node can contain quite arbitrary logic that does > not guarantee projection support. I don't buy this for a minute. Where do you think projection is going to happen? There isn't any existing node type that *couldn't* support projection if we insisted that that be done across-the-board. I think it's mostly just a legacy thing that some don't. regards, tom lane
Re: standby recovery fails (tablespace related) (tentative patch and discussion)
On Wed, Apr 17, 2019 at 1:27 PM Paul Guo wrote: > > create db with tablespace > drop database > drop tablespace. Essentially, that sequence of operations causes crash recovery to fail if the "drop tablespace" transaction was committed before crashing. This is a bug in crash recovery in general and should be reproducible without configuring a standby. Is that right? Your patch creates missing directories in the destination. Don't we need to create the tablespace symlink under pg_tblspc/? I would prefer extending the invalid page mechanism to deal with this, as suggested by Ashwin off-list. It will allow us to avoid creating directories and files only to remove them shortly afterwards when the drop database and drop tablespace records are replayed. Asim
Do CustomScan as not projection capable node
Can we include the CustomScan node in the list of nodes that do not support projection? Reason is that custom node can contain quite arbitrary logic that does not guarantee projection support. Secondly. If planner does not need a separate Result node, it just assign tlist to subplan (i.e. changes targetlist of custom node) and does not change the custom_scan_tlist. Perhaps I do not fully understand the logic of using the custom_scan_tlist field. But if into the PlanCustomPath() routine our custom node does not build own custom_scan_tlist (may be it will use tlist as base for custom_scan_tlist) we will get errors in the set_customscan_references() call. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 938904d179e0a4e31cbb20fb70243d2b980d8dc2 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Fri, 19 Apr 2019 09:34:09 +0500 Subject: [PATCH] Add CustomScan node in the list of nodes that do not support projection --- src/backend/optimizer/plan/createplan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index efe073a3ee..682d4d4429 100644 --- a/src/backend/optimizer/plan/createplan.c +++ b/src/backend/optimizer/plan/createplan.c @@ -6691,6 +6691,7 @@ is_projection_capable_path(Path *path) case T_ModifyTable: case T_MergeAppend: case T_RecursiveUnion: + case T_CustomScan: return false; case T_Append: -- 2.17.1
Re: bug in update tuple routing with foreign partitions
On 2019/04/19 13:00, Amit Langote wrote: > BTW, do you think we should update the docs regarding the newly > established limitation of row movement involving foreign tables? Ah, maybe not, because it's a postgres_fdw limitation, not of the core tuple routing feature. OTOH, we don't mention at all in postgres-fdw.sgml that postgres_fdw supports tuple routing. Maybe we should and list this limitation or would it be too much burden to maintain? Thanks, Amit
Re: Race conditions with checkpointer and shutdown
>>> Maybe what we should be looking for is "why doesn't the walreceiver >>> shut down"? But the dragonet log you quote above shows the walreceiver >>> exiting, or at least starting to exit. Tis a puzzlement. huh ... take a look at this little stanza in PostmasterStateMachine: if (pmState == PM_SHUTDOWN_2) { /* * PM_SHUTDOWN_2 state ends when there's no other children than * dead_end children left. There shouldn't be any regular backends * left by now anyway; what we're really waiting for is walsenders and * archiver. * * Walreceiver should normally be dead by now, but not when a fast * shutdown is performed during recovery. */ if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0 && WalReceiverPID == 0) { pmState = PM_WAIT_DEAD_END; } } I'm too tired to think through exactly what that last comment might be suggesting, but it sure seems like it might be relevant to our problem. If the walreceiver *isn't* dead yet, what's going to ensure that we can move forward later? regards, tom lane
Re: bug in update tuple routing with foreign partitions
Fujita-san, On 2019/04/18 22:10, Etsuro Fujita wrote: >> On 2019/04/18 14:06, Amit Langote wrote: >>> On 2019/04/17 21:49, Etsuro Fujita wrote: So what I'm thinking is to throw an error for cases like this. (Though, I think we should keep to allow rows to be moved from local partitions to a foreign-table subplan targetrel that has been updated already.) >>> >>> Hmm, how would you distinguish (totally inside postgred_fdw I presume) the >>> two cases? > > One thing I came up with to do that is this: > > @@ -1917,6 +1937,23 @@ postgresBeginForeignInsert(ModifyTableState *mtstate, > > initStringInfo(&sql); > > + /* > + * If the foreign table is an UPDATE subplan resultrel that hasn't > yet > + * been updated, routing tuples to the table might yield incorrect > + * results, because if routing tuples, routed tuples will be > mistakenly > + * read from the table and updated twice when updating the table > --- it > + * would be nice if we could handle this case; but for now, throw > an error > + * for safety. > + */ > + if (plan && plan->operation == CMD_UPDATE && > + (resultRelInfo->ri_usesFdwDirectModify || > + resultRelInfo->ri_FdwState) && > + resultRelInfo > mtstate->resultRelInfo + > mtstate->mt_whichplan) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot route tuples into foreign > table to be updated \"%s\"", > + RelationGetRelationName(rel; Oh, I see. So IIUC, you're making postgresBeginForeignInsert() check two things: 1. Whether the foreign table it's about to begin inserting (moved) rows into is a subplan result rel, checked by (resultRelInfo->ri_usesFdwDirectModify || resultRelInfo->ri_FdwState) 2. Whether the foreign table it's about to begin inserting (moved) rows into will be updated later, checked by (resultRelInfo > mtstate->resultRelInfo + mtstate->mt_whichplan) This still allows a foreign table to receive rows moved from the local partitions if it has already finished the UPDATE operation. Seems reasonable to me. >> Forgot to say that since this is a separate issue from the original bug >> report, maybe we can first finish fixing the latter. What to do you think? > > Yeah, I think we can do that, but my favorite would be to fix the two > issues in a single shot, because they seem to me rather close to each > other. So I updated a new version in a single patch, which I'm attaching. I agree that we can move to fix the other issue right away as the fix you outlined above seems reasonable, but I wonder if it wouldn't be better to commit the two fixes separately? The two fixes, although small, are somewhat complicated and combining them in a single commit might be confusing. Also, a combined commit might make it harder for the release note author to list down the exact set of problems being fixed. But I guess your commit message will make it clear that two distinct problems are being solved, so maybe that shouldn't be a problem. > Notes: > > * I kept all the changes in the previous patch, because otherwise > postgres_fdw would fail to release resources for a foreign-insert > operation created by postgresBeginForeignInsert() for a tuple-routable > foreign table (ie, a foreign-table subplan resultrel that has been updated > already) during postgresEndForeignInsert(). Hmm are you saying that the cases for which we'll still allow tuple routing (foreign table receiving moved-in rows has already been updated), there will be two fmstates to be released -- the original fmstate (UPDATE's) and aux_fmstate (INSERT's)? > * I revised a comment according to your previous comment, though I changed > a state name: s/sub_fmstate/aux_fmstate/ That seems fine to me. > * DirectModify also has the latter issue. Here is an example: > > postgres=# create table p (a int, b text) partition by list (a); > postgres=# create table p1 partition of p for values in (1); > postgres=# create table p2base (a int check (a = 2 or a = 3), b text); > postgres=# create foreign table p2 partition of p for values in (2, 3) > server loopback options (table_name 'p2base'); > > postgres=# insert into p values (1, 'foo'); > INSERT 0 1 > postgres=# explain verbose update p set a = a + 1; > QUERY PLAN > - > Update on public.p (cost=0.00..176.21 rows=2511 width=42) > Update on public.p1 > Foreign Update on public.p2 > -> Seq Scan on public.p1 (cost=0.00..25.88 rows=1270 width=42) > Output: (p1.a + 1), p1.b, p1.ctid > -> Foreign Update on public.p2 (cost=100.00..150.33 rows=1241 width=42) > Remote SQL: UPDATE public.p2base SET a = (a + 1) > (7 rows) > > postgres=# update p set a = a + 1; > UPDATE 2 > postgres=# select * from p; > a |
Re: Race conditions with checkpointer and shutdown
Thomas Munro writes: > On Fri, Apr 19, 2019 at 10:22 AM Tom Lane wrote: >> Maybe what we should be looking for is "why doesn't the walreceiver >> shut down"? But the dragonet log you quote above shows the walreceiver >> exiting, or at least starting to exit. Tis a puzzlement. > ... Is there some way that the exit code could hang > *after* that due to corruption of libc resources (FILE streams, > malloc, ...)? It doesn't seem likely to me (we'd hopefully see some > more clues) but I thought I'd mention the idea. I agree it's not likely ... but that's part of the reason I was thinking about adding some postmaster logging. Whatever we're chasing here is "not likely", per the observed buildfarm failure rate. regards, tom lane
Re: Race conditions with checkpointer and shutdown
On Fri, Apr 19, 2019 at 10:22 AM Tom Lane wrote: > Thomas Munro writes: > > 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver > > process due to administrator command > Maybe what we should be looking for is "why doesn't the walreceiver > shut down"? But the dragonet log you quote above shows the walreceiver > exiting, or at least starting to exit. Tis a puzzlement. One thing I noticed about this message: if you receive SIGTERM at a rare time when WalRcvImmediateInterruptOK is true, then that ereport() runs directly in the signal handler context. That's not strictly allowed, and could cause nasal demons. On the other hand, it probably wouldn't have managed to get the FATAL message out if that was the problem here (previously we've seen reports of signal handlers deadlocking while trying to ereport() but they couldn't get their message out at all, because malloc or some such was already locked in the user context). Is there some way that the exit code could hang *after* that due to corruption of libc resources (FILE streams, malloc, ...)? It doesn't seem likely to me (we'd hopefully see some more clues) but I thought I'd mention the idea. -- Thomas Munro https://enterprisedb.com
Re: Two pg_rewind patches (auto generate recovery conf and ensure clean shutdown)
Hi Michael, I updated the patches as attached following previous discussions. The two patches: v2-0001-Extact-common-functions-from-pg_basebackup-into-s.patch v2-0002-Add-option-to-write-recovery-configuration-inform.patch 1. 0001 does move those common functions & variables to two new files (one .c and one .h) for both pg_rewind and pg_basebackup use, note the functions are slightly modified (e.g. because conn is probably NULL on pg_rewind). I do not know where is more proper to put the new files. Currently, they are under pg_basebackup and are used in pg_rewind (Makefile modified to support that). 2. 0002 adds the option to write recovery conf. The below patch runs single mode Postgres if needed to make sure the target is cleanly shutdown. A new option is added (off by default). v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch I've manually tested them and installcheck passes. Thanks. On Wed, Mar 20, 2019 at 1:23 PM Paul Guo wrote: > > > On Wed, Mar 20, 2019 at 1:20 PM Michael Paquier > wrote: > >> On Wed, Mar 20, 2019 at 12:48:52PM +0800, Paul Guo wrote: >> > This is a good suggestion also. Will do it. >> >> Please note also that we don't care about recovery.conf since v12 as >> recovery parameters are now GUCs. I would suggest appending those >> extra parameters to postgresql.auto.conf, which is what pg_basebackup >> does. >> > Yes, the recovery conf patch in the first email did like this, i.e. > writing postgresql.auto.conf & standby.signal > > v2-0001-Ensure-target-clean-shutdown-at-beginning-of-pg_r.patch Description: Binary data v2-0001-Extact-common-functions-from-pg_basebackup-into-s.patch Description: Binary data v2-0002-Add-option-to-write-recovery-configuration-inform.patch Description: Binary data
Re: Pathological performance when inserting many NULLs into a unique index
On Thu, Apr 18, 2019 at 5:46 PM Michael Paquier wrote: > Adding an open item is adapted, nice you found this issue. Testing on > my laptop with v11, the non-NULL case takes 5s and the NULL case 6s. > On HEAD, the non-NULL case takes 6s, and the NULL case takes... I > just gave up and cancelled it :) This was one of those things that occurred to me out of the blue. It just occurred to me that the final patch will need to be more careful about non-key attributes in INCLUDE indexes. It's not okay for it to avoid calling _bt_check_unique() just because a non-key attribute was NULL. It should only do that when a key attribute is NULL. -- Peter Geoghegan
Re: bug in update tuple routing with foreign partitions
(2019/04/18 22:10), Etsuro Fujita wrote: Notes: * I kept all the changes in the previous patch, because otherwise postgres_fdw would fail to release resources for a foreign-insert operation created by postgresBeginForeignInsert() for a tuple-routable foreign table (ie, a foreign-table subplan resultrel that has been updated already) during postgresEndForeignInsert(). I noticed that this explanation was not right. Let me correct myself. The reason why I kept those changes is: without those changes, we would fail to release the resources for a foreign-update operation (ie, fmstate) created by postgresBeginForeignModify(), replaced by the fmstate for the foreign-insert operation, because when doing ExecEndPlan(), we first call postgresEndForeignModify() and then call postgresEndForeignInsert(); so, if we didn't keep those changes, we would *mistakenly* release the fmstate for the foreign-insert operation in postgresEndForeignModify(), and we wouldn't do anything about the fmstate for the foreign-update operation in that function and in the subsequent postgresEndForeignInsert(). Best regards, Etsuro Fujita
Re: Race conditions with checkpointer and shutdown
Thomas Munro writes: > Interesting, but I'm not sure how that could be though. Perhaps, a > bit like the other thing that cropped up in the build farm after that > commit, removing ~200ms of needless sleeping around an earlier online > CHECKPOINT made some other pre-existing race condition more likely to > go wrong. The data that we've got is entirely consistent with the idea that there's a timing-sensitive bug that gets made more or less likely to trigger by "unrelated" changes in test cases or server code. regards, tom lane
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Amit Kapila writes: > I am thinking that we should at least give it a try to move the map to > rel cache level to see how easy or difficult it is and also let's wait > for a day or two to see if Andres/Tom has to say anything about this > or on the response by me above to improve the current patch. FWIW, it's hard for me to see how moving the map to the relcache isn't the right thing to do. You will lose state during a relcache flush, but that's still miles better than how often the state gets lost now. regards, tom lane
Re: Race conditions with checkpointer and shutdown
Michael Paquier writes: > On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote: >> Anyway, this is *not* new in v12. > Indeed. It seems to me that v12 makes the problem easier to appear > though, and I got to wonder if c6c9474 is helping in that as more > cases are popping up since mid-March. Yeah. Whether that's due to a server code change, or new or modified test cases, is unknown. But looking at my summary of buildfarm runs that failed like this, there's a really clear breakpoint at gull | 2018-08-24 03:27:16 | recoveryCheck | pg_ctl: server does not shut down Since that, most of the failures with this message have been in the recoveryCheck step. Before that, the failures were all over the place, and now that I look closely a big fraction of them were in bursts on particular animals, suggesting it was more about some local problem on that animal than any real code issue. So it might be worth groveling around in the commit logs from last August... regards, tom lane
Re: Race conditions with checkpointer and shutdown
On Fri, Apr 19, 2019 at 2:30 PM Michael Paquier wrote: > On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote: > > Anyway, this is *not* new in v12. > > Indeed. It seems to me that v12 makes the problem easier to appear > though, and I got to wonder if c6c9474 is helping in that as more > cases are popping up since mid-March. Interesting, but I'm not sure how that could be though. Perhaps, a bit like the other thing that cropped up in the build farm after that commit, removing ~200ms of needless sleeping around an earlier online CHECKPOINT made some other pre-existing race condition more likely to go wrong. -- Thomas Munro https://enterprisedb.com
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
On Thu, Apr 18, 2019 at 2:10 PM John Naylor wrote: > > On Thu, Apr 18, 2019 at 2:48 PM Amit Kapila wrote: > > I respect and will follow whatever will be the consensus after > > discussion. However, I request you to wait for some time to let the > > discussion conclude. If we can't get to an > > agreement or one of John or me can't implement what is decided, then > > we can anyway revert it. > > Agreed. I suspect the most realistic way to address most of the > objections in a short amount of time would be to: > > 1. rip out the local map > 2. restore hio.c to only checking the last block in the relation if > there is no FSM (and lower the threshold to reduce wasted space) > 3. reduce calls to smgr_exists() > Won't you need an extra call to RelationGetNumberofBlocks to find the last block? Also won't it be less efficient in terms of dealing with bloat as compare to current patch? I think if we go this route, then we might need to revisit it in the future to optimize it, but maybe that is the best alternative as of now. I am thinking that we should at least give it a try to move the map to rel cache level to see how easy or difficult it is and also let's wait for a day or two to see if Andres/Tom has to say anything about this or on the response by me above to improve the current patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Race conditions with checkpointer and shutdown
On Thu, Apr 18, 2019 at 05:57:39PM -0400, Tom Lane wrote: > It's the latter. I searched the buildfarm database for failure logs > including the string "server does not shut down" within the last three > years, and got all of the hits attached. Not all of these look like > the failure pattern Michael pointed to, but enough of them do to say > that the problem has existed since at least mid-2017. To be concrete, > we have quite a sample of cases where a standby server has received a > "fast shutdown" signal and acknowledged that in its log, but it never > gets to the expected "shutting down" message, meaning it never starts > the shutdown checkpoint let alone finishes it. The oldest case that > clearly looks like that is > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29 Interesting. I was sort of thinking about c6c3334 first but this failed based on 9fcf670, which does not include the former. > This leads me to suspect that the problem is (a) some very low-level issue > in spinlocks or or latches or the like, or (b) a timing problem that just > doesn't show up on generic Intel-oid platforms. The timing theory is > maybe a bit stronger given that one test case shows this more often than > others. I've not got any clear ideas beyond that. > > Anyway, this is *not* new in v12. Indeed. It seems to me that v12 makes the problem easier to appear though, and I got to wonder if c6c9474 is helping in that as more cases are popping up since mid-March. -- Michael signature.asc Description: PGP signature
Re: "make installcheck" fails in src/test/recovery
On Thu, Apr 18, 2019 at 09:31:21PM -0400, Tom Lane wrote: > If we think that depending on pageinspect is worthwhile here, the right > thing to do is just to fix the README to say that you need it. I'm > merely asking whether it's really worth the extra dependency. The dependency is not strongly mandatory for the test coverage. I'll just go remove it if that's an annoyance. -- Michael signature.asc Description: PGP signature
Re: ExecForceStoreMinimalTuple leaks memory like there's no tomorrow
Hi, On 2019-04-15 22:46:56 -0400, Tom Lane wrote: > Using HEAD, > > create table t1 as select generate_series(1,4000) id; > vacuum analyze t1; > explain select * from t1, t1 t1b where t1.id = t1b.id; > -- should indicate a hash join > explain analyze select * from t1, t1 t1b where t1.id = t1b.id; > > ... watch the process's memory consumption bloat. (It runs for > awhile before that starts to happen, but eventually it goes to > a couple of GB.) > > It looks to me like the problem is that ExecHashJoinGetSavedTuple > calls ExecForceStoreMinimalTuple with shouldFree = true, and > ExecForceStoreMinimalTuple's second code branch simply ignores > the requirement to free the supplied tuple. Thanks for finding. The fix is obviously easy - but looking through the code I think I found another similar issue. I'll fix both in one go tomorrow. Greetings, Andres Freund
Re: "make installcheck" fails in src/test/recovery
Michael Paquier writes: > I am wondering if it would be better to just install automatically all > the paths listed in EXTRA_INSTALL when invoking installcheck. Absolutely not! In the first place, "make installcheck" is supposed to test the installed tree, not editorialize upon what's in it; and in the second place, you wouldn't necessarily have permissions to change that tree. If we think that depending on pageinspect is worthwhile here, the right thing to do is just to fix the README to say that you need it. I'm merely asking whether it's really worth the extra dependency. regards, tom lane
Re: Pathological performance when inserting many NULLs into a unique index
On Wed, Apr 17, 2019 at 07:37:17PM -0700, Peter Geoghegan wrote: > I'll create an open item for this, and begin work on a patch tomorrow. Adding an open item is adapted, nice you found this issue. Testing on my laptop with v11, the non-NULL case takes 5s and the NULL case 6s. On HEAD, the non-NULL case takes 6s, and the NULL case takes... I just gave up and cancelled it :) -- Michael signature.asc Description: PGP signature
Re: finding changed blocks using WAL scanning
On Thu, Apr 18, 2019 at 03:51:14PM -0400, Robert Haas wrote: > I was thinking that a dedicated background worker would be a good > option, but Stephen Frost seems concerned (over on the other thread) > about how much load that would generate. That never really occurred > to me as a serious issue and I suspect for many people it wouldn't be, > but there might be some. WAL segment size can go up to 1GB, and this does not depend on the compilation anymore. So scanning a very large segment is not going to be free. I think that the performance concerns of Stephen are legit as now we have on the WAL partitions sequential read and write patterns. > It's cool that you have a command-line tool that does this as well. > Over there, it was also discussed that we might want to have both a > command-line tool and a background worker. I think, though, that we > would want to get the output in some kind of compressed binary format, > rather than text. e.g. If you want to tweak it the way you want, please feel free to reuse it for any patch submitted to upstream. Reshaping or redirecting the data is not a big issue once the basics with the WAL reader are in place. -- Michael signature.asc Description: PGP signature
Re: Comments for lossy ORDER BY are lacking
Hi, On 2019-04-18 17:30:20 -0700, Andres Freund wrote: > For not the first time I was trying to remember why and when the whole > nodeIndexscan.c:IndexNextWithReorder() business is needed. The comment > about reordering > > *IndexNextWithReorder > * > *Like IndexNext, but this version can also re-check ORDER BY > *expressions, and reorder the tuples as necessary. > > or > + /* Initialize sort support, if we need to re-check ORDER BY exprs */ > > or > > + /* > +* If there are ORDER BY expressions, look up the sort operators for > +* their datatypes. > +*/ Secondary point: has anybody actually checked whether the extra reordering infrastructure is a measurable overhead? It's obviously fine for index scans that need reordering (i.e. lossy ones), but currently it's at least initialized for distance based order bys. I guess that's largely because currently opclasses don't signal the fact that they might return loss amcanorderby results, but that seems like it could have been fixed back then? Greetings, Andres Freund
Re: Pathological performance when inserting many NULLs into a unique index
On Wed, Apr 17, 2019 at 7:37 PM Peter Geoghegan wrote: > Tentatively, I think that the fix here is to not "itup_key->scantid = > NULL" when a NULL value is involved (i.e. don't "itup_key->scantid = > NULL" when we see IndexTupleHasNulls(itup) within _bt_doinsert()). We > may also want to avoid calling _bt_check_unique() entirely in this > situation. > I'll create an open item for this, and begin work on a patch tomorrow. I came up with the attached patch, which effectively treats a unique index insertion as if the index was not unique at all in the case where new tuple is IndexTupleHasNulls() within _bt_doinsert(). This is correct because inserting a new tuple with a NULL value can neither contribute to somebody else's duplicate violation, nor have a duplicate violation error of its own. I need to test this some more, though I am fairly confident that I have the right idea. We didn't have this problem before my v12 work due to the "getting tired" strategy, but that's not the full story. We also didn't (and don't) move right within _bt_check_unique() when IndexTupleHasNulls(itup), because _bt_isequal() treats NULL values as not equal to any value, including even NULL -- this meant that there was no useless search to the right within _bt_check_unique(). Actually, the overall effect of how _bt_isequal() is used is that _bt_check_unique() does *no* useful work at all with a NULL scankey. It's far simpler to remove responsibility for new tuples/scankeys with NULL values from _bt_check_unique() entirely, which is what I propose to do with this patch. The patch actually ends up removing quite a lot more code than it adds, because it removes _bt_isequal() outright. I always thought that nbtinsert.c dealt with NULL values in unique indexes at the wrong level, and I'm glad to see _bt_isequal() go. Vadim's accompanying 1997 comment about "not using _bt_compare in real comparison" seems confused to me. The new _bt_check_unique() may still need to compare the scankey to some existing, adjacent tuple with a NULL value, but _bt_compare() is perfectly capable of recognizing that that adjacent tuple shouldn't be considered equal. IOW, _bt_compare() is merely incapable of behaving as if "NULL != NULL", which is a bad reason for keeping _bt_isequal() around. -- Peter Geoghegan 0001-Prevent-O-N-2-unique-index-insertion-edge-case.patch Description: Binary data
Comments for lossy ORDER BY are lacking
Hi, For not the first time I was trying to remember why and when the whole nodeIndexscan.c:IndexNextWithReorder() business is needed. The comment about reordering * IndexNextWithReorder * * Like IndexNext, but this version can also re-check ORDER BY * expressions, and reorder the tuples as necessary. or + /* Initialize sort support, if we need to re-check ORDER BY exprs */ or + /* +* If there are ORDER BY expressions, look up the sort operators for +* their datatypes. +*/ nor any other easy to spot ones really explain that. It's not even obvious that this isn't talking about an ordering ordering by a column (expression could maybe be taken as a hint, but that's fairly thin) By reading enough code one can stitch together that that's really only needed for KNN like order bys with lossy distance functions. It'd be good if one had to dig less for that. that logic was (originally) added in: commit 35fcb1b3d038a501f3f4c87c05630095abaaadab Author: Heikki Linnakangas Date: 2015-05-15 14:26:51 +0300 Allow GiST distance function to return merely a lower-bound. but I think some of the documentation & naming for related datastructures was a bit hard to grasp before then too - it's e.g. IMO certainly not obvious that IndexPath.indexorderbys isn't about plain ORDER BYs. Greetings, Andres Freund
Re: "make installcheck" fails in src/test/recovery
On Thu, Apr 18, 2019 at 01:45:45AM -0400, Tom Lane wrote: > Is this extra dependency actually essential? I'm not really > happy about increasing the number of moving parts in this test. Hmmm. I don't actually object to removing the part depending on pageinspect in the tests. Relying on the on-disk page format has proved to be more reliable for the buildfarm than I initially thought, and we are actually able to keep the same coverage without the dependency on pageinspect. Now, I don't think that this is not a problem only for src/test/recovery/ but to any path using EXTRA_INSTALL. For example, if you take contrib/ltree_plpython/, then issue "make install" from this path followed by an installcheck, then the tests complain about ltree missing from the installation. For the recovery tests, we already require test_decoding so I would expect the problem to get worse with the time as we should not restrict the dependencies with other modules if they make sense for some TAP tests. I am wondering if it would be better to just install automatically all the paths listed in EXTRA_INSTALL when invoking installcheck. We enforce the target in src/test/recovery/Makefile, still we could use this opportunity to mark it with TAP_TESTS=1. -- Michael signature.asc Description: PGP signature
Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint
On Thu, Apr 18, 2019 at 03:16:17PM -0400, Robert Haas wrote: > On Tue, Apr 16, 2019 at 2:45 AM Michael Paquier wrote: >> I think that we have race conditions with checkpointing and shutdown >> on HEAD. > > Any idea whether it's something newly-introduced or of long standing? I would suggest to keep the discussion on the more general thread I have created a couple of days ago: https://www.postgresql.org/message-id/20190416070119.gk2...@paquier.xyz Thanks, -- Michael signature.asc Description: PGP signature
Re: block-level incremental backup
Greetings, Ok, responding to the rest of this email. * Robert Haas (robertmh...@gmail.com) wrote: > On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost wrote: > > Sadly, I haven't got any great ideas today. I do know that the WAL-G > > folks have specifically mentioned issues with the visibility map being > > large enough across enough of their systems that it kinda sucks to deal > > with. Perhaps we could do something like the rsync binary-diff protocol > > for non-relation files? This is clearly just hand-waving but maybe > > there's something reasonable in that idea. > > I guess it all comes down to how complicated you're willing to make > the client-server protocol. With the very simple protocol that I > proposed -- client provides a threshold LSN and server sends blocks > modified since then -- the client need not have access to the old > incremental backup to take a new one. Where is the client going to get the threshold LSN from? > Of course, if it happens to > have access to the old backup then it can delta-compress however it > likes after-the-fact, but that doesn't help with the amount of network > transfer. If it doesn't have access to the old backup, then I'm a bit confused as to how a incremental backup would be possible? Isn't that a requirement here? > That problem could be solved by doing something like what > you're talking about (with some probably-negligible false match rate) > but I have no intention of trying to implement anything that > complicated, and I don't really think it's necessary, at least not for > a first version. What I proposed would already allow, for most users, > a large reduction in transfer and storage costs; what you are talking > about here would help more, but also be a lot more work and impose > some additional requirements on the system. I don't object to you > implementing the more complex system, but I'll pass. I was talking about the rsync binary-diff specifically for the files that aren't easy to deal with in the WAL stream. I wouldn't think we'd use it for other files, and there is definitely a question there of if there's a way to do better than a binary-diff approach for those files. > > There's something like 6 different backup tools, at least, for > > PostgreSQL that provide backup management, so I have a really hard time > > agreeing with this idea that users don't want a PG backup management > > system. Maybe that's not what you're suggesting here, but that's what > > came across to me. > > Let me be a little more clear. Different users want different things. > Some people want a canned PostgreSQL backup solution, while other > people just want access to a reasonable set of facilities from which > they can construct their own solution. I believe that the proposal I > am making here could be used either by backup tool authors to enhance > their offerings, or by individuals who want to build up their own > solution using facilities provided by core. The last thing that I think users really want it so build up their own solution. There may be some organizations who would like to provide their own tool, but that's a bit different. Personally, I'd *really* like PG to have a good tool in this area and I've been working, as I've said before, to try to get to a point where we at least have the option to add in such a tool that meets our various requirements. Further, I'm concerned that the approach being presented here won't be interesting to most of the external tools because it's limited and can't be used in a parallel fashion. > > Unless maybe I'm misunderstanding and what you're suggesting here is > > that the "existing solution" is something like the external PG-specific > > backup tools? But then the rest doesn't seem to make sense, as only > > maybe one or two of those tools use pg_basebackup internally. > > Well, what I'm really talking about is in two pieces: providing some > new facilities via the replication protocol, and making pg_basebackup > able to use those facilities. Nothing would stop other tools from > using those facilities directly if they wish. If those facilities are developed and implemented in the same way as the protocol used by pg_basebackup works, then I strongly suspect that the existing backup tools will treat it similairly- which is to say, they'll largely end up ignoring it. > > ... but this is exactly the situation we're in already with all of the > > *other* features around backup (parallel backup, backup management, WAL > > management, etc). Users want those features, pg_basebackup/PG core > > doesn't provide it, and therefore there's a bunch of other tools which > > have been written that do. In addition, saying that PG has incremental > > backup but no built-in management of those full-vs-incremental backups > > and telling users that they basically have to build that themselves > > really feels a lot like we're trying to address a check-box requirement > > rather than making something that our
Re: Race conditions with checkpointer and shutdown
Thomas Munro writes: > This is a curious thing from dragonet's log: > 2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request > 2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions > 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver > process due to administrator command > 2019-04-16 08:28:23.166 CEST [8337] LOG: restartpoint starting: time > LogCheckpointStart() is the thing that writes "starting: ...", and it > prefers to report "shutdown" over "time", but it didn't. Yeah, but since we don't see "shutting down", we know that the shutdown checkpoint hasn't begun. Here's another similar case: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2018-11-30%2011%3A44%3A54 The relevant fragment of the standby server's log is 2018-11-30 05:09:22.996 PST [4229] LOG: received fast shutdown request 2018-11-30 05:09:23.628 PST [4229] LOG: aborting any active transactions 2018-11-30 05:09:23.649 PST [4231] LOG: checkpoint complete: wrote 17 buffers (13.3%); 0 WAL file(s) added, 0 removed, 1 recycled; write=3.021 s, sync=0.000 s, total=3.563 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16563 kB, estimate=16563 kB 2018-11-30 05:09:23.679 PST [4229] LOG: background worker "logical replication launcher" (PID 4276) exited with exit code 1 2018-11-30 05:11:23.757 PST [4288] master LOG: unexpected EOF on standby connection 2018-11-30 05:11:23.883 PST [4229] LOG: received immediate shutdown request 2018-11-30 05:11:23.907 PST [4229] LOG: database system is shut down To the extent that I've found logs in which the checkpointer prints anything at all during this interval, it seems to be just quietly plodding along with its usual business, without any hint that it's aware of the pending shutdown request. It'd be very easy to believe that the postmaster -> checkpointer SIGUSR2 is simply getting dropped, or never issued. Hmm ... actually, looking at the postmaster's logic, it won't issue SIGUSR2 to the checkpointer until the walreceiver (if any) is gone. And now that I think about it, several of these logs contain traces showing that the walreceiver is still live. Like the one quoted above: seems like the line from PID 4288 has to be from a walreceiver. Maybe what we should be looking for is "why doesn't the walreceiver shut down"? But the dragonet log you quote above shows the walreceiver exiting, or at least starting to exit. Tis a puzzlement. I'm a bit tempted to add temporary debug logging to the postmaster so that walreceiver start/stop is recorded at LOG level. We'd have to wait a few weeks to have any clear result from the buildfarm, but I'm not sure how we'll get any hard data without some such measures. regards, tom lane
Re: Race conditions with checkpointer and shutdown
On Fri, Apr 19, 2019 at 2:53 AM Tom Lane wrote: > Question is, what other theory has anybody got? I wondered if there might be a way for PostmasterStateMachine() to be reached with without signals blocked, in the case where we fork a fresh checkpointers, and then it misses the SIGUSR2 that we immediately send because it hasn't installed its handler yet. But I can't see it. This is a curious thing from dragonet's log: 2019-04-16 08:23:24.178 CEST [8335] LOG: received fast shutdown request 2019-04-16 08:23:24.178 CEST [8335] LOG: aborting any active transactions 2019-04-16 08:23:24.178 CEST [8393] FATAL: terminating walreceiver process due to administrator command 2019-04-16 08:28:23.166 CEST [8337] LOG: restartpoint starting: time LogCheckpointStart() is the thing that writes "starting: ...", and it prefers to report "shutdown" over "time", but it didn't. -- Thomas Munro https://enterprisedb.com
Re: Race conditions with checkpointer and shutdown
Robert Haas wrote (in the other thread): > Any idea whether it's something newly-introduced or of long standing? It's the latter. I searched the buildfarm database for failure logs including the string "server does not shut down" within the last three years, and got all of the hits attached. Not all of these look like the failure pattern Michael pointed to, but enough of them do to say that the problem has existed since at least mid-2017. To be concrete, we have quite a sample of cases where a standby server has received a "fast shutdown" signal and acknowledged that in its log, but it never gets to the expected "shutting down" message, meaning it never starts the shutdown checkpoint let alone finishes it. The oldest case that clearly looks like that is https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29 A significant majority of the recent cases look just like the piculet failure Michael pointed to, that is we fail to shut down the "london" server while it's acting as standby in the recovery/t/009_twophase.pl test. But there are very similar failures in other tests. I also notice that the population of machines showing the problem seems heavily skewed towards, um, weird cases. For instance, in the set that have shown this type of failure since January, we have dragonet: uses JIT francolin: --disable-spinlocks gull: armv7 mereswine: armv7 piculet: --disable-atomics sidewinder: amd64, but running netbsd 7 (and this was 9.6, note) spurfowl: fairly generic amd64 This leads me to suspect that the problem is (a) some very low-level issue in spinlocks or or latches or the like, or (b) a timing problem that just doesn't show up on generic Intel-oid platforms. The timing theory is maybe a bit stronger given that one test case shows this more often than others. I've not got any clear ideas beyond that. Anyway, this is *not* new in v12. regards, tom lane sysname| snapshot |stage| l ---+-+-+- jacana| 2016-07-23 06:15:32 | pg_upgradeCheck | pg_ctl: server does not shut down\r pademelon | 2016-08-14 03:49:36 | ECPG-Check | pg_ctl: server does not shut down mereswine | 2017-02-13 14:24:37 | Check | pg_ctl: server does not shut down arapaima | 2017-03-04 20:06:10 | StopDb-C:4 | pg_ctl: server does not shut down nightjar | 2017-06-02 18:54:29 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-02 19:54:11 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-03 15:54:05 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-03 17:54:18 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-03 21:54:09 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-04 00:54:09 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-04 16:34:32 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-04 17:54:16 | SubscriptionCheck | pg_ctl: server does not shut down hornet| 2017-06-05 16:22:09 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-05 16:54:09 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-05 20:26:24 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-06 03:30:02 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-06 15:54:18 | SubscriptionCheck | pg_ctl: server does not shut down hornet| 2017-06-06 17:10:02 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-06 18:54:27 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 00:54:07 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 02:54:06 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 15:12:15 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 17:54:07 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 18:54:06 | SubscriptionCheck | pg_ctl: server does not shut down sungazer | 2017-06-07 19:46:53 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-07 21:03:43 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-08 01:54:07 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-08 15:54:10 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-08 16:57:03 | SubscriptionCheck | pg_ctl: server does not shut down nightjar | 2017-06-08 17:54:09 | SubscriptionChec
Re: finding changed blocks using WAL scanning
On 2019-04-18 17:47:56 -0400, Bruce Momjian wrote: > I can see a 1GB marker being used for that. It would prevent an > incremental backup from being done until the first 1G modblock files was > written, since until then there is no record of modified blocks, but > that seems fine. A 1G marker would allow for consistent behavior > independent of server restarts and base backups. That doesn't seem like a good idea - it'd make writing regression tests for this impractical.
Re: pg_dump is broken for partition tablespaces
On 2019-Apr-17, Alvaro Herrera wrote: > On 2019-Apr-17, Tom Lane wrote: > > > Alvaro Herrera writes: > > > 1. pg_dump now uses regular CREATE TABLE followed by ALTER TABLE / ATTACH > > >PARTITION when creating partitions, rather than CREATE TABLE > > >PARTITION OF. pg_dump --binary-upgrade was already doing that, so > > >this part mostly removes some code. In order to make the partitions > > >reach the correct tablespace, the "default_tablespace" GUC is used. > > >No TABLESPACE clause is added to the dump. This is David's patch > > >near the start of the thread. > > > > This idea seems reasonable independently of all else, simply on the grounds > > of reducing code duplication. It also has the advantage that if you try > > to do a selective restore of just a partition, and the parent partitioned > > table isn't around, you can still do it (with an ignorable error). > > I'll get this part pushed, then. After looking at it again, I found that there's no significant duplication reduction -- the patch simply duplicates one block in a different location, putting half of the original code in each. And if we reject the idea of separating tablespaces, there's no reason to do things that way. So ISTM if we don't want the tablespace thing, we should not apply this part. FWIW, we got quite a few positive votes for handling tablespaces this way for partitioned tables [1] [2], so I resist the idea that we have to revert the initial commit, as some seem to be proposing. After re-reading the thread one more time, I found one more pretty reasonable point that Andres was complaining about, and I made things work the way he described. Namely, if you do this: SET default_tablespace TO 'foo'; CREATE TABLE part (a int) PARTITION BY LIST (a); SET default_tablespace TO 'bar'; CREATE TABLE part1 PARTITION OF part FOR VALUES IN (1); then the partition must end up in tablespace bar, not in tablespace foo: the reason is that the default_tablespace is not "strong enough" to stick with the partitioned table. The partition would only end up in tablespace foo in this case: CREATE TABLE part (a int) PARTITION BY LIST (a) TABLESPACE foo; CREATE TABLE part1 PARTITION OF part FOR VALUES IN (1); i.e. when the tablespace is explicitly indicated in the CREATE TABLE command for the partitioned table. Of course, you can still add a TABLESPACE clause to the partition to override it (and you can change the parent table's tablespace later.) So here's a proposed v5. I would appreciate others' eyes on this patch. [1] https://postgr.es/m/cakjs1f9sxvzqdrgd1teosfd6jbmm0ueaa14_8mrvcwe19tu...@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAKJS1f9PXYcT%2Bj%3DoyL-Lquz%3DScNwpRtmD7u9svLASUygBdbN8w%40mail.gmail.com -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 2e42c31985fd1ed38af7806c25539c6dba600dc1 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 8 Apr 2019 16:40:05 -0400 Subject: [PATCH v5 1/2] Make pg_dump emit ATTACH PARTITION instead of PARTITION OF ca41030 modified how the TABLESPACE option worked with partitioned tables so that it was possible to specify and store a TABLESPACE for a partitioned table. This setting served only to mark what the default tablespace should be for new partitions which were created with the CREATE TABLE PARTITION OF syntax. The problem with this was that pg_dump used the PARTITION OF syntax which meant that if a partition had been explicitly defined to have pg_default as the tablespace then since reltablespace is set to 0 in this case, pg_dump would emit a CREATE TABLE .. PARTITION OF statement and cause the partition to default to the partitioned table's tablespace rather than the one it was meant to be located in. We can work around this problem by simply performing the CREATE TABLE first then perform an ATTACH PARTITION later. This bypasses the check of the parent partitioned table's tablespace in DefineRelation() as the table is not a partition when it is defined. Doing this also means that when restoring partitions they now maintain their original column ordering rather than switch to their partitioned table's column ordering. This is perhaps minor, but it is noteworthy pg_dump in binary upgrade mode already worked this way, so it turns out this commit removes more code than it adds. --- src/bin/pg_dump/pg_dump.c| 118 ++- src/bin/pg_dump/t/002_pg_dump.pl | 12 +++- 2 files changed, 61 insertions(+), 69 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 7cfb67fd36e..bacb4de5c36 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -8618,9 +8618,11 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables) * Normally this is always true, but it's false for dropped columns, as well * as those that were inherited without any local definition. (
Re: finding changed blocks using WAL scanning
On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote: > On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian wrote: > > How would you choose the STARTLSN/ENDLSN? If you could do it per > > checkpoint, rather than per-WAL, I think that would be great. > > I thought of that too. It seems appealing, because you probably only > really care whether a particular block was modified between one > checkpoint and the next, not exactly when during that interval it was > modified. However, the simple algorithm of "just stop when you get to > a checkpoint record" does not work, because the checkpoint record > itself points back to a much earlier LSN, and I think that it's that > earlier LSN that is interesting. So if you want to make this work you > have to be more clever, and I'm not sure I'm clever enough. OK, so let's back up and study how this will be used. Someone wanting to make a useful incremental backup will need the changed blocks from the time of the start of the base backup. It is fine if they incrementally back up some blocks modified _before_ the base backup, but they need all blocks after, until some marker. They will obviously still use WAL to recover to a point after the incremental backup, so there is no need to get every modifified block up to current, just up to some cut-off point where WAL can be discarded. I can see a 1GB marker being used for that. It would prevent an incremental backup from being done until the first 1G modblock files was written, since until then there is no record of modified blocks, but that seems fine. A 1G marker would allow for consistent behavior independent of server restarts and base backups. How would the modblock file record all the modified blocks across restarts and crashes? I assume that 1G of WAL would not be available for scanning. I suppose that writing a modblock file to some PGDATA location when WAL is removed would work since during a crash the modblock file could be updated with the contents of the existing pg_wal files. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: block-level incremental backup
Greetings, I wanted to respond to this point specifically as I feel like it'll really help clear things up when it comes to the point of view I'm seeing this from. * Robert Haas (robertmh...@gmail.com) wrote: > > Perhaps that's what we're both saying too and just talking past each > > other, but I feel like the approach here is "make it work just for the > > simple pg_basebackup case and not worry too much about the other tools, > > since what we do for pg_basebackup will work for them too" while where > > I'm coming from is "focus on what the other tools need first, and then > > make pg_basebackup work with that if there's a sensible way to do so." > > I think perhaps the disconnect is that I just don't see how it can > fail to work for the external tools if it works for pg_basebackup. The existing backup protocol that pg_basebackup uses *does* *not* *work* for the external backup tools. If it worked, they'd use it, but they don't and that's because you can't do things like a parallel backup, which we *know* users want because there's a number of tools which implement that exact capability. I do *not* want another piece of functionality added in this space which is limited in the same way because it does *not* help the external backup tools at all. > Any given piece of functionality is either available in the > replication stream, or it's not. I suspect that for both BART and > pg_backrest, they won't be able to completely give up on having their > own backup engines solely because core has incremental backup, but I > don't know what the alternative to adding features to core one at a > time is. This idea that it's either "in the replication system" or "not in the replication system" is really bad, in my view, because it can be "in the replication system" and at the same time not at all useful to the existing external backup tools, but users and others will see the "checkbox" as ticked and assume that it's available in a useful fashion by the backend and then get upset when they discover the limitations. The existing base backup/replication protocol that's used by pg_basebackup is *not* useful to most of the backup tools, that's quite clear since they *don't* use it. Building on to that an incremental backup solution that is similairly limited isn't going to make things easier for the external tools. If the goal is to make things easier for the external tools by providing capability in the backend / replication protocol then we need to be looking at what those tools require and not at what would be minimally sufficient for pg_basebackup. If we don't care about the external tools and *just* care about making it work for pg_basebackup, then let's be clear about that, and accept that it'll have to be, most likely, ripped out and rewritten when we go to add parallel capabilities, for example, to pg_basebackup down the road. That's clearly the case for the existing "base backup" protocol, so I don't see why it'd be different for an incremental backup system that is similairly designed and implemented. To be clear, I'm all for adding feature to core one at a time, but there's different ways to implement features and that's really what we're talking about here- what's the best way to implement this feature, ideally in a way that it's useful, practically, to both pg_basebackup and the other external backup utilities. Thanks! Stephen signature.asc Description: PGP signature
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Andres Freund writes: > My compromise suggestion would be to try to give John and Amit ~2 weeks > to come up with a cleanup proposal, and then decide whether to 1) revert > 2) apply the new patch, 3) decide to live with the warts for 12, and > apply the patch in 13. As we would already have a patch, 3) seems like > it'd be more tenable than without. Seems reasonable. I think we should shoot to have this resolved before the end of the month, but it doesn't have to be done immediately. regards, tom lane
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Hi, On 2019-04-17 12:20:29 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2019-04-17 15:49:29 +0530, Amit Kapila wrote: > >> OTOH, if we want to extend it later for whatever reason to a relation > >> level cache, it shouldn't be that difficult as the implementation is > >> mostly contained in freespace.c (fsm* functions) and I think the > >> relation is accessible in most places. We might need to rip out some > >> calls to clearlocalmap. > > > But it really isn't contained to freespace.c. That's my primary > > concern. You added new parameters (undocumented ones!), > > FSMClearLocalMap() needs to be called by callers and xlog, etc. > > Given where we are in the release cycle, and the major architectural > concerns that have been raised about this patch, should we just > revert it and try again in v13, rather than trying to fix it under > time pressure? It's not like there's not anything else on our > plates to fix before beta. Hm. I'm of split mind here: It's a nice improvement, and the fixes probably wouldn't be that hard. And we could have piped up a bit earlier about these concerns (I only noticed this when rebasing zheap onto the newest version of postgres). But as you it's also late, and there's other stuff to do. Although I think neither Amit nor John is heavily involved in any... My compromise suggestion would be to try to give John and Amit ~2 weeks to come up with a cleanup proposal, and then decide whether to 1) revert 2) apply the new patch, 3) decide to live with the warts for 12, and apply the patch in 13. As we would already have a patch, 3) seems like it'd be more tenable than without. Regards, Andres
Re: Pluggable Storage - Andres's take
Hi, On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote: > The comments for relation_set_new_relfilenode() callback say that the AM can > set *freezeXid and *minmulti to invalid. But when I did that, VACUUM hits > this assertion: > > TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId) > 3)))", File: "vacuum.c", Line: 1323) Hm, that necessary change unfortunately escaped into the the zheap tree (which indeed doesn't set relfrozenxid). That's why I'd not noticed this. How about something like the attached? I found a related problem in VACUUM FULL / CLUSTER while working on the above, not fixed in the attached yet. Namely even if a relation doesn't yet have a valid relfrozenxid/relminmxid before a VACUUM FULL / CLUSTER, we'll set one after that. That's not great. I suspect the easiest fix would be to to make the relevant relation_copy_for_cluster() FreezeXid, MultiXactCutoff arguments into pointer, and allow the AM to reset them to an invalid value if that's the appropriate one. It'd probably be better if we just moved the entire xid limit computation into the AM, but I'm worried that we actually need to move it *further up* instead - independent of this change. I don't think it's quite right to allow a table with a toast table to be independently VACUUM FULL/CLUSTERed from the toast table. GetOldestXmin() can go backwards for a myriad of reasons (or limited by old_snapshot_threshold), and I'm fairly certain that e.g. VACUUM FULLing the toast table, setting a lower old_snapshot_threshold, and VACUUM FULLing the main table would result in failures. I think we need to fix this for 12, rather than wait for 13. Does anybody disagree? Greetings, Andres Freund >From 5c84256ea5e41055b0cb9e0dc121a4daaca43336 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Thu, 18 Apr 2019 13:20:31 -0700 Subject: [PATCH] Allow pg_class xid & multixid horizons to not be set. This allows table AMs that don't need these horizons. This was already documented in the tableam relation_set_new_filenode callback, but an assert prevented if from actually working (the test AM contained the ne necessary AM itself). Reported-By: Heikki Linnakangas Author: Andres Freund Discussion: https://postgr.es/m/9a7fb9cc-2419-5db7-8840-ddc10c93f...@iki.fi --- src/backend/access/heap/vacuumlazy.c | 4 +++ src/backend/commands/vacuum.c| 53 src/backend/postmaster/autovacuum.c | 4 +-- 3 files changed, 45 insertions(+), 16 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 8dc76fa8583..9364cd4c33f 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -213,6 +213,10 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params, Assert(params != NULL); Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT); + /* not every AM requires these to be valid, but heap does */ + Assert(TransactionIdIsNormal(onerel->rd_rel->relfrozenxid)); + Assert(MultiXactIdIsValid(onerel->rd_rel->relminmxid)); + /* measure elapsed time iff autovacuum logging requires it */ if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0) { diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 1a7291d94bc..94fb6f26063 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -1313,36 +1313,61 @@ vac_update_datfrozenxid(void) /* * Only consider relations able to hold unfrozen XIDs (anything else - * should have InvalidTransactionId in relfrozenxid anyway.) + * should have InvalidTransactionId in relfrozenxid anyway). */ if (classForm->relkind != RELKIND_RELATION && classForm->relkind != RELKIND_MATVIEW && classForm->relkind != RELKIND_TOASTVALUE) + { + Assert(!TransactionIdIsValid(classForm->relfrozenxid)); + Assert(!MultiXactIdIsValid(classForm->relminmxid)); continue; - - Assert(TransactionIdIsNormal(classForm->relfrozenxid)); - Assert(MultiXactIdIsValid(classForm->relminmxid)); + } /* + * Some table AMs might not need per-relation xid / multixid + * horizons. It therefore seems reasonable to allow relfrozenxid and + * relminmxid to not be set (i.e. set to their respective Invalid*Id) + * independently. Thus validate and compute horizon for each only if + * set. + * * If things are working properly, no relation should have a * relfrozenxid or relminmxid that is "in the future". However, such * cases have been known to arise due to bugs in pg_upgrade. If we * see any entries that are "in the future", chicken out and don't do - * anything. This ensures we won't truncate clog before those - * relations have been scanned and cleaned up. + * anything. This ensures we won't truncate clog & multixact SLRUs + * before those relations have been scanned and cleaned up. */ - if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) || - MultiXactIdPrecedes(la
Re: block-level incremental backup
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost wrote: > > As I understand it, the problem is not with backing up an individual > > database or cluster, but rather dealing with backing up thousands of > > individual clusters with thousands of tables in each, leading to an > > awful lot of tables with lots of FSMs/VMs, all of which end up having to > > get copied and stored wholesale. I'll point this thread out to him and > > hopefully he'll have a chance to share more specific information. > > Sounds good. Ok, done. > > I can agree with the idea of having multiple options for how to collect > > up the set of changed blocks, though I continue to feel that a > > WAL-scanning approach isn't something that we'd have implemented in the > > backend at all since it doesn't require the backend and a given backend > > might not even have all of the WAL that is relevant. I certainly don't > > think it makes sense to have a backend go get WAL from the archive to > > then merge the WAL to provide the result to a client asking for it- > > that's adding entirely unnecessary load to the database server. > > My motivation for wanting to include it in the database server was twofold: > > 1. I was hoping to leverage the background worker machinery. The > WAL-scanner would just run all the time in the background, and start > up and shut down along with the server. If it's a standalone tool, > then it can run on a different server or when the server is down, both > of which are nice. The downside though is that now you probably have > to put it in crontab or under systemd or something, instead of just > setting a couple of GUCs and letting the server handle the rest. For > me that downside seems rather significant, but YMMV. Background workers can be used to do pretty much anything. I'm not suggesting that's a bad thing- just that it's such a completely generic tool that could be used to put anything/everything into the backend, so I'm not sure how much it makes sense as an argument when it comes to designing a new capability/feature. Yes, there's an advantage there when it comes to configuration since that means we don't need to set up a cronjob and can, instead, just set a few GUCs... but it also means that it *must* be done on the server and there's no option to do it elsewhere, as you say. When it comes to "this is something that I can do on the DB server or on some other server", the usual preference is to use another system for it, to reduce load on the server. If it comes down to something that needs to/should be an ongoing process, then the packaging can package that as a daemon-type tool which handles the systemd component to it, assuming the stand-alone tool supports that, which it hopefully would. > 2. In order for the information produced by the WAL-scanner to be > useful, it's got to be available to the server when the server is > asked for an incremental backup. If the information is constructed by > a standalone frontend tool, and stored someplace other than under > $PGDATA, then the server won't have convenient access to it. I guess > we could make it the client's job to provide that information to the > server, but I kind of liked the simplicity of not needing to give the > server anything more than an LSN. If the WAL-scanner tool is a stand-alone tool, and it handles picking out all of the FPIs and incremental page changes for each relation, then what does the tool to build out the "new" backup really need to tell the backend? I feel like it mainly needs to ask the backend for the non-relation files, which gets into at least one approach that I've thought about for redesigning the backup protocol: 1. Ask for a list of files and metadata about them 2. Allow asking for individual files 3. Support multiple connections asking for individual files Quite a few of the existing backup tools for PG use a model along these lines (or use tools underneath which do). > > A thought that occurs to me is to have the functions for supporting the > > WAL merging be included in libcommon and available to both the > > independent executable that's available for doing WAL merging, and to > > the backend to be able to WAL merging itself- > > Yeah, that might be possible. I feel like this would be necessary, as it's certainly delicate and critical code and having multiple implementations of it will be difficult to manage. That said... we already have independent work going on to do WAL mergeing (WAL-G, at least), and if we insist that the WAL replay code only exists in the backend, I strongly suspect we'll end up with independent implementations of that too. Sure, we can distance ourselves from that and say that we don't have to deal with any bugs from it... but it seems like the better approach would be to have a common library that provides it. > > but for a specific > > purpose: having a way to reduce the amount of WAL that needs to be sent
Re: finding changed blocks using WAL scanning
On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian wrote: > How would you choose the STARTLSN/ENDLSN? If you could do it per > checkpoint, rather than per-WAL, I think that would be great. I thought of that too. It seems appealing, because you probably only really care whether a particular block was modified between one checkpoint and the next, not exactly when during that interval it was modified. However, the simple algorithm of "just stop when you get to a checkpoint record" does not work, because the checkpoint record itself points back to a much earlier LSN, and I think that it's that earlier LSN that is interesting. So if you want to make this work you have to be more clever, and I'm not sure I'm clever enough. I think it's important that a .modblock file not be too large, because then it will use too much memory, and that it not cover too much WAL, because then it will be too imprecise about when the blocks were modified. Perhaps we should have a threshold for each -- e.g. emit the next .modblock file after finding 2^20 distinct block references or scanning 1GB of WAL. Then individual files would probably be in the single-digit numbers of megabytes in size, assuming we do a decent job with the compression, and you never need to scan more than 1GB of WAL to regenerate one. If the starting point for a backup falls in the middle of such a file, and you include the whole file, at worst you have ~8GB of extra blocks to read, but in most cases less, because your writes probably have some locality and the file may not actually contain the full 2^20 block references. You could also make it more fine-grained than that if you don't mind having more smaller files floating around. It would definitely be better if we could set things up so that we could always switch to the next .modblock file when we cross a potential redo start point, but they're not noted in the WAL so I don't see how to do that. I don't know if it would be possible to insert some new kind of log record concurrently with fixing the redo location, so that redo always started at a record of this new type. That would certainly be helpful for this kind of thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: finding changed blocks using WAL scanning
On Thu, Apr 18, 2019 at 03:43:30PM -0400, Robert Haas wrote: > You can make it kinda make sense by saying "the blocks modified by > records *beginning in* segment XYZ" or alternatively "the blocks > modified by records *ending in* segment XYZ", but that seems confusing > to me. For example, suppose you decide on the first one -- > 000100010068.modblock will contain all blocks modified by > records that begin in 000100010068. Well, that means that > to generate the 000100010068.modblock, you will need > access to 000100010068 AND probably also > 000100010069 and in rare cases perhaps > 00010001006A or even later files. I think that's actually > pretty confusing. > > It seems better to me to give the files names like > ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g. > 0001.00016858.0001687DBBB8.modblock, so that you can > see exactly which *records* are covered by that segment. How would you choose the STARTLSN/ENDLSN? If you could do it per checkpoint, rather than per-WAL, I think that would be great. > And I suspect it may also be a good idea to bunch up the records from > several WAL files. Especially if you are using 16MB WAL files, > collecting all of the block references from a single WAL file is going > to produce a very small file. I suspect that the modified block files > will end up being 100x smaller than the WAL itself, perhaps more, and > I don't think anybody will appreciate us adding another PostgreSQL > systems that spews out huge numbers of tiny little files. If, for > example, somebody's got a big cluster that is churning out a WAL > segment every second, they would probably still be happy to have a new > modified block file only, say, every 10 seconds. Agreed. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: finding changed blocks using WAL scanning
On Mon, Apr 15, 2019 at 11:45 PM Michael Paquier wrote: > On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote: > > That is pretty much exactly what I was intending to propose. > > Any caller of XLogWrite() could switch to a new segment once the > current one is done, and I am not sure that we would want some random > backend to potentially slow down to do that kind of operation. > > Or would a separate background worker do this work by itself? An > external tool can do that easily already: > https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks I was thinking that a dedicated background worker would be a good option, but Stephen Frost seems concerned (over on the other thread) about how much load that would generate. That never really occurred to me as a serious issue and I suspect for many people it wouldn't be, but there might be some. It's cool that you have a command-line tool that does this as well. Over there, it was also discussed that we might want to have both a command-line tool and a background worker. I think, though, that we would want to get the output in some kind of compressed binary format, rather than text. e.g. 4-byte database OID 4-byte tablespace OID any number of relation OID/block OID pairings for that database/tablespace combination 4-byte zero to mark the end of the relation OID/block OID list and then repeat all of the above any number of times That might be too dumb and I suspect we want some headers and a checksum, but we should try to somehow exploit the fact that there aren't likely to be many distinct databases or many distinct tablespaces mentioned -- whereas relation OID and block number will probably have a lot more entropy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Runtime pruning problem
I wrote: > [ let's fix this by reverting ruleutils back to using Plans not PlanStates ] BTW, while I suspect the above wouldn't be a huge patch, it doesn't seem trivial either. Since the issue is (a) cosmetic and (b) not new (v11 behaves the same way), I don't think we should consider it to be an open item for v12. I suggest leaving this as a to-do for v13. regards, tom lane
Re: finding changed blocks using WAL scanning
On Mon, Apr 15, 2019 at 10:22 PM Bruce Momjian wrote: > > > I am thinking tools could retain modblock files along with WAL, could > > > pull full-page-writes from WAL, or from PGDATA. It avoids the need to > > > scan 16MB WAL files, and the WAL files and modblock files could be > > > expired independently. > > > > That is pretty much exactly what I was intending to propose. > > OK, good. Some of your wording was vague so I was unclear exactly what > you were suggesting. Well, I guess the part that isn't like what I was suggesting is the idea that there should be exactly one modified block file per segment. The biggest problem with that idea is that a single WAL record can be split across two segments (or, in pathological cases, perhaps more). I think it makes sense to talk about the blocks modified by WAL between LSN A and LSN B, but it doesn't make much sense to talk about the block modified by the WAL in segment XYZ. You can make it kinda make sense by saying "the blocks modified by records *beginning in* segment XYZ" or alternatively "the blocks modified by records *ending in* segment XYZ", but that seems confusing to me. For example, suppose you decide on the first one -- 000100010068.modblock will contain all blocks modified by records that begin in 000100010068. Well, that means that to generate the 000100010068.modblock, you will need access to 000100010068 AND probably also 000100010069 and in rare cases perhaps 00010001006A or even later files. I think that's actually pretty confusing. It seems better to me to give the files names like ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g. 0001.00016858.0001687DBBB8.modblock, so that you can see exactly which *records* are covered by that segment. And I suspect it may also be a good idea to bunch up the records from several WAL files. Especially if you are using 16MB WAL files, collecting all of the block references from a single WAL file is going to produce a very small file. I suspect that the modified block files will end up being 100x smaller than the WAL itself, perhaps more, and I don't think anybody will appreciate us adding another PostgreSQL systems that spews out huge numbers of tiny little files. If, for example, somebody's got a big cluster that is churning out a WAL segment every second, they would probably still be happy to have a new modified block file only, say, every 10 seconds. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pgsql: Fix plan created for inherited UPDATE/DELETE with all tables exc
Amit Langote writes: > On 2019/02/23 2:23, Tom Lane wrote: >> Fix plan created for inherited UPDATE/DELETE with all tables excluded. > I noticed that we may have forgotten to fix one more thing in this commit > -- nominalRelation may refer to a range table entry that's not referenced > elsewhere in the finished plan: > create table parent (a int); > create table child () inherits (parent); > explain verbose update parent set a = a where false; > QUERY PLAN > ─── > Update on public.parent (cost=0.00..0.00 rows=0 width=0) >Update on public.parent >-> Result (cost=0.00..0.00 rows=0 width=0) > Output: a, ctid > One-Time Filter: false > (5 rows) > I think the "Update on public.parent" shown in the 2nd row is unnecessary, > because it won't really be updated. Well, perhaps, but nonetheless that's what the plan tree shows. Moreover, even though it will receive no row changes, we're going to fire statement-level triggers on it. So I'm not entirely convinced that suppressing it is a step forward ... > As you may notice, Result node's targetlist is shown differently than > before, that is, columns are shown prefixed with table name. ... and that definitely isn't one. I also think that your patch largely falsifies the discussion at 1543ff: * Set the nominal target relation of the ModifyTable node if not * already done. If the target is a partitioned table, we already set * nominalRelation to refer to the partition root, above. For * non-partitioned inheritance cases, we'll use the first child * relation (even if it's excluded) as the nominal target relation. * Because of the way expand_inherited_rtentry works, that should be * the RTE representing the parent table in its role as a simple * member of the inheritance set. * * It would be logically cleaner to *always* use the inheritance * parent RTE as the nominal relation; but that RTE is not otherwise * referenced in the plan in the non-partitioned inheritance case. * Instead the duplicate child RTE created by expand_inherited_rtentry * is used elsewhere in the plan, so using the original parent RTE * would give rise to confusing use of multiple aliases in EXPLAIN * output for what the user will think is the "same" table. OTOH, * it's not a problem in the partitioned inheritance case, because * there is no duplicate RTE for the parent. We'd have to rethink exactly what the goals are if we want to change the definition of nominalRelation like this. One idea, perhaps, is to teach explain.c to not list partitioned tables as subsidiary targets, independently of nominalRelation. But I'm not convinced that we need to do anything at all here. The existing output for this case is exactly like it was in v10 and v11. regards, tom lane
Re: proposal: psql PSQL_TABULAR_PAGER variable
čt 18. 4. 2019 v 21:12 odesílatel Alvaro Herrera napsal: > On 2019-Apr-18, Pavel Stehule wrote: > > > I don't know any about other cases. Other results in psql has tabular > > format. > > What about EXPLAIN? > I forgot it, thank you Pavel > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint
On Tue, Apr 16, 2019 at 2:45 AM Michael Paquier wrote: > I think that we have race conditions with checkpointing and shutdown > on HEAD. Any idea whether it's something newly-introduced or of long standing? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: proposal: psql PSQL_TABULAR_PAGER variable
On 2019-Apr-18, Pavel Stehule wrote: > I don't know any about other cases. Other results in psql has tabular > format. What about EXPLAIN? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: ALTER TABLE with ADD COLUMN and ADD PRIMARY KEY USING INDEX throws spurious "column contains null values"
On Wed, Apr 17, 2019 at 11:55 PM Tom Lane wrote: > * I'm not sure whether we want to try to back-patch this, or how > far it should go. The misbehavior has been there a long time > (at least back to 8.4, I didn't check further); so the lack of > previous reports means people seldom try to do it. That may > indicate that it's not worth taking any risks of new bugs to > squash this one. (Also, I suspect that it might take a lot of > work to port this to before v10, because there are comments > suggesting that this code worked a good bit differently before.) > I do think we should shove this into v12 though. Shoving it into v12 but not back-patching seems like a reasonable compromise, although I have not reviewed the patch or tried to figure out how risky it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: block-level incremental backup
Hi, > > Wow. I have to admit that I feel completely opposite of that- I'd > > *love* to have an independent tool (which ideally uses the same code > > through the common library, or similar) that can be run to apply WAL. > > > > In other words, I don't agree that it's the server's problem at all to > > solve that, or, at least, I don't believe that it needs to be. > > I mean, I guess I'd love to have that if I could get it by waving a > magic wand, but I wouldn't love it if I had to write the code or > maintain it. The routines for applying WAL currently all assume that > you have a whole bunch of server infrastructure present; that code > wouldn't run in a frontend environment, I think. I wouldn't want to > have a second copy of every WAL apply routine that might have its own > set of bugs. I'll fight tooth and nail not to have a second implementation of replay, even if it's just portions. The code we have is complicated and fragile enough, having a [partial] second version would be way worse. There's already plenty improvements we need to make to speed up replay, and a lot of them require multiple execution threads (be it processes or OS threads), something not easily feasible in a standalone tool. And without the already existing concurrent work during replay (primarily checkpointer doing a lot of the necessary IO), it'd also be pretty unattractive to use any separate tool. Unless you just define the server binary as that "independent tool". Which I think is entirely reasonable. With the 'consistent' and LSN recovery targets one already can get most of what's needed from such a tool, anyway. I'd argue the biggest issue there is that there's no equivalent to starting postgres with a private socket directory on windows, and perhaps an option or two making it easier to start postgres in a "private" mode for things like this. Greetings, Andres Freund
Re: Runtime pruning problem
On Tue, Apr 16, 2019 at 10:49 PM Amit Langote wrote: > Maybe, not show them? That may be a bit inconsistent, because the point > of VERBOSE is to the targetlist among other things, but maybe the users > wouldn't mind not seeing it on such empty Append nodes. OTOH, they are > more likely to think seeing a subplan that's clearly prunable as a bug of > the pruning logic. Or maybe we could show them, but the Append could also be flagged in some way that indicates that its child is only a dummy. Everything Pruned: Yes Or something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: block-level incremental backup
On Wed, Apr 17, 2019 at 6:43 PM Stephen Frost wrote: > Sadly, I haven't got any great ideas today. I do know that the WAL-G > folks have specifically mentioned issues with the visibility map being > large enough across enough of their systems that it kinda sucks to deal > with. Perhaps we could do something like the rsync binary-diff protocol > for non-relation files? This is clearly just hand-waving but maybe > there's something reasonable in that idea. I guess it all comes down to how complicated you're willing to make the client-server protocol. With the very simple protocol that I proposed -- client provides a threshold LSN and server sends blocks modified since then -- the client need not have access to the old incremental backup to take a new one. Of course, if it happens to have access to the old backup then it can delta-compress however it likes after-the-fact, but that doesn't help with the amount of network transfer. That problem could be solved by doing something like what you're talking about (with some probably-negligible false match rate) but I have no intention of trying to implement anything that complicated, and I don't really think it's necessary, at least not for a first version. What I proposed would already allow, for most users, a large reduction in transfer and storage costs; what you are talking about here would help more, but also be a lot more work and impose some additional requirements on the system. I don't object to you implementing the more complex system, but I'll pass. > There's something like 6 different backup tools, at least, for > PostgreSQL that provide backup management, so I have a really hard time > agreeing with this idea that users don't want a PG backup management > system. Maybe that's not what you're suggesting here, but that's what > came across to me. Let me be a little more clear. Different users want different things. Some people want a canned PostgreSQL backup solution, while other people just want access to a reasonable set of facilities from which they can construct their own solution. I believe that the proposal I am making here could be used either by backup tool authors to enhance their offerings, or by individuals who want to build up their own solution using facilities provided by core. > Unless maybe I'm misunderstanding and what you're suggesting here is > that the "existing solution" is something like the external PG-specific > backup tools? But then the rest doesn't seem to make sense, as only > maybe one or two of those tools use pg_basebackup internally. Well, what I'm really talking about is in two pieces: providing some new facilities via the replication protocol, and making pg_basebackup able to use those facilities. Nothing would stop other tools from using those facilities directly if they wish. > ... but this is exactly the situation we're in already with all of the > *other* features around backup (parallel backup, backup management, WAL > management, etc). Users want those features, pg_basebackup/PG core > doesn't provide it, and therefore there's a bunch of other tools which > have been written that do. In addition, saying that PG has incremental > backup but no built-in management of those full-vs-incremental backups > and telling users that they basically have to build that themselves > really feels a lot like we're trying to address a check-box requirement > rather than making something that our users are going to be happy with. I disagree. Yes, parallel backup, like incremental backup, needs to go in core. And pg_basebackup should be able to do a parallel backup. I will fight tooth, nail, and claw any suggestion that the server should know how to do a parallel backup but pg_basebackup should not have an option to exploit that capability. And similarly for incremental. > I don't think that I was very clear in what my specific concern here > was. I'm not asking for pg_basebackup to have parallel backup (at > least, not in this part of the discussion), I'm asking for the > incremental block-based protocol that's going to be built-in to core to > be able to be used in a parallel fashion. > > The existing protocol that pg_basebackup uses is basically, connect to > the server and then say "please give me a tarball of the data directory" > and that is then streamed on that connection, making that protocol > impossible to use for parallel backup. That's fine as far as it goes > because only pg_basebackup actually uses that protocol (note that nearly > all of the other tools for doing backups of PostgreSQL don't...). If > we're expecting the external tools to use the block-level incremental > protocol then that protocol really needs to have a way to be > parallelized, otherwise we're just going to end up with all of the > individual tools doing their own thing for block-level incremental > (though perhaps they'd reimplement whatever is done in core but in a way > that they could parallelize it...), if possible
Re: Runtime pruning problem
Amit Langote writes: > On 2019/04/17 13:10, Tom Lane wrote: >> No, the larger issue is that *any* plan node above the Append might >> be recursing down to/through the Append to find out what to print for >> a Var reference. We have to be able to support that. > I wonder why the original targetlist of these nodes, adjusted using just > fix_scan_list(), wouldn't have been better for EXPLAIN to use? So what I'm thinking is that I made a bad decision in 1cc29fe7c, which did this: ... In passing, simplify the EXPLAIN code by having it deal primarily in the PlanState tree rather than separately searching Plan and PlanState trees. This is noticeably cleaner for subplans, and about a wash elsewhere. It was definitely silly to have the recursion in explain.c passing down both Plan and PlanState nodes, when the former is always easily accessible from the latter. So that was an OK change, but at the same time I changed ruleutils.c to accept PlanState pointers not Plan pointers from explain.c, and that is now looking like a bad idea. If we were to revert that decision, then instead of assuming that an AppendState always has at least one live child, we'd only have to assume that an Append has at least one live child. Which is true. I don't recall that there was any really strong reason for switching ruleutils' API like that, although maybe if we look harder we'll find one. I think it was mainly just for consistency with the way that explain.c now looks at the world; which is not a negligible consideration, but it's certainly something we could overrule. > Another idea is to teach explain.c about this special case of run-time > pruning having pruned all child subplans even though appendplans contains > one element to cater for targetlist accesses. That is, Append will be > displayed with "Subplans Removed: All" and no child subplans listed below > it, even though appendplans[] has one. David already said he didn't do in > the first place to avoid PartitionPruneInfo details creeping into other > modules, but maybe there's no other way? I tried simply removing the hack in nodeAppend.c (per quick-hack patch below), and it gets through the core regression tests without a crash, and with output diffs that seem fine to me. However, that just shows that we lack enough test coverage; we evidently have no regression cases where an upper node needs to print Vars that are coming from a fully-pruned Append. Given the test case mentioned in this thread, I get regression=# explain verbose select * from t1 where dt = current_date + 400; QUERY PLAN - Append (cost=0.00..198.42 rows=44 width=8) Subplans Removed: 4 (2 rows) which seems fine, but regression=# explain verbose select * from t1 where dt = current_date + 400 order by id; psql: server closed the connection unexpectedly It's dying trying to resolve Vars in the Sort node, of course. regards, tom lane diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index f3be242..3cd8940 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -143,22 +143,13 @@ ExecInitAppend(Append *node, EState *estate, int eflags) list_length(node->appendplans)); /* - * The case where no subplans survive pruning must be handled - * specially. The problem here is that code in explain.c requires - * an Append to have at least one subplan in order for it to - * properly determine the Vars in that subplan's targetlist. We - * sidestep this issue by just initializing the first subplan and - * setting as_whichplan to NO_MATCHING_SUBPLANS to indicate that - * we don't really need to scan any subnodes. + * If no subplans survive pruning, change as_whichplan to + * NO_MATCHING_SUBPLANS, to indicate that we don't really need to + * scan any subnodes. */ if (bms_is_empty(validsubplans)) - { appendstate->as_whichplan = NO_MATCHING_SUBPLANS; -/* Mark the first as valid so that it's initialized below */ -validsubplans = bms_make_singleton(0); - } - nplans = bms_num_members(validsubplans); } else @@ -174,10 +165,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags) * immediately, preventing later calls to ExecFindMatchingSubPlans. */ if (!prunestate->do_exec_prune) - { - Assert(nplans > 0); appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1); - } } else {
Re: Question about the holdable cursor
On Thu, Apr 18, 2019 at 10:09 PM Tom Lane wrote: > Andy Fan writes: > > when I fetch from holdable cursor, I found the fact is more complex > than I > > expected. > > ... > > why the 3rd time is necessary and will the performance be bad due to this > > design? > > If you read the whole cursor output, then close the transaction and > persist the cursor, yes we'll read it twice, and yes it's bad for that > case. The design is intended to perform well in these other cases: > > Thanks you Tom for the reply!! Looks this situation is really hard to produce but I just got there:( Please help me to confirm my understanding: 1. we can have 2 methods to reproduce it: Method 1: a). begin;// begin the transaction explicitly b). declare c1 cursor WITH HOLD for select * from t; // declare the cursor with HOLD option. c). fetch n c1; // this will run ExecutePlan the first time. d). commit // commit the transaction explicitly, which caused the 2nd ExecutePlan. Write "ALL the records" into tuplestore. Method 2: a). declare c1 cursor WITH HOLD for select * from t; fetch n c1; // send 1 query with 2 statements, with implicitly transaction begin/commit; (even though, I don't know how to send "declare c1 cursor WITH HOLD for select * from t; fetch n c1; " as one query in psql shell) 2. with a bit of more normal case: a). declare c1 cursor WITH HOLD for select * from t; // declare the cursor with HOLD option. the transaction is started implicitly and commit implicitly. during the commit, "ExecutePlan" is called first time and "GET ALL THE RECORDS" and store ALL OF them (what if it is very big, write to file)? b). fetch 10 c1; // will not run ExecutePlan any more. even though, "GET ALL THE RECORDS" at the step 1 is expensive. 3). without hold option a)begin; b). declare c1 cursor for select * from t; .// without hold option. c). fetch 1 c1; // this only scan 1 row. d). commit; if so, the connection can't be used for other transactions until I commit the transaction for cursor (which is something I dislike for now). Could you help to me confirm my understandings are correct regarding the 3 topics? Thanks 1. The HOLD option isn't really being used, ie you just read and > close the cursor within the original transaction. This is important > because applications are frequently sloppy about marking cursors as > WITH HOLD. > > 2. You declare the cursor and persist it before reading anything from it. > (This is really the typical use-case for held cursors, IMV.) > > FWIW, I don't see any intermediate tuplestore in there when > dealing with a PORTAL_ONE_SELECT query, which is the only > case that's possible with a cursor no? > > regards, tom lane >
Re: block-level incremental backup
Hi, On 2019-04-18 11:34:32 -0400, Bruce Momjian wrote: > On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote: > > On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote: > > > Also, instead of storing the file name and block number in the modblock > > > file, using the database oid, relfilenode, and block number (3 int32 > > > values) should be sufficient. > > > > Would doing it that way constrain the design of new table access > > methods in some meaningful way? > > I think these are the values used in WAL, so I assume table access > methods already have to map to those, unless they use their own. > I actually don't know. I don't think it'd be a meaningful restriction. Given that we use those for shared_buffer descriptors, WAL etc. Greetings, Andres Freund
Re: block-level incremental backup
On Wed, Apr 17, 2019 at 5:20 PM Stephen Frost wrote: > As I understand it, the problem is not with backing up an individual > database or cluster, but rather dealing with backing up thousands of > individual clusters with thousands of tables in each, leading to an > awful lot of tables with lots of FSMs/VMs, all of which end up having to > get copied and stored wholesale. I'll point this thread out to him and > hopefully he'll have a chance to share more specific information. Sounds good. > I can agree with the idea of having multiple options for how to collect > up the set of changed blocks, though I continue to feel that a > WAL-scanning approach isn't something that we'd have implemented in the > backend at all since it doesn't require the backend and a given backend > might not even have all of the WAL that is relevant. I certainly don't > think it makes sense to have a backend go get WAL from the archive to > then merge the WAL to provide the result to a client asking for it- > that's adding entirely unnecessary load to the database server. My motivation for wanting to include it in the database server was twofold: 1. I was hoping to leverage the background worker machinery. The WAL-scanner would just run all the time in the background, and start up and shut down along with the server. If it's a standalone tool, then it can run on a different server or when the server is down, both of which are nice. The downside though is that now you probably have to put it in crontab or under systemd or something, instead of just setting a couple of GUCs and letting the server handle the rest. For me that downside seems rather significant, but YMMV. 2. In order for the information produced by the WAL-scanner to be useful, it's got to be available to the server when the server is asked for an incremental backup. If the information is constructed by a standalone frontend tool, and stored someplace other than under $PGDATA, then the server won't have convenient access to it. I guess we could make it the client's job to provide that information to the server, but I kind of liked the simplicity of not needing to give the server anything more than an LSN. > A thought that occurs to me is to have the functions for supporting the > WAL merging be included in libcommon and available to both the > independent executable that's available for doing WAL merging, and to > the backend to be able to WAL merging itself- Yeah, that might be possible. > but for a specific > purpose: having a way to reduce the amount of WAL that needs to be sent > to a replica which has a replication slot but that's been disconnected > for a while. Of course, there'd have to be some way to handle the other > files for that to work to update a long out-of-date replica. Now, if we > taught the backup tool about having a replication slot then perhaps we > could have the backend effectively have the same capability proposed > above, but without the need to go get the WAL from the archive > repository. Hmm, but you can't just skip over WAL records or segments because there are checksums and previous-record pointers and things > I'm still not entirely sure that this makes sense to do in the backend > due to the additional load, this is really just some brainstorming. Would it really be that much load? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: proposal: psql PSQL_TABULAR_PAGER variable
čt 18. 4. 2019 v 18:35 odesílatel Bruce Momjian napsal: > On Thu, Apr 18, 2019 at 06:06:40PM +0200, Pavel Stehule wrote: > > > > > > čt 18. 4. 2019 v 17:58 odesílatel Bruce Momjian > napsal: > > > > On Thu, Apr 18, 2019 at 05:45:24PM +0200, Pavel Stehule wrote: > > > čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian > > napsal: > > > In testing pspg, it seems to work fine with tabular and \ > > x-non-tabular > > > data. Are you asking for a pager option that is only used for > non-\x > > > display? What do people want the non-pspg pager to do? > > > > > > My idea is following - pseudocode > > > > > > else /* for \h xxx */ > > > > Well, normal output and \x looks fine in pspg, and \h doesn't use the > > pager unless it is more than one screen. If I do '\h *' it uses > pspg, > > but now often do people do that? Most \h display doesn't use a > pager, > > so I don't see the point. > > > > > > It depends on terminal size. On my terminal pager is mostly every time. > \? is > > same. > > > > pspg can works like classic pager, but it is not optimized for this > purpose. > > Uh, the odd thing is that \? and sometimes \h are the only case I can > see where using the classic page has much value. Are there more cases? > If not, I don't see the value in having a separate configuration > variable for this. > I don't know any about other cases. Other results in psql has tabular format. Pavel > -- > Bruce Momjian http://momjian.us > EnterpriseDB http://enterprisedb.com > > + As you are, so once was I. As I am, so you will be. + > + Ancient Roman grave inscription + >
Re: proposal: psql PSQL_TABULAR_PAGER variable
On Thu, Apr 18, 2019 at 06:06:40PM +0200, Pavel Stehule wrote: > > > čt 18. 4. 2019 v 17:58 odesílatel Bruce Momjian napsal: > > On Thu, Apr 18, 2019 at 05:45:24PM +0200, Pavel Stehule wrote: > > čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian > napsal: > > In testing pspg, it seems to work fine with tabular and \ > x-non-tabular > > data. Are you asking for a pager option that is only used for > non-\x > > display? What do people want the non-pspg pager to do? > > > > My idea is following - pseudocode > > > > else /* for \h xxx */ > > Well, normal output and \x looks fine in pspg, and \h doesn't use the > pager unless it is more than one screen. If I do '\h *' it uses pspg, > but now often do people do that? Most \h display doesn't use a pager, > so I don't see the point. > > > It depends on terminal size. On my terminal pager is mostly every time. \? is > same. > > pspg can works like classic pager, but it is not optimized for this purpose. Uh, the odd thing is that \? and sometimes \h are the only case I can see where using the classic page has much value. Are there more cases? If not, I don't see the value in having a separate configuration variable for this. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: proposal: psql PSQL_TABULAR_PAGER variable
čt 18. 4. 2019 v 17:58 odesílatel Bruce Momjian napsal: > On Thu, Apr 18, 2019 at 05:45:24PM +0200, Pavel Stehule wrote: > > čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian > napsal: > > In testing pspg, it seems to work fine with tabular and > \x-non-tabular > > data. Are you asking for a pager option that is only used for non-\x > > display? What do people want the non-pspg pager to do? > > > > My idea is following - pseudocode > > > > else /* for \h xxx */ > > Well, normal output and \x looks fine in pspg, and \h doesn't use the > pager unless it is more than one screen. If I do '\h *' it uses pspg, > but now often do people do that? Most \h display doesn't use a pager, > so I don't see the point. > It depends on terminal size. On my terminal pager is mostly every time. \? is same. pspg can works like classic pager, but it is not optimized for this purpose. > -- > Bruce Momjian http://momjian.us > EnterpriseDB http://enterprisedb.com > > + As you are, so once was I. As I am, so you will be. + > + Ancient Roman grave inscription + >
Re: proposal: psql PSQL_TABULAR_PAGER variable
On Thu, Apr 18, 2019 at 05:45:24PM +0200, Pavel Stehule wrote: > čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian napsal: > In testing pspg, it seems to work fine with tabular and \x-non-tabular > data. Are you asking for a pager option that is only used for non-\x > display? What do people want the non-pspg pager to do? > > My idea is following - pseudocode > > else /* for \h xxx */ Well, normal output and \x looks fine in pspg, and \h doesn't use the pager unless it is more than one screen. If I do '\h *' it uses pspg, but now often do people do that? Most \h display doesn't use a pager, so I don't see the point. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: proposal: psql PSQL_TABULAR_PAGER variable
čt 18. 4. 2019 v 15:51 odesílatel Bruce Momjian napsal: > On Thu, Apr 18, 2019 at 07:20:37AM +0200, Pavel Stehule wrote: > > Hi > > > > I wrote a pspg pager https://github.com/okbob/pspg > > > > This pager is designed for tabular data. It can work in fallback mode as > > classic pager, but it is not designed for this purpose (and I don't plan > do > > it). Can we enhance a set of psql environment variables about > > PSQL_TABULAR_PAGER variable. This pager will be used, when psql will > display > > tabular data. > > In testing pspg, it seems to work fine with tabular and \x-non-tabular > data. Are you asking for a pager option that is only used for non-\x > display? What do people want the non-pspg pager to do? > My idea is following - pseudocode if view is a table { if is_defined PSQL_TABULAR_PAGER { pager = PSQL_TABULAR_PAGER } else if is_defined PSQL_PAGER { pager = PSQL_PAGER } else { pager = PAGER } } else /* for \h xxx */ { if is_defined PSQL_PAGER { pager = PSQL_PAGER } else { pager = PAGER } } I expect some configuration like PSQL_TABULAR_PAGER=pspg PSQL_PAGER="less -S" Regards Pavel > -- > Bruce Momjian http://momjian.us > EnterpriseDB http://enterprisedb.com > > + As you are, so once was I. As I am, so you will be. + > + Ancient Roman grave inscription + >
Re: block-level incremental backup
On Thu, Apr 18, 2019 at 05:32:57PM +0200, David Fetter wrote: > On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote: > > Also, instead of storing the file name and block number in the modblock > > file, using the database oid, relfilenode, and block number (3 int32 > > values) should be sufficient. > > Would doing it that way constrain the design of new table access > methods in some meaningful way? I think these are the values used in WAL, so I assume table access methods already have to map to those, unless they use their own. I actually don't know. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: block-level incremental backup
On Wed, Apr 17, 2019 at 11:57:35AM -0400, Bruce Momjian wrote: > On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote: > > Yeah, I really hope we don't end up with dueling patches. I want to > > come up with an approach that can be widely-endorsed and then have > > everybody rowing in the same direction. On the other hand, I do think > > that we may support multiple options in certain places which may have > > the kinds of trade-offs that Bruce mentions. For instance, > > identifying changed blocks by scanning the whole cluster and checking > > the LSN of each block has an advantage in that it requires no prior > > setup or extra configuration. Like a sequential scan, it always > > works, and that is an advantage. Of course, for many people, the > > competing advantage of a WAL-scanning approach that can save a lot of > > I/O will appear compelling, but maybe not for everyone. I think > > there's room for two or three approaches there -- not in the sense of > > competing patches, but in the sense of giving users a choice based on > > their needs. > > Well, by having a separate modblock file for each WAL file, you can keep > both WAL and modblock files and use the modblock list to pull pages from > each WAL file, or from the heap/index files, and it can be done in > parallel. Having WAL and modblock files in the same directory makes > retention simpler. > > In fact, you can do an incremental backup just using the modblock files > and the heap/index files, so you don't even need the WAL. > > Also, instead of storing the file name and block number in the modblock > file, using the database oid, relfilenode, and block number (3 int32 > values) should be sufficient. Would doing it that way constrain the design of new table access methods in some meaningful way? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Race conditions with checkpointer and shutdown
Thomas Munro writes: > On Wed, Apr 17, 2019 at 10:45 AM Tom Lane wrote: >> I think what we need to look for is reasons why (1) the postmaster >> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's >> main loop doesn't get to noticing shutdown_requested. >> >> A rather scary point for (2) is that said main loop seems to be >> assuming that MyLatch a/k/a MyProc->procLatch is not used for any >> other purposes in the checkpointer process. If there were something, >> like say a condition variable wait, that would reset MyLatch at any >> time during a checkpoint, then we could very easily go to sleep at the >> bottom of the loop and not notice that there's a pending shutdown request. > Agreed on the non-composability of that coding, but if there actually > is anything in that loop that can reach ResetLatch(), it's well > hidden... Well, it's easy to see that there's no other ResetLatch call in checkpointer.c. It's much less obvious that there's no such call anywhere in the code reachable from e.g. CreateCheckPoint(). To try to investigate that, I hacked things up to force an assertion failure if ResetLatch was called from any other place in the checkpointer process (dirty patch attached for amusement's sake). This gets through check-world without any assertions. That does not really prove that there aren't corner timing cases where a latch wait and reset could happen, but it does put a big dent in my theory. Question is, what other theory has anybody got? regards, tom lane diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index a1e0423..b0ee1fd 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -166,6 +166,8 @@ static double ckpt_cached_elapsed; static pg_time_t last_checkpoint_time; static pg_time_t last_xlog_switch_time; +extern bool reset_latch_ok; + /* Prototypes for private functions */ static void CheckArchiveTimeout(void); @@ -343,9 +345,13 @@ CheckpointerMain(void) int elapsed_secs; int cur_timeout; + reset_latch_ok = true; + /* Clear any already-pending wakeups */ ResetLatch(MyLatch); + reset_latch_ok = false; + /* * Process any requests or signals received recently. */ diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 59fa917..1f0613d 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -118,6 +118,8 @@ struct WaitEventSet #endif }; +bool reset_latch_ok = true; + #ifndef WIN32 /* Are we currently in WaitLatch? The signal handler would like to know. */ static volatile sig_atomic_t waiting = false; @@ -521,6 +523,8 @@ ResetLatch(Latch *latch) /* Only the owner should reset the latch */ Assert(latch->owner_pid == MyProcPid); + Assert(reset_latch_ok); + latch->is_set = false; /*
Re: Question about the holdable cursor
Andy Fan writes: > when I fetch from holdable cursor, I found the fact is more complex than I > expected. > ... > why the 3rd time is necessary and will the performance be bad due to this > design? If you read the whole cursor output, then close the transaction and persist the cursor, yes we'll read it twice, and yes it's bad for that case. The design is intended to perform well in these other cases: 1. The HOLD option isn't really being used, ie you just read and close the cursor within the original transaction. This is important because applications are frequently sloppy about marking cursors as WITH HOLD. 2. You declare the cursor and persist it before reading anything from it. (This is really the typical use-case for held cursors, IMV.) FWIW, I don't see any intermediate tuplestore in there when dealing with a PORTAL_ONE_SELECT query, which is the only case that's possible with a cursor no? regards, tom lane
Re: proposal: psql PSQL_TABULAR_PAGER variable
On Thu, Apr 18, 2019 at 07:20:37AM +0200, Pavel Stehule wrote: > Hi > > I wrote a pspg pager https://github.com/okbob/pspg > > This pager is designed for tabular data. It can work in fallback mode as > classic pager, but it is not designed for this purpose (and I don't plan do > it). Can we enhance a set of psql environment variables about > PSQL_TABULAR_PAGER variable. This pager will be used, when psql will display > tabular data. In testing pspg, it seems to work fine with tabular and \x-non-tabular data. Are you asking for a pager option that is only used for non-\x display? What do people want the non-pspg pager to do? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: bug in update tuple routing with foreign partitions
Amit-san, Thanks for the comments! (2019/04/18 14:08), Amit Langote wrote: On 2019/04/18 14:06, Amit Langote wrote: On 2019/04/17 21:49, Etsuro Fujita wrote: I think the reason for that is: the row routed to remp is incorrectly fetched as a to-be-updated row when updating remp as a subplan targetrel. Yeah. In the fully-local case, that is, where both the source and the target partition of a row movement operation are local tables, heap AM ensures that tuples that's moved into a given relation in the same command (by way of row movement) are not returned as to-be-updated, because it deems such tuples invisible. The "same command" part being crucial for that to work. In the case where the target of a row movement operation is a foreign table partition, the INSERT used as part of row movement and subsequent UPDATE of the same foreign table are distinct commands for the remote server. So, the rows inserted by the 1st command (as part of the row movement) are deemed visible by the 2nd command (UPDATE) even if both are operating within the same transaction. Yeah, I think so too. I guess there's no easy way for postgres_fdw to make the remote server consider them as a single command. IOW, no way to make the remote server not touch those "moved-in" rows during the UPDATE part of the local query. I agree. The right way to fix this would be to have some way in postgres_fdw in which we don't fetch such rows when updating such subplan targetrels. I tried to figure out a (simple) way to do that, but I couldn't. Yeah, that seems a bit hard to ensure with our current infrastructure. Yeah, I think we should leave that for future work. So what I'm thinking is to throw an error for cases like this. (Though, I think we should keep to allow rows to be moved from local partitions to a foreign-table subplan targetrel that has been updated already.) Hmm, how would you distinguish (totally inside postgred_fdw I presume) the two cases? One thing I came up with to do that is this: @@ -1917,6 +1937,23 @@ postgresBeginForeignInsert(ModifyTableState *mtstate, initStringInfo(&sql); + /* +* If the foreign table is an UPDATE subplan resultrel that hasn't yet +* been updated, routing tuples to the table might yield incorrect +* results, because if routing tuples, routed tuples will be mistakenly +* read from the table and updated twice when updating the table --- it +* would be nice if we could handle this case; but for now, throw an error +* for safety. +*/ + if (plan && plan->operation == CMD_UPDATE && + (resultRelInfo->ri_usesFdwDirectModify || +resultRelInfo->ri_FdwState) && + resultRelInfo > mtstate->resultRelInfo + mtstate->mt_whichplan) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +errmsg("cannot route tuples into foreign table to be updated \"%s\"", + RelationGetRelationName(rel; Forgot to say that since this is a separate issue from the original bug report, maybe we can first finish fixing the latter. What to do you think? Yeah, I think we can do that, but my favorite would be to fix the two issues in a single shot, because they seem to me rather close to each other. So I updated a new version in a single patch, which I'm attaching. Notes: * I kept all the changes in the previous patch, because otherwise postgres_fdw would fail to release resources for a foreign-insert operation created by postgresBeginForeignInsert() for a tuple-routable foreign table (ie, a foreign-table subplan resultrel that has been updated already) during postgresEndForeignInsert(). * I revised a comment according to your previous comment, though I changed a state name: s/sub_fmstate/aux_fmstate/ * DirectModify also has the latter issue. Here is an example: postgres=# create table p (a int, b text) partition by list (a); postgres=# create table p1 partition of p for values in (1); postgres=# create table p2base (a int check (a = 2 or a = 3), b text); postgres=# create foreign table p2 partition of p for values in (2, 3) server loopback options (table_name 'p2base'); postgres=# insert into p values (1, 'foo'); INSERT 0 1 postgres=# explain verbose update p set a = a + 1; QUERY PLAN - Update on public.p (cost=0.00..176.21 rows=2511 width=42) Update on public.p1 Foreign Update on public.p2 -> Seq Scan on public.p1 (cost=0.00..25.88 rows=1270 width=42) Output: (p1.a + 1), p1.b, p1.ctid -> Foreign Update on public.p2 (cost=100.00..150.33 rows=1241 width=42) Remote SQL: UPDATE public.p2base SET a = (a + 1) (7 rows) postgres=# update p set a = a + 1; UPDATE 2 postgres=# select * from p; a | b ---+- 3 | foo (1 row) As shown in t
Remove page-read callback from XLogReaderState.
Hello. As mentioned before [1], read_page callback in XLogReaderState is a cause of headaches. Adding another remote-controlling stuff to xlog readers makes things messier [2]. I refactored XLOG reading functions so that we don't need the callback. In short, ReadRecrod now calls XLogPageRead directly with the attached patch set. | while (XLogReadRecord(xlogreader, RecPtr, &record, &errormsg) |== XLREAD_NEED_DATA) | XLogPageRead(xlogreader, fetching_ckpt, emode, randAccess); On the other hand, XLogReadRecord became a bit complex. The patch makes XLogReadRecord a state machine. I'm not confident that the additional complexity is worth doing. Anyway I'll gegister this to the next CF. [1] https://www.postgresql.org/message-id/47215279-228d-f30d-35d1-16af695e5...@iki.fi [2] https://www.postgresql.org/message-id/20190412.122711.158276916.horiguchi.kyot...@lab.ntt.co.jp regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From 3de14fb47987e9dd8189bfad3ca7264c26b719eb Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Thu, 18 Apr 2019 10:22:49 +0900 Subject: [PATCH 01/10] Define macros to make XLogReadRecord a state machine To minimize apparent imapct on code, use some macros as syntax sugar. This is a similar stuff with ExecInterpExpr but a bit different. The most significant difference is that this stuff allows some functions are leaved midst of their work then continue. Roughly speaking this is used as the follows. enum retval some_func() { static .. internal_variables; XLR_SWITCH(); ... XLR_LEAVE(STATUS1, RETVAL_CONTINUE); ... XLR_LEAVE(STATUS2, RETVAL_CONTINUE2); ... XLR_SWITCH_END(); XLR_RETURN(RETVAL_FINISH); } The caller uses the function as follows: while (some_func() != RETVAL_FINISH) { ; } --- src/backend/access/transam/xlogreader.c | 63 + src/include/access/xlogreader.h | 3 ++ 2 files changed, 66 insertions(+) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 9196aa3aae..5299765040 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -29,6 +29,69 @@ #include "utils/memutils.h" #endif +/* + * Use computed-goto-based opcode dispatch when computed gotos are available. + * But use a separate symbol so that it's easy to adjust locally in this file + * for development and testing. + */ +#ifdef HAVE_COMPUTED_GOTO +#define XLR_USE_COMPUTED_GOTO +#endif /* HAVE_COMPUTED_GOTO */ + +/* + * Macros for opcode dispatch. + * + * XLR_SWITCH - just hides the switch if not in use. + * XLR_CASE - labels the implementation of named expression step type. + * XLR_DISPATCH - jump to the implementation of the step type for 'op'. + * XLR_LEAVE - leave the function and return here at the next call. + * XLR_RETURN - return from the function and set state to initial state. + * XLR_END - just hides the closing brace if not in use. + */ +#if defined(XLR_USE_COMPUTED_GOTO) +#define XLR_SWITCH() \ + /* Don't call duplicatedly */ \ + static int callcnt = 0 PG_USED_FOR_ASSERTS_ONLY; \ + do { \ + if ((XLR_STATE).j) \ + goto *((void *) (XLR_STATE).j); \ + XLR_CASE(XLR_INIT_STATE); \ + Assert(++callcnt == 1);\ + } while (0) +#define XLR_CASE(name) name: +#define XLR_DISPATCH() goto *((void *) (XLR_STATE).j) +#define XLR_LEAVE(name, code) do {\ + (XLR_STATE).j = (&&name); return (code); \ + XLR_CASE(name);\ + } while (0) +#define XLR_RETURN(code) \ + do { \ + Assert(--callcnt == 0);\ + (XLR_STATE).j = (&&XLR_INIT_STATE); return (code); \ + } while (0) +#define XLR_SWITCH_END() +#else /* !XLR_USE_COMPUTED_GOTO */ +#define XLR_SWITCH() \ + /* Don't call duplicatedly */ \ + static int callcnt = 0 PG_USED_FOR_ASSERTS_ONLY; \ + switch ((XLR_STATE).c) { \ + XLR_CASE(XLR_INIT_STATE); \ + Assert(++callcnt == 1); \ +#define XLR_CASE(name) case name: +#define XLR_DISPATCH() goto starteval +#define XLR_LEAVE(name, code) \ + do { \ + (XLR_STATE).c = (name); return (code); \ + XLR_CASE(name); \ + } while (0) +#define XLR_RETURN(code) \ + do { \ + Assert(--callcnt == 0);\ + (XLR_STATE).c = (XLR_INIT_STATE); return (code); \ + } while (0) +#define XLR_SWITCH_END() } +#endif /* XLR_USE_COMPUTED_GOTO */ + static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index f3bae0bf49..30500c35c7 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -240,6 +240,9 @@ extern bool DecodeXLogRecord(XLogReaderState *state, XLogRecord *record, #define XLogRecBlockImageApply(decoder, block_id) \ ((decoder)->blocks[block_id].apply_ima
Question about the holdable cursor
when I fetch from holdable cursor, I found the fact is more complex than I expected. suppose we fetched 20 rows. 1). It will fill a PortalStore, the dest is not the client, it is the DestTupleStore, called ExecutePlan once and receiveSlot will be call 20 times. 2). the portal for client then RunFromStore and send the result to client. the receiveSlot will be call 20 times again. 3). at last, when we HoldPortal, called ExecutePlan once again and receiveSlot will be call 20 times ``` 0 in ExecutePlan of execMain.c:1696 1 in standard_ExecutorRun of execMain.c:366 2 in ExecutorRun of execMain.c:309 3 in PersistHoldablePortal of portalcmds.c:392 4 in HoldPortal of portalmem.c:639 5 in PreCommit_Portals of portalmem.c:733 6 in CommitTransaction of xact.c:2007 7 in CommitTransactionCommand of xact.c:2801 8 in finish_xact_command of postgres.c:2529 9 in exec_simple_query of postgres.c:1176 10 in exec_docdb_simple_query of postgres.c:5069 11 in _exec_query_with_intercept_exception of op_executor.c:38 12 in exec_op_query of op_executor.c:102 13 in exec_op_find of op_executor.c:204 14 in run_op_find_common of op_find_common.c:42 15 in _cmd_run_find of cmd_find.c:31 16 in run_commands of commands.c:610 17 in DocdbMain of postgres.c:4792 18 in DocdbBackendRun of postmaster.c:4715 19 in DocdbBackendStartup of postmaster.c:4196 20 in ServerLoop of postmaster.c:1760 21 in PostmasterMain of postmaster.c:1406 22 in main of main.c:228 ``` why the 3rd time is necessary and will the performance be bad due to this design? Thanks for your help!
Re: Unhappy about API changes in the no-fsm-for-small-rels patch
On Thu, Apr 18, 2019 at 2:48 PM Amit Kapila wrote: > I respect and will follow whatever will be the consensus after > discussion. However, I request you to wait for some time to let the > discussion conclude. If we can't get to an > agreement or one of John or me can't implement what is decided, then > we can anyway revert it. Agreed. I suspect the most realistic way to address most of the objections in a short amount of time would be to: 1. rip out the local map 2. restore hio.c to only checking the last block in the relation if there is no FSM (and lower the threshold to reduce wasted space) 3. reduce calls to smgr_exists() Thoughts? -- John Naylorhttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fix handling of unlogged tables in FOR ALL TABLES publications
On 2019-03-25 09:04, Peter Eisentraut wrote: > So perhaps push the check down to GetRelationPublicationActions() > instead. That way we don't have to patch up two places and everything > "just works" even for possible other callers. See attached patch. This has been committed and backpatched to PG10. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services