Re: Make relfile tombstone files conditional on WAL level
On Mon, May 16, 2022 at 3:24 PM Thomas Munro wrote: > > I think you can get rid of SYNC_UNLINK_REQUEST, sync_unlinkfiletag, > mdunlinkfiletag as these are all now unused. Correct. > Are there any special hazards here if the plan in [1] goes ahead? IMHO we should not have any problem. In fact, we need this for [1] right? Otherwise, there is a risk of reusing the same relfilenode within the same checkpoint cycle as discussed in [2]. > [1] > https://www.postgresql.org/message-id/flat/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA+TgmoZZDL_2E_zuahqpJ-WmkuxmUi8+g7=dLEny=18r-+c...@mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
I think you can get rid of SYNC_UNLINK_REQUEST, sync_unlinkfiletag, mdunlinkfiletag as these are all now unused. Are there any special hazards here if the plan in [1] goes ahead? If the relfilenode allocation is logged and replayed then it should be fine to crash and recover multiple times in a row while creating and dropping tables, with wal_level=minimal, I think. It would be bad if the allocator restarted from a value from the checkpoint, though. [1] https://www.postgresql.org/message-id/flat/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com
Re: Make relfile tombstone files conditional on WAL level
On Thu, May 12, 2022 at 4:27 PM Amul Sul wrote: > Hi Amul, Thanks for the review, actually based on some comments from Robert we have planned to make some design changes. So I am planning to work on that for the July commitfest. I will try to incorporate all your review comments in the new version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
Hi Dilip, On Fri, Mar 4, 2022 at 11:07 AM Dilip Kumar wrote: > > On Mon, Feb 21, 2022 at 1:21 PM Dilip Kumar wrote: > > > > On Thu, Jan 6, 2022 at 1:43 PM Dilip Kumar wrote: > > > > 2) GetNewRelFileNode() will not loop for checking the file existence > > > and retry with other relfilenode. > > > > > > Open Issues- there are currently 2 open issues in the patch 1) Issue > > as discussed above about removing the loop, so currently in this patch > > the loop is removed. 2) During upgrade from the previous version we > > need to advance the nextrelfilenode to the current relfilenode we are > > setting for the object in order to avoid the conflict. > > > In this version I have fixed both of these issues. Thanks Robert for > suggesting the solution for both of these problems in our offlist > discussion. Basically, for the first problem we can flush the xlog > immediately because we are actually logging the WAL every time after > we allocate 64 relfilenode so this should not have much impact on the > performance and I have added the same in the comments. And during > pg_upgrade, whenever we are assigning the relfilenode as part of the > upgrade we will set that relfilenode + 1 as nextRelFileNode to be > assigned so that we never generate the conflicting relfilenode. > > The only part I do not like in the patch is that before this patch we > could directly access the buftag->rnode. But since now we are not > having directly relfilenode as part of the buffertag and instead of > that we are keeping individual fields (i.e. dbOid, tbsOid and relNode) > in the buffer tag. So if we have to directly get the relfilenode we > need to generate it. However those changes are very limited to just 1 > or 2 file so maybe not that bad. > v5 patch needs a rebase and here are a few comments for 0002, I found while reading that, hope that helps: +/* Number of RelFileNode to prefetch (preallocate) per XLOG write */ +#define VAR_RFN_PREFETCH 8192 + Should it be 64, as per comment in XLogPutNextRelFileNode for XLogFlush() ? --- + /* +* Check for the wraparound for the relnode counter. +* +* XXX Actually the relnode is 56 bits wide so we don't need to worry about +* the wraparound case. +*/ + if (ShmemVariableCache->nextRelNode > MAX_RELFILENODE) Very rare case, should use unlikely()? --- +/* + * Max value of the relfilnode. Relfilenode will be of 56bits wide for more + * details refer comments atop BufferTag. + */ +#define MAX_RELFILENODEuint64) 1) << 56) - 1) Should there be 57-bit shifts here? Instead, I think we should use INT64CONST(0xFF) to be consistent with PG_*_MAX declarations, thoughts? --- + /* If we run out of logged for use RelNode then we must log more */ + if (ShmemVariableCache->relnodecount == 0) Might relnodecount never go below, but just to be safer should check <= 0 instead. --- Few typos: Simmialr Simmilar agains idealy Regards, Amul
Re: Make relfile tombstone files conditional on WAL level
On Fri, Mar 4, 2022 at 12:37 AM Dilip Kumar wrote: > In this version I have fixed both of these issues. Here's a bit of review for these patches: - The whole relnode vs. relfilenode thing is really confusing. I realize that there is some precedent for calling the number that pertains to the file on disk "relnode" and that value when combined with the database and tablespace OIDs "relfilenode," but it's definitely not the most obvious thing, especially since pg_class.relfilenode is a prominent case where we don't even adhere to that convention. I'm kind of tempted to think that we should go the other way and rename the RelFileNode struct to something like RelFileLocator, and then maybe call the new data type RelFileNumber. And then we could work toward removing references to "filenode" and "relfilenode" in favor of either (rel)filelocator or (rel)filenumber. Now the question (even assuming other people like this general direction) is how far do we go with it? Renaming pg_class.relfilenode itself wouldn't be the worst compatibility break we've ever had, but it would definitely cause some pain. I'd be inclined to leave the user-visible catalog column alone and just push in this direction for internal stuff. - What you're doing to pg_buffercache here is completely unacceptable. You can't change the definition of an already-released version of the extension. Please study how such issues have been handled in the past. - It looks to me like you need to give significantly more thought to the proper way of adjusting the relfilenode-related test cases in alter_table.out. - I think BufTagGetFileNode and BufTagGetSetFileNode should be introduced in 0001 and then just update the definition in 0002 as required. Note that as things stand you end up with both BufTagGetFileNode and BuffTagGetRelFileNode which is an artifact of the relnode/filenode/relfilenode confusion I mention above, and just to make matters worse, one returns a value while the other produces an out parameter. I think the renaming I'm talking about up above might help somewhat here, but it seems like it might also be good to change the one that uses an out parameter by doing Get -> Copy, just to help the reader get a clue a little more easily. - GetNewRelNode() needs to error out if we would wrap around, not wrap around. Probably similar to what happens if we exhaust 2^64 bytes of WAL. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Tue, Mar 8, 2022 at 1:27 AM Robert Haas wrote: > > The only part I do not like in the patch is that before this patch we > > could directly access the buftag->rnode. But since now we are not > > having directly relfilenode as part of the buffertag and instead of > > that we are keeping individual fields (i.e. dbOid, tbsOid and relNode) > > in the buffer tag. So if we have to directly get the relfilenode we > > need to generate it. However those changes are very limited to just 1 > > or 2 file so maybe not that bad. > > You're talking here about just needing to introduce BufTagGetFileNode > and BufTagSetFileNode, or something else? I don't find those macros to > be problematic. Yeah, I was talking about BufTagGetFileNode macro only. The reason I did not like it is that earlier we could directly use buftag->rnode, but now whenever we wanted to use rnode first we need to use a separate variable for preparing the rnode using BufTagGetFileNode macro. But these changes are very localized and a very few places so I don't have much problem with those. > > BufTagSetFileNode could maybe assert that the OID isn't too big, > though. We should ereport() before we get to this point if we somehow > run out of values, but it might be nice to have a check here as a > backup. Yeah, we could do that, I will do that in the next version. Thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Fri, Mar 4, 2022 at 12:37 AM Dilip Kumar wrote: > In this version I have fixed both of these issues. Thanks Robert for > suggesting the solution for both of these problems in our offlist > discussion. Basically, for the first problem we can flush the xlog > immediately because we are actually logging the WAL every time after > we allocate 64 relfilenode so this should not have much impact on the > performance and I have added the same in the comments. And during > pg_upgrade, whenever we are assigning the relfilenode as part of the > upgrade we will set that relfilenode + 1 as nextRelFileNode to be > assigned so that we never generate the conflicting relfilenode. Anyone else have an opinion on this? > The only part I do not like in the patch is that before this patch we > could directly access the buftag->rnode. But since now we are not > having directly relfilenode as part of the buffertag and instead of > that we are keeping individual fields (i.e. dbOid, tbsOid and relNode) > in the buffer tag. So if we have to directly get the relfilenode we > need to generate it. However those changes are very limited to just 1 > or 2 file so maybe not that bad. You're talking here about just needing to introduce BufTagGetFileNode and BufTagSetFileNode, or something else? I don't find those macros to be problematic. BufTagSetFileNode could maybe assert that the OID isn't too big, though. We should ereport() before we get to this point if we somehow run out of values, but it might be nice to have a check here as a backup. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Feb 21, 2022 at 2:51 AM Dilip Kumar wrote: > While working on this I realized that even if we make the relfilenode > 56 bits we can not remove the loop inside GetNewRelFileNode() for > checking the file existence. Because it is always possible that the > file reaches to the disk even before the WAL for advancing the next > relfilenode and if the system crashes in between that then we might > generate the duplicate relfilenode right? I agree. > I think the second paragraph in XLogPutNextOid() function explain this > issue and now even after we get the wider relfilenode we will have > this issue. Correct? I think you are correct. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Feb 7, 2022 at 10:13 PM Robert Haas wrote: > > On Mon, Feb 7, 2022 at 11:31 AM Dilip Kumar wrote: > > For RelFileNode also we need to use 2, 32-bit integers so that we do > > not add extra alignment padding because there are a few more > > structures that include RelFileNode e.g. xl_xact_relfilenodes, > > RelFileNodeBackend, and many other structures. > > Are you sure that kind of stuff is really important enough to justify > the code churn? I don't think RelFileNodeBackend is used widely enough > or in sufficiently performance-critical places that we really need to > care about a few bytes of alignment padding. xl_xact_relfilenodes is > more concerning because that goes into the WAL format, but I don't > know that we use it often enough for an extra 4 bytes per record to > really matter, especially considering that this proposal also adds 4 > bytes *per relfilenode* which has to be a much bigger deal than a few > padding bytes after 'nrels'. The reason why BufferTag matters a lot is > because (1) we have an array of this struct that can easily contain a > million or eight entries, so the alignment padding adds up a lot more > and (2) access to that array is one of the most performance-critical > parts of PostgreSQL. I agree with you that adding 4 extra bytes to these structures might not be really critical. I will make the changes based on this idea and see how the changes look. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Feb 7, 2022 at 11:31 AM Dilip Kumar wrote: > For RelFileNode also we need to use 2, 32-bit integers so that we do > not add extra alignment padding because there are a few more > structures that include RelFileNode e.g. xl_xact_relfilenodes, > RelFileNodeBackend, and many other structures. Are you sure that kind of stuff is really important enough to justify the code churn? I don't think RelFileNodeBackend is used widely enough or in sufficiently performance-critical places that we really need to care about a few bytes of alignment padding. xl_xact_relfilenodes is more concerning because that goes into the WAL format, but I don't know that we use it often enough for an extra 4 bytes per record to really matter, especially considering that this proposal also adds 4 bytes *per relfilenode* which has to be a much bigger deal than a few padding bytes after 'nrels'. The reason why BufferTag matters a lot is because (1) we have an array of this struct that can easily contain a million or eight entries, so the alignment padding adds up a lot more and (2) access to that array is one of the most performance-critical parts of PostgreSQL. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Feb 7, 2022 at 9:42 PM Robert Haas wrote: > > On Mon, Feb 7, 2022 at 12:26 AM Dilip Kumar wrote: > > I have splitted the patch into multiple patches which can be > > independently committable and easy to review. I have explained the > > purpose and scope of each patch in the respective commit messages. > > Hmm. The parts of this I've looked at seem reasonably clean, but I > don't think I like the design choice. You're inventing > RelFileNodeSetFork(), but at present the RelFileNode struct doesn't > include a fork number. I feel like we should leave that alone, and > only change the definition of a BufferTag. What about adding accessors > for all of the BufferTag fields in 0001, and then in 0002 change it to > look like something this: > > typedef struct BufferTag > { > Oid dbOid; > Oid tablespaceOid; > uint32 fileNode_low; > uint32 fileNode_hi:24; > uint32 forkNumber:8; > BlockNumber blockNumber; > } BufferTag; Okay, we can do that. But we can not leave RelFileNode untouched I mean inside RelFileNode also we will have to change the relNode as 2 32 bit integers, I mean like below. > typedef struct RelFileNode > { > Oid spcNode; > Oid dbNode; > uint32 relNode_low; > uint32 relNode_hi; } RelFileNode; For RelFileNode also we need to use 2, 32-bit integers so that we do not add extra alignment padding because there are a few more structures that include RelFileNode e.g. xl_xact_relfilenodes, RelFileNodeBackend, and many other structures. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Feb 7, 2022 at 12:26 AM Dilip Kumar wrote: > I have splitted the patch into multiple patches which can be > independently committable and easy to review. I have explained the > purpose and scope of each patch in the respective commit messages. Hmm. The parts of this I've looked at seem reasonably clean, but I don't think I like the design choice. You're inventing RelFileNodeSetFork(), but at present the RelFileNode struct doesn't include a fork number. I feel like we should leave that alone, and only change the definition of a BufferTag. What about adding accessors for all of the BufferTag fields in 0001, and then in 0002 change it to look like something this: typedef struct BufferTag { Oid dbOid; Oid tablespaceOid; uint32 fileNode_low; uint32 fileNode_hi:24; uint32 forkNumber:8; BlockNumber blockNumber; } BufferTag; -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Wed, Feb 2, 2022 at 6:57 PM Robert Haas wrote: > > On Mon, Jan 31, 2022 at 9:37 AM Dilip Kumar wrote: > > I agree that we are using 8 bytes unsigned int multiple places in code > > as uint64. But I don't see it as an exposed data type and not used as > > part of any exposed function. But we will have to use the relfilenode > > in the exposed c function e.g. > > binary_upgrade_set_next_heap_relfilenode(). > > Oh, I thought we were talking about the C data type uint8 i.e. an > 8-bit unsigned integer. Which in retrospect was a dumb thought because > you said you wanted to store the relfilenode AND the fork number > there, which only make sense if you were talking about SQL data types > rather than C data types. It is confusing that we have an SQL data > type called int8 and a C data type called int8 and they're not the > same. > > But if you're talking about SQL data types, why? pg_class only stores > the relfilenode and not the fork number currently, and I don't see why > that would change. I think that the data type for the relfilenode > column would change to a 64-bit signed integer (i.e. bigint or int8) > that only ever uses the low-order 56 bits, and then when you need to > store a relfilenode and a fork number in the same 8-byte quantity > you'd do that using either a struct with bit fields or by something > like combined = ((uint64) signed_representation_of_relfilenode) | > (((int) forknumber) << 56); Yeah you're right. I think whenever we are using combined then we can use uint64 C type and in pg_class we can keep it as int64 because that is only representing the relfilenode part. I think I was just confused and tried to use the same data type everywhere whether it is combined with fork number or not. Thanks for your input, I will change this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Jan 31, 2022 at 9:37 AM Dilip Kumar wrote: > I agree that we are using 8 bytes unsigned int multiple places in code > as uint64. But I don't see it as an exposed data type and not used as > part of any exposed function. But we will have to use the relfilenode > in the exposed c function e.g. > binary_upgrade_set_next_heap_relfilenode(). Oh, I thought we were talking about the C data type uint8 i.e. an 8-bit unsigned integer. Which in retrospect was a dumb thought because you said you wanted to store the relfilenode AND the fork number there, which only make sense if you were talking about SQL data types rather than C data types. It is confusing that we have an SQL data type called int8 and a C data type called int8 and they're not the same. But if you're talking about SQL data types, why? pg_class only stores the relfilenode and not the fork number currently, and I don't see why that would change. I think that the data type for the relfilenode column would change to a 64-bit signed integer (i.e. bigint or int8) that only ever uses the low-order 56 bits, and then when you need to store a relfilenode and a fork number in the same 8-byte quantity you'd do that using either a struct with bit fields or by something like combined = ((uint64) signed_representation_of_relfilenode) | (((int) forknumber) << 56); -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Jan 31, 2022 at 7:36 PM Robert Haas wrote: > > On Mon, Jan 31, 2022 at 9:04 AM Robert Haas wrote: > > On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar wrote: > > > the main one currently we do not have uint8 data > > > type only int8 is there so I have used int8 for storing relfilenode + > > > forknumber. > > > > I'm confused. We use int8 in tons of places, so I feel like it must exist. > > Rather, we use uint8 in tons of places, so I feel like it must exist. Hmm, at least pg_type doesn't have anything with a name like uint8. postgres[101702]=# select oid, typname from pg_type where typname like '%int8'; oid | typname --+- 20 | int8 1016 | _int8 (2 rows) postgres[101702]=# select oid, typname from pg_type where typname like '%uint%'; oid | typname -+- (0 rows) I agree that we are using 8 bytes unsigned int multiple places in code as uint64. But I don't see it as an exposed data type and not used as part of any exposed function. But we will have to use the relfilenode in the exposed c function e.g. binary_upgrade_set_next_heap_relfilenode(). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Jan 31, 2022 at 9:04 AM Robert Haas wrote: > On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar wrote: > > the main one currently we do not have uint8 data > > type only int8 is there so I have used int8 for storing relfilenode + > > forknumber. > > I'm confused. We use int8 in tons of places, so I feel like it must exist. Rather, we use uint8 in tons of places, so I feel like it must exist. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar wrote: > the main one currently we do not have uint8 data > type only int8 is there so I have used int8 for storing relfilenode + > forknumber. I'm confused. We use int8 in tons of places, so I feel like it must exist. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar wrote: > > On Thu, Jan 6, 2022 at 7:22 PM Robert Haas wrote: >> >> On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro wrote: >> > Another problem is that relfilenodes are normally allocated with >> > GetNewOidWithIndex(), and initially match a relation's OID. We'd need >> > a new allocator, and they won't be able to match the OID in general >> > (while we have 32 bit OIDs at least). >> >> Personally I'm not sad about that. Values that are the same in simple >> cases but diverge in more complex cases are kind of a trap for the >> unwary. There's no real reason to have them ever match. Yeah, in >> theory, it makes it easier to tell which file matches which relation, >> but in practice, you always have to double-check in case the table has >> ever been rewritten. It doesn't seem worth continuing to contort the >> code for a property we can't guarantee anyway. > > > Make sense, I have started working on this idea, I will try to post the first > version by early next week. Here is the first working patch, with that now we don't need to maintain the TombStone file until the next checkpoint. This is still a WIP patch with this I can see my problem related to ALTER DATABASE SET TABLESPACE WAL-logged problem is solved which Robert reported a couple of mails above in the same thread. General idea of the patch: - Change the RelFileNode.relNode to be 64bit wide, out of which 8 bits for fork number and 56 bits for the relNode as shown below. [1] - GetNewRelFileNode() will just generate a new unique relfilenode and check the file existence and if it already exists then throw an error, so no loop. We also need to add the logic for preserving the nextRelNode across restart and also WAL logging it but that is similar to the preserving nextOid. - mdunlinkfork, will directly forget the relfilenode, so we get rid of all unlinking code from the code. - Now, we don't need any post checkpoint unlinking activity. [1] /* * RelNodeId: * * this is a storage type for RelNode. The reasoning behind using this is same * as using the BlockId so refer comment atop BlockId. */ typedef struct RelNodeId { uint32 rn_hi; uint32 rn_lo; } RelNodeId; typedef struct RelFileNode { Oid spcNode; /* tablespace */ Oid dbNode; /* database */ RelNodeId relNode; /* relation */ } RelFileNode; TODO: There are a couple of TODOs and FIXMEs which I am planning to improve by next week. I am also planning to do the testing where relfilenode consumes more than 32 bits, maybe for that we can set the FirstNormalRelfileNode to higher value for the testing purpose. And, Improve comments. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com From 4a6502c7950969262c6982388865bbc23e531cde Mon Sep 17 00:00:00 2001 From: Dilip Kumar Date: Fri, 28 Jan 2022 18:32:35 +0530 Subject: [PATCH v1] Don't wait for next checkpoint to remove unwanted relfilenode Currently, relfilenode is 32 bits wide so if we remove the relfilenode immediately after it is no longer needed then there is risk of reusing the same relfilenode in the same checkpoint. So for avoiding that we delay cleaning up the relfilenode until the next checkpoint. With this patch we are using 56 bits for the relfilenode. Ideally we can make it 64 bits wider but that will increase the size of the BufferTag so for keeping that size same we are making RelFileNode.relNode as 64 bit wider, in that 8 bits will be used for storing the fork number and the remaining 56 bits for the relfilenode. --- contrib/pg_buffercache/pg_buffercache_pages.c | 4 +- contrib/pg_prewarm/autoprewarm.c| 4 +- src/backend/access/common/syncscan.c| 3 +- src/backend/access/gin/ginxlog.c| 5 +- src/backend/access/rmgrdesc/gistdesc.c | 4 +- src/backend/access/rmgrdesc/heapdesc.c | 4 +- src/backend/access/rmgrdesc/nbtdesc.c | 4 +- src/backend/access/rmgrdesc/seqdesc.c | 4 +- src/backend/access/rmgrdesc/xlogdesc.c | 15 +++- src/backend/access/transam/varsup.c | 42 +- src/backend/access/transam/xlog.c | 57 ++--- src/backend/access/transam/xloginsert.c | 12 +++ src/backend/access/transam/xlogutils.c | 9 ++- src/backend/catalog/catalog.c | 61 +++--- src/backend/catalog/heap.c | 6 +- src/backend/catalog/index.c | 4 +- src/backend/catalog/storage.c | 3 +- src/backend/commands/tablecmds.c| 18 +++-- src/backend/replication/logical/decode.c| 1 + src/backend/replication/logical/reorderbuffer.c | 2 +- src/backend/storage/buffer/bufmgr.c | 61 +++--- src/backend/storage/buffer/localbuf.c | 8 +- src/backend/storage/freespace/fsmpage.c | 4 +- src/backend/storage/lmgr/lwlocknames.txt| 1 + src/backend/storage/smgr/
Re: Make relfile tombstone files conditional on WAL level
On Thu, Jan 6, 2022 at 7:22 PM Robert Haas wrote: > On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro > wrote: > > Another problem is that relfilenodes are normally allocated with > > GetNewOidWithIndex(), and initially match a relation's OID. We'd need > > a new allocator, and they won't be able to match the OID in general > > (while we have 32 bit OIDs at least). > > Personally I'm not sad about that. Values that are the same in simple > cases but diverge in more complex cases are kind of a trap for the > unwary. There's no real reason to have them ever match. Yeah, in > theory, it makes it easier to tell which file matches which relation, > but in practice, you always have to double-check in case the table has > ever been rewritten. It doesn't seem worth continuing to contort the > code for a property we can't guarantee anyway. > Make sense, I have started working on this idea, I will try to post the first version by early next week. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On 2022-01-06 08:52:01 -0500, Robert Haas wrote: > On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro wrote: > > Another problem is that relfilenodes are normally allocated with > > GetNewOidWithIndex(), and initially match a relation's OID. We'd need > > a new allocator, and they won't be able to match the OID in general > > (while we have 32 bit OIDs at least). > > Personally I'm not sad about that. Values that are the same in simple > cases but diverge in more complex cases are kind of a trap for the > unwary. +1
Re: Make relfile tombstone files conditional on WAL level
On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro wrote: > Another problem is that relfilenodes are normally allocated with > GetNewOidWithIndex(), and initially match a relation's OID. We'd need > a new allocator, and they won't be able to match the OID in general > (while we have 32 bit OIDs at least). Personally I'm not sad about that. Values that are the same in simple cases but diverge in more complex cases are kind of a trap for the unwary. There's no real reason to have them ever match. Yeah, in theory, it makes it easier to tell which file matches which relation, but in practice, you always have to double-check in case the table has ever been rewritten. It doesn't seem worth continuing to contort the code for a property we can't guarantee anyway. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Thu, Jan 6, 2022 at 9:13 PM Dilip Kumar wrote: > On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar wrote: > > > I think this idea is worth more consideration. It seems like 2^56 > > > relfilenodes ought to be enough for anyone, recalling that you can > > > only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a > > > bunch of code that is there to guard against relfilenodes being > > > reused. In particular, we can remove the code that leaves a 0-length > > > tombstone file around until the next checkpoint to guard against > > > relfilenode reuse. > > > > +1 +1 > I IMHO a few top level point for implementing this idea would be as listed > here, > > 1) the "relfilenode" member inside the RelFileNode will be now 64 > bytes and remove the "forkNum" all together from the BufferTag. So > now whenever we want to use the relfilenode or fork number we can use > the respective mask and fetch their values. > 2) GetNewRelFileNode() will not loop for checking the file existence > and retry with other relfilenode. > 3) Modify mdunlinkfork() so that we immediately perform the unlink > request, make sure to register_forget_request() before unlink. > 4) In checkpointer, now we don't need any handling for pendingUnlinks. Another problem is that relfilenodes are normally allocated with GetNewOidWithIndex(), and initially match a relation's OID. We'd need a new allocator, and they won't be able to match the OID in general (while we have 32 bit OIDs at least).
Re: Make relfile tombstone files conditional on WAL level
On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar wrote: > > > > > I think this idea is worth more consideration. It seems like 2^56 > > relfilenodes ought to be enough for anyone, recalling that you can > > only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a > > bunch of code that is there to guard against relfilenodes being > > reused. In particular, we can remove the code that leaves a 0-length > > tombstone file around until the next checkpoint to guard against > > relfilenode reuse. > > +1 > I IMHO a few top level point for implementing this idea would be as listed here, 1) the "relfilenode" member inside the RelFileNode will be now 64 bytes and remove the "forkNum" all together from the BufferTag. So now whenever we want to use the relfilenode or fork number we can use the respective mask and fetch their values. 2) GetNewRelFileNode() will not loop for checking the file existence and retry with other relfilenode. 3) Modify mdunlinkfork() so that we immediately perform the unlink request, make sure to register_forget_request() before unlink. 4) In checkpointer, now we don't need any handling for pendingUnlinks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Thu, Jan 6, 2022 at 3:07 AM Robert Haas wrote: > > On Mon, Aug 2, 2021 at 6:38 PM Andres Freund wrote: > > I guess there's a somewhat hacky way to get somewhere without actually > > increasing the size. We could take 3 bytes from the fork number and use that > > to get to a 7 byte relfilenode portion. 7 bytes are probably enough for > > everyone. > > > > It's not like we can use those bytes in a useful way, due to alignment > > requirements. Declaring that the high 7 bytes are for the relNode portion > > and > > the low byte for the fork would still allow efficient comparisons and > > doesn't > > seem too ugly. > > I think this idea is worth more consideration. It seems like 2^56 > relfilenodes ought to be enough for anyone, recalling that you can > only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a > bunch of code that is there to guard against relfilenodes being > reused. In particular, we can remove the code that leaves a 0-length > tombstone file around until the next checkpoint to guard against > relfilenode reuse. +1 > > I think this would also solve a problem Dilip mentioned to me today: > suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's > been trying to do. Then suppose you do "ALTER DATABASE foo SET > TABLESPACE used_recently_but_not_any_more". You might get an error > complaining that “some relations of database \“%s\” are already in > tablespace \“%s\“” because there could be tombstone files in that > database. With this combination of changes, you could just use the > barrier mechanism from https://commitfest.postgresql.org/36/2962/ to > wait for those files to disappear, because they've got to be > previously-unliked files that Windows is still returning because > they're still opening -- or else they could be a sign of a corrupted > database, but there are no other possibilities. Yes, this approach will solve the problem for the WAL-logged ALTER DATABASE we are facing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund wrote: > I guess there's a somewhat hacky way to get somewhere without actually > increasing the size. We could take 3 bytes from the fork number and use that > to get to a 7 byte relfilenode portion. 7 bytes are probably enough for > everyone. > > It's not like we can use those bytes in a useful way, due to alignment > requirements. Declaring that the high 7 bytes are for the relNode portion and > the low byte for the fork would still allow efficient comparisons and doesn't > seem too ugly. I think this idea is worth more consideration. It seems like 2^56 relfilenodes ought to be enough for anyone, recalling that you can only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a bunch of code that is there to guard against relfilenodes being reused. In particular, we can remove the code that leaves a 0-length tombstone file around until the next checkpoint to guard against relfilenode reuse. On Windows, we still need https://commitfest.postgresql.org/36/2962/ because of the problem that Windows won't remove files from the directory listing until they are both unlinked and closed. But in general this seems like it would lead to cleaner code. For example, GetNewRelFileNode() needn't loop. If it allocate the smallest unsigned integer that the cluster (or database) has never previously assigned, the file should definitely not exist on disk, and if it does, an ERROR is appropriate, as the database is corrupted. This does assume that allocations from this new 56-bit relfilenode counter are properly WAL-logged. I think this would also solve a problem Dilip mentioned to me today: suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's been trying to do. Then suppose you do "ALTER DATABASE foo SET TABLESPACE used_recently_but_not_any_more". You might get an error complaining that “some relations of database \“%s\” are already in tablespace \“%s\“” because there could be tombstone files in that database. With this combination of changes, you could just use the barrier mechanism from https://commitfest.postgresql.org/36/2962/ to wait for those files to disappear, because they've got to be previously-unliked files that Windows is still returning because they're still opening -- or else they could be a sign of a corrupted database, but there are no other possibilities. I think, anyway. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On Tue, Oct 5, 2021 at 4:21 PM Thomas Munro wrote: > On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro wrote: > > I managed to produce a case where live data is written to an unlinked > > file and lost In conclusion, there *is* something else that would break, so I'm withdrawing this CF entry (#3030) for now. Also, that something else is already subtly broken, so I'll try to come up with a fix for that separately.
Re: Make relfile tombstone files conditional on WAL level
On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro wrote: > I managed to produce a case where live data is written to an unlinked > file and lost I guess this must have been broken since release 9.2 moved checkpoints out of here[1]. The connection between checkpoints, tombstone files and file descriptor cache invalidation in auxiliary (non-sinval) backends was not documented as far as I can see (or at least not anywhere near the load-bearing parts). How could it be fixed, simply and backpatchably? If BgSyncBuffer() did if-FirstCallSinceLastCheckpoint()-then-smgrcloseall() after locking each individual buffer and before flushing, then I think it might logically have the correct interlocking against relfilenode wraparound, but that sounds a tad expensive :-( I guess it could be made cheaper by using atomics for the checkpoint counter instead of spinlocks. Better ideas? [1] https://www.postgresql.org/message-id/flat/CA%2BU5nMLv2ah-HNHaQ%3D2rxhp_hDJ9jcf-LL2kW3sE4msfnUw9gA%40mail.gmail.com
Re: Make relfile tombstone files conditional on WAL level
On Wed, Sep 29, 2021 at 4:07 PM Thomas Munro wrote: > Hmm, on closer inspection, isn't the lack of real interlocking with > checkpoints a bit suspect already? What stops bgwriter from writing > to the previous relfilenode generation's fd, if a relfilenode is > recycled while BgBufferSync() is running? Not sinval, and not the > above code that only runs between BgBufferSync() invocations. I managed to produce a case where live data is written to an unlinked file and lost, with a couple of tweaks to get the right timing and simulate OID wraparound. See attached. If you run the following commands repeatedly with shared_buffers=256kB and bgwriter_lru_multiplier=10, you should see a number lower than 10,000 from the last query in some runs, depending on timing. create extension if not exists chaos; create extension if not exists pg_prewarm; drop table if exists t1, t2; checkpoint; vacuum pg_class; select clobber_next_oid(20); create table t1 as select 42 i from generate_series(1, 1); select pg_prewarm('t1'); -- fill buffer pool with t1 update t1 set i = i; -- dirty t1 buffers so bgwriter writes some select pg_sleep(2); -- give bgwriter some time drop table t1; checkpoint; vacuum pg_class; select clobber_next_oid(20); create table t2 as select 0 i from generate_series(1, 1); select pg_prewarm('t2'); -- fill buffer pool with t2 update t2 set i = 1 where i = 0; -- dirty t2 buffers so bgwriter writes some select pg_sleep(2); -- give bgwriter some time select pg_prewarm('pg_attribute'); -- evict all clean t2 buffers select sum(i) as t2_sum_should_be_1 from t2; -- have any updates been lost? From b116b80b2775b004e35a9e7be0a057ee2724041b Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 30 Sep 2021 15:47:23 +1300 Subject: [PATCH 1/2] HACK: A function to control the OID allocator. --- src/test/modules/chaos/Makefile | 21 + src/test/modules/chaos/chaos--1.0.sql | 8 src/test/modules/chaos/chaos.c| 26 ++ src/test/modules/chaos/chaos.control | 4 4 files changed, 59 insertions(+) create mode 100644 src/test/modules/chaos/Makefile create mode 100644 src/test/modules/chaos/chaos--1.0.sql create mode 100644 src/test/modules/chaos/chaos.c create mode 100644 src/test/modules/chaos/chaos.control diff --git a/src/test/modules/chaos/Makefile b/src/test/modules/chaos/Makefile new file mode 100644 index 00..ac69721af6 --- /dev/null +++ b/src/test/modules/chaos/Makefile @@ -0,0 +1,21 @@ +# src/test/modules/chaos/Makefile + +MODULE_big = chaos +OBJS = \ + $(WIN32RES) \ + chaos.o +PGFILEDESC = "chaos - module in which to write throwaway fault-injection hacks" + +EXTENSION = chaos +DATA = chaos--1.0.sql + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = src/test/modules/chaos +top_builddir = ../../../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/src/test/modules/chaos/chaos--1.0.sql b/src/test/modules/chaos/chaos--1.0.sql new file mode 100644 index 00..5016f7e586 --- /dev/null +++ b/src/test/modules/chaos/chaos--1.0.sql @@ -0,0 +1,8 @@ +/* src/test/modules/chaos/chaos--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION chaos" to load this file. \quit + +CREATE FUNCTION clobber_next_oid(size BIGINT) + RETURNS pg_catalog.void STRICT + AS 'MODULE_PATHNAME' LANGUAGE C; diff --git a/src/test/modules/chaos/chaos.c b/src/test/modules/chaos/chaos.c new file mode 100644 index 00..f1052f865e --- /dev/null +++ b/src/test/modules/chaos/chaos.c @@ -0,0 +1,26 @@ +#include "postgres.h" + +#include "access/transam.h" +#include "fmgr.h" +#include "storage/lwlock.h" + +#include + +PG_MODULE_MAGIC; + +PG_FUNCTION_INFO_V1(clobber_next_oid); + +Datum +clobber_next_oid(PG_FUNCTION_ARGS) +{ + int64 oid = PG_GETARG_INT64(0); + + if (oid < FirstNormalObjectId || oid > UINT_MAX) + elog(ERROR, "invalid oid"); + + LWLockAcquire(OidGenLock, LW_EXCLUSIVE); + ShmemVariableCache->nextOid = oid; + LWLockRelease(OidGenLock); + + PG_RETURN_VOID(); +} diff --git a/src/test/modules/chaos/chaos.control b/src/test/modules/chaos/chaos.control new file mode 100644 index 00..137ab8a58d --- /dev/null +++ b/src/test/modules/chaos/chaos.control @@ -0,0 +1,4 @@ +comment = 'Test module for throwaway fault-injection hacks...' +default_version = '1.0' +module_pathname = '$libdir/chaos' +relocatable = true -- 2.30.2 From 2acc2ad31c1db268de0e8927d5c10ba2bb06e33c Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 30 Sep 2021 17:16:01 +1300 Subject: [PATCH 2/2] HACK: Slow the bgwriter down a bit. --- src/backend/postmaster/bgwriter.c | 2 ++ src/backend/storage/buffer/bufmgr.c | 2 ++ 2 files changed, 4 insertions(+) diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 5584f4bc24..b65284b1f6
Re: Make relfile tombstone files conditional on WAL level
On Wed, Aug 4, 2021 at 3:22 AM Robert Haas wrote: > It's not really clear to me what problem is at hand. The problems that > the tombstone system created for the async I/O stuff weren't really > explained properly, IMHO. And I don't think the current system is all > that ugly. it's not the most beautiful thing in the world but we have > lots of way worse hacks. And, it's easy to understand, requires very > little code, and has few moving parts that can fail. As hacks go it's > a quality hack, I would say. It's not really an AIO problem. It's just that while testing the AIO stuff across a lot of operating systems, we had tests failing on Windows because the extra worker processes you get if you use io_method=worker were holding cached descriptors and causing stuff like DROP TABLESPACE to fail. AFAIK every problem we discovered in that vein is a current live bug in all versions of PostgreSQL for Windows (it just takes other backends or the bgwriter to hold an fd at the wrong moment). The solution I'm proposing to that general class of problem is https://commitfest.postgresql.org/34/2962/ . In the course of thinking about that, it seemed natural to look into the possibility of getting rid of the tombstones, so that at least Unix systems don't find themselves having to suffer through a CHECKPOINT just to drop a tablespace that happens to contain a tombstone.
Re: Make relfile tombstone files conditional on WAL level
On Fri, Mar 5, 2021 at 11:02 AM Thomas Munro wrote: > This is a WIP with an open question to research: what could actually > break if we did this? I thought this part of bgwriter.c might be a candidate: if (FirstCallSinceLastCheckpoint()) { /* * After any checkpoint, close all smgr files. This is so we * won't hang onto smgr references to deleted files indefinitely. */ smgrcloseall(); } Hmm, on closer inspection, isn't the lack of real interlocking with checkpoints a bit suspect already? What stops bgwriter from writing to the previous relfilenode generation's fd, if a relfilenode is recycled while BgBufferSync() is running? Not sinval, and not the above code that only runs between BgBufferSync() invocations.
Re: Make relfile tombstone files conditional on WAL level
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund wrote: > What I proposed in the past was to have a new shared table that tracks > relfilenodes. I still think that's a decent solution for just the problem at > hand. It's not really clear to me what problem is at hand. The problems that the tombstone system created for the async I/O stuff weren't really explained properly, IMHO. And I don't think the current system is all that ugly. it's not the most beautiful thing in the world but we have lots of way worse hacks. And, it's easy to understand, requires very little code, and has few moving parts that can fail. As hacks go it's a quality hack, I would say. > But it'd also potentially be the way to redesign relation forks and even > slim down buffer tags: > > Right now a buffer tag is: > - 4 byte tablespace oid > - 4 byte database oid > - 4 byte "relfilenode oid" (don't think we have a good name for this) > - 4 byte fork number > - 4 byte block number > > If we had such a shared table we could put at least tablespace, fork number > into that table mapping them to an 8 byte "new relfilenode". That'd only make > the "new relfilenode" unique within a database, but that'd be sufficient for > our purposes. It'd give use a buffertag consisting out of the following: > - 4 byte database oid > - 8 byte "relfilenode" > - 4 byte block number Yep. I think this is a good direction. > Of course, it'd add some complexity too, because a buffertag alone wouldn't be > sufficient to read data (as you'd need the tablespace oid from elsewhere). But > that's probably ok, I think all relevant places would have that information. I think the thing to look at would be the places that call relpathperm() or relpathbackend(). I imagine this can be worked out, but it might require some adjustment. > It's probably possible to remove the database oid from the tag as well, but > it'd make CREATE DATABASE tricker - we'd need to change the filenames of > tables as we copy, to adjust them to the differing oid. Yeah, I'm not really sure that works out to a win. I tend to think that we should be trying to make databases within the same cluster more rather than less independent of each other. If we switch to using a radix tree for the buffer mapping table as you have previously proposed, then presumably each backend can cache a pointer to the second level, after the database OID has been resolved. Then you have no need to compare database OIDs for every lookup. That might turn out to be better for performance than shoving everything into the buffer tag anyway, because then backends in different databases would be accessing distinct parts of the buffer mapping data structure instead of contending with one another. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
Hi, On 2021-08-02 16:03:31 -0400, Robert Haas wrote: > The two most principled solutions to this problem that I can see are > (1) remove wal_level=minimal and I'm personally not opposed to this. It's not practically relevant and makes a lot of stuff more complicated. We imo should rather focus on optimizing the things wal_level=minimal accelerates a lot than adding complications for wal_level=minimal. Such optimizations would have practical relevance, and there's plenty low hanging fruits. > (2) use 64-bit relfilenodes. I have > been reluctant to support #1 because it's hard for me to believe that > there aren't cases where being able to skip a whole lot of WAL-logging > doesn't work out to a nice performance win, but I realize opinions on > that topic vary. And I'm pretty sure that Andres, at least, will hate > #2 because he's unhappy with the width of buffer tags already. Yep :/ I guess there's a somewhat hacky way to get somewhere without actually increasing the size. We could take 3 bytes from the fork number and use that to get to a 7 byte relfilenode portion. 7 bytes are probably enough for everyone. It's not like we can use those bytes in a useful way, due to alignment requirements. Declaring that the high 7 bytes are for the relNode portion and the low byte for the fork would still allow efficient comparisons and doesn't seem too ugly. > So I don't really have a good idea. I agree this tombstone system is a > bit of a wart, but I'm not sure that this patch really makes anything > any better, and I'm not really seeing another idea that seems better > either. > Maybe I am missing something... What I proposed in the past was to have a new shared table that tracks relfilenodes. I still think that's a decent solution for just the problem at hand. But it'd also potentially be the way to redesign relation forks and even slim down buffer tags: Right now a buffer tag is: - 4 byte tablespace oid - 4 byte database oid - 4 byte "relfilenode oid" (don't think we have a good name for this) - 4 byte fork number - 4 byte block number If we had such a shared table we could put at least tablespace, fork number into that table mapping them to an 8 byte "new relfilenode". That'd only make the "new relfilenode" unique within a database, but that'd be sufficient for our purposes. It'd give use a buffertag consisting out of the following: - 4 byte database oid - 8 byte "relfilenode" - 4 byte block number Of course, it'd add some complexity too, because a buffertag alone wouldn't be sufficient to read data (as you'd need the tablespace oid from elsewhere). But that's probably ok, I think all relevant places would have that information. It's probably possible to remove the database oid from the tag as well, but it'd make CREATE DATABASE tricker - we'd need to change the filenames of tables as we copy, to adjust them to the differing oid. Greetings, Andres Freund
Re: Make relfile tombstone files conditional on WAL level
On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas wrote: > It would indeed be nice to have some other mechanism to prevent the > issue with wal_level=minimal, the tombstone files feel hacky and > complicated. Maybe a new shared memory hash table to track the > relfilenodes of dropped tables. Just to summarize the issue here as I understand it, if a relfilenode is used for two unrelated relations during the same checkpoint cycle with wal_level=minimal, and if the WAL-skipping optimization is applied to the second of those but not to the first, then crash recovery will lose our only copy of the new relation's data, because we'll replay the removal of the old relfilenode but will not have logged the new data. Furthermore, we've wondered about writing an end-of-recovery record in all cases rather than sometimes writing an end-of-recovery record and sometimes a checkpoint record. That would allow another version of this same problem, since a single checkpoint cycle could now span multiple server lifetimes. At present, we dodge all this by keeping the first segment of the main fork around as a zero-length file for the rest of the checkpoint cycle, which I think prevents the problem in both cases. Now, apparently that caused some problem with the AIO patch set so Thomas is curious about getting rid of it, and Heikki concurs that it's a hack. I guess my concern about this patch is that it just seems to be reducing the number of cases where that hack is used without actually getting rid of it. Rarely-taken code paths are more likely to have undiscovered bugs, and that seems particularly likely in this case, because this is a low-probability scenario to begin with. A lot of clusters probably never have an OID counter wraparound ever, and even in those that do, getting an OID collision with just the right timing followed by a crash before a checkpoint can intervene has got to be super-unlikely. Even as things are today, if this mechanism has subtle bugs, it seems entirely possible that they could have escaped notice up until now. So I spent some time thinking about the question of getting rid of tombstone files altogether. I don't think that Heikki's idea of a shared memory hash table to track dropped relfilenodes can work. The hash table will have to be of some fixed size N, and whatever the value of N, the approach will break down if N+1 relfilenodes are dropped in the same checkpoint cycle. The two most principled solutions to this problem that I can see are (1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have been reluctant to support #1 because it's hard for me to believe that there aren't cases where being able to skip a whole lot of WAL-logging doesn't work out to a nice performance win, but I realize opinions on that topic vary. And I'm pretty sure that Andres, at least, will hate #2 because he's unhappy with the width of buffer tags already. So I don't really have a good idea. I agree this tombstone system is a bit of a wart, but I'm not sure that this patch really makes anything any better, and I'm not really seeing another idea that seems better either. Maybe I am missing something... -- Robert Haas EDB: http://www.enterprisedb.com
Re: Make relfile tombstone files conditional on WAL level
On 05/03/2021 00:02, Thomas Munro wrote: Hi, I'm starting a new thread for this patch that originated as a side-discussion in [1], to give it its own CF entry in the next cycle. This is a WIP with an open question to research: what could actually break if we did this? I don't see a problem. It would indeed be nice to have some other mechanism to prevent the issue with wal_level=minimal, the tombstone files feel hacky and complicated. Maybe a new shared memory hash table to track the relfilenodes of dropped tables. - Heikki
Make relfile tombstone files conditional on WAL level
Hi, I'm starting a new thread for this patch that originated as a side-discussion in [1], to give it its own CF entry in the next cycle. This is a WIP with an open question to research: what could actually break if we did this? [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com From 61a15ed286a1fd824b4e2b4b689cbe6688930e6e Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 2 Mar 2021 16:09:51 +1300 Subject: [PATCH] Make relfile tombstone files conditional on WAL level. Traditionally we have left behind an empty file to be unlinked after the next checkpoint, when dropping a relation. That would prevent GetNewRelFileNode() from recycling the relfile, avoiding a schedule that could corrupt data in crash recovery, with wal_level=minimal. Since the default wal_level changed to replica in release 10, and since this mechanism introduces costs elsewhere, let's only do it if we have to. In particular, this avoids the need for DROP TABLESPACE for force an expensive checkpoint just to clear out tombstone files. XXX What would break if we did this? Discussion: https://postgr.es/m/CA%2BhUKGLT3zibuLkn_j9xiPWn6hxH9Br-TsJoSaFgQOpxpEUnPQ%40mail.gmail.com --- src/backend/commands/tablespace.c | 14 +++--- src/backend/storage/smgr/md.c | 10 ++ src/include/access/xlog.h | 6 ++ 3 files changed, 23 insertions(+), 7 deletions(-) diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c index 69ea155d50..c1d12f3d19 100644 --- a/src/backend/commands/tablespace.c +++ b/src/backend/commands/tablespace.c @@ -512,8 +512,10 @@ DropTableSpace(DropTableSpaceStmt *stmt) */ if (!destroy_tablespace_directories(tablespaceoid, false)) { + bool try_again = false; + /* - * Not all files deleted? However, there can be lingering empty files + * Not all files deleted? However, there can be lingering tombstones * in the directories, left behind by for example DROP TABLE, that * have been scheduled for deletion at next checkpoint (see comments * in mdunlink() for details). We could just delete them immediately, @@ -528,8 +530,14 @@ DropTableSpace(DropTableSpaceStmt *stmt) * TABLESPACE should not give up on the tablespace becoming empty * until all relevant invalidation processing is complete. */ - RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); - if (!destroy_tablespace_directories(tablespaceoid, false)) + + if (XLogNeedRelFileTombstones()) + { + RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); + try_again = true; + } + + if (!try_again || !destroy_tablespace_directories(tablespaceoid, false)) { /* Still not empty, the files must be important then */ ereport(ERROR, diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 1e12cfad8e..447122519e 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -236,8 +236,9 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo) * forkNum can be a fork number to delete a specific fork, or InvalidForkNumber * to delete all forks. * - * For regular relations, we don't unlink the first segment file of the rel, - * but just truncate it to zero length, and record a request to unlink it after + * For regular relations, we don't always unlink the first segment file, + * depending on the WAL level. If XLogNeedRelFileTombstones() is true, we + * just truncate it to zero length, and record a request to unlink it after * the next checkpoint. Additional segments can be unlinked immediately, * however. Leaving the empty file in place prevents that relfilenode * number from being reused. The scenario this protects us from is: @@ -321,7 +322,8 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) /* * Delete or truncate the first segment. */ - if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode)) + if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode) || + !XLogNeedRelFileTombstones()) { if (!RelFileNodeBackendIsTemp(rnode)) { @@ -349,7 +351,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo) /* Prevent other backends' fds from holding on to the disk space */ ret = do_truncate(path); - /* Register request to unlink first segment later */ + /* Leave the file as a tombstone, to be unlinked at checkpoint time. */ register_unlink_segment(rnode, forkNum, 0 /* first seg */ ); } diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 75ec1073bd..cca04a6aa8 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -207,6 +207,12 @@ extern PGDLLIMPORT int wal_level; /* Do we need to WAL-log information required only for logical replication? */ #define XLogLogicalInfoActive() (wal_level >= WAL_LEVEL_LOGICAL) +/* + * Is