Re: Make relfile tombstone files conditional on WAL level

2022-05-16 Thread Dilip Kumar
On Mon, May 16, 2022 at 3:24 PM Thomas Munro  wrote:
>
> I think you can get rid of SYNC_UNLINK_REQUEST, sync_unlinkfiletag,
> mdunlinkfiletag as these are all now unused.

Correct.

> Are there any special hazards here if the plan in [1] goes ahead?

IMHO we should not have any problem.  In fact, we need this for [1]
right?  Otherwise, there is a risk of reusing the same relfilenode
within the same checkpoint cycle as discussed in [2].

> [1] 
> https://www.postgresql.org/message-id/flat/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com

[2] 
https://www.postgresql.org/message-id/CA+TgmoZZDL_2E_zuahqpJ-WmkuxmUi8+g7=dLEny=18r-+c...@mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-05-16 Thread Thomas Munro
I think you can get rid of SYNC_UNLINK_REQUEST, sync_unlinkfiletag,
mdunlinkfiletag as these are all now unused.
Are there any special hazards here if the plan in [1] goes ahead?  If
the relfilenode allocation is logged and replayed then it should be
fine to crash and recover multiple times in a row while creating and
dropping tables, with wal_level=minimal, I think.  It would be bad if
the allocator restarted from a value from the checkpoint, though.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com




Re: Make relfile tombstone files conditional on WAL level

2022-05-15 Thread Dilip Kumar
On Thu, May 12, 2022 at 4:27 PM Amul Sul  wrote:
>
Hi Amul,

Thanks for the review, actually based on some comments from Robert we
have planned to make some design changes.  So I am planning to work on
that for the July commitfest.  I will try to incorporate all your
review comments in the new version.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-05-12 Thread Amul Sul
Hi Dilip,

On Fri, Mar 4, 2022 at 11:07 AM Dilip Kumar  wrote:
>
> On Mon, Feb 21, 2022 at 1:21 PM Dilip Kumar  wrote:
> >
> > On Thu, Jan 6, 2022 at 1:43 PM Dilip Kumar  wrote:
> >
> >  2) GetNewRelFileNode() will not loop for checking the file existence
> > > and retry with other relfilenode.
> >
> >
> > Open Issues- there are currently 2 open issues in the patch 1) Issue
> > as discussed above about removing the loop, so currently in this patch
> > the loop is removed.  2) During upgrade from the previous version we
> > need to advance the nextrelfilenode to the current relfilenode we are
> > setting for the object in order to avoid the conflict.
>
>
> In this version I have fixed both of these issues.  Thanks Robert for
> suggesting the solution for both of these problems in our offlist
> discussion.  Basically, for the first problem we can flush the xlog
> immediately because we are actually logging the WAL every time after
> we allocate 64 relfilenode so this should not have much impact on the
> performance and I have added the same in the comments.  And during
> pg_upgrade, whenever we are assigning the relfilenode as part of the
> upgrade we will set that relfilenode + 1 as nextRelFileNode to be
> assigned so that we never generate the conflicting relfilenode.
>
> The only part I do not like in the patch is that before this patch we
> could directly access the buftag->rnode.  But since now we are not
> having directly relfilenode as part of the buffertag and instead of
> that we are keeping individual fields (i.e. dbOid, tbsOid and relNode)
> in the buffer tag.  So if we have to directly get the relfilenode we
> need to generate it.  However those changes are very limited to just 1
> or 2 file so maybe not that bad.
>

v5 patch needs a rebase and here are a few comments for 0002, I found
while reading that, hope that helps:

+/* Number of RelFileNode to prefetch (preallocate) per XLOG write */
+#define VAR_RFN_PREFETCH   8192
+

Should it be 64, as per comment in XLogPutNextRelFileNode for XLogFlush() ?
---

+   /*
+* Check for the wraparound for the relnode counter.
+*
+* XXX Actually the relnode is 56 bits wide so we don't need to worry about
+* the wraparound case.
+*/
+   if (ShmemVariableCache->nextRelNode > MAX_RELFILENODE)

Very rare case, should use unlikely()?
---

+/*
+ * Max value of the relfilnode.  Relfilenode will be of 56bits wide for more
+ * details refer comments atop BufferTag.
+ */
+#define MAX_RELFILENODEuint64) 1) << 56) - 1)

Should there be 57-bit shifts here? Instead, I think we should use
INT64CONST(0xFF) to be consistent with PG_*_MAX
declarations, thoughts?
---

+   /* If we run out of logged for use RelNode then we must log more */
+   if (ShmemVariableCache->relnodecount == 0)

Might relnodecount never go below, but just to be safer should check
<= 0 instead.
---

Few typos:
Simmialr
Simmilar
agains
idealy

Regards,
Amul




Re: Make relfile tombstone files conditional on WAL level

2022-03-08 Thread Robert Haas
On Fri, Mar 4, 2022 at 12:37 AM Dilip Kumar  wrote:
> In this version I have fixed both of these issues.

Here's a bit of review for these patches:

- The whole relnode vs. relfilenode thing is really confusing. I
realize that there is some precedent for calling the number that
pertains to the file on disk "relnode" and that value when combined
with the database and tablespace OIDs "relfilenode," but it's
definitely not the most obvious thing, especially since
pg_class.relfilenode is a prominent case where we don't even adhere to
that convention. I'm kind of tempted to think that we should go the
other way and rename the RelFileNode struct to something like
RelFileLocator, and then maybe call the new data type RelFileNumber.
And then we could work toward removing references to "filenode" and
"relfilenode" in favor of either (rel)filelocator or (rel)filenumber.
Now the question (even assuming other people like this general
direction) is how far do we go with it? Renaming pg_class.relfilenode
itself wouldn't be the worst compatibility break we've ever had, but
it would definitely cause some pain. I'd be inclined to leave the
user-visible catalog column alone and just push in this direction for
internal stuff.

- What you're doing to pg_buffercache here is completely unacceptable.
You can't change the definition of an already-released version of the
extension. Please study how such issues have been handled in the past.

- It looks to me like you need to give significantly more thought to
the proper way of adjusting the relfilenode-related test cases in
alter_table.out.

- I think BufTagGetFileNode and BufTagGetSetFileNode should be
introduced in 0001 and then just update the definition in 0002 as
required. Note that as things stand you end up with both
BufTagGetFileNode and BuffTagGetRelFileNode which is an artifact of
the relnode/filenode/relfilenode confusion I mention above, and just
to make matters worse, one returns a value while the other produces an
out parameter. I think the renaming I'm talking about up above might
help somewhat here, but it seems like it might also be good to change
the one that uses an out parameter by doing Get -> Copy, just to help
the reader get a clue a little more easily.

- GetNewRelNode() needs to error out if we would wrap around, not wrap
around. Probably similar to what happens if we exhaust 2^64 bytes of
WAL.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-03-07 Thread Dilip Kumar
On Tue, Mar 8, 2022 at 1:27 AM Robert Haas  wrote:

> > The only part I do not like in the patch is that before this patch we
> > could directly access the buftag->rnode.  But since now we are not
> > having directly relfilenode as part of the buffertag and instead of
> > that we are keeping individual fields (i.e. dbOid, tbsOid and relNode)
> > in the buffer tag.  So if we have to directly get the relfilenode we
> > need to generate it.  However those changes are very limited to just 1
> > or 2 file so maybe not that bad.
>
> You're talking here about just needing to introduce BufTagGetFileNode
> and BufTagSetFileNode, or something else? I don't find those macros to
> be problematic.

Yeah, I was talking about BufTagGetFileNode macro only.  The reason I
did not like it is that earlier we could directly use buftag->rnode,
but now whenever we wanted to use rnode first we need to use a
separate variable for preparing the rnode using BufTagGetFileNode
macro.  But these changes are very localized and a very few places so
I don't have much problem with those.

>
> BufTagSetFileNode could maybe assert that the OID isn't too big,
> though. We should ereport() before we get to this point if we somehow
> run out of values, but it might be nice to have a check here as a
> backup.

Yeah, we could do that, I will do that in the next version.  Thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-03-07 Thread Robert Haas
On Fri, Mar 4, 2022 at 12:37 AM Dilip Kumar  wrote:
> In this version I have fixed both of these issues.  Thanks Robert for
> suggesting the solution for both of these problems in our offlist
> discussion.  Basically, for the first problem we can flush the xlog
> immediately because we are actually logging the WAL every time after
> we allocate 64 relfilenode so this should not have much impact on the
> performance and I have added the same in the comments.  And during
> pg_upgrade, whenever we are assigning the relfilenode as part of the
> upgrade we will set that relfilenode + 1 as nextRelFileNode to be
> assigned so that we never generate the conflicting relfilenode.

Anyone else have an opinion on this?

> The only part I do not like in the patch is that before this patch we
> could directly access the buftag->rnode.  But since now we are not
> having directly relfilenode as part of the buffertag and instead of
> that we are keeping individual fields (i.e. dbOid, tbsOid and relNode)
> in the buffer tag.  So if we have to directly get the relfilenode we
> need to generate it.  However those changes are very limited to just 1
> or 2 file so maybe not that bad.

You're talking here about just needing to introduce BufTagGetFileNode
and BufTagSetFileNode, or something else? I don't find those macros to
be problematic.

BufTagSetFileNode could maybe assert that the OID isn't too big,
though. We should ereport() before we get to this point if we somehow
run out of values, but it might be nice to have a check here as a
backup.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-03-07 Thread Robert Haas
On Mon, Feb 21, 2022 at 2:51 AM Dilip Kumar  wrote:
> While working on this I realized that even if we make the relfilenode
> 56 bits we can not remove the loop inside GetNewRelFileNode() for
> checking the file existence.  Because it is always possible that the
> file reaches to the disk even before the WAL for advancing the next
> relfilenode and if the system crashes in between that then we might
> generate the duplicate relfilenode right?

I agree.

> I think the second paragraph in XLogPutNextOid() function explain this
> issue and now even after we get the wider relfilenode we will have
> this issue.  Correct?

I think you are correct.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-02-09 Thread Dilip Kumar
On Mon, Feb 7, 2022 at 10:13 PM Robert Haas  wrote:
>
> On Mon, Feb 7, 2022 at 11:31 AM Dilip Kumar  wrote:
> > For RelFileNode also we need to use 2, 32-bit integers so that we do
> > not add extra alignment padding because there are a few more
> > structures that include RelFileNode e.g. xl_xact_relfilenodes,
> > RelFileNodeBackend, and many other structures.
>
> Are you sure that kind of stuff is really important enough to justify
> the code churn? I don't think RelFileNodeBackend is used widely enough
> or in sufficiently performance-critical places that we really need to
> care about a few bytes of alignment padding. xl_xact_relfilenodes is
> more concerning because that goes into the WAL format, but I don't
> know that we use it often enough for an extra 4 bytes per record to
> really matter, especially considering that this proposal also adds 4
> bytes *per relfilenode* which has to be a much bigger deal than a few
> padding bytes after 'nrels'. The reason why BufferTag matters a lot is
> because (1) we have an array of this struct that can easily contain a
> million or eight entries, so the alignment padding adds up a lot more
> and (2) access to that array is one of the most performance-critical
> parts of PostgreSQL.

I agree with you that adding 4 extra bytes to these structures might
not be really critical.  I will make the changes based on this idea
and see how the changes look.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-02-07 Thread Robert Haas
On Mon, Feb 7, 2022 at 11:31 AM Dilip Kumar  wrote:
> For RelFileNode also we need to use 2, 32-bit integers so that we do
> not add extra alignment padding because there are a few more
> structures that include RelFileNode e.g. xl_xact_relfilenodes,
> RelFileNodeBackend, and many other structures.

Are you sure that kind of stuff is really important enough to justify
the code churn? I don't think RelFileNodeBackend is used widely enough
or in sufficiently performance-critical places that we really need to
care about a few bytes of alignment padding. xl_xact_relfilenodes is
more concerning because that goes into the WAL format, but I don't
know that we use it often enough for an extra 4 bytes per record to
really matter, especially considering that this proposal also adds 4
bytes *per relfilenode* which has to be a much bigger deal than a few
padding bytes after 'nrels'. The reason why BufferTag matters a lot is
because (1) we have an array of this struct that can easily contain a
million or eight entries, so the alignment padding adds up a lot more
and (2) access to that array is one of the most performance-critical
parts of PostgreSQL.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-02-07 Thread Dilip Kumar
On Mon, Feb 7, 2022 at 9:42 PM Robert Haas  wrote:
>
> On Mon, Feb 7, 2022 at 12:26 AM Dilip Kumar  wrote:
> > I have splitted the patch into multiple patches which can be
> > independently committable and easy to review. I have explained the
> > purpose and scope of each patch in the respective commit messages.
>
> Hmm. The parts of this I've looked at seem reasonably clean, but I
> don't think I like the design choice. You're inventing
> RelFileNodeSetFork(), but at present the RelFileNode struct doesn't
> include a fork number. I feel like we should leave that alone, and
> only change the definition of a BufferTag. What about adding accessors
> for all of the BufferTag fields in 0001, and then in 0002 change it to
> look like something this:
>
> typedef struct BufferTag
> {
> Oid dbOid;
> Oid tablespaceOid;
> uint32  fileNode_low;
> uint32  fileNode_hi:24;
> uint32  forkNumber:8;
> BlockNumber blockNumber;
> } BufferTag;

Okay, we can do that.  But we can not leave RelFileNode untouched I
mean inside RelFileNode also we will have to change the relNode as 2
32 bit integers, I mean like below.

> typedef struct RelFileNode
> {
> Oid spcNode;
> Oid dbNode;
> uint32  relNode_low;
> uint32  relNode_hi;
} RelFileNode;

For RelFileNode also we need to use 2, 32-bit integers so that we do
not add extra alignment padding because there are a few more
structures that include RelFileNode e.g. xl_xact_relfilenodes,
RelFileNodeBackend, and many other structures.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-02-07 Thread Robert Haas
On Mon, Feb 7, 2022 at 12:26 AM Dilip Kumar  wrote:
> I have splitted the patch into multiple patches which can be
> independently committable and easy to review. I have explained the
> purpose and scope of each patch in the respective commit messages.

Hmm. The parts of this I've looked at seem reasonably clean, but I
don't think I like the design choice. You're inventing
RelFileNodeSetFork(), but at present the RelFileNode struct doesn't
include a fork number. I feel like we should leave that alone, and
only change the definition of a BufferTag. What about adding accessors
for all of the BufferTag fields in 0001, and then in 0002 change it to
look like something this:

typedef struct BufferTag
{
Oid dbOid;
Oid tablespaceOid;
uint32  fileNode_low;
uint32  fileNode_hi:24;
uint32  forkNumber:8;
BlockNumber blockNumber;
} BufferTag;

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-02-02 Thread Dilip Kumar
On Wed, Feb 2, 2022 at 6:57 PM Robert Haas  wrote:
>
> On Mon, Jan 31, 2022 at 9:37 AM Dilip Kumar  wrote:
> > I agree that we are using 8 bytes unsigned int multiple places in code
> > as uint64.  But I don't see it as an exposed data type and not used as
> > part of any exposed function.  But we will have to use the relfilenode
> > in the exposed c function e.g.
> > binary_upgrade_set_next_heap_relfilenode().
>
> Oh, I thought we were talking about the C data type uint8 i.e. an
> 8-bit unsigned integer. Which in retrospect was a dumb thought because
> you said you wanted to store the relfilenode AND the fork number
> there, which only make sense if you were talking about SQL data types
> rather than C data types. It is confusing that we have an SQL data
> type called int8 and a C data type called int8 and they're not the
> same.
>
> But if you're talking about SQL data types, why? pg_class only stores
> the relfilenode and not the fork number currently, and I don't see why
> that would change. I think that the data type for the relfilenode
> column would change to a 64-bit signed integer (i.e. bigint or int8)
> that only ever uses the low-order 56 bits, and then when you need to
> store a relfilenode and a fork number in the same 8-byte quantity
> you'd do that using either a struct with bit fields or by something
> like combined = ((uint64) signed_representation_of_relfilenode) |
> (((int) forknumber) << 56);

Yeah you're right.  I think whenever we are using combined then we can
use uint64 C type and in pg_class we can keep it as int64 because that
is only representing the relfilenode part.  I think I was just
confused and tried to use the same data type everywhere whether it is
combined with fork number or not.  Thanks for your input, I will
change this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-02-02 Thread Robert Haas
On Mon, Jan 31, 2022 at 9:37 AM Dilip Kumar  wrote:
> I agree that we are using 8 bytes unsigned int multiple places in code
> as uint64.  But I don't see it as an exposed data type and not used as
> part of any exposed function.  But we will have to use the relfilenode
> in the exposed c function e.g.
> binary_upgrade_set_next_heap_relfilenode().

Oh, I thought we were talking about the C data type uint8 i.e. an
8-bit unsigned integer. Which in retrospect was a dumb thought because
you said you wanted to store the relfilenode AND the fork number
there, which only make sense if you were talking about SQL data types
rather than C data types. It is confusing that we have an SQL data
type called int8 and a C data type called int8 and they're not the
same.

But if you're talking about SQL data types, why? pg_class only stores
the relfilenode and not the fork number currently, and I don't see why
that would change. I think that the data type for the relfilenode
column would change to a 64-bit signed integer (i.e. bigint or int8)
that only ever uses the low-order 56 bits, and then when you need to
store a relfilenode and a fork number in the same 8-byte quantity
you'd do that using either a struct with bit fields or by something
like combined = ((uint64) signed_representation_of_relfilenode) |
(((int) forknumber) << 56);

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-01-31 Thread Dilip Kumar
On Mon, Jan 31, 2022 at 7:36 PM Robert Haas  wrote:
>
> On Mon, Jan 31, 2022 at 9:04 AM Robert Haas  wrote:
> > On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar  wrote:
> > > the main one currently we do not have uint8 data
> > > type only int8 is there so I have used int8 for storing relfilenode +
> > > forknumber.
> >
> > I'm confused. We use int8 in tons of places, so I feel like it must exist.
>
> Rather, we use uint8 in tons of places, so I feel like it must exist.

Hmm, at least pg_type doesn't have anything with a name like uint8.

postgres[101702]=# select oid, typname from pg_type where typname like '%int8';
 oid  | typname
--+-
   20 | int8
 1016 | _int8
(2 rows)

postgres[101702]=# select oid, typname from pg_type where typname like '%uint%';
 oid | typname
-+-
(0 rows)

I agree that we are using 8 bytes unsigned int multiple places in code
as uint64.  But I don't see it as an exposed data type and not used as
part of any exposed function.  But we will have to use the relfilenode
in the exposed c function e.g.
binary_upgrade_set_next_heap_relfilenode().

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-01-31 Thread Robert Haas
On Mon, Jan 31, 2022 at 9:04 AM Robert Haas  wrote:
> On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar  wrote:
> > the main one currently we do not have uint8 data
> > type only int8 is there so I have used int8 for storing relfilenode +
> > forknumber.
>
> I'm confused. We use int8 in tons of places, so I feel like it must exist.

Rather, we use uint8 in tons of places, so I feel like it must exist.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-01-31 Thread Robert Haas
On Mon, Jan 31, 2022 at 12:29 AM Dilip Kumar  wrote:
> the main one currently we do not have uint8 data
> type only int8 is there so I have used int8 for storing relfilenode +
> forknumber.

I'm confused. We use int8 in tons of places, so I feel like it must exist.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-01-28 Thread Dilip Kumar
On Wed, Jan 19, 2022 at 10:37 AM Dilip Kumar  wrote:
>
> On Thu, Jan 6, 2022 at 7:22 PM Robert Haas  wrote:
>>
>> On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro  wrote:
>> > Another problem is that relfilenodes are normally allocated with
>> > GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
>> > a new allocator, and they won't be able to match the OID in general
>> > (while we have 32 bit OIDs at least).
>>
>> Personally I'm not sad about that. Values that are the same in simple
>> cases but diverge in more complex cases are kind of a trap for the
>> unwary. There's no real reason to have them ever match. Yeah, in
>> theory, it makes it easier to tell which file matches which relation,
>> but in practice, you always have to double-check in case the table has
>> ever been rewritten. It doesn't seem worth continuing to contort the
>> code for a property we can't guarantee anyway.
>
>
> Make sense, I have started working on this idea, I will try to post the first 
> version by early next week.

Here is the first working patch, with that now we don't need to
maintain the TombStone file until the next checkpoint.  This is still
a WIP patch with this I can see my problem related to ALTER DATABASE
SET TABLESPACE WAL-logged problem is solved which Robert reported a
couple of mails above in the same thread.

General idea of the patch:
- Change the RelFileNode.relNode to be 64bit wide, out of which 8 bits
for fork number and 56 bits for the relNode as shown below. [1]
- GetNewRelFileNode() will just generate a new unique relfilenode and
check the file existence and if it already exists then throw an error,
so no loop.  We also need to add the logic for preserving the
nextRelNode across restart and also WAL logging it but that is similar
to the preserving nextOid.
- mdunlinkfork, will directly forget the relfilenode, so we get rid of
all unlinking code from the code.
- Now, we don't need any post checkpoint unlinking activity.

[1]
/*
* RelNodeId:
*
* this is a storage type for RelNode. The reasoning behind using this is same
* as using the BlockId so refer comment atop BlockId.
*/
typedef struct RelNodeId
{
  uint32 rn_hi;
  uint32 rn_lo;
} RelNodeId;
typedef struct RelFileNode
{
   Oid spcNode; /* tablespace */
   Oid dbNode; /* database */
   RelNodeId relNode; /* relation */
} RelFileNode;

TODO:

There are a couple of TODOs and FIXMEs which I am planning to improve
by next week.  I am also planning to do the testing where relfilenode
consumes more than 32 bits, maybe for that we can set the
FirstNormalRelfileNode to higher value for the testing purpose.  And,
Improve comments.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From 4a6502c7950969262c6982388865bbc23e531cde Mon Sep 17 00:00:00 2001
From: Dilip Kumar 
Date: Fri, 28 Jan 2022 18:32:35 +0530
Subject: [PATCH v1] Don't wait for next checkpoint to remove unwanted
 relfilenode

Currently, relfilenode is 32 bits wide so if we remove the relfilenode
immediately after it is no longer needed then there is risk of reusing
the same relfilenode in the same checkpoint.  So for avoiding that we
delay cleaning up the relfilenode until the next checkpoint.  With this
patch we are using 56 bits for the relfilenode.  Ideally we can make it
64 bits wider but that will increase the size of the BufferTag so for
keeping that size same we are making RelFileNode.relNode as 64 bit wider,
in that 8 bits will be used for storing the fork number and the remaining
56 bits for the relfilenode.
---
 contrib/pg_buffercache/pg_buffercache_pages.c   |   4 +-
 contrib/pg_prewarm/autoprewarm.c|   4 +-
 src/backend/access/common/syncscan.c|   3 +-
 src/backend/access/gin/ginxlog.c|   5 +-
 src/backend/access/rmgrdesc/gistdesc.c  |   4 +-
 src/backend/access/rmgrdesc/heapdesc.c  |   4 +-
 src/backend/access/rmgrdesc/nbtdesc.c   |   4 +-
 src/backend/access/rmgrdesc/seqdesc.c   |   4 +-
 src/backend/access/rmgrdesc/xlogdesc.c  |  15 +++-
 src/backend/access/transam/varsup.c |  42 +-
 src/backend/access/transam/xlog.c   |  57 ++---
 src/backend/access/transam/xloginsert.c |  12 +++
 src/backend/access/transam/xlogutils.c  |   9 ++-
 src/backend/catalog/catalog.c   |  61 +++---
 src/backend/catalog/heap.c  |   6 +-
 src/backend/catalog/index.c |   4 +-
 src/backend/catalog/storage.c   |   3 +-
 src/backend/commands/tablecmds.c|  18 +++--
 src/backend/replication/logical/decode.c|   1 +
 src/backend/replication/logical/reorderbuffer.c |   2 +-
 src/backend/storage/buffer/bufmgr.c |  61 +++---
 src/backend/storage/buffer/localbuf.c   |   8 +-
 src/backend/storage/freespace/fsmpage.c |   4 +-
 src/backend/storage/lmgr/lwlocknames.txt|   1 +
 src/backend/storage/smgr/

Re: Make relfile tombstone files conditional on WAL level

2022-01-18 Thread Dilip Kumar
On Thu, Jan 6, 2022 at 7:22 PM Robert Haas  wrote:

> On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro 
> wrote:
> > Another problem is that relfilenodes are normally allocated with
> > GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
> > a new allocator, and they won't be able to match the OID in general
> > (while we have 32 bit OIDs at least).
>
> Personally I'm not sad about that. Values that are the same in simple
> cases but diverge in more complex cases are kind of a trap for the
> unwary. There's no real reason to have them ever match. Yeah, in
> theory, it makes it easier to tell which file matches which relation,
> but in practice, you always have to double-check in case the table has
> ever been rewritten. It doesn't seem worth continuing to contort the
> code for a property we can't guarantee anyway.
>

Make sense, I have started working on this idea, I will try to post the
first version by early next week.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: Make relfile tombstone files conditional on WAL level

2022-01-06 Thread Andres Freund
On 2022-01-06 08:52:01 -0500, Robert Haas wrote:
> On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro  wrote:
> > Another problem is that relfilenodes are normally allocated with
> > GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
> > a new allocator, and they won't be able to match the OID in general
> > (while we have 32 bit OIDs at least).
> 
> Personally I'm not sad about that. Values that are the same in simple
> cases but diverge in more complex cases are kind of a trap for the
> unwary.

+1




Re: Make relfile tombstone files conditional on WAL level

2022-01-06 Thread Robert Haas
On Thu, Jan 6, 2022 at 3:47 AM Thomas Munro  wrote:
> Another problem is that relfilenodes are normally allocated with
> GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
> a new allocator, and they won't be able to match the OID in general
> (while we have 32 bit OIDs at least).

Personally I'm not sad about that. Values that are the same in simple
cases but diverge in more complex cases are kind of a trap for the
unwary. There's no real reason to have them ever match. Yeah, in
theory, it makes it easier to tell which file matches which relation,
but in practice, you always have to double-check in case the table has
ever been rewritten. It doesn't seem worth continuing to contort the
code for a property we can't guarantee anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-01-06 Thread Thomas Munro
On Thu, Jan 6, 2022 at 9:13 PM Dilip Kumar  wrote:
> On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar  wrote:
> > > I think this idea is worth more consideration. It seems like 2^56
> > > relfilenodes ought to be enough for anyone, recalling that you can
> > > only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
> > > bunch of code that is there to guard against relfilenodes being
> > > reused. In particular, we can remove the code that leaves a 0-length
> > > tombstone file around until the next checkpoint to guard against
> > > relfilenode reuse.
> >
> > +1

+1

> I IMHO a few top level point for implementing this idea would be as listed 
> here,
>
> 1) the "relfilenode" member inside the RelFileNode will be now 64
> bytes and remove the "forkNum" all together from the BufferTag.  So
> now whenever we want to use the relfilenode or fork number we can use
> the respective mask and fetch their values.
> 2) GetNewRelFileNode() will not loop for checking the file existence
> and retry with other relfilenode.
> 3) Modify mdunlinkfork() so that we immediately perform the unlink
> request, make sure to register_forget_request() before unlink.
> 4) In checkpointer, now we don't need any handling for pendingUnlinks.

Another problem is that relfilenodes are normally allocated with
GetNewOidWithIndex(), and initially match a relation's OID.  We'd need
a new allocator, and they won't be able to match the OID in general
(while we have 32 bit OIDs at least).




Re: Make relfile tombstone files conditional on WAL level

2022-01-06 Thread Dilip Kumar
On Thu, Jan 6, 2022 at 1:12 PM Dilip Kumar  wrote:
>
> >
> > I think this idea is worth more consideration. It seems like 2^56
> > relfilenodes ought to be enough for anyone, recalling that you can
> > only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
> > bunch of code that is there to guard against relfilenodes being
> > reused. In particular, we can remove the code that leaves a 0-length
> > tombstone file around until the next checkpoint to guard against
> > relfilenode reuse.
>
> +1
>

I IMHO a few top level point for implementing this idea would be as listed here,

1) the "relfilenode" member inside the RelFileNode will be now 64
bytes and remove the "forkNum" all together from the BufferTag.  So
now whenever we want to use the relfilenode or fork number we can use
the respective mask and fetch their values.
2) GetNewRelFileNode() will not loop for checking the file existence
and retry with other relfilenode.
3) Modify mdunlinkfork() so that we immediately perform the unlink
request, make sure to register_forget_request() before unlink.
4) In checkpointer, now we don't need any handling for pendingUnlinks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-01-05 Thread Dilip Kumar
On Thu, Jan 6, 2022 at 3:07 AM Robert Haas  wrote:
>
> On Mon, Aug 2, 2021 at 6:38 PM Andres Freund  wrote:
> > I guess there's a somewhat hacky way to get somewhere without actually
> > increasing the size. We could take 3 bytes from the fork number and use that
> > to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
> > everyone.
> >
> > It's not like we can use those bytes in a useful way, due to alignment
> > requirements. Declaring that the high 7 bytes are for the relNode portion 
> > and
> > the low byte for the fork would still allow efficient comparisons and 
> > doesn't
> > seem too ugly.
>
> I think this idea is worth more consideration. It seems like 2^56
> relfilenodes ought to be enough for anyone, recalling that you can
> only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
> bunch of code that is there to guard against relfilenodes being
> reused. In particular, we can remove the code that leaves a 0-length
> tombstone file around until the next checkpoint to guard against
> relfilenode reuse.

+1

>
> I think this would also solve a problem Dilip mentioned to me today:
> suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
> been trying to do. Then suppose you do "ALTER DATABASE foo SET
> TABLESPACE used_recently_but_not_any_more". You might get an error
> complaining that “some relations of database \“%s\” are already in
> tablespace \“%s\“” because there could be tombstone files in that
> database. With this combination of changes, you could just use the
> barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
> wait for those files to disappear, because they've got to be
> previously-unliked files that Windows is still returning because
> they're still opening -- or else they could be a sign of a corrupted
> database, but there are no other possibilities.

Yes, this approach will solve the problem for the WAL-logged ALTER
DATABASE we are facing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2022-01-05 Thread Robert Haas
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund  wrote:
> I guess there's a somewhat hacky way to get somewhere without actually
> increasing the size. We could take 3 bytes from the fork number and use that
> to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
> everyone.
>
> It's not like we can use those bytes in a useful way, due to alignment
> requirements. Declaring that the high 7 bytes are for the relNode portion and
> the low byte for the fork would still allow efficient comparisons and doesn't
> seem too ugly.

I think this idea is worth more consideration. It seems like 2^56
relfilenodes ought to be enough for anyone, recalling that you can
only ever have 2^64 bytes of WAL. So if we do this, we can eliminate a
bunch of code that is there to guard against relfilenodes being
reused. In particular, we can remove the code that leaves a 0-length
tombstone file around until the next checkpoint to guard against
relfilenode reuse. On Windows, we still need
https://commitfest.postgresql.org/36/2962/ because of the problem that
Windows won't remove files from the directory listing until they are
both unlinked and closed. But in general this seems like it would lead
to cleaner code. For example, GetNewRelFileNode() needn't loop. If it
allocate the smallest unsigned integer that the cluster (or database)
has never previously assigned, the file should definitely not exist on
disk, and if it does, an ERROR is appropriate, as the database is
corrupted. This does assume that allocations from this new 56-bit
relfilenode counter are properly WAL-logged.

I think this would also solve a problem Dilip mentioned to me today:
suppose you make ALTER DATABASE SET TABLESPACE WAL-logged, as he's
been trying to do. Then suppose you do "ALTER DATABASE foo SET
TABLESPACE used_recently_but_not_any_more". You might get an error
complaining that “some relations of database \“%s\” are already in
tablespace \“%s\“” because there could be tombstone files in that
database. With this combination of changes, you could just use the
barrier mechanism from https://commitfest.postgresql.org/36/2962/ to
wait for those files to disappear, because they've got to be
previously-unliked files that Windows is still returning because
they're still opening -- or else they could be a sign of a corrupted
database, but there are no other possibilities.

I think, anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2021-10-28 Thread Thomas Munro
On Tue, Oct 5, 2021 at 4:21 PM Thomas Munro  wrote:
> On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro  wrote:
> > I managed to produce a case where live data is written to an unlinked
> > file and lost

In conclusion, there *is* something else that would break, so I'm
withdrawing this CF entry (#3030) for now.  Also, that something else
is already subtly broken, so I'll try to come up with a fix for that
separately.




Re: Make relfile tombstone files conditional on WAL level

2021-10-04 Thread Thomas Munro
On Thu, Sep 30, 2021 at 11:32 PM Thomas Munro  wrote:
> I managed to produce a case where live data is written to an unlinked
> file and lost

I guess this must have been broken since release 9.2 moved checkpoints
out of here[1].  The connection between checkpoints, tombstone files
and file descriptor cache invalidation in auxiliary (non-sinval)
backends was not documented as far as I can see (or at least not
anywhere near the load-bearing parts).

How could it be fixed, simply and backpatchably?  If BgSyncBuffer()
did if-FirstCallSinceLastCheckpoint()-then-smgrcloseall() after
locking each individual buffer and before flushing, then I think it
might logically have the correct interlocking against relfilenode
wraparound, but that sounds a tad expensive :-(  I guess it could be
made cheaper by using atomics for the checkpoint counter instead of
spinlocks.  Better ideas?

[1] 
https://www.postgresql.org/message-id/flat/CA%2BU5nMLv2ah-HNHaQ%3D2rxhp_hDJ9jcf-LL2kW3sE4msfnUw9gA%40mail.gmail.com




Re: Make relfile tombstone files conditional on WAL level

2021-09-30 Thread Thomas Munro
On Wed, Sep 29, 2021 at 4:07 PM Thomas Munro  wrote:
> Hmm, on closer inspection, isn't the lack of real interlocking with
> checkpoints a bit suspect already?  What stops bgwriter from writing
> to the previous relfilenode generation's fd, if a relfilenode is
> recycled while BgBufferSync() is running?  Not sinval, and not the
> above code that only runs between BgBufferSync() invocations.

I managed to produce a case where live data is written to an unlinked
file and lost, with a couple of tweaks to get the right timing and
simulate OID wraparound.  See attached.  If you run the following
commands repeatedly with shared_buffers=256kB and
bgwriter_lru_multiplier=10, you should see a number lower than 10,000
from the last query in some runs, depending on timing.

create extension if not exists chaos;
create extension if not exists pg_prewarm;

drop table if exists t1, t2;
checkpoint;
vacuum pg_class;

select clobber_next_oid(20);
create table t1 as select 42 i from generate_series(1, 1);
select pg_prewarm('t1'); -- fill buffer pool with t1
update t1 set i = i; -- dirty t1 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

drop table t1;
checkpoint;
vacuum pg_class;

select clobber_next_oid(20);
create table t2 as select 0 i from generate_series(1, 1);
select pg_prewarm('t2'); -- fill buffer pool with t2
update t2 set i = 1 where i = 0; -- dirty t2 buffers so bgwriter writes some
select pg_sleep(2); -- give bgwriter some time

select pg_prewarm('pg_attribute'); -- evict all clean t2 buffers
select sum(i) as t2_sum_should_be_1 from t2; -- have any updates been lost?
From b116b80b2775b004e35a9e7be0a057ee2724041b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 30 Sep 2021 15:47:23 +1300
Subject: [PATCH 1/2] HACK: A function to control the OID allocator.

---
 src/test/modules/chaos/Makefile   | 21 +
 src/test/modules/chaos/chaos--1.0.sql |  8 
 src/test/modules/chaos/chaos.c| 26 ++
 src/test/modules/chaos/chaos.control  |  4 
 4 files changed, 59 insertions(+)
 create mode 100644 src/test/modules/chaos/Makefile
 create mode 100644 src/test/modules/chaos/chaos--1.0.sql
 create mode 100644 src/test/modules/chaos/chaos.c
 create mode 100644 src/test/modules/chaos/chaos.control

diff --git a/src/test/modules/chaos/Makefile b/src/test/modules/chaos/Makefile
new file mode 100644
index 00..ac69721af6
--- /dev/null
+++ b/src/test/modules/chaos/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/chaos/Makefile
+
+MODULE_big = chaos
+OBJS = \
+	$(WIN32RES) \
+	chaos.o
+PGFILEDESC = "chaos - module in which to write throwaway fault-injection hacks"
+
+EXTENSION = chaos
+DATA = chaos--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/chaos
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/chaos/chaos--1.0.sql b/src/test/modules/chaos/chaos--1.0.sql
new file mode 100644
index 00..5016f7e586
--- /dev/null
+++ b/src/test/modules/chaos/chaos--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/chaos/chaos--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION chaos" to load this file. \quit
+
+CREATE FUNCTION clobber_next_oid(size BIGINT)
+	RETURNS pg_catalog.void STRICT
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/chaos/chaos.c b/src/test/modules/chaos/chaos.c
new file mode 100644
index 00..f1052f865e
--- /dev/null
+++ b/src/test/modules/chaos/chaos.c
@@ -0,0 +1,26 @@
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "fmgr.h"
+#include "storage/lwlock.h"
+
+#include 
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(clobber_next_oid);
+
+Datum
+clobber_next_oid(PG_FUNCTION_ARGS)
+{
+	int64		oid = PG_GETARG_INT64(0);
+
+	if (oid < FirstNormalObjectId || oid > UINT_MAX)
+		elog(ERROR, "invalid oid");
+
+	LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
+	ShmemVariableCache->nextOid = oid;
+	LWLockRelease(OidGenLock);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/chaos/chaos.control b/src/test/modules/chaos/chaos.control
new file mode 100644
index 00..137ab8a58d
--- /dev/null
+++ b/src/test/modules/chaos/chaos.control
@@ -0,0 +1,4 @@
+comment = 'Test module for throwaway fault-injection hacks...'
+default_version = '1.0'
+module_pathname = '$libdir/chaos'
+relocatable = true
-- 
2.30.2

From 2acc2ad31c1db268de0e8927d5c10ba2bb06e33c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 30 Sep 2021 17:16:01 +1300
Subject: [PATCH 2/2] HACK: Slow the bgwriter down a bit.

---
 src/backend/postmaster/bgwriter.c   | 2 ++
 src/backend/storage/buffer/bufmgr.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 5584f4bc24..b65284b1f6

Re: Make relfile tombstone files conditional on WAL level

2021-09-28 Thread Thomas Munro
On Wed, Aug 4, 2021 at 3:22 AM Robert Haas  wrote:
> It's not really clear to me what problem is at hand. The problems that
> the tombstone system created for the async I/O stuff weren't really
> explained properly, IMHO. And I don't think the current system is all
> that ugly. it's not the most beautiful thing in the world but we have
> lots of way worse hacks. And, it's easy to understand, requires very
> little code, and has few moving parts that can fail. As hacks go it's
> a quality hack, I would say.

It's not really an AIO problem.  It's just that while testing the AIO
stuff across a lot of operating systems, we had tests failing on
Windows because the extra worker processes you get if you use
io_method=worker were holding cached descriptors and causing stuff
like DROP TABLESPACE to fail.  AFAIK every problem we discovered in
that vein is a current live bug in all versions of PostgreSQL for
Windows (it just takes other backends or the bgwriter to hold an fd at
the wrong moment).  The solution I'm proposing to that general class
of problem is https://commitfest.postgresql.org/34/2962/ .

In the course of thinking about that, it seemed natural to look into
the possibility of getting rid of the tombstones, so that at least
Unix systems don't find themselves having to suffer through a
CHECKPOINT just to drop a tablespace that happens to contain a
tombstone.




Re: Make relfile tombstone files conditional on WAL level

2021-09-28 Thread Thomas Munro
On Fri, Mar 5, 2021 at 11:02 AM Thomas Munro  wrote:
> This is a WIP with an open question to research: what could actually
> break if we did this?

I thought this part of bgwriter.c might be a candidate:

   if (FirstCallSinceLastCheckpoint())
   {
   /*
* After any checkpoint, close all smgr files.  This is so we
* won't hang onto smgr references to deleted files indefinitely.
*/
smgrcloseall();
   }

Hmm, on closer inspection, isn't the lack of real interlocking with
checkpoints a bit suspect already?  What stops bgwriter from writing
to the previous relfilenode generation's fd, if a relfilenode is
recycled while BgBufferSync() is running?  Not sinval, and not the
above code that only runs between BgBufferSync() invocations.




Re: Make relfile tombstone files conditional on WAL level

2021-08-03 Thread Robert Haas
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund  wrote:
> What I proposed in the past was to have a new shared table that tracks
> relfilenodes. I still think that's a decent solution for just the problem at
> hand.

It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.

> But it'd also potentially be the way to redesign relation forks and even
> slim down buffer tags:
>
> Right now a buffer tag is:
> - 4 byte tablespace oid
> - 4 byte database oid
> - 4 byte "relfilenode oid" (don't think we have a good name for this)
> - 4 byte fork number
> - 4 byte block number
>
> If we had such a shared table we could put at least tablespace, fork number
> into that table mapping them to an 8 byte "new relfilenode". That'd only make
> the "new relfilenode" unique within a database, but that'd be sufficient for
> our purposes.  It'd give use a buffertag consisting out of the following:
> - 4 byte database oid
> - 8 byte "relfilenode"
> - 4 byte block number

Yep. I think this is a good direction.

> Of course, it'd add some complexity too, because a buffertag alone wouldn't be
> sufficient to read data (as you'd need the tablespace oid from elsewhere). But
> that's probably ok, I think all relevant places would have that information.

I think the thing to look at would be the places that call
relpathperm() or relpathbackend(). I imagine this can be worked out,
but it might require some adjustment.

> It's probably possible to remove the database oid from the tag as well, but
> it'd make CREATE DATABASE tricker - we'd need to change the filenames of
> tables as we copy, to adjust them to the differing oid.

Yeah, I'm not really sure that works out to a win. I tend to think
that we should be trying to make databases within the same cluster
more rather than less independent of each other. If we switch to using
a radix tree for the buffer mapping table as you have previously
proposed, then presumably each backend can cache a pointer to the
second level, after the database OID has been resolved. Then you have
no need to compare database OIDs for every lookup. That might turn out
to be better for performance than shoving everything into the buffer
tag anyway, because then backends in different databases would be
accessing distinct parts of the buffer mapping data structure instead
of contending with one another.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2021-08-02 Thread Andres Freund
Hi,

On 2021-08-02 16:03:31 -0400, Robert Haas wrote:
> The two most principled solutions to this problem that I can see are
> (1) remove wal_level=minimal and

I'm personally not opposed to this. It's not practically relevant and makes a
lot of stuff more complicated. We imo should rather focus on optimizing the
things wal_level=minimal accelerates a lot than adding complications for
wal_level=minimal. Such optimizations would have practical relevance, and
there's plenty low hanging fruits.


> (2) use 64-bit relfilenodes. I have
> been reluctant to support #1 because it's hard for me to believe that
> there aren't cases where being able to skip a whole lot of WAL-logging
> doesn't work out to a nice performance win, but I realize opinions on
> that topic vary. And I'm pretty sure that Andres, at least, will hate
> #2 because he's unhappy with the width of buffer tags already.

Yep :/

I guess there's a somewhat hacky way to get somewhere without actually
increasing the size. We could take 3 bytes from the fork number and use that
to get to a 7 byte relfilenode portion. 7 bytes are probably enough for
everyone.

It's not like we can use those bytes in a useful way, due to alignment
requirements. Declaring that the high 7 bytes are for the relNode portion and
the low byte for the fork would still allow efficient comparisons and doesn't
seem too ugly.


> So I don't really have a good idea. I agree this tombstone system is a
> bit of a wart, but I'm not sure that this patch really makes anything
> any better, and I'm not really seeing another idea that seems better
> either.

> Maybe I am missing something...

What I proposed in the past was to have a new shared table that tracks
relfilenodes. I still think that's a decent solution for just the problem at
hand. But it'd also potentially be the way to redesign relation forks and even
slim down buffer tags:

Right now a buffer tag is:
- 4 byte tablespace oid
- 4 byte database oid
- 4 byte "relfilenode oid" (don't think we have a good name for this)
- 4 byte fork number
- 4 byte block number

If we had such a shared table we could put at least tablespace, fork number
into that table mapping them to an 8 byte "new relfilenode". That'd only make
the "new relfilenode" unique within a database, but that'd be sufficient for
our purposes.  It'd give use a buffertag consisting out of the following:
- 4 byte database oid
- 8 byte "relfilenode"
- 4 byte block number

Of course, it'd add some complexity too, because a buffertag alone wouldn't be
sufficient to read data (as you'd need the tablespace oid from elsewhere). But
that's probably ok, I think all relevant places would have that information.


It's probably possible to remove the database oid from the tag as well, but
it'd make CREATE DATABASE tricker - we'd need to change the filenames of
tables as we copy, to adjust them to the differing oid.

Greetings,

Andres Freund




Re: Make relfile tombstone files conditional on WAL level

2021-08-02 Thread Robert Haas
On Thu, Jun 10, 2021 at 6:47 AM Heikki Linnakangas  wrote:
> It would indeed be nice to have some other mechanism to prevent the
> issue with wal_level=minimal, the tombstone files feel hacky and
> complicated. Maybe a new shared memory hash table to track the
> relfilenodes of dropped tables.

Just to summarize the issue here as I understand it, if a relfilenode
is used for two unrelated relations during the same checkpoint cycle
with wal_level=minimal, and if the WAL-skipping optimization is
applied to the second of those but not to the first, then crash
recovery will lose our only copy of the new relation's data, because
we'll replay the removal of the old relfilenode but will not have
logged the new data. Furthermore, we've wondered about writing an
end-of-recovery record in all cases rather than sometimes writing an
end-of-recovery record and sometimes a checkpoint record. That would
allow another version of this same problem, since a single checkpoint
cycle could now span multiple server lifetimes. At present, we dodge
all this by keeping the first segment of the main fork around as a
zero-length file for the rest of the checkpoint cycle, which I think
prevents the problem in both cases. Now, apparently that caused some
problem with the AIO patch set so Thomas is curious about getting rid
of it, and Heikki concurs that it's a hack.

I guess my concern about this patch is that it just seems to be
reducing the number of cases where that hack is used without actually
getting rid of it. Rarely-taken code paths are more likely to have
undiscovered bugs, and that seems particularly likely in this case,
because this is a low-probability scenario to begin with. A lot of
clusters probably never have an OID counter wraparound ever, and even
in those that do, getting an OID collision with just the right timing
followed by a crash before a checkpoint can intervene has got to be
super-unlikely. Even as things are today, if this mechanism has subtle
bugs, it seems entirely possible that they could have escaped notice
up until now.

So I spent some time thinking about the question of getting rid of
tombstone files altogether. I don't think that Heikki's idea of a
shared memory hash table to track dropped relfilenodes can work. The
hash table will have to be of some fixed size N, and whatever the
value of N, the approach will break down if N+1 relfilenodes are
dropped in the same checkpoint cycle.

The two most principled solutions to this problem that I can see are
(1) remove wal_level=minimal and (2) use 64-bit relfilenodes. I have
been reluctant to support #1 because it's hard for me to believe that
there aren't cases where being able to skip a whole lot of WAL-logging
doesn't work out to a nice performance win, but I realize opinions on
that topic vary. And I'm pretty sure that Andres, at least, will hate
#2 because he's unhappy with the width of buffer tags already. So I
don't really have a good idea. I agree this tombstone system is a bit
of a wart, but I'm not sure that this patch really makes anything any
better, and I'm not really seeing another idea that seems better
either.

Maybe I am missing something...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2021-06-10 Thread Heikki Linnakangas

On 05/03/2021 00:02, Thomas Munro wrote:

Hi,

I'm starting a new thread for this patch that originated as a
side-discussion in [1], to give it its own CF entry in the next cycle.
This is a WIP with an open question to research: what could actually
break if we did this?


I don't see a problem.

It would indeed be nice to have some other mechanism to prevent the 
issue with wal_level=minimal, the tombstone files feel hacky and 
complicated. Maybe a new shared memory hash table to track the 
relfilenodes of dropped tables.


- Heikki