Re: [PATCHES] Post-special page storage TDE support
Hi David, > I have finished the reworking of this particular patch series, and have tried > to > organize this in such a way that it will be easily reviewable. It is > constructed progressively to be able to follow what is happening here. As > such, > each individual commit is not guaranteed to compile on its own, so the whole > series would need to be applied before it works. (It does pass CI tests.) > > Here is a brief roadmap of the patches; some of them have additional details > in > the commit message describing a little more about them. > > These two patches do some refactoring of existing code to make a common place > to > modify the definitions: > > v3-0001-refactor-Create-PageUsableSpace-to-represent-spac.patch > v3-0002-refactor-Make-PageGetUsablePageSize-routine.patch > > These two patches add the ReservedPageSize variable and teach PageInit to use > to > adjust sizing accordingly: > > v3-0003-feature-Add-ReservedPageSize-variable.patch > v3-0004-feature-Adjust-page-sizes-at-PageInit.patch > > This patch modifies the definitions of 4 symbols to be computed based on > PageUsableSpace: > > v3-0005-feature-Create-Calc-Limit-and-Dynamic-forms-for-f.patch > > These following 4 patches are mechanical replacements of all existing uses of > these symbols; this provides both visibility into where the existing symbol is > used as well as distinguishing between parts that care about static allocation > vs dynamic usage. The only non-mechanical change is to remove the definition > of > the old symbol so we can be guaranteed that all uses have been considered: > > v3-0006-chore-Split-MaxHeapTuplesPerPage-into-Limit-and-D.patch > v3-0007-chore-Split-MaxIndexTuplesPerPage-into-Limit-and-.patch > v3-0008-chore-Split-MaxHeapTupleSize-into-Limit-and-Dynam.patch > v3-0009-chore-Split-MaxTIDsPerBTreePage-into-Limit-and-Dy.patch > > The following patches are related to required changes to support dynamic toast > limits: > > v3-0010-feature-Add-hook-for-setting-reloptions-defaults-.patch > v3-0011-feature-Dynamically-calculate-toast_tuple_target.patch > v3-0012-feature-Add-Calc-options-for-toast-related-pieces.patch > v3-0013-chore-Replace-TOAST_MAX_CHUNK_SIZE-with-ClusterTo.patch > v3-0014-chore-Translation-updates-for-TOAST_MAX_CHUNK_SIZ.patch > > In order to calculate some of the sizes, we need to include nbtree.h > internals, > but we can't use in front-end apps, so we separate out the pieces we care > about > into a separate include and use that: > > v3-0015-chore-Split-nbtree.h-structure-defs-into-an-inter.patch > > This is the meat of the patch; provide a common location for these > block-size-related constants to be computed using the infra that has been set > up > so far. Also ensure that we are properly initializing this in front end and > back end code. A tricky piece here is we have two separate include files for > blocksize.h; one which exposes externs as consts for optimizations, and one > that > blocksize.c itself uses without consts, which it uses to create/initialized > the > vars: > > v3-0016-feature-Calculate-all-blocksize-constants-in-a-co.patch > > Add ControlFile and GUC support for reserved_page_size: > > v3-0017-feature-ControlFile-GUC-support-for-reserved_page.patch > > Add initdb support for reserving page space: > > v3-0018-feature-Add-reserved_page_size-to-initdb-bootstra.patch > > Fixes for pg_resetwal: > > v3-0019-feature-Updates-for-pg_resetwal.patch > > The following 4 patches mechanically replace the Dynamic form to use the new > Cluster variables: > > v3-0020-chore-Rename-MaxHeapTupleSizeDynamic-to-ClusterMa.patch > v3-0021-chore-Rename-MaxHeapTuplesPerPageDynamic-to-Clust.patch > v3-0022-chore-Rename-MaxIndexTuplesPerPageDynamic-to-Clus.patch > v3-0023-chore-Rename-MaxTIDsPerBTreePageDynamic-to-Cluste.patch > > Two pieces of optimization required for visibility map: > > v3-0024-optimization-Add-support-for-fast-non-division-ba.patch > v3-0025-optimization-Use-fastdiv-code-in-visibility-map.patch > > Update bufpage.h comments: > > v3-0026-doc-update-bufpage-docs-w-reserved-space-data.patch > > Fixes for bloom to use runtime size: > > v3-0027-feature-Teach-bloom-about-PageUsableSpace.patch > > Fixes for FSM to use runtime size: > > v3-0028-feature-teach-FSM-about-reserved-page-space.patch > > I hope this makes sense for reviewing, I know it's a big job, so breaking > things up a little more and organizing will hopefully help. Just wanted to let you know that the patchset seems to need a rebase, according to cfbot. Best regards, Aleksander Alekseev (wearing a co-CFM hat)
Re: [PATCHES] Post-special page storage TDE support
Hi again! Per some offline discussion with Stephen, I've continued to work on some modifications here; this particular patchset is intended to facilitate review by highlighting the mechanical nature of many of these changes. As such, I have taken the following approach to this rework: 0001 - Create PageUsableSpace to represent space post-smgr 0002 - Add support for fast, non-division-based div/mod algorithms 0003 - Use fastdiv code in visibility map 0004 - Make PageUsableSpace incorporate variable-sized limit 0005 - Add Calc, Limit and Dynamic forms of all variable constants 0006 - Split MaxHeapTuplesPerPage into Limit and Dynamic variants 0007 - Split MaxIndexTuplesPerPage into Limit and Dynamic variants 0008 - Split MaxHeapTupleSize into Limit and Dynamic variants 0009 - Split MaxTIDsPerBTreePage into Limit and Dynamic variant 0001 - 0003 have appeared in this thread or in other forms on the list already, though 0001 refactors things slightly more aggressively, but makes StaticAssert() to ensure that this change is still sane. 0004 adds the ReservedPageSpace variable, and also redefines the previous BLCKSZ - SizeOfPageHeaderDate as PageUsableSpaceMax; there are a few related fixups. 0005 adds the macros to compute the former constants while leaving their original definitions to evaluate to the same place (the infamous Calc* and *Limit, plus we invite *Dynamic to the party as well; the names are terrible and there must be something better) 0006 - 0009 are all the same approach; we undefine the old constant name and modify the existing uses of this symbol to be either the *Limit or *Dynamic, depending on if the changed available space would impact the calculations. Since we are touching every use of this symbol, this facilitates review of the impact, though I would contend that almost every piece I've spot-checked seems like it really does need to know about the runtime limit. Perhaps there is more we could do here. I could also see a variable per constant rather than recalculating this every time, in which case the *Dynamic would just be the variable and we'd need a hook to initialize this or otherwise set on first use. There are a number of additional things remaining to be done to get this to fully work, but I did want to get some of this out there for review. Still to do (almost all in some form in original patch, so just need to extract the relevant pieces): - set reserved-page-size via initdb - load reserved-page-size from pg_control - apply to the running cluster - some form of compatibility for these constants in common and ensuring bin/ works - some toast-related changes (this requires a patch to support dynamic relopts, which I can extract, as the existing code is using a constant lookup table) - probably some more pieces I'm forgetting Thanks, David v2-0005-Add-Calc-Limit-and-Dynamic-forms-of-all-variable-.patch Description: Binary data v2-0001-Create-PageUsableSpace-to-represent-space-post-sm.patch Description: Binary data v2-0002-Add-support-for-fast-non-division-based-div-mod-a.patch Description: Binary data v2-0003-Use-fastdiv-code-in-visibility-map.patch Description: Binary data v2-0004-Make-PageUsableSpace-incorporate-variable-sized-l.patch Description: Binary data v2-0007-Split-MaxIndexTuplesPerPage-into-Limit-and-Dynami.patch Description: Binary data v2-0008-Split-MaxHeapTupleSize-into-Limit-and-Dynamic-var.patch Description: Binary data v2-0006-Split-MaxHeapTuplesPerPage-into-Limit-and-Dynamic.patch Description: Binary data v2-0009-Split-MaxTIDsPerBTreePage-into-Limit-and-Dynamic-.patch Description: Binary data
Re: [PATCHES] Post-special page storage TDE support
On Tue, Nov 7, 2023 at 6:20 PM Andres Freund wrote: > Hi, > > - IMO the patch touches many places it shouldn't need to touch, because of > essentially renaming a lot of existing macro names to *Limit, > necessitating modifying a lot of users. I think instead the few places > that > care about the runtime limit should be modified. > > As-is the patch would cause a lot of fallout in extensions that just do > things like defining an on-stack array of Datums or such - even though > all > they'd need is to change the define to the *Limit one. > > Even leaving extensions aside, it must makes reviewing (and I'm sure > maintaining) the patch very tedious. > Hi Andres et al, So I've been looking at alternate approaches to this issue and considering how to reduce churn, and I think we still need the *Limit variants. Let's take a simple example: Just looking at MaxHeapTuplesPerPage and breaking down instances in the code, loosely partitioning into whether it's used as an array index or other usage (doesn't discriminate against code vs comments, unfortunately) we get the following breakdown: $ git grep -hoE [[]?MaxHeapTuplesPerPage | sort | uniq -c 18 [MaxHeapTuplesPerPage 51 MaxHeapTuplesPerPage This would be 18 places where we would need at adjust in a fairly mechanical fashion to add the MaxHeapTuplesPerPageLimit instead of MaxHeapTuplesPerPage vs some significant fraction of non-comment--even if you assumed half were in comments, there would presumably need to be some sort of adjustments in verbage since we are going to be changing some of the interpretation. I am working on a patch to cleanup some of the assumptions that smgr makes currently about its space usage and how the individual access methods consider it, as they should only be calculating things based on how much space is available after smgr is done with it. That has traditionally been BLCKSZ - SizeOfPageHeaderData, but this patch (included) factors that out into a single expression that we can now use in access methods, so we can then reserve additional page space and not need to adjust the access methods furter. Building on top of this patch, we'd define something like this to handle the #defines that need to be dynamic: extern Size reserved_page_space; #define PageUsableSpace (BLCKSZ - SizeOfPageHeaderData - reserved_page_space) #define MaxHeapTuplesPerPage CalcMaxHeapTuplesPerPage(PageUsableSpace) #define MaxHeapTuplesPerPageLimit CalcMaxHeapTuplesPerPage(BLCKSZ - SizeOfPageHeaderData) #define CalcMaxHeapTuplesPerPage(freesize) ((int) ((freesize) / \ (MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData In my view, extensions that are expecting to need no changes when it comes to changing how these are interpreted are better off needing to only change the static allocation in a mechanical sense than revisit any other uses of code; this seems more likely to guarantee a correct result than if you exceed the page space and start overwriting things you weren't because you're not aware that you need to check for dynamic limits on your own. Take another thing which would need adjusting for reserving page space, MaxHeapTupleSize: $ git grep -ohE '[[]?MaxHeapTupleSize' | sort | uniq -c 3 [MaxHeapTupleSize 16 MaxHeapTupleSize Here there are 3 static arrays which would need to be adjusted vs 16 other instances. If we kept MaxHeapTupleSize interpretation the same and didn't adjust an extension it would compile just fine, but with too large of a length compared to the smaller PageUsableSpace, so you could conceivably overwrite into the reserved space depending on what you were doing. (since by definition the reserved_page_space >= 0, so PageUsableSpace will always be <= BLCKSZ - SizeOfPageHeaderData, so any expression based on it as a basis will be smaller). In short, I think the approach I took originally actually will reduce errors out-of-core, and while churn is still necessary churn. I can produce a second patch which implements this calc/limit atop this first one as well. Thanks, David 0001-Create-PageUsableSpace-to-represent-space-post-smgr.patch Description: Binary data
Re: [PATCHES] Post-special page storage TDE support
Greetings, On Mon, Nov 13, 2023 at 16:53 David Christensen < david.christen...@crunchydata.com> wrote: > On Mon, Nov 13, 2023 at 2:52 PM Andres Freund wrote: > >> >> > This scheme would open up space per page that would now be available for >> > plenty of other things; the encoding in the header and the corresponding >> > available space in the footer would seem to open up quite a few options >> > now, no? >> >> Sure, if you're willing to rewrite the whole cluster to upgrade and >> willing to >> permanently sacrifice some data density. If the stored data is actually >> specific to the page - that is the place to put the data. If not, then the >> tradeoff is much more complicated IMO. >> >> Of course this isn't a new problem - storing the page size on each page >> was >> just silly, it's never going to change across the cluster and even more >> definitely not going to change within a single relation. >> > > Crazy idea; since stored pagesize is already a fixed cost that likely > isn't going away, what if instead of the pd_checksum field, we instead > reinterpret pd_pagesize_version; 4 would mean "no page features", but > anything 5 or higher could be looked up as an external page feature set, > with storage semantics outside of the realm of the page itself (other than > what the page features code itself needs to know); i.e,. move away from the > on-page bitmap into a more abstract representation of features which could > be something along the lines of what you were suggesting, including > extension support. > > It seems like this could also support adding/removing features on page > read/write as long as there was sufficient space in the reserved_page > space; read the old feature set on page read, convert to the new feature > set which will write out the page with the additional/changed format. > Obviously there would be bookkeeping to be done in terms of making sure all > pages had been converted from one format to another, but for the page level > this would be straightforward. > > Just thinking aloud here... > In other crazy idea space … if the page didn’t have enough space to allow for the desired features then make any insert/update actions forcibly have to choose a different page for the new tuple, while allowing delete’s to do their usual thing, and then when vacuum comes along and is able to clean up the page and remove the all dead tuples, it could then enable the features on the page that are desired… Thanks, Stephen >
Re: [PATCHES] Post-special page storage TDE support
On Mon, Nov 13, 2023 at 2:52 PM Andres Freund wrote: > > > This scheme would open up space per page that would now be available for > > plenty of other things; the encoding in the header and the corresponding > > available space in the footer would seem to open up quite a few options > > now, no? > > Sure, if you're willing to rewrite the whole cluster to upgrade and > willing to > permanently sacrifice some data density. If the stored data is actually > specific to the page - that is the place to put the data. If not, then the > tradeoff is much more complicated IMO. > > Of course this isn't a new problem - storing the page size on each page was > just silly, it's never going to change across the cluster and even more > definitely not going to change within a single relation. > Crazy idea; since stored pagesize is already a fixed cost that likely isn't going away, what if instead of the pd_checksum field, we instead reinterpret pd_pagesize_version; 4 would mean "no page features", but anything 5 or higher could be looked up as an external page feature set, with storage semantics outside of the realm of the page itself (other than what the page features code itself needs to know); i.e,. move away from the on-page bitmap into a more abstract representation of features which could be something along the lines of what you were suggesting, including extension support. It seems like this could also support adding/removing features on page read/write as long as there was sufficient space in the reserved_page space; read the old feature set on page read, convert to the new feature set which will write out the page with the additional/changed format. Obviously there would be bookkeeping to be done in terms of making sure all pages had been converted from one format to another, but for the page level this would be straightforward. Just thinking aloud here... David
Re: [PATCHES] Post-special page storage TDE support
Hi, On 2023-11-13 14:37:47 -0600, David Christensen wrote: > On Mon, Nov 13, 2023 at 2:27 PM Andres Freund wrote: > > On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote: > > > On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost wrote: > > > > In conversations with folks (my memory specifically is a discussion > > with > > > > Peter G, added to CC, and my apologies to Peter if I'm misremembering) > > > > there was a pretty strong push that a page should be able to 'stand > > > > alone' and not depend on something else (eg: pg_control, or whatever) > > to > > > > provide info needed be able to interpret the page. For my part, I > > don't > > > > have a particularly strong feeling on that, but that's what lead to > > this > > > > design. > > > > > > The term that I have used in the past is "self-contained". Meaning > > > capable of being decoded more or less as-is, without any metadata, by > > > tools like pg_filedump. > > > > I'm not finding that very convincing - without cluster wide data, like > > keys, a > > tool like pg_filedump isn't going to be able to do much with encrypted > > pages. Given the need to look at some global data, figuring out the offset > > at > > which data starts based on a value in pg_control isn't meaningfully worse > > than > > having the data on each page. > > > > Storing redundant data in each page header, when we've wanted space in the > > page header for plenty other things, just doesn't seem a good use of said > > space. > > > > This scheme would open up space per page that would now be available for > plenty of other things; the encoding in the header and the corresponding > available space in the footer would seem to open up quite a few options > now, no? Sure, if you're willing to rewrite the whole cluster to upgrade and willing to permanently sacrifice some data density. If the stored data is actually specific to the page - that is the place to put the data. If not, then the tradeoff is much more complicated IMO. Of course this isn't a new problem - storing the page size on each page was just silly, it's never going to change across the cluster and even more definitely not going to change within a single relation. Greetings, Andres Freund
Re: [PATCHES] Post-special page storage TDE support
On Mon, Nov 13, 2023 at 2:27 PM Andres Freund wrote: > Hi, > > On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote: > > On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost wrote: > > > In conversations with folks (my memory specifically is a discussion > with > > > Peter G, added to CC, and my apologies to Peter if I'm misremembering) > > > there was a pretty strong push that a page should be able to 'stand > > > alone' and not depend on something else (eg: pg_control, or whatever) > to > > > provide info needed be able to interpret the page. For my part, I > don't > > > have a particularly strong feeling on that, but that's what lead to > this > > > design. > > > > The term that I have used in the past is "self-contained". Meaning > > capable of being decoded more or less as-is, without any metadata, by > > tools like pg_filedump. > > I'm not finding that very convincing - without cluster wide data, like > keys, a > tool like pg_filedump isn't going to be able to do much with encrypted > pages. Given the need to look at some global data, figuring out the offset > at > which data starts based on a value in pg_control isn't meaningfully worse > than > having the data on each page. > > Storing redundant data in each page header, when we've wanted space in the > page header for plenty other things, just doesn't seem a good use of said > space. > This scheme would open up space per page that would now be available for plenty of other things; the encoding in the header and the corresponding available space in the footer would seem to open up quite a few options now, no?
Re: [PATCHES] Post-special page storage TDE support
Hi, On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote: > On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost wrote: > > In conversations with folks (my memory specifically is a discussion with > > Peter G, added to CC, and my apologies to Peter if I'm misremembering) > > there was a pretty strong push that a page should be able to 'stand > > alone' and not depend on something else (eg: pg_control, or whatever) to > > provide info needed be able to interpret the page. For my part, I don't > > have a particularly strong feeling on that, but that's what lead to this > > design. > > The term that I have used in the past is "self-contained". Meaning > capable of being decoded more or less as-is, without any metadata, by > tools like pg_filedump. I'm not finding that very convincing - without cluster wide data, like keys, a tool like pg_filedump isn't going to be able to do much with encrypted pages. Given the need to look at some global data, figuring out the offset at which data starts based on a value in pg_control isn't meaningfully worse than having the data on each page. Storing redundant data in each page header, when we've wanted space in the page header for plenty other things, just doesn't seem a good use of said space. Greetings, Andres Freund
Re: [PATCHES] Post-special page storage TDE support
On Tue, Nov 7, 2023 at 6:20 PM Andres Freund wrote: > Hi, > > On 2023-05-09 17:08:26 -0500, David Christensen wrote: > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001 > > From: David Christensen > > Date: Tue, 9 May 2023 16:56:15 -0500 > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > > > This space is reserved for extended data on the Page structure which > will be ultimately used for > > encrypted data, extended checksums, and potentially other things. This > data appears at the end of > > the Page, after any `pd_special` area, and will be calculated at runtime > based on specific > > ControlFile features. > > > > No effort is made to ensure this is backwards-compatible with existing > clusters for `pg_upgrade`, as > > we will require logical replication to move data into a cluster with > > different settings here. > > The first part of the last paragraph makes it sound like pg_upgrade won't > be > supported across this commit, rather than just between different > settings... > Thanks for the review. > I think as a whole this is not an insane idea. A few comments: > > - IMO the patch touches many places it shouldn't need to touch, because of > essentially renaming a lot of existing macro names to *Limit, > necessitating modifying a lot of users. I think instead the few places > that > care about the runtime limit should be modified. > > As-is the patch would cause a lot of fallout in extensions that just do > things like defining an on-stack array of Datums or such - even though > all > they'd need is to change the define to the *Limit one. > > Even leaving extensions aside, it must makes reviewing (and I'm sure > maintaining) the patch very tedious. > You make a good point, and I think you're right that we could teach the places that care about runtime vs compile time differences about the changes while leaving other callers alone. The *Limit ones were introduced since we need constant values here from the Calc...() macros, but could try keeping the existing *Limit with the old name and switching things around. I suspect there will be the same amount of code churn, but less mechanical. > - I'm a bit worried about how the extra special page will be managed - if > there are multiple features that want to use it, who gets to put their > data > at what offset? > > After writing this I saw that 0002 tries to address this - but I don't > like > the design. It introduces runtime overhead that seems likely to be > visible. > Agreed this could be optimized. > - Checking for features using PageGetFeatureOffset() seems the wrong > design to > me - instead of a branch for some feature being disabled, perfectly > predictable for the CPU, we need to do an external function call every > time > to figure out that yet, checksums are *still* disabled. > This is probably not a supported approach (it felt a little icky), but I'd played around with const pointers to structs of const elements, where the initial values of a global var was populated early on (so set once and never changed post init), and the compiler didn't complain and things seemed to work ok; not sure if this approach might help balance the early mutability and constant lookup needs: typedef struct PageFeatureOffsets { const Size feature0offset; const Size feature1offset; ... } PageFeatureOffsets; PageFeatureOffsets offsets = {0}; const PageFeatureOffsets *exposedOffsets = &offsets; void InitOffsets() { *((Size*)&offsets.feature0offset) = ...; *((Size*)&offsets.feature1offset) = ...; ... } - Recomputing offsets every time in PageGetFeatureOffset() seems too > expensive. The offsets can't change while running as > PageGetFeatureOffset() > have enough information to distinguish between different kinds of > relations > Yes, this was a simple approach for ease of implementation; there is certainly a way to precompute a lookup table from the page feature bitmask into the offsets themselves or otherwise precompute, turn from function call into inline/macro, etc. > - so why do we need to recompute offsets on every single page? I'd > instead > add a distinct offset variable for each feature. > This would work iff there is a single page feature set across all pages in the cluster; I'm not sure we don't want more flexibility here. > - Modifying every single PageInit() call doesn't make sense to me. That'll > just create a lot of breakage for - as far as I can tell - no win. > This was a placeholder to allow different features depending on page type; to keep things simple for now I just used the same values here, but we could move this inside PageInit() instead (again, assuming single feature set per cluster). > - Why is it worth sacrificing space on every page to indicate which > features > were enabled? I think there'd need to be some convincing reasons for > introducing such overhead. > The point here is if we can use either GCM a
Re: [PATCHES] Post-special page storage TDE support
On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost wrote: > In conversations with folks (my memory specifically is a discussion with > Peter G, added to CC, and my apologies to Peter if I'm misremembering) > there was a pretty strong push that a page should be able to 'stand > alone' and not depend on something else (eg: pg_control, or whatever) to > provide info needed be able to interpret the page. For my part, I don't > have a particularly strong feeling on that, but that's what lead to this > design. The term that I have used in the past is "self-contained". Meaning capable of being decoded more or less as-is, without any metadata, by tools like pg_filedump. Any design in this area should try to make things as easy to debug as possible, for the obvious reason: encrypted data that somehow becomes corrupt is bound to be a nightmare to debug. (Besides, we already support tools like pg_filedump, so this isn't a new principle.) -- Peter Geoghegan
Re: [PATCHES] Post-special page storage TDE support
Greetings, On Wed, Nov 8, 2023 at 20:55 David Christensen < david.christen...@crunchydata.com> wrote: > On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost wrote: > >> * Andres Freund (and...@anarazel.de) wrote: >> > On 2023-05-09 17:08:26 -0500, David Christensen wrote: >> > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001 >> > > From: David Christensen >> > > Date: Tue, 9 May 2023 16:56:15 -0500 >> > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure >> > > >> > > This space is reserved for extended data on the Page structure which >> will be ultimately used for >> > > encrypted data, extended checksums, and potentially other things. >> This data appears at the end of >> > > the Page, after any `pd_special` area, and will be calculated at >> runtime based on specific >> > > ControlFile features. >> > > >> > > No effort is made to ensure this is backwards-compatible with >> existing clusters for `pg_upgrade`, as >> > > we will require logical replication to move data into a cluster with >> > > different settings here. >> > >> > The first part of the last paragraph makes it sound like pg_upgrade >> won't be >> > supported across this commit, rather than just between different >> settings... >> > > Yeah, that's vague, but you picked up on what I meant. > > >> > I think as a whole this is not an insane idea. A few comments: >> >> Thanks for all the feedback! >> >> > - Why is it worth sacrificing space on every page to indicate which >> features >> > were enabled? I think there'd need to be some convincing reasons for >> > introducing such overhead. >> >> In conversations with folks (my memory specifically is a discussion with >> Peter G, added to CC, and my apologies to Peter if I'm misremembering) >> there was a pretty strong push that a page should be able to 'stand >> alone' and not depend on something else (eg: pg_control, or whatever) to >> provide info needed be able to interpret the page. For my part, I don't >> have a particularly strong feeling on that, but that's what lead to this >> design. >> > > Unsurprisingly, I agree that it's useful to keep these features on the > page itself; from a forensic standpoint this seems much easier to interpret > what is happening here, as well it would allow you to have different > features on a given page or type of page depending on need. The initial > patch utilizes pg_control to store the cluster page features, but there's > no reason it couldn't be dependent on fork/page type or stored in > pg_tablespace to utilize different features. > When it comes to authenticated encryption, it’s also the case that it’s unclear what value the checksum field has, if any… it’s certainly not directly needed as a checksum, as the auth tag is much better for the purpose of seeing if the page has been changed in some way. It’s also not big enough to serve as an auth tag per NIST guidelines regarding the size of the authenticated data vs. the size of the tag. Using it to indicate what features are enabled on the page seems pretty useful, as David notes. Thanks, Stephen >
Re: [PATCHES] Post-special page storage TDE support
On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost wrote: > Greetings, > > * Andres Freund (and...@anarazel.de) wrote: > > On 2023-05-09 17:08:26 -0500, David Christensen wrote: > > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001 > > > From: David Christensen > > > Date: Tue, 9 May 2023 16:56:15 -0500 > > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > > > > > This space is reserved for extended data on the Page structure which > will be ultimately used for > > > encrypted data, extended checksums, and potentially other things. > This data appears at the end of > > > the Page, after any `pd_special` area, and will be calculated at > runtime based on specific > > > ControlFile features. > > > > > > No effort is made to ensure this is backwards-compatible with existing > clusters for `pg_upgrade`, as > > > we will require logical replication to move data into a cluster with > > > different settings here. > > > > The first part of the last paragraph makes it sound like pg_upgrade > won't be > > supported across this commit, rather than just between different > settings... > Yeah, that's vague, but you picked up on what I meant. > > I think as a whole this is not an insane idea. A few comments: > > Thanks for all the feedback! > > > - Why is it worth sacrificing space on every page to indicate which > features > > were enabled? I think there'd need to be some convincing reasons for > > introducing such overhead. > > In conversations with folks (my memory specifically is a discussion with > Peter G, added to CC, and my apologies to Peter if I'm misremembering) > there was a pretty strong push that a page should be able to 'stand > alone' and not depend on something else (eg: pg_control, or whatever) to > provide info needed be able to interpret the page. For my part, I don't > have a particularly strong feeling on that, but that's what lead to this > design. > Unsurprisingly, I agree that it's useful to keep these features on the page itself; from a forensic standpoint this seems much easier to interpret what is happening here, as well it would allow you to have different features on a given page or type of page depending on need. The initial patch utilizes pg_control to store the cluster page features, but there's no reason it couldn't be dependent on fork/page type or stored in pg_tablespace to utilize different features. Thanks, David
Re: [PATCHES] Post-special page storage TDE support
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2023-05-09 17:08:26 -0500, David Christensen wrote: > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001 > > From: David Christensen > > Date: Tue, 9 May 2023 16:56:15 -0500 > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > > > This space is reserved for extended data on the Page structure which will > > be ultimately used for > > encrypted data, extended checksums, and potentially other things. This > > data appears at the end of > > the Page, after any `pd_special` area, and will be calculated at runtime > > based on specific > > ControlFile features. > > > > No effort is made to ensure this is backwards-compatible with existing > > clusters for `pg_upgrade`, as > > we will require logical replication to move data into a cluster with > > different settings here. > > The first part of the last paragraph makes it sound like pg_upgrade won't be > supported across this commit, rather than just between different settings... > > I think as a whole this is not an insane idea. A few comments: Thanks for all the feedback! > - Why is it worth sacrificing space on every page to indicate which features > were enabled? I think there'd need to be some convincing reasons for > introducing such overhead. In conversations with folks (my memory specifically is a discussion with Peter G, added to CC, and my apologies to Peter if I'm misremembering) there was a pretty strong push that a page should be able to 'stand alone' and not depend on something else (eg: pg_control, or whatever) to provide info needed be able to interpret the page. For my part, I don't have a particularly strong feeling on that, but that's what lead to this design. Getting a consensus on if that's a requirement or not would definitely be really helpful. Thanks, Stephen signature.asc Description: PGP signature
Re: [PATCHES] Post-special page storage TDE support
Hi, On 2023-05-09 17:08:26 -0500, David Christensen wrote: > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001 > From: David Christensen > Date: Tue, 9 May 2023 16:56:15 -0500 > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > This space is reserved for extended data on the Page structure which will be > ultimately used for > encrypted data, extended checksums, and potentially other things. This data > appears at the end of > the Page, after any `pd_special` area, and will be calculated at runtime > based on specific > ControlFile features. > > No effort is made to ensure this is backwards-compatible with existing > clusters for `pg_upgrade`, as > we will require logical replication to move data into a cluster with > different settings here. The first part of the last paragraph makes it sound like pg_upgrade won't be supported across this commit, rather than just between different settings... I think as a whole this is not an insane idea. A few comments: - IMO the patch touches many places it shouldn't need to touch, because of essentially renaming a lot of existing macro names to *Limit, necessitating modifying a lot of users. I think instead the few places that care about the runtime limit should be modified. As-is the patch would cause a lot of fallout in extensions that just do things like defining an on-stack array of Datums or such - even though all they'd need is to change the define to the *Limit one. Even leaving extensions aside, it must makes reviewing (and I'm sure maintaining) the patch very tedious. - I'm a bit worried about how the extra special page will be managed - if there are multiple features that want to use it, who gets to put their data at what offset? After writing this I saw that 0002 tries to address this - but I don't like the design. It introduces runtime overhead that seems likely to be visible. - Checking for features using PageGetFeatureOffset() seems the wrong design to me - instead of a branch for some feature being disabled, perfectly predictable for the CPU, we need to do an external function call every time to figure out that yet, checksums are *still* disabled. - Recomputing offsets every time in PageGetFeatureOffset() seems too expensive. The offsets can't change while running as PageGetFeatureOffset() have enough information to distinguish between different kinds of relations - so why do we need to recompute offsets on every single page? I'd instead add a distinct offset variable for each feature. - Modifying every single PageInit() call doesn't make sense to me. That'll just create a lot of breakage for - as far as I can tell - no win. - Why is it worth sacrificing space on every page to indicate which features were enabled? I think there'd need to be some convincing reasons for introducing such overhead. - Is it really useful to encode the set of features enabled in a cluster with a bitmask? That pretty much precludes utilizing extra page space in extensions. We could instead just have an extra cluster-wide file that defines a mapping of offset to feature. Greetings, Andres Freund
Re: [PATCHES] Post-special page storage TDE support
On Fri, May 12, 2023 at 7:48 PM Stephen Frost wrote: > > Greetings, > > * David Christensen (david.christen...@crunchydata.com) wrote: > > Refreshing this with HEAD as of today, v4. > > Thanks for updating this! Thanks for the patience in my response here. > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > > > This space is reserved for extended data on the Page structure which will > > be ultimately used for > > encrypted data, extended checksums, and potentially other things. This > > data appears at the end of > > the Page, after any `pd_special` area, and will be calculated at runtime > > based on specific > > ControlFile features. > > > > No effort is made to ensure this is backwards-compatible with existing > > clusters for `pg_upgrade`, as > > we will require logical replication to move data into a cluster with > > different settings here. > > This initial patch, at least, does maintain pg_upgrade as the > reserved_page_size (maybe not a great name?) is set to 0, right? > Basically this is just introducing the concept of a reserved_page_size > and adjusting all of the code that currently uses BLCKSZ or > PageGetPageSize() to account for this extra space. Correct; a reserved_page_size of 0 would be the same page format as currently exists, so you could use pg_upgrade with no page features and be binary compatible with existing clusters. > Looking at the changes to bufpage.h, in particular ... > > > diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h > > > @@ -19,6 +19,14 @@ > > #include "storage/item.h" > > #include "storage/off.h" > > > > +extern PGDLLIMPORT int reserved_page_size; > > + > > +#define SizeOfPageReservedSpace() reserved_page_size > > +#define MaxSizeOfPageReservedSpace 0 > > + > > +/* strict upper bound on the amount of space occupied we have reserved on > > + * pages in this cluster */ > > This will eventually be calculated based on what features are supported > concurrently? Correct; these are fleshed out in later patches. > > @@ -36,10 +44,10 @@ > > * | v pd_upper > > | > > * +-++ > > * | | tupleN ... | > > - * +-+--+-+ > > - * |... tuple3 tuple2 tuple1 | "special space" | > > - * ++-+ > > - * ^ > > pd_special > > + * +-+-++++ > > + * | ... tuple2 tuple1 | "special space" | "reserved" | > > + * +---++++ > > + * ^ pd_special ^ > > reserved_page_space > > Right, adds a dynamic amount of space 'post-special area'. Dynamic as in "fixed at initdb time" instead of compile time. However, things are coded in such a way that the page feature bitmap is stored on a given page, so different pages could have different reserved_page_size depending on use case/code path. (Basically preserving future flexibility while minimizing code changes here.) We could utilize different features depending on what type of page it is, say, or have different relations or tablespaces with different page feature defaults. > > @@ -73,6 +81,8 @@ > > * stored as the page trailer. an access method should always > > * initialize its pages with PageInit and then set its own opaque > > * fields. > > + * > > + * XXX - update more comments here about reserved_page_space > > */ > > Would be good to do. ;) Next revision... :D > > @@ -325,7 +335,7 @@ static inline void > > PageValidateSpecialPointer(Page page) > > { > > Assert(page); > > - Assert(((PageHeader) page)->pd_special <= BLCKSZ); > > + AssertPageHeader) page)->pd_special + reserved_page_size) <= > > BLCKSZ); > > Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData); > > } > > This is just one usage ... but seems like maybe we should be using > PageGetPageSize() here instead of BLCKSZ, and that more-or-less > throughout? Nearly everywhere we're using BLCKSZ today to give us that > compile-time advantage of a fixed block size is going to lose that > advantage anyway thanks to reserved_page_size being run-time. Now, one > up-side to this is that it'd also get us closer to being able to support > dynamic block sizes concurrently which would be quite interesting. That > is, a special tablespace with a 32KB block size while the rest are the > traditional 8KB. This would likely require multiple shared buffer > pools, of course... I think multiple shared-buffer pools is a ways off; but sure, this would support this sort of use case as well. I am working on a new patch for this series (probably the first one in the series) which will actually just abstract away all existing compile-time usages of BLCKSZ.
Re: [PATCHES] Post-special page storage TDE support
Greetings, * David Christensen (david.christen...@crunchydata.com) wrote: > Refreshing this with HEAD as of today, v4. Thanks for updating this! > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure > > This space is reserved for extended data on the Page structure which will be > ultimately used for > encrypted data, extended checksums, and potentially other things. This data > appears at the end of > the Page, after any `pd_special` area, and will be calculated at runtime > based on specific > ControlFile features. > > No effort is made to ensure this is backwards-compatible with existing > clusters for `pg_upgrade`, as > we will require logical replication to move data into a cluster with > different settings here. This initial patch, at least, does maintain pg_upgrade as the reserved_page_size (maybe not a great name?) is set to 0, right? Basically this is just introducing the concept of a reserved_page_size and adjusting all of the code that currently uses BLCKSZ or PageGetPageSize() to account for this extra space. Looking at the changes to bufpage.h, in particular ... > diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h > @@ -19,6 +19,14 @@ > #include "storage/item.h" > #include "storage/off.h" > > +extern PGDLLIMPORT int reserved_page_size; > + > +#define SizeOfPageReservedSpace() reserved_page_size > +#define MaxSizeOfPageReservedSpace 0 > + > +/* strict upper bound on the amount of space occupied we have reserved on > + * pages in this cluster */ This will eventually be calculated based on what features are supported concurrently? > @@ -36,10 +44,10 @@ > * | v pd_upper > | > * +-++ > * | | tupleN ... | > - * +-+--+-+ > - * |... tuple3 tuple2 tuple1 | "special space" | > - * ++-+ > - * ^ > pd_special > + * +-+-++++ > + * | ... tuple2 tuple1 | "special space" | "reserved" | > + * +---++++ > + * ^ pd_special ^ > reserved_page_space Right, adds a dynamic amount of space 'post-special area'. > @@ -73,6 +81,8 @@ > * stored as the page trailer. an access method should always > * initialize its pages with PageInit and then set its own opaque > * fields. > + * > + * XXX - update more comments here about reserved_page_space > */ Would be good to do. ;) > @@ -325,7 +335,7 @@ static inline void > PageValidateSpecialPointer(Page page) > { > Assert(page); > - Assert(((PageHeader) page)->pd_special <= BLCKSZ); > + AssertPageHeader) page)->pd_special + reserved_page_size) <= > BLCKSZ); > Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData); > } This is just one usage ... but seems like maybe we should be using PageGetPageSize() here instead of BLCKSZ, and that more-or-less throughout? Nearly everywhere we're using BLCKSZ today to give us that compile-time advantage of a fixed block size is going to lose that advantage anyway thanks to reserved_page_size being run-time. Now, one up-side to this is that it'd also get us closer to being able to support dynamic block sizes concurrently which would be quite interesting. That is, a special tablespace with a 32KB block size while the rest are the traditional 8KB. This would likely require multiple shared buffer pools, of course... > diff --git a/src/backend/storage/page/bufpage.c > b/src/backend/storage/page/bufpage.c > index 9a302ddc30..a93cd9df9f 100644 > --- a/src/backend/storage/page/bufpage.c > +++ b/src/backend/storage/page/bufpage.c > @@ -26,6 +26,8 @@ > /* GUC variable */ > bool ignore_checksum_failure = false; > > +int reserved_page_size = 0; /* how much page space to > reserve for extended unencrypted metadata */ > + > > /* > * Page support functions > @@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize) > { > PageHeader p = (PageHeader) page; > > - specialSize = MAXALIGN(specialSize); > + specialSize = MAXALIGN(specialSize) + reserved_page_size; Rather than make it part of specialSize, I would think we'd be better off just treating them independently. Eg, the later pd_upper setting would be done by: p->pd_upper = pageSize - specialSize - reserved_page_size; etc. > @@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int > flags) > * one that is both unused and deallocated. > * > * If flag PAI_IS_HEAP is set, we enforce that there can't be more than > - * MaxHeapTuplesPerPage
Re: [PATCHES] Post-special page storage TDE support
Looking into some CF bot failures which didn't show up locally. Will send a v3 when resolved.
Re: [PATCHES] Post-special page storage TDE support
On Sat, 29 Oct 2022 at 00:25, David Christensen wrote: > > Hi Matthias, > > > Did you read the related thread with related discussion from last June, > > "Re: better page-level checksums" [0]? In that I argued that space at the > > end of a page is already allocated for the AM, and that reserving variable > > space at the end of the page for non-AM usage is wasting the AM's > > performance potential. > > Yes, I had read parts of that thread among others, but have given it a > re-read. I can see the point you're making here, and agree that if we > can allocate between pd_special and pd_upper that could make sense. I > am a little unclear as to what performance impacts for the AM there > would be if this additional space were ahead or behind the page > special area; it seems like if this is something that needs to live on > the page *somewhere* just being aligned correctly would be sufficient > from the AM's standpoint. It would be sufficient, but it is definitely suboptimal. See [0] as a patch that is being held back by putting stuff behind the special area. I don't really care much about the storage layout on-disk, but I do care that AMs have efficient access to their data. For the page header, line pointers, and special area, that is currently guaranteed by the current page layout. However, for the special area, that currently guaranteed offset of (BLCKSZ - MAXALIGN(sizeof(IndexOpaque))) will get broken as there would be more space in the special area than the AM would be expecting. Right now, our index AMs are doing pointer chasing during special area lookups for no good reason, but with the patch it would be required. I don't like that at all. Because I understand that it is a requirement to store this reserved space in a fixed place on the on-disk page (you must know where the checksum is at static places on the page, otherwise you'd potentially mis-validate a page) but that requirement is not there for in-memory storage. I think it's a small price to pay to swap the fields around during R/W operations - the largest size of special area is currently 24 bytes, and the proposals I've seen for this extra storage area would not need it to be actually filled with data whilst the page is being used by the AM (checksum could be zeroed in in-memory operations, and it'd get set during writeback; same with all other fields I can imagine the storage system using). > Considering that I am trying to make this > have zero storage impact if these features are not active, the impact > on a cluster with no additional features would be moot from a storage > perspective, no? The issue is that I'd like to eliminate the redirection from the page header in the hot path. Currently, we can do that, and pd_special would be little more than a hint to the smgr and pd_linp code that that area is special and reserved for this access method's private data, so that it is not freed. If you stick something extra in there, it's not special for the AM's private data, and the AM won't be able to use pd_special for similar uses as pd_linp+pd_lower. I'd rather have the storage system use it's own not-special area; choreographed by e.g. a reuse of pd_checksum for one more page offset. Swapping the fields around between on-disk and in-memory doesn't need to be an issue, as special areas are rarely very large. Every index type we support utilizes the special area. Wouldn't those in-memory operations have priority on this useful space, as opposed to a storage system that maybe will be used in new clusters, and even then only during R/W operations to disk (each at most once for N memory operations)? > > Apart from that: Is this variable-sized 'metadata' associated with smgr > > infrastructure only, or is it also available for AM features? If not; then > > this is a strong -1. The amount of tasks smgr needs to do on a page is > > generally much less than the amount of tasks an AM needs to do; so in my > > view the AM has priority in prime page real estate, not smgr or related > > infrastructure. > > I will confess to a slightly wobbly understanding of the delineation > of responsibility here. I was under the impression that by modifying > any consumer of PageHeaderData this would be sufficient to cover all > AMs for the types of cluster-wide options we'd be concerned about (say > extended checksums, multiple page encryption schemes, or other > per-page information we haven't yet anticipated). Reading smgr/README > and the various access/*/README has not made the distinction clear to > me yet. pd_special has (historically) been reserved for access methods' page-level private data. If you add to this area, shouldn't that be space that the AM should be able to hook into as well? Or are all those features limited to the storage system only; i.e. the storage system decides what's best for the AM's page handling w.r.t. physical storage? > > re: PageFeatures > > I'm not sure I understand the goal, nor the reasoning. Shouldn't thi
Re: [PATCHES] Post-special page storage TDE support
Hi Matthias, > Did you read the related thread with related discussion from last June, "Re: > better page-level checksums" [0]? In that I argued that space at the end of a > page is already allocated for the AM, and that reserving variable space at > the end of the page for non-AM usage is wasting the AM's performance > potential. Yes, I had read parts of that thread among others, but have given it a re-read. I can see the point you're making here, and agree that if we can allocate between pd_special and pd_upper that could make sense. I am a little unclear as to what performance impacts for the AM there would be if this additional space were ahead or behind the page special area; it seems like if this is something that needs to live on the page *somewhere* just being aligned correctly would be sufficient from the AM's standpoint. Considering that I am trying to make this have zero storage impact if these features are not active, the impact on a cluster with no additional features would be moot from a storage perspective, no? > Apart from that: Is this variable-sized 'metadata' associated with smgr > infrastructure only, or is it also available for AM features? If not; then > this is a strong -1. The amount of tasks smgr needs to do on a page is > generally much less than the amount of tasks an AM needs to do; so in my view > the AM has priority in prime page real estate, not smgr or related > infrastructure. I will confess to a slightly wobbly understanding of the delineation of responsibility here. I was under the impression that by modifying any consumer of PageHeaderData this would be sufficient to cover all AMs for the types of cluster-wide options we'd be concerned about (say extended checksums, multiple page encryption schemes, or other per-page information we haven't yet anticipated). Reading smgr/README and the various access/*/README has not made the distinction clear to me yet. > re: PageFeatures > I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part > of the storage manager (smgr) implementation / can't this be part of the smgr > of the relation? For at least the feature cases I'm anticipating, this would apply to any disk page that may have user data, set (at least initially) at initdb time, so should apply to any pages in the cluster, regardless of AM. > re: use of pd_checksum > I mentioned this in the above-mentioned thread too, in [1], that we could use > pd_checksum as an extra area marker for this storage-specific data, which > would be located between pd_upper and pd_special. I do think that we could indeed use this as an additional in-page pointer, but at least for this version was keeping things backwards-compatible. Peter G (I think) also made some good points about how to include the various status bits on the page somehow in terms of making a page completely self-contained. > Re: patch contents > > 0001: > >+ specialSize = MAXALIGN(specialSize) + reserved_page_size; > > This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or > an assertion that reserved_page_size is MAXALIGNED, would be better. It is currently aligned via the space calculation return value but agree that folding it into an assert or reworking it explicitly is clearer. > > PageValidateSpecialPointer(Page page) > > { > > Assert(page); > > -Assert(((PageHeader) page)->pd_special <= BLCKSZ); > > +AssertPageHeader) page)->pd_special - reserved_page_size) <= > > BLCKSZ); > > This check is incorrect. With your code it would allow pd_special past the > end of the block. If you want to put the reserved_space_size effectively > inside the special area, this check should instead be: > > +Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size)); > > Or, equally valid > > +AssertPageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ); Yup, I think I inverted my logic there; thanks. > > + * +-+-++-+ > > + * | ... tuple2 tuple1 | "special space" | "reserved" | > > + * +---++-+ > > Could you fix the table display if / when you revise the patchset? It seems > to me that the corners don't line up with the column borders. Sure thing. > 0002: > > Make the output of "select_views" test stable > > Changing the reserved_page_size has resulted in non-stable results for this > > test. > > This makes sense, what kind of instability are we talking about? Are there > different results for runs with the same binary, or is this across > compilations? When running with the same compilation/initdb settings, the test results are stable, but differ depending what options you chose, so `make installcheck` output will fail when testing a cluster with different options vs upstream HEAD without these patches, etc. > 0003 and up were not yet reviewed in depth. Thanks, I appreciate the feedback so far.
Re: [PATCHES] Post-special page storage TDE support
Hi On Mon, 24 Oct 2022, 19:56 David Christensen, < david.christen...@crunchydata.com> wrote: > > Discussion is welcome and encouraged! Did you read the related thread with related discussion from last June, "Re: better page-level checksums" [0]? In that I argued that space at the end of a page is already allocated for the AM, and that reserving variable space at the end of the page for non-AM usage is wasting the AM's performance potential. Apart from that: Is this variable-sized 'metadata' associated with smgr infrastructure only, or is it also available for AM features? If not; then this is a strong -1. The amount of tasks smgr needs to do on a page is generally much less than the amount of tasks an AM needs to do; so in my view the AM has priority in prime page real estate, not smgr or related infrastructure. re: PageFeatures I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part of the storage manager (smgr) implementation / can't this be part of the smgr of the relation? re: use of pd_checksum I mentioned this in the above-mentioned thread too, in [1], that we could use pd_checksum as an extra area marker for this storage-specific data, which would be located between pd_upper and pd_special. Re: patch contents 0001: >+ specialSize = MAXALIGN(specialSize) + reserved_page_size; This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or an assertion that reserved_page_size is MAXALIGNED, would be better. > PageValidateSpecialPointer(Page page) > { > Assert(page); > -Assert(((PageHeader) page)->pd_special <= BLCKSZ); > +AssertPageHeader) page)->pd_special - reserved_page_size) <= BLCKSZ); This check is incorrect. With your code it would allow pd_special past the end of the block. If you want to put the reserved_space_size effectively inside the special area, this check should instead be: +Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size)); Or, equally valid +AssertPageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ); > + * +-+-++-+ > + * | ... tuple2 tuple1 | "special space" | "reserved" | > + * +---++-+ Could you fix the table display if / when you revise the patchset? It seems to me that the corners don't line up with the column borders. 0002: > Make the output of "select_views" test stable > Changing the reserved_page_size has resulted in non-stable results for this test. This makes sense, what kind of instability are we talking about? Are there different results for runs with the same binary, or is this across compilations? 0003 and up were not yet reviewed in depth. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/flat/CA%2BTgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA%40mail.gmail.com#90badc63e568a89a76f51fc95f07ffaf [1] https://postgr.es/m/CAEze2Wi5wYinU7nYxyKe_C0DRc6uWYa8ivn5%3Dzg63nKtHBnn8A%40mail.gmail.com
Re: [PATCHES] Post-special page storage TDE support
> > Explicitly > > locking (assuming you stay in your lane) should only need to guard > > against access from other > > backends of this type if using shared buffers, so will be use-case > > dependent. > > I'm not sure what you mean here? I'm mainly pointing out that the specific code that manages this feature is the only one who has to worry about modifying said page region. > > This does have a runtime overhead due to moving some offset > > calculations from compile time to > > runtime. It is thought that the utility of this feature will outweigh > > the costs here. > > Have you done some benchmarking to give an idea of how much overhead we're > talking about? Not yet, but I am going to work on this. I suspect the current code could be improved, but will try to get some sort of measurement of the additional overhead. > > Candidates for page features include 32-bit or 64-bit checksums, > > encryption tags, or additional > > per-page metadata. > > > > While we are not currently getting rid of the pd_checksum field, this > > mechanism could be used to > > free up that 16 bits for some other purpose. > > IIUC there's a hard requirement of initdb-time initialization, as there's > otherwise no guarantee that you will find enough free space in each page at > runtime. It seems like a very hard requirement for a full replacement of the > current checksum approach (even if I agree that the current implementation > limitations are far from ideal), especially since there's no technical reason > that would prevent us from dynamically enabling data-checksums without doing > all the work when the cluster is down. As implemented, that is correct; we are currently assuming this specific feature mechanism is set at initdb time only. Checksums are not the primary motivation here, but were something that I could use for an immediate illustration of the feature. That said, presumably you could define a way to set the features per-relation (say with a template field in pg_class) which would propagate to a relation on rewrite, so there could be ways to handle things incrementally, were this an overall goal. Thanks for looking, David
Re: [PATCHES] Post-special page storage TDE support
Hi, On Mon, Oct 24, 2022 at 12:55:53PM -0500, David Christensen wrote: > > Explicitly > locking (assuming you stay in your lane) should only need to guard > against access from other > backends of this type if using shared buffers, so will be use-case dependent. I'm not sure what you mean here? > This does have a runtime overhead due to moving some offset > calculations from compile time to > runtime. It is thought that the utility of this feature will outweigh > the costs here. Have you done some benchmarking to give an idea of how much overhead we're talking about? > Candidates for page features include 32-bit or 64-bit checksums, > encryption tags, or additional > per-page metadata. > > While we are not currently getting rid of the pd_checksum field, this > mechanism could be used to > free up that 16 bits for some other purpose. IIUC there's a hard requirement of initdb-time initialization, as there's otherwise no guarantee that you will find enough free space in each page at runtime. It seems like a very hard requirement for a full replacement of the current checksum approach (even if I agree that the current implementation limitations are far from ideal), especially since there's no technical reason that would prevent us from dynamically enabling data-checksums without doing all the work when the cluster is down.