Re: [PATCHES] Post-special page storage TDE support

2024-03-12 Thread Aleksander Alekseev
Hi David,

> I have finished the reworking of this particular patch series, and have tried 
> to
> organize this in such a way that it will be easily reviewable.  It is
> constructed progressively to be able to follow what is happening here.  As 
> such,
> each individual commit is not guaranteed to compile on its own, so the whole
> series would need to be applied before it works. (It does pass CI tests.)
>
> Here is a brief roadmap of the patches; some of them have additional details 
> in
> the commit message describing a little more about them.
>
> These two patches do some refactoring of existing code to make a common place 
> to
> modify the definitions:
>
> v3-0001-refactor-Create-PageUsableSpace-to-represent-spac.patch
> v3-0002-refactor-Make-PageGetUsablePageSize-routine.patch
>
> These two patches add the ReservedPageSize variable and teach PageInit to use 
> to
> adjust sizing accordingly:
>
> v3-0003-feature-Add-ReservedPageSize-variable.patch
> v3-0004-feature-Adjust-page-sizes-at-PageInit.patch
>
> This patch modifies the definitions of 4 symbols to be computed based on
> PageUsableSpace:
>
> v3-0005-feature-Create-Calc-Limit-and-Dynamic-forms-for-f.patch
>
> These following 4 patches are mechanical replacements of all existing uses of
> these symbols; this provides both visibility into where the existing symbol is
> used as well as distinguishing between parts that care about static allocation
> vs dynamic usage.  The only non-mechanical change is to remove the definition 
> of
> the old symbol so we can be guaranteed that all uses have been considered:
>
> v3-0006-chore-Split-MaxHeapTuplesPerPage-into-Limit-and-D.patch
> v3-0007-chore-Split-MaxIndexTuplesPerPage-into-Limit-and-.patch
> v3-0008-chore-Split-MaxHeapTupleSize-into-Limit-and-Dynam.patch
> v3-0009-chore-Split-MaxTIDsPerBTreePage-into-Limit-and-Dy.patch
>
> The following patches are related to required changes to support dynamic toast
> limits:
>
> v3-0010-feature-Add-hook-for-setting-reloptions-defaults-.patch
> v3-0011-feature-Dynamically-calculate-toast_tuple_target.patch
> v3-0012-feature-Add-Calc-options-for-toast-related-pieces.patch
> v3-0013-chore-Replace-TOAST_MAX_CHUNK_SIZE-with-ClusterTo.patch
> v3-0014-chore-Translation-updates-for-TOAST_MAX_CHUNK_SIZ.patch
>
> In order to calculate some of the sizes, we need to include nbtree.h 
> internals,
> but we can't use in front-end apps, so we separate out the pieces we care 
> about
> into a separate include and use that:
>
> v3-0015-chore-Split-nbtree.h-structure-defs-into-an-inter.patch
>
> This is the meat of the patch; provide a common location for these
> block-size-related constants to be computed using the infra that has been set 
> up
> so far.  Also ensure that we are properly initializing this in front end and
> back end code.  A tricky piece here is we have two separate include files for
> blocksize.h; one which exposes externs as consts for optimizations, and one 
> that
> blocksize.c itself uses without consts, which it uses to create/initialized 
> the
> vars:
>
> v3-0016-feature-Calculate-all-blocksize-constants-in-a-co.patch
>
> Add ControlFile and GUC support for reserved_page_size:
>
> v3-0017-feature-ControlFile-GUC-support-for-reserved_page.patch
>
> Add initdb support for reserving page space:
>
> v3-0018-feature-Add-reserved_page_size-to-initdb-bootstra.patch
>
> Fixes for pg_resetwal:
>
> v3-0019-feature-Updates-for-pg_resetwal.patch
>
> The following 4 patches mechanically replace the Dynamic form to use the new
> Cluster variables:
>
> v3-0020-chore-Rename-MaxHeapTupleSizeDynamic-to-ClusterMa.patch
> v3-0021-chore-Rename-MaxHeapTuplesPerPageDynamic-to-Clust.patch
> v3-0022-chore-Rename-MaxIndexTuplesPerPageDynamic-to-Clus.patch
> v3-0023-chore-Rename-MaxTIDsPerBTreePageDynamic-to-Cluste.patch
>
> Two pieces of optimization required for visibility map:
>
> v3-0024-optimization-Add-support-for-fast-non-division-ba.patch
> v3-0025-optimization-Use-fastdiv-code-in-visibility-map.patch
>
> Update bufpage.h comments:
>
> v3-0026-doc-update-bufpage-docs-w-reserved-space-data.patch
>
> Fixes for bloom to use runtime size:
>
> v3-0027-feature-Teach-bloom-about-PageUsableSpace.patch
>
> Fixes for FSM to use runtime size:
>
> v3-0028-feature-teach-FSM-about-reserved-page-space.patch
>
> I hope this makes sense for reviewing, I know it's a big job, so breaking 
> things up a little more and organizing will hopefully help.

Just wanted to let you know that the patchset seems to need a rebase,
according to cfbot.

Best regards,
Aleksander Alekseev (wearing a co-CFM hat)




Re: [PATCHES] Post-special page storage TDE support

2023-12-22 Thread David Christensen
Hi again!

Per some offline discussion with Stephen, I've continued to work on some
modifications here; this particular patchset is intended to facilitate
review by highlighting the mechanical nature of many of these changes.  As
such, I have taken the following approach to this rework:

0001 - Create PageUsableSpace to represent space post-smgr
0002 - Add support for fast, non-division-based div/mod algorithms
0003 - Use fastdiv code in visibility map
0004 - Make PageUsableSpace incorporate variable-sized limit
0005 - Add Calc, Limit and Dynamic forms of all variable constants
0006 - Split MaxHeapTuplesPerPage into Limit and Dynamic variants
0007 - Split MaxIndexTuplesPerPage into Limit and Dynamic variants
0008 - Split MaxHeapTupleSize into Limit and Dynamic variants
0009 - Split MaxTIDsPerBTreePage into Limit and Dynamic variant

0001 - 0003 have appeared in this thread or in other forms on the list
already, though 0001 refactors things slightly more aggressively, but makes
StaticAssert() to ensure that this change is still sane.

0004 adds the ReservedPageSpace variable, and also redefines the previous
BLCKSZ - SizeOfPageHeaderDate as PageUsableSpaceMax; there are a few
related fixups.

0005 adds the macros to compute the former constants while leaving their
original definitions to evaluate to the same place (the infamous Calc* and
*Limit, plus we invite *Dynamic to the party as well; the names are
terrible and there must be something better)

0006 - 0009 are all the same approach; we undefine the old constant name
and modify the existing uses of this symbol to be either the *Limit or
*Dynamic, depending on if the changed available space would impact the
calculations.  Since we are touching every use of this symbol, this
facilitates review of the impact, though I would contend that almost every
piece I've spot-checked seems like it really does need to know about the
runtime limit.  Perhaps there is more we could do here.  I could also see a
variable per constant rather than recalculating this every time, in which
case the *Dynamic would just be the variable and we'd need a hook to
initialize this or otherwise set on first use.

There are a number of additional things remaining to be done to get this to
fully work, but I did want to get some of this out there for review.

Still to do (almost all in some form in original patch, so just need to
extract the relevant pieces):
- set reserved-page-size via initdb
- load reserved-page-size from pg_control
- apply to the running cluster
- some form of compatibility for these constants in common and ensuring
bin/ works
- some toast-related changes (this requires a patch to support dynamic
relopts, which I can extract, as the existing code is using a constant
lookup table)
- probably some more pieces I'm forgetting

Thanks,
David


v2-0005-Add-Calc-Limit-and-Dynamic-forms-of-all-variable-.patch
Description: Binary data


v2-0001-Create-PageUsableSpace-to-represent-space-post-sm.patch
Description: Binary data


v2-0002-Add-support-for-fast-non-division-based-div-mod-a.patch
Description: Binary data


v2-0003-Use-fastdiv-code-in-visibility-map.patch
Description: Binary data


v2-0004-Make-PageUsableSpace-incorporate-variable-sized-l.patch
Description: Binary data


v2-0007-Split-MaxIndexTuplesPerPage-into-Limit-and-Dynami.patch
Description: Binary data


v2-0008-Split-MaxHeapTupleSize-into-Limit-and-Dynamic-var.patch
Description: Binary data


v2-0006-Split-MaxHeapTuplesPerPage-into-Limit-and-Dynamic.patch
Description: Binary data


v2-0009-Split-MaxTIDsPerBTreePage-into-Limit-and-Dynamic-.patch
Description: Binary data


Re: [PATCHES] Post-special page storage TDE support

2023-11-29 Thread David Christensen
On Tue, Nov 7, 2023 at 6:20 PM Andres Freund  wrote:

> Hi,
>
> - IMO the patch touches many places it shouldn't need to touch, because of
>   essentially renaming a lot of existing macro names to *Limit,
>   necessitating modifying a lot of users. I think instead the few places
> that
>   care about the runtime limit should be modified.
>
>   As-is the patch would cause a lot of fallout in extensions that just do
>   things like defining an on-stack array of Datums or such - even though
> all
>   they'd need is to change the define to the *Limit one.
>
>   Even leaving extensions aside, it must makes reviewing (and I'm sure
>   maintaining) the patch very tedious.
>

Hi Andres et al,

So I've been looking at alternate approaches to this issue and considering
how to reduce churn, and I think we still need the *Limit variants.  Let's
take a simple example:

Just looking at MaxHeapTuplesPerPage and breaking down instances in the
code, loosely partitioning into whether it's used as an array index or
other usage (doesn't discriminate against code vs comments, unfortunately)
we get the following breakdown:

$ git grep -hoE [[]?MaxHeapTuplesPerPage | sort | uniq -c
 18 [MaxHeapTuplesPerPage
 51 MaxHeapTuplesPerPage

This would be 18 places where we would need at adjust in a fairly
mechanical fashion to add the MaxHeapTuplesPerPageLimit instead of
MaxHeapTuplesPerPage vs some significant fraction of non-comment--even if
you assumed half were in comments, there would presumably need to be some
sort of adjustments in verbage since we are going to be changing some of
the interpretation.

I am working on a patch to cleanup some of the assumptions that smgr makes
currently about its space usage and how the individual access methods
consider it, as they should only be calculating things based on how much
space is available after smgr is done with it.  That has traditionally been
BLCKSZ - SizeOfPageHeaderData, but this patch (included) factors that out
into a single expression that we can now use in access methods, so we can
then reserve additional page space and not need to adjust the access
methods furter.

Building on top of this patch, we'd define something like this to handle
the #defines that need to be dynamic:

extern Size reserved_page_space;
#define PageUsableSpace (BLCKSZ - SizeOfPageHeaderData -
reserved_page_space)
#define MaxHeapTuplesPerPage CalcMaxHeapTuplesPerPage(PageUsableSpace)
#define MaxHeapTuplesPerPageLimit CalcMaxHeapTuplesPerPage(BLCKSZ -
SizeOfPageHeaderData)
#define CalcMaxHeapTuplesPerPage(freesize)
  ((int) ((freesize) / \
  (MAXALIGN(SizeofHeapTupleHeader) +
sizeof(ItemIdData

In my view, extensions that are expecting to need no changes when it comes
to changing how these are interpreted are better off needing to only change
the static allocation in a mechanical sense than revisit any other uses of
code; this seems more likely to guarantee a correct result than if you
exceed the page space and start overwriting things you weren't because
you're not aware that you need to check for dynamic limits on your own.

Take another thing which would need adjusting for reserving page space,
MaxHeapTupleSize:

$ git grep -ohE '[[]?MaxHeapTupleSize' | sort | uniq -c
  3 [MaxHeapTupleSize
 16 MaxHeapTupleSize

Here there are 3 static arrays which would need to be adjusted vs 16 other
instances. If we kept MaxHeapTupleSize interpretation the same and didn't
adjust an extension it would compile just fine, but with too large of a
length compared to the smaller PageUsableSpace, so you could conceivably
overwrite into the reserved space depending on what you were doing.

(since by definition the reserved_page_space >= 0, so PageUsableSpace will
always be <= BLCKSZ - SizeOfPageHeaderData, so any expression based on it
as a basis will be smaller).

In short, I think the approach I took originally actually will reduce
errors out-of-core, and while churn is still necessary churn.

I can produce a second patch which implements this calc/limit atop this
first one as well.

Thanks,

David


0001-Create-PageUsableSpace-to-represent-space-post-smgr.patch
Description: Binary data


Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread Stephen Frost
Greetings,

On Mon, Nov 13, 2023 at 16:53 David Christensen <
david.christen...@crunchydata.com> wrote:

> On Mon, Nov 13, 2023 at 2:52 PM Andres Freund  wrote:
>
>>
>> > This scheme would open up space per page that would now be available for
>> > plenty of other things; the encoding in the header and the corresponding
>> > available space in the footer would seem to open up quite a few options
>> > now, no?
>>
>> Sure, if you're willing to rewrite the whole cluster to upgrade and
>> willing to
>> permanently sacrifice some data density.  If the stored data is actually
>> specific to the page - that is the place to put the data. If not, then the
>> tradeoff is much more complicated IMO.
>>
>> Of course this isn't a new problem - storing the page size on each page
>> was
>> just silly, it's never going to change across the cluster and even more
>> definitely not going to change within a single relation.
>>
>
> Crazy idea; since stored pagesize is already a fixed cost that likely
> isn't going away, what if instead of the pd_checksum field, we instead
> reinterpret pd_pagesize_version; 4 would mean "no page features", but
> anything 5 or higher could be looked up as an external page feature set,
> with storage semantics outside of the realm of the page itself (other than
> what the page features code itself needs to know); i.e,. move away from the
> on-page bitmap into a more abstract representation of features which could
> be something along the lines of what you were suggesting, including
> extension support.
>
> It seems like this could also support adding/removing features on page
> read/write as long as there was sufficient space in the reserved_page
> space; read the old feature set on page read, convert to the new feature
> set which will write out the page with the additional/changed format.
> Obviously there would be bookkeeping to be done in terms of making sure all
> pages had been converted from one format to another, but for the page level
> this would be straightforward.
>
> Just thinking aloud here...
>

In other crazy idea space … if the page didn’t have enough space to allow
for the desired features then make any insert/update actions forcibly have
to choose a different page for the new tuple, while allowing delete’s to do
their usual thing, and then when vacuum comes along and is able to clean up
the page and remove the all dead tuples, it could then enable the features
on the page that are desired…

Thanks,

Stephen

>


Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread David Christensen
On Mon, Nov 13, 2023 at 2:52 PM Andres Freund  wrote:

>
> > This scheme would open up space per page that would now be available for
> > plenty of other things; the encoding in the header and the corresponding
> > available space in the footer would seem to open up quite a few options
> > now, no?
>
> Sure, if you're willing to rewrite the whole cluster to upgrade and
> willing to
> permanently sacrifice some data density.  If the stored data is actually
> specific to the page - that is the place to put the data. If not, then the
> tradeoff is much more complicated IMO.
>
> Of course this isn't a new problem - storing the page size on each page was
> just silly, it's never going to change across the cluster and even more
> definitely not going to change within a single relation.
>

Crazy idea; since stored pagesize is already a fixed cost that likely isn't
going away, what if instead of the pd_checksum field, we instead
reinterpret pd_pagesize_version; 4 would mean "no page features", but
anything 5 or higher could be looked up as an external page feature set,
with storage semantics outside of the realm of the page itself (other than
what the page features code itself needs to know); i.e,. move away from the
on-page bitmap into a more abstract representation of features which could
be something along the lines of what you were suggesting, including
extension support.

It seems like this could also support adding/removing features on page
read/write as long as there was sufficient space in the reserved_page
space; read the old feature set on page read, convert to the new feature
set which will write out the page with the additional/changed format.
Obviously there would be bookkeeping to be done in terms of making sure all
pages had been converted from one format to another, but for the page level
this would be straightforward.

Just thinking aloud here...

David


Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread Andres Freund
Hi,

On 2023-11-13 14:37:47 -0600, David Christensen wrote:
> On Mon, Nov 13, 2023 at 2:27 PM Andres Freund  wrote:
> > On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
> > > On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost  wrote:
> > > > In conversations with folks (my memory specifically is a discussion
> > with
> > > > Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> > > > there was a pretty strong push that a page should be able to 'stand
> > > > alone' and not depend on something else (eg: pg_control, or whatever)
> > to
> > > > provide info needed be able to interpret the page.  For my part, I
> > don't
> > > > have a particularly strong feeling on that, but that's what lead to
> > this
> > > > design.
> > >
> > > The term that I have used in the past is "self-contained". Meaning
> > > capable of being decoded more or less as-is, without any metadata, by
> > > tools like pg_filedump.
> >
> > I'm not finding that very convincing - without cluster wide data, like
> > keys, a
> > tool like pg_filedump isn't going to be able to do much with encrypted
> > pages. Given the need to look at some global data, figuring out the offset
> > at
> > which data starts based on a value in pg_control isn't meaningfully worse
> > than
> > having the data on each page.
> >
> > Storing redundant data in each page header, when we've wanted space in the
> > page header for plenty other things, just doesn't seem a good use of said
> > space.
> >
> 
> This scheme would open up space per page that would now be available for
> plenty of other things; the encoding in the header and the corresponding
> available space in the footer would seem to open up quite a few options
> now, no?

Sure, if you're willing to rewrite the whole cluster to upgrade and willing to
permanently sacrifice some data density.  If the stored data is actually
specific to the page - that is the place to put the data. If not, then the
tradeoff is much more complicated IMO.

Of course this isn't a new problem - storing the page size on each page was
just silly, it's never going to change across the cluster and even more
definitely not going to change within a single relation.

Greetings,

Andres Freund




Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread David Christensen
On Mon, Nov 13, 2023 at 2:27 PM Andres Freund  wrote:

> Hi,
>
> On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
> > On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost  wrote:
> > > In conversations with folks (my memory specifically is a discussion
> with
> > > Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> > > there was a pretty strong push that a page should be able to 'stand
> > > alone' and not depend on something else (eg: pg_control, or whatever)
> to
> > > provide info needed be able to interpret the page.  For my part, I
> don't
> > > have a particularly strong feeling on that, but that's what lead to
> this
> > > design.
> >
> > The term that I have used in the past is "self-contained". Meaning
> > capable of being decoded more or less as-is, without any metadata, by
> > tools like pg_filedump.
>
> I'm not finding that very convincing - without cluster wide data, like
> keys, a
> tool like pg_filedump isn't going to be able to do much with encrypted
> pages. Given the need to look at some global data, figuring out the offset
> at
> which data starts based on a value in pg_control isn't meaningfully worse
> than
> having the data on each page.
>
> Storing redundant data in each page header, when we've wanted space in the
> page header for plenty other things, just doesn't seem a good use of said
> space.
>

This scheme would open up space per page that would now be available for
plenty of other things; the encoding in the header and the corresponding
available space in the footer would seem to open up quite a few options
now, no?


Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread Andres Freund
Hi,

On 2023-11-08 18:47:56 -0800, Peter Geoghegan wrote:
> On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost  wrote:
> > In conversations with folks (my memory specifically is a discussion with
> > Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> > there was a pretty strong push that a page should be able to 'stand
> > alone' and not depend on something else (eg: pg_control, or whatever) to
> > provide info needed be able to interpret the page.  For my part, I don't
> > have a particularly strong feeling on that, but that's what lead to this
> > design.
> 
> The term that I have used in the past is "self-contained". Meaning
> capable of being decoded more or less as-is, without any metadata, by
> tools like pg_filedump.

I'm not finding that very convincing - without cluster wide data, like keys, a
tool like pg_filedump isn't going to be able to do much with encrypted
pages. Given the need to look at some global data, figuring out the offset at
which data starts based on a value in pg_control isn't meaningfully worse than
having the data on each page.

Storing redundant data in each page header, when we've wanted space in the
page header for plenty other things, just doesn't seem a good use of said
space.

Greetings,

Andres Freund




Re: [PATCHES] Post-special page storage TDE support

2023-11-13 Thread David Christensen
On Tue, Nov 7, 2023 at 6:20 PM Andres Freund  wrote:

> Hi,
>
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen 
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which
> will be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This
> data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime
> based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing
> clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
>
> The first part of the last paragraph makes it sound like pg_upgrade won't
> be
> supported across this commit, rather than just between different
> settings...
>

Thanks for the review.


> I think as a whole this is not an insane idea. A few comments:
>
> - IMO the patch touches many places it shouldn't need to touch, because of
>   essentially renaming a lot of existing macro names to *Limit,
>   necessitating modifying a lot of users. I think instead the few places
> that
>   care about the runtime limit should be modified.
>
>   As-is the patch would cause a lot of fallout in extensions that just do
>   things like defining an on-stack array of Datums or such - even though
> all
>   they'd need is to change the define to the *Limit one.
>
>   Even leaving extensions aside, it must makes reviewing (and I'm sure
>   maintaining) the patch very tedious.
>

You make a good point, and I think you're right that we could teach the
places that care about runtime vs compile time differences about the
changes while leaving other callers alone. The *Limit ones were introduced
since we need constant values here from the Calc...() macros, but could try
keeping the existing *Limit with the old name and switching things around.
I suspect there will be the same amount of code churn, but less mechanical.


> - I'm a bit worried about how the extra special page will be managed - if
>   there are multiple features that want to use it, who gets to put their
> data
>   at what offset?
>
>   After writing this I saw that 0002 tries to address this - but I don't
> like
>   the design. It introduces runtime overhead that seems likely to be
> visible.
>

Agreed this could be optimized.


> - Checking for features using PageGetFeatureOffset() seems the wrong
> design to
>   me - instead of a branch for some feature being disabled, perfectly
>   predictable for the CPU, we need to do an external function call every
> time
>   to figure out that yet, checksums are *still* disabled.
>

This is probably not a supported approach (it felt a little icky), but I'd
played around with const pointers to structs of const elements, where the
initial values of a global var was populated early on (so set once and
never changed post init), and the compiler didn't complain and things
seemed to work ok; not sure if this approach might help balance the early
mutability and constant lookup needs:

typedef struct PageFeatureOffsets {
  const Size feature0offset;
  const Size feature1offset;
  ...
} PageFeatureOffsets;

PageFeatureOffsets offsets = {0};
const PageFeatureOffsets *exposedOffsets = 

void InitOffsets() {
*((Size*)) = ...;
*((Size*)) = ...;
...
}

- Recomputing offsets every time in PageGetFeatureOffset() seems too
>   expensive. The offsets can't change while running as
> PageGetFeatureOffset()
>   have enough information to distinguish between different kinds of
> relations
>

Yes, this was a simple approach for ease of implementation; there is
certainly a way to precompute a lookup table from the page feature bitmask
into the offsets themselves or otherwise precompute, turn from function
call into inline/macro, etc.


>   - so why do we need to recompute offsets on every single page?  I'd
> instead
>   add a distinct offset variable for each feature.
>

This would work iff there is a single page feature set across all pages in
the cluster; I'm not sure we don't want more flexibility here.


> - Modifying every single PageInit() call doesn't make sense to me. That'll
>   just create a lot of breakage for - as far as I can tell - no win.
>

This was a placeholder to allow different features depending on page type;
to keep things simple for now I just used the same values here, but we
could move this inside PageInit() instead (again, assuming single feature
set per cluster).


> - Why is it worth sacrificing space on every page to indicate which
> features
>   were enabled?  I think there'd need to be some convincing reasons for
>   introducing such overhead.
>

The point here is if we can use either GCM authtag or stronger checksums
then we've gained the 

Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread Peter Geoghegan
On Wed, Nov 8, 2023 at 6:04 AM Stephen Frost  wrote:
> In conversations with folks (my memory specifically is a discussion with
> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> there was a pretty strong push that a page should be able to 'stand
> alone' and not depend on something else (eg: pg_control, or whatever) to
> provide info needed be able to interpret the page.  For my part, I don't
> have a particularly strong feeling on that, but that's what lead to this
> design.

The term that I have used in the past is "self-contained". Meaning
capable of being decoded more or less as-is, without any metadata, by
tools like pg_filedump.

Any design in this area should try to make things as easy to debug as
possible, for the obvious reason: encrypted data that somehow becomes
corrupt is bound to be a nightmare to debug. (Besides, we already
support tools like pg_filedump, so this isn't a new principle.)

-- 
Peter Geoghegan




Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread Stephen Frost
Greetings,

On Wed, Nov 8, 2023 at 20:55 David Christensen <
david.christen...@crunchydata.com> wrote:

> On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost  wrote:
>
>> * Andres Freund (and...@anarazel.de) wrote:
>> > On 2023-05-09 17:08:26 -0500, David Christensen wrote:
>> > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
>> > > From: David Christensen 
>> > > Date: Tue, 9 May 2023 16:56:15 -0500
>> > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>> > >
>> > > This space is reserved for extended data on the Page structure which
>> will be ultimately used for
>> > > encrypted data, extended checksums, and potentially other things.
>> This data appears at the end of
>> > > the Page, after any `pd_special` area, and will be calculated at
>> runtime based on specific
>> > > ControlFile features.
>> > >
>> > > No effort is made to ensure this is backwards-compatible with
>> existing clusters for `pg_upgrade`, as
>> > > we will require logical replication to move data into a cluster with
>> > > different settings here.
>> >
>> > The first part of the last paragraph makes it sound like pg_upgrade
>> won't be
>> > supported across this commit, rather than just between different
>> settings...
>>
>
> Yeah, that's vague, but you picked up on what I meant.
>
>
>> > I think as a whole this is not an insane idea. A few comments:
>>
>> Thanks for all the feedback!
>>
>> > - Why is it worth sacrificing space on every page to indicate which
>> features
>> >   were enabled?  I think there'd need to be some convincing reasons for
>> >   introducing such overhead.
>>
>> In conversations with folks (my memory specifically is a discussion with
>> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
>> there was a pretty strong push that a page should be able to 'stand
>> alone' and not depend on something else (eg: pg_control, or whatever) to
>> provide info needed be able to interpret the page.  For my part, I don't
>> have a particularly strong feeling on that, but that's what lead to this
>> design.
>>
>
> Unsurprisingly, I agree that it's useful to keep these features on the
> page itself; from a forensic standpoint this seems much easier to interpret
> what is happening here, as well it would allow you to have different
> features on a given page or type of page depending on need.  The initial
> patch utilizes pg_control to store the cluster page features, but there's
> no reason it couldn't be dependent on fork/page type or stored in
> pg_tablespace to utilize different features.
>

When it comes to authenticated encryption, it’s also the case that it’s
unclear what value the checksum field has, if any…  it’s certainly not
directly needed as a checksum, as the auth tag is much better for the
purpose of seeing if the page has been changed in some way. It’s also not
big enough to serve as an auth tag per NIST guidelines regarding the size
of the authenticated data vs. the size of the tag.  Using it to indicate
what features are enabled on the page seems pretty useful, as David notes.

Thanks,

Stephen

>


Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread David Christensen
On Wed, Nov 8, 2023 at 8:04 AM Stephen Frost  wrote:

> Greetings,
>
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > > From: David Christensen 
> > > Date: Tue, 9 May 2023 16:56:15 -0500
> > > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> > >
> > > This space is reserved for extended data on the Page structure which
> will be ultimately used for
> > > encrypted data, extended checksums, and potentially other things.
> This data appears at the end of
> > > the Page, after any `pd_special` area, and will be calculated at
> runtime based on specific
> > > ControlFile features.
> > >
> > > No effort is made to ensure this is backwards-compatible with existing
> clusters for `pg_upgrade`, as
> > > we will require logical replication to move data into a cluster with
> > > different settings here.
> >
> > The first part of the last paragraph makes it sound like pg_upgrade
> won't be
> > supported across this commit, rather than just between different
> settings...
>

Yeah, that's vague, but you picked up on what I meant.


> > I think as a whole this is not an insane idea. A few comments:
>
> Thanks for all the feedback!
>
> > - Why is it worth sacrificing space on every page to indicate which
> features
> >   were enabled?  I think there'd need to be some convincing reasons for
> >   introducing such overhead.
>
> In conversations with folks (my memory specifically is a discussion with
> Peter G, added to CC, and my apologies to Peter if I'm misremembering)
> there was a pretty strong push that a page should be able to 'stand
> alone' and not depend on something else (eg: pg_control, or whatever) to
> provide info needed be able to interpret the page.  For my part, I don't
> have a particularly strong feeling on that, but that's what lead to this
> design.
>

Unsurprisingly, I agree that it's useful to keep these features on the page
itself; from a forensic standpoint this seems much easier to interpret what
is happening here, as well it would allow you to have different features on
a given page or type of page depending on need.  The initial patch utilizes
pg_control to store the cluster page features, but there's no reason it
couldn't be dependent on fork/page type or stored in pg_tablespace to
utilize different features.

Thanks,

David


Re: [PATCHES] Post-special page storage TDE support

2023-11-08 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> > From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> > From: David Christensen 
> > Date: Tue, 9 May 2023 16:56:15 -0500
> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will 
> > be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This 
> > data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime 
> > based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing 
> > clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with
> > different settings here.
> 
> The first part of the last paragraph makes it sound like pg_upgrade won't be
> supported across this commit, rather than just between different settings...
> 
> I think as a whole this is not an insane idea. A few comments:

Thanks for all the feedback!

> - Why is it worth sacrificing space on every page to indicate which features
>   were enabled?  I think there'd need to be some convincing reasons for
>   introducing such overhead.

In conversations with folks (my memory specifically is a discussion with
Peter G, added to CC, and my apologies to Peter if I'm misremembering)
there was a pretty strong push that a page should be able to 'stand
alone' and not depend on something else (eg: pg_control, or whatever) to
provide info needed be able to interpret the page.  For my part, I don't
have a particularly strong feeling on that, but that's what lead to this
design.

Getting a consensus on if that's a requirement or not would definitely
be really helpful.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [PATCHES] Post-special page storage TDE support

2023-11-07 Thread Andres Freund
Hi,

On 2023-05-09 17:08:26 -0500, David Christensen wrote:
> From 965309ea3517fa734c4bc89c144e2031cdf6c0c3 Mon Sep 17 00:00:00 2001
> From: David Christensen 
> Date: Tue, 9 May 2023 16:56:15 -0500
> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
>
> This space is reserved for extended data on the Page structure which will be 
> ultimately used for
> encrypted data, extended checksums, and potentially other things.  This data 
> appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime 
> based on specific
> ControlFile features.
>
> No effort is made to ensure this is backwards-compatible with existing 
> clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with
> different settings here.

The first part of the last paragraph makes it sound like pg_upgrade won't be
supported across this commit, rather than just between different settings...

I think as a whole this is not an insane idea. A few comments:

- IMO the patch touches many places it shouldn't need to touch, because of
  essentially renaming a lot of existing macro names to *Limit,
  necessitating modifying a lot of users. I think instead the few places that
  care about the runtime limit should be modified.

  As-is the patch would cause a lot of fallout in extensions that just do
  things like defining an on-stack array of Datums or such - even though all
  they'd need is to change the define to the *Limit one.

  Even leaving extensions aside, it must makes reviewing (and I'm sure
  maintaining) the patch very tedious.


- I'm a bit worried about how the extra special page will be managed - if
  there are multiple features that want to use it, who gets to put their data
  at what offset?

  After writing this I saw that 0002 tries to address this - but I don't like
  the design. It introduces runtime overhead that seems likely to be visible.


- Checking for features using PageGetFeatureOffset() seems the wrong design to
  me - instead of a branch for some feature being disabled, perfectly
  predictable for the CPU, we need to do an external function call every time
  to figure out that yet, checksums are *still* disabled.


- Recomputing offsets every time in PageGetFeatureOffset() seems too
  expensive. The offsets can't change while running as PageGetFeatureOffset()
  have enough information to distinguish between different kinds of relations
  - so why do we need to recompute offsets on every single page?  I'd instead
  add a distinct offset variable for each feature.


- Modifying every single PageInit() call doesn't make sense to me. That'll
  just create a lot of breakage for - as far as I can tell - no win.


- Why is it worth sacrificing space on every page to indicate which features
  were enabled?  I think there'd need to be some convincing reasons for
  introducing such overhead.

- Is it really useful to encode the set of features enabled in a cluster with
  a bitmask? That pretty much precludes utilizing extra page space in
  extensions. We could instead just have an extra cluster-wide file that
  defines a mapping of offset to feature.


Greetings,

Andres Freund




Re: [PATCHES] Post-special page storage TDE support

2023-05-30 Thread David Christensen
On Fri, May 12, 2023 at 7:48 PM Stephen Frost  wrote:
>
> Greetings,
>
> * David Christensen (david.christen...@crunchydata.com) wrote:
> > Refreshing this with HEAD as of today, v4.
>
> Thanks for updating this!

Thanks for the patience in my response here.

> > Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> >
> > This space is reserved for extended data on the Page structure which will 
> > be ultimately used for
> > encrypted data, extended checksums, and potentially other things.  This 
> > data appears at the end of
> > the Page, after any `pd_special` area, and will be calculated at runtime 
> > based on specific
> > ControlFile features.
> >
> > No effort is made to ensure this is backwards-compatible with existing 
> > clusters for `pg_upgrade`, as
> > we will require logical replication to move data into a cluster with 
> > different settings here.
>
> This initial patch, at least, does maintain pg_upgrade as the
> reserved_page_size (maybe not a great name?) is set to 0, right?
> Basically this is just introducing the concept of a reserved_page_size
> and adjusting all of the code that currently uses BLCKSZ or
> PageGetPageSize() to account for this extra space.

Correct; a reserved_page_size of 0 would be the same page format as
currently exists, so you could use pg_upgrade with no page features
and be binary compatible with existing clusters.

> Looking at the changes to bufpage.h, in particular ...
>
> > diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
>
> > @@ -19,6 +19,14 @@
> >  #include "storage/item.h"
> >  #include "storage/off.h"
> >
> > +extern PGDLLIMPORT int reserved_page_size;
> > +
> > +#define SizeOfPageReservedSpace() reserved_page_size
> > +#define MaxSizeOfPageReservedSpace 0
> > +
> > +/* strict upper bound on the amount of space occupied we have reserved on
> > + * pages in this cluster */
>
> This will eventually be calculated based on what features are supported
> concurrently?

Correct; these are fleshed out in later patches.

> > @@ -36,10 +44,10 @@
> >   * |  v pd_upper   
> > |
> >   * +-++
> >   * |  | tupleN ... |
> > - * +-+--+-+
> > - * |... tuple3 tuple2 tuple1 | "special space" |
> > - * ++-+
> > - *   ^ 
> > pd_special
> > + * +-+-++++
> > + * | ... tuple2 tuple1 | "special space" | "reserved" |
> > + * +---++++
> > + *  ^ pd_special  ^ 
> > reserved_page_space
>
> Right, adds a dynamic amount of space 'post-special area'.

Dynamic as in "fixed at initdb time" instead of compile time. However,
things are coded in such a way that the page feature bitmap is stored
on a given page, so different pages could have different
reserved_page_size depending on use case/code path. (Basically
preserving future flexibility while minimizing code changes here.)  We
could utilize different features depending on what type of page it is,
say, or have different relations or tablespaces with different page
feature defaults.

> > @@ -73,6 +81,8 @@
> >   * stored as the page trailer.  an access method should always
> >   * initialize its pages with PageInit and then set its own opaque
> >   * fields.
> > + *
> > + * XXX - update more comments here about reserved_page_space
> >   */
>
> Would be good to do. ;)

Next revision... :D

> > @@ -325,7 +335,7 @@ static inline void
> >  PageValidateSpecialPointer(Page page)
> >  {
> >   Assert(page);
> > - Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> > + AssertPageHeader) page)->pd_special + reserved_page_size) <= 
> > BLCKSZ);
> >   Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData);
> >  }
>
> This is just one usage ... but seems like maybe we should be using
> PageGetPageSize() here instead of BLCKSZ, and that more-or-less
> throughout?  Nearly everywhere we're using BLCKSZ today to give us that
> compile-time advantage of a fixed block size is going to lose that
> advantage anyway thanks to reserved_page_size being run-time.  Now, one
> up-side to this is that it'd also get us closer to being able to support
> dynamic block sizes concurrently which would be quite interesting.  That
> is, a special tablespace with a 32KB block size while the rest are the
> traditional 8KB.  This would likely require multiple shared buffer
> pools, of course...

I think multiple shared-buffer pools is a ways off; but sure, this
would support this sort of use case as well.  I am working on a new
patch for this series (probably the first one in the series) which
will actually just abstract away all existing compile-time usages of
BLCKSZ.  

Re: [PATCHES] Post-special page storage TDE support

2023-05-12 Thread Stephen Frost
Greetings,

* David Christensen (david.christen...@crunchydata.com) wrote:
> Refreshing this with HEAD as of today, v4.

Thanks for updating this!

> Subject: [PATCH v4 1/3] Add reserved_page_space to Page structure
> 
> This space is reserved for extended data on the Page structure which will be 
> ultimately used for
> encrypted data, extended checksums, and potentially other things.  This data 
> appears at the end of
> the Page, after any `pd_special` area, and will be calculated at runtime 
> based on specific
> ControlFile features.
> 
> No effort is made to ensure this is backwards-compatible with existing 
> clusters for `pg_upgrade`, as
> we will require logical replication to move data into a cluster with 
> different settings here.

This initial patch, at least, does maintain pg_upgrade as the
reserved_page_size (maybe not a great name?) is set to 0, right?
Basically this is just introducing the concept of a reserved_page_size
and adjusting all of the code that currently uses BLCKSZ or
PageGetPageSize() to account for this extra space.

Looking at the changes to bufpage.h, in particular ...

> diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h

> @@ -19,6 +19,14 @@
>  #include "storage/item.h"
>  #include "storage/off.h"
>  
> +extern PGDLLIMPORT int reserved_page_size;
> +
> +#define SizeOfPageReservedSpace() reserved_page_size
> +#define MaxSizeOfPageReservedSpace 0
> +
> +/* strict upper bound on the amount of space occupied we have reserved on
> + * pages in this cluster */

This will eventually be calculated based on what features are supported
concurrently?

> @@ -36,10 +44,10 @@
>   * |  v pd_upper 
>   |
>   * +-++
>   * |  | tupleN ... |
> - * +-+--+-+
> - * |... tuple3 tuple2 tuple1 | "special space" |
> - * ++-+
> - *   ^ 
> pd_special
> + * +-+-++++
> + * | ... tuple2 tuple1 | "special space" | "reserved" |
> + * +---++++
> + *  ^ pd_special  ^ 
> reserved_page_space

Right, adds a dynamic amount of space 'post-special area'.

> @@ -73,6 +81,8 @@
>   * stored as the page trailer.  an access method should always
>   * initialize its pages with PageInit and then set its own opaque
>   * fields.
> + *
> + * XXX - update more comments here about reserved_page_space
>   */

Would be good to do. ;)

> @@ -325,7 +335,7 @@ static inline void
>  PageValidateSpecialPointer(Page page)
>  {
>   Assert(page);
> - Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> + AssertPageHeader) page)->pd_special + reserved_page_size) <= 
> BLCKSZ);
>   Assert(((PageHeader) page)->pd_special >= SizeOfPageHeaderData);
>  }

This is just one usage ... but seems like maybe we should be using
PageGetPageSize() here instead of BLCKSZ, and that more-or-less
throughout?  Nearly everywhere we're using BLCKSZ today to give us that
compile-time advantage of a fixed block size is going to lose that
advantage anyway thanks to reserved_page_size being run-time.  Now, one
up-side to this is that it'd also get us closer to being able to support
dynamic block sizes concurrently which would be quite interesting.  That
is, a special tablespace with a 32KB block size while the rest are the
traditional 8KB.  This would likely require multiple shared buffer
pools, of course...

> diff --git a/src/backend/storage/page/bufpage.c 
> b/src/backend/storage/page/bufpage.c
> index 9a302ddc30..a93cd9df9f 100644
> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c
> @@ -26,6 +26,8 @@
>  /* GUC variable */
>  bool ignore_checksum_failure = false;
>  
> +int  reserved_page_size = 0; /* how much page space to 
> reserve for extended unencrypted metadata */
> +
>  
>  /* 
>   *   Page support functions
> @@ -43,7 +45,7 @@ PageInit(Page page, Size pageSize, Size specialSize)
>  {
>   PageHeader  p = (PageHeader) page;
>  
> - specialSize = MAXALIGN(specialSize);
> + specialSize = MAXALIGN(specialSize) + reserved_page_size;

Rather than make it part of specialSize, I would think we'd be better
off just treating them independently.  Eg, the later pd_upper setting
would be done by:

p->pd_upper = pageSize - specialSize - reserved_page_size;

etc.

> @@ -186,7 +188,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int 
> flags)
>   *   one that is both unused and deallocated.
>   *
>   *   If flag PAI_IS_HEAP is set, we enforce that there can't be more than
> - *   

Re: [PATCHES] Post-special page storage TDE support

2022-11-08 Thread David Christensen
Looking into some CF bot failures which didn't show up locally.  Will
send a v3 when resolved.




Re: [PATCHES] Post-special page storage TDE support

2022-10-31 Thread Matthias van de Meent
On Sat, 29 Oct 2022 at 00:25, David Christensen
 wrote:
>
> Hi Matthias,
>
> > Did you read the related thread with related discussion from last June, 
> > "Re: better page-level checksums" [0]? In that I argued that space at the 
> > end of a page is already allocated for the AM, and that reserving variable 
> > space at the end of the page for non-AM usage is wasting the AM's 
> > performance potential.
>
> Yes, I had read parts of that thread among others, but have given it a
> re-read.  I can see the point you're making here, and agree that if we
> can allocate between pd_special and pd_upper that could make sense.  I
> am a little unclear as to what performance impacts for the AM there
> would be if this additional space were ahead or behind the page
> special area; it seems like if this is something that needs to live on
> the page *somewhere* just being aligned correctly would be sufficient
> from the AM's standpoint.

It would be sufficient, but it is definitely suboptimal. See [0] as a
patch that is being held back by putting stuff behind the special
area.

I don't really care much about the storage layout on-disk, but I do
care that AMs have efficient access to their data. For the page
header, line pointers, and special area, that is currently guaranteed
by  the current page layout. However, for the special area, that
currently guaranteed offset of (BLCKSZ -
MAXALIGN(sizeof(IndexOpaque))) will get broken as there would be more
space in the special area than the AM would be expecting. Right now,
our index AMs are doing pointer chasing during special area lookups
for no good reason, but with the patch it would be required. I don't
like that at all.

Because I understand that it is a requirement to store this reserved
space in a fixed place on the on-disk page (you must know where the
checksum is at static places on the page, otherwise you'd potentially
mis-validate a page) but that requirement is not there for in-memory
storage. I think it's a small price to pay to swap the fields around
during R/W operations - the largest size of special area is currently
24 bytes, and the proposals I've seen for this extra storage area
would not need it to be actually filled with data whilst the page is
being used by the AM (checksum could be zeroed in in-memory
operations, and it'd get set during writeback; same with all other
fields I can imagine the storage system using).

> Considering that I am trying to make this
> have zero storage impact if these features are not active, the impact
> on a cluster with no additional features would be moot from a storage
> perspective, no?

The issue is that I'd like to eliminate the redirection from the page
header in the hot path. Currently, we can do that, and pd_special
would be little more than a hint to the smgr and pd_linp code that
that area is special and reserved for this access method's private
data, so that it is not freed. If you stick something extra in there,
it's not special for the AM's private data, and the AM won't be able
to use pd_special for similar uses as pd_linp+pd_lower. I'd rather
have the storage system use it's own not-special area; choreographed
by e.g. a reuse of pd_checksum for one more page offset. Swapping the
fields around between on-disk and in-memory doesn't need to be an
issue, as special areas are rarely very large.

Every index type we support utilizes the special area. Wouldn't those
in-memory operations have priority on this useful space, as opposed to
a storage system that maybe will be used in new clusters, and even
then only during R/W operations to disk (each at most once for N
memory operations)?

> > Apart from that: Is this variable-sized 'metadata' associated with smgr 
> > infrastructure only, or is it also available for AM features? If not; then 
> > this is a strong -1. The amount of tasks smgr needs to do on a page is 
> > generally much less than the amount of tasks an AM needs to do; so in my 
> > view the AM has priority in prime page real estate, not smgr or related 
> > infrastructure.
>
> I will confess to a slightly wobbly understanding of the delineation
> of responsibility here. I was under the impression that by modifying
> any consumer of PageHeaderData this would be sufficient to cover all
> AMs for the types of cluster-wide options we'd be concerned about (say
> extended checksums, multiple page encryption schemes, or other
> per-page information we haven't yet anticipated).  Reading smgr/README
> and the various access/*/README has not made the distinction clear to
> me yet.

pd_special has (historically) been reserved for access methods'
page-level private data. If you add to this area, shouldn't that be
space that the AM should be able to hook into as well? Or are all
those features limited to the storage system only; i.e. the storage
system decides what's best for the AM's page handling w.r.t. physical
storage?

> > re: PageFeatures
> > I'm not sure I understand the goal, nor the reasoning. Shouldn't 

Re: [PATCHES] Post-special page storage TDE support

2022-10-28 Thread David Christensen
Hi Matthias,

> Did you read the related thread with related discussion from last June, "Re: 
> better page-level checksums" [0]? In that I argued that space at the end of a 
> page is already allocated for the AM, and that reserving variable space at 
> the end of the page for non-AM usage is wasting the AM's performance 
> potential.

Yes, I had read parts of that thread among others, but have given it a
re-read.  I can see the point you're making here, and agree that if we
can allocate between pd_special and pd_upper that could make sense.  I
am a little unclear as to what performance impacts for the AM there
would be if this additional space were ahead or behind the page
special area; it seems like if this is something that needs to live on
the page *somewhere* just being aligned correctly would be sufficient
from the AM's standpoint.  Considering that I am trying to make this
have zero storage impact if these features are not active, the impact
on a cluster with no additional features would be moot from a storage
perspective, no?

> Apart from that: Is this variable-sized 'metadata' associated with smgr 
> infrastructure only, or is it also available for AM features? If not; then 
> this is a strong -1. The amount of tasks smgr needs to do on a page is 
> generally much less than the amount of tasks an AM needs to do; so in my view 
> the AM has priority in prime page real estate, not smgr or related 
> infrastructure.

I will confess to a slightly wobbly understanding of the delineation
of responsibility here. I was under the impression that by modifying
any consumer of PageHeaderData this would be sufficient to cover all
AMs for the types of cluster-wide options we'd be concerned about (say
extended checksums, multiple page encryption schemes, or other
per-page information we haven't yet anticipated).  Reading smgr/README
and the various access/*/README has not made the distinction clear to
me yet.

> re: PageFeatures
> I'm not sure I understand the goal, nor the reasoning. Shouldn't this be part 
> of the storage manager (smgr) implementation / can't this be part of the smgr 
> of the relation?

For at least the feature cases I'm anticipating, this would apply to
any disk page that may have user data, set (at least initially) at
initdb time, so should apply to any pages in the cluster, regardless
of AM.

> re: use of pd_checksum
> I mentioned this in the above-mentioned thread too, in [1], that we could use 
> pd_checksum as an extra area marker for this storage-specific data, which 
> would be located between pd_upper and pd_special.

I do think that we could indeed use this as an additional in-page
pointer, but at least for this version was keeping things
backwards-compatible.  Peter G (I think) also made some good points
about how to include the various status bits on the page somehow in
terms of making a page completely self-contained.

> Re: patch contents
>
> 0001:
> >+ specialSize = MAXALIGN(specialSize) + reserved_page_size;
>
> This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or 
> an assertion that reserved_page_size is MAXALIGNED, would be better.

It is currently aligned via the space calculation return value but
agree that folding it into an assert or reworking it explicitly is
clearer.

> >  PageValidateSpecialPointer(Page page)
> >  {
> >  Assert(page);
> > -Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> > +AssertPageHeader) page)->pd_special - reserved_page_size) <= 
> > BLCKSZ);
>
> This check is incorrect. With your code it would allow pd_special past the 
> end of the block. If you want to put the reserved_space_size effectively 
> inside the special area, this check should instead be:
>
> +Assert(((PageHeader) page)->pd_special <= (BLCKSZ - reserved_page_size));
>
> Or, equally valid
>
> +AssertPageHeader) page)->pd_special + reserved_page_size) <= BLCKSZ);

Yup, I think I inverted my logic there; thanks.

> > + * +-+-++-+
> > + * | ... tuple2 tuple1 | "special space" | "reserved" |
> > + * +---++-+
>
> Could you fix the table display if / when you revise the patchset? It seems 
> to me that the corners don't line up with the column borders.

Sure thing.

> 0002:
> > Make the output of "select_views" test stable
> > Changing the reserved_page_size has resulted in non-stable results for this 
> > test.
>
> This makes sense, what kind of instability are we talking about? Are there 
> different results for runs with the same binary, or is this across 
> compilations?

When running with the same compilation/initdb settings, the test
results are stable, but differ depending what options you chose, so
`make installcheck` output will fail when testing a cluster with
different options vs upstream HEAD without these patches, etc.

> 0003 and up were not yet reviewed in depth.

Thanks, I appreciate the feedback so far.




Re: [PATCHES] Post-special page storage TDE support

2022-10-27 Thread Matthias van de Meent
Hi

On Mon, 24 Oct 2022, 19:56 David Christensen, <
david.christen...@crunchydata.com> wrote:
>
> Discussion is welcome and encouraged!

Did you read the related thread with related discussion from last June,
"Re: better page-level checksums" [0]? In that I argued that space at the
end of a page is already allocated for the AM, and that reserving variable
space at the end of the page for non-AM usage is wasting the AM's
performance potential.

Apart from that: Is this variable-sized 'metadata' associated with smgr
infrastructure only, or is it also available for AM features? If not; then
this is a strong -1. The amount of tasks smgr needs to do on a page is
generally much less than the amount of tasks an AM needs to do; so in my
view the AM has priority in prime page real estate, not smgr or related
infrastructure.

re: PageFeatures
I'm not sure I understand the goal, nor the reasoning. Shouldn't this be
part of the storage manager (smgr) implementation / can't this be part of
the smgr of the relation?

re: use of pd_checksum
I mentioned this in the above-mentioned thread too, in [1], that we could
use pd_checksum as an extra area marker for this storage-specific data,
which would be located between pd_upper and pd_special.

Re: patch contents

0001:
>+ specialSize = MAXALIGN(specialSize) + reserved_page_size;

This needs to be aligned, so MAXALIGN(specialSize + reserved_page_size), or
an assertion that reserved_page_size is MAXALIGNED, would be better.

>  PageValidateSpecialPointer(Page page)
>  {
>  Assert(page);
> -Assert(((PageHeader) page)->pd_special <= BLCKSZ);
> +AssertPageHeader) page)->pd_special - reserved_page_size) <=
BLCKSZ);

This check is incorrect. With your code it would allow pd_special past the
end of the block. If you want to put the reserved_space_size effectively
inside the special area, this check should instead be:

+Assert(((PageHeader) page)->pd_special <= (BLCKSZ -
reserved_page_size));

Or, equally valid

+AssertPageHeader) page)->pd_special + reserved_page_size) <=
BLCKSZ);

> + * +-+-++-+
> + * | ... tuple2 tuple1 | "special space" | "reserved" |
> + * +---++-+

Could you fix the table display if / when you revise the patchset? It seems
to me that the corners don't line up with the column borders.

0002:
> Make the output of "select_views" test stable
> Changing the reserved_page_size has resulted in non-stable results for
this test.

This makes sense, what kind of instability are we talking about? Are there
different results for runs with the same binary, or is this across
compilations?

0003 and up were not yet reviewed in depth.


Kind regards,

Matthias van de Meent


[0]
https://www.postgresql.org/message-id/flat/CA%2BTgmoaCeQ2b-BVgVfF8go8zFoceDjJq9w4AFQX7u6Acfdn2uA%40mail.gmail.com#90badc63e568a89a76f51fc95f07ffaf
[1]
https://postgr.es/m/CAEze2Wi5wYinU7nYxyKe_C0DRc6uWYa8ivn5%3Dzg63nKtHBnn8A%40mail.gmail.com


Re: [PATCHES] Post-special page storage TDE support

2022-10-25 Thread David Christensen
> > Explicitly
> > locking (assuming you stay in your lane) should only need to guard
> > against access from other
> > backends of this type if using shared buffers, so will be use-case 
> > dependent.
>
> I'm not sure what you mean here?

I'm mainly pointing out that the specific code that manages this
feature is the only one who has to worry about modifying said page
region.

> > This does have a runtime overhead due to moving some offset
> > calculations from compile time to
> > runtime. It is thought that the utility of this feature will outweigh
> > the costs here.
>
> Have you done some benchmarking to give an idea of how much overhead we're
> talking about?

Not yet, but I am going to work on this.  I suspect the current code
could be improved, but will try to get some sort of measurement of the
additional overhead.

> > Candidates for page features include 32-bit or 64-bit checksums,
> > encryption tags, or additional
> > per-page metadata.
> >
> > While we are not currently getting rid of the pd_checksum field, this
> > mechanism could be used to
> > free up that 16 bits for some other purpose.
>
> IIUC there's a hard requirement of initdb-time initialization, as there's
> otherwise no guarantee that you will find enough free space in each page at
> runtime.  It seems like a very hard requirement for a full replacement of the
> current checksum approach (even if I agree that the current implementation
> limitations are far from ideal), especially since there's no technical reason
> that would prevent us from dynamically enabling data-checksums without doing
> all the work when the cluster is down.

As implemented, that is correct; we are currently assuming this
specific feature mechanism is set at initdb time only.  Checksums are
not the primary motivation here, but were something that I could use
for an immediate illustration of the feature.  That said, presumably
you could define a way to set the features per-relation (say with a
template field in pg_class) which would propagate to a relation on
rewrite, so there could be ways to handle things incrementally, were
this an overall goal.

Thanks for looking,

David




Re: [PATCHES] Post-special page storage TDE support

2022-10-24 Thread Julien Rouhaud
Hi,

On Mon, Oct 24, 2022 at 12:55:53PM -0500, David Christensen wrote:
>
> Explicitly
> locking (assuming you stay in your lane) should only need to guard
> against access from other
> backends of this type if using shared buffers, so will be use-case dependent.

I'm not sure what you mean here?

> This does have a runtime overhead due to moving some offset
> calculations from compile time to
> runtime. It is thought that the utility of this feature will outweigh
> the costs here.

Have you done some benchmarking to give an idea of how much overhead we're
talking about?

> Candidates for page features include 32-bit or 64-bit checksums,
> encryption tags, or additional
> per-page metadata.
>
> While we are not currently getting rid of the pd_checksum field, this
> mechanism could be used to
> free up that 16 bits for some other purpose.

IIUC there's a hard requirement of initdb-time initialization, as there's
otherwise no guarantee that you will find enough free space in each page at
runtime.  It seems like a very hard requirement for a full replacement of the
current checksum approach (even if I agree that the current implementation
limitations are far from ideal), especially since there's no technical reason
that would prevent us from dynamically enabling data-checksums without doing
all the work when the cluster is down.