Re: Proposal: SLRU to Buffer Cache

2018-08-22 Thread Andres Freund
Hi,

On 2018-08-22 13:35:47 +0500, Andrey Borodin wrote:
> > 15 авг. 2018 г., в 2:35, Shawn Debnath  написал(а):
> > 
> > At the Unconference in Ottawa this year, I pitched the idea of moving
> > components off of SLRU and on to the buffer cache. The motivation
> > behind the idea was three fold:
> > 
> >  * Improve performance by eliminating fixed sized caches, simplistic
> >scan and eviction algorithms.
> >  * Ensuring durability and consistency by tracking LSNs and checksums
> >per block.
> +1, I like this idea more than current patch on CF with checksums for SLRU 
> pages.

Yea, I don't think it really makes sense to reimplement this logic for
SLRUs (and then UNDO) separately.


> >  1. Implement a generic block storage manager that parameterizes
> > several options like segment sizes, fork and segment naming and
> > path schemes, concepts entrenched in md.c that are strongly tied to
> > relations. To mitigate risk, I am planning on not modifying md.c
> > for the time being.
> Probably I'm missing something, but why this should not be in access
> methods?

I think it's not an absurd idea to put the reserved oid into pg_am
(under a separate amtype). Although the fact that shared entries would
be in database local tables is a bit weird. But I'm fairly certain that
we'd not put any actual data into it, not the least because we need to
be able to acess clo etc from connections that cannot attach to a
database (say the startup process, which will never ever start reading
from a catalog table).  So I don't really see what you mean with:

> You can extend AM to control it's segment size and ability to
> truncate unneeded pages. This may to be useful, for example, in LSM
> tree implementation or something similar.

that doesn't really seem like it could work. Nor am I even clear what
the above points really have to do with the AM layer.

Greetings,

Andres Freund



Re: Proposal: SLRU to Buffer Cache

2018-08-22 Thread Andrey Borodin
Hi!

> 15 авг. 2018 г., в 2:35, Shawn Debnath  написал(а):
> 
> At the Unconference in Ottawa this year, I pitched the idea of moving
> components off of SLRU and on to the buffer cache. The motivation
> behind the idea was three fold:
> 
>  * Improve performance by eliminating fixed sized caches, simplistic
>scan and eviction algorithms.
>  * Ensuring durability and consistency by tracking LSNs and checksums
>per block.
+1, I like this idea more than current patch on CF with checksums for SLRU 
pages.

>  1. Implement a generic block storage manager that parameterizes
> several options like segment sizes, fork and segment naming and
> path schemes, concepts entrenched in md.c that are strongly tied to
> relations. To mitigate risk, I am planning on not modifying md.c
> for the time being.
Probably I'm missing something, but why this should not be in access methods? 
You can extend AM to control it's segment size and ability to truncate unneeded 
pages. This may to be useful, for example, in LSM tree implementation or 
something similar.

Best regards, Andrey Borodin.


Re: Proposal: SLRU to Buffer Cache

2018-08-21 Thread Andres Freund
Hi,

On 2018-08-21 09:53:21 -0400, Shawn Debnath wrote:
> > I was wondering what the point of exposing the OIDs to users in a
> > catalog would be though.  It's not necessary to do that to reserve
> > them (and even if it were, pg_database would be the place): the OIDs
> > we choose for undo, clog, ... just have to be in the system reserved
> > range to be safe from collisions.

Maybe I'm missing something, but how are conflicts prevented just by
being in the system range?  There's very commonly multiple patches
trying to use the same oid, and that is just discovered by the
'duplicate_oids' script. But if there's no catalog representation, I
don't see how that'd discover them?



> > I suppose one benefit would be the
> > ability to join eg pg_buffer_cache against it to get a human readable
> > name like "clog", but that'd be slightly odd because the DB OID field
> > would refer to entries in pg_database or pg_storage_manager depending
> > on the number range.

> Good points. However, there are very few cases where our internal 
> representation using DB OIDs will be exposed, one such being 
> pg_buffercache. Wondering if updating the documentation here would be 
> sufficient as pg_buffercache is an extension used by developers and DBEs 
> rather than by consumers. We can circle back to this after the initial 
> set of patches are out.

Showing the oids in pg_database or such seems like it'd make it a bit
harder to change later because people rely on things like joining
against it.  I don't think I like that.  I'm kinda inclined to something
somewhat crazy like instead having a reserved & shared pg_class entry or
such.  Don't like that that much either. Hm.


> > >   5. Due to the on-disk format changes, simply copying the segments
> > >  during upgrade wouldn't work anymore. Given the nature of data
> > >  stored within SLRU segments today, we can extend pg_upgrade to
> > >  translate the segment files by scanning from relfrozenxid and
> > >  relminmxid and recording the corresponding values at the new
> > >  offsets in the target segments.
> > 
> > +1
> > 
> > (Hmm, if we're going to change all this stuff, I wonder if there would
> > be any benefit to switching to 64 bit xids for the xid-based SLRUs
> > while we're here...)
> 
> Do you mean switching or reserving space for it on the block? The latter 
> I hope :-)

I'd make the addressing work in a way that never requires wraparounds,
but instead allows trimming at the beginning. That shouldn't result in
any additional space, while allowing to fully switch to 64bit xids.

Greetings,

Andres Freund



Re: Proposal: SLRU to Buffer Cache

2018-08-14 Thread Thomas Munro
Hi Shawn,

On Wed, Aug 15, 2018 at 9:35 AM, Shawn Debnath  wrote:
> At the Unconference in Ottawa this year, I pitched the idea of moving
> components off of SLRU and on to the buffer cache. The motivation
> behind the idea was three fold:
>
>   * Improve performance by eliminating fixed sized caches, simplistic
> scan and eviction algorithms.
>   * Ensuring durability and consistency by tracking LSNs and checksums
> per block.
>   * Consolidating caching strategies in the engine to simplify the
> codebase, and would benefit from future buffer cache optimizations.

Thanks for working on this.  These are good goals, and I've wondered
about doing exactly this myself for exactly those reasons.  I'm sure
we're not the only ones, and I heard only positive reactions to your
unconference pitch.  As you know, my undo log storage design interacts
with the buffer manager in the same way, so I'm interested in this
subject and will be keen to review and test what you come up with.
That said, I'm fairly new here myself and there are people on this
list with a decade or two more experience hacking on the buffer
manager and transam machinery.

> As the changes are quite invasive, I wanted to vet the approach with the
> community before digging in to implementation. The changes are strictly
> on the storage side and do not change the runtime behavior or protocols.
> Here's the current approach I am considering:
>
>   1. Implement a generic block storage manager that parameterizes
>  several options like segment sizes, fork and segment naming and
>  path schemes, concepts entrenched in md.c that are strongly tied to
>  relations. To mitigate risk, I am planning on not modifying md.c
>  for the time being.

+1 for doing it separately at first.

I've also vacillated between extending md.c and doing my own
undo_file.c thing.  It seems plausible that between SLRU and undo we
could at least share a common smgr implementation, and eventually
maybe md.c.  There are a few differences though, and the question is
whether we'd want to do yet another abstraction layer with
callbacks/vtable/configuration points to handle that parameterisation,
or just use the existing indirection in smgr and call it good.

I'm keen to see what you come up with.  After we have a patch to
refactor and generalise the fsync stuff from md.c (about which more
below), let's see what is left and whether we can usefully combine
some code.

>   2. Introduce a new smgr_truncate_extended() API to allow truncation of
>  a range of blocks starting at a specific offset, and option to
>  delete the file instead of simply truncating.

Hmm.  In my undo proposal I'm currently implementing only the minimum
smgr interface required to make bufmgr.c happy (basically read and
write blocks), but I'm managing segment files (creating, deleting,
recycling) directly via a separate interface UndoLogAllocate(),
UndoLogDiscard() defined in undolog.c.  That seemed necessary for me
because that's where I had machinery to track the meta-data (mostly
head and tail pointers) for each undo log explicitly, but I suppose I
could use a wider smgr interface as you are proposing to move the
filesystem operations over there.  Perhaps I should reconsider that
split.  I look forward to seeing your code.

>   3. I will continue to use the RelFileNode/SMgrRelation constructs
>  through the SMgr API. I will reserve OIDs within the engine that we
>  can use as DB ID in RelFileNode to determine which storage manager
>  to associate for a specific SMgrRelation. To increase the
>  visibility of the OID mappings to the user, I would expose a new
>  catalog where the OIDs can be reserved and mapped to existing
>  components for template db generation. Internally, SMgr wouldn't
>  rely on catalogs, but instead will have them defined in code to not
>  block bootstrap. This scheme should be compatible with the undo log
>  storage work by Thomas Munro, et al. [0].

+1 for the pseudo-DB OID scheme, for now.  I think we can reconsider
how we want to structure buffer tags in the longer term as part of
future projects that overhaul buffer mapping.  We shouldn't get hung
up on that now.

I was wondering what the point of exposing the OIDs to users in a
catalog would be though.  It's not necessary to do that to reserve
them (and even if it were, pg_database would be the place): the OIDs
we choose for undo, clog, ... just have to be in the system reserved
range to be safe from collisions.  I suppose one benefit would be the
ability to join eg pg_buffer_cache against it to get a human readable
name like "clog", but that'd be slightly odd because the DB OID field
would refer to entries in pg_database or pg_storage_manager depending
on the number range.

>   4. For each component that will be transitioned over to the generic
>  block storage, I will introduce a page header at the beginning of
>  the block and re-work the associated offset