Re: [HACKERS] heap metapages

2012-05-25 Thread Simon Riggs
On 24 May 2012 23:02, Bruce Momjian br...@momjian.us wrote:
 On Tue, May 22, 2012 at 09:52:30AM +0100, Simon Riggs wrote:
 Having pg_upgrade touch data files is both dangerous and difficult to
 back out in case of mistake, so I am wary of putting the metapage at
 block 0. Doing it the way I suggest means the .meta files would be
 wholly new and can be deleted as a back-out. We can also clean away
 any unnecessary .vm/.fsm files as a later step.

 Pg_upgrade never modifies the old cluster, except to lock it in link
 mode, so there is never anything to back out.

Agreed. Robert's proposal was to make pg_upgrade modify the cluster,
which I was observing wasn't a good plan.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap metapages

2012-05-25 Thread Jim Nasby

On 5/22/12 12:09 PM, Simon Riggs wrote:

On 22 May 2012 13:52, Robert Haasrobertmh...@gmail.com  wrote:


It seems pretty clear to me that making pg_upgrade responsible for
emptying block zero is a non-starter.  But I don't think that's a
reason to throw out the design; I think it's a problem we can work
around.


I like your design better as well *if* you can explain how we can get
to it. My proposal was a practical alternative that would allow the
idea to proceed.


It occurred to me that having a metapage with information useful to recovery 
operations in *every segment* would be useful; it certainly seems worth the 
extra block. It then occurred to me that we've basically been stuck with 2 
places to store relation data; either at the relation level in pg_class or on 
each page. Sometimes neither one is a good fit.

ISTM that a lot of problems we've faced in the past few years are because 
there's not a good abstraction between a (mostly) linear tuplespace and the 
physical storage that goes underneath it.

- pg_upgrade progress is blocked because we can't deal with a new page that's  
BLKSZ
- There's no good way to deal with table (or worse, index) bloat
- There's no good way to add the concept of a heap metapage
- Forks are being used to store data that might not belong there only because 
there's no other choice (visibility info)

Would it make sense to take a step back and think about ways to abstract 
between logical tuplespace and physical storage? What if 1GB segments had their 
own metadata? Or groups of segments? Could certain operations that currently 
have to rewrite an entire table be changed so that they slowly moved pages from 
one group of segments to another, with a means of marking old pages as having 
been moved?

Einstein said that problems cannot be solved by the same level of thinking that 
created them. Perhaps we're at the point where we need to take a step back from our 
current storage organization and look for a bigger picture?
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap metapages

2012-05-25 Thread Robert Haas
On Fri, May 25, 2012 at 5:57 PM, Jim Nasby j...@nasby.net wrote:
 It occurred to me that having a metapage with information useful to recovery
 operations in *every segment* would be useful; it certainly seems worth the
 extra block. It then occurred to me that we've basically been stuck with 2
 places to store relation data; either at the relation level in pg_class or
 on each page. Sometimes neither one is a good fit.

AFAICS, having metadata in every segment is most only helpful for
recovering from the situation where files have become disassociated
from their filenames, i.e. database - lost+found.  From the view
point of virtually the entire server, the block number space is just a
continuous sequence that starts at 0 and counts up forever (or,
anyway, until 2^32-1).  While it wouldn't be impossible to allow that
knowledge to percolate up to other parts of the server, it would
basically involve drilling a fairly arbitrary hole through an
abstraction boundary that has been intact for a very long time, and
it's not clear that there's anything magical about 1GB.
Nonwithstanding the foregoing...

 ISTM that a lot of problems we've faced in the past few years are because
 there's not a good abstraction between a (mostly) linear tuplespace and the
 physical storage that goes underneath it.

...I agree with this.  I'm not sure exactly what the replacement model
would look like, but it's definitely worth some thought - e.g. perhaps
there ought to be another mapping layer between logical block numbers
and files on disk, so that we can effectively delete blocks out of the
middle of a relation without requiring any special OS support, and so
that we can multiplex many small relation forks onto a single physical
file to minimize inode consumption.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap metapages

2012-05-24 Thread Bruce Momjian
On Tue, May 22, 2012 at 09:52:30AM +0100, Simon Riggs wrote:
 Having pg_upgrade touch data files is both dangerous and difficult to
 back out in case of mistake, so I am wary of putting the metapage at
 block 0. Doing it the way I suggest means the .meta files would be
 wholly new and can be deleted as a back-out. We can also clean away
 any unnecessary .vm/.fsm files as a later step.

Pg_upgrade never modifies the old cluster, except to lock it in link
mode, so there is never anything to back out.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap metapages

2012-05-22 Thread Simon Riggs
On 22 May 2012 02:50, Robert Haas robertmh...@gmail.com wrote:

 Not very sure why a metapage is better than a catalog table.

 Mostly because there's no chance of the startup process accessing a
 catalog table during recovery, but it can read a metapage.

OK, sounds reasonable.

Based upon all you've said, I'd suggest that we make a new kind of
fork, in a separate file for this, .meta. But we also optimise the VM
and FSM in the way you suggest so that we can replace .vm and .fsm
with just .meta in most cases. Big tables would get a .vm and .fsm
appearing when they get big enough, but that won't challenge the inode
limits. When .vm and .fsm do appear, we remove that info from the
metapage - that means we keep all code as it is currently, accept for
an optimisation of .vm and .fsm when those are small enough to do so.

We can watermark data files using special space on block zero using
some code to sneak that in when the page is next written, but that is
regarded as optional, rather than an essential aspect of an
upgrade/normal operation.

Having pg_upgrade touch data files is both dangerous and difficult to
back out in case of mistake, so I am wary of putting the metapage at
block 0. Doing it the way I suggest means the .meta files would be
wholly new and can be deleted as a back-out. We can also clean away
any unnecessary .vm/.fsm files as a later step.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap metapages

2012-05-22 Thread Robert Haas
On Tue, May 22, 2012 at 4:52 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Based upon all you've said, I'd suggest that we make a new kind of
 fork, in a separate file for this, .meta. But we also optimise the VM
 and FSM in the way you suggest so that we can replace .vm and .fsm
 with just .meta in most cases. Big tables would get a .vm and .fsm
 appearing when they get big enough, but that won't challenge the inode
 limits. When .vm and .fsm do appear, we remove that info from the
 metapage - that means we keep all code as it is currently, accept for
 an optimisation of .vm and .fsm when those are small enough to do so.

Well, let's see.  That would mean that a small heap relation has 2
forks instead of 3, and a large relation has 4 forks instead of 3.  In
my proposal, a small relation has 1 fork instead of 3, and a large
relation still has 3 forks.  So I like mine better.

Also, I think that we need a good chunk of the metadata here for both
tables and indexes.  For example, if we use the metapage to store
information about whether a relation is logged, unlogged, being
converted from logged to unlogged, or being converted from logged to
unlogged, we need that information both for tables and for indexes.
Now, there's no absolute reason why those cases have to be handled
symmetrically, but I think things will be a lot simpler if they are.
If we settle on the rule that block 0 of every relation contains a
certain chunk of metadata at a certain byte offset, then the code to
retrieve that data when needed is pretty darn simple.  If tables put
it in a separate fork and indexes put it in the main fork inside the
metablock somewhere, then things are not so simple.  And I sure don't
want to add a separate fork for every index just to hold the metadata:
that would be a huge hit in terms of total inode consumption.

 We can watermark data files using special space on block zero using
 some code to sneak that in when the page is next written, but that is
 regarded as optional, rather than an essential aspect of an
 upgrade/normal operation.

 Having pg_upgrade touch data files is both dangerous and difficult to
 back out in case of mistake, so I am wary of putting the metapage at
 block 0. Doing it the way I suggest means the .meta files would be
 wholly new and can be deleted as a back-out. We can also clean away
 any unnecessary .vm/.fsm files as a later step.

It seems pretty clear to me that making pg_upgrade responsible for
emptying block zero is a non-starter.  But I don't think that's a
reason to throw out the design; I think it's a problem we can work
around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap metapages

2012-05-22 Thread Simon Riggs
On 22 May 2012 13:52, Robert Haas robertmh...@gmail.com wrote:

 It seems pretty clear to me that making pg_upgrade responsible for
 emptying block zero is a non-starter.  But I don't think that's a
 reason to throw out the design; I think it's a problem we can work
 around.

I like your design better as well *if* you can explain how we can get
to it. My proposal was a practical alternative that would allow the
idea to proceed.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] heap metapages

2012-05-21 Thread Robert Haas
At dinner on Friday night at PGCon, the end of the table that included
Tom Lane, Stephen Frost, and myself got to talking about the idea of
including some kind of metapage in every relation, including heap
relations.  At least some index relations already have something like
this (cf _bt_initmetapage, _hash_metapinit).  I believe that adding
this for all relations, including heaps, would allow us to make
improvements in several areas.

1. Tom was interested in the idea of trying to make the system catalog
entries which describe the system catalogs themselves completely
immutable, so that they can potentially be shared between databases.
For example, we might have shared catalogs pg_class_shared and
pg_attribute_shared, describing the structure of all the system
catalogs; and then we might also have pg_class and pg_attribute within
each database, describing the structure of tables which exist only
within that database.  Right now, this is not possible, because values
like relpages, reltuples, and relfrozenxid can vary from database to
database.  However, if those values were stored in a metapage
associated with the heap relation rather than in the system catalogs,
then potentially we could make this work.  The most obvious benefit of
this is that it would reduce the on-disk footprint of a new database,
but there are other possible benefits as well.  For example, a process
not bound to a database could read a shared catalog even if it weren't
nailed, and if we ever implement a prefork system for backends, they'd
be able to do more of their initialization steps before learning which
database they were to target.

2. I'm interested in having a cleaner way to associate
non-transactional state with a relation.  This has come up a few
times.  We currently handle this by having lazy VACUUM do in-place
heap updates to replace values like relpages, reltuples, and
relfrozenxid, but this feels like a kludge.  It's particularly scary
to think about relying on this for anything critical given that
non-inplace heap updates can be happening simultaneously, and the
consequences of losing an update to relfrozenxid in particular are
disastrous.  Plus, it requires hackery in pg_upgrade to preserve the
value between the old and new clusters; we've already had to fix two
data-destroying bugs in that logic.  There are several other things
that we might want to do that have similar requirements.  For example,
Pavan's idea of folding VACUUM's second heap pass into the next vacuum
cycle requires a relation-wide piece of state which can probably be
represented as a single bit, but putting that bit in pg_class would
require the same sorts of hacks there that we already have for
relfrozenxid, with similar consequences if it's not properly
preserved.  Making unlogged tables logged or the other way around
appears to require some piece of relation-level state *that can be
accessed during recovery*, and pg_class is not going to work for that.
 Page checksums have a similar requirement if the granularity for
turning them on and off is anything less than the entire cluster.
Whenever we decide to roll out a new page version, we'll want a place
to record the oldest page version that might be present in a
particular relation, so that we can easily check whether a cluster can
be upgraded to a new release that has dropped support for an old page
version.  Having a common framework for all of these things seems like
it will probably be easier than solving each problem individually, and
a metapage is a good place to store non-transactional state.

3. Right now, a new table uses up a minimum of 3 inodes, even if it
has no indexes: one for the main fork, one for the visibility map, and
one for the free space map.  For people who have lots and lots of
little tiny tables, this is quite inefficient.  The amount of
information we'd have to store in a heap metapage would presumably not
be very big, so we could potentially move the first, say, 1K of the
visibility map into the heap metapage, meaning that tables less than
64MB would no longer require a separate visibility map fork.
Something similar could possibly be done with the free-space map,
though I am unsure of the details.  Right now, a relation containing
just one tuple consumes 5 8k blocks on disk (1 for the main fork, 3
for the FSM, and 1 for the VM) and 3 inodes; getting that down to 8kB
and 1 inode would be very nice.  The case of a completely-empty
relation is a bit annoying; that right now takes 1 inode and 0 blocks
and I suspect we'd end up with 1 inode and 1 block, but I think it
might still be a win overall.

4. Every once in a while, somebody's database ends up in pieces in
lost+found.  We could make this a bit easier to recover from by
including the database OID, relfilenode, and table OID in the
metapage.  This wouldn't be perfect, since a relation over one GB
would still only have one metapage, so additional relation segments
would still be a problem.  But it would be still be a huge 

Re: [HACKERS] heap metapages

2012-05-21 Thread Merlin Moncure
On Mon, May 21, 2012 at 12:56 PM, Robert Haas robertmh...@gmail.com wrote:
 At dinner on Friday night at PGCon, the end of the table that included
 Tom Lane, Stephen Frost, and myself got to talking about the idea of
 including some kind of metapage in every relation, including heap
 relations.  At least some index relations already have something like
 this (cf _bt_initmetapage, _hash_metapinit).  I believe that adding
 this for all relations, including heaps, would allow us to make
 improvements in several areas.

The first thing that jumps to mind is: why can't the metapage be
extended to span multiple pages if necessary?  I've often wondered why
the visibility map isn't stored within the heap itself...

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap metapages

2012-05-21 Thread Robert Haas
On Mon, May 21, 2012 at 2:22 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Mon, May 21, 2012 at 12:56 PM, Robert Haas robertmh...@gmail.com wrote:
 At dinner on Friday night at PGCon, the end of the table that included
 Tom Lane, Stephen Frost, and myself got to talking about the idea of
 including some kind of metapage in every relation, including heap
 relations.  At least some index relations already have something like
 this (cf _bt_initmetapage, _hash_metapinit).  I believe that adding
 this for all relations, including heaps, would allow us to make
 improvements in several areas.

 The first thing that jumps to mind is: why can't the metapage be
 extended to span multiple pages if necessary?  I've often wondered why
 the visibility map isn't stored within the heap itself...

Well, the idea of a metapage, almost by definition, is that it stores
a small amount of information whose size is pretty much fixed and
which can be reasonably anticipated to always fit in one page.  If
you're trying to store some data that can get bigger than that (or
even, come close to filling that up), you need a different system.
I'm anticipating that the amount of relation metadata we need to store
will fit into a 512-byte sector with significant room left over,
leaving us with the rest of the block for whatever we'd like to use it
for (e.g. bits of the FSM or VM).   If at some point in the future, we
need some kind of relation-level metadata that can grow beyond a
handful of bytes, we can either put it in its own fork, or store one
or more block pointers in the metapage indicating the blocks where
information is stored - but right now I'm not seeing the need for
anything that fancy.

Now, that having been said, I don't think there's any particular
reason why we coudn't multiplex all the relation forks onto a single
physical file if we were so inclined.  The FSM and VM are small enough
that interleaving them with the actual data probably wouldn't slow
down seq scans materially.  But on the other hand I am not sure that
we'd gain much by it in general.  I see the value of doing it for
small relations: it saves inodes, potentially quite a lot of inodes if
you're on a system that uses schemas to implement multi-tenancy.  But
it's not clear to me that it's worthwhile in general.  Sticking all
the FSM stuff in its own relation may allow the OS to lay out those
pages physically closer to each other on disk, whereas interleaving
them with the data blocks would probably give up that advantage, and
it's not clear to me what we'd be getting in exchange.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] heap metapages

2012-05-21 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 The FSM and VM are small enough
 that interleaving them with the actual data probably wouldn't slow
 down seq scans materially.  

Wouldn't that end up potentially causing lots of random i/o if you need
to look at many parts of the FSM or VM..?

Also, wouldn't having it at the start of the heap reduce the changes
needed to the SM?  Along with make such things easier to find
themselves, when talking about forensics?

Of course, the real challenge here is dealing with such an on-disk
format change...  If we were starting from scratch, I doubt there would
be much resistance, but figuring out how to do this and still support
pg_upgrade could be quite ugly.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] heap metapages

2012-05-21 Thread Simon Riggs
On 21 May 2012 13:56, Robert Haas robertmh...@gmail.com wrote:

 At dinner on Friday night at PGCon, the end of the table that included
 Tom Lane, Stephen Frost, and myself got to talking about the idea of
 including some kind of metapage in every relation, including heap
 relations.  At least some index relations already have something like
 this (cf _bt_initmetapage, _hash_metapinit).  I believe that adding
 this for all relations, including heaps, would allow us to make
 improvements in several areas.

The only thing against these ideas is that you're putting the design
before the requirements, which always makes me nervous.

I very much like the idea of a common framework to support multiple
requirements. If we can view a couple of other designs as well it may
quickly become clear this is the right way. In any case, the topics
discussed here are important ones, so thanks for covering them.

What springs immediately to mind is why this would not be just another fork.

 1. Tom was interested in the idea of trying to make the system catalog
 entries which describe the system catalogs themselves completely
 immutable, so that they can potentially be shared between databases.
 For example, we might have shared catalogs pg_class_shared and
 pg_attribute_shared, describing the structure of all the system
 catalogs; and then we might also have pg_class and pg_attribute within
 each database, describing the structure of tables which exist only
 within that database.  Right now, this is not possible, because values
 like relpages, reltuples, and relfrozenxid can vary from database to
 database.  However, if those values were stored in a metapage
 associated with the heap relation rather than in the system catalogs,
 then potentially we could make this work.  The most obvious benefit of
 this is that it would reduce the on-disk footprint of a new database,
 but there are other possible benefits as well.  For example, a process
 not bound to a database could read a shared catalog even if it weren't
 nailed, and if we ever implement a prefork system for backends, they'd
 be able to do more of their initialization steps before learning which
 database they were to target.

This is important. I like the idea of breaking down the barriers
between databases to allow it to be an option for one backend to
access tables in multiple databases. The current mechanism doesn't
actually prevent looking at data from other databases using internal
APIs, so full security doesn't exist. It's a very common user
requirement to wish to join tables stored in different databases,
which ought to be possible more cleanly with correct privileges.

 2. I'm interested in having a cleaner way to associate
 non-transactional state with a relation.  This has come up a few
 times.  We currently handle this by having lazy VACUUM do in-place
 heap updates to replace values like relpages, reltuples, and
 relfrozenxid, but this feels like a kludge.  It's particularly scary
 to think about relying on this for anything critical given that
 non-inplace heap updates can be happening simultaneously, and the
 consequences of losing an update to relfrozenxid in particular are
 disastrous.  Plus, it requires hackery in pg_upgrade to preserve the
 value between the old and new clusters; we've already had to fix two
 data-destroying bugs in that logic.  There are several other things
 that we might want to do that have similar requirements.  For example,
 Pavan's idea of folding VACUUM's second heap pass into the next vacuum
 cycle requires a relation-wide piece of state which can probably be
 represented as a single bit, but putting that bit in pg_class would
 require the same sorts of hacks there that we already have for
 relfrozenxid, with similar consequences if it's not properly
 preserved.  Making unlogged tables logged or the other way around
 appears to require some piece of relation-level state *that can be
 accessed during recovery*, and pg_class is not going to work for that.
  Page checksums have a similar requirement if the granularity for
 turning them on and off is anything less than the entire cluster.
 Whenever we decide to roll out a new page version, we'll want a place
 to record the oldest page version that might be present in a
 particular relation, so that we can easily check whether a cluster can
 be upgraded to a new release that has dropped support for an old page
 version.  Having a common framework for all of these things seems like
 it will probably be easier than solving each problem individually, and
 a metapage is a good place to store non-transactional state.

I thought there was a patch that put that info in a separate table 1:1
with pg_class.

Not very sure why a metapage is better than a catalog table. We would
still want a view that allows us to access that data as if it were a
catalog table.

 3. Right now, a new table uses up a minimum of 3 inodes, even if it
 has no indexes: one for the main fork, one for the visibility 

Re: [HACKERS] heap metapages

2012-05-21 Thread Stephen Frost
* Simon Riggs (si...@2ndquadrant.com) wrote:
 The only thing against these ideas is that you're putting the design
 before the requirements, which always makes me nervous.
[...]
 What springs immediately to mind is why this would not be just another fork.

One of the requirements, though perhaps it wasn't made very clear,
really is to reduce the on-disk footprint, both in terms of inodes and
actual disk usage, if possible.

 This is important. I like the idea of breaking down the barriers
 between databases to allow it to be an option for one backend to
 access tables in multiple databases. The current mechanism doesn't
 actually prevent looking at data from other databases using internal
 APIs, so full security doesn't exist. It's a very common user
 requirement to wish to join tables stored in different databases,
 which ought to be possible more cleanly with correct privileges.

That's really a whole different ball of wax and I don't believe what
Robert was proposing would actually allow that to happen due to the
other database-level things which are needed to keep everything
consistent...  That's my understanding, anyway.  I'd be happy as anyone
if we could actually make it work, but isn't like the SysCache stuff per
database?  Also, cross-database queries would actually make it more
difficult to have per-database roles, which is one thing that I was
hoping we might be able to work into this, though perhaps we could have
a shared roles table and a per-database roles table and only 'global'
roles would be able to issue cross-database queries..

 Not very sure why a metapage is better than a catalog table. We would
 still want a view that allows us to access that data as if it were a
 catalog table.

Right, we were discussing that, and what would happen if someone did a
'select *' against it...  Having to pass through all of the files on
disk wouldn't be good, but if we could make it use a cache to return
that information, perhaps it'd work.

 Again, there are other ways to optimise the FSM for small tables.

Sure, but is there one where we also reduce the number of inodes we
allocate for tiny tables..?

 That sounds like the requirement that is driving this idea.

Regarding forensics, it's a nice bonus, but I think the real requirement
is the reduction of inode and disk usage, both for the per-database
catalog and for tiny tables.

 You don't have to rewrite the table, you just need to update the rows
 so they migrate to another block.

Well, that depends on exactly how it gets implemented, but that's an
interesting idea, certainly..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] heap metapages

2012-05-21 Thread Robert Haas
On Mon, May 21, 2012 at 3:15 PM, Simon Riggs si...@2ndquadrant.com wrote:
 I very much like the idea of a common framework to support multiple
 requirements. If we can view a couple of other designs as well it may
 quickly become clear this is the right way. In any case, the topics
 discussed here are important ones, so thanks for covering them.

I considered a couple of other possibilities:

- We could split pg_class into pg_class and pg_class_nt
(non-transactional).  This would solve problem #1 (allowing
pg_class/pg_attribute entries for system catalogs to be shared across
all databases) but it doesn't do anything for problem #3 (excessive
inode consumption) or problem #4 (watermarking for crash recovery) and
isn't very good for problem #2 (maintenance of non-transactional
state) either, since part of the hope here is that we'd be able to get
at this state during recovery even when HS is not used.

- In lieu of adding an entire meta-page, we could just add some
special space to the first page, or maybe to every N'th page.  Adding
space to every N'th page would be the best solution to problem #4
(watermarking), and adding even a small amount of state to the first
page would be enough for problems #1 and #2.  However, I don't think
it would work for problem #3 (reducing inode consumption) because even
if the special space is pretty big, you won't really be able to mix
tuples and visibility map information (for example) on the same page
without complicating the buffer locking regimen unbearably.  The dance
we have to do to make the visibility map crash-safe is already a lot
hairier than I'd really prefer.  Also, I think we really need a lot of
this info for both tables and indexes, and I think it will be simpler
to decide that everything has a metapage rather than to decide that
some things have a metapage and some things just have a little extra
stuff crammed into the special space.

- I considered the idea of designing a crash-safe persistent hash
table, that would be sort of like a table but really more like a
key-value store with keys and values being C structs.  This would be
similar to the pg_class/pg_class_nt split idea, except that
pg_class_nt would be one of these new crash-safe persistent hash table
objects, rather than a normal table; and there's a decent possibility
we'd find other applications for such a beast.  However, it wouldn't
help with problem #3 or problem #4; and Tom seemed to be gravitating
toward the design in my OP rather than this idea.  One point that was
raised is that btree and hash indexes already have a metapage, so
sticking a little more data into it doesn't really cost anything; and
heap relations are pretty much going to end up nailing the visibility
map and free space map pages in cache, so it's not clear that this is
any less cache-efficient in those cases either.  For all that, I kind
of like the idea of a persistent hash table object, which I suspect
could be used to solve some problems not on the list in my OP as well
as some of the ones that are there, but I don't feel too bad laying
that idea aside for now.  If it's really a good idea, it'll come up
again.

 What springs immediately to mind is why this would not be just another fork.

This was pretty much the first thing I considered, but it makes
problem #3 worse, and I really don't want do that.  I think 3 inodes
per table is already too many, and I expect the problem to get worse.
I feel like every third crazy feature idea I come up with involves
creating yet another relation fork, and I'm pretty sure I won't be the
last person to think about such things, and so we're probably headed
that way, but I think we'd better try to hold the line as much as is
reasonably possible.

One random idea would be to have pg_upgrade create a special one-block
relation fork for the heap metapage that would get folded into the
main fork the first time the table gets rewritten.  So we'd add
another fork, but only as a hack to facilitate in-place upgrade.

 This is important. I like the idea of breaking down the barriers
 between databases to allow it to be an option for one backend to
 access tables in multiple databases. The current mechanism doesn't
 actually prevent looking at data from other databases using internal
 APIs, so full security doesn't exist. It's a very common user
 requirement to wish to join tables stored in different databases,
 which ought to be possible more cleanly with correct privileges.

As Stephen says, this would require a lot more than just making
pg_class_shared/pg_attribute_shared work, and I don't particularly
believe it's a good idea anyway.  That having been said, if we decided
we wanted to go this way in some future release, having done this
first couldn't but help.

 I thought there was a patch that put that info in a separate table 1:1
 with pg_class.

 Not very sure why a metapage is better than a catalog table.

Mostly because there's no chance of the startup process accessing a
catalog table 

Re: [HACKERS] heap metapages

2012-05-21 Thread Robert Haas
On Mon, May 21, 2012 at 3:15 PM, Stephen Frost sfr...@snowman.net wrote:
 * Robert Haas (robertmh...@gmail.com) wrote:
 The FSM and VM are small enough
 that interleaving them with the actual data probably wouldn't slow
 down seq scans materially.

 Wouldn't that end up potentially causing lots of random i/o if you need
 to look at many parts of the FSM or VM..?

I doubt it.  They probably stay in core anyway.

 Also, wouldn't having it at the start of the heap reduce the changes
 needed to the SM?  Along with make such things easier to find
 themselves, when talking about forensics?

The metapage, surely yes.  If we wanted to fold the FSM and VM into
the main fork in their entirety, probably not.  But I don't have much
desire to do that.  I think it's fine for a BIG relation to eat a
couple of inodes.  I just don't want a little one to do that.

 Of course, the real challenge here is dealing with such an on-disk
 format change...  If we were starting from scratch, I doubt there would
 be much resistance, but figuring out how to do this and still support
 pg_upgrade could be quite ugly.

That does seem to be the ten million dollar question, but already
we've batted around a few solutions on this thread, so I suspect we'll
find a way to make it work.  I think my next step is going to be to
spend some more time studying what the various index AMs already have
in terms of metapages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers