On Mon, May 21, 2012 at 3:15 PM, Simon Riggs <si...@2ndquadrant.com> wrote: > I very much like the idea of a common framework to support multiple > requirements. If we can view a couple of other designs as well it may > quickly become clear this is the right way. In any case, the topics > discussed here are important ones, so thanks for covering them.
I considered a couple of other possibilities: - We could split pg_class into pg_class and pg_class_nt (non-transactional). This would solve problem #1 (allowing pg_class/pg_attribute entries for system catalogs to be shared across all databases) but it doesn't do anything for problem #3 (excessive inode consumption) or problem #4 (watermarking for crash recovery) and isn't very good for problem #2 (maintenance of non-transactional state) either, since part of the hope here is that we'd be able to get at this state during recovery even when HS is not used. - In lieu of adding an entire meta-page, we could just add some special space to the first page, or maybe to every N'th page. Adding space to every N'th page would be the best solution to problem #4 (watermarking), and adding even a small amount of state to the first page would be enough for problems #1 and #2. However, I don't think it would work for problem #3 (reducing inode consumption) because even if the special space is pretty big, you won't really be able to mix tuples and visibility map information (for example) on the same page without complicating the buffer locking regimen unbearably. The dance we have to do to make the visibility map crash-safe is already a lot hairier than I'd really prefer. Also, I think we really need a lot of this info for both tables and indexes, and I think it will be simpler to decide that everything has a metapage rather than to decide that some things have a metapage and some things just have a little extra stuff crammed into the special space. - I considered the idea of designing a crash-safe persistent hash table, that would be sort of like a table but really more like a key-value store with keys and values being C structs. This would be similar to the pg_class/pg_class_nt split idea, except that pg_class_nt would be one of these new crash-safe persistent hash table objects, rather than a normal table; and there's a decent possibility we'd find other applications for such a beast. However, it wouldn't help with problem #3 or problem #4; and Tom seemed to be gravitating toward the design in my OP rather than this idea. One point that was raised is that btree and hash indexes already have a metapage, so sticking a little more data into it doesn't really cost anything; and heap relations are pretty much going to end up nailing the visibility map and free space map pages in cache, so it's not clear that this is any less cache-efficient in those cases either. For all that, I kind of like the idea of a persistent hash table object, which I suspect could be used to solve some problems not on the list in my OP as well as some of the ones that are there, but I don't feel too bad laying that idea aside for now. If it's really a good idea, it'll come up again. > What springs immediately to mind is why this would not be just another fork. This was pretty much the first thing I considered, but it makes problem #3 worse, and I really don't want do that. I think 3 inodes per table is already too many, and I expect the problem to get worse. I feel like every third crazy feature idea I come up with involves creating yet another relation fork, and I'm pretty sure I won't be the last person to think about such things, and so we're probably headed that way, but I think we'd better try to hold the line as much as is reasonably possible. One random idea would be to have pg_upgrade create a special one-block relation fork for the heap metapage that would get folded into the main fork the first time the table gets rewritten. So we'd add another fork, but only as a hack to facilitate in-place upgrade. > This is important. I like the idea of breaking down the barriers > between databases to allow it to be an option for one backend to > access tables in multiple databases. The current mechanism doesn't > actually prevent looking at data from other databases using internal > APIs, so full security doesn't exist. It's a very common user > requirement to wish to join tables stored in different databases, > which ought to be possible more cleanly with correct privileges. As Stephen says, this would require a lot more than just making pg_class_shared/pg_attribute_shared work, and I don't particularly believe it's a good idea anyway. That having been said, if we decided we wanted to go this way in some future release, having done this first couldn't but help. > I thought there was a patch that put that info in a separate table 1:1 > with pg_class. > > Not very sure why a metapage is better than a catalog table. Mostly because there's no chance of the startup process accessing a catalog table during recovery, but it can read a metapage. > We would > still want a view that allows us to access that data as if it were a > catalog table. Agreed. Tom said the same. > Again, there are other ways to optimise the FSM for small tables. True, but that doesn't make this a bad one. >> 4. Every once in a while, somebody's database ends up in pieces in >> lost+found. We could make this a bit easier to recover from by >> including the database OID, relfilenode, and table OID in the >> metapage. This wouldn't be perfect, since a relation over one GB >> would still only have one metapage, so additional relation segments >> would still be a problem. But it would be still be a huge improvement >> over the status quo: some very large percentage of the work of putting >> everything back where it goes could probably be done by a Perl script >> that read all the metapages, and if you needed to know, say, which >> file contained pg_class, that would be a whole lot easier, too. > > That sounds like the requirement that is driving this idea. No, I listed it fourth because I think it's the least interesting benefit. It IS a benefit, but if this were the primary goal it would be a LOT simpler to shove a few bytes into every N'th heap special space. I coded up a patch for that on my other laptop, and then reformatted the hard drive without saving the patch (brilliant!), so I no longer have working code for this. But it's not that hard. I am much more interested in benefit #2, the ability to maintain non-transactional state that can be read by the startup process during recovery, than I am in this goal. Unfortunately that's harder, but I think it's worth the effort. > You don't have to rewrite the table, you just need to update the rows > so they migrate to another block. True. > That seems easy enough, but still not sure why you wouldn't just use > another fork. Or another idea would be to have the first page have a > non-zero pd_special. See above for a discussion of these points. > I know you were recording what was discussed as an initial starting > point. Looks like a good set of problems to solve. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers