Re: [HACKERS] Rewriting Free Space Map
Heikki Linnakangas [EMAIL PROTECTED] writes: More precisely, on CVS HEAD it takes between 7.1-7.2%. After extending BufferTag with one uint32, it takes 7.4-7.5%. So the effect is measurable if you try hard enough, but not anything to get worried about. And if we adopt the allegedly-faster hash_any that's in the patch queue, the difference should get smaller yet. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Heikki Linnakangas wrote: Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I also wonder what the performance impact of extending BufferTag is. That's a fair objection, and obviously something we'd need to check. But I don't recall seeing hash_any so high on any profile that I think it'd be a big problem. I do remember seeing hash_any in some oprofile runs. But that's fairly easy to test: we don't need to actually implement any of the stuff, other than add a field to BufferTag, and run pgbench. I tried that. hash_any wasn't significant on pgbench, but I was able to construct a test case where it is. It goes like this: BEGIN; CREATE TEMPORARY TABLE foo (id int4); INSERT INTO foo SELECT a FROM generate_series(1, 1) a; INSERT INTO foo SELECT * FROM foo; INSERT INTO foo SELECT * FROM foo; INSERT INTO foo SELECT * FROM foo; ... (repeat multiple times) oprofile says that hash_any consumes ~7 % of CPU time on that test on my laptop. More precisely, on CVS HEAD it takes between 7.1-7.2%. After extending BufferTag with one uint32, it takes 7.4-7.5%. So the effect is measurable if you try hard enough, but not anything to get worried about. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Mon, Mar 17, 2008 at 01:23:46PM -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. Can we call them maps or metadata maps? forks sounds weird. Actually, I do like forks, but to add a little bit diversity: facets? aspects? FWIW, the idea of mapping a relation to a directory quite compelling. Regards - -- tomás -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFH33c7Bcgs9XrR2kYRAuBQAJ9MjISqgn37umRIydxtUBYONORwDgCbBKkE y7adUy7s/30TxQPQiJZZejA= =PAQ9 -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote: Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. In addition to the main data fork which contains the same info as now, there could be one or more map forks which are separate files in the filesystem. Are'nt we in a way doing this for indexes ? I think something similar could be used to store tuple visibility bits separately from heap tuple data itself, so +1 to this idea. Not just bits, but whole visibility info (xmin,xmax,tmin,tmax, plus bits) should be stored separately. A separate fork for visibility should be organized as a b-tree index (as we already have well-honed mechanisms for dealing with those effectively) but visibility fork is stored in a compressed form by storing ranges of all-visible or all-deleted tuples as two endpoints only and also the tree is reorganized when possible similar to what we currently do for HOT updates. This will keep the visibility index really small for cases with little updates, most likely one or two pages regardless of table size. One important difference from indexes is that visibility info should be stored first, before writing data to heap and creating ordinary index entries. (The rough idea in my head was that you can do an indexscan and look up visibility bits without having to pull the whole heap along; and visibility updates are also cheaper, whether they come from indexscans or heap scans. Of course, the implicit cost is that a seqscan needs to fetch the visibility pages, too; and the locking is more complex.) another cost is heavy inserting/updating where there will probably be more lock contention as visibility info for new tuples will more often land on the same visibility pages due to visibility info being generally smaller. Of course, with visibility info in a separate fork, very narrow tables will have the ratios reversed - for one byte wide table visibility info will be a few times bigger than actual data, at least initially before compression has kicked in. Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Hannu Krosing [EMAIL PROTECTED] writes: On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote: Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. Are'nt we in a way doing this for indexes ? Not really --- indexes are closer to being independent entities, since they have their own relfilenode values, own pg_class entries, etc. What I'm imagining here is something that's so tightly tied to the core heap that there's no value in managing it as a distinct entity, thus the idea of same relfilenode with a different extension. The existence of multiple forks in a relation wouldn't be exposed at all at the SQL level. I think something similar could be used to store tuple visibility bits separately from heap tuple data itself, so +1 to this idea. Not just bits, but whole visibility info (xmin,xmax,tmin,tmax, plus bits) should be stored separately. I'm entirely un-sold on this idea, but yeah it would be something that would be possible to experiment with once we have a multi-fork infrastructure. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: You're cavalierly waving away a whole boatload of problems that will arise as soon as you start trying to make the index AMs play along with this :-(. It doesn't seem very hard. The problem is that the index AMs are no longer in control of what goes where within their indexes, which has always been their prerogative to determine. The fact that you think you can kluge btree to still work doesn't mean that it will work for other AMs. Also, I don't think that use the special space will scale to handle other kinds of maps such as the proposed dead space map. (This is exactly why I said the other day that we need a design roadmap for all these ideas.) It works for anything that scales linearly with the relation itself. The proposed FSM and visibility map both fall into that category. It can work only with prearrangement among all the maps that are trying to coexist in the same special space. Every time a new one comes along, we'd have to reconsider the space allocation, re-optimize tradeoffs, and force an initdb that could not possibly be implemented in-place. If we had a short list of maps in mind with no real prospect of changing, then I think this would be acceptable; but that doesn't seem to be the case. It's certainly foolish to start detail design on something like this when we don't even have the roadmap I asked for about what sorts of maps we are agreed we want to have. The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. Hmm. You also need to teach at least xlog.c and xlogutils.c about the map forks, for full page images and the invalid page tracking. Well, you'd have to teach them something anyway, for any incarnation of maps that they might need to update. I also wonder what the performance impact of extending BufferTag is. That's a fair objection, and obviously something we'd need to check. But I don't recall seeing hash_any so high on any profile that I think it'd be a big problem. My original thought was to have a separate RelFileNode for each of the maps. That would require no smgr or xlog changes, and not very many changes in the buffer manager, though I guess you'd more catalog changes. You had doubts about that on the previous thread (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but the map forks idea certainly seems much more invasive than that. The main problems with that are (a) the need to expose every type of map in pg_class and (b) the need to pass all those relfilenode numbers down to pretty low levels of the system. The nice thing about the fork idea is that you don't need any added info to uniquely identify what relation you're working on. The fork numbers would be hard-wired into whatever code needed to know about particular forks. (Of course, these same advantages apply to using special space in an existing file. I'm just suggesting that we can keep these advantages without buying into the restrictions that special space would have.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
On Mon, 2008-03-17 at 09:29 -0400, Tom Lane wrote: Hannu Krosing [EMAIL PROTECTED] writes: On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote: Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. Are'nt we in a way doing this for indexes ? Not really --- indexes are closer to being independent entities, since they have their own relfilenode values, own pg_class entries, etc. What I'm imagining here is something that's so tightly tied to the core heap that there's no value in managing it as a distinct entity, thus the idea of same relfilenode with a different extension. The existence of multiple forks in a relation wouldn't be exposed at all at the SQL level. I think something similar could be used to store tuple visibility bits separately from heap tuple data itself, so +1 to this idea. Not just bits, but whole visibility info (xmin,xmax,tmin,tmax, plus bits) should be stored separately. I'm entirely un-sold on this idea, but yeah it would be something that would be possible to experiment with once we have a multi-fork infrastructure. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I've started working on revamping Free Space Map, using the approach where we store a map of heap pages on every nth heap page. What we need now is discussion on the details of how exactly it should work. You're cavalierly waving away a whole boatload of problems that will arise as soon as you start trying to make the index AMs play along with this :-(. It doesn't seem very hard. An indexam wanting to use FSM needs a little bit of code where the relation is extended, to let the FSM initialize FSM pages. And then there's the B-tree metapage issue I mentioned. But that's all, AFAICS. Hash for instance has very narrow-minded ideas about page allocation within its indexes. Hash doesn't use FSM at all. Also, I don't think that use the special space will scale to handle other kinds of maps such as the proposed dead space map. (This is exactly why I said the other day that we need a design roadmap for all these ideas.) It works for anything that scales linearly with the relation itself. The proposed FSM and visibility map both fall into that category. A separate file is certainly more flexible. I was leaning towards that option originally (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00142.php) for that reason. The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. In addition to the main data fork which contains the same info as now, there could be one or more map forks which are separate files in the filesystem. They are named by relfilenode plus an extension, for instance a relation with relfilenode NNN would have a data fork in file NNN (plus perhaps NNN.1, NNN.2, etc) and a map fork named something like NNN.map (plus NNN.map.1 etc as needed). We'd have to add one more field to buffer lookup keys (BufferTag) to disambiguate which fork the referenced page is in. Having bitten that bullet, though, the idea trivially scales to any number of map forks with potentially different space requirements and different locking and WAL-logging requirements. Hmm. You also need to teach at least xlog.c and xlogutils.c about the map forks, for full page images and the invalid page tracking. I also wonder what the performance impact of extending BufferTag is. My original thought was to have a separate RelFileNode for each of the maps. That would require no smgr or xlog changes, and not very many changes in the buffer manager, though I guess you'd more catalog changes. You had doubts about that on the previous thread (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but the map forks idea certainly seems much more invasive than that. I like the map forks idea; it groups the maps nicely at the filesystem level, and I can see it being useful for all kinds of things in the future. The question is, is it really worth the extra code churn? If you think it is, I can try that approach. Another possible advantage is that a new map fork could be added to an existing table without much trouble. Which is certainly something we'd need if we ever hope to get update-in-place working. Yep. The main disadvantage I can see is that for very small tables, the percentage overhead from multiple map forks of one page apiece is annoyingly high. However, most of the point of a map disappears if the table is small, so we might finesse that by not creating any maps until the table has reached some minimum size. Yeah, the map fork idea is actually better than the every nth heap page approach from that point of view. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote: Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. In addition to the main data fork which contains the same info as now, there could be one or more map forks which are separate files in the filesystem. I think something similar could be used to store tuple visibility bits separately from heap tuple data itself, so +1 to this idea. (The rough idea in my head was that you can do an indexscan and look up visibility bits without having to pull the whole heap along; and visibility updates are also cheaper, whether they come from indexscans or heap scans. Of course, the implicit cost is that a seqscan needs to fetch the visibility pages, too; and the locking is more complex.) I very much like the idea of a generic method for including additional bulk metadata for a relation (heap or index). That neatly provides a general infrastructure for lots of clever things such as dead space, visibility or other properties, while at the same time maintaining modularity. Can we call them maps or metadata maps? forks sounds weird. We don't need to assume anything about the maps themselves at this stage, so we might imagine tightly coupled maps that are always updated as a relation changes, or loosely coupled maps that are lazily updated by background processes. Autovacuum then becomes the vehicle by which we execute map maintenance procedures, defined according to which AMs are installed and what relation options are set. So we have a completely generalised data/metadata storage infrastructure. Sensibly arranged this could provide an entry point for powerful new features within existing and future index AMs. It also sounds like it might avoid a whole class of bugs and special cases that I regrettably foresee would be unavoidable in Heikki's proposal. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Tom Lane [EMAIL PROTECTED] writes: Heikki Linnakangas [EMAIL PROTECTED] writes: My original thought was to have a separate RelFileNode for each of the maps. That would require no smgr or xlog changes, and not very many changes in the buffer manager, though I guess you'd more catalog changes. You had doubts about that on the previous thread (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but the map forks idea certainly seems much more invasive than that. The main problems with that are (a) the need to expose every type of map in pg_class and (b) the need to pass all those relfilenode numbers down to pretty low levels of the system. The nice thing about the fork idea is that you don't need any added info to uniquely identify what relation you're working on. The fork numbers would be hard-wired into whatever code needed to know about particular forks. (Of course, these same advantages apply to using special space in an existing file. I'm just suggesting that we can keep these advantages without buying into the restrictions that special space would have.) One advantage of using separate relfilenodes would be that if we need to regenerate a map we could do it in a new relfilenode and swap it in like we do with heap rewrites. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Gregory Stark wrote: Tom Lane [EMAIL PROTECTED] writes: Heikki Linnakangas [EMAIL PROTECTED] writes: My original thought was to have a separate RelFileNode for each of the maps. That would require no smgr or xlog changes, and not very many changes in the buffer manager, though I guess you'd more catalog changes. You had doubts about that on the previous thread (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but the map forks idea certainly seems much more invasive than that. The main problems with that are (a) the need to expose every type of map in pg_class and (b) the need to pass all those relfilenode numbers down to pretty low levels of the system. The nice thing about the fork idea is that you don't need any added info to uniquely identify what relation you're working on. The fork numbers would be hard-wired into whatever code needed to know about particular forks. (Of course, these same advantages apply to using special space in an existing file. I'm just suggesting that we can keep these advantages without buying into the restrictions that special space would have.) One advantage of using separate relfilenodes would be that if we need to regenerate a map we could do it in a new relfilenode and swap it in like we do with heap rewrites. Why can't you just do that with a different extension and file rename? You'd need an exclusive lock while swapping in the new map, but you need that anyway, IIRC, and this way you don't even need a catalog change. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Gregory Stark [EMAIL PROTECTED] writes: One advantage of using separate relfilenodes would be that if we need to regenerate a map we could do it in a new relfilenode and swap it in like we do with heap rewrites. You could probably do that using a temporary fork number, if the situation ever came up. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Simon Riggs [EMAIL PROTECTED] writes: Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. Can we call them maps or metadata maps? forks sounds weird. I'm not wedded to forks, that's just the name that was used in the only previous example I've seen. Classic Mac had a resource fork and a data fork within each file. Don't think I like maps though, as (a) that prejudges what the alternate forks might be used for, and (b) the name fails to be inclusive of the data fork. Other suggestions anyone? BTW, thinking about the Mac precedent a little more, I believe the way they grafted that Classic concept onto BSD was that applications (which the user thinks of as single objects) are now directories with multiple files inside them. Probably it'd be overkill to think of turning each relation into a subdirectory, but then again maybe not? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
On Mon, 2008-03-17 at 13:23 -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. Can we call them maps or metadata maps? forks sounds weird. I'm not wedded to forks, that's just the name that was used in the only previous example I've seen. Classic Mac had a resource fork and a data fork within each file. Layer? Slab? Sheet? Strata/um? Overlay? Layer makes sense to me because of the way GIS and CAD systems work. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
On Mon, Mar 17, 2008 at 01:23:46PM -0400, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. Can we call them maps or metadata maps? forks sounds weird. I'm not wedded to forks, that's just the name that was used in the only previous example I've seen. Classic Mac had a resource fork and a data fork within each file. Don't think I like maps though, as (a) that prejudges what the alternate forks might be used for, and (b) the name fails to be inclusive of the data fork. Other suggestions anyone? Segment? Section? Module? Cheers, David. -- David Fetter [EMAIL PROTECTED] http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: [EMAIL PROTECTED] Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
On Mon, Mar 17, 2008 at 6:23 PM, Tom Lane wrote: Simon Riggs writes: Can we call them maps or metadata maps? forks sounds weird. I'm not wedded to forks, that's just the name that was used in the only previous example I've seen. Classic Mac had a resource fork and a data fork within each file. Microsoft / NTFS calls them Data Streams: http://www.wikistc.org/wiki/Alternate_data_streams Jochem -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
On Mon, 2008-03-17 at 13:23 -0400, Tom Lane wrote: I'm not wedded to forks, that's just the name that was used in the only previous example I've seen. Classic Mac had a resource fork and a data fork within each file. Don't think I like maps though, as (a) that prejudges what the alternate forks might be used for, and (b) the name fails to be inclusive of the data fork. Other suggestions anyone? I believe that in the world of NTFS the concept is called streams: http://support.microsoft.com/kb/105763. HTH, Mark. -- Mark Cave-Ayland Sirius Corporation - The Open Source Experts http://www.siriusit.co.uk T: +44 870 608 0063 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Tom Lane [EMAIL PROTECTED] writes: I'm not wedded to forks, that's just the name that was used in the only previous example I've seen. Classic Mac had a resource fork and a data fork within each file. fwiw forks are not unique to MacOS, c.f.: http://en.wikipedia.org/wiki/Fork_%28filesystem%29 However I'm not sure reusing any of these terms is such a hot idea. All it's going to do is confuse someone into thinking we're actually talking about HFS forks or NTFS data streams or whatever. Better to pick a term that isn't already being used for such things so people don't get misled. BTW, thinking about the Mac precedent a little more, I believe the way they grafted that Classic concept onto BSD was that applications (which the user thinks of as single objects) are now directories with multiple files inside them. Probably it'd be overkill to think of turning each relation into a subdirectory, but then again maybe not? Well there are upsides and downsides. Many OSes have difficulties when you have many files in a single directory. This would tend to reduce that. On the other hand it would drastically increase the number of directory files the OS has to keep track of and the total number of inodes being referenced. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: You're cavalierly waving away a whole boatload of problems that will arise as soon as you start trying to make the index AMs play along with this :-(. It doesn't seem very hard. The problem is that the index AMs are no longer in control of what goes where within their indexes, which has always been their prerogative to determine. The fact that you think you can kluge btree to still work doesn't mean that it will work for other AMs. Well, it does work with all the existing AMs AFAICS. I do agree with the general point; it'd certainly be cleaner, more modular and more flexible if the AMs didn't need to know about the existence of the maps. The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. Hmm. You also need to teach at least xlog.c and xlogutils.c about the map forks, for full page images and the invalid page tracking. Well, you'd have to teach them something anyway, for any incarnation of maps that they might need to update. Umm, the WAL code doesn't care where the pages it operates on came from. Sure, we'll need rmgr-specific code that know what to do with the maps, but the full page image code would work without changes with the multiple RelFileNode approach. The essential change with the map fork idea is that a RelFileNode no longer uniquely identifies a file on disk (ignoring the segmentation which is handled in smgr for now). Anything that operates on RelFileNodes, without any higher level information of what it is, needs to be modified to use RelFileNode+forkid instead. That includes at least the buffer manager, smgr, and the full page image code in xlog.c. It's probably a pretty mechanical change, even though it affects a lot of code. We'd probably want to have a new struct, let's call it PhysFileId for now, for RelFileNode+forkid, and basically replace all occurrences of RelFileNode with PhysFileId in smgr, bufmgr and xlog code. I also wonder what the performance impact of extending BufferTag is. That's a fair objection, and obviously something we'd need to check. But I don't recall seeing hash_any so high on any profile that I think it'd be a big problem. I do remember seeing hash_any in some oprofile runs. But that's fairly easy to test: we don't need to actually implement any of the stuff, other than add a field to BufferTag, and run pgbench. My original thought was to have a separate RelFileNode for each of the maps. That would require no smgr or xlog changes, and not very many changes in the buffer manager, though I guess you'd more catalog changes. You had doubts about that on the previous thread (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but the map forks idea certainly seems much more invasive than that. The main problems with that are (a) the need to expose every type of map in pg_class and (b) the need to pass all those relfilenode numbers down to pretty low levels of the system. (a) is certainly a valid point. Regarding (b), I don't think the low level stuff (I assume you mean smgr, bufmgr, bgwriter, xlog by that) would need to be passed any additional relfilenode numbers. Or rather, they already work with relfilenodes, and they don't need to know whether the relfilenode is for an index, a heap, or an FSM attached to something else. The relfilenodes would be in RelationData, and we already have that around whenever we do anything that needs to differentiate between those. Another consideration is which approach is easiest to debug. The map fork approach seems better on that front, as you can immediately see from the PhysFileId if a page is coming from an auxiliary map or the main data portion. That might turn out to be handy in the buffer manager or bgwriter as well; they don't currently have any knowledge of what a page contains. The nice thing about the fork idea is that you don't need any added info to uniquely identify what relation you're working on. The fork numbers would be hard-wired into whatever code needed to know about particular forks. (Of course, these same advantages apply to using special space in an existing file. I'm just suggesting that we can keep these advantages without buying into the restrictions that special space would have.) I don't see that advantage. All the higher-level code that care which relation you're working on already have Relation around. All the lower-level stuff don't care. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
On Mon, Mar 17, 2008 at 6:23 PM, Tom Lane [EMAIL PROTECTED] wrote: I'm not wedded to forks, that's just the name that was used in the only previous example I've seen. Classic Mac had a resource fork and a data fork within each file. Don't think I like maps though, as (a) that prejudges what the alternate forks might be used for, and (b) the name fails to be inclusive of the data fork. Other suggestions anyone? Shadow? As each err, fork trails each relfilenode? (Or perhaps shade). Hints? As something more generic than map? Regards, Dawid -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Heikki Linnakangas [EMAIL PROTECTED] writes: I've started working on revamping Free Space Map, using the approach where we store a map of heap pages on every nth heap page. What we need now is discussion on the details of how exactly it should work. You're cavalierly waving away a whole boatload of problems that will arise as soon as you start trying to make the index AMs play along with this :-(. Hash for instance has very narrow-minded ideas about page allocation within its indexes. Also, I don't think that use the special space will scale to handle other kinds of maps such as the proposed dead space map. (This is exactly why I said the other day that we need a design roadmap for all these ideas.) The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. In addition to the main data fork which contains the same info as now, there could be one or more map forks which are separate files in the filesystem. They are named by relfilenode plus an extension, for instance a relation with relfilenode NNN would have a data fork in file NNN (plus perhaps NNN.1, NNN.2, etc) and a map fork named something like NNN.map (plus NNN.map.1 etc as needed). We'd have to add one more field to buffer lookup keys (BufferTag) to disambiguate which fork the referenced page is in. Having bitten that bullet, though, the idea trivially scales to any number of map forks with potentially different space requirements and different locking and WAL-logging requirements. Another possible advantage is that a new map fork could be added to an existing table without much trouble. Which is certainly something we'd need if we ever hope to get update-in-place working. The main disadvantage I can see is that for very small tables, the percentage overhead from multiple map forks of one page apiece is annoyingly high. However, most of the point of a map disappears if the table is small, so we might finesse that by not creating any maps until the table has reached some minimum size. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. In addition to the main data fork which contains the same info as now, there could be one or more map forks which are separate files in the filesystem. I think something similar could be used to store tuple visibility bits separately from heap tuple data itself, so +1 to this idea. (The rough idea in my head was that you can do an indexscan and look up visibility bits without having to pull the whole heap along; and visibility updates are also cheaper, whether they come from indexscans or heap scans. Of course, the implicit cost is that a seqscan needs to fetch the visibility pages, too; and the locking is more complex.) -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers