[HACKERS] Rewriting Free Space Map

Heikki Linnakangas Sun, 16 Mar 2008 15:00:29 -0700

I've started working on revamping Free Space Map, using the approachwhere we store a map of heap pages on every nth heap page. What we neednow is discussion on the details of how exactly it should work.

Here's my rough plan on the implementation. It's long, sorry. I'm fairlyconfident with it, but please let me know if you see any problems orhave any suggestions or better ideas.


Heap FSM
--------

The FSM is stored in the special area of every nth heap page. Whenextending the relation, the heapam checks if the block number of the newpage is one that belongs to the FSM. If it is, it let's the FSM toinitialize it by calling InitFSMPage() on it, and extends the relationagain to get another, normal heap page.

I chose the "every nth page is an FSM page" approach, rather than usinga separate relfilenode, which I also considered. The separaterelfilenode approach has some advantages, like being able to scan allFSM pages in a sequential fashion, but it involves a fair amount ofcatalog and buffer manager changes.

It's convenient that the FSM uses up the whole page, leaving no room forheap tuples. It simplifies the locking, as we don't need to worry withthe possibility in the FSM that the caller is already holding a lock onthe same page.

In an FSM page, there's one byte for each of the next N heap pages,starting from the FSM page. That one byte stores the amount of freespace on the corresponding heap page, in BLCKSZ/256 byte precision (32bytes with default block size).

The mapping of free space to these 256 "buckets" wouldn't necessarilyhave to be a linear one, we could for example have a single bucket forpages with more than BLCKSZ/2 bytes of free space and divide the restlinearly into 16 byte buckets, but let's keep it simple for now. Ofcourse, we could also just use 2 bytes per page, and store the page sizeexactly, but 32 byte precision seems like enough to me.



Index FSM
---------

Indexes use a similar scheme, but since we only need to keep trackwhether a page is used or not, we only need one bit per page. If theamount of free space on pages is interesting for an indexam in thefuture, it can use the heap FSM implementation instead. Or no FSM atall, like the hash indexam.

To use the index FSM, the indexam needs to leave every nth page alone,like in the heap. The B-tree assumes that the metapage is at block 0,but that's also where we would like to store the first index FSM page.To overcome that, we can make the index FSM special area a little bitsmaller, so that the B-tree metadata fits on the same page as the FSMinformation. That will be wasted space on other FSM pages than block 0,but we're only talking about 24 bytes per FSM page, and we only need oneFSM page per ~512 MB of index pages (with default BLCKSZ).



Performance
-----------

The patch I'm working on currently uses a naive way to find a page inthe FSM. To find a page with X bytes of free space, it scans the FSMpages until one is found. And it always starts the scan from thebeginning, which is far from optimal. And when there's page with enoughfree space, it still needs to scan all FSM pages just to find out thatwe need to extend the relation.

To speed things up, we're going to need some mechanism to avoid that.First of all, we need to somehow cache the information that "there's nopage with >= X bytes left", to avoid fruitless scanning. To speed up thecase when there's only a few pages with enough free space, we can keep alimited size list of such pages in addition to the map.

These information needs to be in shared memory, either on heap pageslike the FSM pages and managed by the buffer manager, or in a separateshmem block. I would like to go with normal bufmgr managed pages, asfixed-sized memory blocks have their problems, and the cachedinformation should be stored to disk as well.

Let's have one special page in the heap file, called the Free Space List(FSL) page, in addition to the normal FSM pages. It has the followingstructure:


struct {
  bit anypages[256]

 struct {
    BlockNumber blockno;
    uint8 freespace;
  } freespacelist[as large as fits on page]
}

Remember that we track the free space on each page using one byte, IOW,each page falls into one of 256 buckets of free space. In the anypagesbitmap, we have one bit per bucket indicating "is there any pages withthis much free space". When we look for a page with X bytes, we checkthe bits up to the bucket corresponding X bytes, and if there's no setbits we know not to bother scanning the FSM pages.

To speed up the scan where there is space, we keep a simple list ofpages with free space. This list is actually like the current FSM, buthere we only use it as a small cache of the FSM pages. VACUUM and anyother operations that update the FSM can put pages to the list whenthere's free slots.

We can store the FSL page on a magic location, say block #100. Forrelations smaller than that, there's no need for the FSL and we might aswell scan the FSM page. I'm not sure if we should have more than one FSLpage for large tables.

I'm not sure yet how the locking of FSL and FSM pages should work. Itshouldn't be too hard, though, as the FSM/FSL information are just hintsanyway. We do need to take care that we don't permanently lose track ofany significant amount of free space, as after we can do partial vacuumsusing the visibility map, VACUUM might not visit one part of a tableeven if other parts it are frequently updated.



Page allocation algorithm
-------------------------

There's many different ways we can hand out pages from the FSM. Possiblestrategies, some of which are at odds with each other, include:


1. Try to spread out inserts of different backends, to avoid contention

2. Prefer low-numbered pages, to increase the chance of being able totruncate in VACUUM.3. Reserve pages with lots of free space for big allocations, preferalmost full pages for small allocations. To use all space more efficiently.4. If the table is clustered, try to use pages close to those withsimilar values.

5. On UPDATE, try to use pages close to the old tuple.
6. Prefer pages that are currently in cache, to avoid I/O.

The current FSM tries to achieve only 1, and there haven't been manycomplaints, so I wouldn't worry too much about the other goals. 4 and 6would need some major changes to the buffer manager and indexaminterfaces, respectively, and 3 could lead to more I/O when you do a lotof small inserts.

We can spread inserts of different backends in the Free Space List bymoving a page to the end of the list when it's handed out, or removingit from there altogether. When the FSL is empty, we can vary the placewhere we start to scan the FSM.

To prefer low-number pages, to increase the chance of being able totruncate the relation later, we can favor lower numbered blocks inVACUUM when we decide which blocks to put into the FSL. And we can alsobias the starting point of where we start to scan the FSM when the FSLis empty.



Visibility Map
--------------

This far I've only talked about the FSM, but it's important to considerhow the Visibility Map fits into the scheme. My current thinking is thatthere will be one bit per heap page in the visibility map. The exactsemantics of that one bit is still not totally clear, but that's notimportant right now.


There's a few alternatives:

1. Mix the visibility map with the FSM, stealing one bit of every FSMbyte. There would then be 7 bits for storing how much space there is oneach page, and 1 bit indicating the visibility.

2. Allocate part of the FSM pages for visibility map. For example, First1/9 of the page for VM, and 8/9 for the FSM. The VM part would be astraight bitmap with one bit per page, and the FSM part would use onebyte per page.

3. Use different pages for the VM, for example every nth page for VM,and every (n+1)th page for FSM.

I'm leaning towards 2 at the moment. 3 is intriguing as well, though,because it would help with potential lock contention on the VM/FSM pages.



Per-chunk relfrozenxid
----------------------

I'm imagining that we would set a bit in VM, if a page has no deadtuples at all, making it possible to use it for index-only scans. Such apage is also uninteresting for VACUUM. However, we would still need toscan the whole table to freeze tuples.

To alleviate that, we could allocate some space in the FSM pages,indicating a smallest Xid in a range of heap pages. IOW, likerelfrozenxid, but with a granularity of N pages, rather than wholerelation. You would still have to scan that range of pages to update it,but that's much better than the whole relation.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Rewriting Free Space Map

Reply via email to