On Thu, Nov 26, 2009 at 4:14 PM, Dan Eloff <dan.el...@gmail.com> wrote: > Hi Hackers, > > I've familiarized myself a little with the architecture of postgresql, > largely because it's interesting. There's one thing I can't quite > figure out though, and it seems that there's no better group of people > in the world to ask about it. > > At the lower levels in PG, reading from the disk into cache, and > writing from the cache to the disk is always done in pages. > > Why does PG work this way? Is it any slower to write whole pages > rather than just the region of the page that changed? Conversely, is > it faster? From what I think I know of operating systems, reading > should bring the whole page into the os buffers anyway, so reading the > whole page instead of just part of it isn't much more expensive. > Perhaps writing works similarly?
Yep. It's not just PG that organizes things into pages - disks and disk caches and main memory and kernel buffers are similarly organized. So, for example, to write 5 bytes to the disk, some part of the stack (disk drive or kernel or application) must read the whole page (however large it is) into memory, modify the 5 bytes, and write it back out. Changing a larger fraction of the page contents is essentially free. Another reason to use pages is that a lot of operations have a substantial "per request" overhead. It would be extremely inefficient to have the operating system request each desired byte from the disk individually, or likewise the operating system from the kernel. In fact, I think that disks are typically addressed in fixed-size sectors, and the atomic operation is actually to read or write a sector rather than a byte range. But even at the operating system level, where the kernel interface will LET you read or write bytes one at a time, it's dreadfully slow. Organizing things into pages also simplifies bookkeeping. For example, suppose you want to keep track of which parts of your 256M shared-memory buffer need to be written out to disk. If you organize your data into 8K pages, you can use a bitmap with 1 meaning dirty (needs to be written) and 0 meaning clean, and the whole data structure fits in 4K of memory. If you need to track arbitrary ranges of dirty bytes, you'll need a pair of 32-bit integers for each range (starting offset and ending offset, or starting offset and length). There's no reasonable fixed-size data structure which is guaranteed to be large enough to hold all the ranges you might have (and 4K is not even close to adequate), and inserting additional ranges, or clearing out ranges that have been written, will be FAR more expensive than under the bitmap implementation. You'll likely to to set up some kind of balanced tree structure to make the performance reasonable, which will be more complex and therefore more likely to have bugs - and it'll still be slower. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers