Re: [HACKERS] Buffer Management
Curt Sampson [EMAIL PROTECTED] writes: Note that your proposal of using mmap to replace sysv shared memory relies on the behaviour I've described too. True, but I was not envisioning mapping an actual file --- at least on HPUX, the only way to generate an arbitrary-sized shared memory region is to use MAP_ANONYMOUS and not have the mmap'd area connected to any file at all. It's not farfetched to think that this aspect of mmap might work differently from mapping pieces of actual files. In practice of course we'd have to restrict use of any such implementation to platforms where mmap behaves reasonably ... according to our definition of reasonably. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Buffer Management
Tom Lane wrote: Curt Sampson [EMAIL PROTECTED] writes: Note that your proposal of using mmap to replace sysv shared memory relies on the behaviour I've described too. True, but I was not envisioning mapping an actual file --- at least on HPUX, the only way to generate an arbitrary-sized shared memory region is to use MAP_ANONYMOUS and not have the mmap'd area connected to any file at all. It's not farfetched to think that this aspect of mmap might work differently from mapping pieces of actual files. In practice of course we'd have to restrict use of any such implementation to platforms where mmap behaves reasonably ... according to our definition of reasonably. Yes, I am told mapping /dev/zero is the same as the anon map. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Buffer Management
On Wed, 26 Jun 2002, Tom Lane wrote: Curt Sampson [EMAIL PROTECTED] writes: Note that your proposal of using mmap to replace sysv shared memory relies on the behaviour I've described too. True, but I was not envisioning mapping an actual file --- at least on HPUX, the only way to generate an arbitrary-sized shared memory region is to use MAP_ANONYMOUS and not have the mmap'd area connected to any file at all. It's not farfetched to think that this aspect of mmap might work differently from mapping pieces of actual files. I find it somewhat farfetched, for a couple of reasons: 1. Memory mapped with the MAP_SHARED flag is shared memory, anonymous or not. POSIX is pretty explicit about how this works, and the standard for mmap that predates POSIX is the same. Anonymous memory does not behave differently. You could just as well say that some systems might exist such that one process can write() a block to a file, and then another might read() it afterwards but not see the changes. Postgres should not try to deal with hypothetical systems that are so completely broken. 2. Mmap is implemented as part of a unified buffer cache system on all of today's operating systems that I know of. The memory is backed by swap space when anonymous, and by a specified file when not anonymous; but the way these two are handled is *exactly* the same internally. Even on older systems without unified buffer cache, the behaviour is the same between anonymous and file-backed mmap'd memory. And there would be no point in making it otherwise. Mmap is designed to let you share memory; why make a broken implementation under certain circumstances? In practice of course we'd have to restrict use of any such implementation to platforms where mmap behaves reasonably ... according to our definition of reasonably. Of course. As we do already with regular I/O. cjs -- Curt Sampson [EMAIL PROTECTED] +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Buffer Management
So, while we're at it, what's the current state of people's thinking on using mmap rather than shared memory for data file buffers? I see some pretty powerful advantages to this approach, and I'm not (yet :-)) convinced that the disadvantages are as bad as people think. I think I can address most of the concerns in doc/TODO.detail/mmap. Is this worth pursuing a bit? (I.e., should I spend an hour or two writing up the advantages and thoughts on how to get around the problems?) Anybody got objections that aren't in doc/TODO.detail/mmap? cjs -- Curt Sampson [EMAIL PROTECTED] +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Buffer Management
Curt Sampson [EMAIL PROTECTED] writes: So, while we're at it, what's the current state of people's thinking on using mmap rather than shared memory for data file buffers? There seem to be a couple of different threads in doc/TODO.detail/mmap. One envisions mmap as a one-for-one replacement for our current use of SysV shared memory, the main selling point being to get out from under kernels that don't have SysV support or have it configured too small. This might be worth doing, and I think it'd be relatively easy to do now that the shared memory support is isolated in one file and there's provisions for selecting a shmem implementation at configure time. The only thing you'd really have to think about is how to replace the current behavior that uses shmem attach counts to discover whether any old backends are left over from a previous crashed postmaster. I dunno if mmap offers any comparable facility. The other discussion seemed to be considering how to mmap individual data files right into backends' address space. I do not believe this can possibly work, because of loss of control over visibility of data changes to other backends, timing of write-backs, etc. But as long as you stay away from interpretation #2 and go with mmap-as-a-shmget-substitute, it might be worthwhile. (Hey Marc, can one do mmap in a BSD jail?) regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Buffer Management
On Tue, 2002-06-25 at 09:09, Tom Lane wrote: Curt Sampson [EMAIL PROTECTED] writes: So, while we're at it, what's the current state of people's thinking on using mmap rather than shared memory for data file buffers? [snip] (Hey Marc, can one do mmap in a BSD jail?) I believe the answer is YES. I can send you the man pages if you want. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Buffer Management
On Tue, 25 Jun 2002, Tom Lane wrote: The only thing you'd really have to think about is how to replace the current behavior that uses shmem attach counts to discover whether any old backends are left over from a previous crashed postmaster. I dunno if mmap offers any comparable facility. Sure. Just mmap a file, and it will be persistent. The other discussion seemed to be considering how to mmap individual data files right into backends' address space. I do not believe this can possibly work, because of loss of control over visibility of data changes to other backends, timing of write-backs, etc. I don't understand why there would be any loss of visibility of changes. If two backends mmap the same block of a file, and it's shared, that's the same block of physical memory that they're accessing. Changes don't even need to propagate, because the memory is truly shared. You'd keep your locks in the page itself as well, of course. Can you describe the problem in more detail? But as long as you stay away from interpretation #2 and go with mmap-as-a-shmget-substitute, it might be worthwhile. It's #2 that I was really looking at. :-) cjs -- Curt Sampson [EMAIL PROTECTED] +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Buffer Management
Tom Lane wrote: Curt Sampson [EMAIL PROTECTED] writes: So, while we're at it, what's the current state of people's thinking on using mmap rather than shared memory for data file buffers? There seem to be a couple of different threads in doc/TODO.detail/mmap. One envisions mmap as a one-for-one replacement for our current use of SysV shared memory, the main selling point being to get out from under kernels that don't have SysV support or have it configured too small. This might be worth doing, and I think it'd be relatively easy to do now that the shared memory support is isolated in one file and there's provisions for selecting a shmem implementation at configure time. The only thing you'd really have to think about is how to replace the current behavior that uses shmem attach counts to discover whether any old backends are left over from a previous crashed postmaster. I dunno if mmap offers any comparable facility. The other discussion seemed to be considering how to mmap individual data files right into backends' address space. I do not believe this can possibly work, because of loss of control over visibility of data changes to other backends, timing of write-backs, etc. Agreed. Also, there was in intresting thread that mmap'ing /dev/zero is the same as anonmap for OS's that don't have anonmap. That should cover most of them. The only downside I can see is that SysV shared memory is locked into RAM on some/most OS's while mmap anon probably isn't. Locking in RAM is good in most cases, bad in others. This will also work well when we have non-SysV semaphore support, like Posix semaphores, so we would be able to run with no SysV stuff. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Buffer Management
Curt Sampson [EMAIL PROTECTED] writes: On Tue, 25 Jun 2002, Tom Lane wrote: The other discussion seemed to be considering how to mmap individual data files right into backends' address space. I do not believe this can possibly work, because of loss of control over visibility of data changes to other backends, timing of write-backs, etc. I don't understand why there would be any loss of visibility of changes. If two backends mmap the same block of a file, and it's shared, that's the same block of physical memory that they're accessing. Is it? You have a mighty narrow conception of the range of implementations that's possible for mmap. But the main problem is that mmap doesn't let us control when changes to the memory buffer will get reflected back to disk --- AFAICT, the OS is free to do the write-back at any instant after you dirty the page, and that completely breaks the WAL algorithm. (WAL = write AHEAD log; the log entry describing a change must hit disk before the data page change itself does.) regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Buffer Management
Isn't that what msync() is for? Or is this not portable? -Ursprüngliche Nachricht- Von: Tom Lane [mailto:[EMAIL PROTECTED]] Gesendet: Dienstag, 25. Juni 2002 16:30 An: Curt Sampson Cc: J. R. Nield; Bruce Momjian; PostgreSQL Hacker Betreff: Re: [HACKERS] Buffer Management Curt Sampson [EMAIL PROTECTED] writes: On Tue, 25 Jun 2002, Tom Lane wrote: The other discussion seemed to be considering how to mmap individual data files right into backends' address space. I do not believe this can possibly work, because of loss of control over visibility of data changes to other backends, timing of write-backs, etc. I don't understand why there would be any loss of visibility of changes. If two backends mmap the same block of a file, and it's shared, that's the same block of physical memory that they're accessing. Is it? You have a mighty narrow conception of the range of implementations that's possible for mmap. But the main problem is that mmap doesn't let us control when changes to the memory buffer will get reflected back to disk --- AFAICT, the OS is free to do the write-back at any instant after you dirty the page, and that completely breaks the WAL algorithm. (WAL = write AHEAD log; the log entry describing a change must hit disk before the data page change itself does.) regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Buffer Management
Mario Weilguni [EMAIL PROTECTED] writes: Isn't that what msync() is for? Or is this not portable? msync can force not-yet-written changes down to disk. It does not prevent the OS from choosing to write changes *before* you invoke msync. For example, the HPUX man page for msync says: Normal system activity can cause pages to be written to disk. Therefore, there are no guarantees that msync() is the only control over when pages are or are not written to disk. Our problem is that we want to enforce the write ordering WAL before data file. To do that, we write and fsync (or DSYNC, or something) a WAL entry before we issue the write() against the data file. We don't really care if the kernel delays the data file write beyond that point, but we can be certain that the data file write did not occur too early. msync is designed to ensure exactly the opposite constraint: it can guarantee that no changes remain unwritten after time T, but it can't guarantee that changes aren't written before time T. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Buffer Management
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: This will also work well when we have non-SysV semaphore support, like Posix semaphores, so we would be able to run with no SysV stuff. You do realize that we can use Posix semaphores today? The Darwin (OS X) port uses 'em now. That's one reason I am more interested in mmap as No, I didn't realize we had gotten that far. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Buffer Management
Tom Lane wrote: Curt Sampson [EMAIL PROTECTED] writes: On Tue, 25 Jun 2002, Tom Lane wrote: The other discussion seemed to be considering how to mmap individual data files right into backends' address space. I do not believe this can possibly work, because of loss of control over visibility of data changes to other backends, timing of write-backs, etc. I don't understand why there would be any loss of visibility of changes. If two backends mmap the same block of a file, and it's shared, that's the same block of physical memory that they're accessing. Is it? You have a mighty narrow conception of the range of implementations that's possible for mmap. But the main problem is that mmap doesn't let us control when changes to the memory buffer will get reflected back to disk --- AFAICT, the OS is free to do the write-back at any instant after you dirty the page, and that completely breaks the WAL algorithm. (WAL = write AHEAD log; the log entry describing a change must hit disk before the data page change itself does.) Can we mmap WAL without problems? Not sure if there is any gain to it because we just write it and rarely read from it. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Buffer Management
Bruce Momjian [EMAIL PROTECTED] writes: Can we mmap WAL without problems? Not sure if there is any gain to it because we just write it and rarely read from it. Perhaps, but I don't see any point to it. regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Buffer Management
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Can we mmap WAL without problems? Not sure if there is any gain to it because we just write it and rarely read from it. Perhaps, but I don't see any point to it. Agreed. I have been poking around google looking for an article I read months ago saying that mmap of files is slighly faster in low memory usage situations, but much slower in high memory usage situations because the kernel doesn't know as much about the file access in mmap as it does with stdio. I will find it. :-) -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]