Re: [HACKERS] Buffer Management

2002-06-26 Thread Tom Lane

Curt Sampson [EMAIL PROTECTED] writes:
 Note that your proposal of using mmap to replace sysv shared memory
 relies on the behaviour I've described too.

True, but I was not envisioning mapping an actual file --- at least
on HPUX, the only way to generate an arbitrary-sized shared memory
region is to use MAP_ANONYMOUS and not have the mmap'd area connected
to any file at all.  It's not farfetched to think that this aspect
of mmap might work differently from mapping pieces of actual files.

In practice of course we'd have to restrict use of any such
implementation to platforms where mmap behaves reasonably ... according
to our definition of reasonably.

regards, tom lane



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster





Re: [HACKERS] Buffer Management

2002-06-26 Thread Bruce Momjian

Tom Lane wrote:
 Curt Sampson [EMAIL PROTECTED] writes:
  Note that your proposal of using mmap to replace sysv shared memory
  relies on the behaviour I've described too.
 
 True, but I was not envisioning mapping an actual file --- at least
 on HPUX, the only way to generate an arbitrary-sized shared memory
 region is to use MAP_ANONYMOUS and not have the mmap'd area connected
 to any file at all.  It's not farfetched to think that this aspect
 of mmap might work differently from mapping pieces of actual files.
 
 In practice of course we'd have to restrict use of any such
 implementation to platforms where mmap behaves reasonably ... according
 to our definition of reasonably.

Yes, I am told mapping /dev/zero is the same as the anon map.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org





Re: [HACKERS] Buffer Management

2002-06-26 Thread Curt Sampson

On Wed, 26 Jun 2002, Tom Lane wrote:

 Curt Sampson [EMAIL PROTECTED] writes:
  Note that your proposal of using mmap to replace sysv shared memory
  relies on the behaviour I've described too.

 True, but I was not envisioning mapping an actual file --- at least
 on HPUX, the only way to generate an arbitrary-sized shared memory
 region is to use MAP_ANONYMOUS and not have the mmap'd area connected
 to any file at all.  It's not farfetched to think that this aspect
 of mmap might work differently from mapping pieces of actual files.

I find it somewhat farfetched, for a couple of reasons:

1. Memory mapped with the MAP_SHARED flag is shared memory,
anonymous or not. POSIX is pretty explicit about how this works,
and the standard for mmap that predates POSIX is the same.
Anonymous memory does not behave differently.

You could just as well say that some systems might exist such
that one process can write() a block to a file, and then another
might read() it afterwards but not see the changes. Postgres
should not try to deal with hypothetical systems that are so
completely broken.

2. Mmap is implemented as part of a unified buffer cache system
on all of today's operating systems that I know of. The memory
is backed by swap space when anonymous, and by a specified file
when not anonymous; but the way these two are handled is
*exactly* the same internally.

Even on older systems without unified buffer cache, the behaviour
is the same between anonymous and file-backed mmap'd memory.
And there would be no point in making it otherwise. Mmap is
designed to let you share memory; why make a broken implementation
under certain circumstances?

 In practice of course we'd have to restrict use of any such
 implementation to platforms where mmap behaves reasonably ... according
 to our definition of reasonably.

Of course. As we do already with regular I/O.

cjs
-- 
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light.  --XTC




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly





Re: [HACKERS] Buffer Management

2002-06-25 Thread Curt Sampson


So, while we're at it, what's the current state of people's thinking
on using mmap rather than shared memory for data file buffers? I
see some pretty powerful advantages to this approach, and I'm not
(yet :-)) convinced that the disadvantages are as bad as people think.
I think I can address most of the concerns in doc/TODO.detail/mmap.

Is this worth pursuing a bit? (I.e., should I spend an hour or two
writing up the advantages and thoughts on how to get around the
problems?) Anybody got objections that aren't in doc/TODO.detail/mmap?

cjs
-- 
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light.  --XTC




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly





Re: [HACKERS] Buffer Management

2002-06-25 Thread Tom Lane

Curt Sampson [EMAIL PROTECTED] writes:
 So, while we're at it, what's the current state of people's thinking
 on using mmap rather than shared memory for data file buffers?

There seem to be a couple of different threads in doc/TODO.detail/mmap.

One envisions mmap as a one-for-one replacement for our current use of
SysV shared memory, the main selling point being to get out from under
kernels that don't have SysV support or have it configured too small.
This might be worth doing, and I think it'd be relatively easy to do
now that the shared memory support is isolated in one file and there's
provisions for selecting a shmem implementation at configure time.
The only thing you'd really have to think about is how to replace the
current behavior that uses shmem attach counts to discover whether any
old backends are left over from a previous crashed postmaster.  I dunno
if mmap offers any comparable facility.

The other discussion seemed to be considering how to mmap individual
data files right into backends' address space.  I do not believe this
can possibly work, because of loss of control over visibility of data
changes to other backends, timing of write-backs, etc.

But as long as you stay away from interpretation #2 and go with
mmap-as-a-shmget-substitute, it might be worthwhile.

(Hey Marc, can one do mmap in a BSD jail?)

regards, tom lane



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster





Re: [HACKERS] Buffer Management

2002-06-25 Thread Larry Rosenman

On Tue, 2002-06-25 at 09:09, Tom Lane wrote:
 Curt Sampson [EMAIL PROTECTED] writes:
  So, while we're at it, what's the current state of people's thinking
  on using mmap rather than shared memory for data file buffers?
 
[snip]
 
 (Hey Marc, can one do mmap in a BSD jail?)
I believe the answer is YES.  

I can send you the man pages if you want. 


 
   regards, tom lane
 
 
 
 ---(end of broadcast)---
 TIP 4: Don't 'kill -9' the postmaster
 
 
-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])





Re: [HACKERS] Buffer Management

2002-06-25 Thread Curt Sampson

On Tue, 25 Jun 2002, Tom Lane wrote:

 The only thing you'd really have to think about is how to replace the
 current behavior that uses shmem attach counts to discover whether any
 old backends are left over from a previous crashed postmaster.  I dunno
 if mmap offers any comparable facility.

Sure. Just mmap a file, and it will be persistent.

 The other discussion seemed to be considering how to mmap individual
 data files right into backends' address space.  I do not believe this
 can possibly work, because of loss of control over visibility of data
 changes to other backends, timing of write-backs, etc.

I don't understand why there would be any loss of visibility of changes.
If two backends mmap the same block of a file, and it's shared, that's
the same block of physical memory that they're accessing. Changes don't
even need to propagate, because the memory is truly shared. You'd keep
your locks in the page itself as well, of course.

Can you describe the problem in more detail?

 But as long as you stay away from interpretation #2 and go with
 mmap-as-a-shmget-substitute, it might be worthwhile.

It's #2 that I was really looking at. :-)

cjs
-- 
Curt Sampson  [EMAIL PROTECTED]   +81 90 7737 2974   http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light.  --XTC




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])





Re: [HACKERS] Buffer Management

2002-06-25 Thread Bruce Momjian

Tom Lane wrote:
 Curt Sampson [EMAIL PROTECTED] writes:
  So, while we're at it, what's the current state of people's thinking
  on using mmap rather than shared memory for data file buffers?
 
 There seem to be a couple of different threads in doc/TODO.detail/mmap.
 
 One envisions mmap as a one-for-one replacement for our current use of
 SysV shared memory, the main selling point being to get out from under
 kernels that don't have SysV support or have it configured too small.
 This might be worth doing, and I think it'd be relatively easy to do
 now that the shared memory support is isolated in one file and there's
 provisions for selecting a shmem implementation at configure time.
 The only thing you'd really have to think about is how to replace the
 current behavior that uses shmem attach counts to discover whether any
 old backends are left over from a previous crashed postmaster.  I dunno
 if mmap offers any comparable facility.
 
 The other discussion seemed to be considering how to mmap individual
 data files right into backends' address space.  I do not believe this
 can possibly work, because of loss of control over visibility of data
 changes to other backends, timing of write-backs, etc.

Agreed.  Also, there was in intresting thread that mmap'ing /dev/zero is
the same as anonmap for OS's that don't have anonmap.  That should cover
most of them.  The only downside I can see is that SysV shared memory is
locked into RAM on some/most OS's while mmap anon probably isn't. 
Locking in RAM is good in most cases, bad in others.

This will also work well when we have non-SysV semaphore support, like
Posix semaphores, so we would be able to run with no SysV stuff.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster





Re: [HACKERS] Buffer Management

2002-06-25 Thread Tom Lane

Curt Sampson [EMAIL PROTECTED] writes:
 On Tue, 25 Jun 2002, Tom Lane wrote:
 The other discussion seemed to be considering how to mmap individual
 data files right into backends' address space.  I do not believe this
 can possibly work, because of loss of control over visibility of data
 changes to other backends, timing of write-backs, etc.

 I don't understand why there would be any loss of visibility of changes.
 If two backends mmap the same block of a file, and it's shared, that's
 the same block of physical memory that they're accessing.

Is it?  You have a mighty narrow conception of the range of
implementations that's possible for mmap.

But the main problem is that mmap doesn't let us control when changes to
the memory buffer will get reflected back to disk --- AFAICT, the OS is
free to do the write-back at any instant after you dirty the page, and
that completely breaks the WAL algorithm.  (WAL = write AHEAD log;
the log entry describing a change must hit disk before the data page
change itself does.)

regards, tom lane



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org





Re: [HACKERS] Buffer Management

2002-06-25 Thread Mario Weilguni

Isn't that what msync() is for? Or is this not portable?

-Ursprüngliche Nachricht-
Von: Tom Lane [mailto:[EMAIL PROTECTED]]
Gesendet: Dienstag, 25. Juni 2002 16:30
An: Curt Sampson
Cc: J. R. Nield; Bruce Momjian; PostgreSQL Hacker
Betreff: Re: [HACKERS] Buffer Management 


Curt Sampson [EMAIL PROTECTED] writes:
 On Tue, 25 Jun 2002, Tom Lane wrote:
 The other discussion seemed to be considering how to mmap individual
 data files right into backends' address space.  I do not believe this
 can possibly work, because of loss of control over visibility of data
 changes to other backends, timing of write-backs, etc.

 I don't understand why there would be any loss of visibility of changes.
 If two backends mmap the same block of a file, and it's shared, that's
 the same block of physical memory that they're accessing.

Is it?  You have a mighty narrow conception of the range of
implementations that's possible for mmap.

But the main problem is that mmap doesn't let us control when changes to
the memory buffer will get reflected back to disk --- AFAICT, the OS is
free to do the write-back at any instant after you dirty the page, and
that completely breaks the WAL algorithm.  (WAL = write AHEAD log;
the log entry describing a change must hit disk before the data page
change itself does.)

regards, tom lane



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org





---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]





Re: [HACKERS] Buffer Management

2002-06-25 Thread Tom Lane

Mario Weilguni [EMAIL PROTECTED] writes:
 Isn't that what msync() is for? Or is this not portable?

msync can force not-yet-written changes down to disk.  It does not
prevent the OS from choosing to write changes *before* you invoke msync.
For example, the HPUX man page for msync says:

 Normal system activity can cause pages to be written to disk.
 Therefore, there are no guarantees that msync() is the only control
 over when pages are or are not written to disk.

Our problem is that we want to enforce the write ordering WAL before
data file.  To do that, we write and fsync (or DSYNC, or something)
a WAL entry before we issue the write() against the data file.  We
don't really care if the kernel delays the data file write beyond that
point, but we can be certain that the data file write did not occur
too early.

msync is designed to ensure exactly the opposite constraint: it can
guarantee that no changes remain unwritten after time T, but it can't
guarantee that changes aren't written before time T.

regards, tom lane



---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly





Re: [HACKERS] Buffer Management

2002-06-25 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  This will also work well when we have non-SysV semaphore support, like
  Posix semaphores, so we would be able to run with no SysV stuff.
 
 You do realize that we can use Posix semaphores today?  The Darwin (OS X)
 port uses 'em now.  That's one reason I am more interested in mmap as

No, I didn't realize we had gotten that far.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])





Re: [HACKERS] Buffer Management

2002-06-25 Thread Bruce Momjian

Tom Lane wrote:
 Curt Sampson [EMAIL PROTECTED] writes:
  On Tue, 25 Jun 2002, Tom Lane wrote:
  The other discussion seemed to be considering how to mmap individual
  data files right into backends' address space.  I do not believe this
  can possibly work, because of loss of control over visibility of data
  changes to other backends, timing of write-backs, etc.
 
  I don't understand why there would be any loss of visibility of changes.
  If two backends mmap the same block of a file, and it's shared, that's
  the same block of physical memory that they're accessing.
 
 Is it?  You have a mighty narrow conception of the range of
 implementations that's possible for mmap.
 
 But the main problem is that mmap doesn't let us control when changes to
 the memory buffer will get reflected back to disk --- AFAICT, the OS is
 free to do the write-back at any instant after you dirty the page, and
 that completely breaks the WAL algorithm.  (WAL = write AHEAD log;
 the log entry describing a change must hit disk before the data page
 change itself does.)

Can we mmap WAL without problems?  Not sure if there is any gain to it
because we just write it and rarely read from it.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])





Re: [HACKERS] Buffer Management

2002-06-25 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Can we mmap WAL without problems?  Not sure if there is any gain to it
 because we just write it and rarely read from it.

Perhaps, but I don't see any point to it.

regards, tom lane



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org





Re: [HACKERS] Buffer Management

2002-06-25 Thread Bruce Momjian

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Can we mmap WAL without problems?  Not sure if there is any gain to it
  because we just write it and rarely read from it.
 
 Perhaps, but I don't see any point to it.

Agreed.  I have been poking around google looking for an article I read
months ago saying that mmap of files is slighly faster in low memory
usage situations, but much slower in high memory usage situations
because the kernel doesn't know as much about the file access in mmap as
it does with stdio.  I will find it.  :-)

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]