Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Xu Yifeng

Hello Alfred,

Friday, March 16, 2001, 3:21:09 PM, you wrote:

AP * Xu Yifeng [EMAIL PROTECTED] [010315 22:25] wrote:

 Could anyone consider fork a syncer process to sync data to disk ?
 build a shared sync queue, when a daemon process want to do sync after
 write() is called, just put a sync request to the queue. this can release
 process from blocked on writing as soon as possible. multipile sync
 request for one file can be merged when the request is been inserting to
 the queue.

AP I suggested this about a year ago. :)

AP The problem is that you need that process to potentially open and close
AP many files over and over.

AP I still think it's somewhat of a good idea.

I am not a DBMS guru.
couldn't the syncer process cache opened files? is there any problem I
didn't consider ?

-- 
Best regards,
Xu Yifeng



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Alfred Perlstein

* Xu Yifeng [EMAIL PROTECTED] [010316 01:15] wrote:
 Hello Alfred,
 
 Friday, March 16, 2001, 3:21:09 PM, you wrote:
 
 AP * Xu Yifeng [EMAIL PROTECTED] [010315 22:25] wrote:
 
  Could anyone consider fork a syncer process to sync data to disk ?
  build a shared sync queue, when a daemon process want to do sync after
  write() is called, just put a sync request to the queue. this can release
  process from blocked on writing as soon as possible. multipile sync
  request for one file can be merged when the request is been inserting to
  the queue.
 
 AP I suggested this about a year ago. :)
 
 AP The problem is that you need that process to potentially open and close
 AP many files over and over.
 
 AP I still think it's somewhat of a good idea.
 
 I am not a DBMS guru.

Hah, same here. :)

 couldn't the syncer process cache opened files? is there any problem I
 didn't consider ?

1) IPC latency, the amount of time it takes to call fsync will
   increase by at least two context switches.

2) a working set (number of files needed to be fsync'd) that
   is larger than the amount of files you wish to keep open.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



AW: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Zeugswetter Andreas SB


 Okay ... we can fall back to O_FSYNC if we don't see either of the
 others.  No problem.  Any other weird cases out there?  I think Andreas
 might've muttered something about AIX but I'm not sure now.

You can safely use O_DSYNC on AIX, the only special on AIX is,
that it does not make a speed difference to O_SYNC. This is imho
because the jfs only needs one sync write to the jfs journal for meta info 
in eighter case (so that nobody misunderstands: both perform excellent).

Andreas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



[HACKERS] Re: AW: Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Thomas Lockhart

  Okay ... we can fall back to O_FSYNC if we don't see either of the
  others.  No problem.  Any other weird cases out there?  I think Andreas
  might've muttered something about AIX but I'm not sure now.
 You can safely use O_DSYNC on AIX, the only special on AIX is,
 that it does not make a speed difference to O_SYNC. This is imho
 because the jfs only needs one sync write to the jfs journal for meta info
 in eighter case (so that nobody misunderstands: both perform excellent).

Hmm. Does everyone run jfs on AIX, or are there other file systems
available? The same issue should be raised for Linux (at least): have we
tried test cases with both journaling and non-journaling file systems?
Perhaps the flag choice would be markedly different for the different
options?

 - Thomas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



[HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open

2001-03-16 Thread Doug McNaught

Just a quick delurk to pass along this tidbit from linux-kernel on
Linux *sync() behavior, since we've been talking about it a lot...

-Doug




Hi,

On Wed, Mar 14, 2001 at 10:26:42PM -0500, Tom Vier wrote:
 fdatasync() is the same as fsync(), in linux.

No, in 2.4 fdatasync does the right thing and skips the inode flush if
only the timestamps have changed.

 until fdatasync() is
 implimented (ie, syncs the data only)

fdatasync is required to sync more than just the data: it has to sync
the inode too if any fields other than the timestamps have changed.
So, for appending to files or writing new files from scratch, fsync ==
fdatasync (because each write also changes the inode size).  Only for
updating existing files in place does fdatasync behave differently.

 #ifndef O_DSYNC
 # define O_DSYNC O_SYNC
 #endif

2.4's O_SYNC actually does a fdatasync internally.  This is also the
default behaviour of HPUX, which requires you to set a sysctl variable
if you want O_SYNC to flush timestamp changes to disk.

Cheers,
 Stephen




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: Re[2]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Bruce Momjian

  Could anyone consider fork a syncer process to sync data to disk ?
  build a shared sync queue, when a daemon process want to do sync after
  write() is called, just put a sync request to the queue. this can release
  process from blocked on writing as soon as possible. multipile sync
  request for one file can be merged when the request is been inserting to
  the queue.
 
 I suggested this about a year ago. :)
 
 The problem is that you need that process to potentially open and close
 many files over and over.
 
 I still think it's somewhat of a good idea.

I like the idea too, but people want the transaction to return COMMIT
only after data has been fsync'ed so I don't see a big win.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Re: AW: Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Larry Rosenman


My UnixWare box runs Veritas' VXFS, and has Online-Data Manager 
installed. Documentation is available at http://www.lerctr.org:457/ 

There are MULTIPLE sync modes, and there are also hints an app can give 
to the FS. 

More info is available if you want. 

LER

-- 
Larry Rosenman
 http://www.lerctr.org/~ler/
Phone: +1 972 414 9812
 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749 US
 Original Message 

On 3/16/01, 9:11:51 AM, Thomas Lockhart [EMAIL PROTECTED] wrote 
regarding [HACKERS] Re: AW: Allowing WAL fsync to be done via O_SYNC:


   Okay ... we can fall back to O_FSYNC if we don't see either of the
   others.  No problem.  Any other weird cases out there?  I think Andreas
   might've muttered something about AIX but I'm not sure now.
  You can safely use O_DSYNC on AIX, the only special on AIX is,
  that it does not make a speed difference to O_SYNC. This is imho
  because the jfs only needs one sync write to the jfs journal for meta 
info
  in eighter case (so that nobody misunderstands: both perform excellent).

 Hmm. Does everyone run jfs on AIX, or are there other file systems
 available? The same issue should be raised for Linux (at least): have we
 tried test cases with both journaling and non-journaling file systems?
 Perhaps the flag choice would be markedly different for the different
 options?

  - Thomas

 ---(end of broadcast)---
 TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: Re[2]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Alfred Perlstein

* Bruce Momjian [EMAIL PROTECTED] [010316 07:11] wrote:
   Could anyone consider fork a syncer process to sync data to disk ?
   build a shared sync queue, when a daemon process want to do sync after
   write() is called, just put a sync request to the queue. this can release
   process from blocked on writing as soon as possible. multipile sync
   request for one file can be merged when the request is been inserting to
   the queue.
  
  I suggested this about a year ago. :)
  
  The problem is that you need that process to potentially open and close
  many files over and over.
  
  I still think it's somewhat of a good idea.
 
 I like the idea too, but people want the transaction to return COMMIT
 only after data has been fsync'ed so I don't see a big win.

This isn't simply handing off the sync to this other process, it requires
an ack from the syncer before returning 'COMMIT'.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Ken Hirsch

From: "Bruce Momjian" [EMAIL PROTECTED]
   Could anyone consider fork a syncer process to sync data to disk ?
   build a shared sync queue, when a daemon process want to do sync after
   write() is called, just put a sync request to the queue. this can
release
   process from blocked on writing as soon as possible. multipile sync
   request for one file can be merged when the request is been inserting
to
   the queue.
 
  I suggested this about a year ago. :)
 
  The problem is that you need that process to potentially open and close
  many files over and over.
 
  I still think it's somewhat of a good idea.

 I like the idea too, but people want the transaction to return COMMIT
 only after data has been fsync'ed so I don't see a big win.

For a log file on a busy system, this could improve throughput a lot--batch
commit.  You end up with fewer than one fsync() per transaction.



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



[HACKERS] Re: AW: Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Tom Lane

Thomas Lockhart [EMAIL PROTECTED] writes:
 tried test cases with both journaling and non-journaling file systems?
 Perhaps the flag choice would be markedly different for the different
 options?

Good point.  Another reason we don't have enough data to nail this down
yet.  Anyway, the code is in there and people can run test cases if they
please...

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Tom Lane

Alfred Perlstein [EMAIL PROTECTED] writes:
 couldn't the syncer process cache opened files? is there any problem I
 didn't consider ?

 1) IPC latency, the amount of time it takes to call fsync will
increase by at least two context switches.

 2) a working set (number of files needed to be fsync'd) that
is larger than the amount of files you wish to keep open.

These days we're really only interested in fsync'ing the current WAL
log file, so working set doesn't seem like a problem anymore.  However
context-switch latency is likely to be a big problem.  One thing we'd
definitely need before considering this is to replace the existing
spinlock mechanism with something more efficient.

Vadim has designed the WAL stuff in such a way that a separate
writer/syncer process would be easy to add; in fact it's almost that way
already, in that any backend can write or sync data that's been added
to the queue by any other backend.  The question is whether it'd
actually buy anything to have another process.  Good stuff to experiment
with for 7.2.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Alfred Perlstein

* Tom Lane [EMAIL PROTECTED] [010316 08:16] wrote:
 Alfred Perlstein [EMAIL PROTECTED] writes:
  couldn't the syncer process cache opened files? is there any problem I
  didn't consider ?
 
  1) IPC latency, the amount of time it takes to call fsync will
 increase by at least two context switches.
 
  2) a working set (number of files needed to be fsync'd) that
 is larger than the amount of files you wish to keep open.
 
 These days we're really only interested in fsync'ing the current WAL
 log file, so working set doesn't seem like a problem anymore.  However
 context-switch latency is likely to be a big problem.  One thing we'd
 definitely need before considering this is to replace the existing
 spinlock mechanism with something more efficient.

What sort of problems are you seeing with the spinlock code?

 Vadim has designed the WAL stuff in such a way that a separate
 writer/syncer process would be easy to add; in fact it's almost that way
 already, in that any backend can write or sync data that's been added
 to the queue by any other backend.  The question is whether it'd
 actually buy anything to have another process.  Good stuff to experiment
 with for 7.2.

The delayed/coallecesed (sp?) fsync looked interesting.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Tom Lane

Alfred Perlstein [EMAIL PROTECTED] writes:
 definitely need before considering this is to replace the existing
 spinlock mechanism with something more efficient.

 What sort of problems are you seeing with the spinlock code?

It's great as long as you never block, but it sucks for making things
wait, because the wait interval will be some multiple of 10 msec rather
than just the time till the lock comes free.

We've speculated about using Posix semaphores instead, on platforms
where those are available.  I think Bruce was concerned about the
possible overhead of pulling in a whole thread-support library just to
get semaphores, however.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



RE: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Mikheev, Vadim

  I was wondering if the multiple writes performed to the 
  XLOG could be grouped into one write().
 
 That would require fairly major restructuring of xlog.c, which I don't

Restructing? Why? It's only XLogWrite() who make writes.

 want to undertake at this point in the cycle (we're trying to push out
 a release candidate, remember?).  I'm not convinced it would be a huge
 win anyway.  It would be a win if your average transaction writes
 multiple blocks' worth of XLOG ... but if your average transaction
 writes less than a block then it won't help.

But in multi-user environment multiple transactions may write  1 block
before commit.

 I think it probably is a good idea to restructure xlog.c so 
 that it can write more than one page at a time --- but it's
 not such a great idea that I want to hold up the release any
 more for it.

Agreed.

Vadim

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread The Hermit Hacker

On Fri, 16 Mar 2001, Tom Lane wrote:

 Alfred Perlstein [EMAIL PROTECTED] writes:
  definitely need before considering this is to replace the existing
  spinlock mechanism with something more efficient.

  What sort of problems are you seeing with the spinlock code?

 It's great as long as you never block, but it sucks for making things
 wait, because the wait interval will be some multiple of 10 msec rather
 than just the time till the lock comes free.

 We've speculated about using Posix semaphores instead, on platforms
 where those are available.  I think Bruce was concerned about the
 possible overhead of pulling in a whole thread-support library just to
 get semaphores, however.

But, with shared libraries, are you really pulling in a "whole
thread-support library"?  My understanding of shared libraries (altho it
may be totally off) was that instead of pulling in a whole library, you
pulled in the bits that you needed, pretty much as you needed them ...




---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 I was wondering if the multiple writes performed to the 
 XLOG could be grouped into one write().
 
 That would require fairly major restructuring of xlog.c, which I don't

 Restructing? Why? It's only XLogWrite() who make writes.

I was thinking of changing the data structure.  I guess you could keep
the data structure the same and make XLogWrite more complicated, though.

 I think it probably is a good idea to restructure xlog.c so 
 that it can write more than one page at a time --- but it's
 not such a great idea that I want to hold up the release any
 more for it.

 Agreed.

Yes, to-do item for 7.2.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



RE: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Mikheev, Vadim

 We've speculated about using Posix semaphores instead, on platforms

For spinlocks we should use pthread mutex-es.

 where those are available.  I think Bruce was concerned about the

And nutex-es are more portable than semaphores.

Vadim

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open

2001-03-16 Thread Doug McNaught

Tom Lane [EMAIL PROTECTED] writes:

 Doug McNaught [EMAIL PROTECTED] forwards:
  2.4's O_SYNC actually does a fdatasync internally.  This is also the
  default behaviour of HPUX, which requires you to set a sysctl variable
  if you want O_SYNC to flush timestamp changes to disk.
 
 Well, that guy might know all about Linux, but he doesn't know anything
 about HPUX (at least not any version I've ever run).  O_SYNC is
 distinctly different from O_DSYNC around here.

Y'know, I figured that might be the case.  ;)  He's a well-respected
Linux filesystem hacker, so I trust him on the Linux stuff.  

So are we still thinking about preallocating log files as a
performance hack?  It does seem that using preallocated files along
with O_DATASYNC will eliminate pretty much all metadata writes under
Linux in future...

[NOT suggesting we try to add anything to 7.1, I'm eagerly awaiting RC1]

-Doug

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Tom Lane

Larry Rosenman [EMAIL PROTECTED] writes:
 But, with shared libraries, are you really pulling in a "whole
 thread-support library"?

 Yes, you are.  On UnixWare, you need to add -Kthread, which CHANGES a LOT 
 of primitives to go through threads wrappers and scheduling.

Right, it's not so much that we care about referencing another shlib,
it's that -lpthreads may cause you to get a whole new thread-aware
version of libc, with attendant overhead that we don't need or want.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open

2001-03-16 Thread Tom Lane

Doug McNaught [EMAIL PROTECTED] writes:
 So are we still thinking about preallocating log files as a
 performance hack?

We're not just thinking about it, we're doing it in current sources ...

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flagfor open

2001-03-16 Thread Bruce Momjian

 So are we still thinking about preallocating log files as a
 performance hack?  It does seem that using preallocated files along
 with O_DATASYNC will eliminate pretty much all metadata writes under
 Linux in future...
 
 [NOT suggesting we try to add anything to 7.1, I'm eagerly awaiting RC1]

I am pretty sure that is done.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



AW: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Zeugswetter Andreas SB


  definitely need before considering this is to replace the existing
  spinlock mechanism with something more efficient.
 
  What sort of problems are you seeing with the spinlock code?
 
 It's great as long as you never block, but it sucks for making things

I like optimistic approaches :-)

 wait, because the wait interval will be some multiple of 10 msec rather
 than just the time till the lock comes free.

On the AIX platform usleep (3) is able to really sleep microseconds without 
busying the cpu when called for more than approx. 100 us (the longer the interval,
the less busy the cpu gets) .
Would this not be ideal for spin_lock, or is usleep not very common ?
Linux sais it is in the BSD 4.3 standard.

postgres@s0188000zeu:/usr/postgres time ustest # with 100 us
real0m10.95s
user0m0.40s
sys 0m0.74s

postgres@s0188000zeu:/usr/postgres time ustest # with 10 us
real0m18.62s
user0m1.37s
sys 0m5.73s

Andreas

PS: sorry off for weekend now :-) Current looks good on AIX.


 ustest.c


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Tom Lane

Jan Wieck [EMAIL PROTECTED] writes:
 Uh - not much time to spend if the statistics should at least
 be  half  accurate. And it would become worse in SMP systems.
 So that was a nifty idea, but I think it'd  cause  much  more
 statistic losses than I assumed at first.

 Back to drawing board. Maybe a SYS-V message queue can serve?

That would be the same as a pipe: backends would block if the collector
stopped accepting data.  I do like the "auto discard" aspect of this
UDP-socket approach.

I think Philip had the right idea: each backend should send totals,
not deltas, in its messages.  Then, it doesn't matter (much) if the
collector loses some messages --- that just means that sometimes it
has a slightly out-of-date idea about how much work some backends have
done.  It should be easy to design the software so that that just makes
a small, transient error in the currently displayed statistics.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Doug McNaught

Tom Lane [EMAIL PROTECTED] writes:

 Alfred Perlstein [EMAIL PROTECTED] writes:
  definitely need before considering this is to replace the existing
  spinlock mechanism with something more efficient.
 
  What sort of problems are you seeing with the spinlock code?
 
 It's great as long as you never block, but it sucks for making things
 wait, because the wait interval will be some multiple of 10 msec rather
 than just the time till the lock comes free.

Plus, using select() for the timeout is putting you into the kernel
multiple times in a short period, and causing a reschedule everytime,
which is a big lose.  This was discussed in the linux-kernel thread
that was referred to a few days ago.

 We've speculated about using Posix semaphores instead, on platforms
 where those are available.  I think Bruce was concerned about the
 possible overhead of pulling in a whole thread-support library just to
 get semaphores, however.

Are Posix semaphores faster by definition than SysV semaphores (which
are described as "slow" in the source comments)?  I can't see how
they'd be much faster unless locking/unlocking an uncontended
semaphore avoids a system call, in which case you might run into the
same problems with userland backoff...

Just looked, and on Linux pthreads and POSIX semaphores are both
already in the C library.  Unfortunately, the Linux C library doesn't
support the PROCESS_SHARED attribute for either pthreads mutexes or
POSIX semaphores.  Grumble.  What's the point then?

Just some ignorant ramblings, thanks for listening...

-Doug

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



AW: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Zeugswetter Andreas SB


 For a log file on a busy system, this could improve throughput a lot--batch
 commit.  You end up with fewer than one fsync() per transaction.

This is not the issue, since that is already implemented.
The current bunching method might have room for improvement, but
there are currently fewer fsync's than transactions when appropriate.

Andreas

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Bruce Momjian

[ Charset ISO-8859-1 unsupported, converting... ]
 Yes, you are.  On UnixWare, you need to add -Kthread, which CHANGES a LOT 
 of primitives to go through threads wrappers and scheduling.

This was my concern;  the change that happens on startup and lib calls
when thread support comes in through a library.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Philip Warner

At 17:10 15/03/01 -0800, Alfred Perlstein wrote:
 
 Which is why the backends should not do anything other than maintain the
 raw data. If there is atomic data than can cause inconsistency, then a
 dropped UDP packet will do the same.

The UDP packet (a COPY) can contain a consistant snapshot of the data.
If you have dependancies, you fit a consistant snapshot into a single
packet.

If we were going to go the shared memory way, then yes, as soon as we start
collecting dependant data we would need locking, but IOs, locking stats,
flushes, cache hits/misses are not really in this category.

But I prefer the UDP/Collector model anyway; it gives use greater
flexibility + the ability to keep stats past backend termination, and,as
you say, removes any possible locking requirements from the backends.




Philip Warner| __---_
Albatross Consulting Pty. Ltd.   |/   -  \
(A.B.N. 75 008 659 498)  |  /(@)   __---_
Tel: (+61) 0500 83 82 81 | _  \
Fax: (+61) 0500 83 82 82 | ___ |
Http://www.rhyme.com.au  |/   \|
 |----
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] pgmonitor patch for query string

2001-03-16 Thread Bruce Momjian

  I don't understand the attraction of the UDP stuff.  If we have the
  stuff in shared memory, we can add a collector program that gathers info
  from shared memory and allows others to access it, right?
 
 There  are a couple of problems with shared memory. First you
 have to decide a size. That'll limit what you  can  put  into
 and  if  you  want  to  put things per table (#scans, #block-
 fetches, #cache-hits, ...), you might run out of  mem  either
 way with complicated, multy-thousand table schemas.
 
 And  the  above  illustrates too that the data structs in the
 shmem wouldn't be just some simple arrays of counters. So  we
 have  to  deal  with locking for both, readers and writers of
 the statistics.

[ Jan, previous email was not sent to list, my mistake.]

OK, I understand the problem with pre-defined size.  That is why I was
looking for a way to dump the information out to a flat file somehow.

I think no matter how we deal with this, we will need some way to turn
on/off such reporting.  We can write into shared memory with little
penalty, but network or filesystem output is not going to be near-zero
cost.

OK, how about a shared buffer area that gets written in a loop so a
separate collection program can grab the info if it wants it, and if
not, it just gets overwritten later.  It can even be per-backend:

loops start  end (loop to start)
- [-]
5  stat stat stat stat stat stat
 |^^^
 current pointer
-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Jan Wieck

Alfred Perlstein wrote:
 * Jan Wieck [EMAIL PROTECTED] [010316 08:08] wrote:
  Philip Warner wrote:
  
   But I prefer the UDP/Collector model anyway; it gives use greater
   flexibility + the ability to keep stats past backend termination, and,as
   you say, removes any possible locking requirements from the backends.
 
  OK, did some tests...
 
  The  postmaster can create a SOCK_DGRAM socket at startup and
  bind(2) it to "127.0.0.1:0", what causes the kernel to assign
  a  non-privileged  port  number  that  then  can be read with
  getsockname(2). No other process can have a socket  with  the
  same port number for the lifetime of the postmaster.
 
  If  the  socket  get's  ready, it'll read one backend message
  from   it   with   recvfrom(2).   The   fromaddr   mustbe
  "127.0.0.1:xxx"  where  xxx  is  the  port  number the kernel
  assigned to the above socket.  Yes,  this  is  his  own  one,
  shared  with  postmaster  and  all  backends.  So  both,  the
  postmaster and the backends can  use  this  one  UDP  socket,
  which  the  backends  inherit on fork(2), to send messages to
  the collector. If such  a  UDP  packet  really  came  from  a
  process other than the postmaster or a backend, well then the
  sysadmin has  a  more  severe  problem  than  manipulated  DB
  runtime statistics :-)

 Doing this is a bad idea:

 a) it allows any program to start spamming localhost:randport with
 messages and screw with the postmaster.

 b) it may even allow remote people to mess with it, (see recent
 bugtraq articles about this)

So  it's  possible  for  a  UDP socket to recvfrom(2) and get
packets with  a  fromaddr  localhost:my_own_non_SO_REUSE_port
that really came from somewhere else?

If  that's  possible,  the  packets  must  be coming over the
network.  Oterwise it's the local superuser sending them, and
in  that case it's not worth any more discussion because root
on your system has more powerful possibilities to muck around
with  your  database. And if someone outside the local system
is doing it, it's time for some filter rules, isn't it?

 You should use a unix domain socket (at least when possible).

Unix domain UDP?


  Running  a 500MHz P-III, 192MB, RedHat 6.1 Linux 2.2.17 here,
  I've been able to loose no single message during the parallel
  regression  test,  if each backend sends one 1K sized message
  per query executed, and the collector simply sucks  them  out
  of  the  socket. Message losses start if the collector does a
  per message idle loop like this:
 
  for (i=0,sum=0;i25;i++,sum+=1);
 
  Uh - not much time to spend if the statistics should at least
  be  half  accurate. And it would become worse in SMP systems.
  So that was a nifty idea, but I think it'd  cause  much  more
  statistic losses than I assumed at first.
 
  Back to drawing board. Maybe a SYS-V message queue can serve?

 I wouldn't say back to the drawing board, I would say two steps back.

 What about instead of sending deltas, you send totals?  This would
 allow you to loose messages and still maintain accurate stats.

Similar problem as with shared  memory  -  size.  If  a  long
running  backend  of  a multithousand table database needs to
send access stats per table - and had accessed them all up to
now - it'll be alot of wasted bandwidth.


 You can also enable SIGIO on the socket, then have a signal handler
 buffer packets that arrive when not actively select()ing on the
 UDP socket.  You can then use sigsetmask(2) to provide mutual
 exclusion with your SIGIO handler and general select()ing on the
 socket.

I  already thought that priorizing the socket-drain this way:
there is a fairly big receive buffer. If the buffer is empty,
it  does  a  blocking  select(2). If it's not, it does a non-
blocking (0-timeout) one and only if the  non-blocking  tells
that  there  aren't  new  messages waiting, it'll process one
buffered message and try to receive again.

Will give it a shot.


Jan

--

#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #



_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Re: [SQL] Re: why the DB file size does not reduce when 'delete'the data in DB?

2001-03-16 Thread yves

On Fri, Mar 16, 2001 at 12:01:36AM +, Thomas Lockhart wrote:
   You are not quite factually correct above, even given your definition of
   "bug". PostgreSQL does reuse deleted record space, but requires an
   explicit maintenance step to do this.
  Could you tell us what that maintenance step is? dumping the db and restoring into 
a fresh one ? :/
 
 :) No, "VACUUM" is your friend for this. Look in the reference manual
 for details.
 
  - Thomas

I'm having this problem:
I have a database that is 3 megabyte in size (measured using pg_dump). When
i go to the corresponding data directory (eg du -h data/base/mydbase), it
seems the real disk usage is 135 megabyte! Doing a VACUUM doesn't really
change the disk usage.

Also query  updating speed increases when i dump all data and restore
it into a fresh new database.

I'm running postgresql-7.0.2-6 on a Debian potato.

-Yves

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Jan Wieck


Tom Lane wrote:
 Jan Wieck [EMAIL PROTECTED] writes:
  Uh - not much time to spend if the statistics should at least
  be  half  accurate. And it would become worse in SMP systems.
  So that was a nifty idea, but I think it'd  cause  much  more
  statistic losses than I assumed at first.

  Back to drawing board. Maybe a SYS-V message queue can serve?

 That would be the same as a pipe: backends would block if the collector
 stopped accepting data.  I do like the "auto discard" aspect of this
 UDP-socket approach.

Does  a pipe guarantee that a buffer, written with one atomic
write(2), never can get intermixed with  other  data  on  the
readers  end?   I know that you know what I mean, but for the
broader audience: Let's define a message to the collector  to
be  4byte-len,len-bytes.   Now  hundreds  of  backends hammer
messages into the (shared) writing end of the pipe, all  with
different sizes. Is itGUARANTEEDthata
read(4bytes),read(nbytes) sequence will  allways  return  one
complete  message  and  never  intermixed  parts of different
write(2)s?

With message queues, this is guaranteed. Also, message queues
would  make  it  easy  to query the collected statistics (see
below).

 I think Philip had the right idea: each backend should send totals,
 not deltas, in its messages.  Then, it doesn't matter (much) if the
 collector loses some messages --- that just means that sometimes it
 has a slightly out-of-date idea about how much work some backends have
 done.  It should be easy to design the software so that that just makes
 a small, transient error in the currently displayed statistics.

If we use two message queues (IPC_PRIVATE  is  enough  here),
one  into collector and one into backend direction, this'd be
an easy way to collect and query statistics.

The backends send delta stats messages to  the  collector  on
one  queue. Message queues block, by default, but the backend
could use IPC_NOWAIT and just go on and collect up,  as  long
as  it finally will use a blocking call before exiting. We'll
loose  statistics  for  backends  that  go  down  in   flames
(coredump), but who cares for statistics then?

To  query statistics, we have a set of new builtin functions.
All functions share  a  global  statistics  snapshot  in  the
backend.  If  on  function call the snapshot doesn't exist or
was generated by  another  XACT/commandcounter,  the  backend
sends  a  statistics  request  for  his  database  ID  to the
collector and waits for the messages to arrive on the  second
message  queue. It can pick up the messages meant for him via
message type, which's equal to his backend number +1, because
the  collector will send 'em as such.  For table access stats
for example, the snapshot will have slots identified  by  the
tables  OID,  so  a function pg_get_tables_seqscan_count(oid)
should be easy  to  implement.  And  setting  up  views  that
present access stats in readable format is a nobrainer.

Now  we  have communication only between the backends and the
collector.  And we're  certain  that  only  someone  able  to
SELECT from a system view will ever see this information.


Jan

--

#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Tom Lane

Jan Wieck [EMAIL PROTECTED] writes:
 Does  a pipe guarantee that a buffer, written with one atomic
 write(2), never can get intermixed with  other  data  on  the
 readers  end?

Yes.  The HPUX man page for write(2) sez:

  o  Write requests of {PIPE_BUF} bytes or less will not be
 interleaved with data from other processes doing writes on the
 same pipe.  Writes of greater than {PIPE_BUF} bytes may have
 data interleaved, on arbitrary boundaries, with writes by
 other processes, whether or not the O_NONBLOCK flag of the
 file status flags is set.

Stevens' _UNIX Network Programming_ (1990) states this is true for all
pipes (nameless or named) on all flavors of Unix, and furthermore states
that PIPE_BUF is at least 4K on all systems.  I don't have any relevant
Posix standards to look at, but I'm not worried about assuming this to
be true.

 With message queues, this is guaranteed. Also, message queues
 would  make  it  easy  to query the collected statistics (see
 below).

I will STRONGLY object to any proposal that we use message queues.
We've already had enough problems with the ridiculously low kernel
limits that are commonly imposed on shmem and SysV semaphores.
We don't need to buy into that silliness yet again with message queues.
I don't believe they gain us anything over pipes anyway.

The real problem with either pipes or message queues is that backends
will block if the collector stops collecting data.  I don't think we
want that.  I suppose we could have the backends write a pipe with
O_NONBLOCK and ignore failure, however:

  o  If the O_NONBLOCK flag is set, write() requests will  be
 handled differently, in the following ways:

 -  The write() function will not block the process.

 -  A write request for {PIPE_BUF} or fewer bytes  will have
the following effect:  If there is sufficient space
available in the pipe, write() will transfer all the data
and return the number of bytes  requested.  Otherwise,
write() will transfer no data and return -1 with errno set
to EAGAIN.

Since we already ignore SIGPIPE, we don't need to worry about losing the
collector entirely.

Now this would put a pretty tight time constraint on the collector:
fall more than 4K behind, you start losing data.  I am not sure if
a UDP socket would provide more buffering or not; anyone know?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Jan Wieck

Tom Lane wrote:
 Jan Wieck [EMAIL PROTECTED] writes:
  Does  a pipe guarantee that a buffer, written with one atomic
  write(2), never can get intermixed with  other  data  on  the
  readers  end?

 Yes.  The HPUX man page for write(2) sez:

   o  Write requests of {PIPE_BUF} bytes or less will not be
  interleaved with data from other processes doing writes on the
  same pipe.  Writes of greater than {PIPE_BUF} bytes may have
  data interleaved, on arbitrary boundaries, with writes by
  other processes, whether or not the O_NONBLOCK flag of the
  file status flags is set.

 Stevens' _UNIX Network Programming_ (1990) states this is true for all
 pipes (nameless or named) on all flavors of Unix, and furthermore states
 that PIPE_BUF is at least 4K on all systems.  I don't have any relevant
 Posix standards to look at, but I'm not worried about assuming this to
 be true.

That's good news - and maybe a Good Assumption (TM).

  With message queues, this is guaranteed. Also, message queues
  would  make  it  easy  to query the collected statistics (see
  below).

 I will STRONGLY object to any proposal that we use message queues.
 We've already had enough problems with the ridiculously low kernel
 limits that are commonly imposed on shmem and SysV semaphores.
 We don't need to buy into that silliness yet again with message queues.
 I don't believe they gain us anything over pipes anyway.

   OK.

 The real problem with either pipes or message queues is that backends
 will block if the collector stops collecting data.  I don't think we
 want that.  I suppose we could have the backends write a pipe with
 O_NONBLOCK and ignore failure, however:

   o  If the O_NONBLOCK flag is set, write() requests will  be
  handled differently, in the following ways:

  -  The write() function will not block the process.

  -  A write request for {PIPE_BUF} or fewer bytes  will have
 the following effect:  If there is sufficient space
 available in the pipe, write() will transfer all the data
 and return the number of bytes  requested.  Otherwise,
 write() will transfer no data and return -1 with errno set
 to EAGAIN.

 Since we already ignore SIGPIPE, we don't need to worry about losing the
 collector entirely.

That's  not  what  the manpage said. It said that in the case
you're inside PIPE_BUF size and using O_NONBLOCK, you  either
send complete messages or nothing, getting an EAGAIN then.

So  we  could  do the same here and write to the pipe. In the
case we cannot, just count up and try  again  next  year  (or
so).


 Now this would put a pretty tight time constraint on the collector:
 fall more than 4K behind, you start losing data.  I am not sure if
 a UDP socket would provide more buffering or not; anyone know?

Again,   this   ain't  what  the  manpage  said.  If  there's
sufficient space available in the pipe  in  combination  with
that  PIPE_BUF  is  at least 4K doesn't necessarily mean that
the pipes buffer space is 4K.

Well,  what  I'm  missing  is  the  ability  to  filter   out
statistics reports on the backend side via msgrcv(2)s msgtype
:-(


Jan

--

#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #



_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Jan Wieck

Tom Lane wrote:
 Now this would put a pretty tight time constraint on the collector:
 fall more than 4K behind, you start losing data.  I am not sure if
 a UDP socket would provide more buffering or not; anyone know?

Looks  like Linux has something around 16-32K of buffer space
for UDP sockets. Just from eyeballing the  fprintf(3)  output
of my destructively hacked postleprechaun.


Jan

--

#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #



_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open

2001-03-16 Thread Giles Lean


[ Drifting off topic ... ]

 Well, that guy might know all about Linux, but he doesn't know anything
 about HPUX (at least not any version I've ever run).  O_SYNC is
 distinctly different from O_DSYNC around here.

There is a HP_UX kernel flag 'o_sync_is_o_dsync' which will cause
O_DSYNC to be treated as O_SYNC.  It defaults to being off -- it
is/was a backward compatibility "feature" since HP-UX 9.X (which is
history now) had implemented O_SYNC as O_DSYNC.

http://docs.hp.com/cgi-bin/otsearch/getfile?id=/hpux/onlinedocs/os/KCparam.OsyncIsOdsync.html

Regards,

Giles




---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: AW: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC

2001-03-16 Thread Tom Lane

Zeugswetter Andreas SB  [EMAIL PROTECTED] writes:
 It's great as long as you never block, but it sucks for making things
 wait, because the wait interval will be some multiple of 10 msec rather
 than just the time till the lock comes free.

 On the AIX platform usleep (3) is able to really sleep microseconds without 
 busying the cpu when called for more than approx. 100 us (the longer the interval,
 the less busy the cpu gets) .
 Would this not be ideal for spin_lock, or is usleep not very common ?
 Linux sais it is in the BSD 4.3 standard.

HPUX has usleep, but the man page says

 The usleep() function is included for its historical usage. The
 setitimer() function is preferred over this function.

In any case, I would expect that all these functions offer accuracy
no better than the scheduler's regular clock cycle (~ 100Hz) on most
kernels.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] beta6 packaged ...

2001-03-16 Thread The Hermit Hacker


will do an announce later on tonight, to give the mirrors a chance to
start syncing ... can others confirm that the packaging once more looks
clean?

thanks ...

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] Stuck spins in current

2001-03-16 Thread Mikheev, Vadim

Got it at spin.c:156 with 50 clients doing inserts into
50 tables (int4, text[1-256 bytes]).
-B 16384, -wal_buffers=256 (with default others wal params).

Vadim

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Jan Wieck

Jan Wieck wrote:
 Tom Lane wrote:
  Now this would put a pretty tight time constraint on the collector:
  fall more than 4K behind, you start losing data.  I am not sure if
  a UDP socket would provide more buffering or not; anyone know?

 Looks  like Linux has something around 16-32K of buffer space
 for UDP sockets. Just from eyeballing the  fprintf(3)  output
 of my destructively hacked postleprechaun.

Just  to  get  some  evidence  at hand - could some owners of
different platforms compile and run  the  attached  little  C
source please?

(The  program  tests how much data can be stuffed into a pipe
or a Sys-V message queue before the writer would block or get
an EAGAIN error).

My output on RedHat6.1 Linux 2.2.17 is:

Pipe buffer is 4096 bytes
Sys-V message queue buffer is 16384 bytes

Seems Tom is (unfortunately) right. The pipe blocks at 4K.

So  a  Sys-V  message  queue,  with the ability to distribute
messages from  the  collector  to  individual  backends  with
kernel  support  via  "mtype"  is  four  times by unestimated
complexity better here.  What does your system say?

I really never thought that Sys-V IPC is a good way to go  at
all.   I  hate  it's  incompatibility to the select(2) system
call and all these  OS/installation  dependant  restrictions.
But I'm tempted to reevaluate it "for this case".


Jan

--

#==#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.  #
#== [EMAIL PROTECTED] #




#include stdio.h
#include stdlib.h
#include unistd.h
#include fcntl.h
#include errno.h

#include sys/types.h
#include sys/ipc.h
#include sys/msg.h


typedef struct  test_message
{
longmtype;
charmtext[512 - sizeof(long)];
} test_message;


static int  test_pipe(void);
static int  test_msg(void);


int
main(int argc, char *argv[])
{
if(test_pipe()  0)
return 1;

if(test_msg()  0)
return 1;

return 0;
}


static int
test_pipe(void)
{
int p[2];
charbuf[512];
int done;
int rc;

if (pipe(p)  0)
{
perror("pipe(2)");
return -1;
}

if (fcntl(p[1], F_SETFL, O_NONBLOCK)  0)
{
perror("fcntl(2)");
return -1;
}

for(done = 0; ; )
{
if ((rc = write(p[1], buf, sizeof(buf))) != sizeof(buf))
{
if (rc  0)
{
extern int errno;

if (errno == EAGAIN)
{
printf("Pipe buffer is %d bytes\n", done);
return 0;
}

perror("write(2)");
return -1;
}

fprintf(stderr, "whatever happened - rc = %d on write(2)\n", rc);
return -1;
}
done += rc;
}

fprintf(stderr, "Endless write loop returned - what's that?\n");
return -1;
}


static int
test_msg(void)
{
int mq;
test_messagemsg;
int done;

if ((mq = msgget(IPC_PRIVATE, IPC_CREAT | 0600))  0)
{
perror("msgget(2)");
return -1;
}

for (done = 0; ; )
{
msg.mtype = 1;
if (msgsnd(mq, msg, sizeof(msg), IPC_NOWAIT)  0)
{
extern int  errno;

if (errno == EAGAIN)
{
printf("Sys-V message queue buffer is %d bytes\n", done);
return 0;
}

perror("msgsnd(2)");
return -1;
}
done += sizeof(msg);
}

fprintf(stderr, "Endless write loop returned - what's that?\n");
return -1;
}





---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open

2001-03-16 Thread Giles Lean


 There is a HP_UX kernel flag 'o_sync_is_o_dsync' which will cause
 O_DSYNC to be treated as O_SYNC.  It defaults to being off -- it

... other way around there, of course.  Trying to clarify and
adding confusion instead. :-(

 is/was a backward compatibility "feature" since HP-UX 9.X (which is
 history now) had implemented O_SYNC as O_DSYNC.

Muttering,

Giles

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] beta6 packaged ...

2001-03-16 Thread Tom Lane

The Hermit Hacker [EMAIL PROTECTED] writes:
 will do an announce later on tonight, to give the mirrors a chance to
 start syncing ... can others confirm that the packaging once more looks
 clean?

The main tar.gz matches what I have here.  Didn't look at the partial
tarballs.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Stuck spins in current

2001-03-16 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 Got it at spin.c:156 with 50 clients doing inserts into
 50 tables (int4, text[1-256 bytes]).
 -B 16384, -wal_buffers=256 (with default others wal params).

SpinAcquire() ... but on which lock?

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Stuck spins in current

2001-03-16 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 Got it at spin.c:156 with 50 clients doing inserts into
 50 tables (int4, text[1-256 bytes]).
 -B 16384, -wal_buffers=256 (with default others wal params).

 SpinAcquire() ... but on which lock?

After a little bit of thought I'll bet it's ControlFileLockId.

Likely we shouldn't be using a spinlock at all for that, but the
short-term solution might be a longer timeout for this particular lock.
Alternatively, could we avoid holding that lock while initializing a
new log segment?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Tom Lane

Jan Wieck [EMAIL PROTECTED] writes:
 Just  to  get  some  evidence  at hand - could some owners of
 different platforms compile and run  the  attached  little  C
 source please?

HPUX 10.20:

Pipe buffer is 8192 bytes
Sys-V message queue buffer is 16384 bytes

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Stuck spins in current

2001-03-16 Thread Mikheev, Vadim

  How to synchronize with checkpoint-er if wal_files  0?
 
 I was sort of visualizing assigning the created xlog files 
 dynamically:
 
   create a temp file of a PID-dependent name
   fill it with zeroes and fsync it
   acquire ControlFileLockId
   rename temp file into place as next uncreated segment
   update pg_control
   release ControlFileLockId
 
 Since the things are just filled with 0's, there's no need to 
 know which segment it will be while you're filling it.
 
 This would leave you sometimes with more advance files than you really
 needed, but so what ...

Yes, it has sence, but:

  And you know - I've run same tests on ~ Mar 9 snapshot
  without any problems.
 
 That was before I changed the code to pre-fill the file --- 
 now it takes longer to init a log segment.  And we're only
 using a plain SpinAcquire, not the flavor with a longer timeout.

xlog.c revision 1.55 from Feb 26 already had log file
zero-filling, so ...

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Stuck spins in current

2001-03-16 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 Alternatively, could we avoid holding that lock while initializing a
 new log segment?

 How to synchronize with checkpoint-er if wal_files  0?

I was sort of visualizing assigning the created xlog files dynamically:

create a temp file of a PID-dependent name
fill it with zeroes and fsync it
acquire ControlFileLockId
rename temp file into place as next uncreated segment
update pg_control
release ControlFileLockId

Since the things are just filled with 0's, there's no need to know which
segment it will be while you're filling it.

This would leave you sometimes with more advance files than you really
needed, but so what ...

 And you know - I've run same tests on ~ Mar 9 snapshot
 without any problems.

That was before I changed the code to pre-fill the file --- now it takes
longer to init a log segment.  And we're only using a plain SpinAcquire,
not the flavor with a longer timeout.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



RE: [HACKERS] Stuck spins in current

2001-03-16 Thread Mikheev, Vadim

  Got it at spin.c:156 with 50 clients doing inserts into
  50 tables (int4, text[1-256 bytes]).
  -B 16384, -wal_buffers=256 (with default others wal params).
 
  SpinAcquire() ... but on which lock?
 
 After a little bit of thought I'll bet it's ControlFileLockId.

I see "XLogWrite: new log file created..." in postmaster' log -
backend writes this after releasing ControlFileLockId.

 Likely we shouldn't be using a spinlock at all for that, but the
 short-term solution might be a longer timeout for this 
 particular lock.
 Alternatively, could we avoid holding that lock while initializing a
 new log segment?

How to synchronize with checkpoint-er if wal_files  0?
And you know - I've run same tests on ~ Mar 9 snapshot
without any problems.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Giles Lean


 Just  to  get  some  evidence  at hand - could some owners of
 different platforms compile and run  the  attached  little  C
 source please?

$ uname -srm
FreeBSD 4.1.1-STABLE
$ ./jan
Pipe buffer is 16384 bytes
Sys-V message queue buffer is 2048 bytes

$ uname -srm
NetBSD 1.5 alpha
$ ./jan
Pipe buffer is 4096 bytes
Sys-V message queue buffer is 2048 bytes

$ uname -srm
NetBSD 1.5_BETA2 i386
$ ./jan
Pipe buffer is 4096 bytes
Sys-V message queue buffer is 2048 bytes

$ uname -srm
NetBSD 1.4.2 i386
$ ./jan
Pipe buffer is 4096 bytes
Sys-V message queue buffer is 2048 bytes

$ uname -srm
NetBSD 1.4.1 sparc
$ ./jan
Pipe buffer is 4096 bytes
Bad system call (core dumped)   # no SysV IPC in running kernel

$ uname -srm
HP-UX B.11.11 9000/800
$ ./jan
Pipe buffer is 8192 bytes
Sys-V message queue buffer is 16384 bytes

$ uname -srm
HP-UX B.11.00 9000/813
$ ./jan
Pipe buffer is 8192 bytes
Sys-V message queue buffer is 16384 bytes

$ uname -srm
HP-UX B.10.20 9000/871
$ ./jan
Pipe buffer is 8192 bytes
Sys-V message queue buffer is 16384 bytes

HP-UX can also use STREAMS based pipes if the kernel parameter
streampipes is set.  Using STREAMS based pipes increases the pipe
buffer size by a lot:

# uname -srm 
HP-UX B.11.11 9000/800
# ./jan
Pipe buffer is 131072 bytes
Sys-V message queue buffer is 16384 bytes

# uname -srm
HP-UX B.11.00 9000/800
# ./jan
Pipe buffer is 131072 bytes
Sys-V message queue buffer is 16384 bytes

Regards,

Giles

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [HACKERS] Stuck spins in current

2001-03-16 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 And you know - I've run same tests on ~ Mar 9 snapshot
 without any problems.
 
 That was before I changed the code to pre-fill the file --- 
 now it takes longer to init a log segment.  And we're only
 using a plain SpinAcquire, not the flavor with a longer timeout.

 xlog.c revision 1.55 from Feb 26 already had log file
 zero-filling, so ...

Oh, you're right, I didn't study the CVS log carefully enough.  Hmm,
maybe the control file lock isn't the problem.  The abort() in
s_lock_stuck should have left a core file --- what is the backtrace?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



RE: [HACKERS] Stuck spins in current

2001-03-16 Thread Mikheev, Vadim

  And you know - I've run same tests on ~ Mar 9 snapshot
  without any problems.
  
  That was before I changed the code to pre-fill the file --- 
  now it takes longer to init a log segment.  And we're only
  using a plain SpinAcquire, not the flavor with a longer timeout.
 
  xlog.c revision 1.55 from Feb 26 already had log file
  zero-filling, so ...
 
 Oh, you're right, I didn't study the CVS log carefully enough.  Hmm,
 maybe the control file lock isn't the problem.  The abort() in
 s_lock_stuck should have left a core file --- what is the backtrace?

After 10 times increasing DEFAULT_TIMEOUT in s_lock.c
I got abort in xlog.c:626 - waiting for insert_lck.
But problem is near new log file creation code: system
goes sleep just after new one is created.

Vadim

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



[HACKERS] pg_upgrade

2001-03-16 Thread Peter Eisentraut

Since pg_upgrade will not work for 7.1, should its installation be
prevented and the man page be disabled?

-- 
Peter Eisentraut  [EMAIL PROTECTED]   http://yi.org/peter-e/


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



[HACKERS] transaction timeout

2001-03-16 Thread Kevin T. Manley

Is there a timeout setting I can use to abort transactions that aren't
deadlocked, but which have been blocked waiting for locks greater than some
amount of time? I didn't see anything in the docs on this and observed with
2 instances of psql that a transaction waiting on a lock seems to wait
forever.

If pgsql doesn't have such a setting, has there been any discussion about
adding it?

Regards,
Kevin Manley



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] pg_upgrade

2001-03-16 Thread Bruce Momjian

 Since pg_upgrade will not work for 7.1, should its installation be
 prevented and the man page be disabled?

Probably.  I am not sure it will ever be used again now that we have
numeric file names.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Larry Rosenman

* Jan Wieck [EMAIL PROTECTED] [010316 16:35]:
 Jan Wieck wrote:
  Tom Lane wrote:
   Now this would put a pretty tight time constraint on the collector:
   fall more than 4K behind, you start losing data.  I am not sure if
   a UDP socket would provide more buffering or not; anyone know?
 
  Looks  like Linux has something around 16-32K of buffer space
  for UDP sockets. Just from eyeballing the  fprintf(3)  output
  of my destructively hacked postleprechaun.
 
 Just  to  get  some  evidence  at hand - could some owners of
 different platforms compile and run  the  attached  little  C
 source please?
 
 (The  program  tests how much data can be stuffed into a pipe
 or a Sys-V message queue before the writer would block or get
 an EAGAIN error).
 
 My output on RedHat6.1 Linux 2.2.17 is:
 
 Pipe buffer is 4096 bytes
 Sys-V message queue buffer is 16384 bytes
 
 Seems Tom is (unfortunately) right. The pipe blocks at 4K.
 
 So  a  Sys-V  message  queue,  with the ability to distribute
 messages from  the  collector  to  individual  backends  with
 kernel  support  via  "mtype"  is  four  times by unestimated
 complexity better here.  What does your system say?
 
 I really never thought that Sys-V IPC is a good way to go  at
 all.   I  hate  it's  incompatibility to the select(2) system
 call and all these  OS/installation  dependant  restrictions.
 But I'm tempted to reevaluate it "for this case".
 
 
 Jan
$ ./queuetest
Pipe buffer is 32768 bytes
Sys-V message queue buffer is 4096 bytes
$ uname -a
UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5
$ 

I think some of these are configurable...

LER

-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] Performance monitor signal handler

2001-03-16 Thread Larry Rosenman

* Larry Rosenman [EMAIL PROTECTED] [010316 20:47]:
 * Jan Wieck [EMAIL PROTECTED] [010316 16:35]:
 $ ./queuetest
 Pipe buffer is 32768 bytes
 Sys-V message queue buffer is 4096 bytes
 $ uname -a
 UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5
 $ 
 
 I think some of these are configurable...
They both are.  FIFOBLKSIZE and MSGMNB or some such kernel tunable.

I can get more info if you need it.

LER

-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED]
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] pg_upgrade

2001-03-16 Thread Bruce Momjian

  Since pg_upgrade will not work for 7.1, should its installation be
  prevented and the man page be disabled?
 
 Probably.  I am not sure it will ever be used again now that we have
 numeric file names.

Perhaps we should leave it for 7.1 because people will complain when
they can not find it.  Maybe we can mention this may go away in the next
release.


-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [HACKERS] problems with startup script on upgrade

2001-03-16 Thread Tom Lane

"Martin A. Marques" [EMAIL PROTECTED] writes:
 Please define "doesn't work".  What happens exactly?  What messages
 are produced?

 root@ultra31 /space/pruebas/postgres-cvs # su postgres -c 
 '/dbs/postgres/bin/pg_ctl -o "-i" -D /dbs/postgres/data/ start -l 
 /dbs/postgres/sql.log'
 19054 Killed
 postmaster successfully started
 root@ultra31 /space/pruebas/postgres-cvs #

Hm, that 'Killed' looks suspicious.  What shows up in the
/dbs/postgres/sql.log file?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] problems with startup script on upgrade

2001-03-16 Thread Tom Lane

"Martin A. Marques" [EMAIL PROTECTED] writes:
 Hm, that 'Killed' looks suspicious.  What shows up in the
 /dbs/postgres/sql.log file?

 Nothing at all.

That's no help :-(.  Please alter the command to trace the shell script,
ie

su postgres -c 'sh -x /dbs/postgres/bin/pg_ctl -o ... 2tracefile'

and send the tracefile.

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



[HACKERS] Re: [GENERAL] Problems with outer joins in 7.1beta5

2001-03-16 Thread Tom Lane

Barry Lind [EMAIL PROTECTED] writes:
 What I would expect the syntax to be is:
 table as alias (columna as aliasa, columnb as aliasb,...)
 This will allow the query to work regardless of what the table column 
 order is.  Generally the SQL spec has tried not to tie query behaviour 
 to the table column order.

Unfortunately, the spec authors seem to have forgotten that basic design
rule when they wrote the aliasing syntax.  Column alias lists are
position-sensitive:

 table reference ::=
table name [ [ AS ] correlation name
[ left paren derived column list right paren ] ]
  | derived table [ AS ] correlation name
[ left paren derived column list right paren ]
  | joined table

 derived column list ::= column name list

 column name list ::=
  column name [ { comma column name }... ]

SQL99 seems to be no better.  Sorry.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[GENERAL] Problems with outer joins in 7.1beta5

2001-03-16 Thread Barry Lind

My problem is that my two outer joined tables have columns that have the 
same names.  Therefore when my select list tries to reference the 
columns they are ambiguously defined.  Looking at the doc I see the way 
to deal with this is by using the following syntax:

table as alias (column1alias, column2alias,...)

So we can alias the conficting column names to resolve the problem. 
However the problem with this is that the column aliases are positional 
per the table structure. Thus column1alias applies to the first column 
in the table. Code that relies on the order of columns in a table is 
very brittle. As adding a column always places it at the end of the 
table, it is very easy to have a newly installed site have one order 
(the order the create table command creates them in) and a site 
upgrading from an older version (where the upgrade simply adds the new 
columns) to have column orders be different.

My feeling is that postgres has misinterpreted the SQL92 spec in this 
regards. But I am having problems finding an online copy of the SQL92 
spec so that I can verify.

What I would expect the syntax to be is:

table as alias (columna as aliasa, columnb as aliasb,...)

This will allow the query to work regardless of what the table column 
order is.  Generally the SQL spec has tried not to tie query behaviour 
to the table column order.

I will fix my code so that it works given how postgres currently 
supports the column aliases.

Can anyone point me to a copy of the SQL92 spec so that I can research 
this more?

thanks,
--Barry


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



[HACKERS] Re: problems with startup script on upgrade

2001-03-16 Thread Thomas Lockhart

  Ah, but is the LD_LIBRARY_PATH the same inside that su?  A change of
  environment might explain why this works "by hand" and not through su
  ...
 This #$^%^*$% Solaris!!
 Check this out, and tell me I shouldn't yell out at SUN:
 root@ultra31 / # su - postgres -c 'echo $PATH'
 /usr/bin:
 root@ultra31 / # su - postgres
 postgres@ultra31:~  echo $PATH
/usr/local/bin:/usr/local/gcc/bin:/usr/local/php/bin:/opt/sfw/bin:/usr/local/a2p/bin:/usr/local/sql/bin:/usr/ccs/bin:/bin:/usr/bin/X11:/usr/bin:/usr/ucb:/dbs/postgres/bin:
 postgres@ultra31:~  logout
 root@ultra31 / #
 Can someone explain to why Solaris is doing that, and why did it start doing
 it after an upgrade? I have no words.

It may be that this is the first build of PostgreSQL which asks for
"libz.so", but that is just a guess.

Not sure about "after the upgrade", but I'll bet that the first (command
line) case does not have an attached terminal, while the second case,
where you actually connect to the session, does.

Does your .profile try doing some "terminal stuff"? Try adding echo's to
your .profile to verify that it start, and that it runs to completion...

Also, PATH is not relevant for finding libz.so, so you need to figure
out what (if anything) is happening to LD_LIBRARY_PATH.

  - Thomas

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster