Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
Hello Alfred, Friday, March 16, 2001, 3:21:09 PM, you wrote: AP * Xu Yifeng [EMAIL PROTECTED] [010315 22:25] wrote: Could anyone consider fork a syncer process to sync data to disk ? build a shared sync queue, when a daemon process want to do sync after write() is called, just put a sync request to the queue. this can release process from blocked on writing as soon as possible. multipile sync request for one file can be merged when the request is been inserting to the queue. AP I suggested this about a year ago. :) AP The problem is that you need that process to potentially open and close AP many files over and over. AP I still think it's somewhat of a good idea. I am not a DBMS guru. couldn't the syncer process cache opened files? is there any problem I didn't consider ? -- Best regards, Xu Yifeng ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Xu Yifeng [EMAIL PROTECTED] [010316 01:15] wrote: Hello Alfred, Friday, March 16, 2001, 3:21:09 PM, you wrote: AP * Xu Yifeng [EMAIL PROTECTED] [010315 22:25] wrote: Could anyone consider fork a syncer process to sync data to disk ? build a shared sync queue, when a daemon process want to do sync after write() is called, just put a sync request to the queue. this can release process from blocked on writing as soon as possible. multipile sync request for one file can be merged when the request is been inserting to the queue. AP I suggested this about a year ago. :) AP The problem is that you need that process to potentially open and close AP many files over and over. AP I still think it's somewhat of a good idea. I am not a DBMS guru. Hah, same here. :) couldn't the syncer process cache opened files? is there any problem I didn't consider ? 1) IPC latency, the amount of time it takes to call fsync will increase by at least two context switches. 2) a working set (number of files needed to be fsync'd) that is larger than the amount of files you wish to keep open. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
AW: [HACKERS] Allowing WAL fsync to be done via O_SYNC
Okay ... we can fall back to O_FSYNC if we don't see either of the others. No problem. Any other weird cases out there? I think Andreas might've muttered something about AIX but I'm not sure now. You can safely use O_DSYNC on AIX, the only special on AIX is, that it does not make a speed difference to O_SYNC. This is imho because the jfs only needs one sync write to the jfs journal for meta info in eighter case (so that nobody misunderstands: both perform excellent). Andreas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] Re: AW: Allowing WAL fsync to be done via O_SYNC
Okay ... we can fall back to O_FSYNC if we don't see either of the others. No problem. Any other weird cases out there? I think Andreas might've muttered something about AIX but I'm not sure now. You can safely use O_DSYNC on AIX, the only special on AIX is, that it does not make a speed difference to O_SYNC. This is imho because the jfs only needs one sync write to the jfs journal for meta info in eighter case (so that nobody misunderstands: both perform excellent). Hmm. Does everyone run jfs on AIX, or are there other file systems available? The same issue should be raised for Linux (at least): have we tried test cases with both journaling and non-journaling file systems? Perhaps the flag choice would be markedly different for the different options? - Thomas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open
Just a quick delurk to pass along this tidbit from linux-kernel on Linux *sync() behavior, since we've been talking about it a lot... -Doug Hi, On Wed, Mar 14, 2001 at 10:26:42PM -0500, Tom Vier wrote: fdatasync() is the same as fsync(), in linux. No, in 2.4 fdatasync does the right thing and skips the inode flush if only the timestamps have changed. until fdatasync() is implimented (ie, syncs the data only) fdatasync is required to sync more than just the data: it has to sync the inode too if any fields other than the timestamps have changed. So, for appending to files or writing new files from scratch, fsync == fdatasync (because each write also changes the inode size). Only for updating existing files in place does fdatasync behave differently. #ifndef O_DSYNC # define O_DSYNC O_SYNC #endif 2.4's O_SYNC actually does a fdatasync internally. This is also the default behaviour of HPUX, which requires you to set a sysctl variable if you want O_SYNC to flush timestamp changes to disk. Cheers, Stephen ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: Re[2]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
Could anyone consider fork a syncer process to sync data to disk ? build a shared sync queue, when a daemon process want to do sync after write() is called, just put a sync request to the queue. this can release process from blocked on writing as soon as possible. multipile sync request for one file can be merged when the request is been inserting to the queue. I suggested this about a year ago. :) The problem is that you need that process to potentially open and close many files over and over. I still think it's somewhat of a good idea. I like the idea too, but people want the transaction to return COMMIT only after data has been fsync'ed so I don't see a big win. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Re: AW: Allowing WAL fsync to be done via O_SYNC
My UnixWare box runs Veritas' VXFS, and has Online-Data Manager installed. Documentation is available at http://www.lerctr.org:457/ There are MULTIPLE sync modes, and there are also hints an app can give to the FS. More info is available if you want. LER -- Larry Rosenman http://www.lerctr.org/~ler/ Phone: +1 972 414 9812 E-Mail: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749 US Original Message On 3/16/01, 9:11:51 AM, Thomas Lockhart [EMAIL PROTECTED] wrote regarding [HACKERS] Re: AW: Allowing WAL fsync to be done via O_SYNC: Okay ... we can fall back to O_FSYNC if we don't see either of the others. No problem. Any other weird cases out there? I think Andreas might've muttered something about AIX but I'm not sure now. You can safely use O_DSYNC on AIX, the only special on AIX is, that it does not make a speed difference to O_SYNC. This is imho because the jfs only needs one sync write to the jfs journal for meta info in eighter case (so that nobody misunderstands: both perform excellent). Hmm. Does everyone run jfs on AIX, or are there other file systems available? The same issue should be raised for Linux (at least): have we tried test cases with both journaling and non-journaling file systems? Perhaps the flag choice would be markedly different for the different options? - Thomas ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: Re[2]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Bruce Momjian [EMAIL PROTECTED] [010316 07:11] wrote: Could anyone consider fork a syncer process to sync data to disk ? build a shared sync queue, when a daemon process want to do sync after write() is called, just put a sync request to the queue. this can release process from blocked on writing as soon as possible. multipile sync request for one file can be merged when the request is been inserting to the queue. I suggested this about a year ago. :) The problem is that you need that process to potentially open and close many files over and over. I still think it's somewhat of a good idea. I like the idea too, but people want the transaction to return COMMIT only after data has been fsync'ed so I don't see a big win. This isn't simply handing off the sync to this other process, it requires an ack from the syncer before returning 'COMMIT'. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC
From: "Bruce Momjian" [EMAIL PROTECTED] Could anyone consider fork a syncer process to sync data to disk ? build a shared sync queue, when a daemon process want to do sync after write() is called, just put a sync request to the queue. this can release process from blocked on writing as soon as possible. multipile sync request for one file can be merged when the request is been inserting to the queue. I suggested this about a year ago. :) The problem is that you need that process to potentially open and close many files over and over. I still think it's somewhat of a good idea. I like the idea too, but people want the transaction to return COMMIT only after data has been fsync'ed so I don't see a big win. For a log file on a busy system, this could improve throughput a lot--batch commit. You end up with fewer than one fsync() per transaction. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
[HACKERS] Re: AW: Allowing WAL fsync to be done via O_SYNC
Thomas Lockhart [EMAIL PROTECTED] writes: tried test cases with both journaling and non-journaling file systems? Perhaps the flag choice would be markedly different for the different options? Good point. Another reason we don't have enough data to nail this down yet. Anyway, the code is in there and people can run test cases if they please... regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
Alfred Perlstein [EMAIL PROTECTED] writes: couldn't the syncer process cache opened files? is there any problem I didn't consider ? 1) IPC latency, the amount of time it takes to call fsync will increase by at least two context switches. 2) a working set (number of files needed to be fsync'd) that is larger than the amount of files you wish to keep open. These days we're really only interested in fsync'ing the current WAL log file, so working set doesn't seem like a problem anymore. However context-switch latency is likely to be a big problem. One thing we'd definitely need before considering this is to replace the existing spinlock mechanism with something more efficient. Vadim has designed the WAL stuff in such a way that a separate writer/syncer process would be easy to add; in fact it's almost that way already, in that any backend can write or sync data that's been added to the queue by any other backend. The question is whether it'd actually buy anything to have another process. Good stuff to experiment with for 7.2. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
* Tom Lane [EMAIL PROTECTED] [010316 08:16] wrote: Alfred Perlstein [EMAIL PROTECTED] writes: couldn't the syncer process cache opened files? is there any problem I didn't consider ? 1) IPC latency, the amount of time it takes to call fsync will increase by at least two context switches. 2) a working set (number of files needed to be fsync'd) that is larger than the amount of files you wish to keep open. These days we're really only interested in fsync'ing the current WAL log file, so working set doesn't seem like a problem anymore. However context-switch latency is likely to be a big problem. One thing we'd definitely need before considering this is to replace the existing spinlock mechanism with something more efficient. What sort of problems are you seeing with the spinlock code? Vadim has designed the WAL stuff in such a way that a separate writer/syncer process would be easy to add; in fact it's almost that way already, in that any backend can write or sync data that's been added to the queue by any other backend. The question is whether it'd actually buy anything to have another process. Good stuff to experiment with for 7.2. The delayed/coallecesed (sp?) fsync looked interesting. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
Alfred Perlstein [EMAIL PROTECTED] writes: definitely need before considering this is to replace the existing spinlock mechanism with something more efficient. What sort of problems are you seeing with the spinlock code? It's great as long as you never block, but it sucks for making things wait, because the wait interval will be some multiple of 10 msec rather than just the time till the lock comes free. We've speculated about using Posix semaphores instead, on platforms where those are available. I think Bruce was concerned about the possible overhead of pulling in a whole thread-support library just to get semaphores, however. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
RE: [HACKERS] Allowing WAL fsync to be done via O_SYNC
I was wondering if the multiple writes performed to the XLOG could be grouped into one write(). That would require fairly major restructuring of xlog.c, which I don't Restructing? Why? It's only XLogWrite() who make writes. want to undertake at this point in the cycle (we're trying to push out a release candidate, remember?). I'm not convinced it would be a huge win anyway. It would be a win if your average transaction writes multiple blocks' worth of XLOG ... but if your average transaction writes less than a block then it won't help. But in multi-user environment multiple transactions may write 1 block before commit. I think it probably is a good idea to restructure xlog.c so that it can write more than one page at a time --- but it's not such a great idea that I want to hold up the release any more for it. Agreed. Vadim ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
On Fri, 16 Mar 2001, Tom Lane wrote: Alfred Perlstein [EMAIL PROTECTED] writes: definitely need before considering this is to replace the existing spinlock mechanism with something more efficient. What sort of problems are you seeing with the spinlock code? It's great as long as you never block, but it sucks for making things wait, because the wait interval will be some multiple of 10 msec rather than just the time till the lock comes free. We've speculated about using Posix semaphores instead, on platforms where those are available. I think Bruce was concerned about the possible overhead of pulling in a whole thread-support library just to get semaphores, however. But, with shared libraries, are you really pulling in a "whole thread-support library"? My understanding of shared libraries (altho it may be totally off) was that instead of pulling in a whole library, you pulled in the bits that you needed, pretty much as you needed them ... ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Allowing WAL fsync to be done via O_SYNC
"Mikheev, Vadim" [EMAIL PROTECTED] writes: I was wondering if the multiple writes performed to the XLOG could be grouped into one write(). That would require fairly major restructuring of xlog.c, which I don't Restructing? Why? It's only XLogWrite() who make writes. I was thinking of changing the data structure. I guess you could keep the data structure the same and make XLogWrite more complicated, though. I think it probably is a good idea to restructure xlog.c so that it can write more than one page at a time --- but it's not such a great idea that I want to hold up the release any more for it. Agreed. Yes, to-do item for 7.2. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
RE: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
We've speculated about using Posix semaphores instead, on platforms For spinlocks we should use pthread mutex-es. where those are available. I think Bruce was concerned about the And nutex-es are more portable than semaphores. Vadim ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open
Tom Lane [EMAIL PROTECTED] writes: Doug McNaught [EMAIL PROTECTED] forwards: 2.4's O_SYNC actually does a fdatasync internally. This is also the default behaviour of HPUX, which requires you to set a sysctl variable if you want O_SYNC to flush timestamp changes to disk. Well, that guy might know all about Linux, but he doesn't know anything about HPUX (at least not any version I've ever run). O_SYNC is distinctly different from O_DSYNC around here. Y'know, I figured that might be the case. ;) He's a well-respected Linux filesystem hacker, so I trust him on the Linux stuff. So are we still thinking about preallocating log files as a performance hack? It does seem that using preallocated files along with O_DATASYNC will eliminate pretty much all metadata writes under Linux in future... [NOT suggesting we try to add anything to 7.1, I'm eagerly awaiting RC1] -Doug ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
Larry Rosenman [EMAIL PROTECTED] writes: But, with shared libraries, are you really pulling in a "whole thread-support library"? Yes, you are. On UnixWare, you need to add -Kthread, which CHANGES a LOT of primitives to go through threads wrappers and scheduling. Right, it's not so much that we care about referencing another shlib, it's that -lpthreads may cause you to get a whole new thread-aware version of libc, with attendant overhead that we don't need or want. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open
Doug McNaught [EMAIL PROTECTED] writes: So are we still thinking about preallocating log files as a performance hack? We're not just thinking about it, we're doing it in current sources ... regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flagfor open
So are we still thinking about preallocating log files as a performance hack? It does seem that using preallocated files along with O_DATASYNC will eliminate pretty much all metadata writes under Linux in future... [NOT suggesting we try to add anything to 7.1, I'm eagerly awaiting RC1] I am pretty sure that is done. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
AW: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
definitely need before considering this is to replace the existing spinlock mechanism with something more efficient. What sort of problems are you seeing with the spinlock code? It's great as long as you never block, but it sucks for making things I like optimistic approaches :-) wait, because the wait interval will be some multiple of 10 msec rather than just the time till the lock comes free. On the AIX platform usleep (3) is able to really sleep microseconds without busying the cpu when called for more than approx. 100 us (the longer the interval, the less busy the cpu gets) . Would this not be ideal for spin_lock, or is usleep not very common ? Linux sais it is in the BSD 4.3 standard. postgres@s0188000zeu:/usr/postgres time ustest # with 100 us real0m10.95s user0m0.40s sys 0m0.74s postgres@s0188000zeu:/usr/postgres time ustest # with 10 us real0m18.62s user0m1.37s sys 0m5.73s Andreas PS: sorry off for weekend now :-) Current looks good on AIX. ustest.c ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Performance monitor signal handler
Jan Wieck [EMAIL PROTECTED] writes: Uh - not much time to spend if the statistics should at least be half accurate. And it would become worse in SMP systems. So that was a nifty idea, but I think it'd cause much more statistic losses than I assumed at first. Back to drawing board. Maybe a SYS-V message queue can serve? That would be the same as a pipe: backends would block if the collector stopped accepting data. I do like the "auto discard" aspect of this UDP-socket approach. I think Philip had the right idea: each backend should send totals, not deltas, in its messages. Then, it doesn't matter (much) if the collector loses some messages --- that just means that sometimes it has a slightly out-of-date idea about how much work some backends have done. It should be easy to design the software so that that just makes a small, transient error in the currently displayed statistics. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
Tom Lane [EMAIL PROTECTED] writes: Alfred Perlstein [EMAIL PROTECTED] writes: definitely need before considering this is to replace the existing spinlock mechanism with something more efficient. What sort of problems are you seeing with the spinlock code? It's great as long as you never block, but it sucks for making things wait, because the wait interval will be some multiple of 10 msec rather than just the time till the lock comes free. Plus, using select() for the timeout is putting you into the kernel multiple times in a short period, and causing a reschedule everytime, which is a big lose. This was discussed in the linux-kernel thread that was referred to a few days ago. We've speculated about using Posix semaphores instead, on platforms where those are available. I think Bruce was concerned about the possible overhead of pulling in a whole thread-support library just to get semaphores, however. Are Posix semaphores faster by definition than SysV semaphores (which are described as "slow" in the source comments)? I can't see how they'd be much faster unless locking/unlocking an uncontended semaphore avoids a system call, in which case you might run into the same problems with userland backoff... Just looked, and on Linux pthreads and POSIX semaphores are both already in the C library. Unfortunately, the Linux C library doesn't support the PROCESS_SHARED attribute for either pthreads mutexes or POSIX semaphores. Grumble. What's the point then? Just some ignorant ramblings, thanks for listening... -Doug ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
AW: [HACKERS] Allowing WAL fsync to be done via O_SYNC
For a log file on a busy system, this could improve throughput a lot--batch commit. You end up with fewer than one fsync() per transaction. This is not the issue, since that is already implemented. The current bunching method might have room for improvement, but there are currently fewer fsync's than transactions when appropriate. Andreas ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
[ Charset ISO-8859-1 unsupported, converting... ] Yes, you are. On UnixWare, you need to add -Kthread, which CHANGES a LOT of primitives to go through threads wrappers and scheduling. This was my concern; the change that happens on startup and lib calls when thread support comes in through a library. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Performance monitor signal handler
At 17:10 15/03/01 -0800, Alfred Perlstein wrote: Which is why the backends should not do anything other than maintain the raw data. If there is atomic data than can cause inconsistency, then a dropped UDP packet will do the same. The UDP packet (a COPY) can contain a consistant snapshot of the data. If you have dependancies, you fit a consistant snapshot into a single packet. If we were going to go the shared memory way, then yes, as soon as we start collecting dependant data we would need locking, but IOs, locking stats, flushes, cache hits/misses are not really in this category. But I prefer the UDP/Collector model anyway; it gives use greater flexibility + the ability to keep stats past backend termination, and,as you say, removes any possible locking requirements from the backends. Philip Warner| __---_ Albatross Consulting Pty. Ltd. |/ - \ (A.B.N. 75 008 659 498) | /(@) __---_ Tel: (+61) 0500 83 82 81 | _ \ Fax: (+61) 0500 83 82 82 | ___ | Http://www.rhyme.com.au |/ \| |---- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/ ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] pgmonitor patch for query string
I don't understand the attraction of the UDP stuff. If we have the stuff in shared memory, we can add a collector program that gathers info from shared memory and allows others to access it, right? There are a couple of problems with shared memory. First you have to decide a size. That'll limit what you can put into and if you want to put things per table (#scans, #block- fetches, #cache-hits, ...), you might run out of mem either way with complicated, multy-thousand table schemas. And the above illustrates too that the data structs in the shmem wouldn't be just some simple arrays of counters. So we have to deal with locking for both, readers and writers of the statistics. [ Jan, previous email was not sent to list, my mistake.] OK, I understand the problem with pre-defined size. That is why I was looking for a way to dump the information out to a flat file somehow. I think no matter how we deal with this, we will need some way to turn on/off such reporting. We can write into shared memory with little penalty, but network or filesystem output is not going to be near-zero cost. OK, how about a shared buffer area that gets written in a loop so a separate collection program can grab the info if it wants it, and if not, it just gets overwritten later. It can even be per-backend: loops start end (loop to start) - [-] 5 stat stat stat stat stat stat |^^^ current pointer -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Performance monitor signal handler
Alfred Perlstein wrote: * Jan Wieck [EMAIL PROTECTED] [010316 08:08] wrote: Philip Warner wrote: But I prefer the UDP/Collector model anyway; it gives use greater flexibility + the ability to keep stats past backend termination, and,as you say, removes any possible locking requirements from the backends. OK, did some tests... The postmaster can create a SOCK_DGRAM socket at startup and bind(2) it to "127.0.0.1:0", what causes the kernel to assign a non-privileged port number that then can be read with getsockname(2). No other process can have a socket with the same port number for the lifetime of the postmaster. If the socket get's ready, it'll read one backend message from it with recvfrom(2). The fromaddr mustbe "127.0.0.1:xxx" where xxx is the port number the kernel assigned to the above socket. Yes, this is his own one, shared with postmaster and all backends. So both, the postmaster and the backends can use this one UDP socket, which the backends inherit on fork(2), to send messages to the collector. If such a UDP packet really came from a process other than the postmaster or a backend, well then the sysadmin has a more severe problem than manipulated DB runtime statistics :-) Doing this is a bad idea: a) it allows any program to start spamming localhost:randport with messages and screw with the postmaster. b) it may even allow remote people to mess with it, (see recent bugtraq articles about this) So it's possible for a UDP socket to recvfrom(2) and get packets with a fromaddr localhost:my_own_non_SO_REUSE_port that really came from somewhere else? If that's possible, the packets must be coming over the network. Oterwise it's the local superuser sending them, and in that case it's not worth any more discussion because root on your system has more powerful possibilities to muck around with your database. And if someone outside the local system is doing it, it's time for some filter rules, isn't it? You should use a unix domain socket (at least when possible). Unix domain UDP? Running a 500MHz P-III, 192MB, RedHat 6.1 Linux 2.2.17 here, I've been able to loose no single message during the parallel regression test, if each backend sends one 1K sized message per query executed, and the collector simply sucks them out of the socket. Message losses start if the collector does a per message idle loop like this: for (i=0,sum=0;i25;i++,sum+=1); Uh - not much time to spend if the statistics should at least be half accurate. And it would become worse in SMP systems. So that was a nifty idea, but I think it'd cause much more statistic losses than I assumed at first. Back to drawing board. Maybe a SYS-V message queue can serve? I wouldn't say back to the drawing board, I would say two steps back. What about instead of sending deltas, you send totals? This would allow you to loose messages and still maintain accurate stats. Similar problem as with shared memory - size. If a long running backend of a multithousand table database needs to send access stats per table - and had accessed them all up to now - it'll be alot of wasted bandwidth. You can also enable SIGIO on the socket, then have a signal handler buffer packets that arrive when not actively select()ing on the UDP socket. You can then use sigsetmask(2) to provide mutual exclusion with your SIGIO handler and general select()ing on the socket. I already thought that priorizing the socket-drain this way: there is a fairly big receive buffer. If the buffer is empty, it does a blocking select(2). If it's not, it does a non- blocking (0-timeout) one and only if the non-blocking tells that there aren't new messages waiting, it'll process one buffered message and try to receive again. Will give it a shot. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Re: [SQL] Re: why the DB file size does not reduce when 'delete'the data in DB?
On Fri, Mar 16, 2001 at 12:01:36AM +, Thomas Lockhart wrote: You are not quite factually correct above, even given your definition of "bug". PostgreSQL does reuse deleted record space, but requires an explicit maintenance step to do this. Could you tell us what that maintenance step is? dumping the db and restoring into a fresh one ? :/ :) No, "VACUUM" is your friend for this. Look in the reference manual for details. - Thomas I'm having this problem: I have a database that is 3 megabyte in size (measured using pg_dump). When i go to the corresponding data directory (eg du -h data/base/mydbase), it seems the real disk usage is 135 megabyte! Doing a VACUUM doesn't really change the disk usage. Also query updating speed increases when i dump all data and restore it into a fresh new database. I'm running postgresql-7.0.2-6 on a Debian potato. -Yves ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Performance monitor signal handler
Tom Lane wrote: Jan Wieck [EMAIL PROTECTED] writes: Uh - not much time to spend if the statistics should at least be half accurate. And it would become worse in SMP systems. So that was a nifty idea, but I think it'd cause much more statistic losses than I assumed at first. Back to drawing board. Maybe a SYS-V message queue can serve? That would be the same as a pipe: backends would block if the collector stopped accepting data. I do like the "auto discard" aspect of this UDP-socket approach. Does a pipe guarantee that a buffer, written with one atomic write(2), never can get intermixed with other data on the readers end? I know that you know what I mean, but for the broader audience: Let's define a message to the collector to be 4byte-len,len-bytes. Now hundreds of backends hammer messages into the (shared) writing end of the pipe, all with different sizes. Is itGUARANTEEDthata read(4bytes),read(nbytes) sequence will allways return one complete message and never intermixed parts of different write(2)s? With message queues, this is guaranteed. Also, message queues would make it easy to query the collected statistics (see below). I think Philip had the right idea: each backend should send totals, not deltas, in its messages. Then, it doesn't matter (much) if the collector loses some messages --- that just means that sometimes it has a slightly out-of-date idea about how much work some backends have done. It should be easy to design the software so that that just makes a small, transient error in the currently displayed statistics. If we use two message queues (IPC_PRIVATE is enough here), one into collector and one into backend direction, this'd be an easy way to collect and query statistics. The backends send delta stats messages to the collector on one queue. Message queues block, by default, but the backend could use IPC_NOWAIT and just go on and collect up, as long as it finally will use a blocking call before exiting. We'll loose statistics for backends that go down in flames (coredump), but who cares for statistics then? To query statistics, we have a set of new builtin functions. All functions share a global statistics snapshot in the backend. If on function call the snapshot doesn't exist or was generated by another XACT/commandcounter, the backend sends a statistics request for his database ID to the collector and waits for the messages to arrive on the second message queue. It can pick up the messages meant for him via message type, which's equal to his backend number +1, because the collector will send 'em as such. For table access stats for example, the snapshot will have slots identified by the tables OID, so a function pg_get_tables_seqscan_count(oid) should be easy to implement. And setting up views that present access stats in readable format is a nobrainer. Now we have communication only between the backends and the collector. And we're certain that only someone able to SELECT from a system view will ever see this information. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Performance monitor signal handler
Jan Wieck [EMAIL PROTECTED] writes: Does a pipe guarantee that a buffer, written with one atomic write(2), never can get intermixed with other data on the readers end? Yes. The HPUX man page for write(2) sez: o Write requests of {PIPE_BUF} bytes or less will not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set. Stevens' _UNIX Network Programming_ (1990) states this is true for all pipes (nameless or named) on all flavors of Unix, and furthermore states that PIPE_BUF is at least 4K on all systems. I don't have any relevant Posix standards to look at, but I'm not worried about assuming this to be true. With message queues, this is guaranteed. Also, message queues would make it easy to query the collected statistics (see below). I will STRONGLY object to any proposal that we use message queues. We've already had enough problems with the ridiculously low kernel limits that are commonly imposed on shmem and SysV semaphores. We don't need to buy into that silliness yet again with message queues. I don't believe they gain us anything over pipes anyway. The real problem with either pipes or message queues is that backends will block if the collector stops collecting data. I don't think we want that. I suppose we could have the backends write a pipe with O_NONBLOCK and ignore failure, however: o If the O_NONBLOCK flag is set, write() requests will be handled differently, in the following ways: - The write() function will not block the process. - A write request for {PIPE_BUF} or fewer bytes will have the following effect: If there is sufficient space available in the pipe, write() will transfer all the data and return the number of bytes requested. Otherwise, write() will transfer no data and return -1 with errno set to EAGAIN. Since we already ignore SIGPIPE, we don't need to worry about losing the collector entirely. Now this would put a pretty tight time constraint on the collector: fall more than 4K behind, you start losing data. I am not sure if a UDP socket would provide more buffering or not; anyone know? regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Performance monitor signal handler
Tom Lane wrote: Jan Wieck [EMAIL PROTECTED] writes: Does a pipe guarantee that a buffer, written with one atomic write(2), never can get intermixed with other data on the readers end? Yes. The HPUX man page for write(2) sez: o Write requests of {PIPE_BUF} bytes or less will not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set. Stevens' _UNIX Network Programming_ (1990) states this is true for all pipes (nameless or named) on all flavors of Unix, and furthermore states that PIPE_BUF is at least 4K on all systems. I don't have any relevant Posix standards to look at, but I'm not worried about assuming this to be true. That's good news - and maybe a Good Assumption (TM). With message queues, this is guaranteed. Also, message queues would make it easy to query the collected statistics (see below). I will STRONGLY object to any proposal that we use message queues. We've already had enough problems with the ridiculously low kernel limits that are commonly imposed on shmem and SysV semaphores. We don't need to buy into that silliness yet again with message queues. I don't believe they gain us anything over pipes anyway. OK. The real problem with either pipes or message queues is that backends will block if the collector stops collecting data. I don't think we want that. I suppose we could have the backends write a pipe with O_NONBLOCK and ignore failure, however: o If the O_NONBLOCK flag is set, write() requests will be handled differently, in the following ways: - The write() function will not block the process. - A write request for {PIPE_BUF} or fewer bytes will have the following effect: If there is sufficient space available in the pipe, write() will transfer all the data and return the number of bytes requested. Otherwise, write() will transfer no data and return -1 with errno set to EAGAIN. Since we already ignore SIGPIPE, we don't need to worry about losing the collector entirely. That's not what the manpage said. It said that in the case you're inside PIPE_BUF size and using O_NONBLOCK, you either send complete messages or nothing, getting an EAGAIN then. So we could do the same here and write to the pipe. In the case we cannot, just count up and try again next year (or so). Now this would put a pretty tight time constraint on the collector: fall more than 4K behind, you start losing data. I am not sure if a UDP socket would provide more buffering or not; anyone know? Again, this ain't what the manpage said. If there's sufficient space available in the pipe in combination with that PIPE_BUF is at least 4K doesn't necessarily mean that the pipes buffer space is 4K. Well, what I'm missing is the ability to filter out statistics reports on the backend side via msgrcv(2)s msgtype :-( Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Performance monitor signal handler
Tom Lane wrote: Now this would put a pretty tight time constraint on the collector: fall more than 4K behind, you start losing data. I am not sure if a UDP socket would provide more buffering or not; anyone know? Looks like Linux has something around 16-32K of buffer space for UDP sockets. Just from eyeballing the fprintf(3) output of my destructively hacked postleprechaun. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open
[ Drifting off topic ... ] Well, that guy might know all about Linux, but he doesn't know anything about HPUX (at least not any version I've ever run). O_SYNC is distinctly different from O_DSYNC around here. There is a HP_UX kernel flag 'o_sync_is_o_dsync' which will cause O_DSYNC to be treated as O_SYNC. It defaults to being off -- it is/was a backward compatibility "feature" since HP-UX 9.X (which is history now) had implemented O_SYNC as O_DSYNC. http://docs.hp.com/cgi-bin/otsearch/getfile?id=/hpux/onlinedocs/os/KCparam.OsyncIsOdsync.html Regards, Giles ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: AW: Re[4]: [HACKERS] Allowing WAL fsync to be done via O_SYNC
Zeugswetter Andreas SB [EMAIL PROTECTED] writes: It's great as long as you never block, but it sucks for making things wait, because the wait interval will be some multiple of 10 msec rather than just the time till the lock comes free. On the AIX platform usleep (3) is able to really sleep microseconds without busying the cpu when called for more than approx. 100 us (the longer the interval, the less busy the cpu gets) . Would this not be ideal for spin_lock, or is usleep not very common ? Linux sais it is in the BSD 4.3 standard. HPUX has usleep, but the man page says The usleep() function is included for its historical usage. The setitimer() function is preferred over this function. In any case, I would expect that all these functions offer accuracy no better than the scheduler's regular clock cycle (~ 100Hz) on most kernels. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] beta6 packaged ...
will do an announce later on tonight, to give the mirrors a chance to start syncing ... can others confirm that the packaging once more looks clean? thanks ... Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: [EMAIL PROTECTED] secondary: scrappy@{freebsd|postgresql}.org ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Stuck spins in current
Got it at spin.c:156 with 50 clients doing inserts into 50 tables (int4, text[1-256 bytes]). -B 16384, -wal_buffers=256 (with default others wal params). Vadim ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Performance monitor signal handler
Jan Wieck wrote: Tom Lane wrote: Now this would put a pretty tight time constraint on the collector: fall more than 4K behind, you start losing data. I am not sure if a UDP socket would provide more buffering or not; anyone know? Looks like Linux has something around 16-32K of buffer space for UDP sockets. Just from eyeballing the fprintf(3) output of my destructively hacked postleprechaun. Just to get some evidence at hand - could some owners of different platforms compile and run the attached little C source please? (The program tests how much data can be stuffed into a pipe or a Sys-V message queue before the writer would block or get an EAGAIN error). My output on RedHat6.1 Linux 2.2.17 is: Pipe buffer is 4096 bytes Sys-V message queue buffer is 16384 bytes Seems Tom is (unfortunately) right. The pipe blocks at 4K. So a Sys-V message queue, with the ability to distribute messages from the collector to individual backends with kernel support via "mtype" is four times by unestimated complexity better here. What does your system say? I really never thought that Sys-V IPC is a good way to go at all. I hate it's incompatibility to the select(2) system call and all these OS/installation dependant restrictions. But I'm tempted to reevaluate it "for this case". Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # #include stdio.h #include stdlib.h #include unistd.h #include fcntl.h #include errno.h #include sys/types.h #include sys/ipc.h #include sys/msg.h typedef struct test_message { longmtype; charmtext[512 - sizeof(long)]; } test_message; static int test_pipe(void); static int test_msg(void); int main(int argc, char *argv[]) { if(test_pipe() 0) return 1; if(test_msg() 0) return 1; return 0; } static int test_pipe(void) { int p[2]; charbuf[512]; int done; int rc; if (pipe(p) 0) { perror("pipe(2)"); return -1; } if (fcntl(p[1], F_SETFL, O_NONBLOCK) 0) { perror("fcntl(2)"); return -1; } for(done = 0; ; ) { if ((rc = write(p[1], buf, sizeof(buf))) != sizeof(buf)) { if (rc 0) { extern int errno; if (errno == EAGAIN) { printf("Pipe buffer is %d bytes\n", done); return 0; } perror("write(2)"); return -1; } fprintf(stderr, "whatever happened - rc = %d on write(2)\n", rc); return -1; } done += rc; } fprintf(stderr, "Endless write loop returned - what's that?\n"); return -1; } static int test_msg(void) { int mq; test_messagemsg; int done; if ((mq = msgget(IPC_PRIVATE, IPC_CREAT | 0600)) 0) { perror("msgget(2)"); return -1; } for (done = 0; ; ) { msg.mtype = 1; if (msgsnd(mq, msg, sizeof(msg), IPC_NOWAIT) 0) { extern int errno; if (errno == EAGAIN) { printf("Sys-V message queue buffer is %d bytes\n", done); return 0; } perror("msgsnd(2)"); return -1; } done += sizeof(msg); } fprintf(stderr, "Endless write loop returned - what's that?\n"); return -1; } ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [Stephen C. Tweedie sct@redhat.com] Re: O_DSYNC flag for open
There is a HP_UX kernel flag 'o_sync_is_o_dsync' which will cause O_DSYNC to be treated as O_SYNC. It defaults to being off -- it ... other way around there, of course. Trying to clarify and adding confusion instead. :-( is/was a backward compatibility "feature" since HP-UX 9.X (which is history now) had implemented O_SYNC as O_DSYNC. Muttering, Giles ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] beta6 packaged ...
The Hermit Hacker [EMAIL PROTECTED] writes: will do an announce later on tonight, to give the mirrors a chance to start syncing ... can others confirm that the packaging once more looks clean? The main tar.gz matches what I have here. Didn't look at the partial tarballs. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Stuck spins in current
"Mikheev, Vadim" [EMAIL PROTECTED] writes: Got it at spin.c:156 with 50 clients doing inserts into 50 tables (int4, text[1-256 bytes]). -B 16384, -wal_buffers=256 (with default others wal params). SpinAcquire() ... but on which lock? regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Stuck spins in current
"Mikheev, Vadim" [EMAIL PROTECTED] writes: Got it at spin.c:156 with 50 clients doing inserts into 50 tables (int4, text[1-256 bytes]). -B 16384, -wal_buffers=256 (with default others wal params). SpinAcquire() ... but on which lock? After a little bit of thought I'll bet it's ControlFileLockId. Likely we shouldn't be using a spinlock at all for that, but the short-term solution might be a longer timeout for this particular lock. Alternatively, could we avoid holding that lock while initializing a new log segment? regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Performance monitor signal handler
Jan Wieck [EMAIL PROTECTED] writes: Just to get some evidence at hand - could some owners of different platforms compile and run the attached little C source please? HPUX 10.20: Pipe buffer is 8192 bytes Sys-V message queue buffer is 16384 bytes regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] Stuck spins in current
How to synchronize with checkpoint-er if wal_files 0? I was sort of visualizing assigning the created xlog files dynamically: create a temp file of a PID-dependent name fill it with zeroes and fsync it acquire ControlFileLockId rename temp file into place as next uncreated segment update pg_control release ControlFileLockId Since the things are just filled with 0's, there's no need to know which segment it will be while you're filling it. This would leave you sometimes with more advance files than you really needed, but so what ... Yes, it has sence, but: And you know - I've run same tests on ~ Mar 9 snapshot without any problems. That was before I changed the code to pre-fill the file --- now it takes longer to init a log segment. And we're only using a plain SpinAcquire, not the flavor with a longer timeout. xlog.c revision 1.55 from Feb 26 already had log file zero-filling, so ... Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Stuck spins in current
"Mikheev, Vadim" [EMAIL PROTECTED] writes: Alternatively, could we avoid holding that lock while initializing a new log segment? How to synchronize with checkpoint-er if wal_files 0? I was sort of visualizing assigning the created xlog files dynamically: create a temp file of a PID-dependent name fill it with zeroes and fsync it acquire ControlFileLockId rename temp file into place as next uncreated segment update pg_control release ControlFileLockId Since the things are just filled with 0's, there's no need to know which segment it will be while you're filling it. This would leave you sometimes with more advance files than you really needed, but so what ... And you know - I've run same tests on ~ Mar 9 snapshot without any problems. That was before I changed the code to pre-fill the file --- now it takes longer to init a log segment. And we're only using a plain SpinAcquire, not the flavor with a longer timeout. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
RE: [HACKERS] Stuck spins in current
Got it at spin.c:156 with 50 clients doing inserts into 50 tables (int4, text[1-256 bytes]). -B 16384, -wal_buffers=256 (with default others wal params). SpinAcquire() ... but on which lock? After a little bit of thought I'll bet it's ControlFileLockId. I see "XLogWrite: new log file created..." in postmaster' log - backend writes this after releasing ControlFileLockId. Likely we shouldn't be using a spinlock at all for that, but the short-term solution might be a longer timeout for this particular lock. Alternatively, could we avoid holding that lock while initializing a new log segment? How to synchronize with checkpoint-er if wal_files 0? And you know - I've run same tests on ~ Mar 9 snapshot without any problems. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] Performance monitor signal handler
Just to get some evidence at hand - could some owners of different platforms compile and run the attached little C source please? $ uname -srm FreeBSD 4.1.1-STABLE $ ./jan Pipe buffer is 16384 bytes Sys-V message queue buffer is 2048 bytes $ uname -srm NetBSD 1.5 alpha $ ./jan Pipe buffer is 4096 bytes Sys-V message queue buffer is 2048 bytes $ uname -srm NetBSD 1.5_BETA2 i386 $ ./jan Pipe buffer is 4096 bytes Sys-V message queue buffer is 2048 bytes $ uname -srm NetBSD 1.4.2 i386 $ ./jan Pipe buffer is 4096 bytes Sys-V message queue buffer is 2048 bytes $ uname -srm NetBSD 1.4.1 sparc $ ./jan Pipe buffer is 4096 bytes Bad system call (core dumped) # no SysV IPC in running kernel $ uname -srm HP-UX B.11.11 9000/800 $ ./jan Pipe buffer is 8192 bytes Sys-V message queue buffer is 16384 bytes $ uname -srm HP-UX B.11.00 9000/813 $ ./jan Pipe buffer is 8192 bytes Sys-V message queue buffer is 16384 bytes $ uname -srm HP-UX B.10.20 9000/871 $ ./jan Pipe buffer is 8192 bytes Sys-V message queue buffer is 16384 bytes HP-UX can also use STREAMS based pipes if the kernel parameter streampipes is set. Using STREAMS based pipes increases the pipe buffer size by a lot: # uname -srm HP-UX B.11.11 9000/800 # ./jan Pipe buffer is 131072 bytes Sys-V message queue buffer is 16384 bytes # uname -srm HP-UX B.11.00 9000/800 # ./jan Pipe buffer is 131072 bytes Sys-V message queue buffer is 16384 bytes Regards, Giles ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
Re: [HACKERS] Stuck spins in current
"Mikheev, Vadim" [EMAIL PROTECTED] writes: And you know - I've run same tests on ~ Mar 9 snapshot without any problems. That was before I changed the code to pre-fill the file --- now it takes longer to init a log segment. And we're only using a plain SpinAcquire, not the flavor with a longer timeout. xlog.c revision 1.55 from Feb 26 already had log file zero-filling, so ... Oh, you're right, I didn't study the CVS log carefully enough. Hmm, maybe the control file lock isn't the problem. The abort() in s_lock_stuck should have left a core file --- what is the backtrace? regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://www.postgresql.org/search.mpl
RE: [HACKERS] Stuck spins in current
And you know - I've run same tests on ~ Mar 9 snapshot without any problems. That was before I changed the code to pre-fill the file --- now it takes longer to init a log segment. And we're only using a plain SpinAcquire, not the flavor with a longer timeout. xlog.c revision 1.55 from Feb 26 already had log file zero-filling, so ... Oh, you're right, I didn't study the CVS log carefully enough. Hmm, maybe the control file lock isn't the problem. The abort() in s_lock_stuck should have left a core file --- what is the backtrace? After 10 times increasing DEFAULT_TIMEOUT in s_lock.c I got abort in xlog.c:626 - waiting for insert_lck. But problem is near new log file creation code: system goes sleep just after new one is created. Vadim ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
[HACKERS] pg_upgrade
Since pg_upgrade will not work for 7.1, should its installation be prevented and the man page be disabled? -- Peter Eisentraut [EMAIL PROTECTED] http://yi.org/peter-e/ ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] transaction timeout
Is there a timeout setting I can use to abort transactions that aren't deadlocked, but which have been blocked waiting for locks greater than some amount of time? I didn't see anything in the docs on this and observed with 2 instances of psql that a transaction waiting on a lock seems to wait forever. If pgsql doesn't have such a setting, has there been any discussion about adding it? Regards, Kevin Manley ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [HACKERS] pg_upgrade
Since pg_upgrade will not work for 7.1, should its installation be prevented and the man page be disabled? Probably. I am not sure it will ever be used again now that we have numeric file names. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Performance monitor signal handler
* Jan Wieck [EMAIL PROTECTED] [010316 16:35]: Jan Wieck wrote: Tom Lane wrote: Now this would put a pretty tight time constraint on the collector: fall more than 4K behind, you start losing data. I am not sure if a UDP socket would provide more buffering or not; anyone know? Looks like Linux has something around 16-32K of buffer space for UDP sockets. Just from eyeballing the fprintf(3) output of my destructively hacked postleprechaun. Just to get some evidence at hand - could some owners of different platforms compile and run the attached little C source please? (The program tests how much data can be stuffed into a pipe or a Sys-V message queue before the writer would block or get an EAGAIN error). My output on RedHat6.1 Linux 2.2.17 is: Pipe buffer is 4096 bytes Sys-V message queue buffer is 16384 bytes Seems Tom is (unfortunately) right. The pipe blocks at 4K. So a Sys-V message queue, with the ability to distribute messages from the collector to individual backends with kernel support via "mtype" is four times by unestimated complexity better here. What does your system say? I really never thought that Sys-V IPC is a good way to go at all. I hate it's incompatibility to the select(2) system call and all these OS/installation dependant restrictions. But I'm tempted to reevaluate it "for this case". Jan $ ./queuetest Pipe buffer is 32768 bytes Sys-V message queue buffer is 4096 bytes $ uname -a UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5 $ I think some of these are configurable... LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Performance monitor signal handler
* Larry Rosenman [EMAIL PROTECTED] [010316 20:47]: * Jan Wieck [EMAIL PROTECTED] [010316 16:35]: $ ./queuetest Pipe buffer is 32768 bytes Sys-V message queue buffer is 4096 bytes $ uname -a UnixWare lerami 5 7.1.1 i386 x86at SCO UNIX_SVR5 $ I think some of these are configurable... They both are. FIFOBLKSIZE and MSGMNB or some such kernel tunable. I can get more info if you need it. LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: [EMAIL PROTECTED] US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749 ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pg_upgrade
Since pg_upgrade will not work for 7.1, should its installation be prevented and the man page be disabled? Probably. I am not sure it will ever be used again now that we have numeric file names. Perhaps we should leave it for 7.1 because people will complain when they can not find it. Maybe we can mention this may go away in the next release. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026 ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] problems with startup script on upgrade
"Martin A. Marques" [EMAIL PROTECTED] writes: Please define "doesn't work". What happens exactly? What messages are produced? root@ultra31 /space/pruebas/postgres-cvs # su postgres -c '/dbs/postgres/bin/pg_ctl -o "-i" -D /dbs/postgres/data/ start -l /dbs/postgres/sql.log' 19054 Killed postmaster successfully started root@ultra31 /space/pruebas/postgres-cvs # Hm, that 'Killed' looks suspicious. What shows up in the /dbs/postgres/sql.log file? regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] problems with startup script on upgrade
"Martin A. Marques" [EMAIL PROTECTED] writes: Hm, that 'Killed' looks suspicious. What shows up in the /dbs/postgres/sql.log file? Nothing at all. That's no help :-(. Please alter the command to trace the shell script, ie su postgres -c 'sh -x /dbs/postgres/bin/pg_ctl -o ... 2tracefile' and send the tracefile. regards, tom lane ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Re: [GENERAL] Problems with outer joins in 7.1beta5
Barry Lind [EMAIL PROTECTED] writes: What I would expect the syntax to be is: table as alias (columna as aliasa, columnb as aliasb,...) This will allow the query to work regardless of what the table column order is. Generally the SQL spec has tried not to tie query behaviour to the table column order. Unfortunately, the spec authors seem to have forgotten that basic design rule when they wrote the aliasing syntax. Column alias lists are position-sensitive: table reference ::= table name [ [ AS ] correlation name [ left paren derived column list right paren ] ] | derived table [ AS ] correlation name [ left paren derived column list right paren ] | joined table derived column list ::= column name list column name list ::= column name [ { comma column name }... ] SQL99 seems to be no better. Sorry. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[GENERAL] Problems with outer joins in 7.1beta5
My problem is that my two outer joined tables have columns that have the same names. Therefore when my select list tries to reference the columns they are ambiguously defined. Looking at the doc I see the way to deal with this is by using the following syntax: table as alias (column1alias, column2alias,...) So we can alias the conficting column names to resolve the problem. However the problem with this is that the column aliases are positional per the table structure. Thus column1alias applies to the first column in the table. Code that relies on the order of columns in a table is very brittle. As adding a column always places it at the end of the table, it is very easy to have a newly installed site have one order (the order the create table command creates them in) and a site upgrading from an older version (where the upgrade simply adds the new columns) to have column orders be different. My feeling is that postgres has misinterpreted the SQL92 spec in this regards. But I am having problems finding an online copy of the SQL92 spec so that I can verify. What I would expect the syntax to be is: table as alias (columna as aliasa, columnb as aliasb,...) This will allow the query to work regardless of what the table column order is. Generally the SQL spec has tried not to tie query behaviour to the table column order. I will fix my code so that it works given how postgres currently supports the column aliases. Can anyone point me to a copy of the SQL92 spec so that I can research this more? thanks, --Barry ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
[HACKERS] Re: problems with startup script on upgrade
Ah, but is the LD_LIBRARY_PATH the same inside that su? A change of environment might explain why this works "by hand" and not through su ... This #$^%^*$% Solaris!! Check this out, and tell me I shouldn't yell out at SUN: root@ultra31 / # su - postgres -c 'echo $PATH' /usr/bin: root@ultra31 / # su - postgres postgres@ultra31:~ echo $PATH /usr/local/bin:/usr/local/gcc/bin:/usr/local/php/bin:/opt/sfw/bin:/usr/local/a2p/bin:/usr/local/sql/bin:/usr/ccs/bin:/bin:/usr/bin/X11:/usr/bin:/usr/ucb:/dbs/postgres/bin: postgres@ultra31:~ logout root@ultra31 / # Can someone explain to why Solaris is doing that, and why did it start doing it after an upgrade? I have no words. It may be that this is the first build of PostgreSQL which asks for "libz.so", but that is just a guess. Not sure about "after the upgrade", but I'll bet that the first (command line) case does not have an attached terminal, while the second case, where you actually connect to the session, does. Does your .profile try doing some "terminal stuff"? Try adding echo's to your .profile to verify that it start, and that it runs to completion... Also, PATH is not relevant for finding libz.so, so you need to figure out what (if anything) is happening to LD_LIBRARY_PATH. - Thomas ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster