Re: [HACKERS] sync_file_range()
ITAGAKI Takahiro [EMAIL PROTECTED] wrote I'm interested in it, with which we could improve responsiveness during checkpoints. Though it is Linux specific system call, but we could use the combination of mmap() and msync() instead of it; I mean we can use mmap only to flush dirty pages, not to read or write pages. Can you specify details? As the TODO item inidcates, if we mmap data file, a serious problem is that we don't know when the data pages hit the disks -- so that we may voilate the WAL rule. Regards, Qingqing ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] MultiXacts WAL
I would like to see some checking of this, though. Currently I'm doing testing of PostgreSQL under very large numbers of connections (2000+) and am finding that there's a huge volume of xlog output ... far more than comparable RDBMSes. So I think we are logging stuff we don't really have to. I think you really have to lengthen the checkpoint interval to reduce WAL overhead (20 min or so). Also imho you cannot only compare the log size/activity since other db's write part of what pg writes to WAL to other areas (physical log, rollback segment, ...). If we cannot afford lenghtening the checkpoint interval because of too heavy checkpoint load, we need to find ways to tune bgwriter, and not reduce checkpoint interval. Andreas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] regresssion script hole
Michael Fuhr wrote: On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote: Maybe I'm misreading the packet, but I think the query is for ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc) Correction: ''.kaltenbrunner.cc yes that is exactly the issue - the postmaster tries to resolve ''.kaltenbrunner.cc multiple times during startup and getting ServFail as a response from the upstream resolver. Stefan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Rethinking stats communication mechanisms
Great minds think alike ;-) ... I just committed exactly that protocol. I believe it is correct, because AFAICS there are only four possible risk cases: Congrats ! For general culture you might be interested in reading this : http://en.wikipedia.org/wiki/Software_transactional_memory http://libcmt.sourceforge.net/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] regresssion script hole
Andrew Dunstan wrote: Tom Lane wrote: Anyway, the tail end of the trace shows it repeatedly sending off a UDP packet and getting practically the same data back: I'm not too up on what the DNS protocol looks like on-the-wire, but I'll bet this is it. I think it's trying to look up kaltenbrunner.cc and failing. Why are we actually looking up anything? Just so we can bind to a listening socket? Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to direct it to use files first, something like lookup file bind Stefan, can you look into that? It would be a bit ugly if it's calling DNS (and failing) to resolve localhost. no - resolving localhost works fine (both using /etc/hosts and through the dns-resolver) - and I infact verified that when we initially started to investigate that issue a while ago :-) Stefan ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] sync_file_range()
Qingqing Zhou [EMAIL PROTECTED] wrote: I'm interested in it, with which we could improve responsiveness during checkpoints. Though it is Linux specific system call, but we could use the combination of mmap() and msync() instead of it; I mean we can use mmap only to flush dirty pages, not to read or write pages. Can you specify details? As the TODO item inidcates, if we mmap data file, a serious problem is that we don't know when the data pages hit the disks -- so that we may voilate the WAL rule. I'm thinking about fuzzy checkpoints, where we writes and flushes buffers as need as we should. Then sync_file_range() helps us to control to flush buffers by better granularity. We can stretch a checkpoint length to avoid storage-overload at a burst, using sync_file_range() and cost-based delay, like vacuum. I did not want to modify buffers by mmap, just to say the following pseudo-code. (I don't know it works in fact...) my_sync_file_range(fd, offset, nbytes, ...) { void *p = mmap(NULL, nbytes, ..., fd, offset); msync(p, nbytes, MS_ASYNC); munmap(p, nbytes); } Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] sync_file_range()
On Mon, 2006-06-19 at 15:32 +0800, Qingqing Zhou wrote: ITAGAKI Takahiro [EMAIL PROTECTED] wrote I'm interested in it, with which we could improve responsiveness during checkpoints. Though it is Linux specific system call, but we could use the combination of mmap() and msync() instead of it; I mean we can use mmap only to flush dirty pages, not to read or write pages. Can you specify details? As the TODO item inidcates, if we mmap data file, a serious problem is that we don't know when the data pages hit the disks -- so that we may voilate the WAL rule. Can't see where we'd use it. We fsync the xlog at transaction commit, so only the leading edge needs to be synced - would the call help there? Presumably the OS can already locate all blocks associated with a particular file fairly quickly without doing a full cache scan. Other files are fsynced at checkpoint - always all dirty blocks in the whole file. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] regresssion script hole
Stefan Kaltenbrunner wrote: Andrew Dunstan wrote: Why are we actually looking up anything? Just so we can bind to a listening socket? Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to direct it to use files first, something like lookup file bind Stefan, can you look into that? It would be a bit ugly if it's calling DNS (and failing) to resolve localhost. no - resolving localhost works fine (both using /etc/hosts and through the dns-resolver) - and I infact verified that when we initially started to investigate that issue a while ago :-) Why are we looking up 'kaltenbrunner.cc' at all then? In any case, can we just try with that resolver line? The question isn't whether is succeeds, it's how long it takes to succeed. When I increased the pg_regress timeout it actually went through the whole regression test happily. I suspect we have 2 things eating up the 60s timeout here: loading the timezone db and resolving whatever it is we are trying to resolve. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: R: [HACKERS] Per-server univocal identifier
Giampaolo, On Sun, Jun 18, 2006 at 01:26:21AM +0200, Giampaolo Tomassoni wrote: Or... Can I put a custom variable in pgsql.conf? Like that you mean? custom_variable_classes = 'identify'# list of custom variable classnames identify.id = 42 template1=# show identify.id; identify.id - 42 However pg_settings does not contain variable classes so it can be difficult to actually use this value. I wonder if this is a bug or a feature? Joachim ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] regresssion script hole
Andrew Dunstan [EMAIL PROTECTED] writes: The question isn't whether is succeeds, it's how long it takes to succeed. When I increased the pg_regress timeout it actually went through the whole regression test happily. I suspect we have 2 things eating up the 60s timeout here: loading the timezone db and resolving whatever it is we are trying to resolve. The behavior of loading the whole TZ database was there for awhile before anyone noticed; I believe it could only be responsible for a few seconds. So the failed DNS responses must be the problem. Could we get a ktrace with timestamps on the syscalls to confirm that? Of course the $64 question is *why* is 8.0 trying to resolve that name, particularly seeing that the later branches apparently aren't. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] regresssion script hole
On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote: Of course the $64 question is *why* is 8.0 trying to resolve that name, particularly seeing that the later branches apparently aren't. The formatting of the message suggests it is a gethostbyname('') doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the configuration files? I wonder it it'd be worth adding some conditional code around gethostbyname() calls that warn if the call took longer than say 10 seconds. By printing out that and the string it's looking up you could save a lot of time confirming if the delay is there or elsewhere... Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] Rethinking stats communication mechanisms
Might it not be a win to also store per backend global values in the shared memory segment? Things like time of last command, number of transactions executed in this backend, backend start time and other values that are fixed-size? I'm including backend start time, command start time, etc under the heading of current status which'll be in the shared memory. However, I don't believe in trying to count events (like transaction commits) that way. If we do then we risk losing events whenever a backend quits and is replaced by another. Well, in many cases that's not a problem. It might be interesting to for example know that a backend has run nnn transactions before ending up in the state where it is now (say, idle in transaction and idle for a long time). The part about this being transient data that can go away along with a backend quit would still hold true. What were your thoughts about storing bgwriter and archiver statistics that way? Good or bad idea? I haven't yet looked through the stats in detail, but this approach basically presumes that we are only going to count events per-table and per-database --- I am thinking that the background stats collector process won't even keep track of individual backends anymore. (So, we'll fix the old problem of loss of backend-exit messages resulting in bogus displays.) Right. As I see you have now implemented ;-) /Magnus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] regresssion script hole
Martijn van Oosterhout wrote: On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote: Of course the $64 question is *why* is 8.0 trying to resolve that name, particularly seeing that the later branches apparently aren't. The formatting of the message suggests it is a gethostbyname('') doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the configuration files? I tcpdump'd the dns-traffic on that box during a postmaster startup and it's definitly trying to look up ''.kaltenbrunner.cc a lot of times. And from what it looks like it might be getting somehow rate limited by my ISPs recursive resolvers after doing the same query a dozens of times and getting a servfail every time. At least the timestamps seem to indicate that the responses are getting delayed up to 10 seconds after a number of queries ... It might be a complete shot in the dark but spoonbill worked fine on REL_8_0_STABLE until i disabled reporting 3 month ago. During this time the large escaping security fix/standard_strings patch went in - could this be related in any way ? Stefan ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
R: R: [HACKERS] Per-server univocal identifier
Giampaolo, On Sun, Jun 18, 2006 at 01:26:21AM +0200, Giampaolo Tomassoni wrote: Or... Can I put a custom variable in pgsql.conf? Like that you mean? custom_variable_classes = 'identify'# list of custom variable classnames identify.id = 42 template1=# show identify.id; identify.id - 42 However pg_settings does not contain variable classes so it can be difficult to actually use this value. I wonder if this is a bug or a feature? Yes, that would be fine. It doesn't work to me, anyway. I guess the problem is that the setting shall be associated to a postgres module, which have to be responsible for the proper handing of the setting itself. Without an associated module, the setting is not available under the postgres env. Joachim ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] regresssion script hole
Stefan Kaltenbrunner [EMAIL PROTECTED] writes: I tcpdump'd the dns-traffic on that box during a postmaster startup and it's definitly trying to look up ''.kaltenbrunner.cc a lot of times. I just strace'd postmaster start on a Fedora box and can see nothing corresponding. Since this is a make check we know that the PG configuration files it's using are stock ... so it must be something about the system config that's sending it round the bend. What do you have in /etc/hosts, /etc/resolv.conf, /etc/nsswitch.conf? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] regresssion script hole
Oh, I think I see the problem: 8.0 pg_regress: if [ $unix_sockets = no ]; then postmaster_options=$postmaster_options -c listen_addresses=$hostname else postmaster_options=$postmaster_options -c listen_addresses='' fi 8.1 pg_regress: if [ $unix_sockets = no ]; then postmaster_options=$postmaster_options -c listen_addresses=$hostname else postmaster_options=$postmaster_options -c listen_addresses= fi regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] regresssion script hole
Stefan Kaltenbrunner wrote: Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: The question isn't whether is succeeds, it's how long it takes to succeed. When I increased the pg_regress timeout it actually went through the whole regression test happily. I suspect we have 2 things eating up the 60s timeout here: loading the timezone db and resolving whatever it is we are trying to resolve. The behavior of loading the whole TZ database was there for awhile before anyone noticed; I believe it could only be responsible for a few seconds. So the failed DNS responses must be the problem. Could we get a ktrace with timestamps on the syscalls to confirm that? Of course the $64 question is *why* is 8.0 trying to resolve that name, particularly seeing that the later branches apparently aren't. hmm maybe the later branches are trying to resolve that too - but only the combination of the TZ database loading + the failed DNS-queries is pushing the startup time over the 60 second limit on this (quite slow) box ? I will try to verify what the later branches are doing exactly ... Yes, we're on the margin here. The successful runs I saw got through the timeout in 5 or 10 seconds over the 60 that we currently allow. What interests me is where it even gets the string 'kaltenbrunner.cc' from. It looks to me like the most likely place is the search line in /etc/resolv.conf. Ir would be nice to know exactly what it is trying to resolve. cheers andrew cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] regresssion script hole
I wrote: 8.0 pg_regress: postmaster_options=$postmaster_options -c listen_addresses='' 8.1 pg_regress: postmaster_options=$postmaster_options -c listen_addresses= and in fact here's the commit that changed that: 2005-06-19 22:26 tgl * src/test/regress/pg_regress.sh: Change shell syntax that seems not to work right on FreeBSD 6-CURRENT buildfarm machines. So apparently it's some generic disease of the BSD shell. I should have back-patched at the time but did not. Will take care of it. On the timezone search business, it's still the case that HEAD will search through all the timezones if it's not given an explicit setting (eg an explicit environment TZ value). We could suppress that by having pg_regress set TZ, but then the regression tests wouldn't exercise the search code at all, which is probably not a great idea. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] regresssion script hole
Tom Lane wrote: Oh, I think I see the problem: 8.0 pg_regress: if [ $unix_sockets = no ]; then postmaster_options=$postmaster_options -c listen_addresses=$hostname else postmaster_options=$postmaster_options -c listen_addresses='' fi 8.1 pg_regress: if [ $unix_sockets = no ]; then postmaster_options=$postmaster_options -c listen_addresses=$hostname else postmaster_options=$postmaster_options -c listen_addresses= fi Good catch! I'm impressed! This is surely the heart of the problem. That change (from rev 1.56) clearly needs to be backported to 8.0. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Rethinking stats communication mechanisms
* reader's read starts before and ends after writer's update: reader will certainly note a change in update counter. * reader's read starts before and ends within writer's update: reader will note a change in update counter. * reader's read starts within and ends after writer's update: reader will note a change in update counter. * reader's read starts within and ends within writer's update: reader will see update counter as odd. Am I missing anything? The only remaining concern would be the possibility of the reader thrashing because the writer is updating so often that the reader never gets the same counter twice. IIRC, the reader was only sampling, not trying to catch every entry, so that will help. But is it enough? Regards, Paul Bort ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
R: R: R: [HACKERS] Per-server univocal identifier
...omissis... yes, it's for contrib modules. but you can access it via SHOW so maybe it makes sense to include it in pg_settings as well. Not for now but for the future maybe... I agree: it could be a useful feature. giampaolo Joachim -- Joachim Wieland [EMAIL PROTECTED] C/ Usandizaga 12 1°B ICQ: 37225940 20002 Donostia / San Sebastian (Spain) GPG key available ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] regresssion script hole
Tom Lane wrote: Andrew Dunstan [EMAIL PROTECTED] writes: The question isn't whether is succeeds, it's how long it takes to succeed. When I increased the pg_regress timeout it actually went through the whole regression test happily. I suspect we have 2 things eating up the 60s timeout here: loading the timezone db and resolving whatever it is we are trying to resolve. The behavior of loading the whole TZ database was there for awhile before anyone noticed; I believe it could only be responsible for a few seconds. So the failed DNS responses must be the problem. Could we get a ktrace with timestamps on the syscalls to confirm that? Of course the $64 question is *why* is 8.0 trying to resolve that name, particularly seeing that the later branches apparently aren't. hmm maybe the later branches are trying to resolve that too - but only the combination of the TZ database loading + the failed DNS-queries is pushing the startup time over the 60 second limit on this (quite slow) box ? I will try to verify what the later branches are doing exactly ... Stefan ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] pl/tcl again.
Hi all, I'm still fighting with pltcl test that doesn't return the error message when elog ERROR message is called. I've played witrh pltcl.c pltcl_error and removed the calls to PG_TRY, PG_CATCH and PG_ENDTRY to proove that elog it self had a problem... How can I check what happens in elog? Each time elog is called with a level of ERROR,FATAL, PG_CATCH runs. Also here are the server logs for pltcl checks. It's amazing that the actual error message is in context... LOG: database system was shut down at 2006-06-19 16:48:47 MET DST LOG: checkpoint record is at 0/22C6CE0 LOG: redo record is at 0/22C6CE0; undo record is at 0/0; shutdown TRUE LOG: next transaction ID: 9130; next OID: 38895 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: database system is ready LOG: transaction ID wrap limit is 1073745208, limited by database regression LOG: transaction ID wrap limit is 1073745208, limited by database regression LOG: transaction ID wrap limit is 1073745208, limited by database regression ERROR: role regressgroup1 does not exist ERROR: CONTEXT: duplicate key '1', 'KEY1-3' for T_pkey2 while executing elog ERROR duplicate key '$NEW(key1)', '$NEW(key2)' for T_pkey2 invoked from within if {$n 0} { elog ERROR \ duplicate key '$NEW(key1)', '$NEW(key2)' for T_pkey2 } (procedure __PLTcl_proc_38909_trigger_38900 line 32) invoked from within __PLTcl_proc_38909_trigger_38900 pkey2_before 38900 {{} key1 key2 txt} BEFORE ROW INSERT {key1 1 key2 {KEY1-3 } txt {should fail ... ERROR: CONTEXT: key for t_dta1 not in t_pkey1 while executing elog ERROR key for $GD($planrel) not in $keyrel (procedure __PLTcl_proc_38913_trigger_38902 line 92) invoked from within __PLTcl_proc_38913_trigger_38902 dta1_before 38902 {{} tkey ref1 ref2} BEFORE ROW INSERT {tkey {trec 4} ref1 1 ref2 {key1-4 }} {} ref... ERROR: CONTEXT: key for t_dta2 not in t_pkey2 while executing elog ERROR key for $GD($planrel) not in $keyrel (procedure __PLTcl_proc_38913_trigger_38904 line 92) invoked from within __PLTcl_proc_38913_trigger_38904 dta2_before 38904 {{} tkey ref1 ref2} BEFORE ROW INSERT {tkey {trec 4} ref1 1 ref2 {KEY1-4 }} {} ref... ERROR: CONTEXT: key '1', 'key1-1 ' referenced by T_dta1 while executing elog ERROR key '$OLD(key1)', '$OLD(key2)' referenced by T_dta1 invoked from within if {$check_old_ref} { # # Check for references to OLD # set n [spi_execp -count 1 $GD(plan_dta1) [list $OLD(key1) $OLD(key2)]] if {$n 0}... (procedure __PLTcl_proc_38907_trigger_38898 line 79) invoked from within __PLTcl_proc_38907_trigger_38898 pkey1_before 38898 {{} key1 key2 txt} BEFORE ROW UPDATE {key1 1 key2 {key1-9 } txt {test key ... ERROR: CONTEXT: key '1', 'key1-2 ' referenced by T_dta1 while executing elog ERROR key '$OLD(key1)', '$OLD(key2)' referenced by T_dta1 invoked from within if {$check_old_ref} { # # Check for references to OLD # set n [spi_execp -count 1 $GD(plan_dta1) [list $OLD(key1) $OLD(key2)]] if {$n 0}... (procedure __PLTcl_proc_38907_trigger_38898 line 79) invoked from within __PLTcl_proc_38907_trigger_38898 pkey1_before 38898 {{} key1 key2 txt} BEFORE ROW DELETE {} {key1 1 key2 {key1-2 } txt {test key ... NOTICE: updated 1 entries in T_dta2 for new key in T_pkey2 NOTICE: deleted 1 entries from T_dta2 -- Olivier PRENANT Tel: +33-5-61-50-97-00 (Work) 15, Chemin des Monges+33-5-61-50-97-01 (Fax) 31190 AUTERIVE +33-6-07-63-80-64 (GSM) FRANCE Email: ohp@pyrenet.fr -- Make your life a dream, make your dream a reality. (St Exupery) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Getting rid of extra gettimeofday() calls
As of CVS tip, PG does up to four separate gettimeofday() calls upon the arrival of a new client command. This is because the statement_timestamp, stats_command_string, log_duration, and statement_timeout features each independently save an indication of statement start time. Given what we've found out recently about gettimeofday() being unduly expensive on some hardware, this cries out to get fixed. I propose that we do SetCurrentStatementStartTimestamp() immediately upon receiving a client message, and then make the other features copy that value instead of fetching their own. Another gettimeofday() call that I would like to get rid of is the one currently done at the end of statement when stats_command_string is enabled: we record current time when resetting the activity_string to IDLE. Would anyone be terribly upset if this used statement_timestamp instead? The effect would be that for an idle backend, pg_stat_activity.query_start would reflect the start time of its latest query instead of the time at which it finished the query. I can see some use for the current behavior but I don't really think it's worth the overhead of a gettimeofday() call. Preliminary tests say that after the shared-memory change I committed yesterday, the overhead of stats_command_string consists *entirely* of the two added gettimeofday() calls. If we get rid of both, the difference between having stats_command_string on and off is barely measurable (using Bruce's test case of 1 SELECT 1; statements). regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] sync_file_range()
* Simon Riggs: Other files are fsynced at checkpoint - always all dirty blocks in the whole file. Optionally, sync_file_range does not block the calling process, so it's very easy to flush all files at once, which could in theory reduce seeking overhead. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Getting rid of extra gettimeofday() calls
On Mon, Jun 19, 2006 at 11:17:48AM -0400, Tom Lane wrote: instead? The effect would be that for an idle backend, pg_stat_activity.query_start would reflect the start time of its latest query instead of the time at which it finished the query. I can see some use for the current behavior but I don't really think it's worth the overhead of a gettimeofday() call. Perhaps make it a compile-time option... I suspect that there's people making use of that info in their monitoring tools. Though, those people are probably also likely to have log_duration=true, so maybe the same trick of gettimeofday() once at statement end and copying it as needed would work. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] sync_file_range()
Simon Riggs [EMAIL PROTECTED] writes: On Mon, 2006-06-19 at 15:32 +0800, Qingqing Zhou wrote: ITAGAKI Takahiro [EMAIL PROTECTED] wrote I'm interested in it, with which we could improve responsiveness during checkpoints. Though it is Linux specific system call, but we could use the combination of mmap() and msync() instead of it; I mean we can use mmap only to flush dirty pages, not to read or write pages. Can you specify details? As the TODO item inidcates, if we mmap data file, a serious problem is that we don't know when the data pages hit the disks -- so that we may voilate the WAL rule. Can't see where we'd use it. We fsync the xlog at transaction commit, so only the leading edge needs to be synced - would the call help there? Presumably the OS can already locate all blocks associated with a particular file fairly quickly without doing a full cache scan. Well in theory the transaction being committed isn't necessarily the leading edge, there could be more work from other transactions since the last work this transaction actually did. However I can't see that actually helping performance much if at all. There can't be much, and writing the data it doesn't really matter much how much data it writes -- what really matters is rotational and seek latency anyways. Other files are fsynced at checkpoint - always all dirty blocks in the whole file. Well couldn't it be useful for checkpoints if it there was some way to know which buffers had been touched since the last checkpoint? There could be a lot of buffers dirtied since the checkpoint began and those don't really need to be synced do they? Or it could be used to control the rate at which the files are checkpointed. Come to think of it I wonder whether there's anything to be gained by using smaller files for tables. Instead of 1G files maybe 256M files or something like that to reduce the hit of fsyncing a file. -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] sync_file_range()
On Mon, 2006-06-19 at 15:04 -0400, Greg Stark wrote: We fsync the xlog at transaction commit, so only the leading edge needs to be synced - would the call help there? Presumably the OS can already locate all blocks associated with a particular file fairly quickly without doing a full cache scan. Well in theory the transaction being committed isn't necessarily the leading edge, there could be more work from other transactions since the last work this transaction actually did. Near enough. Other files are fsynced at checkpoint - always all dirty blocks in the whole file. Well couldn't it be useful for checkpoints if it there was some way to know which buffers had been touched since the last checkpoint? There could be a lot of buffers dirtied since the checkpoint began and those don't really need to be synced do they? Qingqing had a proposal for something like that, but seemed not worth it after analysis. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Generic Monitoring Framework Proposal
Motivation: -- The main goal for this Generic Monitoring Framework is to provide a common interface for adding instrumentation points or probes to Postgres so its behavior can be easily observed by developers and administrators even in production systems. This framework will allow Postgres to use the appropriate monitoring/tracing facility provided by each OS. For example, Solaris and FreeBSD will use DTrace, and other OSes can use their respective tool. What is DTrace? -- Some of you may have heard about or used DTrace already. In a nutshell, DTrace is a comprehensive dynamic tracing facility that is built into Solaris and FreeBSD (mostly working) that can be used by administrators and developers on live production systems to examine the behavior of both user programs and of the operating system. DTrace can help answer difficult questions about the OS and the application itself. For example, you may want to ask: - Show all functions that get invoked (userland kernel) and execution time when my function foo() is called. Seeing the path a function takes into the kernel may provide clues for performance tuning. - Show how many times a particular lock is acquired and how long it's held. This can help identity contentions in the system. The best way to appreciate DTrace capabilities is by seeing a demo or through hands-on experience, and I plan to show some interesting demos at the PG Summit. There are a numer of docs on Dtrace, and here's a quick start doc and a complete reference guide. http://www.sun.com/software/solaris/howtoguides/dtracehowto.jsp http://docs.sun.com/app/docs/doc/817-6223 Here is a recent DTrace for FreeBSD status http://marc.theaimsgroup.com/?l=freebsd-currentm=114854018213275w=2 Open source apps that provide user level probes (bottom of page) http://uadmin.blogspot.com/2006/05/what-is-dtrace.html Proposed Solution: This solution is actually quite simple and non-intrusive. 1. Define macros PG_TRACE, PG_TRACE1, etc, in a new header file called pg_trace.h with multiple #if defined(xxx) sections for Solaris, FreeBSD, Linux, etc, and add pg_trace.h to c.h which is included in postgres.h and included by every C file. The macros will have the following format: PG_TRACE[n](module_name, probe_name [, arg1, ..., arg5]) module_name = Name to identify PG module such as pg_backend, pg_psql, pg_plpgsql, etc probe_name = Probe name such as transaction_start, lwlock_acquire, etc arg1..arg5 = Any args to pass to the probe such as txn id, lock id, etc 2. Map PG_TRACE, PG_TRACE1, etc, to macros or functions appropriate for each OS. For OSes that don't have suitable tracing facility, just map the macros to nothing - doing this will not have any affect on performance or existing behavior. Sample of pg_trace.h #if defined(sun) || defined(FreeBSD) #include sys/sdt.h #define PG_TRACEDTRACE_PROBE #define PG_TRACE1 DTRACE_PROBE1 ... #define PG_TRACE5 DTRACE_PROBE5 #elif defined(__linux__) || defined(_AIX) || defined(__sgi) ... /* Map the macros to no-ops */ #define PG_TRACE(module, name) #define PG_TRACE1(module, name, arg1) ... #define PG_TRACE5(module, name, arg1, arg2, arg3, arg4, arg5) #endif 3. Add any file(s) to support the particular OS tracing facility 4. Update the Makefiles as necessary for each OS How to add probes: - To add a probe, just add a one line macro in the appropriate location in the source. Here's an example of two probes, one with no argument and the other with 2 arguments: PG_TRACE (pg_backend, fsync_start); PG_TRACE2 (pg_backend, lwlock_acquire, lockid, mode); If there are enough probes embedded in PG, its behavior can be easily observed. With the help of Gavin Sherry, we have added about 20 probes, and Gavin has suggested a number of other interesting areas for additional probes. Pervasive has also added some probes to PG 8.0.4 and posted the patch on http://pgfoundry.org/projects/dtrace/. I hope to combine the probes using this generic framework for 8.1.4, and make it available for folks to try. Since my knowledge of the PG source code is limited, I'm looking for assistance from experts to hep identify some new interesting probe points. How to use probes: For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performance hit whatsoever. Here is a simple example to print out the number of LWLock counts for each PG process. test.d #!/usr/sbin/dtrace -s pg_backend*:::lwlock-acquire { @foo[pid] = count(); } dtrace:::END { printf(\n%10s %15s\n, PID, Count); printa(%10d [EMAIL PROTECTED],@foo); } # ./test.d PID Count 1438 28 1447 7240 1448 9675 1449 11972 I have a prototype working, so if anyone wants to try it, I can provide a patch or give access to my test system. This is a proposal, so comments,
Re: [HACKERS] Getting rid of extra gettimeofday() calls
On Mon, 2006-06-19 at 11:17 -0400, Tom Lane wrote: As of CVS tip, PG does up to four separate gettimeofday() calls upon the arrival of a new client command. This is because the statement_timestamp, stats_command_string, log_duration, and statement_timeout features each independently save an indication of statement start time. Given what we've found out recently about gettimeofday() being unduly expensive on some hardware, this cries out to get fixed. I propose that we do SetCurrentStatementStartTimestamp() immediately upon receiving a client message, and then make the other features copy that value instead of fetching their own. Yes. Well spotted. That should make each timed aspect more accurate also since its the same value. Presumably you don't mean *every* client message, just stmt start ones. Can we set that in GetTransactionSnapshot() - that way a serializable transaction won't need to update the time after each statement. We can then record this as the SetCurrentSnapshotStartTimestamp(). Another gettimeofday() call that I would like to get rid of is the one currently done at the end of statement when stats_command_string is enabled: we record current time when resetting the activity_string to IDLE. Would anyone be terribly upset if this used statement_timestamp instead? The effect would be that for an idle backend, pg_stat_activity.query_start would reflect the start time of its latest query instead of the time at which it finished the query. I can see some use for the current behavior but I don't really think it's worth the overhead of a gettimeofday() call. Presumably we have to do at least one at the end when doing statement logging? I notice there is also one in elog.c for when we have %t set. We probably don't need to do both when statement logging. Preliminary tests say that after the shared-memory change I committed yesterday, the overhead of stats_command_string consists *entirely* of the two added gettimeofday() calls. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Getting rid of extra gettimeofday() calls
Simon Riggs [EMAIL PROTECTED] writes: Presumably you don't mean *every* client message, just stmt start ones. At the moment I've got it setting the statement_timestamp on receipt of any message that could lead to execution of user-defined code; that includes Query, Parse, Bind, Execute, FunctionCall. Possibly we could dispense with the Bind one but I'm unconvinced. Can we set that in GetTransactionSnapshot() - that way a serializable transaction won't need to update the time after each statement. No, that's much too late, unless you want to do major rearrangement of the times at which reporting actions occur. Furthermore the entire point of statement_timestamp is that it advances for new commands within the same xact, so your proposal amounts to removing statement_timestamp entirely. The actual behavior of CVS tip is that transaction_timestamp copies from statement_timestamp, not vice versa; that seems fine to me. Presumably we have to do at least one at the end when doing statement logging? Only if you've got log_duration on. Per Jim's suggestion, maybe we could have the IDLE activity report advance activity_timestamp only if log_duration is true, ie, only if it's free to do so. I notice there is also one in elog.c for when we have %t set. We probably don't need to do both when statement logging. I'm inclined to think that that one is worth its keep. Sometimes you really wanna know exactly when a log message was emitted ... regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Generic Monitoring Framework Proposal
Robert Lor [EMAIL PROTECTED] writes: The main goal for this Generic Monitoring Framework is to provide a common interface for adding instrumentation points or probes to Postgres so its behavior can be easily observed by developers and administrators even in production systems. What is the overhead of a probe when you're not using it? The answer had better not include the phrase kernel call, or this is unlikely to pass muster... For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performance hit whatsoever. If you believe that, I have a bridge in Brooklyn you might be interested in. What are the criteria going to be for where to put probe calls? If it has to be hard-wired into the source code, I foresee a lot of contention about which probes are worth their overhead, because we'll need one-size-fits-all answers. arg1..arg5 = Any args to pass to the probe such as txn id, lock id, etc Where is the data type of a probe argument defined? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] I was offline
Hi, a quickie: I was offline last week due to my ADSL line going down, so I was unable to follow the discussions closely. I'll be back at the non-transactional catalogs and relminxid discussions later (hopefully tomorrow or on wednesday). -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Getting rid of extra gettimeofday() calls
Ühel kenal päeval, E, 2006-06-19 kell 11:17, kirjutas Tom Lane: As of CVS tip, PG does up to four separate gettimeofday() calls upon the arrival of a new client command. This is because the statement_timestamp, stats_command_string, log_duration, and statement_timeout features each independently save an indication of statement start time. Given what we've found out recently about gettimeofday() being unduly expensive on some hardware, this cries out to get fixed. I propose that we do SetCurrentStatementStartTimestamp() immediately upon receiving a client message, and then make the other features copy that value instead of fetching their own. Another gettimeofday() call that I would like to get rid of is the one currently done at the end of statement when stats_command_string is enabled: we record current time when resetting the activity_string to IDLE. Is it just IDLE or also IDLE in transaction ? If we are going to change things anyway, I'd like the latter to show the time since start of transaction, so that I Would at least have an easy way to write a transaction timeout script :) I don't really care about what plain IDLE uses. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Generic Monitoring Framework Proposal
On Jun 19, 2006, at 4:40 PM, Tom Lane wrote: Robert Lor [EMAIL PROTECTED] writes: The main goal for this Generic Monitoring Framework is to provide a common interface for adding instrumentation points or probes to Postgres so its behavior can be easily observed by developers and administrators even in production systems. What is the overhead of a probe when you're not using it? The answer had better not include the phrase kernel call, or this is unlikely to pass muster... For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performance hit whatsoever. If you believe that, I have a bridge in Brooklyn you might be interested in. Heh. Syscall probes and FBT probes in Dtrace have zero overhead. User-space probes do have overhead, but it is only a few instructions (two I think). Besically, the probe points are replaced by illegal instructions and the kernel infrastructure for Dtrace will fasttrap the ops and then act. So, it is tiny tiny overhead. Little enough that it isn't unreasonable to instrument things like s_lock which are tiny. What are the criteria going to be for where to put probe calls? If it has to be hard-wired into the source code, I foresee a lot of contention about which probes are worth their overhead, because we'll need one-size-fits-all answers. arg1..arg5 = Any args to pass to the probe such as txn id, lock id, etc Where is the data type of a probe argument defined? I assume it would depend on the probe implementation. In Dtrace they are implemented in .d files that will post-instrument the object before final linkage. Dtrace's whole purpose is to be low overhead and it really does it in a fantastic way. As an example, you can take an uninstrumented binary and add dynamic instrumentation to the entry, exit and every instruction op-code over every single routine in the process. And clearly, as the binary is uninstrumented, the overhead is indeed zero when the probes are not enabled. The reason that Robert proposes user-space probes (I assume) is that tracing C functions can be too granular and not conveniently expose the right information to make tracing useful. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/ // Ecelerity: Run with it. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Generic Monitoring Framework Proposal
Tom Lane wrote: Robert Lor [EMAIL PROTECTED] writes: The main goal for this Generic Monitoring Framework is to provide a common interface for adding instrumentation points or probes to Postgres so its behavior can be easily observed by developers and administrators even in production systems. What is the overhead of a probe when you're not using it? The answer had better not include the phrase kernel call, or this is unlikely to pass muster... Here's what the DTrace developers have to say in their Usenix paper. When not explicitly enabled, DTrace has zero probe effect - the system operates exactly as if DTrace were not present at all. http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf The technical details are beyond me, so I can't tell you exactly what happens internally. I can find out if you're interested! For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performance hit whatsoever. If you believe that, I have a bridge in Brooklyn you might be interested in. What are the criteria going to be for where to put probe calls? If it has to be hard-wired into the source code, I foresee a lot of contention about which probes are worth their overhead, because we'll need one-size-fits-all answers. I think we need to be selective in terms of which probes to add since we don't want to scatter them all over the source files. For DTrace, the overhead is very minimal, but you're right, other implementation for the same probe may have more perf overhead. arg1..arg5 = Any args to pass to the probe such as txn id, lock id, etc Where is the data type of a probe argument defined? It's in a .d file which looks like below: provider pg_backend { probe fsync__start(void); probe fsync__end(void); probe lwlock__acquire (int, int); probe lwlock__release(int); ... } Regards, Robert ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Generic Monitoring Framework Proposal
On Mon, Jun 19, 2006 at 05:20:31PM -0400, Theo Schlossnagle wrote: Heh. Syscall probes and FBT probes in Dtrace have zero overhead. User-space probes do have overhead, but it is only a few instructions (two I think). Besically, the probe points are replaced by illegal instructions and the kernel infrastructure for Dtrace will fasttrap the ops and then act. So, it is tiny tiny overhead. Little enough that it isn't unreasonable to instrument things like s_lock which are tiny. If someone wanted to, they should be able to do benchmarking with the DTrace patches on pgFoundry to see the overhead of just having the probes in, and then having the probes in and actually using them. If you *really* want to see the difference, add a probe in s_lock. :) -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Generic Monitoring Framework Proposal
[EMAIL PROTECTED] (Robert Lor) writes: For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performance hit whatsoever. That seems inconceivable. In order to have a way of deciding whether or not the probes are enabled, there has *got* to be at least one instruction executed, and that can't be costless. -- output = reverse(gro.mca @ enworbbc) http://www.ntlug.org/~cbbrowne/wp.html ...while I know many people who emphatically believe in reincarnation, I have never met or read one who could satisfactorily explain population growth. -- Spider Robinson ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Generic Monitoring Framework Proposal
Theo Schlossnagle wrote: Heh. Syscall probes and FBT probes in Dtrace have zero overhead. User-space probes do have overhead, but it is only a few instructions (two I think). Besically, the probe points are replaced by illegal instructions and the kernel infrastructure for Dtrace will fasttrap the ops and then act. So, it is tiny tiny overhead. Little enough that it isn't unreasonable to instrument things like s_lock which are tiny. Theo, you're a genius. FBT (funciton boundary tracing) probes have zero overhead (section 4.1) and user-space probes has two instructions over head (section 4.2). I was incorrect about making a general zero overhead statement. But it's so close to zero :-) http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf The reason that Robert proposes user-space probes (I assume) is that tracing C functions can be too granular and not conveniently expose the right information to make tracing useful. Yes, I'm proposing user-space probes (aka User Statically-Defined Tracing - USDT). USDT provides a high-level abstraction so the application can expose well defined probes without the user having to know the detailed implementation. For example, instead of having to know the function LWLockAcquire(), a well documented probe called lwlock_acquire with the appropriate args is much more usable. Regards, Robert ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] CVS HEAD busted on Windows?
I notice buildfarm member snake is unhappy: The program postgres is needed by initdb but was not found in the same directory as C:/msys/1.0/local/build-farm/HEAD/pgsql.696/src/test/regress/tmp_check/install/usr/local/build-farm/HEAD/inst/bin/initdb.exe. Check your installation. I'm betting Peter's recent patch to merge postmaster and postgres missed some little thing or other ... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Generic Monitoring Framework Proposal
On Jun 19, 2006, at 6:41 PM, Robert Lor wrote: Theo Schlossnagle wrote: Heh. Syscall probes and FBT probes in Dtrace have zero overhead. User-space probes do have overhead, but it is only a few instructions (two I think). Besically, the probe points are replaced by illegal instructions and the kernel infrastructure for Dtrace will fasttrap the ops and then act. So, it is tiny tiny overhead. Little enough that it isn't unreasonable to instrument things like s_lock which are tiny. Theo, you're a genius. FBT (funciton boundary tracing) probes have zero overhead (section 4.1) and user-space probes has two instructions over head (section 4.2). I was incorrect about making a general zero overhead statement. But it's so close to zero :-) http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf The reason that Robert proposes user-space probes (I assume) is that tracing C functions can be too granular and not conveniently expose the right information to make tracing useful. Yes, I'm proposing user-space probes (aka User Statically-Defined Tracing - USDT). USDT provides a high-level abstraction so the application can expose well defined probes without the user having to know the detailed implementation. For example, instead of having to know the function LWLockAcquire(), a well documented probe called lwlock_acquire with the appropriate args is much more usable. I am giving a talk at OSCON this year about PostgreSQL on big systems. Big is all relative, but I will be talking about dtrace a bit and the advantages of running PostgreSQL on Solaris which is what we ended up doing after some extremely disturbing experiences on Linux. I was able to track a very acute memory leak in pl/perl (which Neil so kindly fixed) within a few moments -- and this is without explicit user-space trace points. If there were good user- space points, I likely wouldn't have had to dig in the source as a pre-cursor to my dtrace efforts. The things you might be able to do with user-specific trace points: o better understand the block scatter (distance of block-level reads) for a specific query). o understand lock contention in vastly multiprocessor systems using plockstat (my hunch is that heavy-weight locks might be better). o our current box is 4 way opteron, but we have a 16-way T2000 as well. o report on queries including turn-around time, block-accesses, lock acquisitions grouped by query for specific time windows. The nice thing about dtrace is that it requires no prep to look at a problem. When something is acting odd in production, you don't want to attempt to repeat it in a test environment first. You want to observe it. Dtrace allows you to dig in really deep in production with an acceptable performance penalty and ask questions that couldn't be asked before. It is exceptionally clever stuff. Of all the new neat stuff in Solaris 10, it has my vote for coolest and most useful. I've nailed several production problems (outside of Postgres) using dtrace with accuracy and efficiency. When Solaris 10u2 is released, we'll be trying Postgres on ZFS, so my rankings may change :-) The idea of having intelligently placed dtrace probes in Postrgres would allow us to deal with postgres as a first class app on Solaris 10 with respect to troubleshooting obtuse production problems. That, to me, is exciting stuff. Best regards, Theo // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/ // Ecelerity: Run with it. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Generic Monitoring Framework Proposal
Jim C. Nasby wrote: On Mon, Jun 19, 2006 at 05:20:31PM -0400, Theo Schlossnagle wrote: Heh. Syscall probes and FBT probes in Dtrace have zero overhead. User-space probes do have overhead, but it is only a few instructions (two I think). Besically, the probe points are replaced by illegal instructions and the kernel infrastructure for Dtrace will fasttrap the ops and then act. So, it is tiny tiny overhead. Little enough that it isn't unreasonable to instrument things like s_lock which are tiny. If someone wanted to, they should be able to do benchmarking with the DTrace patches on pgFoundry to see the overhead of just having the probes in, and then having the probes in and actually using them. If you *really* want to see the difference, add a probe in s_lock. :) We will need to benchmark on FreeBSD to see if those comments about overhead stand up to scrutiny there too. I would think that even if (for instance) we find that there is no overhead on Solaris, those of us on platforms where DTrace is less mature would want the option of building without any probes at all in the code - I guess a configure option --without-dtrace on by default on those platforms would do it. regards Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Generic Monitoring Framework Proposal
On Jun 19, 2006, at 7:39 PM, Mark Kirkwood wrote: We will need to benchmark on FreeBSD to see if those comments about overhead stand up to scrutiny there too. I've followed the development of DTrace on FreeBSD and the design approach is mostly identical to the Solaris one. This would mean that if there is overhead on FreeBSD not present on Solaris it would be considered a big and likely fixed. I would think that even if (for instance) we find that there is no overhead on Solaris, those of us on platforms where DTrace is less mature would want the option of building without any probes at all in the code - I guess a configure option --without-dtrace on by default on those platforms would do it. Absolutely. As they are all proposed as preprocessor macros, this would be trivial to accomplish. // Theo Schlossnagle // CTO -- http://www.omniti.com/~jesus/ // OmniTI Computer Consulting, Inc. -- http://www.omniti.com/ // Ecelerity: Run with it. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] sync_file_range()
Greg Stark [EMAIL PROTECTED] writes: Come to think of it I wonder whether there's anything to be gained by using smaller files for tables. Instead of 1G files maybe 256M files or something like that to reduce the hit of fsyncing a file. Actually probably not. The weak part of our current approach is that we tell the kernel sync this file, then sync that file, etc, in a more or less random order. This leads to a probably non-optimal sequence of disk accesses to complete a checkpoint. What we would really like is a way to tell the kernel sync all these files, and let me know when you're done --- then the kernel and hardware have some shot at scheduling all the writes in an intelligent fashion. sync_file_range() is not that exactly, but since it lets you request syncing and then go back and wait for the syncs later, we could get the desired effect with two passes over the file list. (If the file list is longer than our allowed number of open files, though, the extra opens/closes could hurt.) Smaller files would make the I/O scheduling problem worse not better. Indeed, I've been wondering lately if we shouldn't resurrect LET_OS_MANAGE_FILESIZE and make that the default on systems with largefile support. If nothing else it would cut down on open/close overhead on very large relations. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] shall we have a TRACE_MEMORY mode
As I follow Relyea Mike's recent post of possible memory leak, I think that we are lack of a good way of identifing memory usage. Maybe we should also remember __FILE__, __LINE__ etc for better memory usage diagnose when TRACE_MEMORY is on? Regards, Qingqing ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] PAM auth
Hi folks, I'm trying to use PAM auth on PostgreSQL, but I still cannot get success on PAM auth (with PG813 and RHEL3). pg_hba.conf has hostpamtest all 0.0.0.0/0 pam /etc/pam.d/postgresql is #%PAM-1.0 auth required pam_stack.so service=system-auth accountrequired pam_stack.so service=system-auth password required pam_stack.so service=system-auth And I've changed user password with ALTER USER ... PASSWORD. However, my postmaster always denies my login. - % /usr/local/pgsql813/bin/psql -h localhost -W -U hoge pamtest Password for user hoge: LOG: pam_authenticate failed: Authentication failure FATAL: PAM authentication failed for user hoge psql: FATAL: PAM authentication failed for user hoge - What's wrong with that? BTW, I found an empty password () is passed to CheckPAMAuth() function in auth.c. - #ifdef USE_PAM case uaPAM: pam_port_cludge = port; status = CheckPAMAuth(port, port-user_name, ); break; #endif /* USE_PAM */ - /* * Check authentication against PAM. */ static int CheckPAMAuth(Port *port, char *user, char *password) { int retval; pam_handle_t *pamh = NULL; /* * Apparently, Solaris 2.6 is broken, and needs ugly static variable * workaround */ pam_passwd = password; /* * Set the application data portion of the conversation struct This is * later used inside the PAM conversation to pass the password to the * authentication module. */ pam_passw_conv.appdata_ptr = (char *) password; /* from password above, * not allocated */ - What does it mean? I'm not familiar with PAM, so I can't get why the password can be empty here. Any suggestion? Thanks. -- NAGAYASU Satoshi [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] checking on buildfarm member thrush
I'm trying to determine why thrush has been failing on PG CVS HEAD for the past few days. Could you try running the attached program on that machine, and see what it prints? I suspect it will dump core :-( Note: you might need to use -D_GNU_SOURCE to get it to compile at all. regards, tom lane #include stdio.h #include stdlib.h #include string.h #include errno.h #include fcntl.h int main() { if (posix_fadvise(fileno(stdin), 0, 0, POSIX_FADV_DONTNEED)) printf(failed: %s\n, strerror(errno)); else printf(OK\n); return 0; } ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] shall we have a TRACE_MEMORY mode
Qingqing Zhou wrote: As I follow Relyea Mike's recent post of possible memory leak, I think that we are lack of a good way of identifing memory usage. Maybe we should also remember __FILE__, __LINE__ etc for better memory usage diagnose when TRACE_MEMORY is on? Hmm, this would have been a great help to me not long ago, so I'd say it would be nice to have. About the exact form we'd give the feature: maybe write each allocation/freeing to a per-backend file, say /tmp/pgmem.pid. Also memory context creation, destruction, reset. Having the __FILE__ and __LINE__ on each operation would be a good tracing tool as well. Then it's easy to write Perl tools to find specific problems. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] shall we have a TRACE_MEMORY mode
Alvaro Herrera [EMAIL PROTECTED] writes: About the exact form we'd give the feature: maybe write each allocation/freeing to a per-backend file, say /tmp/pgmem.pid. Also memory context creation, destruction, reset. Having the __FILE__ and __LINE__ on each operation would be a good tracing tool as well. Then it's easy to write Perl tools to find specific problems. That seems mostly the hard way to me, because our memory management scheme is *not* based around thou shalt free() what thou malloc()ed. You'd need a tool that understood about resetting memory contexts (recursively) to get anywhere at all in analyzing such a trace. I've had some success in the past with debugging memory leaks by trawling through the oversized memory contexts with gdb x and trying to understand what the bulk of the data was. This is certainly pretty painful though. One idea that comes to mind is to have a compile time option to record the palloc __FILE__ and _LINE__ in every AllocChunk header. Then it would not be so hard to identify the culprit while trawling through memory. The overhead costs would be so high that you'd never turn it on by default though :-( Another thing to consider is that the proximate location of the palloc is frequently *not* very useful. For instance, if your memory is getting eaten by lists, all the palloc traces will point at new_tail_cell(). Not much help. I don't know what to do about that ... any ideas? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings