Re: [HACKERS] sync_file_range()

2006-06-19 Thread Qingqing Zhou

ITAGAKI Takahiro [EMAIL PROTECTED] wrote


 I'm interested in it, with which we could improve responsiveness during
 checkpoints. Though it is Linux specific system call, but we could use
 the combination of mmap() and msync() instead of it; I mean we can use
 mmap only to flush dirty pages, not to read or write pages.


Can you specify details? As the TODO item inidcates, if we mmap data file, a
serious problem is that we don't know when the data pages hit the disks -- 
so that we may voilate the WAL rule.

Regards,
Qingqing



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] MultiXacts WAL

2006-06-19 Thread Zeugswetter Andreas DCP SD

 I would like to see some checking of this, though.  Currently 
 I'm doing testing of PostgreSQL under very large numbers of 
 connections (2000+) and am finding that there's a huge volume 
 of xlog output ... far more than 
 comparable RDBMSes.   So I think we are logging stuff we 
 don't really have to.

I think you really have to lengthen the checkpoint interval to reduce
WAL overhead (20 min or so). Also imho you cannot only compare the log
size/activity since other db's write part of what pg writes to WAL to 
other areas (physical log, rollback segment, ...).

If we cannot afford lenghtening the checkpoint interval because of 
too heavy checkpoint load, we need to find ways to tune bgwriter, and
not
reduce checkpoint interval.

Andreas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Stefan Kaltenbrunner
Michael Fuhr wrote:
 On Sun, Jun 18, 2006 at 07:18:07PM -0600, Michael Fuhr wrote:
 Maybe I'm misreading the packet, but I think the query is for
 ''kaltenbrunner.cc (two single quotes followed by kaltenbrunner.cc)
 
 Correction: ''.kaltenbrunner.cc

yes that is exactly the issue - the postmaster tries to resolve
''.kaltenbrunner.cc multiple times during startup and getting ServFail
as a response from the upstream resolver.


Stefan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-19 Thread PFC



Great minds think alike ;-) ... I just committed exactly that protocol.
I believe it is correct, because AFAICS there are only four possible
risk cases:


Congrats !

For general culture you might be interested in reading this :

http://en.wikipedia.org/wiki/Software_transactional_memory
http://libcmt.sourceforge.net/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Stefan Kaltenbrunner
Andrew Dunstan wrote:
 
 
 Tom Lane wrote:
 
 Anyway, the tail end of the trace
 shows it repeatedly sending off a UDP packet and getting practically the
 same data back:


 I'm not too up on what the DNS protocol looks like on-the-wire, but I'll
 bet this is it.  I think it's trying to look up kaltenbrunner.cc and
 failing.

  

 
 Why are we actually looking up anything? Just so we can bind to a
 listening socket?
 
 Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
 direct it to use files first, something like
 
  lookup file bind
 
 Stefan, can you look into that? It would be a bit ugly if it's calling
 DNS (and failing) to resolve localhost.


no - resolving localhost works fine (both using /etc/hosts and through
the dns-resolver) - and I infact verified that when we initially started
to investigate that issue a while ago :-)


Stefan

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] sync_file_range()

2006-06-19 Thread ITAGAKI Takahiro
Qingqing Zhou [EMAIL PROTECTED] wrote:

  I'm interested in it, with which we could improve responsiveness during
  checkpoints. Though it is Linux specific system call, but we could use
  the combination of mmap() and msync() instead of it; I mean we can use
  mmap only to flush dirty pages, not to read or write pages.
 
 Can you specify details? As the TODO item inidcates, if we mmap data file, a
 serious problem is that we don't know when the data pages hit the disks -- 
 so that we may voilate the WAL rule.

I'm thinking about fuzzy checkpoints, where we writes and flushes buffers
as need as we should. Then sync_file_range() helps us to control to flush
buffers by better granularity. We can stretch a checkpoint length to avoid
storage-overload at a burst, using sync_file_range() and cost-based delay,
like vacuum.

I did not want to modify buffers by mmap, just to say the following
pseudo-code. (I don't know it works in fact...)

my_sync_file_range(fd, offset, nbytes, ...)
{
void *p = mmap(NULL, nbytes, ..., fd, offset);
msync(p, nbytes, MS_ASYNC);
munmap(p, nbytes);
}


Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] sync_file_range()

2006-06-19 Thread Simon Riggs
On Mon, 2006-06-19 at 15:32 +0800, Qingqing Zhou wrote:
 ITAGAKI Takahiro [EMAIL PROTECTED] wrote
 
 
  I'm interested in it, with which we could improve responsiveness during
  checkpoints. Though it is Linux specific system call, but we could use
  the combination of mmap() and msync() instead of it; I mean we can use
  mmap only to flush dirty pages, not to read or write pages.
 
 
 Can you specify details? As the TODO item inidcates, if we mmap data file, a
 serious problem is that we don't know when the data pages hit the disks -- 
 so that we may voilate the WAL rule.

Can't see where we'd use it.

We fsync the xlog at transaction commit, so only the leading edge needs
to be synced - would the call help there? Presumably the OS can already
locate all blocks associated with a particular file fairly quickly
without doing a full cache scan.

Other files are fsynced at checkpoint - always all dirty blocks in the
whole file.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Andrew Dunstan



Stefan Kaltenbrunner wrote:


Andrew Dunstan wrote:
 


Why are we actually looking up anything? Just so we can bind to a
listening socket?

Anyway, maybe the box needs a lookup line in its /etc/resolv.conf to
direct it to use files first, something like

lookup file bind

Stefan, can you look into that? It would be a bit ugly if it's calling
DNS (and failing) to resolve localhost.
   




no - resolving localhost works fine (both using /etc/hosts and through
the dns-resolver) - and I infact verified that when we initially started
to investigate that issue a while ago :-)


 



Why are we looking up 'kaltenbrunner.cc' at all then? In any case, can 
we just try with that resolver line?


The question isn't whether is succeeds, it's how long it takes to 
succeed. When I increased the pg_regress timeout it actually went 
through the whole regression test happily. I suspect we have 2 things 
eating up the 60s timeout here: loading the timezone db and resolving 
whatever it is we are trying to resolve.


cheers

andrew



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: R: [HACKERS] Per-server univocal identifier

2006-06-19 Thread Joachim Wieland
Giampaolo,

On Sun, Jun 18, 2006 at 01:26:21AM +0200, Giampaolo Tomassoni wrote:
 Or... Can I put a custom variable in pgsql.conf?

Like that you mean?


custom_variable_classes = 'identify'# list of custom variable classnames
identify.id = 42




template1=# show identify.id;
 identify.id
-
 42


However pg_settings does not contain variable classes so it can be difficult
to actually use this value. I wonder if this is a bug or a feature?


Joachim


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 The question isn't whether is succeeds, it's how long it takes to 
 succeed. When I increased the pg_regress timeout it actually went 
 through the whole regression test happily. I suspect we have 2 things 
 eating up the 60s timeout here: loading the timezone db and resolving 
 whatever it is we are trying to resolve.

The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds.  So the failed DNS responses must be the problem.  Could
we get a ktrace with timestamps on the syscalls to confirm that?

Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Martijn van Oosterhout
On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:
 Of course the $64 question is *why* is 8.0 trying to resolve that name,
 particularly seeing that the later branches apparently aren't.

The formatting of the message suggests it is a gethostbyname('')
doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
configuration files?

I wonder it it'd be worth adding some conditional code around
gethostbyname() calls that warn if the call took longer than say 10
seconds. By printing out that and the string it's looking up you could
save a lot of time confirming if the delay is there or elsewhere...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-19 Thread Magnus Hagander
  Might it not be a win to also store per backend global 
 values in the 
  shared memory segment? Things like time of last command, 
 number of 
  transactions executed in this backend, backend start 
 time and other 
  values that are fixed-size?
 
 I'm including backend start time, command start time, etc 
 under the heading of current status which'll be in the 
 shared memory.  However, I don't believe in trying to count 
 events (like transaction commits) that way.  If we do then we 
 risk losing events whenever a backend quits and is replaced 
 by another.

Well, in many cases that's not a problem. It might be interesting to for
example know that a backend has run nnn transactions before ending up in
the state where it is now (say, idle in transaction and idle for a long
time). The part about this being transient data that can go away along
with a backend quit would still hold true.

What were your thoughts about storing bgwriter and archiver statistics
that way? Good or bad idea?


 I haven't yet looked through the stats in detail, but this 
 approach basically presumes that we are only going to count 
 events per-table and per-database --- I am thinking that the 
 background stats collector process won't even keep track of 
 individual backends anymore.  (So, we'll fix the old problem 
 of loss of backend-exit messages resulting in bogus displays.)

Right.  As I see you have now implemented ;-)

/Magnus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Stefan Kaltenbrunner
Martijn van Oosterhout wrote:
 On Mon, Jun 19, 2006 at 09:21:21AM -0400, Tom Lane wrote:
 Of course the $64 question is *why* is 8.0 trying to resolve that name,
 particularly seeing that the later branches apparently aren't.
 
 The formatting of the message suggests it is a gethostbyname('')
 doing it. Did any quoting rules change between 8.0 and 8.1 w.r.t. the
 configuration files?

I tcpdump'd the dns-traffic on that box during a postmaster startup and
it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.
And from what it looks like it might be getting somehow rate limited by
my ISPs recursive resolvers after doing the same query a dozens of times
and getting a servfail every time.
At least the timestamps seem to indicate that the responses are getting
delayed up to 10 seconds after a number of queries ...
It might be a complete shot in the dark but spoonbill worked fine on
REL_8_0_STABLE until i disabled reporting 3 month ago.
During this time the large escaping security fix/standard_strings patch
went in - could this be related in any way ?


Stefan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


R: R: [HACKERS] Per-server univocal identifier

2006-06-19 Thread Giampaolo Tomassoni
 Giampaolo,
 
 On Sun, Jun 18, 2006 at 01:26:21AM +0200, Giampaolo Tomassoni wrote:
  Or... Can I put a custom variable in pgsql.conf?
 
 Like that you mean?
 
 
 custom_variable_classes = 'identify'# list of custom variable 
 classnames
 identify.id = 42
 
 
 
 
 template1=# show identify.id;
  identify.id
 -
  42
 
 
 However pg_settings does not contain variable classes so it can 
 be difficult
 to actually use this value. I wonder if this is a bug or a feature?

Yes, that would be fine. It doesn't work to me, anyway. I guess the problem is 
that the setting shall be associated to a postgres module, which have to be 
responsible for the proper handing of the setting itself. Without an associated 
module, the setting is not available under the postgres env.


 
 
 Joachim
 
 
 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 I tcpdump'd the dns-traffic on that box during a postmaster startup and
 it's definitly trying to look up ''.kaltenbrunner.cc a lot of times.

I just strace'd postmaster start on a Fedora box and can see nothing
corresponding.  Since this is a make check we know that the PG
configuration files it's using are stock ... so it must be something
about the system config that's sending it round the bend.  What do you
have in /etc/hosts, /etc/resolv.conf, /etc/nsswitch.conf?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Tom Lane
Oh, I think I see the problem:

8.0 pg_regress:

if [ $unix_sockets = no ]; then
postmaster_options=$postmaster_options -c listen_addresses=$hostname
else
postmaster_options=$postmaster_options -c listen_addresses=''
fi

8.1 pg_regress:

if [ $unix_sockets = no ]; then
postmaster_options=$postmaster_options -c listen_addresses=$hostname
else
postmaster_options=$postmaster_options -c listen_addresses=
fi

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Andrew Dunstan

Stefan Kaltenbrunner wrote:

Tom Lane wrote:
  

Andrew Dunstan [EMAIL PROTECTED] writes:

The question isn't whether is succeeds, it's how long it takes to 
succeed. When I increased the pg_regress timeout it actually went 
through the whole regression test happily. I suspect we have 2 things 
eating up the 60s timeout here: loading the timezone db and resolving 
whatever it is we are trying to resolve.
  

The behavior of loading the whole TZ database was there for awhile
before anyone noticed; I believe it could only be responsible for a
few seconds.  So the failed DNS responses must be the problem.  Could
we get a ktrace with timestamps on the syscalls to confirm that?

Of course the $64 question is *why* is 8.0 trying to resolve that name,
particularly seeing that the later branches apparently aren't.



hmm maybe the later branches are trying to resolve that too - but only
the combination of the TZ database loading + the failed DNS-queries is
pushing the startup time over the 60 second limit on this (quite slow) box ?

I will try to verify what the later branches are doing exactly ...


  


Yes, we're on the margin here. The successful runs I saw got through the 
timeout in 5 or 10 seconds over the 60 that we currently allow.


What interests me is where it even gets the string 'kaltenbrunner.cc' 
from. It looks to me like the most likely place is the search line in 
/etc/resolv.conf. Ir would be nice to know exactly what it is trying to 
resolve.


cheers

andrew

cheers

andrew


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Tom Lane
I wrote:
 8.0 pg_regress:
 postmaster_options=$postmaster_options -c listen_addresses=''
 8.1 pg_regress:
 postmaster_options=$postmaster_options -c listen_addresses=

and in fact here's the commit that changed that:

2005-06-19 22:26  tgl

* src/test/regress/pg_regress.sh: Change shell syntax that seems
not to work right on FreeBSD 6-CURRENT buildfarm machines.

So apparently it's some generic disease of the BSD shell.  I should have
back-patched at the time but did not.  Will take care of it.

On the timezone search business, it's still the case that HEAD will
search through all the timezones if it's not given an explicit setting
(eg an explicit environment TZ value).  We could suppress that by having
pg_regress set TZ, but then the regression tests wouldn't exercise the
search code at all, which is probably not a great idea.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Andrew Dunstan

Tom Lane wrote:

Oh, I think I see the problem:

8.0 pg_regress:

if [ $unix_sockets = no ]; then
postmaster_options=$postmaster_options -c listen_addresses=$hostname
else
postmaster_options=$postmaster_options -c listen_addresses=''
fi

8.1 pg_regress:

if [ $unix_sockets = no ]; then
postmaster_options=$postmaster_options -c listen_addresses=$hostname
else
postmaster_options=$postmaster_options -c listen_addresses=
fi


  


Good catch! I'm impressed! This is surely the heart of the problem.

That change (from rev 1.56) clearly needs to be backported to 8.0.

cheers

andrew


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Rethinking stats communication mechanisms

2006-06-19 Thread Bort, Paul
 
 * reader's read starts before and ends after writer's update: reader
 will certainly note a change in update counter.
 
 * reader's read starts before and ends within writer's update: reader
 will note a change in update counter.
 
 * reader's read starts within and ends after writer's update: reader
 will note a change in update counter.
 
 * reader's read starts within and ends within writer's update: reader
 will see update counter as odd.
 
 Am I missing anything?
 

The only remaining concern would be the possibility of the reader
thrashing because the writer is updating so often that the reader never
gets the same counter twice. IIRC, the reader was only sampling, not
trying to catch every entry, so that will help. But is it enough?

Regards,
Paul Bort

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


R: R: R: [HACKERS] Per-server univocal identifier

2006-06-19 Thread Giampaolo Tomassoni

 ...omissis...

 yes, it's for contrib modules. but you can access it via SHOW so maybe it
 makes sense to include it in pg_settings as well. Not for now but for the
 future maybe...

I agree: it could be a useful feature.

giampaolo


 
 
 Joachim
 
 -- 
 Joachim Wieland  
 [EMAIL PROTECTED]
 C/ Usandizaga 12 1°B   
 ICQ: 37225940
 20002 Donostia / San Sebastian (Spain) GPG 
 key available


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] regresssion script hole

2006-06-19 Thread Stefan Kaltenbrunner
Tom Lane wrote:
 Andrew Dunstan [EMAIL PROTECTED] writes:
 The question isn't whether is succeeds, it's how long it takes to 
 succeed. When I increased the pg_regress timeout it actually went 
 through the whole regression test happily. I suspect we have 2 things 
 eating up the 60s timeout here: loading the timezone db and resolving 
 whatever it is we are trying to resolve.
 
 The behavior of loading the whole TZ database was there for awhile
 before anyone noticed; I believe it could only be responsible for a
 few seconds.  So the failed DNS responses must be the problem.  Could
 we get a ktrace with timestamps on the syscalls to confirm that?
 
 Of course the $64 question is *why* is 8.0 trying to resolve that name,
 particularly seeing that the later branches apparently aren't.

hmm maybe the later branches are trying to resolve that too - but only
the combination of the TZ database loading + the failed DNS-queries is
pushing the startup time over the 60 second limit on this (quite slow) box ?

I will try to verify what the later branches are doing exactly ...


Stefan

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] pl/tcl again.

2006-06-19 Thread ohp
Hi all,

I'm still fighting with pltcl test that doesn't return the error message
when elog ERROR message is called.

I've played witrh pltcl.c pltcl_error and removed the calls to PG_TRY,
PG_CATCH and PG_ENDTRY to proove that elog it self had a problem...

How can I check what happens in elog?

Each time elog is called with a level of ERROR,FATAL, PG_CATCH runs.

Also here are the server logs for pltcl checks.

It's amazing that the actual error message is in context...


LOG:  database system was shut down at 2006-06-19 16:48:47 MET DST
LOG:  checkpoint record is at 0/22C6CE0
LOG:  redo record is at 0/22C6CE0; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 9130; next OID: 38895
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  database system is ready
LOG:  transaction ID wrap limit is 1073745208, limited by database regression
LOG:  transaction ID wrap limit is 1073745208, limited by database regression
LOG:  transaction ID wrap limit is 1073745208, limited by database regression
ERROR:  role regressgroup1 does not exist
ERROR:
CONTEXT:  duplicate key '1', 'KEY1-3' for T_pkey2
while executing
elog ERROR  duplicate key '$NEW(key1)', '$NEW(key2)' for T_pkey2
invoked from within
if {$n  0} {
elog ERROR \
duplicate key '$NEW(key1)', '$NEW(key2)' for T_pkey2
}
(procedure __PLTcl_proc_38909_trigger_38900 line 32)
invoked from within
__PLTcl_proc_38909_trigger_38900 pkey2_before 38900 {{} key1 key2 txt} 
BEFORE ROW INSERT {key1 1 key2 {KEY1-3  } txt {should fail 
...
ERROR:
CONTEXT:  key for t_dta1 not in t_pkey1
while executing
elog ERROR key for $GD($planrel) not in $keyrel
(procedure __PLTcl_proc_38913_trigger_38902 line 92)
invoked from within
__PLTcl_proc_38913_trigger_38902 dta1_before 38902 {{} tkey ref1 ref2} 
BEFORE ROW INSERT {tkey {trec 4} ref1 1 ref2 {key1-4  }} {} 
ref...
ERROR:
CONTEXT:  key for t_dta2 not in t_pkey2
while executing
elog ERROR key for $GD($planrel) not in $keyrel
(procedure __PLTcl_proc_38913_trigger_38904 line 92)
invoked from within
__PLTcl_proc_38913_trigger_38904 dta2_before 38904 {{} tkey ref1 ref2} 
BEFORE ROW INSERT {tkey {trec 4} ref1 1 ref2 {KEY1-4  }} {} 
ref...
ERROR:
CONTEXT:  key '1', 'key1-1  ' referenced by T_dta1
while executing
elog ERROR  key '$OLD(key1)', '$OLD(key2)' referenced by T_dta1
invoked from within
if {$check_old_ref} {
#
# Check for references to OLD
#
set n [spi_execp -count 1 $GD(plan_dta1) [list $OLD(key1) 
$OLD(key2)]]
if {$n  0}...
(procedure __PLTcl_proc_38907_trigger_38898 line 79)
invoked from within
__PLTcl_proc_38907_trigger_38898 pkey1_before 38898 {{} key1 key2 txt} 
BEFORE ROW UPDATE {key1 1 key2 {key1-9  } txt {test key
...
ERROR:
CONTEXT:  key '1', 'key1-2  ' referenced by T_dta1
while executing
elog ERROR  key '$OLD(key1)', '$OLD(key2)' referenced by T_dta1
invoked from within
if {$check_old_ref} {
#
# Check for references to OLD
#
set n [spi_execp -count 1 $GD(plan_dta1) [list $OLD(key1) 
$OLD(key2)]]
if {$n  0}...
(procedure __PLTcl_proc_38907_trigger_38898 line 79)
invoked from within
__PLTcl_proc_38907_trigger_38898 pkey1_before 38898 {{} key1 key2 txt} 
BEFORE ROW DELETE {} {key1 1 key2 {key1-2  } txt {test key 
...
NOTICE:  updated 1 entries in T_dta2 for new key in T_pkey2
NOTICE:  deleted 1 entries from T_dta2
-- 
Olivier PRENANT Tel: +33-5-61-50-97-00 (Work)
15, Chemin des Monges+33-5-61-50-97-01 (Fax)
31190 AUTERIVE   +33-6-07-63-80-64 (GSM)
FRANCE  Email: ohp@pyrenet.fr
--
Make your life a dream, make your dream a reality. (St Exupery)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] Getting rid of extra gettimeofday() calls

2006-06-19 Thread Tom Lane
As of CVS tip, PG does up to four separate gettimeofday() calls upon the
arrival of a new client command.  This is because the statement_timestamp,
stats_command_string, log_duration, and statement_timeout features each
independently save an indication of statement start time.  Given what
we've found out recently about gettimeofday() being unduly expensive on
some hardware, this cries out to get fixed.  I propose that we do
SetCurrentStatementStartTimestamp() immediately upon receiving a client
message, and then make the other features copy that value instead of
fetching their own.

Another gettimeofday() call that I would like to get rid of is the one
currently done at the end of statement when stats_command_string is
enabled: we record current time when resetting the activity_string to
IDLE.  Would anyone be terribly upset if this used statement_timestamp
instead?  The effect would be that for an idle backend,
pg_stat_activity.query_start would reflect the start time of its latest
query instead of the time at which it finished the query.  I can see
some use for the current behavior but I don't really think it's worth
the overhead of a gettimeofday() call.

Preliminary tests say that after the shared-memory change I committed
yesterday, the overhead of stats_command_string consists *entirely*
of the two added gettimeofday() calls.  If we get rid of both, the
difference between having stats_command_string on and off is barely
measurable (using Bruce's test case of 1 SELECT 1; statements).

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] sync_file_range()

2006-06-19 Thread Florian Weimer
* Simon Riggs:

 Other files are fsynced at checkpoint - always all dirty blocks in the
 whole file.

Optionally, sync_file_range does not block the calling process, so
it's very easy to flush all files at once, which could in theory
reduce seeking overhead.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Getting rid of extra gettimeofday() calls

2006-06-19 Thread Jim C. Nasby
On Mon, Jun 19, 2006 at 11:17:48AM -0400, Tom Lane wrote:
 instead?  The effect would be that for an idle backend,
 pg_stat_activity.query_start would reflect the start time of its latest
 query instead of the time at which it finished the query.  I can see
 some use for the current behavior but I don't really think it's worth
 the overhead of a gettimeofday() call.

Perhaps make it a compile-time option... I suspect that there's people
making use of that info in their monitoring tools. Though, those people
are probably also likely to have log_duration=true, so maybe the same
trick of gettimeofday() once at statement end and copying it as needed
would work.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] sync_file_range()

2006-06-19 Thread Greg Stark
Simon Riggs [EMAIL PROTECTED] writes:

 On Mon, 2006-06-19 at 15:32 +0800, Qingqing Zhou wrote:
  ITAGAKI Takahiro [EMAIL PROTECTED] wrote
  
  
   I'm interested in it, with which we could improve responsiveness during
   checkpoints. Though it is Linux specific system call, but we could use
   the combination of mmap() and msync() instead of it; I mean we can use
   mmap only to flush dirty pages, not to read or write pages.
  
  
  Can you specify details? As the TODO item inidcates, if we mmap data file, a
  serious problem is that we don't know when the data pages hit the disks -- 
  so that we may voilate the WAL rule.
 
 Can't see where we'd use it.
 
 We fsync the xlog at transaction commit, so only the leading edge needs
 to be synced - would the call help there? Presumably the OS can already
 locate all blocks associated with a particular file fairly quickly
 without doing a full cache scan.

Well in theory the transaction being committed isn't necessarily the leading
edge, there could be more work from other transactions since the last work
this transaction actually did. However I can't see that actually helping
performance much if at all. There can't be much, and writing the data it
doesn't really matter much how much data it writes -- what really matters is
rotational and seek latency anyways.


 Other files are fsynced at checkpoint - always all dirty blocks in the
 whole file.

Well couldn't it be useful for checkpoints if it there was some way to know
which buffers had been touched since the last checkpoint? There could be a lot
of buffers dirtied since the checkpoint began and those don't really need to
be synced do they?

Or it could be used to control the rate at which the files are checkpointed.

Come to think of it I wonder whether there's anything to be gained by using
smaller files for tables. Instead of 1G files maybe 256M files or something
like that to reduce the hit of fsyncing a file.

-- 
greg


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] sync_file_range()

2006-06-19 Thread Simon Riggs
On Mon, 2006-06-19 at 15:04 -0400, Greg Stark wrote:

  We fsync the xlog at transaction commit, so only the leading edge needs
  to be synced - would the call help there? Presumably the OS can already
  locate all blocks associated with a particular file fairly quickly
  without doing a full cache scan.
 
 Well in theory the transaction being committed isn't necessarily the leading
 edge, there could be more work from other transactions since the last work
 this transaction actually did. 

Near enough.

  Other files are fsynced at checkpoint - always all dirty blocks in the
  whole file.
 
 Well couldn't it be useful for checkpoints if it there was some way to know
 which buffers had been touched since the last checkpoint? There could be a lot
 of buffers dirtied since the checkpoint began and those don't really need to
 be synced do they?

Qingqing had a proposal for something like that, but seemed not worth it
after analysis.

-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Robert Lor


Motivation:
--

The main goal for this Generic Monitoring Framework is to provide a common 
interface for adding instrumentation points or probes to
Postgres so its behavior can be easily observed by developers and 
administrators even in production systems. This framework will allow Postgres 
to use the appropriate
monitoring/tracing facility provided by each OS. For example, Solaris and 
FreeBSD will use DTrace, and other OSes can use their respective tool.

What is DTrace?
--

Some of you may have heard about or used DTrace already. In a nutshell, DTrace 
is a comprehensive dynamic tracing facility that is built into
Solaris and FreeBSD (mostly working) that can be used by administrators and 
developers on live production systems to examine the behavior
of both user programs and of the operating system. 


DTrace can help answer difficult questions about the OS and the application 
itself. For example, you may want to ask:

- Show all functions that get invoked (userland  kernel) and execution time when my function foo() is called. Seeing the path a function 
takes into the kernel may provide clues for performance tuning.

- Show how many times a particular lock is acquired and how long it's held. 
This can help identity contentions in the system.

The best way to appreciate DTrace capabilities is by seeing a demo or through 
hands-on experience, and I plan to show some interesting
demos at the PG Summit.

There are a numer of docs on Dtrace, and here's a quick start doc and a 
complete reference guide.
http://www.sun.com/software/solaris/howtoguides/dtracehowto.jsp
http://docs.sun.com/app/docs/doc/817-6223

Here is a recent DTrace for FreeBSD status
http://marc.theaimsgroup.com/?l=freebsd-currentm=114854018213275w=2

Open source apps that provide user level probes (bottom of page)
http://uadmin.blogspot.com/2006/05/what-is-dtrace.html


Proposed Solution:


This solution is actually quite simple and non-intrusive.

1. Define macros PG_TRACE, PG_TRACE1, etc, in a new header file
called pg_trace.h with multiple #if defined(xxx) sections for Solaris,
FreeBSD, Linux, etc, and add pg_trace.h to c.h which is included in postgres.h
and included by every C file.

The macros will have the following format:

PG_TRACE[n](module_name, probe_name [, arg1, ..., arg5]) 


module_name = Name to identify PG module such as pg_backend, pg_psql, 
pg_plpgsql, etc
probe_name  = Probe name such as transaction_start, lwlock_acquire, etc
arg1..arg5  = Any args to pass to the probe such as txn id, lock id, etc

2. Map PG_TRACE, PG_TRACE1, etc, to macros or functions appropriate for each OS.
For OSes that don't have suitable tracing facility, just map the macros to 
nothing - doing this will not have any affect on performance or
existing behavior.

Sample of pg_trace.h

#if defined(sun) || defined(FreeBSD)

#include sys/sdt.h
#define PG_TRACEDTRACE_PROBE
#define PG_TRACE1  DTRACE_PROBE1
...
#define PG_TRACE5  DTRACE_PROBE5

#elif defined(__linux__) || defined(_AIX) || defined(__sgi) ...

/* Map the macros to no-ops */
#define PG_TRACE(module, name)
#define PG_TRACE1(module, name, arg1)
...
#define PG_TRACE5(module, name, arg1, arg2, arg3, arg4, arg5)

#endif

3. Add any file(s) to support the particular OS tracing facility

4. Update the Makefiles as necessary for each OS

How to add probes:
-

To add a probe, just add a one line macro in the appropriate location in the 
source. Here's an example of two probes, one with no argument
and the other with 2 arguments:

PG_TRACE (pg_backend, fsync_start);
PG_TRACE2 (pg_backend, lwlock_acquire, lockid, mode);

If there are enough probes embedded in PG, its behavior can be easily observed.

With the help of Gavin Sherry, we have added about 20 probes, and Gavin has 
suggested a number of other interesting areas for additional probes.
Pervasive has also added some probes to PG 8.0.4 and posted the patch on 
http://pgfoundry.org/projects/dtrace/.  I hope to combine the probes
using this generic framework for 8.1.4, and make it available for folks to try.

Since my knowledge of the PG source code is limited, I'm looking for assistance 
from experts to hep identify some new interesting probe points.


How to use probes:


For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performance hit whatsoever. 
Here is a simple example to print out the number of LWLock counts for each PG process.


test.d

#!/usr/sbin/dtrace -s
pg_backend*:::lwlock-acquire
{
@foo[pid] = count();
}

dtrace:::END {
printf(\n%10s %15s\n, PID, Count);
printa(%10d [EMAIL PROTECTED],@foo);
}

# ./test.d

   PID   Count
  1438   28
  1447   7240
  1448   9675
  1449 11972


I have a prototype working, so if anyone wants to try it, I can provide a patch 
or give access to my test system.

This is a proposal, so comments, 

Re: [HACKERS] Getting rid of extra gettimeofday() calls

2006-06-19 Thread Simon Riggs
On Mon, 2006-06-19 at 11:17 -0400, Tom Lane wrote:
 As of CVS tip, PG does up to four separate gettimeofday() calls upon the
 arrival of a new client command.  This is because the statement_timestamp,
 stats_command_string, log_duration, and statement_timeout features each
 independently save an indication of statement start time.  Given what
 we've found out recently about gettimeofday() being unduly expensive on
 some hardware, this cries out to get fixed.  I propose that we do
 SetCurrentStatementStartTimestamp() immediately upon receiving a client
 message, and then make the other features copy that value instead of
 fetching their own.

Yes. Well spotted. That should make each timed aspect more accurate also
since its the same value.

Presumably you don't mean *every* client message, just stmt start ones. 

Can we set that in GetTransactionSnapshot() - that way a serializable
transaction won't need to update the time after each statement. We can
then record this as the SetCurrentSnapshotStartTimestamp().

 Another gettimeofday() call that I would like to get rid of is the one
 currently done at the end of statement when stats_command_string is
 enabled: we record current time when resetting the activity_string to
 IDLE.  Would anyone be terribly upset if this used statement_timestamp
 instead?  The effect would be that for an idle backend,
 pg_stat_activity.query_start would reflect the start time of its latest
 query instead of the time at which it finished the query.  I can see
 some use for the current behavior but I don't really think it's worth
 the overhead of a gettimeofday() call.

Presumably we have to do at least one at the end when doing statement
logging?

I notice there is also one in elog.c for when we have %t set. We
probably don't need to do both when statement logging.

 Preliminary tests say that after the shared-memory change I committed
 yesterday, the overhead of stats_command_string consists *entirely*
 of the two added gettimeofday() calls. 
-- 
  Simon Riggs
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Getting rid of extra gettimeofday() calls

2006-06-19 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Presumably you don't mean *every* client message, just stmt start ones. 

At the moment I've got it setting the statement_timestamp on receipt of
any message that could lead to execution of user-defined code; that
includes Query, Parse, Bind, Execute, FunctionCall.  Possibly we could
dispense with the Bind one but I'm unconvinced.

 Can we set that in GetTransactionSnapshot() - that way a serializable
 transaction won't need to update the time after each statement.

No, that's much too late, unless you want to do major rearrangement of
the times at which reporting actions occur.  Furthermore the entire
point of statement_timestamp is that it advances for new commands within
the same xact, so your proposal amounts to removing statement_timestamp
entirely.

The actual behavior of CVS tip is that transaction_timestamp copies from
statement_timestamp, not vice versa; that seems fine to me.

 Presumably we have to do at least one at the end when doing statement
 logging?

Only if you've got log_duration on.  Per Jim's suggestion, maybe we
could have the IDLE activity report advance activity_timestamp only
if log_duration is true, ie, only if it's free to do so.

 I notice there is also one in elog.c for when we have %t set. We
 probably don't need to do both when statement logging.

I'm inclined to think that that one is worth its keep.  Sometimes you
really wanna know exactly when a log message was emitted ...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Tom Lane
Robert Lor [EMAIL PROTECTED] writes:
 The main goal for this Generic Monitoring Framework is to provide a
 common interface for adding instrumentation points or probes to
 Postgres so its behavior can be easily observed by developers and
 administrators even in production systems.

What is the overhead of a probe when you're not using it?  The answer
had better not include the phrase kernel call, or this is unlikely to
pass muster...

 For DTrace, probes can be enabled using a D script. When the probes are not 
 enabled, there is absolutely no performance hit whatsoever. 

If you believe that, I have a bridge in Brooklyn you might be interested
in.

What are the criteria going to be for where to put probe calls?  If it
has to be hard-wired into the source code, I foresee a lot of contention
about which probes are worth their overhead, because we'll need
one-size-fits-all answers.

 arg1..arg5  = Any args to pass to the probe such as txn id, lock id, etc  
Where is the data type of a probe argument defined?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] I was offline

2006-06-19 Thread Alvaro Herrera
Hi, a quickie:

I was offline last week due to my ADSL line going down, so I was unable
to follow the discussions closely.  I'll be back at the
non-transactional catalogs and relminxid discussions later (hopefully
tomorrow or on wednesday).

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Getting rid of extra gettimeofday() calls

2006-06-19 Thread Hannu Krosing
Ühel kenal päeval, E, 2006-06-19 kell 11:17, kirjutas Tom Lane:
 As of CVS tip, PG does up to four separate gettimeofday() calls upon the
 arrival of a new client command.  This is because the statement_timestamp,
 stats_command_string, log_duration, and statement_timeout features each
 independently save an indication of statement start time.  Given what
 we've found out recently about gettimeofday() being unduly expensive on
 some hardware, this cries out to get fixed.  I propose that we do
 SetCurrentStatementStartTimestamp() immediately upon receiving a client
 message, and then make the other features copy that value instead of
 fetching their own.
 
 Another gettimeofday() call that I would like to get rid of is the one
 currently done at the end of statement when stats_command_string is
 enabled: we record current time when resetting the activity_string to
 IDLE. 

Is it just IDLE or also IDLE in transaction ?

If we are going to change things anyway, I'd like the latter to show the
time since start of transaction, so that I Would at least have an easy
way to write a transaction timeout script :)


I don't really care about what plain IDLE uses.

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Theo Schlossnagle


On Jun 19, 2006, at 4:40 PM, Tom Lane wrote:


Robert Lor [EMAIL PROTECTED] writes:

The main goal for this Generic Monitoring Framework is to provide a
common interface for adding instrumentation points or probes to
Postgres so its behavior can be easily observed by developers and
administrators even in production systems.


What is the overhead of a probe when you're not using it?  The  
answer
had better not include the phrase kernel call, or this is  
unlikely to

pass muster...

For DTrace, probes can be enabled using a D script. When the  
probes are not enabled, there is absolutely no performance hit  
whatsoever.


If you believe that, I have a bridge in Brooklyn you might be  
interested

in.


Heh.  Syscall probes and FBT probes in Dtrace have zero overhead.   
User-space probes do have overhead, but it is only a few instructions  
(two I think).  Besically, the probe points are replaced by illegal  
instructions and the kernel infrastructure for Dtrace will fasttrap  
the ops and then act.  So, it is tiny tiny overhead.  Little enough  
that it isn't unreasonable to instrument things like s_lock which are  
tiny.



What are the criteria going to be for where to put probe calls?  If it
has to be hard-wired into the source code, I foresee a lot of  
contention

about which probes are worth their overhead, because we'll need
one-size-fits-all answers.

arg1..arg5  = Any args to pass to the probe such as txn id, lock  
id, etc	

Where is the data type of a probe argument defined?


I assume it would depend on the probe implementation.  In Dtrace they  
are implemented in .d files that will post-instrument the object  
before final linkage.  Dtrace's whole purpose is to be low overhead  
and it really does it in a fantastic way.


As an example, you can take an uninstrumented binary and add dynamic  
instrumentation to the entry, exit and every instruction op-code over  
every single routine in the process.  And clearly, as the binary is  
uninstrumented, the overhead is indeed zero when the probes are not  
enabled.


The reason that Robert proposes user-space probes (I assume) is that  
tracing C functions can be too granular and not conveniently expose  
the right information to make tracing useful.


// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
// Ecelerity: Run with it.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Robert Lor

Tom Lane wrote:


Robert Lor [EMAIL PROTECTED] writes:
 


The main goal for this Generic Monitoring Framework is to provide a
common interface for adding instrumentation points or probes to
Postgres so its behavior can be easily observed by developers and
administrators even in production systems.
   



What is the overhead of a probe when you're not using it?  The answer
had better not include the phrase kernel call, or this is unlikely to
pass muster...
 

Here's what the DTrace developers have to say in their Usenix paper.  
When not explicitly enabled, DTrace has zero probe effect - the system 
operates exactly as if DTrace were not present at all.


http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf

The technical details are beyond me, so I can't tell you exactly what 
happens internally. I can find out if you're interested!


 

For DTrace, probes can be enabled using a D script. When the probes are not enabled, there is absolutely no performance hit whatsoever. 
   



If you believe that, I have a bridge in Brooklyn you might be interested
in.

What are the criteria going to be for where to put probe calls?  If it
has to be hard-wired into the source code, I foresee a lot of contention
about which probes are worth their overhead, because we'll need
one-size-fits-all answers.

 

I think we need to be selective in terms of which probes to add since 
we  don't want to scatter them all over the source files. For DTrace, 
the overhead is very minimal, but you're right, other implementation for 
the same probe may have more perf overhead.



arg1..arg5  = Any args to pass to the probe such as txn id, lock id, etc
   


Where is the data type of a probe argument defined?
 


It's in a .d file which looks like below:

provider pg_backend {

probe fsync__start(void);
probe fsync__end(void);
probe lwlock__acquire (int, int);
probe lwlock__release(int);
...

}


Regards,
Robert


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Jim C. Nasby
On Mon, Jun 19, 2006 at 05:20:31PM -0400, Theo Schlossnagle wrote:
 Heh.  Syscall probes and FBT probes in Dtrace have zero overhead.   
 User-space probes do have overhead, but it is only a few instructions  
 (two I think).  Besically, the probe points are replaced by illegal  
 instructions and the kernel infrastructure for Dtrace will fasttrap  
 the ops and then act.  So, it is tiny tiny overhead.  Little enough  
 that it isn't unreasonable to instrument things like s_lock which are  
 tiny.

If someone wanted to, they should be able to do benchmarking with the
DTrace patches on pgFoundry to see the overhead of just having the
probes in, and then having the probes in and actually using them. If you
*really* want to see the difference, add a probe in s_lock. :)
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Chris Browne
[EMAIL PROTECTED] (Robert Lor) writes:
 For DTrace, probes can be enabled using a D script. When the probes
 are not enabled, there is absolutely no performance hit whatsoever.

That seems inconceivable.

In order to have a way of deciding whether or not the probes are
enabled, there has *got* to be at least one instruction executed, and
that can't be costless.
-- 
output = reverse(gro.mca @ enworbbc)
http://www.ntlug.org/~cbbrowne/wp.html
...while   I   know  many   people   who   emphatically  believe   in
reincarnation, I have  never met or read one  who could satisfactorily
explain population growth. -- Spider Robinson

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Robert Lor

Theo Schlossnagle wrote:



Heh.  Syscall probes and FBT probes in Dtrace have zero overhead.   
User-space probes do have overhead, but it is only a few instructions  
(two I think).  Besically, the probe points are replaced by illegal  
instructions and the kernel infrastructure for Dtrace will fasttrap  
the ops and then act.  So, it is tiny tiny overhead.  Little enough  
that it isn't unreasonable to instrument things like s_lock which are  
tiny.


Theo, you're a genius. FBT (funciton boundary tracing)  probes have zero 
overhead (section 4.1) and user-space probes has two instructions over 
head (section 4.2). I was incorrect about making a general zero overhead 
statement.  But it's so close to zero :-)


http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf



The reason that Robert proposes user-space probes (I assume) is that  
tracing C functions can be too granular and not conveniently expose  
the right information to make tracing useful.


Yes, I'm proposing user-space probes (aka User Statically-Defined 
Tracing - USDT). USDT provides a high-level abstraction so the 
application can expose well defined probes without the user having to 
know the detailed implementation.  For example, instead of having to 
know the function LWLockAcquire(), a well documented probe called 
lwlock_acquire with the appropriate args is much more usable.


Regards,
Robert


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[HACKERS] CVS HEAD busted on Windows?

2006-06-19 Thread Tom Lane
I notice buildfarm member snake is unhappy:

The program postgres is needed by initdb but was not found in the
same directory as 
C:/msys/1.0/local/build-farm/HEAD/pgsql.696/src/test/regress/tmp_check/install/usr/local/build-farm/HEAD/inst/bin/initdb.exe.
Check your installation.

I'm betting Peter's recent patch to merge postmaster and postgres missed
some little thing or other ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Theo Schlossnagle


On Jun 19, 2006, at 6:41 PM, Robert Lor wrote:


Theo Schlossnagle wrote:



Heh.  Syscall probes and FBT probes in Dtrace have zero  
overhead.   User-space probes do have overhead, but it is only a  
few instructions  (two I think).  Besically, the probe points are  
replaced by illegal  instructions and the kernel infrastructure  
for Dtrace will fasttrap  the ops and then act.  So, it is tiny  
tiny overhead.  Little enough  that it isn't unreasonable to  
instrument things like s_lock which are  tiny.


Theo, you're a genius. FBT (funciton boundary tracing)  probes have  
zero overhead (section 4.1) and user-space probes has two  
instructions over head (section 4.2). I was incorrect about making  
a general zero overhead statement.  But it's so close to zero :-)


http://www.sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf



The reason that Robert proposes user-space probes (I assume) is  
that  tracing C functions can be too granular and not conveniently  
expose  the right information to make tracing useful.


Yes, I'm proposing user-space probes (aka User Statically-Defined  
Tracing - USDT). USDT provides a high-level abstraction so the  
application can expose well defined probes without the user having  
to know the detailed implementation.  For example, instead of  
having to know the function LWLockAcquire(), a well documented  
probe called lwlock_acquire with the appropriate args is much more  
usable.


I am giving a talk at OSCON this year about PostgreSQL on big  
systems.  Big is all relative, but I will be talking about dtrace a  
bit and the advantages of running PostgreSQL on Solaris which is what  
we ended up doing after some extremely disturbing experiences on  
Linux.  I was able to track a very acute memory leak in pl/perl  
(which Neil so kindly fixed) within a few moments -- and this is  
without explicit user-space trace points.  If there were good user- 
space points, I likely wouldn't have had to dig in the source as a  
pre-cursor to my dtrace efforts.


The things you might be able to do with user-specific trace points:
  o better understand the block scatter (distance of block-level  
reads) for a specific query).
  o understand lock contention in vastly multiprocessor systems  
using plockstat (my hunch is that heavy-weight locks might be better).
o our current box is 4 way opteron, but we have a 16-way T2000  
as well.
  o report on queries including turn-around time, block-accesses,  
lock acquisitions grouped by query for specific time windows.


The nice thing about dtrace is that it requires no prep to look at  
a problem.  When something is acting odd in production, you don't  
want to attempt to repeat it in a test environment first.  You want  
to observe it.  Dtrace allows you to dig in really deep in  
production with an acceptable performance penalty and ask questions  
that couldn't be asked before.  It is exceptionally clever stuff.  Of  
all the new neat stuff in Solaris 10, it has my vote for coolest  
and most useful.  I've nailed  several production problems (outside  
of Postgres) using dtrace with accuracy and efficiency.  When Solaris  
10u2 is released, we'll be trying Postgres on ZFS, so my rankings may  
change :-)


The idea of having intelligently placed dtrace probes in Postrgres  
would allow us to deal with postgres as a first class app on  
Solaris 10 with respect to troubleshooting obtuse production  
problems.  That, to me, is exciting stuff.


Best regards,

Theo

// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
// Ecelerity: Run with it.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Mark Kirkwood

Jim C. Nasby wrote:

On Mon, Jun 19, 2006 at 05:20:31PM -0400, Theo Schlossnagle wrote:
Heh.  Syscall probes and FBT probes in Dtrace have zero overhead.   
User-space probes do have overhead, but it is only a few instructions  
(two I think).  Besically, the probe points are replaced by illegal  
instructions and the kernel infrastructure for Dtrace will fasttrap  
the ops and then act.  So, it is tiny tiny overhead.  Little enough  
that it isn't unreasonable to instrument things like s_lock which are  
tiny.


If someone wanted to, they should be able to do benchmarking with the
DTrace patches on pgFoundry to see the overhead of just having the
probes in, and then having the probes in and actually using them. If you
*really* want to see the difference, add a probe in s_lock. :)


We will need to benchmark on FreeBSD to see if those comments about 
overhead stand up to scrutiny there too.


I would think that even if (for instance) we find that there is no 
overhead on Solaris, those of us on platforms where DTrace is less 
mature would want the option of building without any probes at all in 
the code - I guess a configure option --without-dtrace on by default 
on those platforms would do it.


regards

Mark


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Generic Monitoring Framework Proposal

2006-06-19 Thread Theo Schlossnagle


On Jun 19, 2006, at 7:39 PM, Mark Kirkwood wrote:
We will need to benchmark on FreeBSD to see if those comments about  
overhead stand up to scrutiny there too.


I've followed the development of DTrace on FreeBSD and the design  
approach is mostly identical to the Solaris one.  This would mean  
that if there is overhead on FreeBSD not present on Solaris it would  
be considered a big and likely fixed.


I would think that even if (for instance) we find that there is no  
overhead on Solaris, those of us on platforms where DTrace is less  
mature would want the option of building without any probes at all  
in the code - I guess a configure option --without-dtrace on by  
default on those platforms would do it.


Absolutely.  As they are all proposed as preprocessor macros, this  
would be trivial to accomplish.



// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/
// Ecelerity: Run with it.



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] sync_file_range()

2006-06-19 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 Come to think of it I wonder whether there's anything to be gained by using
 smaller files for tables. Instead of 1G files maybe 256M files or something
 like that to reduce the hit of fsyncing a file.

Actually probably not.  The weak part of our current approach is that we
tell the kernel sync this file, then sync that file, etc, in a more
or less random order.  This leads to a probably non-optimal sequence of
disk accesses to complete a checkpoint.  What we would really like is a
way to tell the kernel sync all these files, and let me know when
you're done --- then the kernel and hardware have some shot at
scheduling all the writes in an intelligent fashion.

sync_file_range() is not that exactly, but since it lets you request
syncing and then go back and wait for the syncs later, we could get the
desired effect with two passes over the file list.  (If the file list
is longer than our allowed number of open files, though, the extra
opens/closes could hurt.)

Smaller files would make the I/O scheduling problem worse not better.
Indeed, I've been wondering lately if we shouldn't resurrect
LET_OS_MANAGE_FILESIZE and make that the default on systems with
largefile support.  If nothing else it would cut down on open/close
overhead on very large relations.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] shall we have a TRACE_MEMORY mode

2006-06-19 Thread Qingqing Zhou
As I follow Relyea Mike's recent post of possible memory leak, I think that
we are lack of a good way of identifing memory usage. Maybe we should also
remember __FILE__, __LINE__ etc for better memory usage diagnose when
TRACE_MEMORY is on?

Regards,
Qingqing



---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] PAM auth

2006-06-19 Thread satoshi nagayasu
Hi folks,

I'm trying to use PAM auth on PostgreSQL, but I still cannot
get success on PAM auth (with PG813 and RHEL3).

pg_hba.conf has
 hostpamtest all 0.0.0.0/0 pam

/etc/pam.d/postgresql is
 #%PAM-1.0
 auth   required pam_stack.so service=system-auth
 accountrequired pam_stack.so service=system-auth
 password   required pam_stack.so service=system-auth

And I've changed user password with ALTER USER ... PASSWORD.

However, my postmaster always denies my login.
-
% /usr/local/pgsql813/bin/psql -h localhost -W -U hoge pamtest
Password for user hoge:
LOG:  pam_authenticate failed: Authentication failure
FATAL:  PAM authentication failed for user hoge
psql: FATAL:  PAM authentication failed for user hoge
-
What's wrong with that?

BTW, I found an empty password () is passed to CheckPAMAuth()
function in auth.c.
-
#ifdef USE_PAM
case uaPAM:
pam_port_cludge = port;
status = CheckPAMAuth(port, port-user_name, );
break;
#endif   /* USE_PAM */
-
/*
 * Check authentication against PAM.
 */
static int
CheckPAMAuth(Port *port, char *user, char *password)
{
int retval;
pam_handle_t *pamh = NULL;

/*
 * Apparently, Solaris 2.6 is broken, and needs ugly static variable
 * workaround
 */
pam_passwd = password;

/*
 * Set the application data portion of the conversation struct This is
 * later used inside the PAM conversation to pass the password to the
 * authentication module.
 */
pam_passw_conv.appdata_ptr = (char *) password; /* from password above,
 * not allocated */
-
What does it mean? I'm not familiar with PAM, so I can't get
why the password can be empty here.

Any suggestion?

Thanks.
-- 
NAGAYASU Satoshi [EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] checking on buildfarm member thrush

2006-06-19 Thread Tom Lane
I'm trying to determine why thrush has been failing on PG CVS HEAD for
the past few days.  Could you try running the attached program on that
machine, and see what it prints?  I suspect it will dump core :-(

Note: you might need to use -D_GNU_SOURCE to get it to compile at all.

regards, tom lane


#include stdio.h
#include stdlib.h
#include string.h
#include errno.h
#include fcntl.h

int
main()
{
if (posix_fadvise(fileno(stdin), 0, 0, POSIX_FADV_DONTNEED))
printf(failed: %s\n, strerror(errno));
else
printf(OK\n);
return 0;
}

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] shall we have a TRACE_MEMORY mode

2006-06-19 Thread Alvaro Herrera
Qingqing Zhou wrote:
 As I follow Relyea Mike's recent post of possible memory leak, I think that
 we are lack of a good way of identifing memory usage. Maybe we should also
 remember __FILE__, __LINE__ etc for better memory usage diagnose when
 TRACE_MEMORY is on?

Hmm, this would have been a great help to me not long ago, so I'd say it
would be nice to have.

About the exact form we'd give the feature: maybe write each
allocation/freeing to a per-backend file, say /tmp/pgmem.pid.  Also
memory context creation, destruction, reset.  Having the __FILE__ and
__LINE__ on each operation would be a good tracing tool as well.  Then
it's easy to write Perl tools to find specific problems.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] shall we have a TRACE_MEMORY mode

2006-06-19 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 About the exact form we'd give the feature: maybe write each
 allocation/freeing to a per-backend file, say /tmp/pgmem.pid.  Also
 memory context creation, destruction, reset.  Having the __FILE__ and
 __LINE__ on each operation would be a good tracing tool as well.  Then
 it's easy to write Perl tools to find specific problems.

That seems mostly the hard way to me, because our memory management
scheme is *not* based around thou shalt free() what thou malloc()ed.
You'd need a tool that understood about resetting memory contexts
(recursively) to get anywhere at all in analyzing such a trace.

I've had some success in the past with debugging memory leaks by
trawling through the oversized memory contexts with gdb x and
trying to understand what the bulk of the data was.  This is certainly
pretty painful though.

One idea that comes to mind is to have a compile time option to record
the palloc __FILE__ and _LINE__ in every AllocChunk header.  Then it
would not be so hard to identify the culprit while trawling through
memory.  The overhead costs would be so high that you'd never turn it on
by default though :-(

Another thing to consider is that the proximate location of the palloc
is frequently *not* very useful.  For instance, if your memory is
getting eaten by lists, all the palloc traces will point at
new_tail_cell().  Not much help.  I don't know what to do about that
... any ideas?

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings