Re: [HACKERS] Windows buildfarm failures

2007-01-20 Thread Stefan Kaltenbrunner
Alvaro Herrera wrote:
 Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 Now, if some Windows-enabled person could step forward so that we can
 suggest some tests to run, that would be great.  Perhaps the solution to
 the problem is to relax the conditions a little, so that two scans are
 accepted on that table instead of only one; but it would be good to
 confirm whether the stat system is really working and it's really still
 counting stuff as it's supposed to do.
 No, you misread it: the check is for at least one new event, not exactly
 one.
 
 Doh :-(
 
 We've been seeing this intermittently for a long time, but it sure seems
 that autovac has raised the probability greatly.  That's pretty odd.
 If it's a timing thing, why are all and only the Windows machines
 affected?  Could it be that autovac is sucking all the spare cycles
 and keeping the stats collector from running?
 
 Hmm, that could explain it, but it's strange that only Windows machines
 are affected.  Maybe it's a scheduler issue, and the Unix machines are
 able to let pgstat do some work but Windows are not.

maybe not only windows boxes:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=zebradt=2007-01-20%2015:25:05


Stefan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Windows buildfarm failures

2007-01-20 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 Alvaro Herrera wrote:
 Hmm, that could explain it, but it's strange that only Windows machines
 are affected.  Maybe it's a scheduler issue, and the Unix machines are
 able to let pgstat do some work but Windows are not.

 maybe not only windows boxes:
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=zebradt=2007-01-20%2015:25:05

That one's interesting because only the first of the two queries failed.
I suppose that must mean that the stats file did update, but between
those two queries.

Maybe we just need to lengthen the sleep() even more?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Windows buildfarm failures

2007-01-20 Thread Tom Lane
Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 maybe not only windows boxes:
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=zebradt=2007-01-20%2015:25:05

Wow, I just saw the stats failure on my own machine, for the first time
ever.  Conclusions:
1. Enabling autovac has definitely raised the probability of failure.
2. It's not Windows-only, but the probability of failure is much higher
on Windows.

Not sure what that tells us, though ...

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Windows buildfarm failures

2007-01-19 Thread Alvaro Herrera
Tom Lane wrote:

 I noticed today on my own machine several strange pauses while running
 the serial regression tests --- the machine didn't seem to be hitting
 the disk nor sucking lots of CPU, it just sat there for several seconds
 and then picked up again.  I wonder if that's related.  It sure seems it
 must be due to autovac being on now.

The only pauses I see are are in the stats and the prepared_xacts
tests.  The latter is due to a test that uses statement_timeout to
detect a lock, and the stat test does a pg_sleep(2.0) call.

Do those explain what you are seeing?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Windows buildfarm failures

2007-01-19 Thread Stefan Kaltenbrunner
Alvaro Herrera wrote:
 Alvaro Herrera wrote:
 Stefan Kaltenbrunner wrote:
 
 yeah - looks like it's the autovacuum change - snake is now passing the
 numeric-test but still fails the stats one ...
 Interesting -- both yak and snake are failing in a very similar way.
 I'll investigate it tomorrow if no one beats me to it.
 
 All our Windows buildfarm machines are failing.  AFAICT, the first
 failure was on Yak, 
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yakdt=2007-01-16%2021:55:20
 
 and the last successful run just before that seems to come from Snake,
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakedt=2007-01-16%2014:30:00
 
 The only changes that went in in that period are the patch that enabled
 autovacuum by default, an information_schema fix and a TODO file change.
 The only that could cause this problem seems to be the autovacuum enable
 bit.

I think this one:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=beardt=2007-01-19%2006:06:02

is fallout from the autovacuum changes too - it seems that initdb is
picking a low value (20) for max_connections on that box and autovacuum
is acting as an additional client that will cause the maximum of allowed
connections to exceed during the parallel tests and therefor resulting
in the failure.


Stefan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Windows buildfarm failures

2007-01-19 Thread Andrew Dunstan

Stefan Kaltenbrunner wrote:

Alvaro Herrera wrote:
  

Alvaro Herrera wrote:


Stefan Kaltenbrunner wrote:
  

yeah - looks like it's the autovacuum change - snake is now passing the
numeric-test but still fails the stats one ...


Interesting -- both yak and snake are failing in a very similar way.
I'll investigate it tomorrow if no one beats me to it.
  

All our Windows buildfarm machines are failing.  AFAICT, the first
failure was on Yak, 
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yakdt=2007-01-16%2021:55:20


and the last successful run just before that seems to come from Snake,
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakedt=2007-01-16%2014:30:00

The only changes that went in in that period are the patch that enabled
autovacuum by default, an information_schema fix and a TODO file change.
The only that could cause this problem seems to be the autovacuum enable
bit.



I think this one:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=beardt=2007-01-19%2006:06:02

is fallout from the autovacuum changes too - it seems that initdb is
picking a low value (20) for max_connections on that box and autovacuum
is acting as an additional client that will cause the maximum of allowed
connections to exceed during the parallel tests and therefor resulting
in the failure.



  


If so, that's a case of driver error, I think. The buildfarm member 
should set MAX_CONNECTIONS = '10' or similar in the build_env stanza of 
the config file.


cheers

andrew


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Windows buildfarm failures

2007-01-19 Thread Alvaro Herrera
Stefan Kaltenbrunner wrote:
 Alvaro Herrera wrote:

  All our Windows buildfarm machines are failing.  AFAICT, the first
  failure was on Yak, 
  http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yakdt=2007-01-16%2021:55:20
  
  and the last successful run just before that seems to come from Snake,
  http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakedt=2007-01-16%2014:30:00
 
 I think this one:
 
 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=beardt=2007-01-19%2006:06:02
 
 is fallout from the autovacuum changes too - it seems that initdb is
 picking a low value (20) for max_connections on that box and autovacuum
 is acting as an additional client that will cause the maximum of allowed
 connections to exceed during the parallel tests and therefor resulting
 in the failure.

Sorry, I forgot to mention that I specifically skipped those errors not
directly related to the problem at hand.  This problem is clearly
something else (as well as the Mac OS X failures due to readline
misconfiguration, the ECPG-check failures, etc).  I concur with Andrew's
suggestion that it's really pilot error.

Maybe what we really ought to do is pick an internal max_connections
value that exceeds what the max_connections GUC parameter say, adjusting
per autovacuum configuration.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Windows buildfarm failures

2007-01-19 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I noticed today on my own machine several strange pauses while running
 the serial regression tests ---

 Do those explain what you are seeing?

No, those are expected.  I'm having a hard time reproducing the behavior
right now, but IIRC the delays were in the vacuum and/or sanity_check
tests.  It's not unlikely that the foreground VACUUM was blocking on a
lock while autovac did the same work, except that that doesn't explain
the length of the pause, nor the lack of disk activity.

But I can't make it happen right now, so nevermind until I figure out
how to reproduce it ...

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Windows buildfarm failures

2007-01-19 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Maybe what we really ought to do is pick an internal max_connections
 value that exceeds what the max_connections GUC parameter say, adjusting
 per autovacuum configuration.

That's just cosmetic; it doesn't address the real issue, which is that
if SHMMAX or other kernel settings are too small, initdb will pick a
max_connections too low to allow the parallel regression tests to run.

The fact that the regression tests try to exercise 20 concurrent
sessions by default isn't just an accident; the thought was that if you
had a configuration too small to allow a reasonable number of concurrent
sessions, the tests ought to point it out to you.  (Indeed, these days
we probably oughta try to exercise more than 20 sessions.)

But this is somewhat in conflict with our desire that buildfarm members
not fall over for random reasons --- and we've seen it happen more than
once that a test run's initdb picks a smaller-than-normal
max_connections because of transient system loads.

Perhaps we could extend pg_regress to allow --max-connections=auto
which would instruct it to set its connection limit to the server's
actual max_connections minus superuser reserved slots (and probably
minus a couple more to allow for backend shutdown time etc).  Then the
buildfarm could use that, while we'd leave the behavior alone for normal
manual regression tests.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Windows buildfarm failures

2007-01-19 Thread Andrew Dunstan

Tom Lane wrote:


Perhaps we could extend pg_regress to allow --max-connections=auto
which would instruct it to set its connection limit to the server's
actual max_connections minus superuser reserved slots (and probably
minus a couple more to allow for backend shutdown time etc).  Then the
buildfarm could use that, while we'd leave the behavior alone for normal
manual regression tests.

  


This seems needlessly complex. We can tolerate occasional intermittent 
failures on buildfarm, and if they are persistent there is already a 
configurable rate limiting mechanism available.


cheers

andrew


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] Windows buildfarm failures

2007-01-18 Thread Alvaro Herrera
Alvaro Herrera wrote:
 Stefan Kaltenbrunner wrote:

  yeah - looks like it's the autovacuum change - snake is now passing the
  numeric-test but still fails the stats one ...
 
 Interesting -- both yak and snake are failing in a very similar way.
 I'll investigate it tomorrow if no one beats me to it.

All our Windows buildfarm machines are failing.  AFAICT, the first
failure was on Yak, 
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=yakdt=2007-01-16%2021:55:20

and the last successful run just before that seems to come from Snake,
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakedt=2007-01-16%2014:30:00

The only changes that went in in that period are the patch that enabled
autovacuum by default, an information_schema fix and a TODO file change.
The only that could cause this problem seems to be the autovacuum enable
bit.

The failures are all exactly alike:

*** ./expected/stats.outThu Jan 18 08:48:12 2007
--- ./results/stats.out Thu Jan 18 09:02:53 2007
***
*** 51,57 
   WHERE st.relname='tenk2' AND cl.relname='tenk2';
   ?column? | ?column? | ?column? | ?column? 
  --+--+--+--
!  t| t| t| t
  (1 row)
  
  SELECT st.heap_blks_read + st.heap_blks_hit = pr.heap_blks + cl.relpages,
--- 51,57 
   WHERE st.relname='tenk2' AND cl.relname='tenk2';
   ?column? | ?column? | ?column? | ?column? 
  --+--+--+--
!  f| f| f| f
  (1 row)
  
  SELECT st.heap_blks_read + st.heap_blks_hit = pr.heap_blks + cl.relpages,
***
*** 60,66 
   WHERE st.relname='tenk2' AND cl.relname='tenk2';
   ?column? | ?column? 
  --+--
!  t| t
  (1 row)
  
  -- End of Stats Test
--- 60,66 
   WHERE st.relname='tenk2' AND cl.relname='tenk2';
   ?column? | ?column? 
  --+--
!  f| f
  (1 row)
  
  -- End of Stats Test


The full failing queries are these:

-- check effects
SELECT st.seq_scan = pr.seq_scan + 1,
   st.seq_tup_read = pr.seq_tup_read + cl.reltuples,
   st.idx_scan = pr.idx_scan + 1,
   st.idx_tup_fetch = pr.idx_tup_fetch + 1
  FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats AS pr
 WHERE st.relname='tenk2' AND cl.relname='tenk2';
 ?column? | ?column? | ?column? | ?column? 
--+--+--+--
 t| t| t| t
(1 row)

SELECT st.heap_blks_read + st.heap_blks_hit = pr.heap_blks + cl.relpages,
   st.idx_blks_read + st.idx_blks_hit = pr.idx_blks + 1
  FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
 WHERE st.relname='tenk2' AND cl.relname='tenk2';
 ?column? | ?column? 
--+--
 t| t
(1 row)

The six booleans are false on Windows.

What could be the reason for this change?  The only thing that occurs to
me is that autovacuum is firing just when running that test, it
processes that table and increments the counters before the final SQL is
run.

Now, if some Windows-enabled person could step forward so that we can
suggest some tests to run, that would be great.  Perhaps the solution to
the problem is to relax the conditions a little, so that two scans are
accepted on that table instead of only one; but it would be good to
confirm whether the stat system is really working and it's really still
counting stuff as it's supposed to do.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Windows buildfarm failures

2007-01-18 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Now, if some Windows-enabled person could step forward so that we can
 suggest some tests to run, that would be great.  Perhaps the solution to
 the problem is to relax the conditions a little, so that two scans are
 accepted on that table instead of only one; but it would be good to
 confirm whether the stat system is really working and it's really still
 counting stuff as it's supposed to do.

No, you misread it: the check is for at least one new event, not exactly
one.

We've been seeing this intermittently for a long time, but it sure seems
that autovac has raised the probability greatly.  That's pretty odd.
If it's a timing thing, why are all and only the Windows machines
affected?  Could it be that autovac is sucking all the spare cycles
and keeping the stats collector from running?  (Does autovac use
vacuum_cost_delay by default?  It probably should if not.)

I noticed today on my own machine several strange pauses while running
the serial regression tests --- the machine didn't seem to be hitting
the disk nor sucking lots of CPU, it just sat there for several seconds
and then picked up again.  I wonder if that's related.  It sure seems it
must be due to autovac being on now.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Windows buildfarm failures

2007-01-18 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Now, if some Windows-enabled person could step forward so that we can
  suggest some tests to run, that would be great.  Perhaps the solution to
  the problem is to relax the conditions a little, so that two scans are
  accepted on that table instead of only one; but it would be good to
  confirm whether the stat system is really working and it's really still
  counting stuff as it's supposed to do.
 
 No, you misread it: the check is for at least one new event, not exactly
 one.

Doh :-(

 We've been seeing this intermittently for a long time, but it sure seems
 that autovac has raised the probability greatly.  That's pretty odd.
 If it's a timing thing, why are all and only the Windows machines
 affected?  Could it be that autovac is sucking all the spare cycles
 and keeping the stats collector from running?

Hmm, that could explain it, but it's strange that only Windows machines
are affected.  Maybe it's a scheduler issue, and the Unix machines are
able to let pgstat do some work but Windows are not.

 (Does autovac use vacuum_cost_delay by default?  It probably should if
 not.)

The default autovacuum_vacuum_cost_delay is -1, which means use the
system default, which in turn is 0.  So it's off by default.

 I noticed today on my own machine several strange pauses while running
 the serial regression tests --- the machine didn't seem to be hitting
 the disk nor sucking lots of CPU, it just sat there for several seconds
 and then picked up again.  I wonder if that's related.  It sure seems it
 must be due to autovac being on now.

Hmm, strange; I ran the tests several times today testing Magnus
changes, and I didn't notice any pause.  It was mostly the parallel
tests though; I'll try serial.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster