subject:"\[HACKERS\] Going for 'all green' buildfarm results"

Re: [HACKERS] Going for all green buildfarm results

2006-08-19 Thread Gregory Stark


Andrew Dunstan [EMAIL PROTECTED] writes:

 stark wrote:

  So I hacked psql to issue queries asynchronously and allow multiple
  database connections. That way you can switch connections while a blocked
  or slow transaction is still running and issue queries in other
  transactions.
 
 [snip]
 
 Can you please put the patch up somewhere so people can see what's involved?

I'll send it to pgsql-patches. 

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Going for all green buildfarm results

2006-08-18 Thread Peter Eisentraut

Am Donnerstag, 17. August 2006 17:17 schrieb stark:
 Instead I just added a command to cause psql to wait for a time.

Do we need the full multiple-connection handling command set, or would 
asynchronous query support and a wait command be enough?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Going for all green buildfarm results

2006-08-18 Thread Martijn van Oosterhout

On Fri, Aug 18, 2006 at 02:46:39PM +0200, Peter Eisentraut wrote:
 Am Donnerstag, 17. August 2006 17:17 schrieb stark:
  Instead I just added a command to cause psql to wait for a time.
 
 Do we need the full multiple-connection handling command set, or would 
 asynchronous query support and a wait command be enough?

I am interested in this too. For example the tool I posted a while ago
supported only this. It controlled multiple connections and only
supported sending async  wait.

It is enough to support fairly deterministic scenarios, for example,
testing if the locks block on eachother as documented. However, it
works less well for non-deterministic testing. Yet, a test-suite has to
be deterministic, right?

From a client side, is there any testing method better than async and
wait? I've wondered about a tool that attached to the backend with gdb
and for testing killed the backend when it hit a particular function.
By selecting different functions each time, once you'd covered a lot of
functions and tested recovery, you could have a good idea if the
recovery code works properly.

Has anyone seens a tool like that?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature

Re: [HACKERS] Going for all green buildfarm results

2006-08-18 Thread Andrew Dunstan


stark wrote:

Alvaro Herrera alvherre ( at ) commandprompt ( dot ) com writes:


Maybe we could write a suitable test case using Martijn's concurrent
testing framework.
  

The trick is to get process A to commit between the times that process B
looks at the new and old versions of the pg_class row (and it has to
happen to do so in that order ... although that's not a bad bet given
the way btree handles equal keys).

I think the reason we've not tracked this down before is that that's a
pretty small window.  You could force the problem by stopping process B
with a debugger breakpoint and then letting A do its thing, but short of
something like that you'll never reproduce it with high probability.



Actually I was already looking into a related issue and have some work here
that may help with this.

I wanted to test the online index build and to do that I figured you needed to
have regression tests like the ones we have now except with multiple database
sessions. So I hacked psql to issue queries asynchronously and allow multiple
database connections. That way you can switch connections while a blocked or
slow transaction is still running and issue queries in other transactions.

I thought it was a proof-of-concept kludge but actually it's worked out quite
well. There were a few conceptual gotchas but I think I have a reasonable
solution for each.

  


[snip]

Can you please put the patch up somewhere so people can see what's involved?

thanks

cheers

andrew

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Going for all green buildfarm results

2006-08-18 Thread Tom Lane

Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Vacuum's always had a race condition: it makes a list of rel OIDs and
 then tries to vacuum each one.  It narrows the window for failure by
 doing a SearchSysCacheExists test before relation_open, but there's
 still a window for failure.

 hmm yeah - missed the VACUUM; part of the regression diff.
 Still this means we will have to live with (rare) failures once in a
 while during that test ?

I thought of what seems a pretty simple solution for this: make VACUUM
lock the relation before doing the SearchSysCacheExists, ie instead
of the existing code

if (!SearchSysCacheExists(RELOID,
  ObjectIdGetDatum(relid),
  0, 0, 0))
// give up

lmode = vacstmt-full ? AccessExclusiveLock : ShareUpdateExclusiveLock;

onerel = relation_open(relid, lmode);

do

lmode = vacstmt-full ? AccessExclusiveLock : ShareUpdateExclusiveLock;

LockRelationOid(relid, lmode);

if (!SearchSysCacheExists(RELOID,
  ObjectIdGetDatum(relid),
  0, 0, 0))
// give up

onerel = relation_open(relid, NoLock);

Once we're holding lock, we can be sure there's not a DROP TABLE in
progress, so there's no race condition anymore.  It's OK to take a
lock on the OID of a relation that no longer exists, AFAICS; we'll
just drop it again immediately (the give up path includes transaction
exit, so there's not even any extra code needed).

This wasn't possible before the recent adjustments to the relation
locking protocol, but now it looks trivial ... am I missing anything?

Perhaps it is worth folding this test into a conditional_relation_open
function that returns NULL instead of failing if the rel no longer
exists.  I think there are potential uses in CLUSTER and perhaps REINDEX
as well as VACUUM.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Going for all green buildfarm results

2006-08-17 Thread Stefan Kaltenbrunner

Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 Maybe we could write a suitable test case using Martijn's concurrent
 testing framework.
 
 The trick is to get process A to commit between the times that process B
 looks at the new and old versions of the pg_class row (and it has to
 happen to do so in that order ... although that's not a bad bet given
 the way btree handles equal keys).
 
 I think the reason we've not tracked this down before is that that's a
 pretty small window.  You could force the problem by stopping process B
 with a debugger breakpoint and then letting A do its thing, but short of
 something like that you'll never reproduce it with high probability.
 
 As far as Andrew's question goes: I have no doubt that this race
 condition is (or now, was) real and could explain Stefan's failure.
 It's not impossible that there's some other problem in there, though.
 If so we will still see the problem from time to time on HEAD, and
 know that we have more work to do.  But I don't think that continuing
 to see it on the back branches will teach us anything.

maybe the following buildfarm report means that we need a new theory  :-(

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spongedt=2006-08-16%2021:30:02


Stefan

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Going for all green buildfarm results

2006-08-17 Thread Tom Lane

Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 maybe the following buildfarm report means that we need a new theory  :-(

 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spongedt=2006-08-16%2021:30:02

Vacuum's always had a race condition: it makes a list of rel OIDs and
then tries to vacuum each one.  It narrows the window for failure by
doing a SearchSysCacheExists test before relation_open, but there's
still a window for failure.

The rel in question is most likely a temp rel of another backend,
because sanity_check is running by itself and so there shouldn't
be anything else happening except perhaps some other session's
post-disconnect cleanup.  Maybe we could put the check for is
this a temp rel of another relation into the initial list-making
step instead of waiting till after relation_open.  That doesn't
seem to solve the general problem though.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Going for all green buildfarm results

2006-08-17 Thread Stefan Kaltenbrunner

Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 maybe the following buildfarm report means that we need a new theory  :-(
 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spongedt=2006-08-16%2021:30:02
 
 Vacuum's always had a race condition: it makes a list of rel OIDs and
 then tries to vacuum each one.  It narrows the window for failure by
 doing a SearchSysCacheExists test before relation_open, but there's
 still a window for failure.
 
 The rel in question is most likely a temp rel of another backend,
 because sanity_check is running by itself and so there shouldn't
 be anything else happening except perhaps some other session's
 post-disconnect cleanup.  Maybe we could put the check for is
 this a temp rel of another relation into the initial list-making
 step instead of waiting till after relation_open.  That doesn't
 seem to solve the general problem though.

hmm yeah - missed the VACUUM; part of the regression diff.
Still this means we will have to live with (rare) failures once in a
while during that test ?


Stefan

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Going for all green buildfarm results

2006-08-17 Thread stark


 Alvaro Herrera alvherre ( at ) commandprompt ( dot ) com writes:
 Maybe we could write a suitable test case using Martijn's concurrent
 testing framework.
 
 The trick is to get process A to commit between the times that process B
 looks at the new and old versions of the pg_class row (and it has to
 happen to do so in that order ... although that's not a bad bet given
 the way btree handles equal keys).
 
 I think the reason we've not tracked this down before is that that's a
 pretty small window.  You could force the problem by stopping process B
 with a debugger breakpoint and then letting A do its thing, but short of
 something like that you'll never reproduce it with high probability.

Actually I was already looking into a related issue and have some work here
that may help with this.

I wanted to test the online index build and to do that I figured you needed to
have regression tests like the ones we have now except with multiple database
sessions. So I hacked psql to issue queries asynchronously and allow multiple
database connections. That way you can switch connections while a blocked or
slow transaction is still running and issue queries in other transactions.

I thought it was a proof-of-concept kludge but actually it's worked out quite
well. There were a few conceptual gotchas but I think I have a reasonable
solution for each.

The main issue was that any time you issue an asynchronously connection that
you expect to block you have a race condition in the test. You can't switch
connections and proceed right away or you may actually proceed with the other
connection before the first connection's command is received and acted on by
the backend.

The right solution to this would involve altering the backend and the
protocol to provide some form of feedback when an asynchronous query had
reached various states including when it was blocked. You would have to
annotate it with enough information that the client can determine it's
actually blocked on the right thing and not just on some uninteresting
transient lock too.

Instead I just added a command to cause psql to wait for a time. This is
nearly as good since all the regression tests run fairly quickly so if you
wait even a fraction of a second you can be pretty certain the command has
been received and if it were not going to block it would have finished and
printed output already. And it was *much* simpler.

Also, I think for interactive use we would want a somewhat more sophisticated
scheduling of output. It would be nice to print out results as they come in
even if we're on another connection. For the regression tests you certainly do
not want that since that would introduce unavoidable non-deterministic race
conditions in your output files all over the place. The way I've coded it now
takes care to print out output only from the active database connection and
the test cases need to be written to switch connections at each point they
want to test for possibly incorrect output.

Another issue was that I couldn't come up with a nice set of names for the
commands that didn't conflict with the myriad of one-letter commands already
in psql. So I just prefixed the all with c (connection). I figured when I
submitted it I would just let the community hash out the names and take the 2s
it would take to change them.

The test cases are actually super easy to write and read, at least considering
we're talking about concurrent sql sessions here. I think it's far clearer
than trying to handle separate scripts and nearly as clear as Martin's
proposal from a while back to prepend a connection number on every line.

The commands I've added or altered are:

  \c[onnect][] [DBNAME|- USER|- HOST|- PORT|-]
connect to new database (currently postgres)
if optional  is present open do not close existing connection
  \cswitch n
switch to database connection n
  \clist
list database connections
  \cdisconnect
close current database connection
use \cswitch or \connect to select another connection
  \cnowait
issue next query without waiting for results
  \cwait [n]
if any queries are pending wait n seconds for results

Also I added % to the psql prompt format to indicate the current connection.

So the tests look like, for example:

postgres=# \c
[2] You are now connected to database postgres.
postgres[2]=# begin;
BEGIN
postgres[2]=# create table foo (a integer);
CREATE TABLE
postgres[2]=# \cswitch 1
[1] You are now connected to database postgres
postgres[1]=# select * from foo;
ERROR:  relation foo does not exist
postgres[1]=# \cswitch 2
[2] You are now connected to database postgres
postgres[2]=# commit;
COMMIT
postgres[2]=# \cswitch 1
[1] You are now connected to database postgres
postgres[1]=# select * from foo;
 a 
---
(0 rows)

postgres[1]=# insert into foo values (1);
INSERT 0 1
postgres[1]=# begin;
BEGIN
postgres[1]=# update foo set

Re: [HACKERS] Going for all green buildfarm results

2006-08-17 Thread Alvaro Herrera

stark wrote:

 Actually I was already looking into a related issue and have some work here
 that may help with this.
 
 I wanted to test the online index build and to do that I figured you needed to
 have regression tests like the ones we have now except with multiple database
 sessions. So I hacked psql to issue queries asynchronously and allow multiple
 database connections. That way you can switch connections while a blocked or
 slow transaction is still running and issue queries in other transactions.
 
 I thought it was a proof-of-concept kludge but actually it's worked out quite
 well. There were a few conceptual gotchas but I think I have a reasonable
 solution for each.

I have had an idea for some time that is actually much simpler -- just
launch several backends at once to do different things, and randomly
send SIGSTOP and SIGCONT to each.  If they keep doing whatever they are
doing in infinite loops, and you leave it enough time, it's very likely
that you'll get problems if the concurrent locking (or whatever) is not
right.

The nice thing about this is that it's completely random, i.e. you don't
have to introduce individual stop points in the backend (which may
themselves hide some bugs).  It acts (or at least, I expect it to act)
just like the kernel gave execution to another process.

The main difference with your approach is that I haven't tried it.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Going for all green buildfarm results

2006-08-17 Thread Jim C. Nasby

On Thu, Aug 17, 2006 at 04:17:01PM +0100, stark wrote:
 I wanted to test the online index build and to do that I figured you needed to
 have regression tests like the ones we have now except with multiple database
 sessions. So I hacked psql to issue queries asynchronously and allow multiple
 database connections. That way you can switch connections while a blocked or
 slow transaction is still running and issue queries in other transactions.
 
Wow, that's damn cool! FWIW, one thing I can think of that would be
useful is the ability to 'background' a long-running query. I see
\cnowait, but having something like  from unix shells would be even
easier. It'd also be great to have the equivalent of ^Z so that if you
got tired of waiting on a query, you could get back to the psql prompt
without killing it.

 Also, I think for interactive use we would want a somewhat more sophisticated
 scheduling of output. It would be nice to print out results as they come in
 even if we're on another connection. For the regression tests you certainly do
 not want that since that would introduce unavoidable non-deterministic race
 conditions in your output files all over the place. The way I've coded it now
 takes care to print out output only from the active database connection and
 the test cases need to be written to switch connections at each point they
 want to test for possibly incorrect output.
 
Thinking in terms of tcsh  co, there's a number of ways to handle this:

1) Output happens real-time
2) Only output from current connection (what you've done)
3) Only output after user input (ie: code that handles output is only
run after the user has entered a command). I think most shells
operate this way by default.
4) Provide an indication that output has come in from a background
connection, but don't provide the actual output. This could be
combined with #3.

#3 is nice because you won't get interrupted in the middle of entering
some long query. #4 could be useful for automated testing, especially if
the indicator was routed to another output channel, such as STDERR.

 Another issue was that I couldn't come up with a nice set of names for the
 commands that didn't conflict with the myriad of one-letter commands already
 in psql. So I just prefixed the all with c (connection). I figured when I
 submitted it I would just let the community hash out the names and take the 2s
 it would take to change them.
 
 The test cases are actually super easy to write and read, at least considering
 we're talking about concurrent sql sessions here. I think it's far clearer
 than trying to handle separate scripts and nearly as clear as Martin's
 proposal from a while back to prepend a connection number on every line.
 
 The commands I've added or altered are:
 
   \c[onnect][] [DBNAME|- USER|- HOST|- PORT|-]
 connect to new database (currently postgres)
 if optional  is present open do not close existing connection
   \cswitch n
 switch to database connection n

I can see \1 - \9 as being a handy shortcut.
   \clist
 list database connections
   \cdisconnect
 close current database connection
 use \cswitch or \connect to select another connection

Would ^d have the same effect?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Going for all green buildfarm results

2006-08-17 Thread Jim C. Nasby

On Thu, Aug 17, 2006 at 03:09:30PM -0400, Alvaro Herrera wrote:
 stark wrote:
 
  Actually I was already looking into a related issue and have some work here
  that may help with this.
  
  I wanted to test the online index build and to do that I figured you needed 
  to
  have regression tests like the ones we have now except with multiple 
  database
  sessions. So I hacked psql to issue queries asynchronously and allow 
  multiple
  database connections. That way you can switch connections while a blocked or
  slow transaction is still running and issue queries in other transactions.
  
  I thought it was a proof-of-concept kludge but actually it's worked out 
  quite
  well. There were a few conceptual gotchas but I think I have a reasonable
  solution for each.
 
 I have had an idea for some time that is actually much simpler -- just
 launch several backends at once to do different things, and randomly
 send SIGSTOP and SIGCONT to each.  If they keep doing whatever they are
 doing in infinite loops, and you leave it enough time, it's very likely
 that you'll get problems if the concurrent locking (or whatever) is not
 right.

This is probably worth doing as well, since it would simulate what an
IO-bound system would look like.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Going for all green buildfarm results

2006-08-17 Thread Tom Lane

Jim C. Nasby [EMAIL PROTECTED] writes:
 On Thu, Aug 17, 2006 at 03:09:30PM -0400, Alvaro Herrera wrote:
 I have had an idea for some time that is actually much simpler -- just
 launch several backends at once to do different things, and randomly
 send SIGSTOP and SIGCONT to each.  If they keep doing whatever they are
 doing in infinite loops, and you leave it enough time, it's very likely
 that you'll get problems if the concurrent locking (or whatever) is not
 right.

 This is probably worth doing as well, since it would simulate what an
 IO-bound system would look like.

While that might be useful for testing, it'd absolutely suck for
debugging, because of the difficulty of reproducing a problem :-(

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Going for all green buildfarm results

2006-07-31 Thread Jim C. Nasby

On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Stefan Kaltenbrunner wrote:
  FYI: lionfish just managed to hit that problem again:
  http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-07-29%2023:30:06
 
  The test alter_table, which is on the same parallel group as limit (the
  failing test), contains these lines:
  ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
  ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;
 
 I bet Alvaro's spotted the problem.  ALTER INDEX RENAME doesn't seem to
 take any lock on the index's parent table, only on the index itself.
 That means that a query on onek could be trying to read the pg_class
 entries for onek's indexes concurrently with someone trying to commit
 a pg_class update to rename an index.  If the query manages to visit
 the new and old versions of the row in that order, and the commit
 happens between, *neither* of the versions would look valid.  MVCC
 doesn't save us because this is all SnapshotNow.
 
 Not sure what to do about this.  Trying to lock the parent table could
 easily be a cure-worse-than-the-disease, because it would create
 deadlock risks (we've already locked the index before we could look up
 and lock the parent).  Thoughts?
 
 The path of least resistance might just be to not run these tests in
 parallel.  The chance of this issue causing problems in the real world
 seems small.

It doesn't seem that unusual to want to rename an index on a running
system, and it certainly doesn't seem like the kind of operation that
should pose a problem. So at the very least, we'd need a big fat warning
in the docs about how renaming an index could cause other queries in the
system to fail, and the error message needs to be improved.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Going for all green buildfarm results

2006-07-31 Thread Stefan Kaltenbrunner

Jim C. Nasby wrote:
 On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
 Stefan Kaltenbrunner wrote:
 FYI: lionfish just managed to hit that problem again:
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-07-29%2023:30:06
 The test alter_table, which is on the same parallel group as limit (the
 failing test), contains these lines:
 ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
 ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;
 I bet Alvaro's spotted the problem.  ALTER INDEX RENAME doesn't seem to
 take any lock on the index's parent table, only on the index itself.
 That means that a query on onek could be trying to read the pg_class
 entries for onek's indexes concurrently with someone trying to commit
 a pg_class update to rename an index.  If the query manages to visit
 the new and old versions of the row in that order, and the commit
 happens between, *neither* of the versions would look valid.  MVCC
 doesn't save us because this is all SnapshotNow.

 Not sure what to do about this.  Trying to lock the parent table could
 easily be a cure-worse-than-the-disease, because it would create
 deadlock risks (we've already locked the index before we could look up
 and lock the parent).  Thoughts?

 The path of least resistance might just be to not run these tests in
 parallel.  The chance of this issue causing problems in the real world
 seems small.
 
 It doesn't seem that unusual to want to rename an index on a running
 system, and it certainly doesn't seem like the kind of operation that
 should pose a problem. So at the very least, we'd need a big fat warning
 in the docs about how renaming an index could cause other queries in the
 system to fail, and the error message needs to be improved.

it is my understanding that Tom is already tackling the underlying issue
on a much more general base ...


Stefan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Going for all green buildfarm results

2006-07-31 Thread Tom Lane

Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 Jim C. Nasby wrote:
 On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
 The path of least resistance might just be to not run these tests in
 parallel.  The chance of this issue causing problems in the real world
 seems small.
 
 It doesn't seem that unusual to want to rename an index on a running
 system, and it certainly doesn't seem like the kind of operation that
 should pose a problem. So at the very least, we'd need a big fat warning
 in the docs about how renaming an index could cause other queries in the
 system to fail, and the error message needs to be improved.

 it is my understanding that Tom is already tackling the underlying issue
 on a much more general base ...

Done in HEAD, but we might still wish to think about changing the
regression tests in the back branches, else we'll probably continue to
see this failure once in a while ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Going for all green buildfarm results

2006-07-31 Thread Andrew Dunstan


Tom Lane wrote:


Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 


Jim C. Nasby wrote:
   


On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
 


The path of least resistance might just be to not run these tests in
parallel.  The chance of this issue causing problems in the real world
seems small.
   


It doesn't seem that unusual to want to rename an index on a running
system, and it certainly doesn't seem like the kind of operation that
should pose a problem. So at the very least, we'd need a big fat warning
in the docs about how renaming an index could cause other queries in the
system to fail, and the error message needs to be improved.
 



 


it is my understanding that Tom is already tackling the underlying issue
on a much more general base ...
   



Done in HEAD, but we might still wish to think about changing the
regression tests in the back branches, else we'll probably continue to
see this failure once in a while ...


 



How sure are we that this is the cause of the problem? The feeling I got 
was this is a good guess. If so, do we want to prevent ourselves 
getting any further clues in case we're wrong? It's also an interesting 
case of a (low likelihood) bug which is not fixable on any stable branch.


cheers

andrew


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Going for all green buildfarm results

2006-07-31 Thread Stefan Kaltenbrunner

Andrew Dunstan wrote:
 Tom Lane wrote:
 
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
  

 Jim C. Nasby wrote:
   
 On Sun, Jul 30, 2006 at 11:44:44AM -0400, Tom Lane wrote:
 
 The path of least resistance might just be to not run these tests in
 parallel.  The chance of this issue causing problems in the real world
 seems small.
   
 It doesn't seem that unusual to want to rename an index on a running
 system, and it certainly doesn't seem like the kind of operation that
 should pose a problem. So at the very least, we'd need a big fat
 warning
 in the docs about how renaming an index could cause other queries in
 the
 system to fail, and the error message needs to be improved.
 

  

 it is my understanding that Tom is already tackling the underlying issue
 on a much more general base ...
   

 Done in HEAD, but we might still wish to think about changing the
 regression tests in the back branches, else we'll probably continue to
 see this failure once in a while ...


  

 
 How sure are we that this is the cause of the problem? The feeling I got
 was this is a good guess. If so, do we want to prevent ourselves
 getting any further clues in case we're wrong? It's also an interesting
 case of a (low likelihood) bug which is not fixable on any stable branch.

well I have a lot of trust into tom - though the main issue is that this
issue seems to be difficult hard to trigger.
afaik only one box (lionfish) ever managed to hit it and even there only
2 times out of several hundred builds - I don't suppose we can come up
with a testcase that might be more reliably showing that issue ?

Stefan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Going for all green buildfarm results

2006-07-31 Thread Alvaro Herrera

Stefan Kaltenbrunner wrote:
 Andrew Dunstan wrote:

  How sure are we that this is the cause of the problem? The feeling I got
  was this is a good guess. If so, do we want to prevent ourselves
  getting any further clues in case we're wrong? It's also an interesting
  case of a (low likelihood) bug which is not fixable on any stable branch.
 
 well I have a lot of trust into tom - though the main issue is that this
 issue seems to be difficult hard to trigger.
 afaik only one box (lionfish) ever managed to hit it and even there only
 2 times out of several hundred builds - I don't suppose we can come up
 with a testcase that might be more reliably showing that issue ?

Maybe we could write a suitable test case using Martijn's concurrent
testing framework.  Or with a pair of custom SQL script running under
pgbench, and a separate process sending random SIGSTOP/SIGCONT to
backends.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Going for all green buildfarm results

2006-07-31 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Maybe we could write a suitable test case using Martijn's concurrent
 testing framework.

The trick is to get process A to commit between the times that process B
looks at the new and old versions of the pg_class row (and it has to
happen to do so in that order ... although that's not a bad bet given
the way btree handles equal keys).

I think the reason we've not tracked this down before is that that's a
pretty small window.  You could force the problem by stopping process B
with a debugger breakpoint and then letting A do its thing, but short of
something like that you'll never reproduce it with high probability.

As far as Andrew's question goes: I have no doubt that this race
condition is (or now, was) real and could explain Stefan's failure.
It's not impossible that there's some other problem in there, though.
If so we will still see the problem from time to time on HEAD, and
know that we have more work to do.  But I don't think that continuing
to see it on the back branches will teach us anything.

regards, tom lane 

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Going for all green buildfarm results

2006-07-31 Thread Andrew Dunstan


Tom Lane wrote:

As far as Andrew's question goes: I have no doubt that this race
condition is (or now, was) real and could explain Stefan's failure.
It's not impossible that there's some other problem in there, though.
If so we will still see the problem from time to time on HEAD, and
know that we have more work to do.  But I don't think that continuing
to see it on the back branches will teach us anything.


  


Fair enough.

cheers

andrew


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Going for all green buildfarm results

2006-07-30 Thread Alvaro Herrera

Stefan Kaltenbrunner wrote:
 Tom Lane wrote:
  Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
  FWIW: lionfish had a weird make check error 3 weeks ago which I
  (unsuccessfully) tried to reproduce multiple times after that:
  
  http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-05-12%2005:30:14
  
  Weird.
  
SELECT ''::text AS eleven, unique1, unique2, stringu1 
  FROM onek WHERE unique1  50 
  ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
  ! ERROR:  could not open relation with OID 27035
  
  AFAICS, the only way to get that error in HEAD is if ScanPgRelation
  can't find a pg_class row with the mentioned OID.  Presumably 27035
  belongs to onek or one of its indexes.  The very next command also
  refers to onek, and doesn't fail, so what we seem to have here is
  a transient lookup failure.  We've found a btree bug like that once
  before ... wonder if there's still one left?
 
 FYI: lionfish just managed to hit that problem again:
 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-07-29%2023:30:06

The error message this time is

! ERROR:  could not open relation with OID 27006

It's worth mentioning that the portals_p2 test, which happens in the
parallel group previous to where this test is run, also accesses the
onek table successfully.  It may be interesting to see exactly what
relation is 27006.

The test alter_table, which is on the same parallel group as limit (the
failing test), contains these lines:

ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;

Maybe this is related.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Going for all green buildfarm results

2006-07-30 Thread Stefan Kaltenbrunner

Alvaro Herrera wrote:
 Stefan Kaltenbrunner wrote:
 Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 FWIW: lionfish had a weird make check error 3 weeks ago which I
 (unsuccessfully) tried to reproduce multiple times after that:
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-05-12%2005:30:14
 Weird.

   SELECT ''::text AS eleven, unique1, unique2, stringu1 
 FROM onek WHERE unique1  50 
 ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
 ! ERROR:  could not open relation with OID 27035

 AFAICS, the only way to get that error in HEAD is if ScanPgRelation
 can't find a pg_class row with the mentioned OID.  Presumably 27035
 belongs to onek or one of its indexes.  The very next command also
 refers to onek, and doesn't fail, so what we seem to have here is
 a transient lookup failure.  We've found a btree bug like that once
 before ... wonder if there's still one left?
 FYI: lionfish just managed to hit that problem again:

 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-07-29%2023:30:06
 
 The error message this time is
 
 ! ERROR:  could not open relation with OID 27006

yeah and before it was:
! ERROR:  could not open relation with OID 27035

which looks quite related :-)

 
 It's worth mentioning that the portals_p2 test, which happens in the
 parallel group previous to where this test is run, also accesses the
 onek table successfully.  It may be interesting to see exactly what
 relation is 27006.

sorry but i don't have access to the cluster in question any more
(lionfish is quite resource starved and I only enabled to keep failed
builds on -HEAD after the last incident ...)

 
 The test alter_table, which is on the same parallel group as limit (the
 failing test), contains these lines:
 
 ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
 ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;

hmm interesting - lionfish is a slow box(250Mhz MIPS) and particulary
low on memory(48MB+140MB swap) so it is quite likely that the parallel
regress tests are driving it into swap - maybe some sort of subtile
timing issue ?


Stefan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Going for all green buildfarm results

2006-07-30 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Stefan Kaltenbrunner wrote:
 FYI: lionfish just managed to hit that problem again:
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-07-29%2023:30:06

 The test alter_table, which is on the same parallel group as limit (the
 failing test), contains these lines:
 ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
 ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;

I bet Alvaro's spotted the problem.  ALTER INDEX RENAME doesn't seem to
take any lock on the index's parent table, only on the index itself.
That means that a query on onek could be trying to read the pg_class
entries for onek's indexes concurrently with someone trying to commit
a pg_class update to rename an index.  If the query manages to visit
the new and old versions of the row in that order, and the commit
happens between, *neither* of the versions would look valid.  MVCC
doesn't save us because this is all SnapshotNow.

Not sure what to do about this.  Trying to lock the parent table could
easily be a cure-worse-than-the-disease, because it would create
deadlock risks (we've already locked the index before we could look up
and lock the parent).  Thoughts?

The path of least resistance might just be to not run these tests in
parallel.  The chance of this issue causing problems in the real world
seems small.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Going for all green buildfarm results

2006-07-29 Thread Stefan Kaltenbrunner

Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 FWIW: lionfish had a weird make check error 3 weeks ago which I
 (unsuccessfully) tried to reproduce multiple times after that:
 
 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-05-12%2005:30:14
 
 Weird.
 
   SELECT ''::text AS eleven, unique1, unique2, stringu1 
 FROM onek WHERE unique1  50 
 ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
 ! ERROR:  could not open relation with OID 27035
 
 AFAICS, the only way to get that error in HEAD is if ScanPgRelation
 can't find a pg_class row with the mentioned OID.  Presumably 27035
 belongs to onek or one of its indexes.  The very next command also
 refers to onek, and doesn't fail, so what we seem to have here is
 a transient lookup failure.  We've found a btree bug like that once
 before ... wonder if there's still one left?

FYI: lionfish just managed to hit that problem again:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-07-29%2023:30:06



Stefan

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Going for all green buildfarm results

2006-06-22 Thread Andrew Dunstan




Mark Wong wrote:



Now why are we failing on 7.3? What version of flex do you have? If 
it's too modern we'll just need to take 7.3 out of the cobra and 
stoat rotations - we'd really only make supercritical fixes on that 
branch these days.



Flex is 2.5.33 on both systems.  I'm assuming that's too modern so 
I'll go ahead and stop building 7.3 for those systems.



You could be lucky the others build. I believe our supported version is 
still 2.5.4, which is what all my linux systems have.


cheers

andrew



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Going for all green buildfarm results

2006-06-22 Thread Neil Conway

On Thu, 2006-06-22 at 12:52 -0400, Andrew Dunstan wrote:
 I believe our supported version is still 2.5.4, which is
 what all my linux systems have.

Its not clear to me why some people have such antipathy toward recent
flex releases, but if our only supported flex version is 2.5.4, I think
this should be documented: the current installation instructions only
say that Flex 2.5.4 or later should be used for CVS builds.

-Neil



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Going for all green buildfarm results

2006-06-22 Thread Tom Lane

Neil Conway [EMAIL PROTECTED] writes:
 On Thu, 2006-06-22 at 12:52 -0400, Andrew Dunstan wrote:
 I believe our supported version is still 2.5.4, which is
 what all my linux systems have.

 Its not clear to me why some people have such antipathy toward recent
 flex releases, but if our only supported flex version is 2.5.4, I think
 this should be documented: the current installation instructions only
 say that Flex 2.5.4 or later should be used for CVS builds.

Some of them appear to be actively broken :-(.  If we knew exactly which
ones they were, we'd document that, but I don't think we have a handle
on why some 2.5.x flexes barf and others don't.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Going for all green buildfarm results

2006-06-22 Thread Tom Lane

Andrew Dunstan [EMAIL PROTECTED] writes:
 Mark Wong wrote:
 Flex is 2.5.33 on both systems.  I'm assuming that's too modern so 
 I'll go ahead and stop building 7.3 for those systems.

 You could be lucky the others build. I believe our supported version is 
 still 2.5.4, which is what all my linux systems have.

I checked into this and it seems that what we'd need to do is backport
some subset of these 7.4 changes:

2003-09-13 22:18  tgl

* contrib/seg/: Makefile, README.seg, seg.c, seg.sql.in,
segparse.y, segscan.l, expected/seg.out: Make contrib/seg work with
flex 2.5.31.  Fix it up to have a real btree operator class, too,
since in PG 7.4 you can't GROUP without one.

2003-09-13 21:52  tgl

* contrib/cube/: Makefile, README.cube, cube.c, cube.sql.in,
cubeparse.y, cubescan.l, expected/cube.out: Make contrib/cube work
with flex 2.5.31.  Fix it up to have a real btree operator class,
too, since in PG 7.4 you can't GROUP without one.

2003-09-14 14:44  tgl

* contrib/: tsearch/parser.l, tsearch2/wordparser/parser.l:
Persuade tsearch/tsearch2 to work (or at least pass their
regression tests) when using flex 2.5.31.  The fix is to *not* try
to use palloc and pfree for allocations within the lexer; when you
do that, the yy_buffer_stack gets freed at inopportune times.  The
code is already set up to do manual deallocation, so I see no
particular advantage to using palloc anyway.

This is probably not worth doing.  We're really only maintaining 7.3 for
legacy platforms (the only one I care about is RHEL3 ;-)) and a legacy
platform is likely to have an old flex.  It's a tad annoying to lose
buildfarm coverage on it though ...

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] Going for 'all green' buildfarm results

2006-06-09 Thread ohp

I can take other if that helps.

Larry, could you help me in the setup?

Regards,
On Thu, 8 Jun 2006, Andrew Dunstan wrote:

 Date: Thu, 08 Jun 2006 10:54:09 -0400
 From: Andrew Dunstan [EMAIL PROTECTED]
 Newsgroups: pgsql.hackers
 Subject: Re: Going for 'all green' buildfarm results

 Larry Rosenman wrote:
  well, the changes didn't help.

  I've pulled ALL the cronjobs from firefly.

  consider it dead.

  Since it is an outlier, it's not useful.

 OK, I am marking firefly as retired. That means we have no coverage for
 Unixware.

 cheers

 andrew

 ---(end of broadcast)---
 TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly

-- 
Olivier PRENANT Tel: +33-5-61-50-97-00 (Work)
15, Chemin des Monges+33-5-61-50-97-01 (Fax)
31190 AUTERIVE   +33-6-07-63-80-64 (GSM)
FRANCE  Email: ohp@pyrenet.fr
--
Make your life a dream, make your dream a reality. (St Exupery)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Going for 'all green' buildfarm results

2006-06-09 Thread ohp

On Fri, 9 Jun 2006 ohp@pyrenet.fr wrote:

 Date: Fri, 9 Jun 2006 11:12:07 +0200
 From: ohp@pyrenet.fr
 To: Andrew Dunstan [EMAIL PROTECTED], Larry Rosenman ler@lerctr.org
 Newsgroups: pgsql.hackers
 Subject: Re: Going for 'all green' buildfarm results

 I can take other if that helps.
Ooops... takeover :)

 Larry, could you help me in the setup?

 Regards,
 On Thu, 8 Jun 2006, Andrew Dunstan wrote:

  Date: Thu, 08 Jun 2006 10:54:09 -0400
  From: Andrew Dunstan [EMAIL PROTECTED]
  Newsgroups: pgsql.hackers
  Subject: Re: Going for 'all green' buildfarm results
 
  Larry Rosenman wrote:
   well, the changes didn't help.
  
   I've pulled ALL the cronjobs from firefly.
  
   consider it dead.
  
   Since it is an outlier, it's not useful.
  
  
  
 
 
 
  OK, I am marking firefly as retired. That means we have no coverage for
  Unixware.
 
  cheers
 
  andrew
 
  ---(end of broadcast)---
  TIP 1: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly
 
 



-- 
Olivier PRENANT Tel: +33-5-61-50-97-00 (Work)
15, Chemin des Monges+33-5-61-50-97-01 (Fax)
31190 AUTERIVE   +33-6-07-63-80-64 (GSM)
FRANCE  Email: ohp@pyrenet.fr
--
Make your life a dream, make your dream a reality. (St Exupery)

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] Going for 'all green' buildfarm results

2006-06-08 Thread Andrew Dunstan


Larry Rosenman wrote:

well, the changes didn't help.

I've pulled ALL the cronjobs from firefly.

consider it dead.

Since it is an outlier, it's not useful.


  




OK, I am marking firefly as retired. That means we have no coverage for 
Unixware.


cheers

andrew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Going for all green buildfarm results

2006-06-03 Thread Kris Jurka





 Original Message 
From:   Tom Lane [EMAIL PROTECTED]

kudu HEAD: one-time failure 6/1/06 in statement_timeout test, never seen
before.  Is it possible system was under enough load that the 1-second
timeout fired before control reached the exception block?



The load here was no different than any other day.  As to whether it's a 
real issue or not I have no idea.  It is a virtual machine that is subject 
to the load on other VMs, but none of them were scheduled to do 
anything at the time.



Kris Jurka

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

Re: [HACKERS] Going for all green buildfarm results

2006-06-02 Thread Stefan Kaltenbrunner

Tom Lane wrote:
 I've been making another pass over getting rid of buildfarm failures.
 The remaining ones I see at the moment are:
 
 firefly HEAD: intermittent failures in the stats test.  We seem to have
 fixed every other platform back in January, but not this one.
 
 kudu HEAD: one-time failure 6/1/06 in statement_timeout test, never seen
 before.  Is it possible system was under enough load that the 1-second
 timeout fired before control reached the exception block?

[...]

FWIW: lionfish had a weird make check error 3 weeks ago which I
(unsuccessfully) tried to reproduce multiple times after that:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-05-12%2005:30:14


[...]

 cobra, stoat, sponge 7.4: pilot error.  Either install Tk or configure
 --without-tk.

sorry for that but the issue with sponge on 7.4 was fixed nearly a week
ago though there have been no changes until today to trigger a new build ;-)


Stefan

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Going for all green buildfarm results

2006-06-02 Thread Larry Rosenman

Tom Lane wrote:
 I've been making another pass over getting rid of buildfarm failures.
 The remaining ones I see at the moment are:
 
 firefly HEAD: intermittent failures in the stats test.  We seem to
 have fixed every other platform back in January, but not this one. 
 
 
 firefly 7.4: dblink test fails, with what looks like an rpath problem.
 Another one that we fixed awhile ago, and the fix worked on every
 platform but this one. 
 
 firefly 7.3: trivial regression diffs; we could install variant
 comparison files if anyone cared. 
 
 
 Firefly is obviously the outlier here.  I dunno if anyone cares
 enough about SCO to spend time investigating it (I don't).  Most of
 the others just need a little bit of attention from the machine
 owner.   

If I generate fixes for firefly (I'm the owner), would they have a prayer 
Of being applied?

LER

 
   regards, tom lane
 
 ---(end of
 broadcast)--- 
 TIP 2: Don't 'kill -9' the postmaster



-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3683 US


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Going for 'all green' buildfarm results

2006-06-02 Thread Andrew Dunstan

Larry Rosenman said:
 Tom Lane wrote:
 I've been making another pass over getting rid of buildfarm failures.
 The remaining ones I see at the moment are:

 firefly HEAD: intermittent failures in the stats test.  We seem to
 have fixed every other platform back in January, but not this one.


 firefly 7.4: dblink test fails, with what looks like an rpath problem.
 Another one that we fixed awhile ago, and the fix worked on every
 platform but this one.

 firefly 7.3: trivial regression diffs; we could install variant
 comparison files if anyone cared.


 Firefly is obviously the outlier here.  I dunno if anyone cares
 enough about SCO to spend time investigating it (I don't).  Most of
 the others just need a little bit of attention from the machine
 owner.

 If I generate fixes for firefly (I'm the owner), would they have a
 prayer  Of being applied?


Sure, although I wouldn't bother with 7.3 - just take 7.3 out of firefly's
build schedule. That's not carte blanche on fixes, of course - we'd have to
see them.

cheers

andrew



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] Going for all green buildfarm results

2006-06-02 Thread Andrew Dunstan


Tom Lane wrote:

Or is
it worth improving buildfarm to be able to skip specific tests?


  


There is a session on buildfarm improvements scheduled for the Toronto 
conference. This is certainly one possibility.


cheers

andrew


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Going for all green buildfarm results

2006-06-02 Thread Tom Lane

Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 FWIW: lionfish had a weird make check error 3 weeks ago which I
 (unsuccessfully) tried to reproduce multiple times after that:

 http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-05-12%2005:30:14

Weird.

  SELECT ''::text AS eleven, unique1, unique2, stringu1 
FROM onek WHERE unique1  50 
ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
! ERROR:  could not open relation with OID 27035

AFAICS, the only way to get that error in HEAD is if ScanPgRelation
can't find a pg_class row with the mentioned OID.  Presumably 27035
belongs to onek or one of its indexes.  The very next command also
refers to onek, and doesn't fail, so what we seem to have here is
a transient lookup failure.  We've found a btree bug like that once
before ... wonder if there's still one left?

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Going for 'all green' buildfarm results

2006-06-02 Thread Tom Lane

Andrew Dunstan [EMAIL PROTECTED] writes:
 Larry Rosenman said:
 If I generate fixes for firefly (I'm the owner), would they have a
 prayer  Of being applied?

 Sure, although I wouldn't bother with 7.3 - just take 7.3 out of firefly's
 build schedule. That's not carte blanche on fixes, of course - we'd have to
 see them.

What he said ... it'd depend entirely on how ugly the fixes are ;-)

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Going for 'all green' buildfarm results

2006-06-02 Thread Larry Rosenman

Tom Lane wrote:
 Andrew Dunstan [EMAIL PROTECTED] writes:
 Larry Rosenman said:
 If I generate fixes for firefly (I'm the owner), would they have a
 prayer  Of being applied?
 
 Sure, although I wouldn't bother with 7.3 - just take 7.3 out of
 firefly's build schedule. That's not carte blanche on fixes, of
 course - we'd have to see them.
 
 What he said ... it'd depend entirely on how ugly the fixes are ;-)
 
Ok, 7.3 is out of firefly's crontab.

I'll look into 7.4.

LER




-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] Going for all green buildfarm results

2006-06-02 Thread Stefan Kaltenbrunner

Tom Lane wrote:
 Stefan Kaltenbrunner [EMAIL PROTECTED] writes:
 
FWIW: lionfish had a weird make check error 3 weeks ago which I
(unsuccessfully) tried to reproduce multiple times after that:
 
 
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=lionfishdt=2006-05-12%2005:30:14
 
 
 Weird.
 
   SELECT ''::text AS eleven, unique1, unique2, stringu1 
 FROM onek WHERE unique1  50 
 ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
 ! ERROR:  could not open relation with OID 27035
 
 AFAICS, the only way to get that error in HEAD is if ScanPgRelation
 can't find a pg_class row with the mentioned OID.  Presumably 27035
 belongs to onek or one of its indexes.  The very next command also
 refers to onek, and doesn't fail, so what we seem to have here is
 a transient lookup failure.  We've found a btree bug like that once
 before ... wonder if there's still one left?

If there is still one left it must be quite hard to trigger (using the
regression tests). Like i said before - I tried quite hard to reproduce
the issue back then - without any success.


Stefan

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] Going for 'all green' buildfarm results

2006-06-02 Thread Larry Rosenman

Larry Rosenman wrote:
 Tom Lane wrote:
 Andrew Dunstan [EMAIL PROTECTED] writes:
 Larry Rosenman said:
 If I generate fixes for firefly (I'm the owner), would they have a
 prayer  Of being applied?
 
 Sure, although I wouldn't bother with 7.3 - just take 7.3 out of
 firefly's build schedule. That's not carte blanche on fixes, of
 course - we'd have to see them.
 
 What he said ... it'd depend entirely on how ugly the fixes are ;-)
 
 Ok, 7.3 is out of firefly's crontab.
 
 I'll look into 7.4.
 
 LER

I've taken the cheaters way out for 7.4, and turned off the perl stuff for
now.

as to HEAD, I've played with the system send/recv space parms, and let's see
if
that helps the stats stuff.

LER


-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893


---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] Going for 'all green' buildfarm results

2006-06-02 Thread Larry Rosenman

Larry Rosenman wrote:
 Larry Rosenman wrote:
 Tom Lane wrote:
 Andrew Dunstan [EMAIL PROTECTED] writes:
 Larry Rosenman said:
 If I generate fixes for firefly (I'm the owner), would they have a
 prayer  Of being applied?
 
 Sure, although I wouldn't bother with 7.3 - just take 7.3 out of
 firefly's build schedule. That's not carte blanche on fixes, of
 course - we'd have to see them.
 
 What he said ... it'd depend entirely on how ugly the fixes are ;-)
 
 Ok, 7.3 is out of firefly's crontab.
 
 I'll look into 7.4.
 
 LER
 
 I've taken the cheaters way out for 7.4, and turned off the perl
 stuff for now.
 
 as to HEAD, I've played with the system send/recv space parms, and
 let's see if
 that helps the stats stuff.
 
 LER

well, the changes didn't help.

I've pulled ALL the cronjobs from firefly.

consider it dead.

Since it is an outlier, it's not useful.

LER


-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 512-248-2683 E-Mail: ler@lerctr.org
US Mail: 430 Valona Loop, Round Rock, TX 78681-3893


---(end of broadcast)---
TIP 6: explain analyze is your friend

[HACKERS] Going for all green buildfarm results

2006-06-01 Thread Tom Lane

I've been making another pass over getting rid of buildfarm failures.
The remaining ones I see at the moment are:

firefly HEAD: intermittent failures in the stats test.  We seem to have
fixed every other platform back in January, but not this one.

kudu HEAD: one-time failure 6/1/06 in statement_timeout test, never seen
before.  Is it possible system was under enough load that the 1-second
timeout fired before control reached the exception block?

tapir HEAD: pilot error, insufficient SysV shmem settings

carp various: carp seems to have *serious* hardware problems, as it
has been failing randomly in all branches for a long time.  I suggest
putting that poor machine out to pasture.

penguin 8.0: fails in tsearch2.  Previous investigation says that the
failure is unfixable without initdb, which we are not going to force
for 8.0 branch.  I suggest retiring penguin from checking 8.0, as
there's not much point in continuing to see a failure there.  Or is
it worth improving buildfarm to be able to skip specific tests?

penguin 7.4: fails in initdb, with what seems to be a variant of the
alignment issue that kills tsearch2 in 8.0.  We won't fix this either,
so again might as well stop tracking this branch on this machine.

cobra, stoat, sponge 7.4: pilot error.  Either install Tk or configure
--without-tk.

firefly 7.4: dblink test fails, with what looks like an rpath problem.
Another one that we fixed awhile ago, and the fix worked on every
platform but this one.

firefly 7.3: trivial regression diffs; we could install variant
comparison files if anyone cared.

cobra, stoat, caribou 7.3: same Tk configuration error as in 7.4 branch

Firefly is obviously the outlier here.  I dunno if anyone cares enough
about SCO to spend time investigating it (I don't).  Most of the others
just need a little bit of attention from the machine owner.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

44 matches

Mail list logo