Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-08 Thread Jim C. Nasby
On Fri, Nov 04, 2005 at 08:46:27PM -0400, Marc G. Fournier wrote:
 On Fri, 4 Nov 2005, Jim C. Nasby wrote:
 For all the talk about couldn't it be part of regression, I haven't seen 
 anyone submit a patch that would test for it ... since I believe both you 
 and Tom have both stated that for things like race conditions, I don't 
 know that you can create reproducable cases, can you submit a patch for 
 how you propose this should be added to the regression tests?

I have an idea, but it might be better if Robert could produce a test
case since it would cover both a context storm issue as well as this
race condition.

Baring that, my idea was to spawn a number of processes, all of which
were trying to insert/update a random value in a table using David
Fetter's plpgsql code for doing a merge. This would produce a heavy
workload that also used subtransactions (due to the exception handling
in plpgsql).

Suggestions for a better test welcome...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-08 Thread Robert Creager
On Tue, 08 Nov 2005 14:09:58 -0600
Jim C. Nasby [EMAIL PROTECTED] wrote:

 On Fri, Nov 04, 2005 at 08:46:27PM -0400, Marc G. Fournier wrote:
  On Fri, 4 Nov 2005, Jim C. Nasby wrote:
  For all the talk about couldn't it be part of regression, I haven't seen 
  anyone submit a patch that would test for it ... since I believe both you 
  and Tom have both stated that for things like race conditions, I don't 
  know that you can create reproducable cases, can you submit a patch for 
  how you propose this should be added to the regression tests?
 
 I have an idea, but it might be better if Robert could produce a test
 case since it would cover both a context storm issue as well as this
 race condition.
 

Actually, I have a test case.  I just sent it out to Tom a couple of hours ago. 
The quick and dirty is that it shows the problem after running for about 20
minutes on my Xenon system with 8.1.0...  I cannot get it to fail on my AMD
system with a much higher load...

I can send it to others who are interested.  The e-mail with dump, module and
script is just over 1Mb.

Cheers,
Rob

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-08 Thread Jim C. Nasby
On Tue, Nov 08, 2005 at 02:09:35PM -0700, Robert Creager wrote:
 On Tue, 08 Nov 2005 14:09:58 -0600
 Jim C. Nasby [EMAIL PROTECTED] wrote:
 
  On Fri, Nov 04, 2005 at 08:46:27PM -0400, Marc G. Fournier wrote:
   On Fri, 4 Nov 2005, Jim C. Nasby wrote:
   For all the talk about couldn't it be part of regression, I haven't 
   seen 
   anyone submit a patch that would test for it ... since I believe both you 
   and Tom have both stated that for things like race conditions, I don't 
   know that you can create reproducable cases, can you submit a patch for 
   how you propose this should be added to the regression tests?
  
  I have an idea, but it might be better if Robert could produce a test
  case since it would cover both a context storm issue as well as this
  race condition.
  
 
 Actually, I have a test case.  I just sent it out to Tom a couple of hours 
 ago. 
 The quick and dirty is that it shows the problem after running for about 20
 minutes on my Xenon system with 8.1.0...  I cannot get it to fail on my AMD
 system with a much higher load...
 
 I can send it to others who are interested.  The e-mail with dump, module and
 script is just over 1Mb.

Just to clarify, did it show the assert failure, the context switch
storm, or both?

Yes, I'd like to take a look at this if you could send it on to me. Is
there any simple way to populate the database? I doubt people would be
keen on having a 1MB dump in CVS...
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-08 Thread Robert Creager
On Tue, 08 Nov 2005 15:36:18 -0600
Jim C. Nasby [EMAIL PROTECTED] wrote:
 
 Just to clarify, did it show the assert failure, the context switch
 storm, or both?

I didn't try for the assert after the patch.  I was developing the test when I
ran across the assert problem.  It should trigger the assert problem.

 
 Yes, I'd like to take a look at this if you could send it on to me. Is
 there any simple way to populate the database? I doubt people would be
 keen on having a 1MB dump in CVS...

Hmmm...  Should be possible to populate all the data algorithmically.  For the
most part, the specific data doesn't matter, just the general patterns in the
data.

I'll re-send the e-mail to you.

Cheers,
Rob

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-04 Thread Jim C. Nasby
On Wed, Nov 02, 2005 at 06:45:21PM -0500, Tom Lane wrote:
 Robert Creager [EMAIL PROTECTED] writes:
  Ran with both for an hour with no problem, where I could produce the ASSERT
  failure within minutes for the non patched version.
 
 Great.  I'll go ahead and commit the smaller fix into HEAD and the back
 branches, and hold the larger fix for 8.2.
 
 It's curious that two different people stumbled across this just
 recently, when the bug has been there since 7.2.  I suppose that the
 addition of pg_subtrans increased the probability of seeing the bug by
 a considerable amount, but I'm still surprised it wasn't identified
 before.  At the very least, we should have heard about it earlier in
 the 8.0 release cycle ...

Well, the common theme in each case IIRC is a fairly high transaction
rate; on the order of hundreds if not thousands per second.

Could something like that be added to regression, or maybe as a seperate
test case for the buildfarm?
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-04 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 Could something like that be added to regression, or maybe as a seperate
 test case for the buildfarm?

If you don't have a self-contained, reproducible test case, it's a bit
pointless to suggest adding the nonexistent test case to the regression
suite.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-04 Thread Jim C. Nasby
On Fri, Nov 04, 2005 at 04:35:10PM -0500, Tom Lane wrote:
 Jim C. Nasby [EMAIL PROTECTED] writes:
  Could something like that be added to regression, or maybe as a seperate
  test case for the buildfarm?
 
 If you don't have a self-contained, reproducible test case, it's a bit
 pointless to suggest adding the nonexistent test case to the regression
 suite.

Well, for things like race conditions I don't know that you can create
reproducable test cases. My point was that this bug was exposed by
databases with workloads that involved very high transaction rates. I
know in the case of my client this is due to some sub-optimal design
decisions, and I believe the other case was similar. My suggestion is
that having a test that involves a lot of row-by-row type operations
that generate a very high transaction rate would help expose these kinds
of bugs.

Of course if someone can come up with a self-contained reproducable test
case for this race condition that would be great as well. :)
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-04 Thread Andrew Dunstan



Jim C. Nasby wrote:


On Fri, Nov 04, 2005 at 04:35:10PM -0500, Tom Lane wrote:
 


Jim C. Nasby [EMAIL PROTECTED] writes:
   


Could something like that be added to regression, or maybe as a seperate
test case for the buildfarm?
 


If you don't have a self-contained, reproducible test case, it's a bit
pointless to suggest adding the nonexistent test case to the regression
suite.
   



Well, for things like race conditions I don't know that you can create
reproducable test cases. My point was that this bug was exposed by
databases with workloads that involved very high transaction rates. I
know in the case of my client this is due to some sub-optimal design
decisions, and I believe the other case was similar. My suggestion is
that having a test that involves a lot of row-by-row type operations
that generate a very high transaction rate would help expose these kinds
of bugs.

Of course if someone can come up with a self-contained reproducable test
case for this race condition that would be great as well. :)
 



These conditions make it quite unsuitable for buildfarm, which is 
designed as a thin veneer over the postgres build process, and intended 
to run anywhere you can build postgres.


Maybe you could use one of the Linux labs, since your client is on RHEL.

cheers

andrew

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-04 Thread Jim C. Nasby
On Fri, Nov 04, 2005 at 05:26:25PM -0500, Andrew Dunstan wrote:
 Well, for things like race conditions I don't know that you can create
 reproducable test cases. My point was that this bug was exposed by
 databases with workloads that involved very high transaction rates. I
 know in the case of my client this is due to some sub-optimal design
 decisions, and I believe the other case was similar. My suggestion is
 that having a test that involves a lot of row-by-row type operations
 that generate a very high transaction rate would help expose these kinds
 of bugs.
 
 Of course if someone can come up with a self-contained reproducable test
 case for this race condition that would be great as well. :)
  
 
 
 These conditions make it quite unsuitable for buildfarm, which is 
 designed as a thin veneer over the postgres build process, and intended 
 to run anywhere you can build postgres.
 
 Maybe you could use one of the Linux labs, since your client is on RHEL.

I'm not worried about my client, I'm just thinking of a way to better
ferret out bugs like this. And there's no real reason why something like
this couldn't be part of regression, or an additional build target.

BTW, I just realized that part of the answer to Tom's musing about why
this hasn't been seen before now is that few (if any) regular users are
running with asserts turned on, so odds are good that they'd never know
if this problem occured or not. Further argument for trying to test this
on the buildfarm and/or enabling assertions by default, IMHO.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-04 Thread Marc G. Fournier

On Fri, 4 Nov 2005, Jim C. Nasby wrote:


On Fri, Nov 04, 2005 at 05:26:25PM -0500, Andrew Dunstan wrote:

Well, for things like race conditions I don't know that you can create
reproducable test cases. My point was that this bug was exposed by
databases with workloads that involved very high transaction rates. I
know in the case of my client this is due to some sub-optimal design
decisions, and I believe the other case was similar. My suggestion is
that having a test that involves a lot of row-by-row type operations
that generate a very high transaction rate would help expose these kinds
of bugs.

Of course if someone can come up with a self-contained reproducable test
case for this race condition that would be great as well. :)




These conditions make it quite unsuitable for buildfarm, which is
designed as a thin veneer over the postgres build process, and intended
to run anywhere you can build postgres.

Maybe you could use one of the Linux labs, since your client is on RHEL.


I'm not worried about my client, I'm just thinking of a way to better
ferret out bugs like this. And there's no real reason why something like
this couldn't be part of regression, or an additional build target.


For all the talk about couldn't it be part of regression, I haven't seen 
anyone submit a patch that would test for it ... since I believe both you 
and Tom have both stated that for things like race conditions, I don't 
know that you can create reproducable cases, can you submit a patch for 
how you propose this should be added to the regression tests?



Marc G. Fournier   Hub.Org Networking Services (http://www.hub.org)
Email: [EMAIL PROTECTED]   Yahoo!: yscrappy  ICQ: 7615664

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[HACKERS] Assert failure found in 8.1RC1

2005-11-02 Thread Robert Creager

Hey all,

While trying to get a reproducible test case for my CS storm problem (see
http://archives.postgresql.org/pgsql-hackers/2005-10/msg00585.php), I upgraded
to 8.1RC1 and encountered the following assert:

TRAP: FailedAssertion(!(shared-page_number[slotno] == pageno 
shared-page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS), File: slru.c,
Line: 309)

On the good side, I'm yet unable to get a sustained CS storm anymore with this
level of code.  Looks like something might changed for the better in the last 2
weeks?

For the assert, I had 5 sets of my app running, each with 8 potential
outstanding queries.  I then threw my test at the db with 20 more queries, and
took the above failure.

creagrs=# select version();
 version
---
--
 PostgreSQL 8.1RC1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.1
(Mandrake Linux 9.2 3.3.1-2mdk)

BINDIR = /usr/local/pgsql810/bin
DOCDIR = /usr/local/pgsql810/doc
INCLUDEDIR = /usr/local/pgsql810/include
PKGINCLUDEDIR = /usr/local/pgsql810/include
INCLUDEDIR-SERVER = /usr/local/pgsql810/include/server
LIBDIR = /usr/local/pgsql810/lib
PKGLIBDIR = /usr/local/pgsql810/lib
LOCALEDIR =
MANDIR = /usr/local/pgsql810/man
SHAREDIR = /usr/local/pgsql810/share
SYSCONFDIR = /usr/local/pgsql810/etc
PGXS = /usr/local/pgsql810/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--enable-syslog' '--prefix=/usr/local/pgsql810' '--enable-debug'
'--enable-cassert'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wendif-labels
-fno-strict-aliasing -g
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-rpath,/usr/local/pgsql810/lib
LDFLAGS_SL =
LIBS = -lpgport -lz -lreadline -lncurses -lcrypt -lresolv -lnsl -ldl -lm -lbsd
VERSION = PostgreSQL 8.1RC1

Thanks,
Rob

-- 
Robert Creager
Advisory Software Engineer
Data Management Group
Sun Microsystems
[EMAIL PROTECTED]
303.673.2365 Office
888.912.4458 Pager


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-02 Thread Tom Lane
Robert Creager [EMAIL PROTECTED] writes:
 TRAP: FailedAssertion(!(shared-page_number[slotno] == pageno 
 shared-page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS), File: slru.c,
 Line: 309)

http://archives.postgresql.org/pgsql-hackers/2005-10/msg01385.php

If you can reproduce the failure with any reliability, please try
one or both of the proposed patches:

http://archives.postgresql.org/pgsql-patches/2005-10/msg00240.php
http://archives.postgresql.org/pgsql-patches/2005-10/msg00248.php

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-02 Thread Robert Creager
On Wed, 02 Nov 2005 15:37:05 -0500
Tom Lane [EMAIL PROTECTED] wrote:

 Robert Creager [EMAIL PROTECTED] writes:
  I can reproduce very quickly.  Looks like I should try the patch in 248
  first to see if it fixes 8.1RC1?
 
 Excellent.  Yes, the second patch is higher priority, but please try
 both while you're at it.
 

I've put in patch 2.  I'm kicking the s**t out of it, with no problems so far. 
I'll let it run for a while longer.

One note is that I did hit the CS switch problem, but with a combination of
production app and my test app.  But, it took much more activity, wasn't as
severe (queries were typically staying  10 seconds) and the db came out of it a
few minutes after my test app stopped.

I'll put in the first patch and re-run the tests.

Cheers,
Rob

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-02 Thread Robert Creager
On Wed, 02 Nov 2005 15:19:44 -0500
Tom Lane [EMAIL PROTECTED] wrote:

 Robert Creager [EMAIL PROTECTED] writes:
  TRAP: FailedAssertion(!(shared-page_number[slotno] == pageno 
  shared-page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS), File: slru.c,
  Line: 309)
 
 http://archives.postgresql.org/pgsql-hackers/2005-10/msg01385.php
 
 If you can reproduce the failure with any reliability, please try
 one or both of the proposed patches:
 
 http://archives.postgresql.org/pgsql-patches/2005-10/msg00240.php
 http://archives.postgresql.org/pgsql-patches/2005-10/msg00248.php
 

Ran with both for an hour with no problem, where I could produce the ASSERT
failure within minutes for the non patched version.

Thanks,
Rob

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Assert failure found in 8.1RC1

2005-11-02 Thread Tom Lane
Robert Creager [EMAIL PROTECTED] writes:
 Ran with both for an hour with no problem, where I could produce the ASSERT
 failure within minutes for the non patched version.

Great.  I'll go ahead and commit the smaller fix into HEAD and the back
branches, and hold the larger fix for 8.2.

It's curious that two different people stumbled across this just
recently, when the bug has been there since 7.2.  I suppose that the
addition of pg_subtrans increased the probability of seeing the bug by
a considerable amount, but I'm still surprised it wasn't identified
before.  At the very least, we should have heard about it earlier in
the 8.0 release cycle ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings