Re: [GENERAL] Various intermittent bugs/instability - how to debug?

2008-09-25 Thread Frederik Ramm

Hi,

   just to give an update on this:

Frederik Ramm wrote:
Every other night, the process aborts with some strange error message, 
and never at the same position:


[...]

Turns out it *was* a RAM defect on one of the machines. memtest86 ran 
for a day and didn't detect it, but when I started making copies of a 
100 GB file and m5dsumming it afterwards, I'd get one or two bad copies 
per day. I then juggled RAM chips until I had identified the defective 
one. As you can imagine, this took a while.


Lesson learnt: memtest86 running for a day and not finding an error does 
not mean your RAM is ok!


I still have occasional problems on the *other* machine (if you remember 
my original post, I had similar errors on two different machines) and 
even massive data copying did not show any error there but at least a 
PostGIS bug seems extremely unlikely now (and that's still running stock 
packages, not patched).


Bye
Frederik

--
Frederik Ramm   www.geofabrik.de
Geofabrik GmbH  Handelsregister: HRB Mannheim 703657
Rueppurrer Strasse 4Geschaeftsfuehrung: Frederik Ramm
76137 Karlsruhe Tel: 0721-1803560-0
[EMAIL PROTECTED]  Fax: 0721-1803560-9


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Various intermittent bugs/instability - how to debug?

2008-09-10 Thread Mark Cave-Ayland

Frederik Ramm wrote:

Dear 
PostgreSQL community,


   I hope you can help me with a problem I'm having - I'm stuck and 
don't know how to debug this further.


I have a rather large nightly process that imports a lot of data from 
the OpenStreetMap project into a PostGIS database, then proceeds doing 
all sorts of things - creating spatial indexes, computing bounding 
boxes, doing simplification of geometries, that kind of stuff. The whole 
job usually takes about five hours.


I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine 
with 8 GB RAM.


Every other night, the process aborts with some strange error message, 
and never at the same position:


ERROR:  invalid page header in block 166406 of relation node_tags

ERROR:  could not open segment 2 of relation 1663/24253056/24253895 
(target block 1421295656): No such file or directory


ERROR:  Unknown geometry type: 10

When I continue the process after the failure, it will usually work.

I know you all think hardware problem now. Of course this was my first 
guess as well. I ran a memory test for a night, no results; I downgraded 
do failsafe defaults for all BIOS timings, again no change. Ran 
cpuburn and all sorts of other things to grill the hardware - nothing.


Then I bought an entirely new machine; similar setup, but using a 
Gigabyte instead of Asus mainboard, different chipset, slightly faster 
Quad-Core processor, and again 8 GB RAM and Ubuntu Hardy with 
PostgresSQL 8.3 and matching PostGIS.


Believe it or not, this machine shows the *same* problems. It is not 
100% reproducible, sometimes the job works fully, but every other day it 
just breaks down with one of the funny messages like above. No memtest 
errors here either.


Both machines are consumer quality, i.e. normal Intel processors and 
not the server (Xeon) stock.


I am at a loss - how can I proceed? This looks like a hardware problem 
alright, but so simliar problems on two so different machines? Is there 
something wrong with Intel's Quad-Core CPUs?


What could I do to have a better chance of reproducing the error and 
ultimately identifying the component responsible? Is there some kind of 
PostgresSQL load test, something like cpuburn for PostgreSQL?


Have there been other reports of intermittent problems like mine, and 
does anybody have any blind guesses...?


Thanks
Frederik



Hi Frederik,

We did find a memory clobber in the PostGIS ANALYZE routine a while 
back, but the fix hasn't yet made it into a release.


If you are building from source, please can you try applying the patch 
here: http://code.google.com/p/postgis/issues/detail?id=43 and reporting 
back whether it helps or not?



ATB,

Mark.

--
Mark Cave-Ayland
Sirius Corporation - The Open Source Experts
http://www.siriusit.co.uk
T: +44 870 608 0063

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Various intermittent bugs/instability - how to debug?

2008-09-09 Thread Frederik Ramm

Dear PostgreSQL community,

   I hope you can help me with a problem I'm having - I'm stuck and 
don't know how to debug this further.


I have a rather large nightly process that imports a lot of data from 
the OpenStreetMap project into a PostGIS database, then proceeds doing 
all sorts of things - creating spatial indexes, computing bounding 
boxes, doing simplification of geometries, that kind of stuff. The whole 
job usually takes about five hours.


I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine 
with 8 GB RAM.


Every other night, the process aborts with some strange error message, 
and never at the same position:


ERROR:  invalid page header in block 166406 of relation node_tags

ERROR:  could not open segment 2 of relation 1663/24253056/24253895 
(target block 1421295656): No such file or directory


ERROR:  Unknown geometry type: 10

When I continue the process after the failure, it will usually work.

I know you all think hardware problem now. Of course this was my first 
guess as well. I ran a memory test for a night, no results; I downgraded 
do failsafe defaults for all BIOS timings, again no change. Ran 
cpuburn and all sorts of other things to grill the hardware - nothing.


Then I bought an entirely new machine; similar setup, but using a 
Gigabyte instead of Asus mainboard, different chipset, slightly faster 
Quad-Core processor, and again 8 GB RAM and Ubuntu Hardy with 
PostgresSQL 8.3 and matching PostGIS.


Believe it or not, this machine shows the *same* problems. It is not 
100% reproducible, sometimes the job works fully, but every other day it 
just breaks down with one of the funny messages like above. No memtest 
errors here either.


Both machines are consumer quality, i.e. normal Intel processors and 
not the server (Xeon) stock.


I am at a loss - how can I proceed? This looks like a hardware problem 
alright, but so simliar problems on two so different machines? Is there 
something wrong with Intel's Quad-Core CPUs?


What could I do to have a better chance of reproducing the error and 
ultimately identifying the component responsible? Is there some kind of 
PostgresSQL load test, something like cpuburn for PostgreSQL?


Have there been other reports of intermittent problems like mine, and 
does anybody have any blind guesses...?


Thanks
Frederik

--
Frederik Ramm   www.geofabrik.de
Geofabrik GmbH  Handelsregister: HRB Mannheim 703657
Rueppurrer Strasse 4Geschaeftsfuehrung: Frederik Ramm
76137 Karlsruhe Tel: 0721-1803560-0
[EMAIL PROTECTED]  Fax: 0721-1803560-9

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Various intermittent bugs/instability - how to debug?

2008-09-09 Thread Joshua D. Drake

Frederik Ramm wrote:

Dear PostgreSQL community,


Every other night, the process aborts with some strange error message, 
and never at the same position:


ERROR:  invalid page header in block 166406 of relation node_tags

ERROR:  could not open segment 2 of relation 1663/24253056/24253895 
(target block 1421295656): No such file or directory


What could I do to have a better chance of reproducing the error and 
ultimately identifying the component responsible? Is there some kind of 
PostgresSQL load test, something like cpuburn for PostgreSQL?


Have there been other reports of intermittent problems like mine, and 
does anybody have any blind guesses...?


Hard drives. You said you replaced the machines and ran cpu burn etc... 
You didn't say you checked the hard drives (or replaced them).


Joshua D. Drake




Thanks
Frederik




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Various intermittent bugs/instability - how to debug?

2008-09-09 Thread Frederik Ramm

Hi,

Joshua D. Drake wrote:
Hard drives. You said you replaced the machines and ran cpu burn etc... 
You didn't say you checked the hard drives (or replaced them).


The new machine has new hard drives. On the old machine I had a hardware 
RAID5 array made up of 6x500 GB SATA with an Areca RAID controller and 
Linux LVM; the new machine has 3x750 GB SATA plugged directly into the 
main board (no RAID, no LVM). So even the drives are quite different, 
and it's different drivers in both cases...


Bye
Frederik

--
Frederik Ramm   www.geofabrik.de
Geofabrik GmbH  Handelsregister: HRB Mannheim 703657
Rueppurrer Strasse 4Geschaeftsfuehrung: Frederik Ramm
76137 Karlsruhe Tel: 0721-1803560-0
[EMAIL PROTECTED]  Fax: 0721-1803560-9

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Various intermittent bugs/instability - how to debug?

2008-09-09 Thread Scott Marlowe
On Tue, Sep 9, 2008 at 8:50 AM, Frederik Ramm
[EMAIL PROTECTED] wrote:
 Dear PostgreSQL community,

   I hope you can help me with a problem I'm having - I'm stuck and don't
 know how to debug this further.

 I have a rather large nightly process that imports a lot of data from the
 OpenStreetMap project into a PostGIS database, then proceeds doing all sorts
 of things - creating spatial indexes, computing bounding boxes, doing
 simplification of geometries, that kind of stuff. The whole job usually
 takes about five hours.

 I'm running this on a Quad-Core Linux (Ubuntu, PostgreSQL 8.3) machine with
 8 GB RAM.

 Every other night, the process aborts with some strange error message, and
 never at the same position:

 ERROR:  invalid page header in block 166406 of relation node_tags

 ERROR:  could not open segment 2 of relation 1663/24253056/24253895 (target
 block 1421295656): No such file or directory

 ERROR:  Unknown geometry type: 10

 When I continue the process after the failure, it will usually work.

 I know you all think hardware problem now. Of course this was my first
 guess as well. I ran a memory test for a night, no results; I downgraded do
 failsafe defaults for all BIOS timings, again no change. Ran cpuburn and
 all sorts of other things to grill the hardware - nothing.

You definitely are suffering from db corruption, and given the number
and differing type of errors, it would seem unlikely that pgsql has a
load of bugs only you are seeing.  OTOH, if the bug is hidden deep in
postgis or something, then who knows...

I'd definitely run something like bonnie++ for a few days and see if
it gets HD errors or not.

And definitely run memtest86 for a day or so and make sure you're not
getting any errors there.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Various intermittent bugs/instability - how to debug?

2008-09-09 Thread Scott Marlowe
Also, I'd try running a large, long running pgbench test to see if you
get any errors.  If such a simple schema / operation in large scale
gets errors it's almost certainly a hardware failure.

Have you had power removed while the system was running?  With a lot
of direct attached SATA drives you can get corruption from that as
well due to linux / sata often not obeying write barriers.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Various intermittent bugs/instability - how to debug?

2008-09-09 Thread Martijn van Oosterhout
On Tue, Sep 09, 2008 at 04:50:53PM +0200, Frederik Ramm wrote:
 I have a rather large nightly process that imports a lot of data from 
 the OpenStreetMap project into a PostGIS database, then proceeds doing 
 all sorts of things - creating spatial indexes, computing bounding 
 boxes, doing simplification of geometries, that kind of stuff. The whole 
 job usually takes about five hours.

Which program is this?

 ERROR:  invalid page header in block 166406 of relation node_tags

After this happens, can you find the relfilenode of the node_tags table
and extract the relevent block? Something like:

dd if=dboid/relfilenode bs=8k skip=166406 count=1 of=/tmp/block

What sort of data does it contain?

 ERROR:  could not open segment 2 of relation 1663/24253056/24253895 
 (target block 1421295656): No such file or directory

Does any file of that name exist? 

ls 1663/24253056/24253895*

Hope this helps,
-- 
Martijn van Oosterhout   [EMAIL PROTECTED]   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [GENERAL] Various intermittent bugs/instability - how to debug?

2008-09-09 Thread Tom Lane
Scott Marlowe [EMAIL PROTECTED] writes:
 On Tue, Sep 9, 2008 at 8:50 AM, Frederik Ramm
 I hope you can help me with a problem I'm having - I'm stuck and don't
 know how to debug this further.

 You definitely are suffering from db corruption, and given the number
 and differing type of errors, it would seem unlikely that pgsql has a
 load of bugs only you are seeing.  OTOH, if the bug is hidden deep in
 postgis or something, then who knows...

Yeah, I would wonder about a postgis bug.  Another thing to look into
is kernel bugs --- I gather that these boxes are running more or less
the same kernel?  If there's something about your hardware that the
kernel doesn't quite grok, all sorts of fun could ensue.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general