Re: 4.0 frozen
diego wrote: Federico, I have the same problem on 3.9 http://marc.theaimsgroup.com/?l=openbsd-misc&m=115192952225331&w=2 My server still running 3.9. You have the same problem with 4.0? You modify the kernel with NKMEMPAGES_MAX and still freeze? After the NKMEMPAGES_MAX change, the problems became much more rare, but now after the 4.0 upgrade they started again to be much more frequent. Bye. Federico Giannici escribis: Stephen Schaff wrote: I've got 4.0 running nicely on a server sitting in a data centre, thanks to the help of the members of this list. It's been up since Nov. 22nd and in production. Yesterday it inexplicably went dark. I went down to check it out, and hooked up the monitor and keyboard. I could see the welcoming login prompt, but it wouldn't accept any input. It wasn't accepting any pings from a remote system on the network either. The only word I have for that is frozen - if there's better terminology out there - please let me know. Welcome to the club! :-( A couple of minutes ago I restarted a frozen PC of mine. This happens to different PCs, and I replaced ALL the hardware, but nothing changed. It seems to happen usually during high disk/network activity, but I'm not sure. For sure they became much more frequent after the upgrade from 3.9 to 4.0. I sent several emails here, but nobody seemed to have any real clue... Bye. -- ___ __ |- [EMAIL PROTECTED] |ederico Giannici http://www.neomedia.it Presidente del CDA - Neomedia S.r.l. ___
Re: 4.0 frozen
Federico, I have the same problem on 3.9 http://marc.theaimsgroup.com/?l=openbsd-misc&m=115192952225331&w=2 My server still running 3.9. You have the same problem with 4.0? You modify the kernel with NKMEMPAGES_MAX and still freeze? Regards,. Federico Giannici escribis: Stephen Schaff wrote: I've got 4.0 running nicely on a server sitting in a data centre, thanks to the help of the members of this list. It's been up since Nov. 22nd and in production. Yesterday it inexplicably went dark. I went down to check it out, and hooked up the monitor and keyboard. I could see the welcoming login prompt, but it wouldn't accept any input. It wasn't accepting any pings from a remote system on the network either. The only word I have for that is frozen - if there's better terminology out there - please let me know. Welcome to the club! :-( A couple of minutes ago I restarted a frozen PC of mine. This happens to different PCs, and I replaced ALL the hardware, but nothing changed. It seems to happen usually during high disk/network activity, but I'm not sure. For sure they became much more frequent after the upgrade from 3.9 to 4.0. I sent several emails here, but nobody seemed to have any real clue... Bye.
Re: 4.0 frozen
Yeah. I did some testing last night - to know avail. When it bailed today, I restarted it, expecting the raid to rebuild as it always does. This time it didn't! It booted right up using wd1 and failed wd0 in raid0. Kinda makes me happy I built it that way (special thanks to this page: http://www.argon18.com/raid_openbsd.html ). So, I think that wd0 may be the cause of the whole problem, and I'll replace it right away and keep an eye on it to make sure that there aren't other problems. Thanks everyone for your great suggestions. I've been exploring them all. Best Regards, Stephen On 17-Dec-06, at 12:48 PM, Artur Grabowski wrote: Stephen Schaff <[EMAIL PROTECTED]> writes: wd0(pciide1:0:0): timeout type: ata c_bcount: 65536 c_skip: 0 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0 bn 235334857; cn 14648 tn 233 sn 58), retrying wd0: soft error (corrected) wd0(pciide1:0:0): timeout type: ata c_bcount: 65536 c_skip: 0 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 wd0d: device timeout reading fsbn 234997440 of 234997440-234997567 (wd0 bn 236170185; cn 14700 tn 233 sn 6), retrying wd0: soft error (corrected) wd0(pciide1:0:0): timeout type: ata c_bcount: 65536 c_skip: 0 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 wd0d: device timeout reading fsbn 235719872 of 235719872-23571 (wd0 bn 236892617; cn 14745 tn 225 sn 17), retrying wd0: soft error (corrected) This is a pretty good indication of what's going wrong. Your disk is sad. //art
Re: 4.0 frozen
> Original message > >Date: Sun, 17 Dec 2006 02:57:56 +0100 > >From: Dimitry Andric <[EMAIL PROTECTED]> > >Subject: Re: 4.0 frozen > >To: Stephen Schaff <[EMAIL PROTECTED]> > >Cc: misc@openbsd.org > > > >Stephen Schaff wrote: > >> Yesterday it inexplicably went dark. > >... > >> wd0(pciide1:0:0): timeout > >> type: ata > >> c_bcount: 65536 > >> c_skip: 0 > >> pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 > >> wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0 > >> bn 235334857; cn 14648 tn 233 sn 58), retrying > >> wd0: soft error (corrected) > >> wd0(pciide1:0:0): timeout > >> type: ata > >> c_bcount: 65536 > >> c_skip: 0 > >> pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 > >... more of those IDE errors ... > > > >Maybe dying disks? > > Running # atactl wd0 smartstatus is also a quick way to check. I've got something in rc.local for that... Travers Buda
Re: 4.0 frozen
Jacob Yocom-Piatt wrote: >>> wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0 >>> bn 235334857; cn 14648 tn 233 sn 58), retrying >>> wd0: soft error (corrected) >> Maybe dying disks? > i must second this suggestion. almost every time i've seen these IDE timeout > messages, it means that the disk(s) are damaged, close to dead or totally > dead. Note that these errors can also be caused by any other part of the IDE subsystem, e.g. the controller, the cables, etc. Or even by bad RAM... For sanity's sake, do a full hardware diagnostic of the machine.
Re: 4.0 frozen
Original message >Date: Sun, 17 Dec 2006 02:57:56 +0100 >From: Dimitry Andric <[EMAIL PROTECTED]> >Subject: Re: 4.0 frozen >To: Stephen Schaff <[EMAIL PROTECTED]> >Cc: misc@openbsd.org > >Stephen Schaff wrote: >> Yesterday it inexplicably went dark. >... >> wd0(pciide1:0:0): timeout >> type: ata >> c_bcount: 65536 >> c_skip: 0 >> pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 >> wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0 >> bn 235334857; cn 14648 tn 233 sn 58), retrying >> wd0: soft error (corrected) >> wd0(pciide1:0:0): timeout >> type: ata >> c_bcount: 65536 >> c_skip: 0 >> pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 >... more of those IDE errors ... > >Maybe dying disks? > i must second this suggestion. almost every time i've seen these IDE timeout messages, it means that the disk(s) are damaged, close to dead or totally dead. i find that doing disk intensive operations, e.g. extracting src.tar.gz, with the machine in question will likely reproduce the timeouts if this is the case. cheers, jake
Re: 4.0 frozen
* Stephen Schaff wrote: > So, I thought I would post my dmesg here and see if it grabs the > attention of anyone who knows better than I do. Any insight would be > much appreciated. It turns my stomach to think I'd have to reinstall > with a different OS. If this system is critical for you, you might consider installing a hardware watchdog timer which will then reboot the machine if it hangs.
Re: 4.0 frozen
On Sat, 16 Dec 2006 21:31:28 -0500 "STeve Andre'" <[EMAIL PROTECTED]> wrote: > > If things have been running for nearly a month and now you've > crashed twice in two days, that says that the system was just > fine, and now things have gone to hell. You have new hardware > problems. > > I'd first suspect ram. Get memtest86 and run it for 24 hours or > so. I'd also take the raid array and stuff it into another identical > computer. You do have a spare system for this production > service, don't you? > That's some good advice--if the problems are just now showing with great frequency, it's the hardware. I'd check the disk, ram, and PSU in that order. Travers Buda
Re: 4.0 frozen
On Saturday 16 December 2006 20:24, Stephen Schaff wrote: > I've got 4.0 running nicely on a server sitting in a data centre, > thanks to the help of the members of this list. > It's been up since Nov. 22nd and in production. > > Yesterday it inexplicably went dark. I went down to check it out, and > hooked up the monitor and keyboard. I could see the welcoming login > prompt, but it wouldn't accept any input. It wasn't accepting any > pings from a remote system on the network either. The only word I > have for that is frozen - if there's better terminology out there - > please let me know. > > Anyway, after hard booting the machine, and rebuilding the raid - I > checked all the log files I could think of and can't find a thing. > Nada. Then - it went down again today! I'm not sure what to do now. > > So, I thought I would post my dmesg here and see if it grabs the > attention of anyone who knows better than I do. Any insight would be > much appreciated. It turns my stomach to think I'd have to reinstall > with a different OS. > > Best Regards, > Stephen If things have been running for nearly a month and now you've crashed twice in two days, that says that the system was just fine, and now things have gone to hell. You have new hardware problems. I'd first suspect ram. Get memtest86 and run it for 24 hours or so. I'd also take the raid array and stuff it into another identical computer. You do have a spare system for this production service, don't you? --STeve Andre'
Re: 4.0 frozen
Hi Stephen. On 12/17/06, Stephen Schaff <[EMAIL PROTECTED]> wrote: wd0(pciide1:0:0): timeout type: ata c_bcount: 65536 c_skip: 0 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0 bn 235334857; cn 14648 tn 233 sn 58), retrying wd0: soft error (corrected) wd0(pciide1:0:0): timeout type: ata c_bcount: 65536 c_skip: 0 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 wd0d: device timeout reading fsbn 234997440 of 234997440-234997567 (wd0 bn 236170185; cn 14700 tn 233 sn 6), retrying wd0: soft error (corrected) wd0(pciide1:0:0): timeout type: ata c_bcount: 65536 c_skip: 0 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 wd0d: device timeout reading fsbn 235719872 of 235719872-23571 (wd0 bn 236892617; cn 14745 tn 225 sn 17), retrying wd0: soft error (corrected) I guess wd0 holds your root file system, right? I had the same problem with my OpenBSD access point over one year ago. After replacing the disk my system works like a charm :) I suggest that you replace the dying harddisk with a new one and give it a try. HTH, Andreas. -- Hobbes : Shouldn't we read the instructions? Calvin : Do I look like a sissy?
Re: 4.0 frozen
Stephen Schaff wrote: I've got 4.0 running nicely on a server sitting in a data centre, thanks to the help of the members of this list. It's been up since Nov. 22nd and in production. Yesterday it inexplicably went dark. I went down to check it out, and hooked up the monitor and keyboard. I could see the welcoming login prompt, but it wouldn't accept any input. It wasn't accepting any pings from a remote system on the network either. The only word I have for that is frozen - if there's better terminology out there - please let me know. Welcome to the club! :-( A couple of minutes ago I restarted a frozen PC of mine. This happens to different PCs, and I replaced ALL the hardware, but nothing changed. It seems to happen usually during high disk/network activity, but I'm not sure. For sure they became much more frequent after the upgrade from 3.9 to 4.0. I sent several emails here, but nobody seemed to have any real clue... Bye. -- ___ __ |- [EMAIL PROTECTED] |ederico Giannici http://www.neomedia.it ___
Re: 4.0 frozen
Stephen Schaff wrote: > Yesterday it inexplicably went dark. ... > wd0(pciide1:0:0): timeout > type: ata > c_bcount: 65536 > c_skip: 0 > pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 > wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0 > bn 235334857; cn 14648 tn 233 sn 58), retrying > wd0: soft error (corrected) > wd0(pciide1:0:0): timeout > type: ata > c_bcount: 65536 > c_skip: 0 > pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 ... more of those IDE errors ... Maybe dying disks?
Re: 4.0 frozen
On 12/16/06, Stephen Schaff <[EMAIL PROTECTED]> wrote: Yesterday it inexplicably went dark. I went down to check it out, and hooked up the monitor and keyboard. I could see the welcoming login prompt, but it wouldn't accept any input. It wasn't accepting any pings from a remote system on the network either. The only word I have for that is frozen - if there's better terminology out there - please let me know. Anyway, after hard booting the machine, and rebuilding the raid - I checked all the log files I could think of and can't find a thing. Nada. Then - it went down again today! I'm not sure what to do now. Sounds like a physical problem. I've seen this type of "hard freeze" with bad power, RAM, motherboard, or CPU,. The problem is often related to heat. If you can take it out of production for half a day or so, I would try UBCD, starting with the memory tests. http://www.ultimatebootcd.com/ Kevin
4.0 frozen
I've got 4.0 running nicely on a server sitting in a data centre, thanks to the help of the members of this list. It's been up since Nov. 22nd and in production. Yesterday it inexplicably went dark. I went down to check it out, and hooked up the monitor and keyboard. I could see the welcoming login prompt, but it wouldn't accept any input. It wasn't accepting any pings from a remote system on the network either. The only word I have for that is frozen - if there's better terminology out there - please let me know. Anyway, after hard booting the machine, and rebuilding the raid - I checked all the log files I could think of and can't find a thing. Nada. Then - it went down again today! I'm not sure what to do now. So, I thought I would post my dmesg here and see if it grabs the attention of anyone who knows better than I do. Any insight would be much appreciated. It turns my stomach to think I'd have to reinstall with a different OS. Best Regards, Stephen , addr 1 uhub1: 8 ports with 8 removable, self powered pciide0 at pci0 dev 13 function 0 "NVIDIA MCP51 IDE" rev 0xa1: DMA, channel 0 configured to compatibility, channel 1 configured to compatibility atapiscsi0 at pciide0 channel 0 drive 0 scsibus0 at atapiscsi0: 2 targets cd0 at scsibus0 targ 0 lun 0: SCSI0 5/cdrom removable cd0(pciide0:0:0): using PIO mode 4, DMA mode 2 pciide0: channel 1 disabled (no drives) pciide1 at pci0 dev 14 function 0 "NVIDIA MCP51 SATA" rev 0xa1: DMA pciide1: using irq 11 for native-PCI interrupt wd0 at pciide1 channel 0 drive 0: wd0: 16-sector PIO, LBA48, 238475MB, 488397168 sectors wd0(pciide1:0:0): using PIO mode 4, Ultra-DMA mode 5 pciide2 at pci0 dev 15 function 0 "NVIDIA MCP51 SATA" rev 0xa1: DMA pciide2: using irq 10 for native-PCI interrupt wd1 at pciide2 channel 0 drive 0: wd1: 16-sector PIO, LBA48, 238475MB, 488397168 sectors wd1(pciide2:0:0): using PIO mode 4, Ultra-DMA mode 5 wd2 at pciide2 channel 1 drive 0: wd2: 16-sector PIO, LBA48, 238475MB, 488397168 sectors wd2(pciide2:1:0): using PIO mode 4, Ultra-DMA mode 5 ppb3 at pci0 dev 16 function 0 "NVIDIA MCP51 PCI-PCI" rev 0xa2 pci4 at ppb3 bus 4 "VIA VT6306 FireWire" rev 0x80 at pci4 dev 5 function 0 not configured em0 at pci4 dev 9 function 0 "Intel PRO/1000GT (82541GI)" rev 0x05: irq 5, address 00:0e:0c:b1:4e:e6 azalia0 at pci0 dev 16 function 1 "NVIDIA MCP51 HD Audio" rev 0xa2: irq 5 azalia0: host: High Definition Audio rev. 1.0 azalia0: codec: 0x04x/0x11d4 (rev. 5.0), HDA version 1.0 audio0 at azalia0 nfe0 at pci0 dev 20 function 0 "NVIDIA MCP51 LAN" rev 0xa1: irq 5, address 00:13:d4:ff:0f:4b eephy0 at nfe0 phy 1: Marvell 88E Gigabit PHY, rev. 2 pchb0 at pci0 dev 24 function 0 "AMD AMD64 HyperTransport" rev 0x00 pchb1 at pci0 dev 24 function 1 "AMD AMD64 Address Map" rev 0x00 pchb2 at pci0 dev 24 function 2 "AMD AMD64 DRAM Cfg" rev 0x00 pchb3 at pci0 dev 24 function 3 "AMD AMD64 Misc Cfg" rev 0x00 isa0 at pcib0 isadma0 at isa0 pckbc0 at isa0 port 0x60/5 pckbd0 at pckbc0 (kbd slot) pckbc0: using irq 1 for kbd slot wskbd0 at pckbd0: console keyboard, using wsdisplay0 pcppi0 at isa0 port 0x61 midi0 at pcppi0: spkr0 at pcppi0 lpt0 at isa0 port 0x378/4 irq 7 lm0 at isa0 port 0x290/8: unknown Winbond chip (ID 0xa1) npx0 at isa0 port 0xf0/16: using exception 16 pccom0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo fdc0 at isa0 port 0x3f0/6 irq 6 drq 2 biomask ff6d netmask ff6d ttymask ffef pctr: user-level cycle counter enabled Kernelized RAIDframe activated cd0(atapiscsi0:0:0): Check Condition (error 0x70) on opcode 0x0 SENSE KEY: Not Ready ASC/ASCQ: Medium Not Present raid0 (root): (RAID Level 1) total number of sectors is 487219200 (237900 MB) as root dkcsum: wd0 matches BIOS drive 0x80 dkcsum: wd1 matches BIOS drive 0x81 dkcsum: wd2 matches BIOS drive 0x82 WARNING: / was not properly unmounted swapmount: no device raid0: Device already configured! wd0(pciide1:0:0): timeout type: ata c_bcount: 65536 c_skip: 0 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 wd0d: device timeout reading fsbn 234162112 of 234162112-234162239 (wd0 bn 235334857; cn 14648 tn 233 sn 58), retrying wd0: soft error (corrected) wd0(pciide1:0:0): timeout type: ata c_bcount: 65536 c_skip: 0 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 wd0d: device timeout reading fsbn 234997440 of 234997440-234997567 (wd0 bn 236170185; cn 14700 tn 233 sn 6), retrying wd0: soft error (corrected) wd0(pciide1:0:0): timeout type: ata c_bcount: 65536 c_skip: 0 pciide1:0:0: bus-master DMA error: missing interrupt, status=0x21 wd0d: device timeout reading fsbn 235719872 of 235719872-23571 (wd0 bn 236892617; cn 14745 tn 225 sn 17), retrying wd0: soft error (corrected) Warning: truncating spare disk /dev/wd2d to 487219200 blocks. OpenBSD 4.0 (GENERIC) #0: Thu Nov 23 01:28:38 MST 2006 [EMAIL PROTECTED]:/mnt/sys/arch/i386/co