Re: 4.0 locked up over the weekend

2007-05-10 Thread Joachim Schipper
On Wed, May 09, 2007 at 11:46:13AM -0700, Bruce Bauer wrote:
 On 5/8/07, Joachim Schipper [EMAIL PROTECTED] wrote:
  On Tue, May 08, 2007 at 09:05:44AM -0700, Bruce Bauer wrote:
   Probably a good idea to put some load on the sytem anyway.

   Running make in ports/www/kde should keep it busy for a while
   Not familiar with bonnie++, I'll check it out
  [snip: bonnie runs fine]
 update:
 
 i've experienced 3 more hard lockups.
 no messgaes on the console screen. nothing unusual in any of the log
 file that i've found. make running in /upr/ports/x11/kde was
 interrupted at different tasks each time, (downloading, compiling, and
 running a configure script). system recovered each time with no
 problems after a powercycle.
 
 are there some system monitoring tools i should be running to keep
 track of various resources?

No, not really; very few things you could do would cause the system to
freeze.

Okay, so something is wrong. Troubleshooting it tends to be hard;
however, you are experiencing lock-ups, not crashes. Perhaps the box
simply gets too hot? Most modern systems have sensors, which can usually
be seen via hw.sensors, or at least the BIOS screen. Simply cleaning out
the mess tends to help here.

Joachim

-- 
TFMotD: fpa, fea, fta (4) - DEC FDDI controller device driver



Re: 4.0 locked up over the weekend

2007-05-10 Thread Nick Holland

Bruce Bauer wrote:

This system has been running flawlessly since mid-March with GENERIC
plus the 010 patch. dmesg below
This morning I found it totally unresponsive both through network and
at the console.  Had to use the power switch to recover.

Where do I start trying to track this down?

The system is running sshd and openvpn only

DMESG:
OpenBSD 4.0 (GENERICp) #0: Fri Mar 16 19:07:33 MST 2007
   [EMAIL PROTECTED]:/usr/src/sys/arch/i386/compile/GENERICp
cpu0: AMD Sempron(tm) Processor 3000+ (AuthenticAMD 686-class, 256KB
L2 cache) 1.61 GHz
cpu0: 
FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SSE3,CX16 

..
Is this an amd64 capable Sempron?  It looks like it is, based on the 
rest of the dmesg.


If so...this could be the i386 on amd64 bug, the symptoms sure seem 
to fit.  (granted, locked hard covers a lot of problems... :)


If that's the case, you might want to upgrade to 4.1, which should 
take care of the problem.


Nick.



Re: 4.0 locked up over the weekend

2007-05-10 Thread Bryan

Can you get rid of extraneous hardware?  Can you drop some RAM, and
the video card?  How about any of the AMD-specific processor setting,
like HyperTransport?

Can you disable apm?  maybe there are some conflicts in the apm...

I mean, these are a few ideas that I thought of...



On 5/10/07, Joachim Schipper [EMAIL PROTECTED] wrote:

On Wed, May 09, 2007 at 11:46:13AM -0700, Bruce Bauer wrote:
 On 5/8/07, Joachim Schipper [EMAIL PROTECTED] wrote:
  On Tue, May 08, 2007 at 09:05:44AM -0700, Bruce Bauer wrote:
   Probably a good idea to put some load on the sytem anyway.




Re: 4.0 locked up over the weekend

2007-05-10 Thread Tobias Weingartner
In article [EMAIL PROTECTED], Nick Holland wrote:
  cpu0: 
  FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SSE3,CX16
   
  ..
  Is this an amd64 capable Sempron?  It looks like it is, based on the 
  rest of the dmesg.

Nope, no LONG in that cpu flags...

-- 
 [100~Plax]sb16i0A2172656B63616820636420726568746F6E61207473754A[dZ1!=b]salax



Re: 4.0 locked up over the weekend

2007-05-10 Thread Tobias Weingartner
Tobias Weingartner wrote:
  In article [EMAIL PROTECTED], Nick Holland wrote:
   cpu0: 
   FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SSE3,CX16

   ..
   Is this an amd64 capable Sempron?  It looks like it is, based on the 
   rest of the dmesg.
 
  Nope, no LONG in that cpu flags...

And while this part is right, that CPU does not have LONG support, it
may still exhibit the PAE bug.

-- 
 [100~Plax]sb16i0A2172656B63616820636420726568746F6E61207473754A[dZ1!=b]salax



Re: 4.0 locked up over the weekend

2007-05-09 Thread Bruce Bauer

Update:

I've experienced 3 more hard lockups.
No messgaes on the console screen. Nothing unusual in any of the log
file that I've found. Make running in /upr/ports/x11/kde was
interrupted at different tasks each time, (downloading, compiling, and
running a configure script). System recovered each time with no
problems after a powercycle.

Are there some system monitoring tools I should be running to keep
track of various resources?

On 5/8/07, Bruce Bauer [EMAIL PROTECTED] wrote:

Initial results:

complied bonnie++ from ports
make is running in ports/x11/kde
2 video streams passsing through VPN tunnel at abou 32 fps total
output from bonnie++:
Version  1.03   --Sequential Output-- --Sequential Input- --Random-
   -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
roadrunner.for 300M 50379  46 49432   6  6322   1 25376  41 34974   4 130.7   0
   --Sequential Create-- Random Create
   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
16  2542   5 + +++  5113   8  2898   7 + +++  5478   9
roadrunner.fortechsw.com,300M,50379,46,49432,6,6322,1,25376,41,34974,4,130.7,0,16,2542,5,+,+++,5113,8,2898,7,+,+++,5478,9

ran uptime after bonnie++ finished
11:21AM up 1 day, 2:15, 2 users, load averages: 4.08, 3.15, 2.55

Everything seems to be running smoothly

Bruce

On 5/8/07, Joachim Schipper [EMAIL PROTECTED] wrote:
 On Tue, May 08, 2007 at 09:05:44AM -0700, Bruce Bauer wrote:
  Probably a good idea to put some load on the sytem anyway.
  See how the VPN data transfer holds up.
  Downloading ports.tar.gz now
  Running make in ports/www/kde should keep it busy for a while
  Not familiar with bonnie++, I'll check it out

 Bonnie++ just generates a lot of I/O. The 'ghetto' version involves
 running 'tar xzf srf.tar.gz; rm -rf src' in a loop.

 Let us know how it goes...

Joachim

 --
 TFMotD: tht, thtc (4) - Tehuti Networks 10Gb Ethernet device




Re: 4.0 locked up over the weekend

2007-05-08 Thread Bruce Bauer

Hmmm...

Probably a good idea to put some load on the sytem anyway.
See how the VPN data transfer holds up.
Downloading ports.tar.gz now
Running make in ports/www/kde should keep it busy for a while
Not familiar with bonnie++, I'll check it out

Thanks,

Bruce

On 5/7/07, Joachim Schipper [EMAIL PROTECTED] wrote:

On Mon, May 07, 2007 at 12:42:55PM -0700, Bruce Bauer wrote:
 On 5/7/07, Jack J. Woehr [EMAIL PROTECTED] wrote:
 On May 7, 2007, at 12:20 PM, Bruce Bauer wrote:
  This system has been running flawlessly since mid-March with GENERIC
  plus the 010 patch. dmesg below
  This morning I found it totally unresponsive both through network and
  at the console.  Had to use the power switch to recover.
  
  Where do I start trying to track this down?
 
 Open the box and check your power supply and blow it out with air if it's
 full of dust.
 Number one cause of mysterious lockups in my personal experience. Next, run
 a memory
 test.
 
 Only then start trying to debug software, e.g., OpenBSD.

 Thanks for the response.

 OK, maybe a little less basic than that.  The system is sitting in a
 restricted access server room.  Not a clean room, but very little
 dust.  Nice and cool..  The system still looks brand new, inside and
 out.

 The purpose of this system is to receive streaming video data over the
 VPN from IP webcams.  It doesn't do anything with the data except pass
 it on to a DVR system over the local network.  Plans are to add
 another network card so the VPN and the local network will be on
 separate channels.  But, for now, it all goes through one card.

 It has worked in this configuration for over a month with video from 2
 cameras coming in.

 Oops! Message from Joachim Schipper  just came in:

 There were no console messages
 The authlog does show that someone is trying to brute force an ssh
 login. I think I'll turn off sshd for now...

Nah, script kiddies trying to bruteforce SSH logins are so common that I
just tuned them out of the log parser altogether. Just use public keys,
or good passwords.

That said, Jack might be right to suspect some random hardware failure.
If this is the case, how about some proper stress testing (compiling the
whole system is fairly good in exercising CPU and memory, something like
bonnie++ might help you to test the disk?).

If that doesn't work, the software might be problematic...

   Joachim

--
TFMotD: piconv (1) - iconv(1), reinvented in perl




Re: 4.0 locked up over the weekend

2007-05-08 Thread Joachim Schipper
On Tue, May 08, 2007 at 09:05:44AM -0700, Bruce Bauer wrote:
 Probably a good idea to put some load on the sytem anyway.
 See how the VPN data transfer holds up.
 Downloading ports.tar.gz now
 Running make in ports/www/kde should keep it busy for a while
 Not familiar with bonnie++, I'll check it out

Bonnie++ just generates a lot of I/O. The 'ghetto' version involves
running 'tar xzf srf.tar.gz; rm -rf src' in a loop.

Let us know how it goes...

Joachim

-- 
TFMotD: tht, thtc (4) - Tehuti Networks 10Gb Ethernet device



Re: 4.0 locked up over the weekend

2007-05-08 Thread Bruce Bauer

Initial results:

complied bonnie++ from ports
make is running in ports/x11/kde
2 video streams passsing through VPN tunnel at abou 32 fps total
output from bonnie++:
Version  1.03   --Sequential Output-- --Sequential Input- --Random-
   -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
roadrunner.for 300M 50379  46 49432   6  6322   1 25376  41 34974   4 130.7   0
   --Sequential Create-- Random Create
   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
16  2542   5 + +++  5113   8  2898   7 + +++  5478   9
roadrunner.fortechsw.com,300M,50379,46,49432,6,6322,1,25376,41,34974,4,130.7,0,16,2542,5,+,+++,5113,8,2898,7,+,+++,5478,9

ran uptime after bonnie++ finished
11:21AM up 1 day, 2:15, 2 users, load averages: 4.08, 3.15, 2.55

Everything seems to be running smoothly

Bruce

On 5/8/07, Joachim Schipper [EMAIL PROTECTED] wrote:

On Tue, May 08, 2007 at 09:05:44AM -0700, Bruce Bauer wrote:
 Probably a good idea to put some load on the sytem anyway.
 See how the VPN data transfer holds up.
 Downloading ports.tar.gz now
 Running make in ports/www/kde should keep it busy for a while
 Not familiar with bonnie++, I'll check it out

Bonnie++ just generates a lot of I/O. The 'ghetto' version involves
running 'tar xzf srf.tar.gz; rm -rf src' in a loop.

Let us know how it goes...

   Joachim

--
TFMotD: tht, thtc (4) - Tehuti Networks 10Gb Ethernet device




4.0 locked up over the weekend

2007-05-07 Thread Bruce Bauer

This system has been running flawlessly since mid-March with GENERIC
plus the 010 patch. dmesg below
This morning I found it totally unresponsive both through network and
at the console.  Had to use the power switch to recover.

Where do I start trying to track this down?

The system is running sshd and openvpn only

DMESG:
OpenBSD 4.0 (GENERICp) #0: Fri Mar 16 19:07:33 MST 2007
   [EMAIL PROTECTED]:/usr/src/sys/arch/i386/compile/GENERICp
cpu0: AMD Sempron(tm) Processor 3000+ (AuthenticAMD 686-class, 256KB
L2 cache) 1.61 GHz
cpu0: 
FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SSE3,CX16
real mem  = 501706752 (489948K)
avail mem = 449642496 (439104K)
using 4256 buffers containing 25186304 bytes (24596K) of memory
mainbus0 (root)
bios0 at mainbus0: AT/286+(f0) BIOS, date 02/27/07, BIOS32 rev. 0 @
0xfa820, SMBIOS rev. 2.4 @ 0xf (41 entries)
apm0 at bios0: Power Management spec V1.2
apm0: AC on, battery charge unknown
apm0: flags 70102 dobusy 1 doidle 1
pcibios0 at bios0: rev 3.0 @ 0xf/0xcfd4
pcibios0: PCI IRQ Routing Table rev 1.0 @ 0xfcee0/240 (13 entries)
pcibios0: bad IRQ table checksum
pcibios0: PCI BIOS has 13 Interrupt Routing table entries
pcibios0: PCI Exclusive IRQs: 5 10 11
pcibios0: no compatible PCI ICU found
pcibios0: Warning, unable to fix up PCI interrupt routing
pcibios0: PCI bus #3 is the last bus
bios0: ROM list: 0xc/0xde00 0xd/0x1800
cpu0 at mainbus0
pci0 at mainbus0 bus 0: configuration mode 1 (no bios)
NVIDIA C51 Host rev 0xa2 at pci0 dev 0 function 0 not configured
NVIDIA C51 Memory rev 0xa2 at pci0 dev 0 function 1 not configured
NVIDIA C51 Memory rev 0xa2 at pci0 dev 0 function 2 not configured
NVIDIA C51 Memory rev 0xa2 at pci0 dev 0 function 3 not configured
NVIDIA C51 Memory rev 0xa2 at pci0 dev 0 function 4 not configured
NVIDIA C51 Memory rev 0xa2 at pci0 dev 0 function 5 not configured
NVIDIA C51 Memory rev 0xa2 at pci0 dev 0 function 6 not configured
NVIDIA C51 Memory rev 0xa2 at pci0 dev 0 function 7 not configured
ppb0 at pci0 dev 3 function 0 NVIDIA C51 PCIE rev 0xa1
pci1 at ppb0 bus 1
ppb1 at pci0 dev 4 function 0 NVIDIA C51 PCIE rev 0xa1
pci2 at ppb1 bus 2
vga1 at pci0 dev 5 function 0 NVIDIA GeForce 6100 rev 0xa2
wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
NVIDIA MCP51 Host rev 0xa2 at pci0 dev 9 function 0 not configured
pcib0 at pci0 dev 10 function 0 vendor NVIDIA, unknown product 0x0261 rev 0xa3
nviic0 at pci0 dev 10 function 1 NVIDIA MCP51 SMBus rev 0xa3
iic0 at nviic0
iic1 at nviic0
NVIDIA MCP51 Memory rev 0xa3 at pci0 dev 10 function 2 not configured
ohci0 at pci0 dev 11 function 0 NVIDIA MCP51 USB rev 0xa3: irq 10,
version 1.0, legacy support
usb0 at ohci0: USB revision 1.0
uhub0 at usb0
uhub0: NVIDIA OHCI root hub, rev 1.00/1.00, addr 1
uhub0: 8 ports with 8 removable, self powered
ehci0 at pci0 dev 11 function 1 NVIDIA MCP51 USB rev 0xa3: irq 11
usb1 at ehci0: USB revision 2.0
uhub1 at usb1
uhub1: NVIDIA EHCI root hub, rev 2.00/1.00, addr 1
uhub1: 8 ports with 8 removable, self powered
pciide0 at pci0 dev 13 function 0 NVIDIA MCP51 IDE rev 0xa1: DMA,
channel 0 configured to compatibility, channel 1 configured to
compatibility
pciide0: channel 0 disabled (no drives)
atapiscsi0 at pciide0 channel 1 drive 0
scsibus0 at atapiscsi0: 2 targets
cd0 at scsibus0 targ 0 lun 0: Lite-On, LTN486 48x Max, YD01 SCSI0
5/cdrom removable
cd0(pciide0:1:0): using PIO mode 4, Ultra-DMA mode 2
pciide1 at pci0 dev 14 function 0 NVIDIA MCP51 SATA rev 0xa1: DMA
pciide1: using irq 11 for native-PCI interrupt
wd0 at pciide1 channel 0 drive 0: WDC WD800JD-00MSA1
wd0: 16-sector PIO, LBA48, 76319MB, 156301488 sectors
wd0(pciide1:0:0): using PIO mode 4, Ultra-DMA mode 5
ppb2 at pci0 dev 16 function 0 NVIDIA MCP51 PCI-PCI rev 0xa2
pci3 at ppb2 bus 3
auich0 at pci0 dev 16 function 2 NVIDIA MCP51 AC97 rev 0xa2: irq 11,
MCP51 AC97
ac97: codec id 0x414c4760 (Avance Logic ALC655 rev 0)
audio0 at auich0
nfe0 at pci0 dev 20 function 0 NVIDIA MCP51 LAN rev 0xa3: irq 10,
address 00:19:21:33:1d:93
ukphy0 at nfe0 phy 1: Generic IEEE 802.3u media interface, rev. 1: OUI
0x0050ef, model 0x0007
pchb0 at pci0 dev 24 function 0 AMD AMD64 HyperTransport rev 0x00
pchb1 at pci0 dev 24 function 1 AMD AMD64 Address Map rev 0x00
pchb2 at pci0 dev 24 function 2 AMD AMD64 DRAM Cfg rev 0x00
pchb3 at pci0 dev 24 function 3 AMD AMD64 Misc Cfg rev 0x00
isa0 at pcib0
isadma0 at isa0
pckbc0 at isa0 port 0x60/5
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pms0 at pckbc0 (aux slot)
pckbc0: using irq 12 for aux slot
wsmouse0 at pms0 mux 0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: PC speaker
spkr0 at pcppi0
lpt0 at isa0 port 0x378/4 irq 7
it0 at isa0 port 0x290/8: IT87
npx0 at isa0 port 0xf0/16: using exception 16
pccom0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
biomask ef6d 

Re: 4.0 locked up over the weekend

2007-05-07 Thread Jack J. Woehr
On May 7, 2007, at 12:20 PM, Bruce Bauer wrote:

 This system has been running flawlessly since mid-March with GENERIC
 plus the 010 patch. dmesg below
 This morning I found it totally unresponsive both through network and
 at the console.  Had to use the power switch to recover.

 Where do I start trying to track this down?

Open the box and check your power supply and blow it out with air if  
it's full of dust.
Number one cause of mysterious lockups in my personal experience.  
Next, run a memory
test.

Only then start trying to debug software, e.g., OpenBSD.

-- 
Jack J. Woehr
Director of Development
Absolute Performance, Inc.
[EMAIL PROTECTED]
303-443-7000 ext. 527



Re: 4.0 locked up over the weekend

2007-05-07 Thread Joachim Schipper
On Mon, May 07, 2007 at 11:20:00AM -0700, Bruce Bauer wrote:
 This system has been running flawlessly since mid-March with GENERIC
 plus the 010 patch. dmesg below
 This morning I found it totally unresponsive both through network and
 at the console.  Had to use the power switch to recover.
 
 Where do I start trying to track this down?

If it happens again, try to see if there are any messages on the
console.

Otherwise, look at what was last written to the log files; that might or
might not contain a clue. (The kernel screaming at you about something
or other would be a solid clue, for instance.)

Joachim



Re: 4.0 locked up over the weekend

2007-05-07 Thread Bruce Bauer

On 5/7/07, Jack J. Woehr [EMAIL PROTECTED] wrote:



On May 7, 2007, at 12:20 PM, Bruce Bauer wrote:

This system has been running flawlessly since mid-March with GENERIC
plus the 010 patch. dmesg below
This morning I found it totally unresponsive both through network and
at the console.  Had to use the power switch to recover.

Where do I start trying to track this down?

Open the box and check your power supply and blow it out with air if it's
full of dust.
Number one cause of mysterious lockups in my personal experience. Next, run
a memory
test.

Only then start trying to debug software, e.g., OpenBSD.


--
Jack J. Woehr
Director of Development
Absolute Performance, Inc.
[EMAIL PROTECTED]
303-443-7000 ext. 527



Thanks for the response.

OK, maybe a little less basic than that.  The system is sitting in a
restricted access server room.  Not a clean room, but very little
dust.  Nice and cool..  The system still looks brand new, inside and
out.

The purpose of this system is to receive streaming video data over the
VPN from IP webcams.  It doesn't do anything with the data except pass
it on to a DVR system over the local network.  Plans are to add
another network card so the VPN and the local network will be on
separate channels.  But, for now, it all goes through one card.

It has worked in this configuration for over a month with video from 2
cameras coming in.

Oops! Message from Joachim Schipper  just came in:

There were no console messages
The authlog does show that someone is trying to brute force an ssh
login. I think I'll turn off sshd for now...



Re: 4.0 locked up over the weekend

2007-05-07 Thread Joachim Schipper
On Mon, May 07, 2007 at 12:42:55PM -0700, Bruce Bauer wrote:
 On 5/7/07, Jack J. Woehr [EMAIL PROTECTED] wrote:
 On May 7, 2007, at 12:20 PM, Bruce Bauer wrote:
  This system has been running flawlessly since mid-March with GENERIC
  plus the 010 patch. dmesg below
  This morning I found it totally unresponsive both through network and
  at the console.  Had to use the power switch to recover.
  
  Where do I start trying to track this down?
 
 Open the box and check your power supply and blow it out with air if it's
 full of dust.
 Number one cause of mysterious lockups in my personal experience. Next, run
 a memory
 test.
 
 Only then start trying to debug software, e.g., OpenBSD.

 Thanks for the response.
 
 OK, maybe a little less basic than that.  The system is sitting in a
 restricted access server room.  Not a clean room, but very little
 dust.  Nice and cool..  The system still looks brand new, inside and
 out.
 
 The purpose of this system is to receive streaming video data over the
 VPN from IP webcams.  It doesn't do anything with the data except pass
 it on to a DVR system over the local network.  Plans are to add
 another network card so the VPN and the local network will be on
 separate channels.  But, for now, it all goes through one card.
 
 It has worked in this configuration for over a month with video from 2
 cameras coming in.
 
 Oops! Message from Joachim Schipper  just came in:
 
 There were no console messages
 The authlog does show that someone is trying to brute force an ssh
 login. I think I'll turn off sshd for now...

Nah, script kiddies trying to bruteforce SSH logins are so common that I
just tuned them out of the log parser altogether. Just use public keys,
or good passwords.

That said, Jack might be right to suspect some random hardware failure.
If this is the case, how about some proper stress testing (compiling the
whole system is fairly good in exercising CPU and memory, something like
bonnie++ might help you to test the disk?).

If that doesn't work, the software might be problematic...

Joachim

-- 
TFMotD: piconv (1) - iconv(1), reinvented in perl