Re: [OpenIndiana-discuss] OI Crash

2013-01-24 Thread Dimitri Alexandris
I will agree with the driver problem.

My OI has 2 1G Intel ethernet bonded, and crashes at random times.

There are also 2 10G ports connected and working fine.

Symptom: OI crashes when a lot of traffic at the bond (5 - 40 minutes
after heavy traffic starts):

- Night rsync backups from other servers (when i choke the b/w, works ok)

- Big ftp traffic from PCs/servers

- Big smb traffic from windows users

Everything freezes, even keyboard. Since it is a double server, the
only way is to reboot via the IPMI web page. Which is shared with (one
of ?) the same ethernet...

The other server in the box runs Proxmox (debian), with no problem at
all in the ethernet bond.

Traffic (very heavy) between two servers via the internal Intel 10G
ports woks fine!

When i had Nexenta Core (before OI), everything was ok too.


Now i am thinking to turn to Nas4free (imports the pool ok).

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-24 Thread Sašo Kiselkov
On 01/24/2013 06:38 PM, Dimitri Alexandris wrote:
 I will agree with the driver problem.
 
 My OI has 2 1G Intel ethernet bonded, and crashes at random times.
 
 There are also 2 10G ports connected and working fine.
 
 Symptom: OI crashes when a lot of traffic at the bond (5 - 40 minutes
 after heavy traffic starts):
 
 - Night rsync backups from other servers (when i choke the b/w, works ok)
 
 - Big ftp traffic from PCs/servers
 
 - Big smb traffic from windows users
 
 Everything freezes, even keyboard. Since it is a double server, the
 only way is to reboot via the IPMI web page. Which is shared with (one
 of ?) the same ethernet...
 
 The other server in the box runs Proxmox (debian), with no problem at
 all in the ethernet bond.
 
 Traffic (very heavy) between two servers via the internal Intel 10G
 ports woks fine!
 
 When i had Nexenta Core (before OI), everything was ok too.
 
 Now i am thinking to turn to Nas4free (imports the pool ok).

Can you provide a crash dump or at least a stack trace of what was going
on when the system crashed?

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-20 Thread David Scharbach
I have a topic posted at illumos.org.  Lame title for bug #3489.

Cheers,

Dave

On 2013-01-20, at 4:31 AM, Albert Lee albert@nexenta.com wrote:

 Hi Dave,
 
 Please try to copy this and any other information you can obtain, as
 explained by others, into a bug report on illumos.org. Some of us are
 very interested in any problems with the CIFS service (which has
 crashed here).
 
 Thanks,
 -Albert
 
 On Sat, Jan 19, 2013 at 5:28 PM, David Scharbach
 david.scharb...@mac.com wrote:
 English is good.
 
 $ fmdump -m
 SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
 EVENT-TIME: Thu Jan 17 20:08:28 CST 2013
 PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: 
 openindiana
 SOURCE: software-diagnosis, REV: 0.1
 EVENT-ID: 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 DESC: The system has rebooted after a kernel panic.  Refer to 
 http://illumos.org/msg/SUNOS-8000-KL for more information.
 AUTO-RESPONSE: The failed system image was dumped to the dump device.  If 
 savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to 
 the savecore directory /var/crash/openindiana.
 IMPACT: There may be some performance impact while the panic is copied to 
 the savecore directory.  Disk space usage by panics can be substantial.
 REC-ACTION: If savecore is not enabled then please take steps to preserve 
 the crash image.
 Use 'fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6' to view more panic 
 detail.  Please refer to the knowledge article for additional information.
 
 With the extended info:
 
 $ fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 TIME   UUID 
 SUNW-MSG-ID
 Jan 17 2013 20:08:28.91935 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 
 SUNOS-8000-KL
 
  TIME CLASS ENA
  Jan 17 20:08:28.9139 ireport.os.sunos.panic.dump_available 
 0x
  Jan 17 20:08:07.5900 ireport.os.sunos.panic.dump_pending_on_device 
 0x
 
 nvlist version: 0
version = 0x0
class = list.suspect
uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
code = SUNOS-8000-KL
diag-time = 1358474908 917149
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = 
 sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
resource = 
 sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
savecore-succcess = 1
dump-dir = /var/crash/openindiana
dump-files = vmdump.0
os-instance-uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
panicstr = BAD TRAP: type=e (#pf Page fault) 
 rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL 
 pointer dereference
panicstack = unix:die+dd () | unix:trap+17db () | 
 unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | 
 smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () | 
 smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () | 
 unix:thread_start+8 () |
crashtime = 1358409705
panic-time = January 17, 2013 02:01:45 AM CST CST
(end fault-list[0])
 
fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x50f8ae9c 0x36cc2af0
 
 And as I am a n00b to OI, I still don't really know what is going on…
 
 Thanks you again,
 
 Dave
 
 
 On 2013-01-19, at 4:15 PM, David Scharbach david.scharb...@mac.com wrote:
 
 $ fmdump
 TIME UUID SUNW-MSG-ID EVENT
 Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL 
 Diagnosed
 $ uptime
 16:12pm  up 1 day 20:04,  2 users,  load average: 0.08, 0.14, 0.21
 
 Given today is the 19th and such, I think that timestamp on the fmdump is 
 near when the OI server last crashed.  I don't know what the event means.  
 Can you let me know?
 
 Cheers,
 
 Dave
 
 
 On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com 
 wrote:
 
 Hi,
 Has someone mentioned using 'fmdump' ?
 
 With this tool I discovered that I had issues with an unreliable disk
 controller on my workstation with the consequence of OI freezing approx.
 every 2months.
 In my case ZFS is getting the fault and standby until resolution of the
 issue, thus yielding an indefinite wait for disk I/O to resume.
 Best
 
 Aurelien
 
 
 On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley 
 pulask...@yahoo.comwrote:
 
 One time when I happened to look, I saw that the Ultra 60 I used at work
 had been up for over 18 months.
 
 If a sys admin told me he wanted to reboot a system once a week, just in
 case he'd be looking for a new job very soon or else sent back to the 

Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread Reginald Beardsley
One time when I happened to look, I saw that the Ultra 60 I used at work had 
been up for over 18 months.  

If a sys admin told me he wanted to reboot a system once a week, just in case 
he'd be looking for a new job very soon or else sent back to the PC support 
pool.

BTW The reason that 11/780 era admins did not want to shut machines down was 
primarily the problems posed by hundreds, if not thousands of mechanical 
connectors some of which if allowed to cool would lose contact.  The cure was 
simple, but tedious, you went around reseating circuit boards and cabling and 
powered up again. There are a lot of boards and cables in a well populated 
11/780 especially if its got an FPS-120B, Gould-DeAnza graphics processor and a 
Versatec plotter attached along w/ the usual disk and tape drives.

One summer weekend in Dallas, my group moved across town.  So our workstations 
spent the day in a moving van probably at 130+ F.  Monday morning several would 
not boot until I went around and reseated the disk drive cables.  

Voodoo has no place in computing.

Have Fun!
Reg

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread Aurélien Larcher
Hi,
Has someone mentioned using 'fmdump' ?

With this tool I discovered that I had issues with an unreliable disk
controller on my workstation with the consequence of OI freezing approx.
every 2months.
In my case ZFS is getting the fault and standby until resolution of the
issue, thus yielding an indefinite wait for disk I/O to resume.
Best

Aurelien


On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.comwrote:

 One time when I happened to look, I saw that the Ultra 60 I used at work
 had been up for over 18 months.

 If a sys admin told me he wanted to reboot a system once a week, just in
 case he'd be looking for a new job very soon or else sent back to the PC
 support pool.

 BTW The reason that 11/780 era admins did not want to shut machines down
 was primarily the problems posed by hundreds, if not thousands of
 mechanical connectors some of which if allowed to cool would lose contact.
  The cure was simple, but tedious, you went around reseating circuit boards
 and cabling and powered up again. There are a lot of boards and cables in a
 well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza
 graphics processor and a Versatec plotter attached along w/ the usual disk
 and tape drives.

 One summer weekend in Dallas, my group moved across town.  So our
 workstations spent the day in a moving van probably at 130+ F.  Monday
 morning several would not boot until I went around and reseated the disk
 drive cables.

 Voodoo has no place in computing.

 Have Fun!
 Reg

 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss




-- 
---
LARCHER Aurélien| KTH, School of Computer Science and
Communication
Work: +46 (0) 8 790 71 42   | Lindstedtsvägen 5, Plan 5
Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN
---
Praise the Caffeine embeddings ...
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread Reginald Beardsley
Having a console window open and checking it periodically can be very helpful.  
 Such events will get logged to the console. I recently had a correctable event 
show up in mine. There's probably a way to have the events trigger an email if 
desired.

Have Fun!
Reg

--- On Sat, 1/19/13, Aurélien Larcher aurelien.larc...@gmail.com wrote:

 From: Aurélien Larcher aurelien.larc...@gmail.com
 Subject: Re: [OpenIndiana-discuss] OI Crash
 To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org
 Date: Saturday, January 19, 2013, 12:30 PM
 Hi,
 Has someone mentioned using 'fmdump' ?
 
 With this tool I discovered that I had issues with an
 unreliable disk
 controller on my workstation with the consequence of OI
 freezing approx.
 every 2months.
 In my case ZFS is getting the fault and standby until
 resolution of the
 issue, thus yielding an indefinite wait for disk I/O to
 resume.
 Best
 
 Aurelien
 
 
 On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley 
 pulask...@yahoo.comwrote:
 
  One time when I happened to look, I saw that the Ultra
 60 I used at work
  had been up for over 18 months.
 
  If a sys admin told me he wanted to reboot a system
 once a week, just in
  case he'd be looking for a new job very soon or else
 sent back to the PC
  support pool.
 
  BTW The reason that 11/780 era admins did not want to
 shut machines down
  was primarily the problems posed by hundreds, if not
 thousands of
  mechanical connectors some of which if allowed to cool
 would lose contact.
   The cure was simple, but tedious, you went around
 reseating circuit boards
  and cabling and powered up again. There are a lot of
 boards and cables in a
  well populated 11/780 especially if its got an
 FPS-120B, Gould-DeAnza
  graphics processor and a Versatec plotter attached
 along w/ the usual disk
  and tape drives.
 
  One summer weekend in Dallas, my group moved across
 town.  So our
  workstations spent the day in a moving van probably at
 130+ F.  Monday
  morning several would not boot until I went around and
 reseated the disk
  drive cables.
 
  Voodoo has no place in computing.
 
  Have Fun!
  Reg
 
  ___
  OpenIndiana-discuss mailing list
  OpenIndiana-discuss@openindiana.org
  http://openindiana.org/mailman/listinfo/openindiana-discuss
 
 
 
 
 -- 
 ---
 LARCHER Aurélien            |
 KTH, School of Computer Science and
 Communication
 Work: +46 (0) 8 790 71 42   |
 Lindstedtsvägen 5, Plan 5
 Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN
 ---
 Praise the Caffeine embeddings ...
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss
 

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread Jim Klimov

On 2013-01-19 20:04, Reginald Beardsley wrote:

Having a console window open and checking it periodically can be very helpful.  
 Such events will get logged to the console. I recently had a correctable event 
show up in mine. There's probably a way to have the events trigger an email if 
desired.


http://docs.oracle.com/cd/E19963-01/html/821-1462/fmd-1m.html

Notification Services
syslog (package service/fault-management)
Email (package service/fault-management/smtp-notify)
SNMP (package service/fault-management/snmp-notify)

These all are present in OI as well.

Should also help monitor SMF service state transitions (i.e. failures):
http://www.c0t0d0s0.org/archives/7051-New-Solaris-features-How-to-monitor-SMF-services-via-mail.html

https://blogs.oracle.com/gavinm/entry/notifications_for_smf_instance_state

HTH,
//Jim

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread David Scharbach
$ fmdump
TIME UUID SUNW-MSG-ID EVENT
Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL 
Diagnosed
$ uptime
 16:12pm  up 1 day 20:04,  2 users,  load average: 0.08, 0.14, 0.21

Given today is the 19th and such, I think that timestamp on the fmdump is near 
when the OI server last crashed.  I don't know what the event means.  Can you 
let me know?

Cheers,

Dave


On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com wrote:

 Hi,
 Has someone mentioned using 'fmdump' ?
 
 With this tool I discovered that I had issues with an unreliable disk
 controller on my workstation with the consequence of OI freezing approx.
 every 2months.
 In my case ZFS is getting the fault and standby until resolution of the
 issue, thus yielding an indefinite wait for disk I/O to resume.
 Best
 
 Aurelien
 
 
 On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley 
 pulask...@yahoo.comwrote:
 
 One time when I happened to look, I saw that the Ultra 60 I used at work
 had been up for over 18 months.
 
 If a sys admin told me he wanted to reboot a system once a week, just in
 case he'd be looking for a new job very soon or else sent back to the PC
 support pool.
 
 BTW The reason that 11/780 era admins did not want to shut machines down
 was primarily the problems posed by hundreds, if not thousands of
 mechanical connectors some of which if allowed to cool would lose contact.
 The cure was simple, but tedious, you went around reseating circuit boards
 and cabling and powered up again. There are a lot of boards and cables in a
 well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza
 graphics processor and a Versatec plotter attached along w/ the usual disk
 and tape drives.
 
 One summer weekend in Dallas, my group moved across town.  So our
 workstations spent the day in a moving van probably at 130+ F.  Monday
 morning several would not boot until I went around and reseated the disk
 drive cables.
 
 Voodoo has no place in computing.
 
 Have Fun!
 Reg
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss
 
 
 
 
 -- 
 ---
 LARCHER Aurélien| KTH, School of Computer Science and
 Communication
 Work: +46 (0) 8 790 71 42   | Lindstedtsvägen 5, Plan 5
 Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN
 ---
 Praise the Caffeine embeddings ...
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread Aurélien Larcher
If you use the -m flags to get the details what does it say ?

On Sat, Jan 19, 2013 at 11:15 PM, David Scharbach
david.scharb...@mac.comwrote:

 $ fmdump
 TIME UUID SUNW-MSG-ID EVENT
 Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL
 Diagnosed
 $ uptime
  16:12pm  up 1 day 20:04,  2 users,  load average: 0.08, 0.14, 0.21

 Given today is the 19th and such, I think that timestamp on the fmdump is
 near when the OI server last crashed.  I don't know what the event means.
  Can you let me know?

 Cheers,

 Dave


 On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com
 wrote:

  Hi,
  Has someone mentioned using 'fmdump' ?
 
  With this tool I discovered that I had issues with an unreliable disk
  controller on my workstation with the consequence of OI freezing approx.
  every 2months.
  In my case ZFS is getting the fault and standby until resolution of the
  issue, thus yielding an indefinite wait for disk I/O to resume.
  Best
 
  Aurelien
 
 
  On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.com
 wrote:
 
  One time when I happened to look, I saw that the Ultra 60 I used at work
  had been up for over 18 months.
 
  If a sys admin told me he wanted to reboot a system once a week, just
 in
  case he'd be looking for a new job very soon or else sent back to the
 PC
  support pool.
 
  BTW The reason that 11/780 era admins did not want to shut machines down
  was primarily the problems posed by hundreds, if not thousands of
  mechanical connectors some of which if allowed to cool would lose
 contact.
  The cure was simple, but tedious, you went around reseating circuit
 boards
  and cabling and powered up again. There are a lot of boards and cables
 in a
  well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza
  graphics processor and a Versatec plotter attached along w/ the usual
 disk
  and tape drives.
 
  One summer weekend in Dallas, my group moved across town.  So our
  workstations spent the day in a moving van probably at 130+ F.  Monday
  morning several would not boot until I went around and reseated the disk
  drive cables.
 
  Voodoo has no place in computing.
 
  Have Fun!
  Reg
 
  ___
  OpenIndiana-discuss mailing list
  OpenIndiana-discuss@openindiana.org
  http://openindiana.org/mailman/listinfo/openindiana-discuss
 
 
 
 
  --
 
 ---
  LARCHER Aurélien| KTH, School of Computer Science and
  Communication
  Work: +46 (0) 8 790 71 42   | Lindstedtsvägen 5, Plan 5
  Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN
 
 ---
  Praise the Caffeine embeddings ...
  ___
  OpenIndiana-discuss mailing list
  OpenIndiana-discuss@openindiana.org
  http://openindiana.org/mailman/listinfo/openindiana-discuss


 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss




-- 
---
LARCHER Aurélien| KTH, School of Computer Science and
Communication
Work: +46 (0) 8 790 71 42   | Lindstedtsvägen 5, Plan 5
Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN
---
Praise the Caffeine embeddings ...
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread David Scharbach
English is good.

$ fmdump -m
SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
EVENT-TIME: Thu Jan 17 20:08:28 CST 2013
PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: openindiana
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
DESC: The system has rebooted after a kernel panic.  Refer to 
http://illumos.org/msg/SUNOS-8000-KL for more information.
AUTO-RESPONSE: The failed system image was dumped to the dump device.  If 
savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the 
savecore directory /var/crash/openindiana.
IMPACT: There may be some performance impact while the panic is copied to the 
savecore directory.  Disk space usage by panics can be substantial.
REC-ACTION: If savecore is not enabled then please take steps to preserve the 
crash image.
Use 'fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6' to view more panic 
detail.  Please refer to the knowledge article for additional information.

With the extended info:

$ fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
TIME   UUID SUNW-MSG-ID
Jan 17 2013 20:08:28.91935 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 
SUNOS-8000-KL

  TIME CLASS ENA
  Jan 17 20:08:28.9139 ireport.os.sunos.panic.dump_available 0x
  Jan 17 20:08:07.5900 ireport.os.sunos.panic.dump_pending_on_device 
0x

nvlist version: 0
version = 0x0
class = list.suspect
uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
code = SUNOS-8000-KL
diag-time = 1358474908 917149
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list[0])
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = 
sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
resource = 
sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
savecore-succcess = 1
dump-dir = /var/crash/openindiana
dump-files = vmdump.0
os-instance-uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
panicstr = BAD TRAP: type=e (#pf Page fault) 
rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL pointer 
dereference
panicstack = unix:die+dd () | unix:trap+17db () | 
unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 
() | smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () | 
genunix:taskq_d_thread+b1 () | unix:thread_start+8 () | 
crashtime = 1358409705
panic-time = January 17, 2013 02:01:45 AM CST CST
(end fault-list[0])

fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x50f8ae9c 0x36cc2af0

And as I am a n00b to OI, I still don't really know what is going on…

Thanks you again,

Dave


On 2013-01-19, at 4:15 PM, David Scharbach david.scharb...@mac.com wrote:

 $ fmdump
 TIME UUID SUNW-MSG-ID EVENT
 Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL 
 Diagnosed
 $ uptime
 16:12pm  up 1 day 20:04,  2 users,  load average: 0.08, 0.14, 0.21
 
 Given today is the 19th and such, I think that timestamp on the fmdump is 
 near when the OI server last crashed.  I don't know what the event means.  
 Can you let me know?
 
 Cheers,
 
 Dave
 
 
 On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com 
 wrote:
 
 Hi,
 Has someone mentioned using 'fmdump' ?
 
 With this tool I discovered that I had issues with an unreliable disk
 controller on my workstation with the consequence of OI freezing approx.
 every 2months.
 In my case ZFS is getting the fault and standby until resolution of the
 issue, thus yielding an indefinite wait for disk I/O to resume.
 Best
 
 Aurelien
 
 
 On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley 
 pulask...@yahoo.comwrote:
 
 One time when I happened to look, I saw that the Ultra 60 I used at work
 had been up for over 18 months.
 
 If a sys admin told me he wanted to reboot a system once a week, just in
 case he'd be looking for a new job very soon or else sent back to the PC
 support pool.
 
 BTW The reason that 11/780 era admins did not want to shut machines down
 was primarily the problems posed by hundreds, if not thousands of
 mechanical connectors some of which if allowed to cool would lose contact.
 The cure was simple, but tedious, you went around reseating circuit boards
 and cabling and powered up again. There are a lot of boards and cables in a
 well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza
 graphics processor and a Versatec plotter attached along w/ the usual disk
 and 

Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread Jason Matthews

to this end, redirect your console to a serial port and put a serial recorder 
on it. they cost maybe 60$ but can be handy to catch output from panics. 

j. 

Sent from Jasons' hand held

On Jan 19, 2013, at 11:04 AM, Reginald Beardsley pulask...@yahoo.com wrote:

 Having a console window open and checking it periodically can be very 
 helpful.   Such events will get logged to the console. I recently had a 
 correctable event show up in mine. There's probably a way to have the events 
 trigger an email if desired.
 
 Have Fun!
 Reg
 
 --- On Sat, 1/19/13, Aurélien Larcher aurelien.larc...@gmail.com wrote:
 
 From: Aurélien Larcher aurelien.larc...@gmail.com
 Subject: Re: [OpenIndiana-discuss] OI Crash
 To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org
 Date: Saturday, January 19, 2013, 12:30 PM
 Hi,
 Has someone mentioned using 'fmdump' ?
 
 With this tool I discovered that I had issues with an
 unreliable disk
 controller on my workstation with the consequence of OI
 freezing approx.
 every 2months.
 In my case ZFS is getting the fault and standby until
 resolution of the
 issue, thus yielding an indefinite wait for disk I/O to
 resume.
 Best
 
 Aurelien
 
 
 On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley 
 pulask...@yahoo.comwrote:
 
 One time when I happened to look, I saw that the Ultra
 60 I used at work
 had been up for over 18 months.
 
 If a sys admin told me he wanted to reboot a system
 once a week, just in
 case he'd be looking for a new job very soon or else
 sent back to the PC
 support pool.
 
 BTW The reason that 11/780 era admins did not want to
 shut machines down
 was primarily the problems posed by hundreds, if not
 thousands of
 mechanical connectors some of which if allowed to cool
 would lose contact.
   The cure was simple, but tedious, you went around
 reseating circuit boards
 and cabling and powered up again. There are a lot of
 boards and cables in a
 well populated 11/780 especially if its got an
 FPS-120B, Gould-DeAnza
 graphics processor and a Versatec plotter attached
 along w/ the usual disk
 and tape drives.
 
 One summer weekend in Dallas, my group moved across
 town.  So our
 workstations spent the day in a moving van probably at
 130+ F.  Monday
 morning several would not boot until I went around and
 reseated the disk
 drive cables.
 
 Voodoo has no place in computing.
 
 Have Fun!
 Reg
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss
 
 
 
 -- 
 ---
 LARCHER Aurélien|
 KTH, School of Computer Science and
 Communication
 Work: +46 (0) 8 790 71 42   |
 Lindstedtsvägen 5, Plan 5
 Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN
 ---
 Praise the Caffeine embeddings ...
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread Sašo Kiselkov
Your dump device contains a crash dump from a kernel panic that your
machine previously encountered. See
http://wiki.illumos.org/display/illumos/How+To+Report+Problems for a
guide on how to extract useful information from the crash dump and post
it here. In particular, you'll want to do savecore (this downloads the
compressed crash dump from your dump device into /var/crash/hostname),
savecore -vf crashdump_filename to extract it and then inspect it
using mdb to glean some useful info from it, such as ::panicinfo and
::stack.

--
Saso

On 01/19/2013 11:28 PM, David Scharbach wrote:
 English is good.
 
 $ fmdump -m
 SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
 EVENT-TIME: Thu Jan 17 20:08:28 CST 2013
 PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: 
 openindiana
 SOURCE: software-diagnosis, REV: 0.1
 EVENT-ID: 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 DESC: The system has rebooted after a kernel panic.  Refer to 
 http://illumos.org/msg/SUNOS-8000-KL for more information.
 AUTO-RESPONSE: The failed system image was dumped to the dump device.  If 
 savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to 
 the savecore directory /var/crash/openindiana.
 IMPACT: There may be some performance impact while the panic is copied to the 
 savecore directory.  Disk space usage by panics can be substantial.
 REC-ACTION: If savecore is not enabled then please take steps to preserve the 
 crash image.
 Use 'fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6' to view more panic 
 detail.  Please refer to the knowledge article for additional information.
 
 With the extended info:
 
 $ fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 TIME   UUID 
 SUNW-MSG-ID
 Jan 17 2013 20:08:28.91935 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 
 SUNOS-8000-KL
 
   TIME CLASS ENA
   Jan 17 20:08:28.9139 ireport.os.sunos.panic.dump_available 
 0x
   Jan 17 20:08:07.5900 ireport.os.sunos.panic.dump_pending_on_device 
 0x
 
 nvlist version: 0
 version = 0x0
 class = list.suspect
 uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 code = SUNOS-8000-KL
 diag-time = 1358474908 917149
 de = fmd:///module/software-diagnosis
 fault-list-sz = 0x1
 fault-list = (array of embedded nvlists)
 (start fault-list[0])
 nvlist version: 0
 version = 0x0
 class = defect.sunos.kernel.panic
 certainty = 0x64
 asru = 
 sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 resource = 
 sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 savecore-succcess = 1
 dump-dir = /var/crash/openindiana
 dump-files = vmdump.0
 os-instance-uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 panicstr = BAD TRAP: type=e (#pf Page fault) 
 rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL pointer 
 dereference
 panicstack = unix:die+dd () | unix:trap+17db () | 
 unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | 
 smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () | 
 smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () | 
 unix:thread_start+8 () | 
 crashtime = 1358409705
 panic-time = January 17, 2013 02:01:45 AM CST CST
 (end fault-list[0])
 
 fault-status = 0x1
 severity = Major
 __ttl = 0x1
 __tod = 0x50f8ae9c 0x36cc2af0
 
 And as I am a n00b to OI, I still don't really know what is going on…
 
 Thanks you again,
 
 Dave
 
 
 On 2013-01-19, at 4:15 PM, David Scharbach david.scharb...@mac.com wrote:
 
 $ fmdump
 TIME UUID SUNW-MSG-ID EVENT
 Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL 
 Diagnosed
 $ uptime
 16:12pm  up 1 day 20:04,  2 users,  load average: 0.08, 0.14, 0.21

 Given today is the 19th and such, I think that timestamp on the fmdump is 
 near when the OI server last crashed.  I don't know what the event means.  
 Can you let me know?

 Cheers,

 Dave


 On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com 
 wrote:

 Hi,
 Has someone mentioned using 'fmdump' ?

 With this tool I discovered that I had issues with an unreliable disk
 controller on my workstation with the consequence of OI freezing approx.
 every 2months.
 In my case ZFS is getting the fault and standby until resolution of the
 issue, thus yielding an indefinite wait for disk I/O to resume.
 Best

 Aurelien


 On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley 
 pulask...@yahoo.comwrote:

 One time when I happened to look, I saw that the Ultra 60 I used at work
 had been up for over 18 months.

 If a sys admin told me 

Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread Aurélien Larcher
I cannot tell what would be the next step to diagnose the problem but:

panicstr = BAD TRAP: type=e (#pf Page fault) rp=ff003c913840 addr=77
occurred in module smbsrv due to a NULL pointer dereference
panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () |
smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () |
smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () |
genunix:taskq_d_thread+b1 () | unix:thread_start+8 () |

looks like a good start would be to look if there is any bug filed
concerning Samba...
Best,

Aurelien


On Sat, Jan 19, 2013 at 11:28 PM, David Scharbach
david.scharb...@mac.comwrote:

 English is good.

 $ fmdump -m
 SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
 EVENT-TIME: Thu Jan 17 20:08:28 CST 2013
 PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME:
 openindiana
 SOURCE: software-diagnosis, REV: 0.1
 EVENT-ID: 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 DESC: The system has rebooted after a kernel panic.  Refer to
 http://illumos.org/msg/SUNOS-8000-KL for more information.
 AUTO-RESPONSE: The failed system image was dumped to the dump device.  If
 savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to
 the savecore directory /var/crash/openindiana.
 IMPACT: There may be some performance impact while the panic is copied to
 the savecore directory.  Disk space usage by panics can be substantial.
 REC-ACTION: If savecore is not enabled then please take steps to preserve
 the crash image.
 Use 'fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6' to view more
 panic detail.  Please refer to the knowledge article for additional
 information.

 With the extended info:

 $ fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 TIME   UUID
 SUNW-MSG-ID
 Jan 17 2013 20:08:28.91935 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 SUNOS-8000-KL

   TIME CLASS ENA
   Jan 17 20:08:28.9139 ireport.os.sunos.panic.dump_available
 0x
   Jan 17 20:08:07.5900 ireport.os.sunos.panic.dump_pending_on_device
 0x

 nvlist version: 0
 version = 0x0
 class = list.suspect
 uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 code = SUNOS-8000-KL
 diag-time = 1358474908 917149
 de = fmd:///module/software-diagnosis
 fault-list-sz = 0x1
 fault-list = (array of embedded nvlists)
 (start fault-list[0])
 nvlist version: 0
 version = 0x0
 class = defect.sunos.kernel.panic
 certainty = 0x64
 asru =
 sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 resource =
 sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 savecore-succcess = 1
 dump-dir = /var/crash/openindiana
 dump-files = vmdump.0
 os-instance-uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
 panicstr = BAD TRAP: type=e (#pf Page fault)
 rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL
 pointer dereference
 panicstack = unix:die+dd () | unix:trap+17db () |
 unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () |
 smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () |
 smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () |
 unix:thread_start+8 () |
 crashtime = 1358409705
 panic-time = January 17, 2013 02:01:45 AM CST CST
 (end fault-list[0])

 fault-status = 0x1
 severity = Major
 __ttl = 0x1
 __tod = 0x50f8ae9c 0x36cc2af0

 And as I am a n00b to OI, I still don't really know what is going on…

 Thanks you again,

 Dave


 On 2013-01-19, at 4:15 PM, David Scharbach david.scharb...@mac.com
 wrote:

  $ fmdump
  TIME UUID SUNW-MSG-ID
 EVENT
  Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL
 Diagnosed
  $ uptime
  16:12pm  up 1 day 20:04,  2 users,  load average: 0.08, 0.14, 0.21
 
  Given today is the 19th and such, I think that timestamp on the fmdump
 is near when the OI server last crashed.  I don't know what the event
 means.  Can you let me know?
 
  Cheers,
 
  Dave
 
 
  On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com
 wrote:
 
  Hi,
  Has someone mentioned using 'fmdump' ?
 
  With this tool I discovered that I had issues with an unreliable disk
  controller on my workstation with the consequence of OI freezing approx.
  every 2months.
  In my case ZFS is getting the fault and standby until resolution of the
  issue, thus yielding an indefinite wait for disk I/O to resume.
  Best
 
  Aurelien
 
 
  On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley 
 pulask...@yahoo.comwrote:
 
  One time when I happened to look, I saw that the Ultra 60 I used at
 work
  had been up for over 18 

Re: [OpenIndiana-discuss] OI Crash

2013-01-19 Thread Jim Klimov

On 2013-01-19 23:50, Aurélien Larcher wrote:

I cannot tell what would be the next step to diagnose the problem but:

panicstr = BAD TRAP: type=e (#pf Page fault) rp=ff003c913840 addr=77
occurred in module smbsrv due to a NULL pointer dereference
panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () |
smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () |
smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () |
genunix:taskq_d_thread+b1 () | unix:thread_start+8 () |

looks like a good start would be to look if there is any bug filed
concerning Samba...


I believe, this would not be Samba (a userspace project) but Solaris
kernel implementation of CIFS, server in this case.

Causes might be varied, but if there is integration with the Windows
network (MSAD domain), it might be one thing worth researching - if
the user account mapping (mapid, PAM), kerberos login of the server to
domain, naming services and such pieces don't log errors of their own...
Not that these SHOULD cause kernel panics, but who knows what the module
can do if fed invalid inputs? ;) - and these you might be warned about
before the crash...

//Jim


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-18 Thread dormitionsk...@hotmail.com
On Jan 17, 2013, at 8:47 PM, Reginald Beardsley wrote:

 As far as I'm concerned, problems like this are a bottomless abyss.  Which is 
 why I'm still putting up w/ my OI box hanging.  It's annoying, but not 
 critical.  It's also why critical stuff still runs on Solaris 10.
 
 Intermittent failures are the worst time sink there is. There is no assurance 
 that devoting all your time to the problem will fix it even at very high 
 skill levels w/ a full complement of the very best tools.
 
 If you're getting crash dumps there is hope of finding the cause, so that's a 
 big improvement.
 
 Good luck,
 Reg
 
 BTW Back in the 80's there was a VAX operator in Texas who went out to his 
 truck, got a .357 and shot the computer.  His employer was not happy.  But I 
 can certainly understand how the operator felt.


From 1992 to I used to 1998, I used to work at the Denver Museum of Natural 
History -- now the Denver Museum of Nature and Science.  We had two or three 
DEC Vax's and an AIX machine there.  It was their policy that once a week we 
had to power each of the servers all the way down to clear out any memory 
problems -- or whatever -- as preventive maintenance.  

Since then, I've always had the habit of setting up a cron job to reboot my 
servers once a week.  It's not as good as a full power down, but it's better 
than nothing.  And in all these years, I've never had to deal with intermittent 
problems like this, except for a few brief times when I used Red Hat Linux ten 
plus years ago.  (I've tried most of Red Hat's versions since 6.2, and RHEL 6 
is the first version I've found that runs decent enough on our hardware, and 
that I'm happy enough with, for us to use.)

So, if you can do it, you might want try setting up a cron job to reboot your 
server once a week -- or every night.  I reboot our LTSP thin client server 
every night just because it gets hit with running lots of desktop applications 
that I think give it a greater potential for these kinds of memory problems.  

On the other hand, we have all of our websites hosted on one of our 
parishioner's servers -- and he doesn't reboot his machines periodically like I 
do -- and about every two months, I have to call him up and tell him something 
is wrong.  And he goes and powers down his system -- sometimes he has to even 
unplug it -- and then turn it back on, and everything works again.

I know there are system admins that just love to brag about how great their 
up-times are on their machines -- but this might just save you a lot of time 
and grief.

Of course, if you're running a real high-volume server, this might not be 
workable for you; but it only takes 2-5 minutes or so to reboot... Perhaps in 
the middle of the night you might be able to spare it being down that short 
time?

Just a friendly suggestion.

Shared experience.

I know others may tell you that that's no longer necessary anymore in these 
more modern times; but my experience has been otherwise.

I hope it helps.

+Peter, hieromonk



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-18 Thread Sašo Kiselkov
On 01/19/2013 01:53 AM, dormitionsk...@hotmail.com wrote:
 From 1992 to I used to 1998, I used to work at the Denver Museum of Natural 
 History -- now the Denver Museum of Nature and Science.  We had two or three 
 DEC Vax's and an AIX machine there.  It was their policy that once a week we 
 had to power each of the servers all the way down to clear out any memory 
 problems -- or whatever -- as preventive maintenance.  
 
 Since then, I've always had the habit of setting up a cron job to reboot my 
 servers once a week.  It's not as good as a full power down, but it's better 
 than nothing.  And in all these years, I've never had to deal with 
 intermittent problems like this, except for a few brief times when I used Red 
 Hat Linux ten plus years ago.  (I've tried most of Red Hat's versions since 
 6.2, and RHEL 6 is the first version I've found that runs decent enough on 
 our hardware, and that I'm happy enough with, for us to use.)

Nice anecdote, but I find this kind of policy very strange. Sure,
regular maintenance downtime windows are important, but doing to preempt
any problems in the OS seems just strange... not to mention that a
powercycle needlessly stresses the electromechanical components of the
server (HDD motors, fans, etc.)

Also, I don't know about VAX, but boot on a typical SPARC machine can
easily take upwards of 10 minutes (or more, depending on the level of
checks you enabled). Sun E10ks were famous for booting over half an hour
(checking all of their complicated hardware took a lot of time).

 So, if you can do it, you might want try setting up a cron job to reboot your 
 server once a week -- or every night.  I reboot our LTSP thin client server 
 every night just because it gets hit with running lots of desktop 
 applications that I think give it a greater potential for these kinds of 
 memory problems.  

How about just killing these apps (e.g. forced logout of users) rather
than rebooting the whole machine? Do you suspect memory problems in the
base OS services?

 On the other hand, we have all of our websites hosted on one of our 
 parishioner's servers -- and he doesn't reboot his machines periodically like 
 I do -- and about every two months, I have to call him up and tell him 
 something is wrong.

I suggest switching hosting providers, as your server admin apparently
has next to no idea of what he's doing. I've been running web servers
for years without any trouble. Only the most drastic changes should
warrant a reboot (e.g. kernel update).

  And he goes and powers down his system -- sometimes he has to even
unplug it -- and then turn it back on, and everything works again.

What's up with this Windows 95-era powercycling voodoo? You are
obviously dealing with a serious issue and ignoring it.

 I know there are system admins that just love to brag about how great their 
 up-times are on their machines -- but this might just save you a lot of time 
 and grief.

Frequent rebooting and powercycling might have worked for you, but lots
of applications don't allow for that. Don't mistake an admin's pride of
a job well done for bragging.

 Of course, if you're running a real high-volume server, this might not be 
 workable for you; but it only takes 2-5 minutes or so to reboot... Perhaps in 
 the middle of the night you might be able to spare it being down that short 
 time?

This is just plastering over the problem - I've seen plenty of
solutions of this kind where the restart frequency of a service slowly
had to increase until it was no longer workable. In general, I'd
recommend doing what you say only as the absolute last option.

 Just a friendly suggestion.
 Shared experience.
 
 I know others may tell you that that's no longer necessary anymore in these 
 more modern times; but my experience has been otherwise.
 
 I hope it helps.

When you do encounter these kinds of problems, try and capture a crash
dump, file an Illumos issue and provide as much info on the problem as
possible to help debug it (that's what I recommended to David, he has
yet to respond). Nothing will improve if users keep issues to
themselves. I've been dealing with a serious (show stopper) network load
problem in Illumos a while back and after a little googling, mailing and
testing I managed to resolve it. Sticking one's head in the sand isn't a
good avenue of progress.

Anyway, just my two cents..

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-18 Thread Doug Hughes

On 1/18/2013 7:53 PM, dormitionsk...@hotmail.com wrote:

On Jan 17, 2013, at 8:47 PM, Reginald Beardsley wrote:


As far as I'm concerned, problems like this are a bottomless abyss.  Which is 
why I'm still putting up w/ my OI box hanging.  It's annoying, but not 
critical.  It's also why critical stuff still runs on Solaris 10.

Intermittent failures are the worst time sink there is. There is no assurance 
that devoting all your time to the problem will fix it even at very high skill 
levels w/ a full complement of the very best tools.

If you're getting crash dumps there is hope of finding the cause, so that's a 
big improvement.

Good luck,
Reg

BTW Back in the 80's there was a VAX operator in Texas who went out to his 
truck, got a .357 and shot the computer.  His employer was not happy.  But I 
can certainly understand how the operator felt.



 From 1992 to I used to 1998, I used to work at the Denver Museum of Natural 
History -- now the Denver Museum of Nature and Science.  We had two or three 
DEC Vax's and an AIX machine there.  It was their policy that once a week we 
had to power each of the servers all the way down to clear out any memory 
problems -- or whatever -- as preventive maintenance.

Since then, I've always had the habit of setting up a cron job to reboot my 
servers once a week.  It's not as good as a full power down, but it's better 
than nothing.  And in all these years, I've never had to deal with intermittent 
problems like this, except for a few brief times when I used Red Hat Linux ten 
plus years ago.  (I've tried most of Red Hat's versions since 6.2, and RHEL 6 
is the first version I've found that runs decent enough on our hardware, and 
that I'm happy enough with, for us to use.)

So, if you can do it, you might want try setting up a cron job to reboot your 
server once a week -- or every night.  I reboot our LTSP thin client server 
every night just because it gets hit with running lots of desktop applications 
that I think give it a greater potential for these kinds of memory problems.

On the other hand, we have all of our websites hosted on one of our 
parishioner's servers -- and he doesn't reboot his machines periodically like I 
do -- and about every two months, I have to call him up and tell him something 
is wrong.  And he goes and powers down his system -- sometimes he has to even 
unplug it -- and then turn it back on, and everything works again.

I know there are system admins that just love to brag about how great their 
up-times are on their machines -- but this might just save you a lot of time 
and grief.

Of course, if you're running a real high-volume server, this might not be 
workable for you; but it only takes 2-5 minutes or so to reboot... Perhaps in 
the middle of the night you might be able to spare it being down that short 
time?

Just a friendly suggestion.

Shared experience.

I know others may tell you that that's no longer necessary anymore in these 
more modern times; but my experience has been otherwise.

I hope it helps.

+Peter, hieromonk



Haven't we passed the days of mystical sysadmin without understanding 
and characterization? Keeping up tradition for tradition's sake without 
understanding the underlying reasons really doesn't do anybody a favor. 
If there are memory leaks, we posses the technology to find them. My 
organization has thousands of machines that run jobs sometimes for 
months at a time. If I had to reboot servers once a week, my users would 
be at the doors with pitchforks. The only time we take downtime is when 
there are reasons to do so, including OS updates, hardware failures, and 
user software run amok. They can run a very long time like this.


Not that memory leaks never happen. Of course they do, but they 
eventually get found and fixed, or the program causing them passes into 
obsolescence. Always.


I encourage discovery rather than superstition, and diagnosis rather 
than repetition.


Be a knight, not a victim!


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-18 Thread dormitionsk...@hotmail.com
Well, I don't think it's stressing the hardware all that much, when you 
consider our oldest server is 11 1/2 years old, with all its original hardware. 
 Our newest server is somewhere around 7 years old, without a hardware failure 
for at least five years.

I admit I'm not much of a system admin.  I've been forced into that role 
because there's nobody else here to do it. Our hosting provider situation is a 
similarly less than ideal situation, which we're working on.  Bosses kind of 
tend to get in the way of some of these things, too...

I have no idea about SPARC, or any of the real big server environments.  I 
can't even fathom working in an environment with thousands of servers, or why 
they would even need that many.  

And if you have the time and expertise to work through and find the problem so 
it can be resolved, that's obviously better.  But this archaic way of dealing 
with the problem actually works -- if a person can do it.  Like I said, it may 
not be practical for everyone's situation, though.  It's certainly not for big, 
professional admins.  For smaller environments, I believe it can be a 
reasonable option, though.

It's not being superstitious, or a victim.  It's simply trying take the easy 
way out, and if it takes care of the problem, then you don't have to deal with 
it any more.  Or at least not right now.  If it doesn't, well, then, you have 
to fight your way through it.  

I think setting up periodic reboots is better as a preventive maintenance 
measure, than as a way of addressing a known issue.  But if nothing else, it 
might just buy you some time until you can work on it more at your convenience.

Oh, and I didn't make this reboot procedure up.  From what I understand, it 
used to be fairly common practice.  I figured some of the professionals would 
take exception to it.  But sometimes, older things can still be better than 
new.  

Unless, of course, you like fighting and beating your head against the wall 
trying to figure out why your system hangs, or whatever, instead of having a 
stable network and spending your time on less pressing and / or more mundane 
things... 

[]:-)

Cheers.

fp



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-17 Thread David Scharbach
I ran memtest86 for 3 passes, everything was ok there.

Computer froze again today after only 1 day of uptime.  I now have a dump file 
but I am confused as to what to do with it.  Sorry to be a n00b but could you 
point me in the right direction?

Cheers,

On 2013-01-15, at 9:10 PM, Ian Collins i...@ianshome.com wrote:

 David Scharbach wrote:
 I have an OI installation that seems to crash about every 20 days.  Locks up 
 completely and needs a hard reset.  Not very much fun.
 
 Question I have is where would I start to look to see why?  I first thought 
 it may be due to scrubbing load on the LSI controller but that is not the 
 case.  It crashed today hours after a curb was successful at 2AM.
 
 I am a new OI user and would really appreciate a bit of help on this one.  
 Basically looking for a crash dump of some sort and the locations that I 
 have looked at don't really help.
 
 System is
 
 i3 CPU
 32GB Ram
 LSI SAS2008
 Intel SAS expander
 13 SATA 7200RPM drives
 Asus MB
 
 Is there any evidence of a crash in the logs?
 
 Look in /var/adm/messages for any clues and under /var/crash/hostname for 
 any dumps.
 
 I'm not sure if there is a version of Solaris CAT (Crash Analysis Tool) that 
 works with OI.  If there is and you have a dump that's the best place to look.
 
 If there isn't any evidence of a crash, there's a fair chance you have a 
 hardware problem.  I'm guessing that with an i3 motherboard you won't have 
 ECC memory, so running memtest86 for a while would be a good start.
 
 -- 
 Ian.
 
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-17 Thread David Scharbach
I checked and the P8V77-v that I am using seems to be listed, unless the LK 
suffix makes a big difference.

I just disabled my on-board NIC and installed an Intel NIC.  Shall see…

Thanks again,

On 2013-01-15, at 10:44 PM, Mehmet Erol Sanliturk m.e.sanlit...@gmail.com 
wrote:

 On Tue, Jan 15, 2013 at 6:50 PM, David Scharbach 
 david.scharb...@mac.comwrote:
 
 I have an OI installation that seems to crash about every 20 days.  Locks
 up completely and needs a hard reset.  Not very much fun.
 
 Question I have is where would I start to look to see why?  I first
 thought it may be due to scrubbing load on the LSI controller but that is
 not the case.  It crashed today hours after a curb was successful at 2AM.
 
 I am a new OI user and would really appreciate a bit of help on this one.
 Basically looking for a crash dump of some sort and the locations that I
 have looked at don't really help.
 
 System is
 
 i3 CPU
 32GB Ram
 LSI SAS2008
 Intel SAS expander
 13 SATA 7200RPM drives
 Asus MB
 
 Thank you for any help.
 
 Cheers,
 
 Dave
 
 
 
 If your mother board is NOT present in the following list , it means that
 working under Unix like
 operating systems is a chance and crashes are very likely :
 
 http://www.asus.com/Static_WebPage/OS_Compatibility/
 http://www.asus.com/websites/global/aboutasus/OS/Linux1211.pdf
 
 Thank you very much .
 
 Mehmet Erol Sanliturk
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-17 Thread David Scharbach
lol, you make it seem so easy :)

I just disabled the on board NIC.  We will see.  Next I will try the storage 
controller.  Then a hammer.

Cheers,

On 2013-01-16, at 9:01 AM, Edward Ned Harvey (openindiana) 
openindi...@nedharvey.com wrote:

 From: David Scharbach [mailto:david.scharb...@mac.com]
 
 I have an OI installation that seems to crash about every 20 days.  Locks up
 completely and needs a hard reset.  Not very much fun.
 
 Whenever I've seen this type of behavior before, it was hardware/driver 
 related, but we never were able to narrow it down to *which* piece of 
 hardware or driver, by any method other than blindly swapping out hardware.
 
 I'm not talking, necessarily, about failing hardware.  Just some sort of 
 incompatibility bug.  On one system, we greatly reduced the incidence of 
 crashes by disabling the on-board broadcom NIC, and buying the intel server 
 PCIE NIC instead.
 
 Likely candidates are the storage controller, and network adapter.  And 
 everything else in the system.
 
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-17 Thread Reginald Beardsley
As far as I'm concerned, problems like this are a bottomless abyss.  Which is 
why I'm still putting up w/ my OI box hanging.  It's annoying, but not 
critical.  It's also why critical stuff still runs on Solaris 10.

Intermittent failures are the worst time sink there is. There is no assurance 
that devoting all your time to the problem will fix it even at very high skill 
levels w/ a full complement of the very best tools.

If you're getting crash dumps there is hope of finding the cause, so that's a 
big improvement.

Good luck,
Reg

BTW Back in the 80's there was a VAX operator in Texas who went out to his 
truck, got a .357 and shot the computer.  His employer was not happy.  But I 
can certainly understand how the operator felt.

--- On Thu, 1/17/13, David Scharbach david.scharb...@mac.com wrote:

 From: David Scharbach david.scharb...@mac.com
 Subject: Re: [OpenIndiana-discuss] OI Crash
 To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org
 Date: Thursday, January 17, 2013, 8:27 PM
 lol, you make it seem so easy :)
 
 I just disabled the on board NIC.  We will see. 
 Next I will try the storage controller.  Then a
 hammer.
 
 Cheers,
 
 On 2013-01-16, at 9:01 AM, Edward Ned Harvey (openindiana)
 openindi...@nedharvey.com
 wrote:
 
  From: David Scharbach [mailto:david.scharb...@mac.com]
  
  I have an OI installation that seems to crash about
 every 20 days.  Locks up
  completely and needs a hard reset.  Not very
 much fun.
  
  Whenever I've seen this type of behavior before, it was
 hardware/driver related, but we never were able to narrow it
 down to *which* piece of hardware or driver, by any method
 other than blindly swapping out hardware.
  
  I'm not talking, necessarily, about failing
 hardware.  Just some sort of incompatibility bug. 
 On one system, we greatly reduced the incidence of crashes
 by disabling the on-board broadcom NIC, and buying the intel
 server PCIE NIC instead.
  
  Likely candidates are the storage controller, and
 network adapter.  And everything else in the system.
  
  
  ___
  OpenIndiana-discuss mailing list
  OpenIndiana-discuss@openindiana.org
  http://openindiana.org/mailman/listinfo/openindiana-discuss
 
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss
 

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-17 Thread Sašo Kiselkov
On 01/18/2013 03:20 AM, David Scharbach wrote:
 I ran memtest86 for 3 passes, everything was ok there.
 
 Computer froze again today after only 1 day of uptime.  I now have a dump 
 file but I am confused as to what to do with it.  Sorry to be a n00b but 
 could you point me in the right direction?

If you have a crash dump, follow
http://wiki.illumos.org/display/illumos/How+To+Report+Problems and send
your crash dump info (the crash.0 file as it is generated in that
guide). That should extract most of the relevant info from the crash
dump and give us a clue as to where exactly your system panic'ed.

Cheers,
--
Saso

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-17 Thread dormitionsk...@hotmail.com
 BTW Back in the 80's there was a VAX operator in Texas who went out to his 
 truck, got a .357 and shot the computer.  His employer was not happy.  But I 
 can certainly understand how the operator felt.


Ah.  That's too bad!  I used to love VAX!

A shotgun would have done a much better job, too.

[]:-)



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-16 Thread James Carlson
On 01/15/13 23:02, Rich wrote:
 mkdir -p /var/crash/$(hostname)
 pfexec dumpadm -y
 
 And ideally, put set dump_plat_mincpu=0 in /etc/system, lest the
 core dump code try to thread and fail miserably.
 
 Next time you die, you should get a core dump in
 /var/crash/[hostname]/, presuming your dump device has enough space.

Good advice.  If you're seeing hangs, I also suggest this in /etc/system:

set snooping = 1

That causes the scheduler to panic if it fails to make progress.  It's
probably not something you want to have for the long term, but to help
identify the cause of a hard hang, it can be useful.

-- 
James Carlson 42.703N 71.076W carls...@workingcode.com

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-16 Thread Edward Ned Harvey (openindiana)
 From: David Scharbach [mailto:david.scharb...@mac.com]
 
 I have an OI installation that seems to crash about every 20 days.  Locks up
 completely and needs a hard reset.  Not very much fun.

Whenever I've seen this type of behavior before, it was hardware/driver 
related, but we never were able to narrow it down to *which* piece of hardware 
or driver, by any method other than blindly swapping out hardware.

I'm not talking, necessarily, about failing hardware.  Just some sort of 
incompatibility bug.  On one system, we greatly reduced the incidence of 
crashes by disabling the on-board broadcom NIC, and buying the intel server 
PCIE NIC instead.

Likely candidates are the storage controller, and network adapter.  And 
everything else in the system.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-16 Thread Daniel Kjar
I have had this issue and it turned out to be a power supply with too 
little power for the 6 hard drives stuffed into my ultra 20. Removed 2 
drives and everything was fine.  The drives were perfectly fine.  The 
other time I have run into this is when I would lose a required nfs 
mount (like a home drive).


Good luck.

The time I had a reset every 14 days turned out to be a problem with the 
server rooms ups sending me a 'shutdown' message every two weeks.  
Switched ups and that disappeared.




On 01/16/13 10:01 AM, Edward Ned Harvey (openindiana) wrote:

From: David Scharbach [mailto:david.scharb...@mac.com]

I have an OI installation that seems to crash about every 20 days.  Locks up
completely and needs a hard reset.  Not very much fun.

Whenever I've seen this type of behavior before, it was hardware/driver 
related, but we never were able to narrow it down to *which* piece of hardware 
or driver, by any method other than blindly swapping out hardware.

I'm not talking, necessarily, about failing hardware.  Just some sort of 
incompatibility bug.  On one system, we greatly reduced the incidence of 
crashes by disabling the on-board broadcom NIC, and buying the intel server 
PCIE NIC instead.

Likely candidates are the storage controller, and network adapter.  And 
everything else in the system.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


--
Dr. Daniel Kjar
Assistant Professor of Biology
Division of Mathematics and Natural Sciences
Elmira College
1 Park Place
Elmira, NY 14901
607-735-1826
http://faculty.elmira.edu/dkjar

...humans send their young men to war; ants send their old ladies
-E. O. Wilson



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-15 Thread Ian Collins

David Scharbach wrote:

I have an OI installation that seems to crash about every 20 days.  Locks up 
completely and needs a hard reset.  Not very much fun.

Question I have is where would I start to look to see why?  I first thought it 
may be due to scrubbing load on the LSI controller but that is not the case.  
It crashed today hours after a curb was successful at 2AM.

I am a new OI user and would really appreciate a bit of help on this one.  
Basically looking for a crash dump of some sort and the locations that I have 
looked at don't really help.

System is

i3 CPU
32GB Ram
LSI SAS2008
Intel SAS expander
13 SATA 7200RPM drives
Asus MB


Is there any evidence of a crash in the logs?

Look in /var/adm/messages for any clues and under /var/crash/hostname 
for any dumps.


I'm not sure if there is a version of Solaris CAT (Crash Analysis Tool) 
that works with OI.  If there is and you have a dump that's the best 
place to look.


If there isn't any evidence of a crash, there's a fair chance you have a 
hardware problem.  I'm guessing that with an i3 motherboard you won't 
have ECC memory, so running memtest86 for a while would be a good start.


--
Ian.


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-15 Thread David Scharbach
I will make a memtest ISO ASAP.  /var/adm/messages shows nothing.  /var/crash 
does not exist on my system.

Will see what memtest says.

Cheers,

Dave

On 2013-01-15, at 9:10 PM, Ian Collins i...@ianshome.com wrote:

 David Scharbach wrote:
 I have an OI installation that seems to crash about every 20 days.  Locks up 
 completely and needs a hard reset.  Not very much fun.
 
 Question I have is where would I start to look to see why?  I first thought 
 it may be due to scrubbing load on the LSI controller but that is not the 
 case.  It crashed today hours after a curb was successful at 2AM.
 
 I am a new OI user and would really appreciate a bit of help on this one.  
 Basically looking for a crash dump of some sort and the locations that I 
 have looked at don't really help.
 
 System is
 
 i3 CPU
 32GB Ram
 LSI SAS2008
 Intel SAS expander
 13 SATA 7200RPM drives
 Asus MB
 
 Is there any evidence of a crash in the logs?
 
 Look in /var/adm/messages for any clues and under /var/crash/hostname for 
 any dumps.
 
 I'm not sure if there is a version of Solaris CAT (Crash Analysis Tool) that 
 works with OI.  If there is and you have a dump that's the best place to look.
 
 If there isn't any evidence of a crash, there's a fair chance you have a 
 hardware problem.  I'm guessing that with an i3 motherboard you won't have 
 ECC memory, so running memtest86 for a while would be a good start.
 
 -- 
 Ian.
 
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss


___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-15 Thread Jason Matthews


On Jan 15, 2013, at 7:10 PM, Ian Collins i...@ianshome.com wrote:
 
 If there isn't any evidence of a crash, there's a fair chance you have a 
 hardware problem.

i have decent number of identical production boxes. 

about once per quarter one of them spontaneously reboots leaving no trace as to 
why. it is never the same box twice. 

the first couple of times i offlined the systems and ran diagnostics on them. i 
ran memtest for two weeks. i checked SEL, etc found butt-kiss. 

i came to the conclusion it is just something that happens. it is likely a 
driver issue. since my srchitecture can absorb such failures i havent spent 
slot of time on it. 

 i am still on 151a so perhaps there is a fix that i just dont have it yet. 
whatever my reboot problem is, it is not a hardware problem. 


j. 
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-15 Thread Rich
mkdir -p /var/crash/$(hostname)
pfexec dumpadm -y

And ideally, put set dump_plat_mincpu=0 in /etc/system, lest the
core dump code try to thread and fail miserably.

Next time you die, you should get a core dump in
/var/crash/[hostname]/, presuming your dump device has enough space.

- Rich

On Tue, Jan 15, 2013 at 10:18 PM, David Scharbach
david.scharb...@mac.com wrote:
 I will make a memtest ISO ASAP.  /var/adm/messages shows nothing.  /var/crash 
 does not exist on my system.

 Will see what memtest says.

 Cheers,

 Dave

 On 2013-01-15, at 9:10 PM, Ian Collins i...@ianshome.com wrote:

 David Scharbach wrote:
 I have an OI installation that seems to crash about every 20 days.  Locks 
 up completely and needs a hard reset.  Not very much fun.

 Question I have is where would I start to look to see why?  I first thought 
 it may be due to scrubbing load on the LSI controller but that is not the 
 case.  It crashed today hours after a curb was successful at 2AM.

 I am a new OI user and would really appreciate a bit of help on this one.  
 Basically looking for a crash dump of some sort and the locations that I 
 have looked at don't really help.

 System is

 i3 CPU
 32GB Ram
 LSI SAS2008
 Intel SAS expander
 13 SATA 7200RPM drives
 Asus MB

 Is there any evidence of a crash in the logs?

 Look in /var/adm/messages for any clues and under /var/crash/hostname for 
 any dumps.

 I'm not sure if there is a version of Solaris CAT (Crash Analysis Tool) that 
 works with OI.  If there is and you have a dump that's the best place to 
 look.

 If there isn't any evidence of a crash, there's a fair chance you have a 
 hardware problem.  I'm guessing that with an i3 motherboard you won't have 
 ECC memory, so running memtest86 for a while would be a good start.

 --
 Ian.


 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss


 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-15 Thread Reginald Beardsley
FYI I had to force a hard reboot via power switch today (i.e. no shutdown or 
sync :-(  The system hangs and will not take any input via the X server 
keyboard  mouse.  In this case it would not even do a clean reboot via the 
power switch monitor daemon.  Only option was force it down.

I'm running OI 151.  I think a5, but that's from memory rather than from 
something reliable like uname(1).  From what I know at present this is an 
interrupt conflict between the keyboard/mouse drivers and the Nvidia graphics 
driver.

This is apparently an outstanding bug and as far as I can see not easily fixed. 
 I did not have this problem w/ 148 and I would revert to that except I can't 
remember the root password :-(

This may not be related to your problem, but the system hanging really isn't a 
crash.  It's actually much worse.  

I still have scars from a MicroVAX that hung about every 60 days for 18 months. 
Because it never actually crashed it was very hard to get any help despite top 
grade support.  Eventually we discovered it was a bad thermal sensor shutting 
down one side of the split 15 V supply.  But that was only because it did it 
one day when DEC support was there and we had the skins off the machine and 
could see the LED status on the power supply.  

After over a year of this I had them living there trying to fix it.  We'd 
replaced almost everything in the machine except the backplane and cabinet.  
We'd already replaced the PS, so when the fault showed on the LED we knew it 
wasn't the PS which left the thermal sensor in the top of the box.  I never 
understood why that only shut down one side of the supply, but it did a great 
job of locking up the CPU.

If you don't get a crash dump, it is *really* hard to resolve the cause and fix 
it.  There's nothing to get a hold of.

Good luck and please keep us posted.

Reg

--- On Tue, 1/15/13, David Scharbach david.scharb...@mac.com wrote:

 From: David Scharbach david.scharb...@mac.com
 Subject: Re: [OpenIndiana-discuss] OI Crash
 To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org
 Date: Tuesday, January 15, 2013, 9:18 PM
 I will make a memtest ISO ASAP. 
 /var/adm/messages shows nothing.  /var/crash does not
 exist on my system.
 
 Will see what memtest says.
 
 Cheers,
 
 Dave
 
 On 2013-01-15, at 9:10 PM, Ian Collins i...@ianshome.com
 wrote:
 
  David Scharbach wrote:
  I have an OI installation that seems to crash about
 every 20 days.  Locks up completely and needs a hard
 reset.  Not very much fun.
  
  Question I have is where would I start to look to
 see why?  I first thought it may be due to scrubbing
 load on the LSI controller but that is not the case. 
 It crashed today hours after a curb was successful at 2AM.
  
  I am a new OI user and would really appreciate a
 bit of help on this one.  Basically looking for a crash
 dump of some sort and the locations that I have looked at
 don't really help.
  
  System is
  
  i3 CPU
  32GB Ram
  LSI SAS2008
  Intel SAS expander
  13 SATA 7200RPM drives
  Asus MB
  
  Is there any evidence of a crash in the logs?
  
  Look in /var/adm/messages for any clues and under
 /var/crash/hostname for any dumps.
  
  I'm not sure if there is a version of Solaris CAT
 (Crash Analysis Tool) that works with OI.  If there is
 and you have a dump that's the best place to look.
  
  If there isn't any evidence of a crash, there's a fair
 chance you have a hardware problem.  I'm guessing that
 with an i3 motherboard you won't have ECC memory, so running
 memtest86 for a while would be a good start.
  
  -- 
  Ian.
  
  
  ___
  OpenIndiana-discuss mailing list
  OpenIndiana-discuss@openindiana.org
  http://openindiana.org/mailman/listinfo/openindiana-discuss
 
 
 ___
 OpenIndiana-discuss mailing list
 OpenIndiana-discuss@openindiana.org
 http://openindiana.org/mailman/listinfo/openindiana-discuss
 

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-15 Thread Mehmet Erol Sanliturk
On Tue, Jan 15, 2013 at 6:50 PM, David Scharbach david.scharb...@mac.comwrote:

 I have an OI installation that seems to crash about every 20 days.  Locks
 up completely and needs a hard reset.  Not very much fun.

 Question I have is where would I start to look to see why?  I first
 thought it may be due to scrubbing load on the LSI controller but that is
 not the case.  It crashed today hours after a curb was successful at 2AM.

 I am a new OI user and would really appreciate a bit of help on this one.
  Basically looking for a crash dump of some sort and the locations that I
 have looked at don't really help.

 System is

 i3 CPU
 32GB Ram
 LSI SAS2008
 Intel SAS expander
 13 SATA 7200RPM drives
 Asus MB

 Thank you for any help.

 Cheers,

 Dave



If your mother board is NOT present in the following list , it means that
working under Unix like
operating systems is a chance and crashes are very likely :

http://www.asus.com/Static_WebPage/OS_Compatibility/
http://www.asus.com/websites/global/aboutasus/OS/Linux1211.pdf

Thank you very much .

Mehmet Erol Sanliturk
___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OI Crash

2013-01-15 Thread Jim Klimov

On 2013-01-16 05:04, Reginald Beardsley wrote:

This is apparently an outstanding bug and as far as I can see not easily fixed. 
 I did not have this problem w/ 148 and I would revert to that except I can't 
remember the root password :-(


Can't you beadm mount oi_148 (insert proper BE name) and fix up
the /etc/shadow file inside there (i.e. if you know your current
root password, just copy-paste the cyphertext from your running
BE's /etc/shadow over the one in the other BE).

When done, don't forget to beadm umount before you beadm activate

HTH, //Jim

___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss