Re: [OpenIndiana-discuss] OI Crash
I will agree with the driver problem. My OI has 2 1G Intel ethernet bonded, and crashes at random times. There are also 2 10G ports connected and working fine. Symptom: OI crashes when a lot of traffic at the bond (5 - 40 minutes after heavy traffic starts): - Night rsync backups from other servers (when i choke the b/w, works ok) - Big ftp traffic from PCs/servers - Big smb traffic from windows users Everything freezes, even keyboard. Since it is a double server, the only way is to reboot via the IPMI web page. Which is shared with (one of ?) the same ethernet... The other server in the box runs Proxmox (debian), with no problem at all in the ethernet bond. Traffic (very heavy) between two servers via the internal Intel 10G ports woks fine! When i had Nexenta Core (before OI), everything was ok too. Now i am thinking to turn to Nas4free (imports the pool ok). ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On 01/24/2013 06:38 PM, Dimitri Alexandris wrote: I will agree with the driver problem. My OI has 2 1G Intel ethernet bonded, and crashes at random times. There are also 2 10G ports connected and working fine. Symptom: OI crashes when a lot of traffic at the bond (5 - 40 minutes after heavy traffic starts): - Night rsync backups from other servers (when i choke the b/w, works ok) - Big ftp traffic from PCs/servers - Big smb traffic from windows users Everything freezes, even keyboard. Since it is a double server, the only way is to reboot via the IPMI web page. Which is shared with (one of ?) the same ethernet... The other server in the box runs Proxmox (debian), with no problem at all in the ethernet bond. Traffic (very heavy) between two servers via the internal Intel 10G ports woks fine! When i had Nexenta Core (before OI), everything was ok too. Now i am thinking to turn to Nas4free (imports the pool ok). Can you provide a crash dump or at least a stack trace of what was going on when the system crashed? Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
I have a topic posted at illumos.org. Lame title for bug #3489. Cheers, Dave On 2013-01-20, at 4:31 AM, Albert Lee albert@nexenta.com wrote: Hi Dave, Please try to copy this and any other information you can obtain, as explained by others, into a bug report on illumos.org. Some of us are very interested in any problems with the CIFS service (which has crashed here). Thanks, -Albert On Sat, Jan 19, 2013 at 5:28 PM, David Scharbach david.scharb...@mac.com wrote: English is good. $ fmdump -m SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major EVENT-TIME: Thu Jan 17 20:08:28 CST 2013 PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: openindiana SOURCE: software-diagnosis, REV: 0.1 EVENT-ID: 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 DESC: The system has rebooted after a kernel panic. Refer to http://illumos.org/msg/SUNOS-8000-KL for more information. AUTO-RESPONSE: The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/openindiana. IMPACT: There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial. REC-ACTION: If savecore is not enabled then please take steps to preserve the crash image. Use 'fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6' to view more panic detail. Please refer to the knowledge article for additional information. With the extended info: $ fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 TIME UUID SUNW-MSG-ID Jan 17 2013 20:08:28.91935 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL TIME CLASS ENA Jan 17 20:08:28.9139 ireport.os.sunos.panic.dump_available 0x Jan 17 20:08:07.5900 ireport.os.sunos.panic.dump_pending_on_device 0x nvlist version: 0 version = 0x0 class = list.suspect uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 code = SUNOS-8000-KL diag-time = 1358474908 917149 de = fmd:///module/software-diagnosis fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 asru = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 resource = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 savecore-succcess = 1 dump-dir = /var/crash/openindiana dump-files = vmdump.0 os-instance-uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 panicstr = BAD TRAP: type=e (#pf Page fault) rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL pointer dereference panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () | unix:thread_start+8 () | crashtime = 1358409705 panic-time = January 17, 2013 02:01:45 AM CST CST (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x50f8ae9c 0x36cc2af0 And as I am a n00b to OI, I still don't really know what is going on… Thanks you again, Dave On 2013-01-19, at 4:15 PM, David Scharbach david.scharb...@mac.com wrote: $ fmdump TIME UUID SUNW-MSG-ID EVENT Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL Diagnosed $ uptime 16:12pm up 1 day 20:04, 2 users, load average: 0.08, 0.14, 0.21 Given today is the 19th and such, I think that timestamp on the fmdump is near when the OI server last crashed. I don't know what the event means. Can you let me know? Cheers, Dave On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com wrote: Hi, Has someone mentioned using 'fmdump' ? With this tool I discovered that I had issues with an unreliable disk controller on my workstation with the consequence of OI freezing approx. every 2months. In my case ZFS is getting the fault and standby until resolution of the issue, thus yielding an indefinite wait for disk I/O to resume. Best Aurelien On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.comwrote: One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18 months. If a sys admin told me he wanted to reboot a system once a week, just in case he'd be looking for a new job very soon or else sent back to the
Re: [OpenIndiana-discuss] OI Crash
One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18 months. If a sys admin told me he wanted to reboot a system once a week, just in case he'd be looking for a new job very soon or else sent back to the PC support pool. BTW The reason that 11/780 era admins did not want to shut machines down was primarily the problems posed by hundreds, if not thousands of mechanical connectors some of which if allowed to cool would lose contact. The cure was simple, but tedious, you went around reseating circuit boards and cabling and powered up again. There are a lot of boards and cables in a well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza graphics processor and a Versatec plotter attached along w/ the usual disk and tape drives. One summer weekend in Dallas, my group moved across town. So our workstations spent the day in a moving van probably at 130+ F. Monday morning several would not boot until I went around and reseated the disk drive cables. Voodoo has no place in computing. Have Fun! Reg ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
Hi, Has someone mentioned using 'fmdump' ? With this tool I discovered that I had issues with an unreliable disk controller on my workstation with the consequence of OI freezing approx. every 2months. In my case ZFS is getting the fault and standby until resolution of the issue, thus yielding an indefinite wait for disk I/O to resume. Best Aurelien On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.comwrote: One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18 months. If a sys admin told me he wanted to reboot a system once a week, just in case he'd be looking for a new job very soon or else sent back to the PC support pool. BTW The reason that 11/780 era admins did not want to shut machines down was primarily the problems posed by hundreds, if not thousands of mechanical connectors some of which if allowed to cool would lose contact. The cure was simple, but tedious, you went around reseating circuit boards and cabling and powered up again. There are a lot of boards and cables in a well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza graphics processor and a Versatec plotter attached along w/ the usual disk and tape drives. One summer weekend in Dallas, my group moved across town. So our workstations spent the day in a moving van probably at 130+ F. Monday morning several would not boot until I went around and reseated the disk drive cables. Voodoo has no place in computing. Have Fun! Reg ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss -- --- LARCHER Aurélien| KTH, School of Computer Science and Communication Work: +46 (0) 8 790 71 42 | Lindstedtsvägen 5, Plan 5 Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN --- Praise the Caffeine embeddings ... ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
Having a console window open and checking it periodically can be very helpful. Such events will get logged to the console. I recently had a correctable event show up in mine. There's probably a way to have the events trigger an email if desired. Have Fun! Reg --- On Sat, 1/19/13, Aurélien Larcher aurelien.larc...@gmail.com wrote: From: Aurélien Larcher aurelien.larc...@gmail.com Subject: Re: [OpenIndiana-discuss] OI Crash To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Date: Saturday, January 19, 2013, 12:30 PM Hi, Has someone mentioned using 'fmdump' ? With this tool I discovered that I had issues with an unreliable disk controller on my workstation with the consequence of OI freezing approx. every 2months. In my case ZFS is getting the fault and standby until resolution of the issue, thus yielding an indefinite wait for disk I/O to resume. Best Aurelien On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.comwrote: One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18 months. If a sys admin told me he wanted to reboot a system once a week, just in case he'd be looking for a new job very soon or else sent back to the PC support pool. BTW The reason that 11/780 era admins did not want to shut machines down was primarily the problems posed by hundreds, if not thousands of mechanical connectors some of which if allowed to cool would lose contact. The cure was simple, but tedious, you went around reseating circuit boards and cabling and powered up again. There are a lot of boards and cables in a well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza graphics processor and a Versatec plotter attached along w/ the usual disk and tape drives. One summer weekend in Dallas, my group moved across town. So our workstations spent the day in a moving van probably at 130+ F. Monday morning several would not boot until I went around and reseated the disk drive cables. Voodoo has no place in computing. Have Fun! Reg ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss -- --- LARCHER Aurélien | KTH, School of Computer Science and Communication Work: +46 (0) 8 790 71 42 | Lindstedtsvägen 5, Plan 5 Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN --- Praise the Caffeine embeddings ... ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On 2013-01-19 20:04, Reginald Beardsley wrote: Having a console window open and checking it periodically can be very helpful. Such events will get logged to the console. I recently had a correctable event show up in mine. There's probably a way to have the events trigger an email if desired. http://docs.oracle.com/cd/E19963-01/html/821-1462/fmd-1m.html Notification Services syslog (package service/fault-management) Email (package service/fault-management/smtp-notify) SNMP (package service/fault-management/snmp-notify) These all are present in OI as well. Should also help monitor SMF service state transitions (i.e. failures): http://www.c0t0d0s0.org/archives/7051-New-Solaris-features-How-to-monitor-SMF-services-via-mail.html https://blogs.oracle.com/gavinm/entry/notifications_for_smf_instance_state HTH, //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
$ fmdump TIME UUID SUNW-MSG-ID EVENT Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL Diagnosed $ uptime 16:12pm up 1 day 20:04, 2 users, load average: 0.08, 0.14, 0.21 Given today is the 19th and such, I think that timestamp on the fmdump is near when the OI server last crashed. I don't know what the event means. Can you let me know? Cheers, Dave On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com wrote: Hi, Has someone mentioned using 'fmdump' ? With this tool I discovered that I had issues with an unreliable disk controller on my workstation with the consequence of OI freezing approx. every 2months. In my case ZFS is getting the fault and standby until resolution of the issue, thus yielding an indefinite wait for disk I/O to resume. Best Aurelien On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.comwrote: One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18 months. If a sys admin told me he wanted to reboot a system once a week, just in case he'd be looking for a new job very soon or else sent back to the PC support pool. BTW The reason that 11/780 era admins did not want to shut machines down was primarily the problems posed by hundreds, if not thousands of mechanical connectors some of which if allowed to cool would lose contact. The cure was simple, but tedious, you went around reseating circuit boards and cabling and powered up again. There are a lot of boards and cables in a well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza graphics processor and a Versatec plotter attached along w/ the usual disk and tape drives. One summer weekend in Dallas, my group moved across town. So our workstations spent the day in a moving van probably at 130+ F. Monday morning several would not boot until I went around and reseated the disk drive cables. Voodoo has no place in computing. Have Fun! Reg ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss -- --- LARCHER Aurélien| KTH, School of Computer Science and Communication Work: +46 (0) 8 790 71 42 | Lindstedtsvägen 5, Plan 5 Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN --- Praise the Caffeine embeddings ... ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
If you use the -m flags to get the details what does it say ? On Sat, Jan 19, 2013 at 11:15 PM, David Scharbach david.scharb...@mac.comwrote: $ fmdump TIME UUID SUNW-MSG-ID EVENT Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL Diagnosed $ uptime 16:12pm up 1 day 20:04, 2 users, load average: 0.08, 0.14, 0.21 Given today is the 19th and such, I think that timestamp on the fmdump is near when the OI server last crashed. I don't know what the event means. Can you let me know? Cheers, Dave On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com wrote: Hi, Has someone mentioned using 'fmdump' ? With this tool I discovered that I had issues with an unreliable disk controller on my workstation with the consequence of OI freezing approx. every 2months. In my case ZFS is getting the fault and standby until resolution of the issue, thus yielding an indefinite wait for disk I/O to resume. Best Aurelien On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.com wrote: One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18 months. If a sys admin told me he wanted to reboot a system once a week, just in case he'd be looking for a new job very soon or else sent back to the PC support pool. BTW The reason that 11/780 era admins did not want to shut machines down was primarily the problems posed by hundreds, if not thousands of mechanical connectors some of which if allowed to cool would lose contact. The cure was simple, but tedious, you went around reseating circuit boards and cabling and powered up again. There are a lot of boards and cables in a well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza graphics processor and a Versatec plotter attached along w/ the usual disk and tape drives. One summer weekend in Dallas, my group moved across town. So our workstations spent the day in a moving van probably at 130+ F. Monday morning several would not boot until I went around and reseated the disk drive cables. Voodoo has no place in computing. Have Fun! Reg ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss -- --- LARCHER Aurélien| KTH, School of Computer Science and Communication Work: +46 (0) 8 790 71 42 | Lindstedtsvägen 5, Plan 5 Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN --- Praise the Caffeine embeddings ... ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss -- --- LARCHER Aurélien| KTH, School of Computer Science and Communication Work: +46 (0) 8 790 71 42 | Lindstedtsvägen 5, Plan 5 Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN --- Praise the Caffeine embeddings ... ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
English is good. $ fmdump -m SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major EVENT-TIME: Thu Jan 17 20:08:28 CST 2013 PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: openindiana SOURCE: software-diagnosis, REV: 0.1 EVENT-ID: 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 DESC: The system has rebooted after a kernel panic. Refer to http://illumos.org/msg/SUNOS-8000-KL for more information. AUTO-RESPONSE: The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/openindiana. IMPACT: There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial. REC-ACTION: If savecore is not enabled then please take steps to preserve the crash image. Use 'fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6' to view more panic detail. Please refer to the knowledge article for additional information. With the extended info: $ fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 TIME UUID SUNW-MSG-ID Jan 17 2013 20:08:28.91935 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL TIME CLASS ENA Jan 17 20:08:28.9139 ireport.os.sunos.panic.dump_available 0x Jan 17 20:08:07.5900 ireport.os.sunos.panic.dump_pending_on_device 0x nvlist version: 0 version = 0x0 class = list.suspect uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 code = SUNOS-8000-KL diag-time = 1358474908 917149 de = fmd:///module/software-diagnosis fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 asru = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 resource = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 savecore-succcess = 1 dump-dir = /var/crash/openindiana dump-files = vmdump.0 os-instance-uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 panicstr = BAD TRAP: type=e (#pf Page fault) rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL pointer dereference panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () | unix:thread_start+8 () | crashtime = 1358409705 panic-time = January 17, 2013 02:01:45 AM CST CST (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x50f8ae9c 0x36cc2af0 And as I am a n00b to OI, I still don't really know what is going on… Thanks you again, Dave On 2013-01-19, at 4:15 PM, David Scharbach david.scharb...@mac.com wrote: $ fmdump TIME UUID SUNW-MSG-ID EVENT Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL Diagnosed $ uptime 16:12pm up 1 day 20:04, 2 users, load average: 0.08, 0.14, 0.21 Given today is the 19th and such, I think that timestamp on the fmdump is near when the OI server last crashed. I don't know what the event means. Can you let me know? Cheers, Dave On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com wrote: Hi, Has someone mentioned using 'fmdump' ? With this tool I discovered that I had issues with an unreliable disk controller on my workstation with the consequence of OI freezing approx. every 2months. In my case ZFS is getting the fault and standby until resolution of the issue, thus yielding an indefinite wait for disk I/O to resume. Best Aurelien On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.comwrote: One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18 months. If a sys admin told me he wanted to reboot a system once a week, just in case he'd be looking for a new job very soon or else sent back to the PC support pool. BTW The reason that 11/780 era admins did not want to shut machines down was primarily the problems posed by hundreds, if not thousands of mechanical connectors some of which if allowed to cool would lose contact. The cure was simple, but tedious, you went around reseating circuit boards and cabling and powered up again. There are a lot of boards and cables in a well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza graphics processor and a Versatec plotter attached along w/ the usual disk and
Re: [OpenIndiana-discuss] OI Crash
to this end, redirect your console to a serial port and put a serial recorder on it. they cost maybe 60$ but can be handy to catch output from panics. j. Sent from Jasons' hand held On Jan 19, 2013, at 11:04 AM, Reginald Beardsley pulask...@yahoo.com wrote: Having a console window open and checking it periodically can be very helpful. Such events will get logged to the console. I recently had a correctable event show up in mine. There's probably a way to have the events trigger an email if desired. Have Fun! Reg --- On Sat, 1/19/13, Aurélien Larcher aurelien.larc...@gmail.com wrote: From: Aurélien Larcher aurelien.larc...@gmail.com Subject: Re: [OpenIndiana-discuss] OI Crash To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Date: Saturday, January 19, 2013, 12:30 PM Hi, Has someone mentioned using 'fmdump' ? With this tool I discovered that I had issues with an unreliable disk controller on my workstation with the consequence of OI freezing approx. every 2months. In my case ZFS is getting the fault and standby until resolution of the issue, thus yielding an indefinite wait for disk I/O to resume. Best Aurelien On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.comwrote: One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18 months. If a sys admin told me he wanted to reboot a system once a week, just in case he'd be looking for a new job very soon or else sent back to the PC support pool. BTW The reason that 11/780 era admins did not want to shut machines down was primarily the problems posed by hundreds, if not thousands of mechanical connectors some of which if allowed to cool would lose contact. The cure was simple, but tedious, you went around reseating circuit boards and cabling and powered up again. There are a lot of boards and cables in a well populated 11/780 especially if its got an FPS-120B, Gould-DeAnza graphics processor and a Versatec plotter attached along w/ the usual disk and tape drives. One summer weekend in Dallas, my group moved across town. So our workstations spent the day in a moving van probably at 130+ F. Monday morning several would not boot until I went around and reseated the disk drive cables. Voodoo has no place in computing. Have Fun! Reg ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss -- --- LARCHER Aurélien| KTH, School of Computer Science and Communication Work: +46 (0) 8 790 71 42 | Lindstedtsvägen 5, Plan 5 Mob.: +46 (0) 7 09 46 40 17 | 100 44 Stockholm, SWEDEN --- Praise the Caffeine embeddings ... ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
Your dump device contains a crash dump from a kernel panic that your machine previously encountered. See http://wiki.illumos.org/display/illumos/How+To+Report+Problems for a guide on how to extract useful information from the crash dump and post it here. In particular, you'll want to do savecore (this downloads the compressed crash dump from your dump device into /var/crash/hostname), savecore -vf crashdump_filename to extract it and then inspect it using mdb to glean some useful info from it, such as ::panicinfo and ::stack. -- Saso On 01/19/2013 11:28 PM, David Scharbach wrote: English is good. $ fmdump -m SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major EVENT-TIME: Thu Jan 17 20:08:28 CST 2013 PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: openindiana SOURCE: software-diagnosis, REV: 0.1 EVENT-ID: 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 DESC: The system has rebooted after a kernel panic. Refer to http://illumos.org/msg/SUNOS-8000-KL for more information. AUTO-RESPONSE: The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/openindiana. IMPACT: There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial. REC-ACTION: If savecore is not enabled then please take steps to preserve the crash image. Use 'fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6' to view more panic detail. Please refer to the knowledge article for additional information. With the extended info: $ fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 TIME UUID SUNW-MSG-ID Jan 17 2013 20:08:28.91935 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL TIME CLASS ENA Jan 17 20:08:28.9139 ireport.os.sunos.panic.dump_available 0x Jan 17 20:08:07.5900 ireport.os.sunos.panic.dump_pending_on_device 0x nvlist version: 0 version = 0x0 class = list.suspect uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 code = SUNOS-8000-KL diag-time = 1358474908 917149 de = fmd:///module/software-diagnosis fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 asru = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 resource = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 savecore-succcess = 1 dump-dir = /var/crash/openindiana dump-files = vmdump.0 os-instance-uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 panicstr = BAD TRAP: type=e (#pf Page fault) rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL pointer dereference panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () | unix:thread_start+8 () | crashtime = 1358409705 panic-time = January 17, 2013 02:01:45 AM CST CST (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x50f8ae9c 0x36cc2af0 And as I am a n00b to OI, I still don't really know what is going on… Thanks you again, Dave On 2013-01-19, at 4:15 PM, David Scharbach david.scharb...@mac.com wrote: $ fmdump TIME UUID SUNW-MSG-ID EVENT Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL Diagnosed $ uptime 16:12pm up 1 day 20:04, 2 users, load average: 0.08, 0.14, 0.21 Given today is the 19th and such, I think that timestamp on the fmdump is near when the OI server last crashed. I don't know what the event means. Can you let me know? Cheers, Dave On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com wrote: Hi, Has someone mentioned using 'fmdump' ? With this tool I discovered that I had issues with an unreliable disk controller on my workstation with the consequence of OI freezing approx. every 2months. In my case ZFS is getting the fault and standby until resolution of the issue, thus yielding an indefinite wait for disk I/O to resume. Best Aurelien On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.comwrote: One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18 months. If a sys admin told me
Re: [OpenIndiana-discuss] OI Crash
I cannot tell what would be the next step to diagnose the problem but: panicstr = BAD TRAP: type=e (#pf Page fault) rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL pointer dereference panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () | unix:thread_start+8 () | looks like a good start would be to look if there is any bug filed concerning Samba... Best, Aurelien On Sat, Jan 19, 2013 at 11:28 PM, David Scharbach david.scharb...@mac.comwrote: English is good. $ fmdump -m SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major EVENT-TIME: Thu Jan 17 20:08:28 CST 2013 PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: openindiana SOURCE: software-diagnosis, REV: 0.1 EVENT-ID: 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 DESC: The system has rebooted after a kernel panic. Refer to http://illumos.org/msg/SUNOS-8000-KL for more information. AUTO-RESPONSE: The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/openindiana. IMPACT: There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial. REC-ACTION: If savecore is not enabled then please take steps to preserve the crash image. Use 'fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6' to view more panic detail. Please refer to the knowledge article for additional information. With the extended info: $ fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 TIME UUID SUNW-MSG-ID Jan 17 2013 20:08:28.91935 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL TIME CLASS ENA Jan 17 20:08:28.9139 ireport.os.sunos.panic.dump_available 0x Jan 17 20:08:07.5900 ireport.os.sunos.panic.dump_pending_on_device 0x nvlist version: 0 version = 0x0 class = list.suspect uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 code = SUNOS-8000-KL diag-time = 1358474908 917149 de = fmd:///module/software-diagnosis fault-list-sz = 0x1 fault-list = (array of embedded nvlists) (start fault-list[0]) nvlist version: 0 version = 0x0 class = defect.sunos.kernel.panic certainty = 0x64 asru = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 resource = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 savecore-succcess = 1 dump-dir = /var/crash/openindiana dump-files = vmdump.0 os-instance-uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 panicstr = BAD TRAP: type=e (#pf Page fault) rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL pointer dereference panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () | unix:thread_start+8 () | crashtime = 1358409705 panic-time = January 17, 2013 02:01:45 AM CST CST (end fault-list[0]) fault-status = 0x1 severity = Major __ttl = 0x1 __tod = 0x50f8ae9c 0x36cc2af0 And as I am a n00b to OI, I still don't really know what is going on… Thanks you again, Dave On 2013-01-19, at 4:15 PM, David Scharbach david.scharb...@mac.com wrote: $ fmdump TIME UUID SUNW-MSG-ID EVENT Jan 17 20:08:28.9193 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL Diagnosed $ uptime 16:12pm up 1 day 20:04, 2 users, load average: 0.08, 0.14, 0.21 Given today is the 19th and such, I think that timestamp on the fmdump is near when the OI server last crashed. I don't know what the event means. Can you let me know? Cheers, Dave On 2013-01-19, at 12:30 PM, Aurélien Larcher aurelien.larc...@gmail.com wrote: Hi, Has someone mentioned using 'fmdump' ? With this tool I discovered that I had issues with an unreliable disk controller on my workstation with the consequence of OI freezing approx. every 2months. In my case ZFS is getting the fault and standby until resolution of the issue, thus yielding an indefinite wait for disk I/O to resume. Best Aurelien On Sat, Jan 19, 2013 at 3:19 PM, Reginald Beardsley pulask...@yahoo.comwrote: One time when I happened to look, I saw that the Ultra 60 I used at work had been up for over 18
Re: [OpenIndiana-discuss] OI Crash
On 2013-01-19 23:50, Aurélien Larcher wrote: I cannot tell what would be the next step to diagnose the problem but: panicstr = BAD TRAP: type=e (#pf Page fault) rp=ff003c913840 addr=77 occurred in module smbsrv due to a NULL pointer dereference panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () | unix:thread_start+8 () | looks like a good start would be to look if there is any bug filed concerning Samba... I believe, this would not be Samba (a userspace project) but Solaris kernel implementation of CIFS, server in this case. Causes might be varied, but if there is integration with the Windows network (MSAD domain), it might be one thing worth researching - if the user account mapping (mapid, PAM), kerberos login of the server to domain, naming services and such pieces don't log errors of their own... Not that these SHOULD cause kernel panics, but who knows what the module can do if fed invalid inputs? ;) - and these you might be warned about before the crash... //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On Jan 17, 2013, at 8:47 PM, Reginald Beardsley wrote: As far as I'm concerned, problems like this are a bottomless abyss. Which is why I'm still putting up w/ my OI box hanging. It's annoying, but not critical. It's also why critical stuff still runs on Solaris 10. Intermittent failures are the worst time sink there is. There is no assurance that devoting all your time to the problem will fix it even at very high skill levels w/ a full complement of the very best tools. If you're getting crash dumps there is hope of finding the cause, so that's a big improvement. Good luck, Reg BTW Back in the 80's there was a VAX operator in Texas who went out to his truck, got a .357 and shot the computer. His employer was not happy. But I can certainly understand how the operator felt. From 1992 to I used to 1998, I used to work at the Denver Museum of Natural History -- now the Denver Museum of Nature and Science. We had two or three DEC Vax's and an AIX machine there. It was their policy that once a week we had to power each of the servers all the way down to clear out any memory problems -- or whatever -- as preventive maintenance. Since then, I've always had the habit of setting up a cron job to reboot my servers once a week. It's not as good as a full power down, but it's better than nothing. And in all these years, I've never had to deal with intermittent problems like this, except for a few brief times when I used Red Hat Linux ten plus years ago. (I've tried most of Red Hat's versions since 6.2, and RHEL 6 is the first version I've found that runs decent enough on our hardware, and that I'm happy enough with, for us to use.) So, if you can do it, you might want try setting up a cron job to reboot your server once a week -- or every night. I reboot our LTSP thin client server every night just because it gets hit with running lots of desktop applications that I think give it a greater potential for these kinds of memory problems. On the other hand, we have all of our websites hosted on one of our parishioner's servers -- and he doesn't reboot his machines periodically like I do -- and about every two months, I have to call him up and tell him something is wrong. And he goes and powers down his system -- sometimes he has to even unplug it -- and then turn it back on, and everything works again. I know there are system admins that just love to brag about how great their up-times are on their machines -- but this might just save you a lot of time and grief. Of course, if you're running a real high-volume server, this might not be workable for you; but it only takes 2-5 minutes or so to reboot... Perhaps in the middle of the night you might be able to spare it being down that short time? Just a friendly suggestion. Shared experience. I know others may tell you that that's no longer necessary anymore in these more modern times; but my experience has been otherwise. I hope it helps. +Peter, hieromonk ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On 01/19/2013 01:53 AM, dormitionsk...@hotmail.com wrote: From 1992 to I used to 1998, I used to work at the Denver Museum of Natural History -- now the Denver Museum of Nature and Science. We had two or three DEC Vax's and an AIX machine there. It was their policy that once a week we had to power each of the servers all the way down to clear out any memory problems -- or whatever -- as preventive maintenance. Since then, I've always had the habit of setting up a cron job to reboot my servers once a week. It's not as good as a full power down, but it's better than nothing. And in all these years, I've never had to deal with intermittent problems like this, except for a few brief times when I used Red Hat Linux ten plus years ago. (I've tried most of Red Hat's versions since 6.2, and RHEL 6 is the first version I've found that runs decent enough on our hardware, and that I'm happy enough with, for us to use.) Nice anecdote, but I find this kind of policy very strange. Sure, regular maintenance downtime windows are important, but doing to preempt any problems in the OS seems just strange... not to mention that a powercycle needlessly stresses the electromechanical components of the server (HDD motors, fans, etc.) Also, I don't know about VAX, but boot on a typical SPARC machine can easily take upwards of 10 minutes (or more, depending on the level of checks you enabled). Sun E10ks were famous for booting over half an hour (checking all of their complicated hardware took a lot of time). So, if you can do it, you might want try setting up a cron job to reboot your server once a week -- or every night. I reboot our LTSP thin client server every night just because it gets hit with running lots of desktop applications that I think give it a greater potential for these kinds of memory problems. How about just killing these apps (e.g. forced logout of users) rather than rebooting the whole machine? Do you suspect memory problems in the base OS services? On the other hand, we have all of our websites hosted on one of our parishioner's servers -- and he doesn't reboot his machines periodically like I do -- and about every two months, I have to call him up and tell him something is wrong. I suggest switching hosting providers, as your server admin apparently has next to no idea of what he's doing. I've been running web servers for years without any trouble. Only the most drastic changes should warrant a reboot (e.g. kernel update). And he goes and powers down his system -- sometimes he has to even unplug it -- and then turn it back on, and everything works again. What's up with this Windows 95-era powercycling voodoo? You are obviously dealing with a serious issue and ignoring it. I know there are system admins that just love to brag about how great their up-times are on their machines -- but this might just save you a lot of time and grief. Frequent rebooting and powercycling might have worked for you, but lots of applications don't allow for that. Don't mistake an admin's pride of a job well done for bragging. Of course, if you're running a real high-volume server, this might not be workable for you; but it only takes 2-5 minutes or so to reboot... Perhaps in the middle of the night you might be able to spare it being down that short time? This is just plastering over the problem - I've seen plenty of solutions of this kind where the restart frequency of a service slowly had to increase until it was no longer workable. In general, I'd recommend doing what you say only as the absolute last option. Just a friendly suggestion. Shared experience. I know others may tell you that that's no longer necessary anymore in these more modern times; but my experience has been otherwise. I hope it helps. When you do encounter these kinds of problems, try and capture a crash dump, file an Illumos issue and provide as much info on the problem as possible to help debug it (that's what I recommended to David, he has yet to respond). Nothing will improve if users keep issues to themselves. I've been dealing with a serious (show stopper) network load problem in Illumos a while back and after a little googling, mailing and testing I managed to resolve it. Sticking one's head in the sand isn't a good avenue of progress. Anyway, just my two cents.. Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On 1/18/2013 7:53 PM, dormitionsk...@hotmail.com wrote: On Jan 17, 2013, at 8:47 PM, Reginald Beardsley wrote: As far as I'm concerned, problems like this are a bottomless abyss. Which is why I'm still putting up w/ my OI box hanging. It's annoying, but not critical. It's also why critical stuff still runs on Solaris 10. Intermittent failures are the worst time sink there is. There is no assurance that devoting all your time to the problem will fix it even at very high skill levels w/ a full complement of the very best tools. If you're getting crash dumps there is hope of finding the cause, so that's a big improvement. Good luck, Reg BTW Back in the 80's there was a VAX operator in Texas who went out to his truck, got a .357 and shot the computer. His employer was not happy. But I can certainly understand how the operator felt. From 1992 to I used to 1998, I used to work at the Denver Museum of Natural History -- now the Denver Museum of Nature and Science. We had two or three DEC Vax's and an AIX machine there. It was their policy that once a week we had to power each of the servers all the way down to clear out any memory problems -- or whatever -- as preventive maintenance. Since then, I've always had the habit of setting up a cron job to reboot my servers once a week. It's not as good as a full power down, but it's better than nothing. And in all these years, I've never had to deal with intermittent problems like this, except for a few brief times when I used Red Hat Linux ten plus years ago. (I've tried most of Red Hat's versions since 6.2, and RHEL 6 is the first version I've found that runs decent enough on our hardware, and that I'm happy enough with, for us to use.) So, if you can do it, you might want try setting up a cron job to reboot your server once a week -- or every night. I reboot our LTSP thin client server every night just because it gets hit with running lots of desktop applications that I think give it a greater potential for these kinds of memory problems. On the other hand, we have all of our websites hosted on one of our parishioner's servers -- and he doesn't reboot his machines periodically like I do -- and about every two months, I have to call him up and tell him something is wrong. And he goes and powers down his system -- sometimes he has to even unplug it -- and then turn it back on, and everything works again. I know there are system admins that just love to brag about how great their up-times are on their machines -- but this might just save you a lot of time and grief. Of course, if you're running a real high-volume server, this might not be workable for you; but it only takes 2-5 minutes or so to reboot... Perhaps in the middle of the night you might be able to spare it being down that short time? Just a friendly suggestion. Shared experience. I know others may tell you that that's no longer necessary anymore in these more modern times; but my experience has been otherwise. I hope it helps. +Peter, hieromonk Haven't we passed the days of mystical sysadmin without understanding and characterization? Keeping up tradition for tradition's sake without understanding the underlying reasons really doesn't do anybody a favor. If there are memory leaks, we posses the technology to find them. My organization has thousands of machines that run jobs sometimes for months at a time. If I had to reboot servers once a week, my users would be at the doors with pitchforks. The only time we take downtime is when there are reasons to do so, including OS updates, hardware failures, and user software run amok. They can run a very long time like this. Not that memory leaks never happen. Of course they do, but they eventually get found and fixed, or the program causing them passes into obsolescence. Always. I encourage discovery rather than superstition, and diagnosis rather than repetition. Be a knight, not a victim! ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
Well, I don't think it's stressing the hardware all that much, when you consider our oldest server is 11 1/2 years old, with all its original hardware. Our newest server is somewhere around 7 years old, without a hardware failure for at least five years. I admit I'm not much of a system admin. I've been forced into that role because there's nobody else here to do it. Our hosting provider situation is a similarly less than ideal situation, which we're working on. Bosses kind of tend to get in the way of some of these things, too... I have no idea about SPARC, or any of the real big server environments. I can't even fathom working in an environment with thousands of servers, or why they would even need that many. And if you have the time and expertise to work through and find the problem so it can be resolved, that's obviously better. But this archaic way of dealing with the problem actually works -- if a person can do it. Like I said, it may not be practical for everyone's situation, though. It's certainly not for big, professional admins. For smaller environments, I believe it can be a reasonable option, though. It's not being superstitious, or a victim. It's simply trying take the easy way out, and if it takes care of the problem, then you don't have to deal with it any more. Or at least not right now. If it doesn't, well, then, you have to fight your way through it. I think setting up periodic reboots is better as a preventive maintenance measure, than as a way of addressing a known issue. But if nothing else, it might just buy you some time until you can work on it more at your convenience. Oh, and I didn't make this reboot procedure up. From what I understand, it used to be fairly common practice. I figured some of the professionals would take exception to it. But sometimes, older things can still be better than new. Unless, of course, you like fighting and beating your head against the wall trying to figure out why your system hangs, or whatever, instead of having a stable network and spending your time on less pressing and / or more mundane things... []:-) Cheers. fp ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
I ran memtest86 for 3 passes, everything was ok there. Computer froze again today after only 1 day of uptime. I now have a dump file but I am confused as to what to do with it. Sorry to be a n00b but could you point me in the right direction? Cheers, On 2013-01-15, at 9:10 PM, Ian Collins i...@ianshome.com wrote: David Scharbach wrote: I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Question I have is where would I start to look to see why? I first thought it may be due to scrubbing load on the LSI controller but that is not the case. It crashed today hours after a curb was successful at 2AM. I am a new OI user and would really appreciate a bit of help on this one. Basically looking for a crash dump of some sort and the locations that I have looked at don't really help. System is i3 CPU 32GB Ram LSI SAS2008 Intel SAS expander 13 SATA 7200RPM drives Asus MB Is there any evidence of a crash in the logs? Look in /var/adm/messages for any clues and under /var/crash/hostname for any dumps. I'm not sure if there is a version of Solaris CAT (Crash Analysis Tool) that works with OI. If there is and you have a dump that's the best place to look. If there isn't any evidence of a crash, there's a fair chance you have a hardware problem. I'm guessing that with an i3 motherboard you won't have ECC memory, so running memtest86 for a while would be a good start. -- Ian. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
I checked and the P8V77-v that I am using seems to be listed, unless the LK suffix makes a big difference. I just disabled my on-board NIC and installed an Intel NIC. Shall see… Thanks again, On 2013-01-15, at 10:44 PM, Mehmet Erol Sanliturk m.e.sanlit...@gmail.com wrote: On Tue, Jan 15, 2013 at 6:50 PM, David Scharbach david.scharb...@mac.comwrote: I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Question I have is where would I start to look to see why? I first thought it may be due to scrubbing load on the LSI controller but that is not the case. It crashed today hours after a curb was successful at 2AM. I am a new OI user and would really appreciate a bit of help on this one. Basically looking for a crash dump of some sort and the locations that I have looked at don't really help. System is i3 CPU 32GB Ram LSI SAS2008 Intel SAS expander 13 SATA 7200RPM drives Asus MB Thank you for any help. Cheers, Dave If your mother board is NOT present in the following list , it means that working under Unix like operating systems is a chance and crashes are very likely : http://www.asus.com/Static_WebPage/OS_Compatibility/ http://www.asus.com/websites/global/aboutasus/OS/Linux1211.pdf Thank you very much . Mehmet Erol Sanliturk ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
lol, you make it seem so easy :) I just disabled the on board NIC. We will see. Next I will try the storage controller. Then a hammer. Cheers, On 2013-01-16, at 9:01 AM, Edward Ned Harvey (openindiana) openindi...@nedharvey.com wrote: From: David Scharbach [mailto:david.scharb...@mac.com] I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Whenever I've seen this type of behavior before, it was hardware/driver related, but we never were able to narrow it down to *which* piece of hardware or driver, by any method other than blindly swapping out hardware. I'm not talking, necessarily, about failing hardware. Just some sort of incompatibility bug. On one system, we greatly reduced the incidence of crashes by disabling the on-board broadcom NIC, and buying the intel server PCIE NIC instead. Likely candidates are the storage controller, and network adapter. And everything else in the system. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
As far as I'm concerned, problems like this are a bottomless abyss. Which is why I'm still putting up w/ my OI box hanging. It's annoying, but not critical. It's also why critical stuff still runs on Solaris 10. Intermittent failures are the worst time sink there is. There is no assurance that devoting all your time to the problem will fix it even at very high skill levels w/ a full complement of the very best tools. If you're getting crash dumps there is hope of finding the cause, so that's a big improvement. Good luck, Reg BTW Back in the 80's there was a VAX operator in Texas who went out to his truck, got a .357 and shot the computer. His employer was not happy. But I can certainly understand how the operator felt. --- On Thu, 1/17/13, David Scharbach david.scharb...@mac.com wrote: From: David Scharbach david.scharb...@mac.com Subject: Re: [OpenIndiana-discuss] OI Crash To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Date: Thursday, January 17, 2013, 8:27 PM lol, you make it seem so easy :) I just disabled the on board NIC. We will see. Next I will try the storage controller. Then a hammer. Cheers, On 2013-01-16, at 9:01 AM, Edward Ned Harvey (openindiana) openindi...@nedharvey.com wrote: From: David Scharbach [mailto:david.scharb...@mac.com] I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Whenever I've seen this type of behavior before, it was hardware/driver related, but we never were able to narrow it down to *which* piece of hardware or driver, by any method other than blindly swapping out hardware. I'm not talking, necessarily, about failing hardware. Just some sort of incompatibility bug. On one system, we greatly reduced the incidence of crashes by disabling the on-board broadcom NIC, and buying the intel server PCIE NIC instead. Likely candidates are the storage controller, and network adapter. And everything else in the system. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On 01/18/2013 03:20 AM, David Scharbach wrote: I ran memtest86 for 3 passes, everything was ok there. Computer froze again today after only 1 day of uptime. I now have a dump file but I am confused as to what to do with it. Sorry to be a n00b but could you point me in the right direction? If you have a crash dump, follow http://wiki.illumos.org/display/illumos/How+To+Report+Problems and send your crash dump info (the crash.0 file as it is generated in that guide). That should extract most of the relevant info from the crash dump and give us a clue as to where exactly your system panic'ed. Cheers, -- Saso ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
BTW Back in the 80's there was a VAX operator in Texas who went out to his truck, got a .357 and shot the computer. His employer was not happy. But I can certainly understand how the operator felt. Ah. That's too bad! I used to love VAX! A shotgun would have done a much better job, too. []:-) ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On 01/15/13 23:02, Rich wrote: mkdir -p /var/crash/$(hostname) pfexec dumpadm -y And ideally, put set dump_plat_mincpu=0 in /etc/system, lest the core dump code try to thread and fail miserably. Next time you die, you should get a core dump in /var/crash/[hostname]/, presuming your dump device has enough space. Good advice. If you're seeing hangs, I also suggest this in /etc/system: set snooping = 1 That causes the scheduler to panic if it fails to make progress. It's probably not something you want to have for the long term, but to help identify the cause of a hard hang, it can be useful. -- James Carlson 42.703N 71.076W carls...@workingcode.com ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
From: David Scharbach [mailto:david.scharb...@mac.com] I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Whenever I've seen this type of behavior before, it was hardware/driver related, but we never were able to narrow it down to *which* piece of hardware or driver, by any method other than blindly swapping out hardware. I'm not talking, necessarily, about failing hardware. Just some sort of incompatibility bug. On one system, we greatly reduced the incidence of crashes by disabling the on-board broadcom NIC, and buying the intel server PCIE NIC instead. Likely candidates are the storage controller, and network adapter. And everything else in the system. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
I have had this issue and it turned out to be a power supply with too little power for the 6 hard drives stuffed into my ultra 20. Removed 2 drives and everything was fine. The drives were perfectly fine. The other time I have run into this is when I would lose a required nfs mount (like a home drive). Good luck. The time I had a reset every 14 days turned out to be a problem with the server rooms ups sending me a 'shutdown' message every two weeks. Switched ups and that disappeared. On 01/16/13 10:01 AM, Edward Ned Harvey (openindiana) wrote: From: David Scharbach [mailto:david.scharb...@mac.com] I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Whenever I've seen this type of behavior before, it was hardware/driver related, but we never were able to narrow it down to *which* piece of hardware or driver, by any method other than blindly swapping out hardware. I'm not talking, necessarily, about failing hardware. Just some sort of incompatibility bug. On one system, we greatly reduced the incidence of crashes by disabling the on-board broadcom NIC, and buying the intel server PCIE NIC instead. Likely candidates are the storage controller, and network adapter. And everything else in the system. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss -- Dr. Daniel Kjar Assistant Professor of Biology Division of Mathematics and Natural Sciences Elmira College 1 Park Place Elmira, NY 14901 607-735-1826 http://faculty.elmira.edu/dkjar ...humans send their young men to war; ants send their old ladies -E. O. Wilson ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
David Scharbach wrote: I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Question I have is where would I start to look to see why? I first thought it may be due to scrubbing load on the LSI controller but that is not the case. It crashed today hours after a curb was successful at 2AM. I am a new OI user and would really appreciate a bit of help on this one. Basically looking for a crash dump of some sort and the locations that I have looked at don't really help. System is i3 CPU 32GB Ram LSI SAS2008 Intel SAS expander 13 SATA 7200RPM drives Asus MB Is there any evidence of a crash in the logs? Look in /var/adm/messages for any clues and under /var/crash/hostname for any dumps. I'm not sure if there is a version of Solaris CAT (Crash Analysis Tool) that works with OI. If there is and you have a dump that's the best place to look. If there isn't any evidence of a crash, there's a fair chance you have a hardware problem. I'm guessing that with an i3 motherboard you won't have ECC memory, so running memtest86 for a while would be a good start. -- Ian. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
I will make a memtest ISO ASAP. /var/adm/messages shows nothing. /var/crash does not exist on my system. Will see what memtest says. Cheers, Dave On 2013-01-15, at 9:10 PM, Ian Collins i...@ianshome.com wrote: David Scharbach wrote: I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Question I have is where would I start to look to see why? I first thought it may be due to scrubbing load on the LSI controller but that is not the case. It crashed today hours after a curb was successful at 2AM. I am a new OI user and would really appreciate a bit of help on this one. Basically looking for a crash dump of some sort and the locations that I have looked at don't really help. System is i3 CPU 32GB Ram LSI SAS2008 Intel SAS expander 13 SATA 7200RPM drives Asus MB Is there any evidence of a crash in the logs? Look in /var/adm/messages for any clues and under /var/crash/hostname for any dumps. I'm not sure if there is a version of Solaris CAT (Crash Analysis Tool) that works with OI. If there is and you have a dump that's the best place to look. If there isn't any evidence of a crash, there's a fair chance you have a hardware problem. I'm guessing that with an i3 motherboard you won't have ECC memory, so running memtest86 for a while would be a good start. -- Ian. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On Jan 15, 2013, at 7:10 PM, Ian Collins i...@ianshome.com wrote: If there isn't any evidence of a crash, there's a fair chance you have a hardware problem. i have decent number of identical production boxes. about once per quarter one of them spontaneously reboots leaving no trace as to why. it is never the same box twice. the first couple of times i offlined the systems and ran diagnostics on them. i ran memtest for two weeks. i checked SEL, etc found butt-kiss. i came to the conclusion it is just something that happens. it is likely a driver issue. since my srchitecture can absorb such failures i havent spent slot of time on it. i am still on 151a so perhaps there is a fix that i just dont have it yet. whatever my reboot problem is, it is not a hardware problem. j. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
mkdir -p /var/crash/$(hostname) pfexec dumpadm -y And ideally, put set dump_plat_mincpu=0 in /etc/system, lest the core dump code try to thread and fail miserably. Next time you die, you should get a core dump in /var/crash/[hostname]/, presuming your dump device has enough space. - Rich On Tue, Jan 15, 2013 at 10:18 PM, David Scharbach david.scharb...@mac.com wrote: I will make a memtest ISO ASAP. /var/adm/messages shows nothing. /var/crash does not exist on my system. Will see what memtest says. Cheers, Dave On 2013-01-15, at 9:10 PM, Ian Collins i...@ianshome.com wrote: David Scharbach wrote: I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Question I have is where would I start to look to see why? I first thought it may be due to scrubbing load on the LSI controller but that is not the case. It crashed today hours after a curb was successful at 2AM. I am a new OI user and would really appreciate a bit of help on this one. Basically looking for a crash dump of some sort and the locations that I have looked at don't really help. System is i3 CPU 32GB Ram LSI SAS2008 Intel SAS expander 13 SATA 7200RPM drives Asus MB Is there any evidence of a crash in the logs? Look in /var/adm/messages for any clues and under /var/crash/hostname for any dumps. I'm not sure if there is a version of Solaris CAT (Crash Analysis Tool) that works with OI. If there is and you have a dump that's the best place to look. If there isn't any evidence of a crash, there's a fair chance you have a hardware problem. I'm guessing that with an i3 motherboard you won't have ECC memory, so running memtest86 for a while would be a good start. -- Ian. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
FYI I had to force a hard reboot via power switch today (i.e. no shutdown or sync :-( The system hangs and will not take any input via the X server keyboard mouse. In this case it would not even do a clean reboot via the power switch monitor daemon. Only option was force it down. I'm running OI 151. I think a5, but that's from memory rather than from something reliable like uname(1). From what I know at present this is an interrupt conflict between the keyboard/mouse drivers and the Nvidia graphics driver. This is apparently an outstanding bug and as far as I can see not easily fixed. I did not have this problem w/ 148 and I would revert to that except I can't remember the root password :-( This may not be related to your problem, but the system hanging really isn't a crash. It's actually much worse. I still have scars from a MicroVAX that hung about every 60 days for 18 months. Because it never actually crashed it was very hard to get any help despite top grade support. Eventually we discovered it was a bad thermal sensor shutting down one side of the split 15 V supply. But that was only because it did it one day when DEC support was there and we had the skins off the machine and could see the LED status on the power supply. After over a year of this I had them living there trying to fix it. We'd replaced almost everything in the machine except the backplane and cabinet. We'd already replaced the PS, so when the fault showed on the LED we knew it wasn't the PS which left the thermal sensor in the top of the box. I never understood why that only shut down one side of the supply, but it did a great job of locking up the CPU. If you don't get a crash dump, it is *really* hard to resolve the cause and fix it. There's nothing to get a hold of. Good luck and please keep us posted. Reg --- On Tue, 1/15/13, David Scharbach david.scharb...@mac.com wrote: From: David Scharbach david.scharb...@mac.com Subject: Re: [OpenIndiana-discuss] OI Crash To: Discussion list for OpenIndiana openindiana-discuss@openindiana.org Date: Tuesday, January 15, 2013, 9:18 PM I will make a memtest ISO ASAP. /var/adm/messages shows nothing. /var/crash does not exist on my system. Will see what memtest says. Cheers, Dave On 2013-01-15, at 9:10 PM, Ian Collins i...@ianshome.com wrote: David Scharbach wrote: I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Question I have is where would I start to look to see why? I first thought it may be due to scrubbing load on the LSI controller but that is not the case. It crashed today hours after a curb was successful at 2AM. I am a new OI user and would really appreciate a bit of help on this one. Basically looking for a crash dump of some sort and the locations that I have looked at don't really help. System is i3 CPU 32GB Ram LSI SAS2008 Intel SAS expander 13 SATA 7200RPM drives Asus MB Is there any evidence of a crash in the logs? Look in /var/adm/messages for any clues and under /var/crash/hostname for any dumps. I'm not sure if there is a version of Solaris CAT (Crash Analysis Tool) that works with OI. If there is and you have a dump that's the best place to look. If there isn't any evidence of a crash, there's a fair chance you have a hardware problem. I'm guessing that with an i3 motherboard you won't have ECC memory, so running memtest86 for a while would be a good start. -- Ian. ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On Tue, Jan 15, 2013 at 6:50 PM, David Scharbach david.scharb...@mac.comwrote: I have an OI installation that seems to crash about every 20 days. Locks up completely and needs a hard reset. Not very much fun. Question I have is where would I start to look to see why? I first thought it may be due to scrubbing load on the LSI controller but that is not the case. It crashed today hours after a curb was successful at 2AM. I am a new OI user and would really appreciate a bit of help on this one. Basically looking for a crash dump of some sort and the locations that I have looked at don't really help. System is i3 CPU 32GB Ram LSI SAS2008 Intel SAS expander 13 SATA 7200RPM drives Asus MB Thank you for any help. Cheers, Dave If your mother board is NOT present in the following list , it means that working under Unix like operating systems is a chance and crashes are very likely : http://www.asus.com/Static_WebPage/OS_Compatibility/ http://www.asus.com/websites/global/aboutasus/OS/Linux1211.pdf Thank you very much . Mehmet Erol Sanliturk ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss
Re: [OpenIndiana-discuss] OI Crash
On 2013-01-16 05:04, Reginald Beardsley wrote: This is apparently an outstanding bug and as far as I can see not easily fixed. I did not have this problem w/ 148 and I would revert to that except I can't remember the root password :-( Can't you beadm mount oi_148 (insert proper BE name) and fix up the /etc/shadow file inside there (i.e. if you know your current root password, just copy-paste the cyphertext from your running BE's /etc/shadow over the one in the other BE). When done, don't forget to beadm umount before you beadm activate HTH, //Jim ___ OpenIndiana-discuss mailing list OpenIndiana-discuss@openindiana.org http://openindiana.org/mailman/listinfo/openindiana-discuss