Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
What next? The machine answered the question: It was running smoothly for close to an hour. Then I left for lunch. When I came back, the monitor was black, no reaction. I tried all and everything, with power button as last resort. That resulted in a cold start. Since this was a good opportunity, I gave it a shot and pulled a network cable to it, and disabled WLAN. And it connected properly to the network; so the NIC is probably not broken, as one could assume. The message lines are as in my earlier mail, except that there are two more: one with bge0 link up, immediately followed by bad address 0.0.0.0 I had issued ifconfig bge0 dhcp afterwards, and there are no more bge0 messages in the log. So what we seem to encounter here, is a bad architectural mistake in the kernel. Blame nwam on pulling the wrong cords, nevermind. But the kernel must not allow this to happen: When bge0 can't connect, it monopolises all resources to load the 'correct' firmware to get it back up? On top of that, I never used bge0, always wpi0. So there is no reason at all for the kernel to try to force bge0 to work. -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
The machine answered the question: It was running smoothly for close to an hour. Then I left for lunch. When I came back, the monitor was black, no reaction. I tried all and everything, with power button as last resort. That resulted in a cold start. Is it possible that the ip address assigned to the system's hostname was bound to the bge0 interface; that is, after disabling the bge driver, connections to the local machine using the machine' `hostname` failed ? I suspect that the X screen blank needs such a connection to wake up from screen blank mode... Since this was a good opportunity, I gave it a shot and pulled a network cable to it, and disabled WLAN. And it connected properly to the network; so the NIC is probably not broken, as one could assume. The message lines are as in my earlier mail, except that there are two more: one with bge0 link up, immediately followed by bad address 0.0.0.0 I had issued ifconfig bge0 dhcp afterwards, and there are no more bge0 messages in the log. That is, when there is no cable connected to the bge NIC hardware, the machine starts to consume lots of kernel cpu time after a few minutes, and eventually hangs the system? And when a cable is connected, there is no excessive kernel cpu time usage, and the machine doesn't hang? So what we seem to encounter here, is a bad architectural mistake in the kernel. Blame nwam on pulling the wrong cords, nevermind. But the kernel must not allow this to happen: When bge0 can't connect, it monopolises all resources to load the 'correct' firmware to get it back up? Yep; I'd say something is broken in the bge driver... Maybe the BIOS has configured the nic hardware to enter a power saving state after five minutes with no activity; and the Solaris bge driver is confused when the device enters that power saving state (it tries to recover by reseting the bge hardware, but fails to wake up the hardware, and tries to wait forever for the firmware to become ready) ? On top of that, I never used bge0, always wpi0. So there is no reason at all for the kernel to try to force bge0 to work. I think a possible workaround is to disable svc:/network/physical:nwam, enable svc:/network/physical:default, and manually configure the wpi0 interface (and not use the bge interface for now). -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Is it possible that the ip address assigned to the system's hostname was bound to the bge0 interface; that is, after disabling the bge driver, connections to the local machine using the machine' `hostname` failed ? I suspect that the X screen blank needs such a connection to wake up from screen blank mode... No, bge0 came in later. That was when it was running wpi0, with bge0 disabled in the kernel. That is, when there is no cable connected to the bge NIC hardware, the machine starts to consume lots of kernel cpu time after a few minutes, and eventually hangs the system? Correct. And when a cable is connected, there is no excessive kernel cpu time usage, and the machine doesn't hang? Aside of those many mwaiti86 (or so), correct. At least, the machine doesn't hang (as before, let's leave out the non-return after some 2 hours of my absence), and has a beautiful load less 0.10. Yep; I'd say something is broken in the bge driver... And in the kernel architecture, I'd add. I don't consider it proper for the kernel to shoot itself by trying to wake up a little NIC at the periphery.? Maybe the BIOS has configured the nic hardware to enter a power saving state after five minutes with no activity; and the Solaris bge driver is confused when the device enters that power saving state (it tries to recover by reseting the bge hardware, but fails to wake up the hardware, and tries to wait forever for the firmware to become ready) ? As I mentioned before, the BIOS is most ugly in this machine, it doesn't allow much, many settings seem to be inaccessible by the user. As for your theory above, we have two phases to consider: 1. For some reason, the kernel tries hard to wake up bge0 after some 5 minutes, consuming all of CPU0 2. After a few minutes more, the system is completely dead. What else does it do, what does it try to achieve a few minutes later that kills it completely? I think a possible workaround is to disable svc:/network/physical:nwam, enable svc:/network/physical:default, and manually configure the wpi0 interface (and not use the bge interface for now). My workaround is much more elegant: Boot to Ubuntu, OpenBSD or XP. They all work pretty well on this machine. :( Nevertheless, if there is anything more required from my side to help debugging this situation, let me know! Uwe -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Looks like driver interrupts, on cpu #0, and at IPL 4. What interrupts are bound to cpu 0 / IPL 4, on your machine? This information is printed by echo ::interrupts | mdb -k This is whole lot while 'sane' (close to 0 CPU use): IRQ Vect IPL BusTrg Type CPU Share APIC/INT# ISR(s) ... 25 0x30 4 PCIEdg MSI0 1 - pepb_intr_handler So it could be related to PCI-e / PCI bus bridge; maybe some hotplug or power management event interrupt. The five minute delay could be a hint that it is related to power management. Are there perhaps BIOS setup options to enabled / disable power management for PCI-e devices? -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
ISR(s) ... 25 0x30 4 PCIEdg MSI0 1 - pepb_intr_handler could be related to PCI-e / PCI bus bridge; maybe some hotplug or power management event interrupt. The five minute delay could be a hint that it is related to power management. Are there perhaps BIOS setup options to enabled / disable power management for PCI-e devices? Alas, no. As much as I have come to like the machine, the BIOS is atypical. Just a proprietary down to cannot set anything here from HP. -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Ok, for system cpu time usage: try to run a kernel profile, to find out what kernel functions are consuming the time, lockstat -kIW -D 20 sleep 15 I did one on the machine, and then quickly an ssh and another one in ssh for the screenshot: # lockstat -kIW -D 20 sleep 15 Profiling interrupt: 3074 events in 15.841 seconds (194 events/sec) Count indv cuml rcnt nsec Hottest CPU+PILCaller --- 2430 79% 79% 0.00 2682 cpu[0] i86_mwait 279 9% 88% 0.00 1364 cpu[0]+4 tsc_read 113 4% 92% 0.00 554980 cpu[0]+4 ddi_mem_get32 103 3% 95% 0.00 1437 cpu[0]+4 tsc_gethrtime 53 2% 97% 0.00 1369 cpu[0]+4 mul32 35 1% 98% 0.00 1337 cpu[0]+4 gethrtime 28 1% 99% 0.00 1379 cpu[0]+4 drv_usecwait ... and 10 seconds later it was completely dead. Does this help, or do you need another one? Hmm, the 79% i86_mwait() should be 79% idle time. The rest is ~ 20% cpu time usage for accessing some memory mapped registers, reading the cpu's time stamp counter (tsc); on CPU #0 at priority level 4 cpu[0]+4. Looks like the kernel is busy waiting for some time using drv_usecwait at priority level 4. If you repeat that lockstat, does the result look similar? cpu usage by cpu[0]+4, in tsc_read(), ddi_mem_get32(), tsc_gethrtime(), ...drv_usecwait() ? Maybe we can find out who's calling drv_usecwait(), using: lockstat -kIW -f drv_usecwait -s 10 sleep 15 -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
If you repeat that lockstat, does the result look similar? cpu usage by cpu[0]+4, in tsc_read(), ddi_mem_get32(), tsc_gethrtime(), ...drv_usecwait() ? Maybe we can find out who's calling drv_usecwait(), using: lockstat -kIW -f drv_usecwait -s 10 sleep 15 Okay, think, I caught them all here: First the two at sanity (~0% CPU): # lockstat -kIW -f drv_usecwait -s 10 sleep 15 Profiling interrupt: 1 events in 15.041 seconds (0 events/sec) --- Count indv cuml rcnt nsec Hottest CPU+PILCaller 1 100% 100% 0.00 1246 cpu[0] drv_usecwait nsec -- Time Distribution -- count Stack 2048 |@@ 1 ec_wait_ibf_clear ec_rd ec_handler AcpiEvAddressSpaceDispatch AcpiExAccessRegion AcpiExFieldDatumIo AcpiExExtractFromField AcpiExReadDataFromField AcpiExResolveNodeToValue --- # lockstat -kIW -D 20 sleep 15 Profiling interrupt: 2918 events in 15.042 seconds (194 events/sec) Count indv cuml rcnt nsec Hottest CPU+PILCaller --- 2896 99% 99% 0.00 3174 cpu[1] i86_mwait 12 0% 100% 0.00 3050 cpu[0] (usermode) 2 0% 100% 0.00 2757 cpu[0] mutex_enter 1 0% 100% 0.00 1944 cpu[1]+11 savectx 1 0% 100% 0.00 1886 cpu[1] cv_broadcast 1 0% 100% 0.00 4440 cpu[1] page_get_mnode_freelist 1 0% 100% 0.00 1777 cpu[1] bt_getlowbit 1 0% 100% 0.00 3452 cpu[0] hwblkpagecopy 1 0% 100% 0.00 3109 cpu[0]+5 ddi_mem_put8 1 0% 100% 0.00 3844 cpu[0] _sys_sysenter_post_swapgs 1 0% 100% 0.00 1414 cpu[0]+2 dtrace_dynvar_clean --- The first command usually returned nothing; I ran it around 10 times until I got that output above. Next, the two at ~50% CPU use: # lockstat -kIW -D 20 sleep 15 Profiling interrupt: 3268 events in 16.849 seconds (194 events/sec) Count indv cuml rcnt nsec Hottest CPU+PILCaller --- 1601 49% 49% 0.00 1098 cpu[1]+9 i86_mwait 781 24% 73% 0.00 881 cpu[0]+4 tsc_read 315 10% 83% 0.00 531420 cpu[0]+4 ddi_getl 245 7% 90% 0.00 871 cpu[0]+4 tsc_gethrtime 136 4% 94% 0.00 864 cpu[0]+4 mul32 83 3% 97% 0.00 860 cpu[0]+4 gethrtime 73 2% 99% 0.00 869 cpu[0]+4 drv_usecwait 8 0% 99% 0.0075265 cpu[1] (usermode) 4 0% 99% 0.00 1023 cpu[1]+9 mutex_delay_default 3 0% 99% 0.00 2278 cpu[0]+4 do_splx 3 0% 100% 0.00 1653 cpu[0] AcpiUtDebugPrint 1 0% 100% 0.00 3645 cpu[1]+9 as_segcompar 1 0% 100% 0.00 1710 cpu[1]+9 avl_find 1 0% 100% 0.00 3877 cpu[1]+9 page_lookup_create 1 0% 100% 0.00 976 cpu[1]+9 default_lock_delay 1 0% 100% 0.00 3036 cpu[1]+9 mutex_enter 1 0% 100% 0.00 3232 cpu[1]+9 inb 1 0% 100% 0.00 1633692 cpu[1]+9 ddi_io_put32 1 0% 100% 0.00 951528 cpu[1]+9 ddi_getw 1 0% 100% 0.00 1419253 cpu[1] ddi_getb --- # lockstat -kIW -f drv_usecwait -s 10 sleep 15 Profiling interrupt: 88 events in 16.823 seconds (5 events/sec) --- Count indv cuml rcnt
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Maybe we can find out who's calling drv_usecwait(), using: lockstat -kIW -f drv_usecwait -s 10 sleep 15 # lockstat -kIW -f drv_usecwait -s 10 sleep 15 Profiling interrupt: 88 events in 16.823 seconds (5 events/sec) --- Count indv cuml rcnt nsec Hottest CPU+PILCaller 86 98% 98% 0.00 867 cpu[0]+4 drv_usecwait nsec -- Time Distribution -- count Stack 1024 |@@ 76bge_poll_firmware 2048 |@@@10bge_chip_reset bge_reset bge_restart bge_chip_factotum av_dispatch_softvect dispatch_softint switch_sp_and_call Hmm, looks like the bge driver is using software interrupts, and I think these could be running at priority level 4. Seems that the bge hardware has some problems, and the driver tries to reset the bge network hardware in an attempt to recover from the bge hardware problem. bge_poll_firmware() could be busy waiting for up to one second; I suspect this could explain the kernel cpu time usage. Are there any error or warning messages logged to /var/adm/messages when the system starts consuming kernel cpu time? Maybe the hang can be avoided when the bge nic driver isn't used and the bge interface is unconfigured / unplumbed? Or the bge nic driver isn't allowed to load, by using the kernel option -B disable-bge=true ? -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Hmm, looks like the bge driver is using software interrupts, and I think these could be running at priority level 4. Seems that the bge hardware has some problems, and the driver tries to reset the bge network hardware in an attempt to recover from the bge hardware problem. bge_poll_firmware() could be busy waiting for up to one second; I suspect this could explain the kernel cpu time usage. Are there any error or warning messages logged to /var/adm/messages when the system starts consuming kernel cpu time? Maybe the hang can be avoided when the bge nic driver isn't used and the bge interface is unconfigured / unplumbed? Or the bge nic driver isn't allowed to load, by using the kernel option -B disable-bge=true ? I started at the end, with -B disable-bge=true. The network applet still shows bge0, but it doesn't try to configure it. ifconfig bge0 unplumb says bge0 is no interface, so the kernel option seems to have worked. Lockstat though still shows 98% of i86_mwait at 'sane' state. I checked the /var/adm/messages, but it is so long, and I don't know what I should look for. I tried 'excess' and 'consum', but neither had any hits. What looks strange to me, the layperson in kernel land: Aug 8 22:05:34 OSolUwe mac: [ID 469746 kern.info] NOTICE: bge0 registered Aug 8 22:05:34 OSolUwe pci_pci: [ID 370704 kern.info] PCI-device: pci103c,3...@e, bge0 Aug 8 22:05:34 OSolUwe genunix: [ID 936769 kern.info] bge0 is /p...@0,0/pci8086,2...@1e/pci103c,3...@e Aug 8 22:05:46 OSolUwe genunix: [ID 408114 kern.info] /p...@0,0/pci8086,2...@1e/pci103c,3...@e (bge0) online Aug 8 22:05:47 OSolUwe ip: [ID 856290 kern.notice] ip: joining multicasts failed (4) on bge0 - will use link layer broadcasts for multicast Aug 8 22:05:50 OSolUwe in.ndpd[366]: [ID 169330 daemon.error] Interface bge0 has been removed from kernel. in.ndpd will no longer use it Aug 8 22:05:54 OSolUwe genunix: [ID 408114 kern.info] /p...@0,0/pci8086,2...@1e/pci103c,3...@e (bge0) online At least, I can confirm that now the system keeps running normally; meaning that at least the symptoms have been suppressed by that kernel option. What next? -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Did you have a look at mpstat 1 output before and after the jump to 50% cpu usage? Does it consume user or system cpu time? 100% system time of CPU0 when the CPU usage bumps up to 50%. Before it is around 1-3 % for each CPU. Ok, for system cpu time usage: try to run a kernel profile, to find out what kernel functions are consuming the time, lockstat -kIW -D 20 sleep 15 -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Ok, for system cpu time usage: try to run a kernel profile, to find out what kernel functions are consuming the time, lockstat -kIW -D 20 sleep 15 I did one on the machine, and then quickly an ssh and another one in ssh for the screenshot: # lockstat -kIW -D 20 sleep 15 Profiling interrupt: 3074 events in 15.841 seconds (194 events/sec) Count indv cuml rcnt nsec Hottest CPU+PILCaller --- 2430 79% 79% 0.00 2682 cpu[0] i86_mwait 279 9% 88% 0.00 1364 cpu[0]+4 tsc_read 113 4% 92% 0.00 554980 cpu[0]+4 ddi_mem_get32 103 3% 95% 0.00 1437 cpu[0]+4 tsc_gethrtime 53 2% 97% 0.00 1369 cpu[0]+4 mul32 35 1% 98% 0.00 1337 cpu[0]+4 gethrtime 28 1% 99% 0.00 1379 cpu[0]+4 drv_usecwait 11 0% 99% 0.00 4931 cpu[0] (usermode) 4 0% 99% 0.00 2269 cpu[0]+4 do_splx 2 0% 99% 0.00 2306 cpu[1] fsflush_do_pages 2 0% 100% 0.00 279710 cpu[0] ddi_io_getw 1 0% 100% 0.00 2382 cpu[1] as_fault 1 0% 100% 0.00 1887 cpu[1] xsetitimer 1 0% 100% 0.0013510 cpu[1] segvn_lockop 1 0% 100% 0.00 1705 cpu[0] poll_common 1 0% 100% 0.00 4468 cpu[1] syscall_mstate 1 0% 100% 0.00 3378 cpu[1]+11 thread_lock 1 0% 100% 0.00 2526 cpu[1] page_trylock 1 0% 100% 0.00 2103 cpu[1] kstat_compare_bykid 1 0% 100% 0.00 1403 cpu[1]+9 mutex_delay_default --- and 10 seconds later it was completely dead. Does this help, or do you need another one? -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Count indv cuml rcnt nsec Hottest CPU+PIL Caller - -- 2430 79% 79% 0.00 2682 cpu[0] i86_mwait 279 9% 88% 0.00 1364 cpu[0]+4 tsc_read 113 4% 92% 0.00 554980 cpu[0]+4 ddi_mem_get32 103 3% 95% 0.00 1437 cpu[0]+4 tsc_gethrtime ... Looks like driver interrupts, on cpu #0, and at IPL 4. What interrupts are bound to cpu 0 / IPL 4, on your machine? This information is printed by echo ::interrupts | mdb -k Does the command intrstat show lots of interrupts / lots of cpu time usage? -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Looks like driver interrupts, on cpu #0, and at IPL 4. What interrupts are bound to cpu 0 / IPL 4, on your machine? This information is printed by echo ::interrupts | mdb -k This is whole lot while 'sane' (close to 0 CPU use): IRQ Vect IPL BusTrg Type CPU Share APIC/INT# ISR(s) 10x41 5 ISAEdg Fixed 1 1 0x0/0x1 i8042_intr 40xb0 12 ISAEdg Fixed 0 1 0x0/0x4 asyintr 70x43 5 ISAEdg Fixed 1 1 0x0/0x7 ecpp_isr 90x81 9 PCILvl Fixed 1 1 0x0/0x9 acpi_wrapper_isr 12 0x42 5 ISAEdg Fixed 0 1 0x0/0xc i8042_intr 14 0x40 5 ISAEdg Fixed 0 1 0x0/0xe ata_intr 15 0x44 5 ISAEdg Fixed 1 1 0x0/0xf ata_intr 16 0x88 9 PCILvl Fixed 1 3 0x0/0x10 drm_irq_handler_wrap, wpi_intr, bge_intr 18 0x84 9 PCILvl Fixed 1 2 0x0/0x12 pcic_intr, uhci_intr 19 0x85 9 PCILvl Fixed 0 2 0x0/0x13 hci1394_isr, uhci_intr 20 0x82 9 PCILvl Fixed 1 2 0x0/0x14 uhci_intr, ehci_intr 21 0x83 9 PCILvl Fixed 0 2 0x0/0x15 audiohd_intr, uhci_intr 22 0x20 1 PCILvl Fixed 0 1 0x0/0x16 sdhost_intr 24 0x86 7 PCIEdg MSI1 1 - pepb_intr_handler 25 0x30 4 PCIEdg MSI0 1 - pepb_intr_handler 26 0x87 7 PCIEdg MSI1 1 - pepb_intr_handler 160 0xa0 0 Edg IPIall 0 - poke_cpu 192 0xc0 13 Edg IPIall 1 - xc_serv 208 0xd0 14 Edg IPIall 1 - kcpc_hw_overflow_intr 209 0xd1 14 Edg IPIall 1 - cbe_fire 210 0xd3 14 Edg IPIall 1 - cbe_fire 240 0xe0 15 Edg IPIall 1 - xc_serv 241 0xe1 15 Edg IPIall 1 - apic_error_intr device | cpu0 %tim cpu1 %tim -+-- ata#1 | 0 0.0 0 0.0 audiohd#0 | 2 0.0 0 0.0 bge#0 | 0 0.059 0.0 ehci#0 | 0 0.0 4 0.0 hci1394#0 | 2 0.0 0 0.0 i915#0 | 0 0.059 0.0 pcic#0 | 0 0.0 2 0.0 uhci#0 | 0 0.0 4 0.0 uhci#1 | 2 0.0 0 0.0 uhci#2 | 0 0.0 2 0.0 uhci#3 | 2 0.0 0 0.0 wpi#0 | 0 0.059 0.1 The latter doesn't change very much once the mess has started, though (this is after the CPU use has bumped to 50%): device | cpu0 %tim cpu1 %tim -+-- ata#0 | 1 0.0 0 0.0 ata#1 | 0 0.0 0 0.0 audiohd#0 | 1 0.0 0 0.0 bge#0 | 0 0.0 2 0.1 ehci#0 | 0 0.0 1 0.0 hci1394#0 | 1 0.0 0 0.0 i915#0 | 0 0.0 2 0.0 pcic#0 | 0 0.0 1 0.0 uhci#0 | 0 0.0 1 0.0 uhci#1 | 1 0.0 0 0.0 uhci#2 | 0 0.0 1 0.1 uhci#3 | 1 0.0 0 0.0 wpi#0 | 0 0.0 2 0.0 Uwe -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Yes, it is the Core 2 Duo (isainfo -k: amd64), and it runs in 'SATA Native Mode' disabled. It wouldn't run XP the other way round. Bluetooth is disabled. mdb says: Kernel: 13% ZFS File Data 8% Anon 9% Page cache 2% Free (cachelist) 1% Free 66% Total 3311 Physical 3311 What I observed: the Gnome Resource Applet shows normal values, after around 5 minutes, whatever I do, the CPU will suddenly jump to 50% usage and stay there. Then, and this is also reproducable, the external mouse becomes almost unresponsive, while the touchpad gets 'jumpy' only. Then I have another 2 minutes or so to shut down, or the system will hang 100%. -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
What I observed: the Gnome Resource Applet shows normal values, after around 5 minutes, whatever I do, the CPU will suddenly jump to 50% usage and stay there. Then, and this is also reproducable, the external mouse becomes almost unresponsive, while the touchpad gets 'jumpy' only. Then I have another 2 minutes or so to shut down, or the system will hang 100%. Did you have a look at mpstat 1 output before and after the jump to 50% cpu usage? Does it consume user or system cpu time? -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
Did you have a look at mpstat 1 output before and after the jump to 50% cpu usage? Does it consume user or system cpu time? 100% system time of CPU0 when the CPU usage bumps up to 50%. Before it is around 1-3 % for each CPU. -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
I gave OpenSolaris 2009.06 a shot, on an old HP nx6320. I am running it daily on an equally old HP NX6310 (I *think* yours has got Core 2 Duo CPU, whereas mine is the older Core Duo ; 'isainfo -k' will confirm this). Already during the install, it slows down considerably, and got stuck completely. The third effort succeeded finally. Check if the disk controller is configured to run in SATA mode. Now, after reboot, the system is normal, if not snappy, for a few minutes, then the mouse starts to react slow, that is, it jumps as if it would read the position data once per second, and after a few more minutes it freezes completely: no more keyboard, no more mouse, and it doesn't answer to ping-s any more. Sounds like some interrupt flood to me. Try disabling bits that do not work on OpenSolaris, like the BlueTooth. Never had anything similar on this laptop under OpenSolaris yet. On an older PIII box under Nevada snv-117 or thereabouts I have the top-left-corner jumping cursor, which is very annoying (but then I have the same on a different NetBSD machine under Xorg 1.6.1, so my take is this is an upstream problem), so at the end I disabled cde-login and gdm and am using it solely as a server. Hard reboot, and the same starts all over. Since it is a fresh install from the original CD, I guess, there is a serious problem. The same notebook runs XP and Ubuntu very well; memory test goes through for 24 hours. I have tried an update, which resulted in an 'opensolaris-1' grub entry, and it behaves likewise. It has 4 GB of RAM. 2GB on mine... Just after boot, 31% thereof are consumes for programs, 0% for cache, by the desktop alone. Is that normal? On the machine I am presently on (2GB memory, dual Opteron 246), after 20 days uptime (rebooted only because an image-update) I have 45% memory use (Firefox 3.51 being the greatest hog with about 10 tabs open). ZFS is usually taking most of the rest for ARC as far as I can gather. I've got set zfs:zfs_arc_max=831155200 in my /etc/system to limit it's use a little. Uwe Chavdar Ivanov -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
It has 4 GB of RAM. Just after boot, 31% thereof are consumes for programs, 0% for cache, by the desktop alone. Is that normal? Where are these percentages coming from? mdb -k ::memstat command? My 8GB box shows 0% (12MB) for Exec and libs, and 85% Free, and the rest is used by the kernel. (Just after a boot) Could this be a memory leak in some user land program? Does prstat -s rss / prstat -s size list any processes that have quickly growing memory consumption ? -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org
[osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes
I gave OpenSolaris 2009.06 a shot, on an old HP nx6320. Already during the install, it slows down considerably, and got stuck completely. The third effort succeeded finally. Now, after reboot, the system is normal, if not snappy, for a few minutes, then the mouse starts to react slow, that is, it jumps as if it would read the position data once per second, and after a few more minutes it freezes completely: no more keyboard, no more mouse, and it doesn't answer to ping-s any more. Hard reboot, and the same starts all over. Since it is a fresh install from the original CD, I guess, there is a serious problem. The same notebook runs XP and Ubuntu very well; memory test goes through for 24 hours. I have tried an update, which resulted in an 'opensolaris-1' grub entry, and it behaves likewise. It has 4 GB of RAM. Just after boot, 31% thereof are consumes for programs, 0% for cache, by the desktop alone. Is that normal? Uwe -- This message posted from opensolaris.org ___ opensolaris-help mailing list opensolaris-help@opensolaris.org