Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-09 Thread Uwe Dippel
 What next?

The machine answered the question: It was running smoothly for close to an 
hour. Then I left for lunch. When I came back, the monitor was black, no 
reaction. I tried all and everything, with power button as last resort. That 
resulted in a cold start.

Since this was a good opportunity, I gave it a shot and pulled a network cable 
to it, and disabled WLAN. And it connected properly to the network; so the NIC 
is probably not broken, as one could assume. The message lines are as in my 
earlier mail, except that there are two more: one with bge0 link up, 
immediately followed by bad address 0.0.0.0
I had issued ifconfig bge0 dhcp afterwards, and there are no more bge0 messages 
in the log.
So what we seem to encounter here, is a bad architectural mistake in the 
kernel. Blame nwam on pulling the wrong cords, nevermind. But the kernel must 
not allow this to happen: When bge0 can't connect, it monopolises all resources 
to load the 'correct' firmware to get it back up? On top of that, I never used 
bge0, always wpi0. So there is no reason at all for the kernel to try to force 
bge0 to work.
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-09 Thread Jürgen Keil
 The machine answered the question: It was running
 smoothly for close to an hour. Then I left for lunch.
 When I came back, the monitor was black, no reaction.
 I tried all and everything, with power button as last
 resort. That resulted in a cold start.

Is it possible that the ip address assigned to the
system's hostname was bound to the bge0 interface;
that is, after disabling the bge driver, connections
to the local machine using the machine' `hostname`
failed ?   I suspect that the X screen blank needs such
a connection to wake up from screen blank mode...


 Since this was a good opportunity, I gave it a shot
 and pulled a network cable to it, and disabled WLAN.
 And it connected properly to the network; so the NIC
 is probably not broken, as one could assume. The
 message lines are as in my earlier mail, except that
 there are two more: one with bge0 link up,
 immediately followed by bad address 0.0.0.0
 I had issued ifconfig bge0 dhcp afterwards, and there
 are no more bge0 messages in the log.

That is, when there is no cable connected to the 
bge NIC hardware, the machine starts to consume
lots of kernel cpu time after a few minutes, and
eventually hangs the system?

And when a cable is connected, there is no
excessive kernel cpu time usage, and the machine
doesn't hang?


 So what we seem to encounter here, is a bad
 architectural mistake in the kernel. Blame nwam on
 pulling the wrong cords, nevermind. But the kernel
 must not allow this to happen: When bge0 can't
 connect, it monopolises all resources to load the
 'correct' firmware to get it back up?

Yep; I'd say something is broken in the bge driver...

Maybe the BIOS has configured the nic hardware to enter
a power saving state after five minutes with no activity;
and the Solaris bge driver is confused when the device
enters that power saving state (it tries to recover by 
reseting the bge hardware, but fails to wake up the hardware,
and tries to wait forever for the firmware to become ready) ?


 On top of that,
 I never used bge0, always wpi0. So there is no reason
 at all for the kernel to try to force bge0 to work.

I think a possible workaround is to disable
svc:/network/physical:nwam, enable
svc:/network/physical:default, and manually
configure the wpi0 interface (and not use the
bge interface for now).
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-09 Thread Uwe Dippel
 Is it possible that the ip address assigned to the
 system's hostname was bound to the bge0 interface;
 that is, after disabling the bge driver, connections
 to the local machine using the machine' `hostname`
 failed ?   I suspect that the X screen blank needs
 such
 a connection to wake up from screen blank mode...

No, bge0 came in later. That was when it was running wpi0, with bge0 disabled 
in the kernel.

 That is, when there is no cable connected to the 
 bge NIC hardware, the machine starts to consume
 lots of kernel cpu time after a few minutes, and
 eventually hangs the system?

Correct.

 And when a cable is connected, there is no
 excessive kernel cpu time usage, and the machine
 doesn't hang?

Aside of those many mwaiti86 (or so), correct. At least, the machine doesn't 
hang (as before, let's leave out the non-return after some 2 hours of my 
absence), and has a beautiful load less 0.10.

 Yep; I'd say something is broken in the bge
 driver...

And in the kernel architecture, I'd add. I don't consider it proper for the 
kernel to shoot itself by trying to wake up a little NIC at the periphery.?

 Maybe the BIOS has configured the nic hardware to
 enter
 a power saving state after five minutes with no
 activity;
 and the Solaris bge driver is confused when the
 device
 enters that power saving state (it tries to recover
 by 
 reseting the bge hardware, but fails to wake up the
 hardware,
 and tries to wait forever for the firmware to become
 ready) ?

As I mentioned before, the BIOS is most ugly in this machine, it doesn't allow 
much, many settings seem to be inaccessible by the user.

As for your theory above, we have two phases to consider:
1. For some reason, the kernel tries hard to wake up bge0 after some 5 minutes, 
consuming all of CPU0
2. After a few minutes more, the system is completely dead. What else does it 
do, what does it try to achieve a few minutes later that kills it completely?

 I think a possible workaround is to disable
 svc:/network/physical:nwam, enable
 svc:/network/physical:default, and manually
 configure the wpi0 interface (and not use the
 bge interface for now).

My workaround is much more elegant: Boot to Ubuntu, OpenBSD or XP. They all 
work pretty well on this machine.  :(

Nevertheless, if there is anything more required from my side to help debugging 
this situation, let me know!

Uwe
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-08 Thread Jürgen Keil
  Looks like driver interrupts, on cpu #0, and at IPL 4.
  
  What interrupts are bound to cpu 0 / IPL 4, on your
  machine?  This information is printed by 
  
  echo ::interrupts | mdb -k
 
 This is whole lot while 'sane' (close to 0 CPU use):
 IRQ  Vect IPL BusTrg Type   CPU Share APIC/INT# ISR(s) 
 ...
 25   0x30 4   PCIEdg MSI0   1 - pepb_intr_handler

So it could be related to PCI-e / PCI bus bridge;
maybe some hotplug or power management event interrupt.

The five minute delay could be a hint that it is related to
power management.  Are there perhaps BIOS setup options
to enabled / disable power management for PCI-e devices?
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-08 Thread Uwe Dippel
 ISR(s) 
  ...
  25   0x30 4   PCIEdg MSI0   1 -
 pepb_intr_handler
 could be related to PCI-e / PCI bus bridge;
 maybe some hotplug or power management event
 interrupt.
 
 The five minute delay could be a hint that it is
 related to
 power management.  Are there perhaps BIOS setup
 options
 to enabled / disable power management for PCI-e
 devices?

Alas, no. As much as I have come to like the machine, the BIOS is atypical. 
Just a proprietary down to cannot set anything here from HP.
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-08 Thread Jürgen Keil
  Ok, for system cpu time usage:  try to run a kernel
  profile, to find out what kernel functions are consuming
  the time,  lockstat -kIW -D 20 sleep 15
 
 I did one on the machine, and then quickly an ssh and
 another one in ssh for the screenshot:
 
 # lockstat -kIW -D 20 sleep 15
 
 Profiling interrupt: 3074 events in 15.841 seconds
 (194 events/sec)
 
 Count indv cuml rcnt nsec Hottest CPU+PILCaller  
 ---
  2430  79%  79% 0.00 2682 cpu[0] i86_mwait   
   279   9%  88% 0.00 1364 cpu[0]+4   tsc_read
   113   4%  92% 0.00   554980 cpu[0]+4   ddi_mem_get32   
   103   3%  95% 0.00 1437 cpu[0]+4   tsc_gethrtime   
53   2%  97% 0.00 1369 cpu[0]+4   mul32   
35   1%  98% 0.00 1337 cpu[0]+4   gethrtime   
28   1%  99% 0.00 1379 cpu[0]+4   drv_usecwait
...

 and 10 seconds later it was completely dead.
 
 Does this help, or do you need another one?

Hmm, the 79% i86_mwait() should be 79% idle time.

The rest is ~ 20% cpu time usage for accessing some
memory mapped registers, reading the cpu's time
stamp counter (tsc); on CPU #0 at priority level 4
cpu[0]+4.  Looks like the kernel is busy waiting
for some time using drv_usecwait at priority level 4.

If you repeat that lockstat, does the result look similar?
cpu usage by cpu[0]+4, in tsc_read(), ddi_mem_get32(),
tsc_gethrtime(), ...drv_usecwait()  ?


Maybe we can find out who's calling drv_usecwait(),
using:
lockstat -kIW -f drv_usecwait -s 10 sleep 15
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-08 Thread Uwe Dippel
 If you repeat that lockstat, does the result look
 similar?
 cpu usage by cpu[0]+4, in tsc_read(),
 ddi_mem_get32(),
 tsc_gethrtime(), ...drv_usecwait()  ?
 
 
 Maybe we can find out who's calling drv_usecwait(),
 using:
 lockstat -kIW -f drv_usecwait -s 10 sleep 15

Okay, think, I caught them all here:

First the two at sanity (~0% CPU):

# lockstat -kIW -f drv_usecwait -s 10 sleep 15

Profiling interrupt: 1 events in 15.041 seconds (0 events/sec)

---
Count indv cuml rcnt nsec Hottest CPU+PILCaller  
1 100% 100% 0.00 1246 cpu[0] drv_usecwait

  nsec -- Time Distribution -- count Stack   
  2048 |@@ 1 ec_wait_ibf_clear   
 ec_rd   
 ec_handler  
 AcpiEvAddressSpaceDispatch
 AcpiExAccessRegion  
 AcpiExFieldDatumIo  
 AcpiExExtractFromField  
 AcpiExReadDataFromField
 AcpiExResolveNodeToValue
---

# lockstat -kIW -D 20 sleep 15

Profiling interrupt: 2918 events in 15.042 seconds (194 events/sec)

Count indv cuml rcnt nsec Hottest CPU+PILCaller 
---
 2896  99%  99% 0.00 3174 cpu[1] i86_mwait  
   12   0% 100% 0.00 3050 cpu[0] (usermode) 
2   0% 100% 0.00 2757 cpu[0] mutex_enter
1   0% 100% 0.00 1944 cpu[1]+11  savectx
1   0% 100% 0.00 1886 cpu[1] cv_broadcast   
1   0% 100% 0.00 4440 cpu[1] page_get_mnode_freelist
1   0% 100% 0.00 1777 cpu[1] bt_getlowbit   
1   0% 100% 0.00 3452 cpu[0] hwblkpagecopy  
1   0% 100% 0.00 3109 cpu[0]+5   ddi_mem_put8   
1   0% 100% 0.00 3844 cpu[0] _sys_sysenter_post_swapgs
1   0% 100% 0.00 1414 cpu[0]+2   dtrace_dynvar_clean
---

The first command usually returned nothing; I ran it around 10 times until I 
got that output above.


Next, the two at ~50% CPU use:

# lockstat -kIW -D 20 sleep 15

Profiling interrupt: 3268 events in 16.849 seconds (194 events/sec)

Count indv cuml rcnt nsec Hottest CPU+PILCaller  
---
 1601  49%  49% 0.00 1098 cpu[1]+9   i86_mwait   
  781  24%  73% 0.00  881 cpu[0]+4   tsc_read
  315  10%  83% 0.00   531420 cpu[0]+4   ddi_getl
  245   7%  90% 0.00  871 cpu[0]+4   tsc_gethrtime   
  136   4%  94% 0.00  864 cpu[0]+4   mul32   
   83   3%  97% 0.00  860 cpu[0]+4   gethrtime   
   73   2%  99% 0.00  869 cpu[0]+4   drv_usecwait
8   0%  99% 0.0075265 cpu[1] (usermode)  
4   0%  99% 0.00 1023 cpu[1]+9   mutex_delay_default 
3   0%  99% 0.00 2278 cpu[0]+4   do_splx 
3   0% 100% 0.00 1653 cpu[0] AcpiUtDebugPrint
1   0% 100% 0.00 3645 cpu[1]+9   as_segcompar
1   0% 100% 0.00 1710 cpu[1]+9   avl_find
1   0% 100% 0.00 3877 cpu[1]+9   page_lookup_create  
1   0% 100% 0.00  976 cpu[1]+9   default_lock_delay  
1   0% 100% 0.00 3036 cpu[1]+9   mutex_enter 
1   0% 100% 0.00 3232 cpu[1]+9   inb 
1   0% 100% 0.00  1633692 cpu[1]+9   ddi_io_put32
1   0% 100% 0.00   951528 cpu[1]+9   ddi_getw
1   0% 100% 0.00  1419253 cpu[1] ddi_getb
---

# lockstat -kIW -f drv_usecwait -s 10 sleep 15 

Profiling interrupt: 88 events in 16.823 seconds (5 events/sec)

---
Count indv cuml rcnt

Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-08 Thread Jürgen Keil
  Maybe we can find out who's calling drv_usecwait(),
  using:
  lockstat -kIW -f drv_usecwait -s 10 sleep 15
 # lockstat -kIW -f drv_usecwait -s 10 sleep 15 
 
 Profiling interrupt: 88 events in 16.823 seconds (5 events/sec)
 
 ---
 Count indv cuml rcnt nsec Hottest CPU+PILCaller  
86  98%  98% 0.00  867 cpu[0]+4   drv_usecwait
 
   nsec -- Time Distribution -- count Stack   
   1024 |@@ 76bge_poll_firmware   
   2048 |@@@10bge_chip_reset  
  bge_reset   
  bge_restart 
  bge_chip_factotum   
  av_dispatch_softvect
  dispatch_softint
  switch_sp_and_call  

Hmm, looks like the bge driver is using
software interrupts, and I think these could
be running at priority level 4.

Seems that the bge hardware has some
problems, and the driver tries to reset the
bge network hardware in an attempt to 
recover from the bge hardware problem.

bge_poll_firmware() could be busy waiting 
for up to one second; I suspect this could
explain the kernel cpu time usage.

Are there any error or warning messages
logged to /var/adm/messages when the
system starts consuming kernel cpu time?


Maybe the hang can be avoided when the
bge nic driver isn't used and the bge interface
is unconfigured / unplumbed?  Or the bge
nic driver isn't allowed to load, by using
the kernel option -B disable-bge=true ?
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-08 Thread Uwe Dippel
 Hmm, looks like the bge driver is using
 software interrupts, and I think these could
 be running at priority level 4.
 
 Seems that the bge hardware has some
 problems, and the driver tries to reset the
 bge network hardware in an attempt to 
 recover from the bge hardware problem.
 
 bge_poll_firmware() could be busy waiting 
 for up to one second; I suspect this could
 explain the kernel cpu time usage.
 
 Are there any error or warning messages
 logged to /var/adm/messages when the
 system starts consuming kernel cpu time?
 
 
 Maybe the hang can be avoided when the
 bge nic driver isn't used and the bge interface
 is unconfigured / unplumbed?  Or the bge
 nic driver isn't allowed to load, by using
 the kernel option -B disable-bge=true ?

I started at the end, with -B disable-bge=true. The network applet still shows 
bge0, but it doesn't try to configure it. ifconfig bge0 unplumb says bge0 is no 
interface, so the kernel option seems to have worked. Lockstat though still 
shows 98% of i86_mwait at 'sane' state.

I checked the /var/adm/messages, but it is so long, and I don't know what I 
should look for. I tried 'excess' and 'consum', but neither had any hits.

What looks strange to me, the layperson in kernel land:
Aug  8 22:05:34 OSolUwe mac: [ID 469746 kern.info] NOTICE: bge0 registered
Aug  8 22:05:34 OSolUwe pci_pci: [ID 370704 kern.info] PCI-device: 
pci103c,3...@e, bge0
Aug  8 22:05:34 OSolUwe genunix: [ID 936769 kern.info] bge0 is 
/p...@0,0/pci8086,2...@1e/pci103c,3...@e
Aug  8 22:05:46 OSolUwe genunix: [ID 408114 kern.info] 
/p...@0,0/pci8086,2...@1e/pci103c,3...@e (bge0) online
Aug  8 22:05:47 OSolUwe ip: [ID 856290 kern.notice] ip: joining multicasts 
failed (4) on bge0 - will use link layer broadcasts for multicast
Aug  8 22:05:50 OSolUwe in.ndpd[366]: [ID 169330 daemon.error] Interface bge0 
has been removed from kernel. in.ndpd will no longer use it
Aug  8 22:05:54 OSolUwe genunix: [ID 408114 kern.info] 
/p...@0,0/pci8086,2...@1e/pci103c,3...@e (bge0) online

At least, I can confirm that now the system keeps running normally; meaning 
that at least the symptoms have been suppressed by that kernel option.

What next?
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-07 Thread Jürgen Keil
  Did you have a look at mpstat 1 output before and
  after the jump to 50% cpu usage?
  
  Does it consume user or system cpu time?
 
 100% system time of CPU0 when the CPU usage bumps up
 to 50%. Before it is around 1-3 % for each CPU.

Ok, for system cpu time usage:  try to run a kernel
profile, to find out what kernel functions are consuming
the time,  lockstat -kIW -D 20 sleep 15
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-07 Thread Uwe Dippel
 Ok, for system cpu time usage:  try to run a kernel
 profile, to find out what kernel functions are
 consuming
 the time,  lockstat -kIW -D 20 sleep 15

I did one on the machine, and then quickly an ssh and another one in ssh for 
the screenshot:

# lockstat -kIW -D 20 sleep 15

Profiling interrupt: 3074 events in 15.841 seconds (194 events/sec)

Count indv cuml rcnt nsec Hottest CPU+PILCaller  
---
 2430  79%  79% 0.00 2682 cpu[0] i86_mwait   
  279   9%  88% 0.00 1364 cpu[0]+4   tsc_read
  113   4%  92% 0.00   554980 cpu[0]+4   ddi_mem_get32   
  103   3%  95% 0.00 1437 cpu[0]+4   tsc_gethrtime   
   53   2%  97% 0.00 1369 cpu[0]+4   mul32   
   35   1%  98% 0.00 1337 cpu[0]+4   gethrtime   
   28   1%  99% 0.00 1379 cpu[0]+4   drv_usecwait
   11   0%  99% 0.00 4931 cpu[0] (usermode)  
4   0%  99% 0.00 2269 cpu[0]+4   do_splx 
2   0%  99% 0.00 2306 cpu[1] fsflush_do_pages
2   0% 100% 0.00   279710 cpu[0] ddi_io_getw 
1   0% 100% 0.00 2382 cpu[1] as_fault
1   0% 100% 0.00 1887 cpu[1] xsetitimer  
1   0% 100% 0.0013510 cpu[1] segvn_lockop
1   0% 100% 0.00 1705 cpu[0] poll_common 
1   0% 100% 0.00 4468 cpu[1] syscall_mstate  
1   0% 100% 0.00 3378 cpu[1]+11  thread_lock 
1   0% 100% 0.00 2526 cpu[1] page_trylock
1   0% 100% 0.00 2103 cpu[1] kstat_compare_bykid 
1   0% 100% 0.00 1403 cpu[1]+9   mutex_delay_default 
---

and 10 seconds later it was completely dead.

Does this help, or do you need another one?
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-07 Thread Jürgen Keil
 Count indv cuml rcnt nsec Hottest CPU+PIL Caller
 - 
 --
 2430 79% 79% 0.00 2682 cpu[0] i86_mwait
 279 9% 88% 0.00 1364 cpu[0]+4 tsc_read
 113 4% 92% 0.00 554980 cpu[0]+4 ddi_mem_get32
 103 3% 95% 0.00 1437 cpu[0]+4 tsc_gethrtime 
 ...

Looks like driver interrupts, on cpu #0, and at IPL 4.

What interrupts are bound to cpu 0 / IPL 4, on your
machine?  This information is printed by 

echo ::interrupts | mdb -k


Does the command intrstat show lots of interrupts / 
lots of cpu time usage?
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-07 Thread Uwe Dippel
 Looks like driver interrupts, on cpu #0, and at IPL
 4.
 
 What interrupts are bound to cpu 0 / IPL 4, on your
 machine?  This information is printed by 
 
 echo ::interrupts | mdb -k

This is whole lot while 'sane' (close to 0 CPU use):
IRQ  Vect IPL BusTrg Type   CPU Share APIC/INT# ISR(s) 
10x41 5   ISAEdg Fixed  1   1 0x0/0x1   i8042_intr
40xb0 12  ISAEdg Fixed  0   1 0x0/0x4   asyintr
70x43 5   ISAEdg Fixed  1   1 0x0/0x7   ecpp_isr
90x81 9   PCILvl Fixed  1   1 0x0/0x9   acpi_wrapper_isr
12   0x42 5   ISAEdg Fixed  0   1 0x0/0xc   i8042_intr
14   0x40 5   ISAEdg Fixed  0   1 0x0/0xe   ata_intr
15   0x44 5   ISAEdg Fixed  1   1 0x0/0xf   ata_intr
16   0x88 9   PCILvl Fixed  1   3 0x0/0x10  drm_irq_handler_wrap, 
wpi_intr, bge_intr
18   0x84 9   PCILvl Fixed  1   2 0x0/0x12  pcic_intr, uhci_intr
19   0x85 9   PCILvl Fixed  0   2 0x0/0x13  hci1394_isr, uhci_intr
20   0x82 9   PCILvl Fixed  1   2 0x0/0x14  uhci_intr, ehci_intr
21   0x83 9   PCILvl Fixed  0   2 0x0/0x15  audiohd_intr, uhci_intr
22   0x20 1   PCILvl Fixed  0   1 0x0/0x16  sdhost_intr
24   0x86 7   PCIEdg MSI1   1 - pepb_intr_handler
25   0x30 4   PCIEdg MSI0   1 - pepb_intr_handler
26   0x87 7   PCIEdg MSI1   1 - pepb_intr_handler
160  0xa0 0  Edg IPIall 0 - poke_cpu
192  0xc0 13 Edg IPIall 1 - xc_serv
208  0xd0 14 Edg IPIall 1 - kcpc_hw_overflow_intr
209  0xd1 14 Edg IPIall 1 - cbe_fire
210  0xd3 14 Edg IPIall 1 - cbe_fire
240  0xe0 15 Edg IPIall 1 - xc_serv
241  0xe1 15 Edg IPIall 1 - apic_error_intr

  device |  cpu0 %tim  cpu1 %tim
-+--
   ata#1 | 0  0.0 0  0.0
   audiohd#0 | 2  0.0 0  0.0
   bge#0 | 0  0.059  0.0
  ehci#0 | 0  0.0 4  0.0
   hci1394#0 | 2  0.0 0  0.0
  i915#0 | 0  0.059  0.0
  pcic#0 | 0  0.0 2  0.0
  uhci#0 | 0  0.0 4  0.0
  uhci#1 | 2  0.0 0  0.0
  uhci#2 | 0  0.0 2  0.0
  uhci#3 | 2  0.0 0  0.0
   wpi#0 | 0  0.059  0.1

The latter doesn't change very much once the mess has started, though (this is 
after the CPU use has bumped to 50%):
  device |  cpu0 %tim  cpu1 %tim
-+--
   ata#0 | 1  0.0 0  0.0
   ata#1 | 0  0.0 0  0.0
   audiohd#0 | 1  0.0 0  0.0
   bge#0 | 0  0.0 2  0.1
  ehci#0 | 0  0.0 1  0.0
   hci1394#0 | 1  0.0 0  0.0
  i915#0 | 0  0.0 2  0.0
  pcic#0 | 0  0.0 1  0.0
  uhci#0 | 0  0.0 1  0.0
  uhci#1 | 1  0.0 0  0.0
  uhci#2 | 0  0.0 1  0.1
  uhci#3 | 1  0.0 0  0.0
   wpi#0 | 0  0.0 2  0.0

Uwe
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-06 Thread Uwe Dippel
Yes, it is the Core 2 Duo (isainfo -k: amd64), and it runs in 'SATA Native 
Mode' disabled. It wouldn't run XP the other way round. Bluetooth is disabled.

mdb says:
Kernel: 13%
ZFS File Data 8%
Anon 9%
Page cache 2%
Free (cachelist) 1%
Free 66%

Total 3311
Physical 3311

What I observed: the Gnome Resource Applet shows normal values, after around 5 
minutes, whatever I do, the CPU will suddenly jump to 50% usage and stay there.
Then, and this is also reproducable, the external mouse becomes almost 
unresponsive, while the touchpad gets 'jumpy' only. Then I have another 2 
minutes or so to shut down, or the system will hang 100%.
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-06 Thread Jürgen Keil
 What I observed: the Gnome Resource Applet shows
 normal values, after around 5 minutes, whatever I do,
 the CPU will suddenly jump to 50% usage and stay there.
 Then, and this is also reproducable, the external
 mouse becomes almost unresponsive, while the touchpad
 gets 'jumpy' only. Then I have another 2 minutes or
 so to shut down, or the system will hang 100%.

Did you have a look at mpstat 1 output before and 
after the jump to 50% cpu usage?

Does it consume user or system cpu time?
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-06 Thread Uwe Dippel
 Did you have a look at mpstat 1 output before and 
 after the jump to 50% cpu usage?
 
 Does it consume user or system cpu time?

100% system time of CPU0 when the CPU usage bumps up to 50%. Before it is 
around 1-3 % for each CPU.
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-05 Thread Chavdar Ivanov
 I gave OpenSolaris 2009.06 a shot, on an old HP
 nx6320. 

I am running it daily on an equally old HP NX6310 (I *think* yours has got Core 
2 Duo CPU, whereas mine is the older Core Duo ; 'isainfo -k' will confirm 
this). 

 Already during the install, it slows down
 considerably, and got stuck completely. The third
 effort succeeded finally.

Check if the disk controller is configured to run in SATA mode. 

 Now, after reboot, the system is normal, if not
 snappy, for a few minutes, then the mouse starts to
 react slow, that is, it jumps as if it would read the
 position data once per second, and after a few more
 minutes it freezes completely: no more keyboard, no
 more mouse, and it doesn't answer to ping-s any more.

Sounds like some interrupt flood to me. Try disabling bits that do not work on 
OpenSolaris, like the BlueTooth. Never had anything similar on this laptop 
under OpenSolaris yet. On an older PIII box under Nevada snv-117 or thereabouts 
I have the top-left-corner jumping cursor, which is very annoying (but then I 
have the same on a different NetBSD machine under Xorg 1.6.1, so my take is 
this is an upstream problem), so at the end I disabled cde-login and gdm and am 
using it solely as a server. 
 
 Hard reboot, and the same starts all over.
 Since it is a fresh install from the original CD, I
 guess, there is a serious problem. The same notebook
 runs XP and Ubuntu very well; memory test goes
 through for  24 hours.
 
 I have tried an update, which resulted in an
 'opensolaris-1' grub entry, and it behaves likewise.
 It has 4 GB of RAM. 

2GB on mine...

 Just after boot, 31% thereof are
 consumes for programs, 0% for cache, by the desktop
 alone. Is that normal?

On the machine I am presently on (2GB memory, dual Opteron 246), after 20 days 
uptime (rebooted only because an image-update) I have 45% memory use (Firefox 
3.51 being the greatest hog with about 10 tabs open). ZFS is usually taking 
most of the rest for ARC as far as I can gather. I've got

set zfs:zfs_arc_max=831155200

in my /etc/system to limit it's use a little. 

 
 Uwe

Chavdar Ivanov
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


Re: [osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-05 Thread Jürgen Keil
 It has 4 GB of RAM. Just after boot, 31% thereof are
 consumes for programs, 0% for cache, by the desktop
 alone. Is that normal?

Where are these percentages coming from?
mdb -k ::memstat command?

My 8GB box shows 0% (12MB) for Exec and libs,
and 85% Free, and the rest is used by the kernel.
(Just after a boot)

Could this be a memory leak in some user land program?
Does prstat -s rss / prstat -s size list any processes
that have quickly growing memory consumption ?
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org


[osol-help] OpenSolaris 2009.06 stalls always after some 5 minutes

2009-08-04 Thread Uwe Dippel
I gave OpenSolaris 2009.06 a shot, on an old HP nx6320. Already during the 
install, it slows down considerably, and got stuck completely. The third effort 
succeeded finally.
Now, after reboot, the system is normal, if not snappy, for a few minutes, then 
the mouse starts to react slow, that is, it jumps as if it would read the 
position data once per second, and after a few more minutes it freezes 
completely: no more keyboard, no more mouse, and it doesn't answer to ping-s 
any more. Hard reboot, and the same starts all over.
Since it is a fresh install from the original CD, I guess, there is a serious 
problem. The same notebook runs XP and Ubuntu very well; memory test goes 
through for  24 hours.

I have tried an update, which resulted in an 'opensolaris-1' grub entry, and it 
behaves likewise. It has 4 GB of RAM. Just after boot, 31% thereof are consumes 
for programs, 0% for cache, by the desktop alone. Is that normal?

Uwe
-- 
This message posted from opensolaris.org
___
opensolaris-help mailing list
opensolaris-help@opensolaris.org