Re: AHCI timeouts on S3 resume

2010-05-19 Thread Jeremy Chadwick
On Tue, May 18, 2010 at 10:14:03PM -0400, Damian Gerow wrote:
 A few months back, I swapped out my dying hard drive for a WD Scorpio Blue.
 Cheap, seemed reliable, and it was the only drive the local shop had in
 stock.  However, it seems that AHCI doesn't like this device, and is having
 troubles during an S3 resume.  It appears as though I'm experiencing two
 types of timeouts when resuming: recoverable, and non-recoverable.
 
 My question is: do I have a bad HDD, or is AHCI just not playing nicely?

Your hard disk looks generally OK; it isn't going bad.  The one thing I
can't tell or not is whether the disk is actually spinning back up on
resume; you'd have to literally listen for it, or look at SMART
Attribute #4 before and after a suspend/resume.  I'll discuss analysis
of SMART statistics further down.

The error messages you see coming from the AHCI driver indicate, to me,
one of three things: 1) The ICH9 controller being stuck (possibly resume
does something incorrectly to the controller), 2) FreeBSD not doing
something quite right when coming out of suspend mode, or 3) the disk
never waking up.  If I had to take a guess, I'd say #2.

mav@ might be able to help determine if something is being done
incorrectly in the AHCI driver after resume.  If the driver is doing the
Right Thing(tm), then the next thing to do would be to discuss the
problem on freebsd-a...@.  I can't help with these things.

I will point out, however, that you've set this value in loader.conf:

 hw.pci.do_power_nodriver=2

I've read the sysctl -d description for it, but I am not familiar with
sleep/power states so I don't know the implications.  I worry that this
value may be causing problems with your ICH9 controller.  If you could
comment this out and re-try suspend/resume to see if AHCI times out, you
might determine if it's responsible for the problem.

 The HDD is a WD Scorpio blue, model WD5000BEVT-22A0RT0, and isn't exactly
 the fastest drive on the planet.  SMART seems to be relatively clean, with
 some mild questions surrounding attributes 191, 9/193, and 194:
 
 -
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
 WHEN_FAILED RAW_VALUE
   3 Spin_Up_Time0x0027   186   185   021Pre-fail  Always  
  -   1675
   4 Start_Stop_Count0x0032   055   055   000Old_age   Always  
  -   45174
   9 Power_On_Hours  0x0032   100   100   000Old_age   Always  
  -   723
 191 G-Sense_Error_Rate  0x0032   072   072   000Old_age   Always  
  -   28
 193 Load_Cycle_Count0x0032   162   162   000Old_age   Always  
  -   115712
 194 Temperature_Celsius 0x0022   112   106   000Old_age   Always  
  -   35
 -

Attribute #3 indicates the total amount of time it takes for the drive
to spin up (usually in milliseconds).  I'll point out that there are
drives out there (such as the WD Caviar Black) which report ~8s spin-up
times when powered on; this is normal.  The drive is actually able to
function during the spin-up, which is why those systems don't take a
full 8 seconds before they're able to read from the HD.  I wanted to
point out this attribute because you've brought up concerns over AHCI 15
second timeouts being hit.

Attribute #4 indicates the number of times the disk has been told by the
controller to spin up or spin down.  This counter should increase when
your laptop goes in/out of suspend/resume.  I wanted to point out this
attribute because of what I said in my first paragraph.

Attribute #9 indicates the total amount of time the hard disk has been
powered on (read: not asleep) during its lifetime.  I can't tell you
whether or not this value is correct; only you would be able to
determine that, given your usage patterns.  I *have* seen desktop drives
which have reported this value incorrectly (meaning, servers I know have
been on for thousands of hours that show 4 for this RAW_VALUE;
probably a firmware bug).

Attribute #191 indicates a *rate* of G-shock events.  The drive has a
G-shock sensor inside of it.  This value being non-zero is perfectly
fine for laptops; people have a tendency to walk around with their
systems on, tilt them sideways, place them on the desk firmly, etc..
The sensor is sensitive, and it isn't intended to detect severity of
shock (e.g. throwing your laptop across the room); it's intended to
measure a rate.  The RAW_VALUE doesn't mean anything to me; 48 what?  We
don't know.  Only WD knows if that's a safe value or not.  So what do we
do in this case?  We look at the adjusted value VALUE and compare it to
WORST and THRESH.  SMART disk failure won't get triggered until VALUE
reaches 000, so 162 is pretty good.  I'd say don't worry about it.
(I'll use this opportunity to point out to readers that this is why
looking at RAW_VALUE explicitly is not always the correct way to read
SMART).

Attribute #193 indicates the number of times the actuator arm (thus
heads) has been 

network probs rxcsum

2010-05-19 Thread Mark Stapper
Hi,

I have two machines running FreeBSD amd64 8.0-Stable with custom kernels.
My newer box has had troubles with ssh from day one.
I hoped a kernel upgrade would help, but it didn't.
When I'd ssh into the box ssh would exit with errors:
Bad packet length xx
Disconnecting: Packet corrupt.

after issueing: ifconfig em0 -rxcons everything was stable again.
First I figured it'd be a driver issue. However, I use the same NIC in
my other box!
What could be causing this problem?



signature.asc
Description: OpenPGP digital signature


Re: network probs rxcsum

2010-05-19 Thread Jeremy Chadwick
On Wed, May 19, 2010 at 12:34:17PM +0200, Mark Stapper wrote:
 I have two machines running FreeBSD amd64 8.0-Stable with custom kernels.
 My newer box has had troubles with ssh from day one.
 I hoped a kernel upgrade would help, but it didn't.
 When I'd ssh into the box ssh would exit with errors:
 Bad packet length xx
 Disconnecting: Packet corrupt.
 
 after issueing: ifconfig em0 -rxcons everything was stable again.
 First I figured it'd be a driver issue. However, I use the same NIC in
 my other box!
 What could be causing this problem?

I think you mean -rxcsum, not -rxcons.

Could you please provide output from the following commands?  Jack Vogel
will probably respond later about this, but said output would help him.

- uname -a
- dmesg | grep em0
- pciconf -lvc

Thanks.

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: network probs rxcsum

2010-05-19 Thread Mark Stapper
On 19/05/2010 12:44, Jeremy Chadwick wrote:
 On Wed, May 19, 2010 at 12:34:17PM +0200, Mark Stapper wrote:
   
 I have two machines running FreeBSD amd64 8.0-Stable with custom kernels.
 My newer box has had troubles with ssh from day one.
 I hoped a kernel upgrade would help, but it didn't.
 When I'd ssh into the box ssh would exit with errors:
 Bad packet length xx
 Disconnecting: Packet corrupt.

 after issueing: ifconfig em0 -rxcons everything was stable again.
 First I figured it'd be a driver issue. However, I use the same NIC in
 my other box!
 What could be causing this problem?
 
 I think you mean -rxcsum, not -rxcons.

 Could you please provide output from the following commands?  Jack Vogel
 will probably respond later about this, but said output would help him.

 - uname -a
 - dmesg | grep em0
 - pciconf -lvc

 Thanks.

   
Well, yes... something got garbled in my mind...
I'll provide the outputs when I get home as the network connection just
went down for no particilar reason...
Greets,
Mark





signature.asc
Description: OpenPGP digital signature


Re: AHCI timeouts on S3 resume

2010-05-19 Thread Damian Gerow
Jeremy Chadwick wrote:
: On Tue, May 18, 2010 at 10:14:03PM -0400, Damian Gerow wrote:
:  A few months back, I swapped out my dying hard drive for a WD Scorpio Blue.
:  Cheap, seemed reliable, and it was the only drive the local shop had in
:  stock.  However, it seems that AHCI doesn't like this device, and is having
:  troubles during an S3 resume.  It appears as though I'm experiencing two
:  types of timeouts when resuming: recoverable, and non-recoverable.
:  
:  My question is: do I have a bad HDD, or is AHCI just not playing nicely?
: 
: Your hard disk looks generally OK; it isn't going bad.  The one thing I
: can't tell or not is whether the disk is actually spinning back up on
: resume; you'd have to literally listen for it, or look at SMART
: Attribute #4 before and after a suspend/resume.  I'll discuss analysis
: of SMART statistics further down.

The disk spins back up immediately on resume.  I have no recollection of it
/not/ doing so (it's definitely noticable), and I just confirmed it with a
few S3 cycles.

I also checked the WD spec sheet, and the average drive ready time is 4s.

: I will point out, however, that you've set this value in loader.conf:
: 
:  hw.pci.do_power_nodriver=2
: 
: I've read the sysctl -d description for it, but I am not familiar with
: sleep/power states so I don't know the implications.  I worry that this
: value may be causing problems with your ICH9 controller.  If you could
: comment this out and re-try suspend/resume to see if AHCI times out, you
: might determine if it's responsible for the problem.

That *should* just remove power from devices without a driver.  But I
removed it, rebooted, went through two S3 cycles, and I'm still seeing the
timeouts.  (Recoverable; of the two cycles I did, I didn't see a
non-recoverable timeout.)

:  The HDD is a WD Scorpio blue, model WD5000BEVT-22A0RT0, and isn't exactly
:  the fastest drive on the planet.  SMART seems to be relatively clean, with
:  some mild questions surrounding attributes 191, 9/193, and 194:
:  
:  -
:  ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
:3 Spin_Up_Time0x0027   186   185   021Pre-fail  Always
   -   1675
:4 Start_Stop_Count0x0032   055   055   000Old_age   Always
   -   45174
:9 Power_On_Hours  0x0032   100   100   000Old_age   Always
   -   723
:  191 G-Sense_Error_Rate  0x0032   072   072   000Old_age   Always
   -   28
:  193 Load_Cycle_Count0x0032   162   162   000Old_age   Always
   -   115712
:  194 Temperature_Celsius 0x0022   112   106   000Old_age   Always
   -   35
:  -

: Attribute #9 indicates the total amount of time the hard disk has been
: powered on (read: not asleep) during its lifetime.  I can't tell you
: whether or not this value is correct; only you would be able to
: determine that, given your usage patterns.  I *have* seen desktop drives
: which have reported this value incorrectly (meaning, servers I know have
: been on for thousands of hours that show 4 for this RAW_VALUE;
: probably a firmware bug).

I combined attributes 9 and 193 together because it seems like a load cycle
count of ~116k with 723 power-on hours is a bit high.  I believe laptop HDDs
are designed to handle a higher rate of load cycle counts, but I've never
really paid attention to them -- save on my previously dying drive, which
had broken 1M, and started screeching when doing some seeks.

But yes, that 723 power-on hours seems accurate.

: Attribute #193 indicates the number of times the actuator arm (thus
: heads) has been parked or come out of being parked.  There is a known
: problem with some models of WD Green Power (GP) drives where the drive
: spends an excessive amount of time parking, and this counter increases
: rapidly.  One FreeBSD user who reported this problem to Western Digital
: received a replacement firmware which addressed the problem.  The WD
: Scorpio Blue drives (or some of them) may have this same problem --
: HOWEVER, this model of hard disk (2.5 FF) is *specifically* intended
: for laptops and low-power environments, so the behaviour seen in this
: case could be 100% normal.  WD would hopefully know.

I'm fairly certain that WD only includes that IntelliPark feature on the GP
drives.  At least, WD doesn't indicate that there's any of their fancy new
GP-related tricks on the Scorpio Blue line.

I'd actually recently dropped my vfs.zfs.txg.timeout to 5, as I was
experiencing some pretty horrible stalls when it was left at default (30, I
believe).  I was curious to see if this decreased the rate of my
Load_Cycle_Count, but I'm already at ~122k.  Given that this drive is rated
to handle 600k, it makes me wonder if there isn't something like IntelliPark
on this drive.

: Hope this helps.

Aye.  It confirms that SMART clears my drive -- thanks!
___

7-stable compile broken: kern_ntptime

2010-05-19 Thread Michael Butler
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

After the recent kern_ntptime updates:

cc -c -O2 -pipe -fno-strict-aliasing -march=pentium4 -std=c99  -Wall
- -Wredundant-decls -Wnested-externs -Wstrict-prototypes
- -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual  -Wundef
- -Wno-pointer-sign -fformat-extensions -nostdinc  -I. -I/usr/src/sys
- -I/usr/src/sys/contrib/altq -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS
- -include opt_global.h -fno-common -finline-limit=8000 --param
inline-unit-growth=100 --param large-function-growth=1000
- -fno-omit-frame-pointer -mno-align-long-strings
- -mpreferred-stack-boundary=2  -mno-mmx -mno-3dnow -mno-sse -mno-sse2
- -mno-sse3 -ffreestanding -Werror  /usr/src/sys/kern/kern_ntptime.c

cc1: warnings being treated as errors
/usr/src/sys/kern/kern_ntptime.c: In function 'periodic_resettodr':
/usr/src/sys/kern/kern_ntptime.c:985: warning: implicit declaration of
function 'resettodr'
/usr/src/sys/kern/kern_ntptime.c:985: warning: nested extern declaration
of 'resettodr'
/usr/src/sys/kern/kern_ntptime.c:989: warning: implicit declaration of
function 'callout_schedule'
/usr/src/sys/kern/kern_ntptime.c:989: warning: nested extern declaration
of 'callout_schedule'
*** Error code 1

Stop in /usr/obj/usr/src/sys/AUBURN.
*** Error code 1

Stop in /usr/src.
*** Error code 1

Stop in /usr/src.

imb

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (FreeBSD)

iEYEARECAAYFAkv0AC8ACgkQQv9rrgRC1JJs/QCgpVIUSKua6RaVH1Ch16BEixao
CNQAoJ59A4isvuVms6jHuSaW28p/ubD4
=GRs2
-END PGP SIGNATURE-
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 7-stable compile broken: kern_ntptime

2010-05-19 Thread Jeremy Chadwick
On Wed, May 19, 2010 at 11:13:51AM -0400, Michael Butler wrote:
 After the recent kern_ntptime updates:

 {snip CC warnings}

The problem was addressed 6 minutes ago.  You'll need to wait for the
cvsup mirrors to pick up the change, otherwise use
cvsup-master.freebsd.org (not recommended).

CVS commit:

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_ntptime.c

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: network probs rxcsum

2010-05-19 Thread Mark Stapper
On 05/19/10 12:44, Jeremy Chadwick wrote:
 On Wed, May 19, 2010 at 12:34:17PM +0200, Mark Stapper wrote:
   
 I have two machines running FreeBSD amd64 8.0-Stable with custom kernels.
 My newer box has had troubles with ssh from day one.
 I hoped a kernel upgrade would help, but it didn't.
 When I'd ssh into the box ssh would exit with errors:
 Bad packet length xx
 Disconnecting: Packet corrupt.

 after issueing: ifconfig em0 -rxcons everything was stable again.
 First I figured it'd be a driver issue. However, I use the same NIC in
 my other box!
 What could be causing this problem?
 
 I think you mean -rxcsum, not -rxcons.

 Could you please provide output from the following commands?  Jack Vogel
 will probably respond later about this, but said output would help him.

 - uname -a
 - dmesg | grep em0
 - pciconf -lvc

 Thanks.

   
Could it be a shared interrupt problem?
Even though ssh worked with rxcsup disabled, network performance was
horrible!
Using my onboard nick in stead of em0 cleared it right up!
em0 is a pci addon card.
Here are the outputs you requested:

[r...@mario ~]# uname -a
FreeBSD mario 8.0-STABLE FreeBSD 8.0-STABLE #0: Tue May 18 19:37:30 CEST
2010 root@:/usr/obj/usr/src/sys/mario  amd64
[r...@mario ~]# dmesg |grep em0
em0: Intel(R) PRO/1000 Legacy Network Connection 1.0.1 port
0x9c00-0x9c3f mem 0xfdfa-0xfdfb,0xfdfc-0xfdfd irq 18 at
device 6.0 on pci2
em0: [FILTER]
em0: Ethernet address: 00:1b:21:4b:8b:85
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
em0: link state changed to UP
em0: link state changed to DOWN
[r...@mario ~]# pciconf -lvc
no...@pci0:0:0:0:   class=0x05 card=0x02f010de chip=0x02f410de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Host Bridge'
class  = memory
subclass   = RAM
cap 08[44] = HT slave
cap 08[e0] = HT MSI address window disabled at 0xfee0
no...@pci0:0:0:1:   class=0x05 card=0x02fa10de chip=0x02fa10de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 0'
class  = memory
subclass   = RAM
no...@pci0:0:0:2:   class=0x05 card=0x02fe10de chip=0x02fe10de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 1'
class  = memory
subclass   = RAM
no...@pci0:0:0:3:   class=0x05 card=0x02f810de chip=0x02f810de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 5'
class  = memory
subclass   = RAM
no...@pci0:0:0:4:   class=0x05 card=0x02f910de chip=0x02f910de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 4'
class  = memory
subclass   = RAM
no...@pci0:0:0:5:   class=0x05 card=0x02ff10de chip=0x02ff10de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Host Bridge'
class  = memory
subclass   = RAM
cap 00[44] = unknown
no...@pci0:0:0:6:   class=0x05 card=0x027f10de chip=0x027f10de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 3'
class  = memory
subclass   = RAM
no...@pci0:0:0:7:   class=0x05 card=0x027e10de chip=0x027e10de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 2'
class  = memory
subclass   = RAM
pc...@pci0:0:4:0:   class=0x060400 card=0x10de chip=0x02fb10de
rev=0xa1 hdr=0x01
vendor = 'NVIDIA Corporation'
device = 'C51 PCIe Bridge'
class  = bridge
subclass   = PCI-PCI
cap 0d[40] = PCI Bridge card=0x10de
cap 01[48] = powerspec 2  supports D0 D3  current D0
cap 05[50] = MSI supports 2 messages, 64 bit
cap 08[60] = HT MSI address window disabled at 0xfee0
cap 10[80] = PCI-Express 1 root port max data 128(128) link x16(x16)
no...@pci0:0:8:0:   class=0x05 card=0xcb8410de chip=0x036910de
rev=0xa1 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'MCP55 Memory Controller'
class  = memory
subclass   = RAM
cap 08[44] = HT slave
cap 08[dc] = HT MSI address window enabled at 0xfee0
is...@pci0:0:9:0:   class=0x060100 card=0xcb8410de chip=0x036010de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'MCP55 LPC Bridge'
class  = bridge
subclass   = PCI-ISA
no...@pci0:0:9:1:   class=0x0c0500 card=0xcb8410de chip=0x036810de
rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'SMBus controller ((0xCB84 integrated chip nForce Pro
3400))'
class  = serial bus
subclass   = SMBus
cap 01[44] = powerspec 2  supports D0 D3  current D0
non...@pci0:0:9:3:  class=0x0b4000 

Re: network probs rxcsum

2010-05-19 Thread Jack Vogel
vmstat -i ?

Custom kernel? If you use stock kernel do you still see this problem?
If you use 8 RELEASE do you see the problem?

Jack


On Wed, May 19, 2010 at 11:06 AM, Mark Stapper st...@mapper.nl wrote:

 On 05/19/10 12:44, Jeremy Chadwick wrote:
  On Wed, May 19, 2010 at 12:34:17PM +0200, Mark Stapper wrote:
 
  I have two machines running FreeBSD amd64 8.0-Stable with custom
 kernels.
  My newer box has had troubles with ssh from day one.
  I hoped a kernel upgrade would help, but it didn't.
  When I'd ssh into the box ssh would exit with errors:
  Bad packet length xx
  Disconnecting: Packet corrupt.
 
  after issueing: ifconfig em0 -rxcons everything was stable again.
  First I figured it'd be a driver issue. However, I use the same NIC in
  my other box!
  What could be causing this problem?
 
  I think you mean -rxcsum, not -rxcons.
 
  Could you please provide output from the following commands?  Jack Vogel
  will probably respond later about this, but said output would help him.
 
  - uname -a
  - dmesg | grep em0
  - pciconf -lvc
 
  Thanks.
 
 
 Could it be a shared interrupt problem?
 Even though ssh worked with rxcsup disabled, network performance was
 horrible!
 Using my onboard nick in stead of em0 cleared it right up!
 em0 is a pci addon card.
 Here are the outputs you requested:

 [r...@mario ~]# uname -a
 FreeBSD mario 8.0-STABLE FreeBSD 8.0-STABLE #0: Tue May 18 19:37:30 CEST
 2010 root@:/usr/obj/usr/src/sys/mario  amd64
 [r...@mario ~]# dmesg |grep em0
 em0: Intel(R) PRO/1000 Legacy Network Connection 1.0.1 port
 0x9c00-0x9c3f mem 0xfdfa-0xfdfb,0xfdfc-0xfdfd irq 18 at
 device 6.0 on pci2
 em0: [FILTER]
 em0: Ethernet address: 00:1b:21:4b:8b:85
 em0: link state changed to UP
 em0: link state changed to DOWN
 em0: link state changed to UP
 em0: link state changed to DOWN
 em0: link state changed to UP
 em0: link state changed to DOWN
 em0: link state changed to UP
 em0: link state changed to DOWN
 em0: link state changed to UP
 em0: link state changed to DOWN
 [r...@mario ~]# pciconf -lvc
 no...@pci0:0:0:0:   class=0x05 card=0x02f010de chip=0x02f410de
 rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Host Bridge'
class  = memory
subclass   = RAM
cap 08[44] = HT slave
cap 08[e0] = HT MSI address window disabled at 0xfee0
 no...@pci0:0:0:1:   class=0x05 card=0x02fa10de chip=0x02fa10de
 rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 0'
class  = memory
subclass   = RAM
 no...@pci0:0:0:2:   class=0x05 card=0x02fe10de chip=0x02fe10de
 rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 1'
class  = memory
subclass   = RAM
 no...@pci0:0:0:3:   class=0x05 card=0x02f810de chip=0x02f810de
 rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 5'
class  = memory
subclass   = RAM
 no...@pci0:0:0:4:   class=0x05 card=0x02f910de chip=0x02f910de
 rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 4'
class  = memory
subclass   = RAM
 no...@pci0:0:0:5:   class=0x05 card=0x02ff10de chip=0x02ff10de
 rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Host Bridge'
class  = memory
subclass   = RAM
cap 00[44] = unknown
 no...@pci0:0:0:6:   class=0x05 card=0x027f10de chip=0x027f10de
 rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 3'
class  = memory
subclass   = RAM
 no...@pci0:0:0:7:   class=0x05 card=0x027e10de chip=0x027e10de
 rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'C51 Memory Controller 2'
class  = memory
subclass   = RAM
 pc...@pci0:0:4:0:   class=0x060400 card=0x10de chip=0x02fb10de
 rev=0xa1 hdr=0x01
vendor = 'NVIDIA Corporation'
device = 'C51 PCIe Bridge'
class  = bridge
subclass   = PCI-PCI
cap 0d[40] = PCI Bridge card=0x10de
cap 01[48] = powerspec 2  supports D0 D3  current D0
cap 05[50] = MSI supports 2 messages, 64 bit
cap 08[60] = HT MSI address window disabled at 0xfee0
cap 10[80] = PCI-Express 1 root port max data 128(128) link x16(x16)
 no...@pci0:0:8:0:   class=0x05 card=0xcb8410de chip=0x036910de
 rev=0xa1 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'MCP55 Memory Controller'
class  = memory
subclass   = RAM
cap 08[44] = HT slave
cap 08[dc] = HT MSI address window enabled at 0xfee0
 is...@pci0:0:9:0:   class=0x060100 card=0xcb8410de chip=0x036010de
 rev=0xa2 hdr=0x00
vendor = 'NVIDIA Corporation'
device = 'MCP55 LPC Bridge'
class  = bridge
subclass   = PCI-ISA
 no...@pci0:0:9:1:   class=0x0c0500 card=0xcb8410de chip=0x036810de
 rev=0xa2 

Re: Kernel panic when unpluggin AC adaptor

2010-05-19 Thread Brandon Gooch
On Tue, May 18, 2010 at 10:47 PM, Brandon Gooch
jamesbrandongo...@gmail.com wrote:
 On Tue, May 18, 2010 at 9:04 AM, Giovanni Trematerra
 giovanni.tremate...@gmail.com wrote:
 On Sat, May 15, 2010 at 9:12 PM, Brandon Gooch
 jamesbrandongo...@gmail.com wrote:
 On Thu, May 13, 2010 at 7:25 PM, Giovanni Trematerra
 giovanni.tremate...@gmail.com wrote:
 On Thu, May 13, 2010 at 1:09 AM, Brandon Gooch
 jamesbrandongo...@gmail.com wrote:
 On Wed, May 12, 2010 at 9:41 AM, Attilio Rao atti...@freebsd.org wrote:
 2010/5/12 David DEMELIER demelier.da...@gmail.com:
 I remove the patch, and built the kernel (I updated the src this
 morning) and it does not panic now. It's really odd. If it reappears
 soon I will tell you.

 I looked at the code with Giovanni and I have the feeling that the
 race with the idle thread may still be fatal.
 We need to fix that.

 Attilio


 That seems to be the case, as my laptop shows about an 80-85 % chance
 of experiencing a panic if left idle for long-ish periods of time (2
 to 4 hours). I usually rebuild world or big ports overnight, and more
 often than not I wake up to a panicked machine, same situation every
 time:

 ...
 rman_get_bushandle() at rman_get_bushandle+0x1
 sched_idletd() at sched_idletd+0x123
 fork_exit() at fork_exit+0x12a
 fork_trampoline() at fork_trampoline+0xe
 ...

 The kernel/userland is rebuilt, the ports are finished compiling --
 it's in the time AFTER the completion of all tasks that the machine
 gets bored and tries to kill itself :)

 I have seen the AC adapter plug/unplug hang in the past on this
 laptop, but I never made the connection between the events, as
 nowadays my laptop usually stays plugged in :(

 Attilio, I hope you can track this one down, let me know if I can do
 anything to help or test...


 Attilio and I came up with this patch. It seems ready for stress
 testing and review
 Please test and report back.

 Thank you

 P.S: all the faults are only mine.

 I tried the patch, and my kernel panics I panic on boot. I have
 8.5MB(!) of JPG images (6 of them) if anyone needs to see them. I'm
 looking for a place to post them, but if anyone wants, I can send via
 e-mail...

 Hi Brandon,
 Could you please, try this new one? The panic at boot stage should be solved,
 at least I tried on a 8-way machine and all went ok at boot.
 Please, remove WITNESS_SKIPSPIN from your kernel config file.
 This patch might be sub-optimal and contains style(9) error but if it
 works we are
 on the right way.
 Let me know if it works for you.

 Applied the patch, built, installed, and booted new kernel: no panic!

 I will remove WITNESS_SKIPSPIN and build another kernel. Then I'll
 try to trigger the panic (by letting my laptop sit idle after a
 buildworld session).

 Thanks for giving this some attention, I hope you and/or others are
 able to get to the bottom of this...

Hey everyone, just reporting in:

The laptop has experienced the longest uptime it's seen in a while --
so far, so good!

I'll keep the machine up and running just in case...

-Brandon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org