Re: questions on NAPI processing latency and dropped network packets

2008-01-11 Thread Ray Lee
On Jan 10, 2008 9:24 AM, Chris Friesen [EMAIL PROTECTED] wrote:
 After a recent userspace app change, we've started seeing packets being
 dropped by the ethernet hardware (e1000, NAPI is enabled).  The
 error/dropped/fifo counts are going up in ethtool:

(These are perhaps too obvious, but I didn't see the questions or
answers in the thread.)

Can you reproduce it with a simple userspace cpu hog? (Two, really,
one per cpu.)

Can you reproduce it with the newer e1000?

Can you reproduce it with git head?

If the answer to the first one is yes, the last no, then bisect until
you get a kernel that doesn't show the problem. Backport the fix,
unless the fix happens to be CFS. However, I suspect that your
userpace app is just starving the system from time to time.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regression: Wireshark sees no packets in 2.6.24-rc3

2007-12-15 Thread Ray Lee
On Dec 14, 2007 11:09 PM, Ray Lee [EMAIL PROTECTED] wrote:
 On Dec 14, 2007 6:41 PM, Gabriel C [EMAIL PROTECTED] wrote:
 Correct, absolutely no traffic. So if it works for you, then either
 it's something that got fixed between -rc3 and -rc5, or something odd
 when I did a make oldconfig, I suppose. (Or because I'm on an x86-64
 kernel?) Regardless, -rc5 is currently building, and I'll try it in
 the morning.

-rc5 works great. Really don't know what's different between my -rc3
and -rc5 builds. The diff of .config between the two doesn't show
anything obvious, so perhaps it was something fixed in the interim.

I've gone ahead and closed the bugzilla entry, btw. Thanks, and sorry
for the false (or tardy) alarm.

Ray
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Regression: Wireshark sees no packets in 2.6.24-rc3

2007-12-14 Thread Ray Lee
On Dec 14, 2007 6:41 PM, Gabriel C [EMAIL PROTECTED] wrote:
 Rafael J. Wysocki wrote:
  On Friday, 14 of December 2007, Ray Lee wrote:
  tshark -i eth0, eth1, lo are all empty. Works under 2.6.23.0 just
  fine. A quick scan of the log between 2.6.24-rc3 and current tip
  (-rc5) doesn't show any obvious fixes, but then again, what do I know.
  I'll check current tip on the weekend when I'll have the luxury to
  have my main system down long enough for a test. Right now I'm kinda
  up against a deadline, but didn't want to leave it unreported. Should
  be easy for someone else to confirm or deny whether current tip has
  the problem.
 
  FYI, I have created a bugzilla entry for this issue at:
  http://bugzilla.kernel.org/show_bug.cgi?id=9568

 Hmm what do you mean by empty ? it does not capturing anything on that 
 interface ?

Correct, absolutely no traffic. So if it works for you, then either
it's something that got fixed between -rc3 and -rc5, or something odd
when I did a make oldconfig, I suppose. (Or because I'm on an x86-64
kernel?) Regardless, -rc5 is currently building, and I'll try it in
the morning.

 I do run -rc5-git with wireshark-0.99.6 and tshark -i eth0 or lo works here.

Excellent. Thank you for checking!

Rafael: I'll update the bugzilla as warranted after testing.

Ray
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Regression: Wireshark sees no packets in 2.6.24-rc3

2007-12-13 Thread Ray Lee
tshark -i eth0, eth1, lo are all empty. Works under 2.6.23.0 just
fine. A quick scan of the log between 2.6.24-rc3 and current tip
(-rc5) doesn't show any obvious fixes, but then again, what do I know.
I'll check current tip on the weekend when I'll have the luxury to
have my main system down long enough for a test. Right now I'm kinda
up against a deadline, but didn't want to leave it unreported. Should
be easy for someone else to confirm or deny whether current tip has
the problem.

Ray
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] New Kernel Bugs

2007-11-13 Thread Ray Lee
On Nov 13, 2007 7:24 AM, Giacomo A. Catenazzi [EMAIL PROTECTED] wrote:
 As a long time kernel tester, I see some problem with the
 newer new development model. In the short merge windows,
 after to much time, there are to many patches.

I think the root issue there is that it's hard to get all testers to
run a bisect, but easy to ask them to test snapshots. Right now the
snapshots are generated nightly, but I think it would make more sense
if they were generated every N patches, for some value of N...

Of course, for that to really work, we have to ensure that the result
is always compilable, which has been getting better, but not perfect.

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Weird network problems with 2.6.23-rc2

2007-11-13 Thread Ray Lee
Hello there Shish,

On Aug 10, 2007 11:39 PM, Shish [EMAIL PROTECTED] wrote:
 Something seems to have broken in 2.6.23-rc2, and I'm not sure what, or
 where I should look for further debugging. The info I have:

 On my 2.6.23-rc2 desktop, things run fine.

 On my test server, built from the same source tree, networking goes
 strange every few minutes, with the following symptoms:

 o) running ping against the server, the first ping goes through;
 further pings go AWOL until about icmp_seq=30, when I get 4-5 icmp
 replies (marked as DUP!), then no pings for a while, then dups, and so
 on.

 o) the server doesn't see ARP replies. According to tcpdump, the server
 will send eg who has 192.168.0.2? tell 192.168.0.1; the client in
 question will recieve the packet and send a response, but nothing shows
 up in the server-side tcpdump.

 o) after a few minutes of random network troubles, everything will work
 fine again, (ping is normal, arp replies are seen, tcp sessions work)
 for a few minutes.

 o) The server's dmesg shows lots of short udp packet messages

 o) ifdown then ifup'ing the interfaces fixes things, temporarily.

 Reverting to 2.6.22, everything seems to be running fine (but no lguest,
 which is what I came for :( )

 I've also tried with the latest code from git, the behaviour is the same
 as 2.6.23-rc2.

Several questions. What network card do you have on your server? Is
this still reproducible with the latest code from git? If so, it would
be extremely helpful if you could do a bisect between 2.6.22 and
2.6.23-rc2. Feel free to ask for help if you need it.

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/23] per device dirty throttling -v8

2007-08-04 Thread Ray Lee
(adding netdev cc:)

On 8/4/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 On Sat, 4 Aug 2007, Ingo Molnar wrote:

  * Ingo Molnar [EMAIL PROTECTED] wrote:
 
  There are positive reports in the never-ending my system crawls like
  an XT when copying large files bugzilla entry:
 
   http://bugzilla.kernel.org/show_bug.cgi?id=7372
 
  i forgot this entry:
 
   We recently upgraded our office to gigabit Ethernet and got some big
AMD64 / 3ware boxes for file and vmware servers... only to find them
almost useless under any kind of real load. I've built some patched
2.6.21.6 kernels (using the bdi throttling patch you mentioned) to
see if our various Debian Etch boxes run better. So far my testing
shows a *great* improvement over the stock Debian 2.6.18 kernel on
our configurations. 
 
  and bdi has been in -mm in the past i think, so we also know (to a
  certain degree) that it does not hurt those workloads that are fine
  either.
 
  [ my personal interest in this is the following regression: every time i
   start a large kernel build with DEBUG_INFO on a quad-core 4GB RAM box,
   i get up to 30 seconds complete pauses in Vim (and most other tasks),
   during plain editing of the source code. (which happens when Vim tries
   to write() to its swap/undo-file.) ]

 I have an issue that sounds like it's related.

 I've got a syslog server that's got two Opteron 246 cpu's, 16G ram, 2x140G
 15k rpm drives (fusion MPT hardware mirroring), 16x500G 7200rpm SATA
 drives on 3ware 9500 cards (software raid6) running 2.6.20.3 with hz set
 at default and preempt turned off.

 I have syslog doing buffered writes to the SCSI drives and every 5 min a
 cron job copies the data to the raid array.

 I've found that if I do anything significant on the large raid array that
 the system looses a significant amount of the UDP syslog traffic, even
 though there should be pleanty of ram and cpu (and the spindles involved
 in the writes are not being touched), even a grep can cause up to 40%
 losses in the syslog traffic. I've experimented with nice levels (nicing
 down the grep and nicing up the syslogd) without a noticable effect on the
 losses.

 I've been planning to try a new kernel with hz=1000 to see if that would
 help, and after that experiment with the various preempt settings, but it
 sounds like the per-device queues may actually be more relavent to the
 problem.

 what would you suggest I test, and in what order and combination?

At least on a surface level, your report has some similarities to
http://lkml.org/lkml/2007/5/21/84 . In that message, John Miller
mentions several things he tried without effect:

 - I increased the max allowed receive buffer through
 proc/sys/net/core/rmem_max and the application calls the right
 syscall. netstat -su does not show any packet receive errors.

 - After getting kernel: swapper: page allocation failure.
 order:0, mode:0x20, I increased /proc/sys/vm/min_free_kbytes

 - ixgb.txt in kernel network documentation suggests to increase
 net.core.netdev_max_backlog to 30. This did not help.

 - I also had to increase net.core.optmem_max, because the default
 value was too small for 700 multicast groups.

As they're all pretty simple to test, it may be worthwhile to give
them a shot just to rule things out.

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ieee80211 sleeping in invalid context

2006-12-12 Thread Ray Lee
Michael Buesch wrote:
 Congratulations to your decision ;)

Sometimes making decisions via Brownian motion has its advantages.

 Which kernel are you using?

Hmm, I'm using the mercurial repository, let me see if I can translate that to 
a git
head... Looks like git tree c2bb88baa52429b6b76e3ba4272cb2b29713c5a8 . (Which 
is from
less than 24 hours ago.)

 There is some locking breakage in latest kernels with softmac.
 I attached the fixes for the known bugs.

Okay, I'll apply to my local copy, rebuild, and try again. I'll let you know 
what
happens.

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ieee80211 sleeping in invalid context

2006-12-11 Thread Ray Lee
Hey all, more data on my bcm43xx problem report from a few weeks back.

By random chance I acquired a brain, and decided to rebuild my latest kernel
pull with as many debugging options on as I could stand. Got the below, plus
a dead keyboard (except for Magic SysRq) (but only if I let userspace come up
fully -- booting with init=/bin/bash is fine). Since the trace below mentions
scans, I'm hoping it's related to my problem.

In other news, now that I've moved my laptop back to my home office, I'm able
to recreate the dead-keyboard lockups I've been having again, about once every
day or two. What fun. So if there are patches I should try ontop of the latest
git, let me know. (Though I'm hoping the below will be a smoking gun to someone
who has a clue, i.e., not me.)

Ray

Dec 11 19:34:18 phoenix syslogd 1.4.1#18ubuntu6: restart.
Dec 11 19:34:18 phoenix kernel: Inspecting /boot/System.map-2.6.19
Dec 11 19:34:19 phoenix kernel: Loaded 26330 symbols from 
/boot/System.map-2.6.19.
Dec 11 19:34:19 phoenix kernel: Symbols match kernel version 2.6.19.
Dec 11 19:34:19 phoenix kernel: No module symbols loaded - kernel modules not 
enabled.
Dec 11 19:34:19 phoenix kernel: [0.00] Linux version 2.6.19 ([EMAIL 
PROTECTED]) (gcc version 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)) 
#1 PREEMPT Mon Dec 11 12:52:41 PST 2006
Dec 11 19:34:19 phoenix kernel: [0.00] Command line: 
root=UUID=bf7dc35f-5eff-4a85-b398-590f37c5679e ro noapic
Dec 11 19:34:19 phoenix kernel: [0.00] BIOS-provided physical RAM map:
Dec 11 19:34:19 phoenix kernel: [0.00]  BIOS-e820:  - 
0009fc00 (usable)
Dec 11 19:34:19 phoenix kernel: [0.00]  BIOS-e820: 0009fc00 - 
000a (reserved)
Dec 11 19:34:19 phoenix kernel: [0.00]  BIOS-e820: 000e - 
0010 (reserved)
Dec 11 19:34:20 phoenix kernel: [0.00]  BIOS-e820: 0010 - 
37fd (usable)
Dec 11 19:34:21 phoenix kernel: [0.00]  BIOS-e820: 37fd - 
37fefc00 (reserved)
Dec 11 19:34:21 phoenix kernel: [0.00]  BIOS-e820: 37fefc00 - 
37ffb000 (ACPI NVS)
Dec 11 19:34:21 phoenix kernel: [0.00]  BIOS-e820: 37ffb000 - 
4000 (reserved)
Dec 11 19:34:21 phoenix kernel: [0.00]  BIOS-e820: e000 - 
f000 (reserved)
Dec 11 19:34:21 phoenix kernel: [0.00]  BIOS-e820: fec0 - 
fec02000 (reserved)
Dec 11 19:34:21 phoenix kernel: [0.00]  BIOS-e820: ffb8 - 
ffc0 (reserved)
Dec 11 19:34:21 phoenix kernel: [0.00]  BIOS-e820: fff8 - 
0001 (reserved)
Dec 11 19:34:21 phoenix kernel: [0.00] end_pfn_map = 1048576
Dec 11 19:34:21 phoenix kernel: [0.00] DMI 2.3 present.
Dec 11 19:34:23 phoenix kernel: [0.00] No mptable found.
Dec 11 19:34:23 phoenix kernel: [0.00] Zone PFN ranges:
Dec 11 19:34:23 phoenix kernel: [0.00]   DMA 0 - 4096
Dec 11 19:34:23 phoenix kernel: [0.00]   DMA324096 -  1048576
Dec 11 19:34:24 phoenix kernel: [0.00]   Normal1048576 -  1048576
Dec 11 19:34:24 phoenix kernel: [0.00] early_node_map[2] active PFN 
ranges
Dec 11 19:34:24 phoenix kernel: [0.00] 0:0 -  159
Dec 11 19:34:24 phoenix kernel: [0.00] 0:  256 -   229328
Dec 11 19:34:24 phoenix hpiod: 1.6.9 accepting connections at 2208...
Dec 11 19:34:25 phoenix kernel: [0.00] ACPI: PM-Timer IO Port: 0x8008
Dec 11 19:34:25 phoenix kernel: [0.00] ACPI: LAPIC (acpi_id[0x01] 
lapic_id[0x00] enabled)
Dec 11 19:34:25 phoenix kernel: [0.00] Processor #0 (Bootup-CPU)
Dec 11 19:34:25 phoenix kernel: [0.00] ACPI: LAPIC_NMI (acpi_id[0x01] 
high edge lint[0x1])
Dec 11 19:34:25 phoenix kernel: [0.00] ACPI: Skipping IOAPIC probe due 
to 'noapic' option.
Dec 11 19:34:25 phoenix kernel: [0.00] arch/x86_64/mm/init.c:145: bad 
pte 810001c58fe8(8000fec01173).
Dec 11 19:34:25 phoenix kernel: [0.00] Nosave address range: 
0009f000 - 000a
Dec 11 19:34:25 phoenix kernel: [0.00] Nosave address range: 
000a - 000e
Dec 11 19:34:25 phoenix kernel: [0.00] Nosave address range: 
000e - 0010
Dec 11 19:34:25 phoenix kernel: [0.00] Allocating PCI resources 
starting at 5000 (gap: 4000:a000)
Dec 11 19:34:25 phoenix kernel: [0.00] Built 1 zonelists.  Total pages: 
223940
Dec 11 19:34:25 phoenix kernel: [0.00] Kernel command line: 
root=UUID=bf7dc35f-5eff-4a85-b398-590f37c5679e ro noapic
Dec 11 19:34:25 phoenix kernel: [0.00] Initializing CPU#0
Dec 11 19:34:25 phoenix kernel: [0.00] PID hash table entries: 4096 
(order: 12, 32768 bytes)
Dec 11 19:34:25 phoenix kernel: [   13.705535] time.c: Using 3.579545 MHz WALL 
PM GTOD PIT/TSC 

Re: bcm43xx regression 2.6.19rc3 - rc5, rtnl_lock trouble?

2006-11-18 Thread Ray Lee
Larry Finger wrote:
 Johannes Berg wrote:
 Hah, that's a lot more plausible than bcm43xx's drain patch actually
 causing this. So maybe somehow interrupts for bcm43xx aren't routed
 properly or something...

 Ray, please check /proc/interrupts when this happens.

When it happens, I can't. The keyboard is entirely dead (I'm in X, perhaps at
a console it would be okay). The only thing that works is magic SysRq. even
ctrl-alt-f1 to get to a console doesn't work.

That said, /proc/interrupts doesn't show MSI routed things on my AMD64 laptop.

 I am convinced that the patch in question (drain tx status) is not
 causing this -- the patch should be a no-op in most cases anyway, and in
 those cases where it isn't a no-op it'll run only once at card init and
 remove some things from a hardware-internal FIFO.


Okay, I can buy that.

 I agree that drain tx status should not cause the problem.
 
 Ray, does -rc6 solve your problem as it did for Joseph?

I can't get it to repeat other than the first two times. However, I
accidentally stopped NetworkManager from handling my wireless a few days ago,
and haven't restarted it, so that may play into this.

Humor me one last time, I beg. Did you look at the messages file I posted? (Or
maybe I didn't include this second bit... Damn, I need to be more careful with
cutting and pasting...)

The second sysrq-t shows locking stuff going on, can you tell me if it looks
reasonable? It still seems to me that something acquiring and not releasing
rtnl_lock explains what I was seeing (rtnl lock is implicated in both sysrq-t
backtraces). I don't know if that thing is bcm43xx, though.

Is this part reasonable?:
 1 lock held by events/0/4:
  #0:  (bcm-mutex){--..}, at: [mutex_lock+9/16] mutex_lock+0x9/0x10
 2 locks held by NetworkManager/4837:
  #0:  (rtnl_mutex){--..}, at: [mutex_lock+9/16] mutex_lock+0x9/0x10
  #1:  (bcm-mutex){--..}, at: [mutex_lock+9/16] mutex_lock+0x9/0x10
 1 lock held by wpa_supplicant/5953:
  #0:  (rtnl_mutex){--..}, at: [mutex_lock+9/16] mutex_lock+0x9/0x10

(So locks A, AB, B)

...of the below...

 Showing all locks held in the system:
 1 lock held by events/0/4:
  #0:  (bcm-mutex){--..}, at: [mutex_lock+9/16] mutex_lock+0x9/0x10
 1 lock held by getty/4224:
  #0:  (tty-atomic_read_lock){--..}, at: [mutex_lock_interruptible+9/16]
mutex_lock_interruptible+0x9/0x10
 1 lock held by getty/4225:
  #0:  (tty-atomic_read_lock){--..}, at: [mutex_lock_interruptible+9/16]
mutex_lock_interruptible+0x9/0x10
 1 lock held by getty/4226:
  #0:  (tty-atomic_read_lock){--..}, at: [mutex_lock_interruptible+9/16]
mutex_lock_interruptible+0x9/0x10
 1 lock held by getty/4227:
  #0:  (tty-atomic_read_lock){--..}, at: [mutex_lock_interruptible+9/16]
mutex_lock_interruptible+0x9/0x10
 1 lock held by getty/4228:
  #0:  (tty-atomic_read_lock){--..}, at: [mutex_lock_interruptible+9/16]
mutex_lock_interruptible+0x9/0x10
 1 lock held by getty/4229:
  #0:  (tty-atomic_read_lock){--..}, at: [mutex_lock_interruptible+9/16]
mutex_lock_interruptible+0x9/0x10
 2 locks held by NetworkManager/4837:
  #0:  (rtnl_mutex){--..}, at: [mutex_lock+9/16] mutex_lock+0x9/0x10
  #1:  (bcm-mutex){--..}, at: [mutex_lock+9/16] mutex_lock+0x9/0x10
 1 lock held by wpa_supplicant/5953:
  #0:  (rtnl_mutex){--..}, at: [mutex_lock+9/16] mutex_lock+0x9/0x10
 1 lock held by less/29492:
  #0:  (tty-atomic_read_lock){--..}, at: [mutex_lock_interruptible+9/16]
mutex_lock_interruptible+0x9/0x10
 1 lock held by bash/9871:
  #0:  (tty-atomic_read_lock){--..}, at: [mutex_lock_interruptible+9/16]
mutex_lock_interruptible+0x9/0x10

 =

Regardless, I'm going to withdraw my regression report until I can reproduce
this. I can't justify holding anything up if we can't even finger a culprit to
look at. In the meantime I'll try running with rc6.

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bcm43xx regression 2.6.19rc3 - rc5, rtnl_lock trouble?

2006-11-16 Thread Ray Lee

First off, thanks for all your help.

Second off,

On 11/16/06, Larry Finger [EMAIL PROTECTED] wrote:

Ray Lee wrote:

 If I could figure out a way to make it repeatable, I'd happily do a blind
 bisect.

[...]

 I'm open to suggestions on how to make the problem trigger more than once
 every two days...

I don't know what might be causing the lock problems. I'm more concerned with 
the NETDEV WATCHDOG
timeouts. AFAIK, you are the only one still reporting this error. On my system, 
I get an occasional
MAC suspend failure, sometimes followed by an BCM43xx_IRQ_XMIT_ERROR.


Last time I had trouble with 2.6.18-rcX, I wasn't the only one, just
the only one reporting it. Can you tell me why reverting the likely
culprit isn't an option? rc6 is out, and Linus is really pushing to
finalize 2.6.19 here soon.


 From what I read in your post, the timeouts happen a lot more often than once 
every two days. Once
we get those fixed, then we can concentrate on the locking.


It's becoming clear that I wasn't so clear :-). No, it doesn't happen
more than once every two (three, now) days. I'm saying that it's only
happened twice, as once the first timeout message starts, the timeouts
don't stop short of a reboot.

Or, in other words, it happened occasionally under 2.6.19-rc3, but
fixed itself. Under 2.6.19-rc5, it's happened less frequently (maybe),
but once it starts, it goes on solid until I reboot the computer.
Until I reboot, the laptop is fully unusable as things start hanging
on the rtnl_lock (X, apparently).

Please see http://madrabbit.org/~ray/messages.gz for the
/var/log/messages to understand what I mean by that. (Though, that was
captured before I'd rebuilt the module with debugging, unfortunately.
Regardless, it may help clarify what I mean here.)

So all the NETDEV WATCHDOG timeouts other than the first (of each of
the two events) appear to be bogus, or side effects of rtnl_lock being
held after the first time, and not clearing out.

thinks... Maybe I've got the culprit backward here. Perhaps
something else in my system is locking on rtnl_lock, and bcm43xx can't
acquire it? Could the NETDEV WATCHDOG timeouts be a side effect of
someone acquiring and not releasing the rtnl_lock()? Is that possible?
(ie, would it cause the effect I'm seeing?)

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bcm43xx regression 2.6.19rc3 - rc5, rtnl_lock trouble?

2006-11-15 Thread Ray Lee
Hey all,

I ran 2.6.19-rc3 for almost two weeks or so with no difficulties (none related
to the bcm43xx driver, at least). However, Andrew asked me to double check the
latest release to see if my problem report against 2.6.18 (hard locks) was
fixed. Good news is that it still is fixed. Bad news is that 2.6.19-rc5 is
worse than rc3 in other ways.

I've come back to my laptop being mostly dead after hours of it being off on
its own (twice now). Mostly dead meaning the keyboard is nearly
non-responsive, but the mouse works great (I'm in X, of course). I say 'nearly
dead' as sysrq-t,b works, so I'm sorta stumped there. (x-session seems to use
netlink, so perhaps that's the connection? ctrl-alt-f[1-7] don't do anything,
however.)

It seems to be a locking problem, though lockdep isn't catching it. I'll let
you guys decide though.

Regardless, here's what's I can see. My logs start filling with:

$ grep 'NETDEV WATCHDOG:' /var/log/messages |  cut -d '[' -f 2- | head
50025.388173] NETDEV WATCHDOG: eth1: transmit timed out
50029.019574] NETDEV WATCHDOG: eth1: transmit timed out
50030.835313] NETDEV WATCHDOG: eth1: transmit timed out
50032.651049] NETDEV WATCHDOG: eth1: transmit timed out
50034.466785] NETDEV WATCHDOG: eth1: transmit timed out
50036.282523] NETDEV WATCHDOG: eth1: transmit timed out
50038.098237] NETDEV WATCHDOG: eth1: transmit timed out
50039.913974] NETDEV WATCHDOG: eth1: transmit timed out
50041.729709] NETDEV WATCHDOG: eth1: transmit timed out
50043.545447] NETDEV WATCHDOG: eth1: transmit timed out
(...1249 of these, so it doesn't fix itself.)

and then the system becomes pretty worthless. (Full /var/log/messages with
sysrq-t at: http://madrabbit.org/~ray/messages.gz ).

Interesting bits of that:

$ grep -B5 -A10 'Nov 13 01:5.*mutex' /var/log/messages | cut -d ']' -f2-
 DWARF2 unwinder stuck at child_rip+0xa/0x12

 Leftover inexact backtrace:

  [restore_args+0/48] restore_args+0x0/0x30
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [kthread+0/272] kthread+0x0/0x110
  [child_rip+0/18] child_rip+0x0/0x12

 khelper   S 810037fbe318 0 5  1 6 4 (L-TLB)
  810037907e60 0046 810037907e70 810037fbe140
  81001095f140 3b5d 810001e3e668 0286
  810037907e40 8026bbb2 810037907e70 810001e3e600
 Call Trace:
  [worker_thread+236/352] worker_thread+0xec/0x160
  [kthread+211/272] kthread+0xd3/0x110
--
 DWARF2 unwinder stuck at child_rip+0xa/0x12

 Leftover inexact backtrace:

  [restore_args+0/48] restore_args+0x0/0x30
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [kthread+0/272] kthread+0x0/0x110
  [child_rip+0/18] child_rip+0x0/0x12

 kthread   S 810037fad218 0 6  1252129 5 (L-TLB)
  810037f01e60 0046 810037f01e70 810037fad040
  81002b3df140 062b 810001e3e468 0286
  810037f01e40 8026bbb2 810037f01e70 810001e3e400
 Call Trace:
  [worker_thread+236/352] worker_thread+0xec/0x160
  [kthread+211/272] kthread+0xd3/0x110
--
 DWARF2 unwinder stuck at child_rip+0xa/0x12

 Leftover inexact backtrace:

  [restore_args+0/48] restore_args+0x0/0x30
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [kthread+0/272] kthread+0x0/0x110
  [child_rip+0/18] child_rip+0x0/0x12

 kblockd/0 S 810037989318 025  626   (L-TLB)
  81003798fe60 0046 81003798fe70 810037989140
  8100379a5100 078b 810037fa2468 0286
  81003798fe40 8026bbb2 81003798fe70 810037fa2400
 Call Trace:
  [worker_thread+236/352] worker_thread+0xec/0x160
  [kthread+211/272] kthread+0xd3/0x110
--
 NetworkManage D 810037943258 0  4833  1  4853  4809 (NOTLB)
  81002bfefbe8 0046 81002bfefb98 810037943080
  81002e6d2100 000122a6 8062ce80 0046
  0246 810037943080 81002e47b3f0 81002e47b3a0
 Call Trace:
  [__mutex_lock_slowpath+344/624] __mutex_lock_slowpath+0x158/0x270
  [mutex_lock+9/16] mutex_lock+0x9/0x10
  [_end+126343345/2126632680] :bcm43xx:bcm43xx_wx_get_mode+0x29/0x60
  [ioctl_standard_call+139/944] ioctl_standard_call+0x8b/0x3b0
  [wireless_process_ioctl+260/976] wireless_process_ioctl+0x104/0x3d0
  [dev_ioctl+854/944] dev_ioctl+0x356/0x3b0
  [sock_ioctl+576/624] sock_ioctl+0x240/0x270
  [do_ioctl+49/160] do_ioctl+0x31/0xa0
  [vfs_ioctl+683/720] vfs_ioctl+0x2ab/0x2d0
  [sys_ioctl+106/160] sys_ioctl+0x6a/0xa0
  [system_call+126/131] system_call+0x7e/0x83
 DWARF2 unwinder stuck at system_call+0x7e/0x83
--
 x-session-man D 81002ef02298 0  5625   4565  5672  4586 (NOTLB)
  810028a1fad8 0046 8062c500 81002ef020c0
  8100249a6040 8c5d  0046
  0246 81002ef020c0 805505b0 80550560
 Call Trace:
  [__mutex_lock_slowpath+344/624] 

Re: bcm43xx regression 2.6.19rc3 - rc5, rtnl_lock trouble?

2006-11-15 Thread Ray Lee
Larry Finger wrote:
 Ray Lee wrote:
 Michael Buesch wrote:
 On Wednesday 15 November 2006 20:01, Ray Lee wrote:
 Suggestions? Requests for shudder even more info?
 Yeah, enable bcm43xx debugging.

 Sigh, didn't even think to look for that. Okay, enabled and compiling
 a new kernel. This will take a few days to trigger, if the pattern holds, so
 in the meantime, any *other* thoughts?
 
 Which chip and revision do you have? Send me your equivalent of the line
 bcm43xx: Chip ID 0x4306, rev 0x2.

bcm43xx: Chip ID 0x4306, rev 0x3

Also, another thing I wasn't clear about in my first email was that the netdev
watchdog timeouts are new with rc5:

$ zgrep 'NETDEV WATCH' /var/log/messages{,.0,.1.gz} | cut -d: -f2| cut -c 1-6
| uniq -c
   1249 Nov 13
  6 Nov  6
  1 Nov  7
  3 Nov  8
  2 Nov  9
   5717 Nov 10
   5652 Nov 11
  5 Oct 29
  3 Oct 30
  3 Oct 31
  4 Nov  1
  1 Nov  2
  1 Nov  3

I booted into 2.6.19-rc5 on November 10th. Previous to that was 2.6.19-rc3.
There really does seem to be something suspicious with that patch, yes?

Thanks,

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bcm43xx softMac Driver in 2.6.18

2006-09-29 Thread Ray Lee
(re-adding linux-kernel.)

Larry Finger wrote:
 Would you please test the attached patch that should be applied to a
 vanilla 2.6.18? I'm currently running it, but only for a few minutes. It
 comes up fine and I ran it through several ifdown/ifup cycles without
 any problem.

Okay, this is far better than vanilla 2.6.18 (or your other patch). I've
been running this for six hours so far with no troubles, when before I'd
have a hard system freeze within a minute or two of associating (or
trying to associate) with an access point.

As for -stable, the patch is sorta, y'know, ginormous:

 bcm43xx.h |  181 +-
 bcm43xx_debugfs.c |   80 
 bcm43xx_debugfs.h |1
 bcm43xx_dma.c |  583 +++---
 bcm43xx_dma.h |  296 +
 bcm43xx_leds.c|   10
 bcm43xx_main.c|  905
+++---
 bcm43xx_main.h|6
 bcm43xx_phy.c |   48 +-
 bcm43xx_pio.c |4
 bcm43xx_sysfs.c   |   46 +-
 bcm43xx_wx.c  |  121 +++
 12 files changed, 1426 insertions(+), 855 deletions(-)

OTOH, the current version is completely unusable on this system, so I
don't know if the right path is to revert the driver to 2.6.17's
version, or to try to move forward with the patch when it's had hard
review and testing.

I'm heading out on vacation for the next two weeks. I'll catch up with
any mail directed to me for more things to try (or report about this
specific system), if requested, when I get back. (Or catch me today.)

Thank you very much for your help,

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bcm43xx softMac Driver in 2.6.18

2006-09-23 Thread Ray Lee

On 9/22/06, Larry Finger [EMAIL PROTECTED] wrote:

When we found the cause of NETDEV watchdog timeouts in the wireless-2.6 code,
I knew that the 2.6.18 release code would cause a serious regression.


I don't know if this is the lockup you're trying to address, but
2.6.18's bcm43xx has definitely regressed for me versus 2.6.17.x.

2.6.18 vanilla and 2.6.18 with your patch both lock my system hard
with bcm43xx. I've got an HP/Compaq nx6125 laptop. Symptoms are that
it will associate fine on its own and send traffic to/fro upon ifup,
but when I do an iwconfig, ifdown, ifup to change the access point,
the system locks (somewhat randomly) during one of those operations.
Well, the iwconfig or the ifup, actually.

lspci -v:

02:02.0 Network controller: Broadcom Corporation BCM4309 802.11a/b/g (rev 03)
   Subsystem: Hewlett-Packard Company Unknown device 12f9
   Flags: bus master, fast devsel, latency 64, IRQ 11
   Memory at d001 (32-bit, non-prefetchable) [size=8K]

./bcm43xx-fwcutter -i BCMWL5.SYS
 filename :  bcmwl5.sys
 version  :  4.10.40.1
 MD5  :  69f940672be0ecee5bd1e905706ba8ce

Wireless tools are Version: 28-1ubuntu2.

I've got multiple access points in view of the laptop, a g (54Mb), and
a b (11Mb). Neither with encryption enabled, if that makes a
difference (we live in the boonies).

It's 2.6.18 + your patch, compiled for x86_64, ubuntu devel.

Any suggestions or requests for tests?

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bcm43xx softMac Driver in 2.6.18

2006-09-23 Thread Ray Lee
Rafael J. Wysocki wrote:
 2.6.18 vanilla and 2.6.18 with your patch both lock my system hard
 with bcm43xx. I've got an HP/Compaq nx6125 laptop. Symptoms are that
 it will associate fine on its own and send traffic to/fro upon ifup,
 but when I do an iwconfig, ifdown, ifup to change the access point,
 the system locks (somewhat randomly) during one of those operations.
 Well, the iwconfig or the ifup, actually.
 
 I have observed similar symptoms on HPC nx6325, although I haven't managed
 to get the adapter associate with an AP.

Yeah, I'm having the same troubles. Carefully watching the iwconfig
results showed me that only half of the time did my `iwconfig eth1 essid
AccessPointName` actually take. (It listed the essid of the ap I told it
to associate with, but then showed Access Point: Invalid or words to
that effect, until I issued the exact same iwconfig again.)

So, try it twice, double check the iwconfig output, then try bringing up
the interface. Though that seems awfully difficult to do as well (DHCP
is just sending out stuff with nothing coming back).

When I switch consoles while DHCP is plaintively asking for an IP, and
issue *another* iwconfig with the same essid, then it seems to kick
something in the driver and DHCP immediately associates. Happened twice
for me so far, though that could merely be a coincidence.

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/9] deadlock prevention core

2006-08-19 Thread Ray Lee

On 8/18/06, Andrew Morton [EMAIL PROTECTED] wrote:

  I assert that this can be solved by putting swap on local disks.  Peter
  asserts that this isn't acceptable due to disk unreliability.  I point
  out that local disk reliability can be increased via MD, all goes quiet.

  A good exposition which helps us to understand whether and why a
  significant proportion of the target user base still wishes to do
  swap-over-network would be useful.


Adding a hard drive adds $low per system, another failure point, and
more importantly ~3-10 Watts which then has to be paid for twice (once
to power it, again to cool it). For a hundred seats, that's
significant. For 500, it's ranging toward fully painful.

I'm in the process of designing the next upgrade for a VoIP call
center, and we want to go entirely diskless in the agent systems. We'd
also rather not swap over the network, but 'swap is as swap does.'

That said, it in no way invalidates using /proc/sys/vm/min_free_kbytes...

Ray
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html