* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Thursday 19 July 2007 21:56, Ingo Molnar wrote:
> > nope - with this patch applied the box still has no network, symptoms
> > are similar. (should i apply the WARN_ON() patch too?)
>
> Yes, that would be nice. If that doesn't help, you can also throw
On Thursday 19 July 2007 21:56, Ingo Molnar wrote:
> nope - with this patch applied the box still has no network, symptoms
> are similar. (should i apply the WARN_ON() patch too?)
Yes, that would be nice. If that doesn't help, you can also throw in
the one below.
Olaf
--
Olaf Kirch | --- o --
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> Does the following help?
> --- build-2.6.orig/drivers/net/netconsole.c
> +++ build-2.6/drivers/net/netconsole.c
> @@ -70,7 +70,7 @@ static void write_msg(struct console *co
> int frag, left;
> unsigned long flags;
>
> - if (!np.dev)
> +
Does the following help?
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[EMAIL PROTECTED] |/ | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
Test patch
---
Index: build-2.6/drivers/net/netconsole.c
==
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> Here's a somewhat drastic modification that should not change any
> timing, but just verifies whether my patch is to blame at all. Can you
> give it a try?
> @@ -1027,7 +1027,7 @@ static inline void netif_rx_complete(str
>* But at least it does
On Thursday 19 July 2007 18:07, Ingo Molnar wrote:
> because i dont seem to be able to trigger Olaf's WARN_ON(), can you see
> anything in the ethtool output that i sent in the previous mail(s)?
If the WARN_ON doesn't trigger, I cannot see how my patch would affect
your system.
- IF we ent
On Thursday 19 July 2007 19:36, Olaf Kirch wrote:
> Can you confirm this by spraying the laptop with arp packets
> or broadcast pings while it's booting?
Sorry for the noise - didn't see your other message where you
described just that.
This sounds more like a hardware issue - Rx interrupt seems
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Thursday 19 July 2007 18:05, Ingo Molnar wrote:
> > that network-intense test also produced periodic broadcast packets that
> > got the e1000 out of its weird state before the tx timeout could hit.
> > Now that i've stopped the test, the network is q
On Thursday 19 July 2007 18:05, Ingo Molnar wrote:
> that network-intense test also produced periodic broadcast packets that
> got the e1000 out of its weird state before the tx timeout could hit.
> Now that i've stopped the test, the network is quiescent again and the
> e1000 hangs.
Can you co
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> > i'll now check whether removing ignore_on_loglevel (no other
> > changes) makes the hang go away. Maybe ignore_on_loglevel is buggy -
> > or it produces an immediate printk (going out to the interface)
> > during a particularly sensitive period of n
* Kok, Auke <[EMAIL PROTECTED]> wrote:
> > I don't have a fix ready yet - I hope I'll have something later this
> > afternoon.
>
> interesting, you seem to found the cause allright. I can't confirm the
> problem but I know that netpoll and NAPI has historically been an
> issue. I look forward
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> i'll now check whether removing ignore_on_loglevel (no other changes)
> makes the hang go away. Maybe ignore_on_loglevel is buggy - or it
> produces an immediate printk (going out to the interface) during a
> particularly sensitive period of network i
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> ah! Just found the reason: the bug apparently depends on the precise
> kernel command-line contents. I accidentally dropped ignore_loglevel
> (found this while comparing with the older logs i sent to you), adding
> it back in produces hung networking
Olaf Kirch wrote:
On Thursday 19 July 2007 12:58, Ingo Molnar wrote:
i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine
hickup symptoms, with no other bad symptoms such as lockups or crashes.
Duh, I found it.
The e1000 poll routine does this to leave polling mode.
net
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> ugh. Something really weird happened with this e1000 problem.
>
> i crashed the laptop in a weird way and had to power-cycle it in an
> unusual fashion. After that i wanted to try your latest BUG_ON()
> theory but the network hang went away!
>
> For
On Thursday 19 July 2007 17:07, Ingo Molnar wrote:
> i crashed the laptop in a weird way and had to power-cycle it in an
> unusual fashion. After that i wanted to try your latest BUG_ON() theory
> but the network hang went away!
Should I rejoice, or regret? :-)
> maybe it's not the power-cyclin
ugh. Something really weird happened with this e1000 problem.
i crashed the laptop in a weird way and had to power-cycle it in an
unusual fashion. After that i wanted to try your latest BUG_ON() theory
but the network hang went away!
For 3 hours i tried to reproduce the hang (i went back to th
On Thursday 19 July 2007 14:52, Olaf Kirch wrote:
> On Thursday 19 July 2007 12:58, Ingo Molnar wrote:
> > i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine
> > hickup symptoms, with no other bad symptoms such as lockups or crashes.
>
> Duh, I found it.
The following patch shoul
On Thursday 19 July 2007 12:58, Ingo Molnar wrote:
> i.e. it's the classic 'eth0 got stuck somehow' tx/rx state machine
> hickup symptoms, with no other bad symptoms such as lockups or crashes.
Duh, I found it.
The e1000 poll routine does this to leave polling mode.
netif_rx_complete(po
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> * Olaf Kirch <[EMAIL PROTECTED]> wrote:
>
> > On Thursday 19 July 2007 12:01, Ingo Molnar wrote:
> > > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39()
> > > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0.
> > > initcall 0xc0603f55 ran f
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Thursday 19 July 2007 12:01, Ingo Molnar wrote:
> > Calling initcall 0xc0603f55: netpoll_init+0x0/0x39()
> > initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0.
> > initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39()
> > Calling initc
On Thursday 19 July 2007 12:01, Ingo Molnar wrote:
> Calling initcall 0xc0603f55: netpoll_init+0x0/0x39()
> initcall 0xc0603f55: netpoll_init+0x0/0x39() returned 0.
> initcall 0xc0603f55 ran for 0 msecs: netpoll_init+0x0/0x39()
> Calling initcall 0xc0604257: netlink_proto_init+0x0/0x12a()
> NE
* Ingo Molnar <[EMAIL PROTECTED]> wrote:
> the e1000 in this laptop is historically pretty robust. The only
> problem i ever had with it were some rx/tx hw-engine latency problems
> [pings from the outside took up to 1 second to propagate] that were
> quickly fixed by the e1000 driver guys. Ma
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> -You say that netconsole output continues to trickle after
> the network gets wedged. This could be caused by the
> e1000 watchdog, which triggers a NIC interrupt "to ensure
> rx ring is cleaned". I assume that this triggers the
>
On Thursday 19 July 2007 11:09, Ingo Molnar wrote:
> the e1000 in this laptop is historically pretty robust. The only problem
> i ever had with it were some rx/tx hw-engine latency problems [pings
> from the outside took up to 1 second to propagate] that were quickly
> fixed by the e1000 driver
i have your original patch applied to my working tree to be able to
observe this bug's behavior, and here's another observation: the problem
seems to go away if i turn on CONFIG_NO_HZ. So it looks timing related
indeed ...
but when the bug happens, it happens all the time, reboot after reboot.
On Wed, Jul 18, 2007 at 01:48:20PM +0200, Jarek Poplawski wrote:
...
> I'd be very glad if it could be verified and/or tested.
Jarek,
This patch is verified crap!
Regards,
Jarek P.
PS: Olaf,
You've written earlier that one of the main reasons for poll_napi is
to work when the kernel "doesn't e
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> > also, i'm using netconsole via the command line (both the network
> > driver and netconsole is built into the bzImage), maybe that makes a
> > difference?
>
> Possibly - but so far there's nothing in the code that jumped at me.
>
> Can you try the f
On Wednesday 18 July 2007 14:48, Ingo Molnar wrote:
> something i noticed: netconsole output seems to trickle through though,
> but very, very slowly (a packet once every 4 seconds or so). TCP/IP is
> not functional.
>
> also, i'm using netconsole via the command line (both the network driver
>
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Tuesday 17 July 2007 20:56, Ingo Molnar wrote:
> > i logged these not via netconsole but via logging on over the console
> > and using dmesg, so it should include everything. in the 100hz case the
> > following seems to show the anomaly:
> >
> > N
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Tuesday 17 July 2007 20:56, Ingo Molnar wrote:
> > i logged these not via netconsole but via logging on over the console
> > and using dmesg, so it should include everything. in the 100hz case the
> > following seems to show the anomaly:
> >
> > N
On Tuesday 17 July 2007 20:56, Ingo Molnar wrote:
> i logged these not via netconsole but via logging on over the console
> and using dmesg, so it should include everything. in the 100hz case the
> following seems to show the anomaly:
>
> NETDEV WATCHDOG: eth0: transmit timed out
So, it seems
Hi,
Here is my proposal of a solution based on dev->state flag,
but intended mainly to prevent poll_napi from disturbing
while net_rx_action is running and polling the device.
It doesn't look very nice or clean but I hope it could
guard net_rx_action enough with some room for netpoll too.
I'd be
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Tuesday 17 July 2007 20:18, Ingo Molnar wrote:
> > (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just
> > like HZ=250.)
> >
> > no 'rx_sched set' messages in either case. Network still hung for
> > HZ=100, and is working for HZ=1
On Tuesday 17 July 2007 20:18, Ingo Molnar wrote:
> (one is HZ=100, the other HZ=1000. HZ=100 produces a hung network just
> like HZ=250.)
>
> no 'rx_sched set' messages in either case. Network still hung for
> HZ=100, and is working for HZ=1000.
Is this from dmesg or the netconsole output? I d
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> Hi Ingo,
>
> On Tuesday 17 July 2007 18:57, Ingo Molnar wrote:
> > i've done the patch below, but it did not change the timeouts nor did it
> > solve the 'no network' problem. netconsole output hung earlier as well.
> Hm, pity.
>
> To rule out any e100
* David Miller <[EMAIL PROTECTED]> wrote:
> From: Ingo Molnar <[EMAIL PROTECTED]>
> Date: Tue, 17 Jul 2007 00:37:18 +0200
>
> > I think if you leaned back and thought it through, and if you
> > applied this scenario to a bad scheduler commit from me that broke
> > your box, you'd readily agree
Hi Ingo,
On Tuesday 17 July 2007 18:57, Ingo Molnar wrote:
> i've done the patch below, but it did not change the timeouts nor did it
> solve the 'no network' problem. netconsole output hung earlier as well.
Hm, pity.
To rule out any e1000 problem, can you try the the following please,
both with
On Tue, 17 Jul 2007, Ingo Molnar wrote:
>
> i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
> the problem go away. So it's somehow also related to jiffies.
No, I suspect it's just related to timing: you need to hit that window
when the LIST_FROZEN bit is set, and since i
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> Can you try what happens if you change netif_rx_complete to something
> like this:
>
> if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state)) {
> dev->quota = dev->weight;
> return;
> }
>
> This is just a hack
On Tuesday 17 July 2007 10:57, Ingo Molnar wrote:
> i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
> the problem go away. So it's somehow also related to jiffies.
There are several "Tx Hang detected" messages in the log, which looks
a lot as if net_rx_action never runs, or
On Tue, Jul 17, 2007 at 10:57:48AM +0200, Ingo Molnar wrote:
>
> Olaf,
>
> i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
> the problem go away. So it's somehow also related to jiffies.
IMHO it could be related with __LINK_STATE_RX_SCHED beeing set
too long e.g. between t
On Tue, Jul 17, 2007 at 10:28:34AM +0200, Olaf Kirch wrote:
> On Tuesday 17 July 2007 09:55, Olaf Kirch wrote:
> > What I find more problematic about this portion of code though
> > is that once a net_device is over quota, net_rx_action will
> > loop for up to one jiffy, even if there's just this o
Olaf,
i've got a new observation: changing CONFIG_HZ from 250 to 1000 makes
the problem go away. So it's somehow also related to jiffies.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at
On Tuesday 17 July 2007 09:55, Olaf Kirch wrote:
> What I find more problematic about this portion of code though
> is that once a net_device is over quota, net_rx_action will
> loop for up to one jiffy, even if there's just this one device on
> the poll_list.
Duh, wrong. For every loop, it'll add
On Monday 16 July 2007 23:40, Linus Torvalds wrote:
> - The change seems to always set the LIST_FROZEN bit when calling
>->poll(), and at least on e1000, the NAPI poll() routine ends up doing
>that netif_rx_complete(), so we're *guaranteed* to always take the
>early exit and not do
On Tuesday 17 July 2007 08:14, Jarek Poplawski wrote:
> > If after poll_napi dev->quota <= 0 dev->poll is not run and
> > __LINK_STATE_RX_SCHED bit (plus dev->poll_list) stays uncleared.
>
> Or, more precisely dev->poll_list will be cleared just after this,
> and net_rx_action returns with __LINK_
On Tuesday 17 July 2007 00:08, David Miller wrote:
> Sure, but I thought it would be nice to give Olaf a day or two to
> figure out what's going on rather than have the knee-jerk reaction to
> just revert.
Oh, reverting is fine with me. I'll just resubmit the patch.
Olaf
--
Olaf Kirch | --- o
On Tue, Jul 17, 2007 at 07:46:39AM +0200, Jarek Poplawski wrote:
...
> > static void net_rx_action(struct softirq_action *h)
> > {
> > struct softnet_data *queue = &__get_cpu_var(softnet_data);
> > unsigned long start_time = jiffies;
> > int budget = netdev_budget;
> >
On 16-07-2007 11:12, Ingo Molnar wrote:
> current -git broke my main testbox. No TCP/IP networking to/from the box
> and e1000 would time out in xmit:
>
> NETDEV WATCHDOG: eth0: transmit timed out
> e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
...
Olaf, I think this error can trigger
On Mon, 16 Jul 2007, Matt Mackall wrote:
>
> Unfortunately the particular patch from Olaf is presumably covering up
> another bug that other people (including Olaf) had hit. So reverting
> it is going to introduce a different regression.
It's not a regression, it's an old problem.
And the rule
On Mon, Jul 16, 2007 at 03:29:15PM -0700, Linus Torvalds wrote:
>
>
> On Mon, 16 Jul 2007, David Miller wrote:
> >
> > Ingo is the only person hitting and reporting this and last time I
> > checked he is competent enough to revert the thing locally in his own
> > trees, right? :-)
>
> Umm. And
From: Ingo Molnar <[EMAIL PROTECTED]>
Date: Tue, 17 Jul 2007 00:37:18 +0200
> I think if you leaned back and thought it through, and if you applied
> this scenario to a bad scheduler commit from me that broke your box,
> you'd readily agree with me =B-) (which scenario is purely hypothetical,
>
From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Mon, 16 Jul 2007 15:29:15 -0700 (PDT)
> If we knew something was wrong before the -rc1 release, all the better: we
> can avoid havign that bug in -rc1, and the people who test it will tell us
> about the problems we did *not* know about.
>
> In con
* David Miller <[EMAIL PROTECTED]> wrote:
> From: Ingo Molnar <[EMAIL PROTECTED]>
> Date: Mon, 16 Jul 2007 23:51:17 +0200
>
> > i also offered to quickly try any test-version of the fixed patch, so
> > there's a real and deterministic path towards fixing the patch. The
> > regression is obviou
On Mon, 16 Jul 2007, David Miller wrote:
>
> Ingo is the only person hitting and reporting this and last time I
> checked he is competent enough to revert the thing locally in his own
> trees, right? :-)
Umm. And your suggestion is what? Wait until -rc1, when non-developers
(the kinds of peopl
From: Ingo Molnar <[EMAIL PROTECTED]>
Date: Mon, 16 Jul 2007 23:09:24 +0200
> so ... i can promise to test whatever new version of the patch Olaf
> sends me (the problem is easy to reproduce and easy to test, so i can
> check it all in a heartbeat), so to get things back on track, and to
> valu
From: Linus Torvalds <[EMAIL PROTECTED]>
Date: Mon, 16 Jul 2007 14:40:38 -0700 (PDT)
> If we don't know what caused a problem in the first place, or if the fix
> is known to be required for something else and reverting it would cause
> *another* regression, it would be another issue. But as it i
From: Ingo Molnar <[EMAIL PROTECTED]>
Date: Mon, 16 Jul 2007 23:51:17 +0200
> i also offered to quickly try any test-version of the fixed patch, so
> there's a real and deterministic path towards fixing the patch. The
> regression is obvious and triggers all the time.
For you.
-
To unsubscribe
* Linus Torvalds <[EMAIL PROTECTED]> wrote:
>With MSI, edge-triggered interrupts are making a comeback in a big
>way, and yeah, e1000 is one of the drivers that do MSI. Ingo might
>want to confirm whether it's actually enabled for him, and whether
>turning it off might hide the
On Mon, 16 Jul 2007, David Miller wrote:
>
> Well, let's figure out why before we revert because it
> is attempting to fix a legitimate bug.
I'm reverting it. I don't think there is any excuse for not reverting
something that provably breaks somebody's machine. I don't want this to be
on the
* David Miller <[EMAIL PROTECTED]> wrote:
> From: Ingo Molnar <[EMAIL PROTECTED]>
> Date: Mon, 16 Jul 2007 11:12:36 +0200
>
> > Applying the revert patch below makes it work again.
>
> Well, let's figure out why before we revert because it is attempting
> to fix a legitimate bug.
yeah, no dou
* Olaf Kirch <[EMAIL PROTECTED]> wrote:
> On Monday 16 July 2007 13:26, David Miller wrote:
> > Well, let's figure out why before we revert because it
> > is attempting to fix a legitimate bug.
> >
> > Olaf, any ideas?
>
> It seems as if the card is stuck in NAPI mode without being serviced
>
On Monday 16 July 2007 13:26, David Miller wrote:
> Well, let's figure out why before we revert because it
> is attempting to fix a legitimate bug.
>
> Olaf, any ideas?
It seems as if the card is stuck in NAPI mode without being
serviced by net_rx_action.
Ingo, is this a UP or SMP machine? Are y
From: Ingo Molnar <[EMAIL PROTECTED]>
Date: Mon, 16 Jul 2007 11:12:36 +0200
> Applying the revert patch below makes it work again.
Well, let's figure out why before we revert because it
is attempting to fix a legitimate bug.
Olaf, any ideas?
-
To unsubscribe from this list: send the line "unsubs
On Monday 16 July 2007 11:12, Ingo Molnar wrote:
> After a bisection session the bad commit turned out to be:
>
> 29578624e354f56143d92510fff33a8b2aaa2c03 is first bad commit
> commit 29578624e354f56143d92510fff33a8b2aaa2c03
> Author: Olaf Kirch <[EMAIL PROTECTED]>
> Date: Wed Jul 11 19:32:0
evert patch below
makes it work again.
Ingo
---------->
Subject: [patch] revert: [NET]: Fix races in net_rx_action vs netpoll
From: Ingo Molnar <[EMAIL PROTECTED]>
commit 29578624 causes netconsole failures:
NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0: e1000_c
67 matches
Mail list logo