Re: intermittant petabyte usage reported with broadcom nic

2007-05-21 Thread Michael Chan
On Mon, 2 Apr 2007 11:43:19 +1000 CaT <[EMAIL PROTECTED]> wrote:
>  
> I take minute by minute snapshots of network traffic by sampling
> /proc/net/dev and most of the time everything works fine. Occasionally
> though I get petabyte byte traffic and corresponding packet traffic.

We were able to reproduce the problem and confirmed that it was a DMA
problem of the statistics block.  About once an hour on average, wrong
counter values will be DMA'ed to host memory.  Luckily, the DMA write
stays within the intended address range so it will not corrupt other
parts of memory.  Other types of DMA including traffic and buffer
descriptors are not affected.

If you happen to be reading /proc/net/dev within a second after the DMA
corruption, you'll see bogus counters.  One second later and until the
next bad DMA, the counters will be normal again.

We are considering ways to workaround the problem.  Thanks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-05-21 Thread Michael Chan
On Mon, 2 Apr 2007 11:43:19 +1000 CaT [EMAIL PROTECTED] wrote:
  
 I take minute by minute snapshots of network traffic by sampling
 /proc/net/dev and most of the time everything works fine. Occasionally
 though I get petabyte byte traffic and corresponding packet traffic.

We were able to reproduce the problem and confirmed that it was a DMA
problem of the statistics block.  About once an hour on average, wrong
counter values will be DMA'ed to host memory.  Luckily, the DMA write
stays within the intended address range so it will not corrupt other
parts of memory.  Other types of DMA including traffic and buffer
descriptors are not affected.

If you happen to be reading /proc/net/dev within a second after the DMA
corruption, you'll see bogus counters.  One second later and until the
next bad DMA, the counters will be normal again.

We are considering ways to workaround the problem.  Thanks.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-17 Thread Roland Dreier
I actually have a couple of Dell 1950 systems with bnx2 NICs too,
which I use for kernel development (ie one more crash is fine :)

If someone can give me an idea for what kind of load to use, I can try
this patch out to see if it triggers.

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-17 Thread Jean-Daniel Pauget
On Tue, Apr 17, 2007 at 09:43:48AM +1000, CaT wrote:
> On Mon, Apr 16, 2007 at 12:10:51PM -0700, Michael Chan wrote:
> > On Sat, 2007-04-14 at 17:20 -0700, Michael Chan wrote:
> > 
> > Here's the debug patch for x86 only that will change the statistics
> > memory block to read-only.  If the kernel is corrupting it, you should
> > get a page fault that will crash the system.  If you continue to see
> > bogus counters, it is definitely a firmware or hardware problem.  Please
> > try it and let me know.  Thanks.
[.../...]
> Perhaps Jean-Daniel, who is also experiencing this problem and seemingly
> more frequently then I, has a box that he could run your patch on. I
> think we both run pretty-much the same hardware (Dell [12]950s).
Dell 1950/2950 indeed...

if there is any way to catch that writing without crashing the system 
(even to the price of some slowness) I can test it. if not, I can't 
because all my available targets are remote administrated and involved 
with production processes.
if luckilly one of them gets free, I'll to apply the latest patch you'd 
provide me. I may also try it one day I'm close to those machines, so 
keep me in the list of up to date patches.

-- 
Jean-Daniel Pauget

Tél: +33 (0)2 33 17 20 16
2, rue André PELCA
50580 Denneville-Plage
France

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-17 Thread Jean-Daniel Pauget
On Tue, Apr 17, 2007 at 09:43:48AM +1000, CaT wrote:
 On Mon, Apr 16, 2007 at 12:10:51PM -0700, Michael Chan wrote:
  On Sat, 2007-04-14 at 17:20 -0700, Michael Chan wrote:
  
  Here's the debug patch for x86 only that will change the statistics
  memory block to read-only.  If the kernel is corrupting it, you should
  get a page fault that will crash the system.  If you continue to see
  bogus counters, it is definitely a firmware or hardware problem.  Please
  try it and let me know.  Thanks.
[.../...]
 Perhaps Jean-Daniel, who is also experiencing this problem and seemingly
 more frequently then I, has a box that he could run your patch on. I
 think we both run pretty-much the same hardware (Dell [12]950s).
Dell 1950/2950 indeed...

if there is any way to catch that writing without crashing the system 
(even to the price of some slowness) I can test it. if not, I can't 
because all my available targets are remote administrated and involved 
with production processes.
if luckilly one of them gets free, I'll to apply the latest patch you'd 
provide me. I may also try it one day I'm close to those machines, so 
keep me in the list of up to date patches.

-- 
Jean-Daniel Pauget

Tél: +33 (0)2 33 17 20 16
2, rue André PELCA
50580 Denneville-Plage
France

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-17 Thread Roland Dreier
I actually have a couple of Dell 1950 systems with bnx2 NICs too,
which I use for kernel development (ie one more crash is fine :)

If someone can give me an idea for what kind of load to use, I can try
this patch out to see if it triggers.

 - R.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-16 Thread CaT
On Mon, Apr 16, 2007 at 12:10:51PM -0700, Michael Chan wrote:
> On Sat, 2007-04-14 at 17:20 -0700, Michael Chan wrote:
> 
> > I also like Andi's idea of using change_page_attr() to isolate the
> > problem.  I'll try to send you a debug patch in the next few days to try
> > that out.  Thanks.
> 
> Here's the debug patch for x86 only that will change the statistics
> memory block to read-only.  If the kernel is corrupting it, you should
> get a page fault that will crash the system.  If you continue to see
> bogus counters, it is definitely a firmware or hardware problem.  Please
> try it and let me know.  Thanks.

Ahh. Would truly love to but the moment you said 'crash the system' I
had to bail. These boxes are in production and as such a crash would be,
shall we say, unwelcome. I might be able to fenagle something but I
very-much doubt it.

Perhaps Jean-Daniel, who is also experiencing this problem and seemingly
more frequently then I, has a box that he could run your patch on. I
think we both run pretty-much the same hardware (Dell [12]950s). I've
CCed him.

-- 
"To the extent that we overreact, we proffer the terrorists the
greatest tribute."
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-16 Thread Michael Chan
On Sat, 2007-04-14 at 17:20 -0700, Michael Chan wrote:

> I also like Andi's idea of using change_page_attr() to isolate the
> problem.  I'll try to send you a debug patch in the next few days to try
> that out.  Thanks.
> 
Here's the debug patch for x86 only that will change the statistics
memory block to read-only.  If the kernel is corrupting it, you should
get a page fault that will crash the system.  If you continue to see
bogus counters, it is definitely a firmware or hardware problem.  Please
try it and let me know.  Thanks.

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 0b7aded..b7d491b 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bnx2.h"
 #include "bnx2_fw.h"
@@ -436,6 +437,8 @@ bnx2_free_mem(struct bnx2 *bp)
}
}
if (bp->status_blk) {
+   change_page_attr(virt_to_page(bp->status_blk), 1, PAGE_KERNEL);
+   global_flush_tlb();
pci_free_consistent(bp->pdev, bp->status_stats_size,
bp->status_blk, bp->status_blk_mapping);
bp->status_blk = NULL;
@@ -501,6 +504,7 @@ bnx2_alloc_mem(struct bnx2 *bp)
bp->status_stats_size = status_blk_size +
sizeof(struct statistics_block);
 
+   bp->status_stats_size = PAGE_SIZE;
bp->status_blk = pci_alloc_consistent(bp->pdev, bp->status_stats_size,
  >status_blk_mapping);
if (bp->status_blk == NULL)
@@ -508,6 +512,10 @@ bnx2_alloc_mem(struct bnx2 *bp)
 
memset(bp->status_blk, 0, bp->status_stats_size);
 
+   /* x86 debug code to see if the kernel is corrupting the statistics */
+   change_page_attr(virt_to_page(bp->status_blk), 1, PAGE_KERNEL_RO);
+   global_flush_tlb();
+
bp->stats_blk = (void *) ((unsigned long) bp->status_blk +
  status_blk_size);
 
@@ -4307,7 +4315,9 @@ bnx2_timer(unsigned long data)
msg = (u32) ++bp->fw_drv_pulse_wr_seq;
REG_WR_IND(bp, bp->shmem_base + BNX2_DRV_PULSE_MB, msg);
 
+#if 0
bp->stats_blk->stat_FwRxDrop = REG_RD_IND(bp, BNX2_FW_RX_DROP_COUNT);
+#endif
 
if (bp->phy_flags & PHY_SERDES_FLAG) {
if (CHIP_NUM(bp) == CHIP_NUM_5706)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-16 Thread Michael Chan
On Sat, 2007-04-14 at 17:20 -0700, Michael Chan wrote:

 I also like Andi's idea of using change_page_attr() to isolate the
 problem.  I'll try to send you a debug patch in the next few days to try
 that out.  Thanks.
 
Here's the debug patch for x86 only that will change the statistics
memory block to read-only.  If the kernel is corrupting it, you should
get a page fault that will crash the system.  If you continue to see
bogus counters, it is definitely a firmware or hardware problem.  Please
try it and let me know.  Thanks.

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 0b7aded..b7d491b 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -47,6 +47,7 @@
 #include linux/prefetch.h
 #include linux/cache.h
 #include linux/zlib.h
+#include asm/cacheflush.h
 
 #include bnx2.h
 #include bnx2_fw.h
@@ -436,6 +437,8 @@ bnx2_free_mem(struct bnx2 *bp)
}
}
if (bp-status_blk) {
+   change_page_attr(virt_to_page(bp-status_blk), 1, PAGE_KERNEL);
+   global_flush_tlb();
pci_free_consistent(bp-pdev, bp-status_stats_size,
bp-status_blk, bp-status_blk_mapping);
bp-status_blk = NULL;
@@ -501,6 +504,7 @@ bnx2_alloc_mem(struct bnx2 *bp)
bp-status_stats_size = status_blk_size +
sizeof(struct statistics_block);
 
+   bp-status_stats_size = PAGE_SIZE;
bp-status_blk = pci_alloc_consistent(bp-pdev, bp-status_stats_size,
  bp-status_blk_mapping);
if (bp-status_blk == NULL)
@@ -508,6 +512,10 @@ bnx2_alloc_mem(struct bnx2 *bp)
 
memset(bp-status_blk, 0, bp-status_stats_size);
 
+   /* x86 debug code to see if the kernel is corrupting the statistics */
+   change_page_attr(virt_to_page(bp-status_blk), 1, PAGE_KERNEL_RO);
+   global_flush_tlb();
+
bp-stats_blk = (void *) ((unsigned long) bp-status_blk +
  status_blk_size);
 
@@ -4307,7 +4315,9 @@ bnx2_timer(unsigned long data)
msg = (u32) ++bp-fw_drv_pulse_wr_seq;
REG_WR_IND(bp, bp-shmem_base + BNX2_DRV_PULSE_MB, msg);
 
+#if 0
bp-stats_blk-stat_FwRxDrop = REG_RD_IND(bp, BNX2_FW_RX_DROP_COUNT);
+#endif
 
if (bp-phy_flags  PHY_SERDES_FLAG) {
if (CHIP_NUM(bp) == CHIP_NUM_5706)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-16 Thread CaT
On Mon, Apr 16, 2007 at 12:10:51PM -0700, Michael Chan wrote:
 On Sat, 2007-04-14 at 17:20 -0700, Michael Chan wrote:
 
  I also like Andi's idea of using change_page_attr() to isolate the
  problem.  I'll try to send you a debug patch in the next few days to try
  that out.  Thanks.
 
 Here's the debug patch for x86 only that will change the statistics
 memory block to read-only.  If the kernel is corrupting it, you should
 get a page fault that will crash the system.  If you continue to see
 bogus counters, it is definitely a firmware or hardware problem.  Please
 try it and let me know.  Thanks.

Ahh. Would truly love to but the moment you said 'crash the system' I
had to bail. These boxes are in production and as such a crash would be,
shall we say, unwelcome. I might be able to fenagle something but I
very-much doubt it.

Perhaps Jean-Daniel, who is also experiencing this problem and seemingly
more frequently then I, has a box that he could run your patch on. I
think we both run pretty-much the same hardware (Dell [12]950s). I've
CCed him.

-- 
To the extent that we overreact, we proffer the terrorists the
greatest tribute.
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-14 Thread Michael Chan
On Mon, 2007-04-02 at 17:41 +1000, CaT wrote:
> On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
> > On Mon, 2 Apr 2007 11:43:19 +1000 CaT <[EMAIL PROTECTED]> wrote:
> > 
> > > I take minute by minute snapshots of network traffic by sampling
> > > /proc/net/dev and most of the time everything works fine. Occasionally
> > > though I get petabyte byte traffic and corresponding packet traffic.
> > 
> > How frequently?
> 
> I can count about 6 over the past month.
> 
I did a quick test on a 64-bit kernel and did not see any problem with
the counters.  I'll ask the lab to set up a longer term test and monitor
the counters for bogus values.

I also like Andi's idea of using change_page_attr() to isolate the
problem.  I'll try to send you a debug patch in the next few days to try
that out.  Thanks.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-14 Thread Michael Chan
On Mon, 2007-04-02 at 17:41 +1000, CaT wrote:
 On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
  On Mon, 2 Apr 2007 11:43:19 +1000 CaT [EMAIL PROTECTED] wrote:
  
   I take minute by minute snapshots of network traffic by sampling
   /proc/net/dev and most of the time everything works fine. Occasionally
   though I get petabyte byte traffic and corresponding packet traffic.
  
  How frequently?
 
 I can count about 6 over the past month.
 
I did a quick test on a 64-bit kernel and did not see any problem with
the counters.  I'll ask the lab to set up a longer term test and monitor
the counters for bogus values.

I also like Andi's idea of using change_page_attr() to isolate the
problem.  I'll try to send you a debug patch in the next few days to try
that out.  Thanks.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Andi Kleen
Roland Dreier <[EMAIL PROTECTED]> writes:

> [Adding Michael Chan, who seems to look after bnx2, to the cc list]
> 
>  > To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
>  > them all as amd64s). Network card driver in use is the one defined by
>  > CONFIG_BNX2. Kernel's monolithic.
> 
> From a quick look at bnx2.c, it seems that the driver gives the NIC
> (firmware?) a block of memory to DMA stats into, and just reads from
> that memory in its get_stats method.  So if you're seeing wonky stats
> from the NIC intermittently, my best guess would be that firmware is
> occasionally writing junk into the stats block.

When only the firmware is writing to that area it could be put
into an own page and then write protected with change_page_attr()
That would catch any corruption coming from the rest of the kernel.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Roland Dreier
[Adding Michael Chan, who seems to look after bnx2, to the cc list]

 > To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
 > them all as amd64s). Network card driver in use is the one defined by
 > CONFIG_BNX2. Kernel's monolithic.

>From a quick look at bnx2.c, it seems that the driver gives the NIC
(firmware?) a block of memory to DMA stats into, and just reads from
that memory in its get_stats method.  So if you're seeing wonky stats
from the NIC intermittently, my best guess would be that firmware is
occasionally writing junk into the stats block.

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread CaT
On Thu, Apr 12, 2007 at 04:18:24PM -0700, Roland Dreier wrote:
> > > Apr 11 22:14:02 '  eth0:220898233988841368 66750274000 0  
> > > 0  86458738 52386430545 101089219 19931300 0  199313  
> > > 0 '
> 
> > > Apr 11 22:15:02 '  eth0:17227454818 81381144000 0 
> > >  0 0 33091307388 86658381000 0   0  0 
> > > '
> 
> > But in fact I think you're saying that the numbers go bad, and then stay 
> > bad.
> 
> Doesn't look like it -- one minute after the first hiccup the eth0 #s
> look reasonable again.

Yeah. Sorry for not making it clear. I included good values on either
side of the bad one.

-- 
"To the extent that we overreact, we proffer the terrorists the
greatest tribute."
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Roland Dreier
 > > Apr 11 22:14:02 '  eth0:220898233988841368 66750274000 0   
 > >0  86458738 52386430545 101089219 19931300 0  199313
 > >   0 '

 > > Apr 11 22:15:02 '  eth0:17227454818 81381144000 0  
 > > 0 0 33091307388 86658381000 0   0  0 '

 > But in fact I think you're saying that the numbers go bad, and then stay bad.

Doesn't look like it -- one minute after the first hiccup the eth0 #s
look reasonable again.

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Roland Dreier
 > Apr  9 06:19:04 '  eth0:14250798570591813804 2284720007938 1863800 
 > 18638  0  27375938 1556640980159 3345714490000 0 
 >   0  0 '

One odd thing is that crazy number 14250798570591813804 is
c5c501cbc5c500ac in hex.  I dunno what the significant of the 0xc5 bit
pattern is though...

The other line has 220898233988841368, which is 0x310c9c6006a7f98, not
nearly so regular a patter.

I don't think I'm helping much...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 08:52:49 +1000
CaT <[EMAIL PROTECTED]> wrote:

> On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
> > On Mon, 2 Apr 2007 11:43:19 +1000 CaT <[EMAIL PROTECTED]> wrote:
> > 
> > > I take minute by minute snapshots of network traffic by sampling
> > > /proc/net/dev and most of the time everything works fine. Occasionally
> > > though I get petabyte byte traffic and corresponding packet traffic.
> > 
> > How frequently?
> > 
> > Are you able to provide some actual numbers (expected and actual values),
> > so we can look at the bit patterns?
> 
> I have some now. These are raw lines from /proc/net/dev. In this case it's
> eth0 at 22:14 that chucked a wee wibbly.
> 
> Apr 11 22:13:02 '  eth0:17227166357 81379716000 0  0  
>0 33090495625 86656584000 0   0  0 '
> Apr 11 22:13:02 '  eth1:30708022097 91219466000 0  0  
>0 122989582024 125073786000 0   0  0 '
> Apr 11 22:14:02 '  eth0:220898233988841368 66750274000 0  
> 0  86458738 52386430545 101089219 19931300 0  199313  
> 0 '

0x310_c9c6_006a_7f98

Not sure what to make of that.

> Apr 11 22:14:02 '  eth1:30708307787 91220183000 0  0  
>0 122989665004 125074344000 0   0  0 '
> Apr 11 22:15:02 '  eth0:17227454818 81381144000 0  0  
>0 33091307388 86658381000 0   0  0 '
> Apr 11 22:15:02 '  eth1:30708569308 91220742000 0  0  
>0 122989732601 125074712000 0   0  0 '
> 
> On another server (same hardware except for 2ru case, more ram and more hds):
> 
> Apr  9 06:18:05 '  eth0:1556640056941 3598105481000 0 
>  0 0 2281147324747 3318270401000 0   0  0 
> '
> Apr  9 06:18:05 '  eth1:912389249044 1190286687000 0  
> 0 0 642943095469 991257887000 0   0  0 '
> Apr  9 06:19:04 '  eth0:14250798570591813804 2284720007938 1863800 
> 18638  0  27375938 1556640980159 3345714490000 0  
>  0  0 '

0xc5c5_01cb_c5c5_00ac and 0x213_f3ec_ab02

The first one looks like trashed memory: it got overwritten by kernel
addresses.  Except they're x86-32 kernel addresses, and you're running
x86_64 64-bit kernel.  hm.

I don't see any pattern here.

> Apr  9 06:19:04 '  eth1:912389281939 1190287072000 0  
> 0 0 642943219035 991258183000 0   0  0 '
> Apr  9 06:20:05 '  eth0:1556643514710 3598121584000 0 
>  0 0 2281154391794 3318284878000 0   0  0 
> '
> Apr  9 06:20:05 '  eth1:912389305767 1190287354000 0  
> 0 0 642943273879 991258351000 0   0  0 '
> 
> > > This happens on an AMD64, dual core smp box with Broadcom NetXtreme II
> > > nics.
> > 
> > What driver drivers that?  b44.c?
> 
> To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
> them all as amd64s). Network card driver in use is the one defined by
> CONFIG_BNX2. Kernel's monolithic.
> 
> > We do perform racy 64-bit updates of some of the stats counters.  But
> > that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit
> > kernel on that AMD64 box (are you?)
> 
> Yes. With 32bit compat for executables built in.

OK.  I was earlier assuming that you were seeing transient funny numbers. 
But in fact I think you're saying that the numbers go bad, and then stay
bad.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread CaT
On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
> On Mon, 2 Apr 2007 11:43:19 +1000 CaT <[EMAIL PROTECTED]> wrote:
> 
> > I take minute by minute snapshots of network traffic by sampling
> > /proc/net/dev and most of the time everything works fine. Occasionally
> > though I get petabyte byte traffic and corresponding packet traffic.
> 
> How frequently?
> 
> Are you able to provide some actual numbers (expected and actual values),
> so we can look at the bit patterns?

I have some now. These are raw lines from /proc/net/dev. In this case it's
eth0 at 22:14 that chucked a wee wibbly.

Apr 11 22:13:02 '  eth0:17227166357 81379716000 0  0
 0 33090495625 86656584000 0   0  0 '
Apr 11 22:13:02 '  eth1:30708022097 91219466000 0  0
 0 122989582024 125073786000 0   0  0 '
Apr 11 22:14:02 '  eth0:220898233988841368 66750274000 0
  0  86458738 52386430545 101089219 19931300 0  199313  0 '
Apr 11 22:14:02 '  eth1:30708307787 91220183000 0  0
 0 122989665004 125074344000 0   0  0 '
Apr 11 22:15:02 '  eth0:17227454818 81381144000 0  0
 0 33091307388 86658381000 0   0  0 '
Apr 11 22:15:02 '  eth1:30708569308 91220742000 0  0
 0 122989732601 125074712000 0   0  0 '

On another server (same hardware except for 2ru case, more ram and more hds):

Apr  9 06:18:05 '  eth0:1556640056941 3598105481000 0  
0 0 2281147324747 3318270401000 0   0  0 '
Apr  9 06:18:05 '  eth1:912389249044 1190286687000 0  0 
0 642943095469 991257887000 0   0  0 '
Apr  9 06:19:04 '  eth0:14250798570591813804 2284720007938 1863800 
18638  0  27375938 1556640980159 3345714490000 0   
0  0 '
Apr  9 06:19:04 '  eth1:912389281939 1190287072000 0  0 
0 642943219035 991258183000 0   0  0 '
Apr  9 06:20:05 '  eth0:1556643514710 3598121584000 0  
0 0 2281154391794 3318284878000 0   0  0 '
Apr  9 06:20:05 '  eth1:912389305767 1190287354000 0  0 
0 642943273879 991258351000 0   0  0 '

> > This happens on an AMD64, dual core smp box with Broadcom NetXtreme II
> > nics.
> 
> What driver drivers that?  b44.c?

To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
them all as amd64s). Network card driver in use is the one defined by
CONFIG_BNX2. Kernel's monolithic.

> We do perform racy 64-bit updates of some of the stats counters.  But
> that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit
> kernel on that AMD64 box (are you?)

Yes. With 32bit compat for executables built in.

-- 
"To the extent that we overreact, we proffer the terrorists the
greatest tribute."
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread CaT
On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
 On Mon, 2 Apr 2007 11:43:19 +1000 CaT [EMAIL PROTECTED] wrote:
 
  I take minute by minute snapshots of network traffic by sampling
  /proc/net/dev and most of the time everything works fine. Occasionally
  though I get petabyte byte traffic and corresponding packet traffic.
 
 How frequently?
 
 Are you able to provide some actual numbers (expected and actual values),
 so we can look at the bit patterns?

I have some now. These are raw lines from /proc/net/dev. In this case it's
eth0 at 22:14 that chucked a wee wibbly.

Apr 11 22:13:02 '  eth0:17227166357 81379716000 0  0
 0 33090495625 86656584000 0   0  0 '
Apr 11 22:13:02 '  eth1:30708022097 91219466000 0  0
 0 122989582024 125073786000 0   0  0 '
Apr 11 22:14:02 '  eth0:220898233988841368 66750274000 0
  0  86458738 52386430545 101089219 19931300 0  199313  0 '
Apr 11 22:14:02 '  eth1:30708307787 91220183000 0  0
 0 122989665004 125074344000 0   0  0 '
Apr 11 22:15:02 '  eth0:17227454818 81381144000 0  0
 0 33091307388 86658381000 0   0  0 '
Apr 11 22:15:02 '  eth1:30708569308 91220742000 0  0
 0 122989732601 125074712000 0   0  0 '

On another server (same hardware except for 2ru case, more ram and more hds):

Apr  9 06:18:05 '  eth0:1556640056941 3598105481000 0  
0 0 2281147324747 3318270401000 0   0  0 '
Apr  9 06:18:05 '  eth1:912389249044 1190286687000 0  0 
0 642943095469 991257887000 0   0  0 '
Apr  9 06:19:04 '  eth0:14250798570591813804 2284720007938 1863800 
18638  0  27375938 1556640980159 3345714490000 0   
0  0 '
Apr  9 06:19:04 '  eth1:912389281939 1190287072000 0  0 
0 642943219035 991258183000 0   0  0 '
Apr  9 06:20:05 '  eth0:1556643514710 3598121584000 0  
0 0 2281154391794 3318284878000 0   0  0 '
Apr  9 06:20:05 '  eth1:912389305767 1190287354000 0  0 
0 642943273879 991258351000 0   0  0 '

  This happens on an AMD64, dual core smp box with Broadcom NetXtreme II
  nics.
 
 What driver drivers that?  b44.c?

To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
them all as amd64s). Network card driver in use is the one defined by
CONFIG_BNX2. Kernel's monolithic.

 We do perform racy 64-bit updates of some of the stats counters.  But
 that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit
 kernel on that AMD64 box (are you?)

Yes. With 32bit compat for executables built in.

-- 
To the extent that we overreact, we proffer the terrorists the
greatest tribute.
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Andrew Morton
On Fri, 13 Apr 2007 08:52:49 +1000
CaT [EMAIL PROTECTED] wrote:

 On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
  On Mon, 2 Apr 2007 11:43:19 +1000 CaT [EMAIL PROTECTED] wrote:
  
   I take minute by minute snapshots of network traffic by sampling
   /proc/net/dev and most of the time everything works fine. Occasionally
   though I get petabyte byte traffic and corresponding packet traffic.
  
  How frequently?
  
  Are you able to provide some actual numbers (expected and actual values),
  so we can look at the bit patterns?
 
 I have some now. These are raw lines from /proc/net/dev. In this case it's
 eth0 at 22:14 that chucked a wee wibbly.
 
 Apr 11 22:13:02 '  eth0:17227166357 81379716000 0  0  
0 33090495625 86656584000 0   0  0 '
 Apr 11 22:13:02 '  eth1:30708022097 91219466000 0  0  
0 122989582024 125073786000 0   0  0 '
 Apr 11 22:14:02 '  eth0:220898233988841368 66750274000 0  
 0  86458738 52386430545 101089219 19931300 0  199313  
 0 '

0x310_c9c6_006a_7f98

Not sure what to make of that.

 Apr 11 22:14:02 '  eth1:30708307787 91220183000 0  0  
0 122989665004 125074344000 0   0  0 '
 Apr 11 22:15:02 '  eth0:17227454818 81381144000 0  0  
0 33091307388 86658381000 0   0  0 '
 Apr 11 22:15:02 '  eth1:30708569308 91220742000 0  0  
0 122989732601 125074712000 0   0  0 '
 
 On another server (same hardware except for 2ru case, more ram and more hds):
 
 Apr  9 06:18:05 '  eth0:1556640056941 3598105481000 0 
  0 0 2281147324747 3318270401000 0   0  0 
 '
 Apr  9 06:18:05 '  eth1:912389249044 1190286687000 0  
 0 0 642943095469 991257887000 0   0  0 '
 Apr  9 06:19:04 '  eth0:14250798570591813804 2284720007938 1863800 
 18638  0  27375938 1556640980159 3345714490000 0  
  0  0 '

0xc5c5_01cb_c5c5_00ac and 0x213_f3ec_ab02

The first one looks like trashed memory: it got overwritten by kernel
addresses.  Except they're x86-32 kernel addresses, and you're running
x86_64 64-bit kernel.  hm.

I don't see any pattern here.

 Apr  9 06:19:04 '  eth1:912389281939 1190287072000 0  
 0 0 642943219035 991258183000 0   0  0 '
 Apr  9 06:20:05 '  eth0:1556643514710 3598121584000 0 
  0 0 2281154391794 3318284878000 0   0  0 
 '
 Apr  9 06:20:05 '  eth1:912389305767 1190287354000 0  
 0 0 642943273879 991258351000 0   0  0 '
 
   This happens on an AMD64, dual core smp box with Broadcom NetXtreme II
   nics.
  
  What driver drivers that?  b44.c?
 
 To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
 them all as amd64s). Network card driver in use is the one defined by
 CONFIG_BNX2. Kernel's monolithic.
 
  We do perform racy 64-bit updates of some of the stats counters.  But
  that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit
  kernel on that AMD64 box (are you?)
 
 Yes. With 32bit compat for executables built in.

OK.  I was earlier assuming that you were seeing transient funny numbers. 
But in fact I think you're saying that the numbers go bad, and then stay
bad.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Roland Dreier
  Apr  9 06:19:04 '  eth0:14250798570591813804 2284720007938 1863800 
  18638  0  27375938 1556640980159 3345714490000 0 
0  0 '

One odd thing is that crazy number 14250798570591813804 is
c5c501cbc5c500ac in hex.  I dunno what the significant of the 0xc5 bit
pattern is though...

The other line has 220898233988841368, which is 0x310c9c6006a7f98, not
nearly so regular a patter.

I don't think I'm helping much...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Roland Dreier
   Apr 11 22:14:02 '  eth0:220898233988841368 66750274000 0   
  0  86458738 52386430545 101089219 19931300 0  199313
 0 '

   Apr 11 22:15:02 '  eth0:17227454818 81381144000 0  
   0 0 33091307388 86658381000 0   0  0 '

  But in fact I think you're saying that the numbers go bad, and then stay bad.

Doesn't look like it -- one minute after the first hiccup the eth0 #s
look reasonable again.

 - R.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread CaT
On Thu, Apr 12, 2007 at 04:18:24PM -0700, Roland Dreier wrote:
   Apr 11 22:14:02 '  eth0:220898233988841368 66750274000 0  
   0  86458738 52386430545 101089219 19931300 0  199313  
   0 '
 
   Apr 11 22:15:02 '  eth0:17227454818 81381144000 0 
0 0 33091307388 86658381000 0   0  0 
   '
 
  But in fact I think you're saying that the numbers go bad, and then stay 
  bad.
 
 Doesn't look like it -- one minute after the first hiccup the eth0 #s
 look reasonable again.

Yeah. Sorry for not making it clear. I included good values on either
side of the bad one.

-- 
To the extent that we overreact, we proffer the terrorists the
greatest tribute.
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Roland Dreier
[Adding Michael Chan, who seems to look after bnx2, to the cc list]

  To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
  them all as amd64s). Network card driver in use is the one defined by
  CONFIG_BNX2. Kernel's monolithic.

From a quick look at bnx2.c, it seems that the driver gives the NIC
(firmware?) a block of memory to DMA stats into, and just reads from
that memory in its get_stats method.  So if you're seeing wonky stats
from the NIC intermittently, my best guess would be that firmware is
occasionally writing junk into the stats block.

 - R.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-12 Thread Andi Kleen
Roland Dreier [EMAIL PROTECTED] writes:

 [Adding Michael Chan, who seems to look after bnx2, to the cc list]
 
   To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
   them all as amd64s). Network card driver in use is the one defined by
   CONFIG_BNX2. Kernel's monolithic.
 
 From a quick look at bnx2.c, it seems that the driver gives the NIC
 (firmware?) a block of memory to DMA stats into, and just reads from
 that memory in its get_stats method.  So if you're seeing wonky stats
 from the NIC intermittently, my best guess would be that firmware is
 occasionally writing junk into the stats block.

When only the firmware is writing to that area it could be put
into an own page and then write protected with change_page_attr()
That would catch any corruption coming from the rest of the kernel.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-02 Thread Jean-Daniel Pauget
I don't know if a me-too may help you, but I have exactly the same 
trouble on a whole set of dell servers, all with bmx drivers (suse 10.1
kernel) and values fetched by an homebrew daemon and collected via rrd.

> uname -a
Linux toronto 2.6.16.27-0.6-smp #1 SMP Wed Dec 13 09:34:50 UTC 2006 x86_64 
x86_64 x86_64 GNU/Linux

.../...
<6>Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.4.31 (January 19, 2006)
<6>ACPI: PCI Interrupt :09:00.0[A] -> GSI 16 (level, low) -> IRQ 169
<6>usbcore: registered new driver hub
<6>eth0: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz 
found at mem f400, IRQ 169, node addr 0015c5f18146
<6>ACPI: PCI Interrupt :05:00.0[A] -> GSI 16 (level, low) -> IRQ 169
<6>eth1: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz 
found at mem f800, IRQ 169, node addr 0015c5f18144
.../...


On Mon, Apr 02, 2007 at 05:41:08PM +1000, CaT wrote:
> On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
> > On Mon, 2 Apr 2007 11:43:19 +1000 CaT <[EMAIL PROTECTED]> wrote:
> > 
> > > I take minute by minute snapshots of network traffic by sampling
> > > /proc/net/dev and most of the time everything works fine. Occasionally
> > > though I get petabyte byte traffic and corresponding packet traffic.

on my side measures are performed on a 10sec frequency basis

> > How frequently?
> 
> I can count about 6 over the past month.

almost once a day per machine.

> > Are you able to provide some actual numbers (expected and actual 
> > values),
> > so we can look at the bit patterns?

I can patch my app in order to give you those exact numbers (I'm afraid 
not to be an rrd expert to extract real past values reported)
on another side, I cannot really test new drivers on those machine just 
for those tests.

> > > This happens on an AMD64, dual core smp box with Broadcom NetXtreme 
> > > II
> > > nics.
> > 
> > What driver drivers that?  b44.c?
> bnx2

> 
> > > The issue happens with both nics but at different times. The same
> > > sampling code runs on p4 boxes with ht on and e1000 nics without issues
> > > so I don't believe it's an issue with my code (famous last words :)

exactly the same, just xeons instead of AMD.

-- 
Jean-Daniel Pauget

Tél: +33 (0)2 33 17 20 16
2, rue André PELCA
50580 Denneville-Plage
France

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-02 Thread CaT
On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
> On Mon, 2 Apr 2007 11:43:19 +1000 CaT <[EMAIL PROTECTED]> wrote:
> 
> > I take minute by minute snapshots of network traffic by sampling
> > /proc/net/dev and most of the time everything works fine. Occasionally
> > though I get petabyte byte traffic and corresponding packet traffic.
> 
> How frequently?

I can count about 6 over the past month.

> Are you able to provide some actual numbers (expected and actual values),
> so we can look at the bit patterns?

I have them in an rrd file. I think though that the numbers will be
'adjusted' to fit in with the timekeeping. The logging code I've added
should provide exact numbers as it'll just dump what it reads from /proc
into syslog.

> > This happens on an AMD64, dual core smp box with Broadcom NetXtreme II
> > nics.
> 
> What driver drivers that?  b44.c?

bnx2

> > The issue happens with both nics but at different times. The same
> > sampling code runs on p4 boxes with ht on and e1000 nics without issues
> > so I don't believe it's an issue with my code (famous last words :)
> > which just does an re to extract the data on a per-line basis and prints
> > it out. Still, I'll be adding code to log any big readings and hopefully
> > it'll happen again sooner rather then later.
> > 
> > There is no preemption involved and the kernel is a monolythic build of
> > 2.6.19.[12] (there are two servers).
> 
> We do perform racy 64-bit updates of some of the stats counters.  But
> that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit
> kernel on that AMD64 box (are you?)

Correct. The environment is 64bit clean, though the kernel is compiled
with 32bit support so that I can run static 32bit binaries if need be.

> Plus it's odd that both the byte-counters and the packet-counters go wonky
> at the same time.

If you want I can toss you the rrd graphs that result from the data. The
values do not appear to be static. For example, the resent 2 hits
(within 10 minutes of each other) gave almost 3petabytes and just over 4
petabytes. Interesting is that the incoming data is driven upto
petabytes whilst the outgoing data hits megabytes at that point. This is
consistant and the server is generally quiet.

-- 
"To the extent that we overreact, we proffer the terrorists the
greatest tribute."
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-02 Thread Andrew Morton
On Mon, 2 Apr 2007 11:43:19 +1000 CaT <[EMAIL PROTECTED]> wrote:

> I take minute by minute snapshots of network traffic by sampling
> /proc/net/dev and most of the time everything works fine. Occasionally
> though I get petabyte byte traffic and corresponding packet traffic.

How frequently?

Are you able to provide some actual numbers (expected and actual values),
so we can look at the bit patterns?

> This happens on an AMD64, dual core smp box with Broadcom NetXtreme II
> nics.

What driver drivers that?  b44.c?

> The issue happens with both nics but at different times. The same
> sampling code runs on p4 boxes with ht on and e1000 nics without issues
> so I don't believe it's an issue with my code (famous last words :)
> which just does an re to extract the data on a per-line basis and prints
> it out. Still, I'll be adding code to log any big readings and hopefully
> it'll happen again sooner rather then later.
> 
> There is no preemption involved and the kernel is a monolythic build of
> 2.6.19.[12] (there are two servers).

We do perform racy 64-bit updates of some of the stats counters.  But
that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit
kernel on that AMD64 box (are you?)

Plus it's odd that both the byte-counters and the packet-counters go wonky
at the same time.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-02 Thread Andrew Morton
On Mon, 2 Apr 2007 11:43:19 +1000 CaT [EMAIL PROTECTED] wrote:

 I take minute by minute snapshots of network traffic by sampling
 /proc/net/dev and most of the time everything works fine. Occasionally
 though I get petabyte byte traffic and corresponding packet traffic.

How frequently?

Are you able to provide some actual numbers (expected and actual values),
so we can look at the bit patterns?

 This happens on an AMD64, dual core smp box with Broadcom NetXtreme II
 nics.

What driver drivers that?  b44.c?

 The issue happens with both nics but at different times. The same
 sampling code runs on p4 boxes with ht on and e1000 nics without issues
 so I don't believe it's an issue with my code (famous last words :)
 which just does an re to extract the data on a per-line basis and prints
 it out. Still, I'll be adding code to log any big readings and hopefully
 it'll happen again sooner rather then later.
 
 There is no preemption involved and the kernel is a monolythic build of
 2.6.19.[12] (there are two servers).

We do perform racy 64-bit updates of some of the stats counters.  But
that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit
kernel on that AMD64 box (are you?)

Plus it's odd that both the byte-counters and the packet-counters go wonky
at the same time.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-02 Thread CaT
On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
 On Mon, 2 Apr 2007 11:43:19 +1000 CaT [EMAIL PROTECTED] wrote:
 
  I take minute by minute snapshots of network traffic by sampling
  /proc/net/dev and most of the time everything works fine. Occasionally
  though I get petabyte byte traffic and corresponding packet traffic.
 
 How frequently?

I can count about 6 over the past month.

 Are you able to provide some actual numbers (expected and actual values),
 so we can look at the bit patterns?

I have them in an rrd file. I think though that the numbers will be
'adjusted' to fit in with the timekeeping. The logging code I've added
should provide exact numbers as it'll just dump what it reads from /proc
into syslog.

  This happens on an AMD64, dual core smp box with Broadcom NetXtreme II
  nics.
 
 What driver drivers that?  b44.c?

bnx2

  The issue happens with both nics but at different times. The same
  sampling code runs on p4 boxes with ht on and e1000 nics without issues
  so I don't believe it's an issue with my code (famous last words :)
  which just does an re to extract the data on a per-line basis and prints
  it out. Still, I'll be adding code to log any big readings and hopefully
  it'll happen again sooner rather then later.
  
  There is no preemption involved and the kernel is a monolythic build of
  2.6.19.[12] (there are two servers).
 
 We do perform racy 64-bit updates of some of the stats counters.  But
 that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit
 kernel on that AMD64 box (are you?)

Correct. The environment is 64bit clean, though the kernel is compiled
with 32bit support so that I can run static 32bit binaries if need be.

 Plus it's odd that both the byte-counters and the packet-counters go wonky
 at the same time.

If you want I can toss you the rrd graphs that result from the data. The
values do not appear to be static. For example, the resent 2 hits
(within 10 minutes of each other) gave almost 3petabytes and just over 4
petabytes. Interesting is that the incoming data is driven upto
petabytes whilst the outgoing data hits megabytes at that point. This is
consistant and the server is generally quiet.

-- 
To the extent that we overreact, we proffer the terrorists the
greatest tribute.
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: intermittant petabyte usage reported with broadcom nic

2007-04-02 Thread Jean-Daniel Pauget
I don't know if a me-too may help you, but I have exactly the same 
trouble on a whole set of dell servers, all with bmx drivers (suse 10.1
kernel) and values fetched by an homebrew daemon and collected via rrd.

 uname -a
Linux toronto 2.6.16.27-0.6-smp #1 SMP Wed Dec 13 09:34:50 UTC 2006 x86_64 
x86_64 x86_64 GNU/Linux

.../...
6Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.4.31 (January 19, 2006)
6ACPI: PCI Interrupt :09:00.0[A] - GSI 16 (level, low) - IRQ 169
6usbcore: registered new driver hub
6eth0: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz 
found at mem f400, IRQ 169, node addr 0015c5f18146
6ACPI: PCI Interrupt :05:00.0[A] - GSI 16 (level, low) - IRQ 169
6eth1: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz 
found at mem f800, IRQ 169, node addr 0015c5f18144
.../...


On Mon, Apr 02, 2007 at 05:41:08PM +1000, CaT wrote:
 On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote:
  On Mon, 2 Apr 2007 11:43:19 +1000 CaT [EMAIL PROTECTED] wrote:
  
   I take minute by minute snapshots of network traffic by sampling
   /proc/net/dev and most of the time everything works fine. Occasionally
   though I get petabyte byte traffic and corresponding packet traffic.

on my side measures are performed on a 10sec frequency basis

  How frequently?
 
 I can count about 6 over the past month.

almost once a day per machine.

  Are you able to provide some actual numbers (expected and actual 
  values),
  so we can look at the bit patterns?

I can patch my app in order to give you those exact numbers (I'm afraid 
not to be an rrd expert to extract real past values reported)
on another side, I cannot really test new drivers on those machine just 
for those tests.

   This happens on an AMD64, dual core smp box with Broadcom NetXtreme 
   II
   nics.
  
  What driver drivers that?  b44.c?
 bnx2

 
   The issue happens with both nics but at different times. The same
   sampling code runs on p4 boxes with ht on and e1000 nics without issues
   so I don't believe it's an issue with my code (famous last words :)

exactly the same, just xeons instead of AMD.

-- 
Jean-Daniel Pauget

Tél: +33 (0)2 33 17 20 16
2, rue André PELCA
50580 Denneville-Plage
France

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/