Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Chris Friesen

Chris Friesen wrote:


No joy on the 2.6.14 backport, so I guess I'll try the RHEL4 route.


Bonding driver from 2.6.9-42.0.8.EL doesn't help at all, at least with 
the module parms I was using before.


Switching to miimon doesn't help either.

Chris
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Andy Gospodarek
On Thu, Mar 29, 2007 at 07:26:07PM -0600, Chris Friesen wrote:
> Chris Friesen wrote:
> > Andy Gospodarek wrote:
> > 
> >> If you are looking for a decent source for patches you could consider
> >> downloading the latest source-rpm from RHEL4/CentOS4.  The bonding
> >> driver in those releases have been updated to much later code and I can
> >> tell you from personal experience they work pretty well.
> 
> > I'm just about to load a kernel with a backport of bonding from 2.6.14. 
> >  I'll try it out and if it doesn't help I'll try the RHEL4 one.
> 
> No joy on the 2.6.14 backport, so I guess I'll try the RHEL4 route.
> 

Ah, ok.  I'm not too sure how different the 2.6.9 and 2.6.10 bonding
code was, so it might take a little tweaking but I'm guessing there
won't be significant differences.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Chris Friesen

Chris Friesen wrote:

Andy Gospodarek wrote:


If you are looking for a decent source for patches you could consider
downloading the latest source-rpm from RHEL4/CentOS4.  The bonding
driver in those releases have been updated to much later code and I can
tell you from personal experience they work pretty well.


I'm just about to load a kernel with a backport of bonding from 2.6.14. 
 I'll try it out and if it doesn't help I'll try the RHEL4 one.


No joy on the 2.6.14 backport, so I guess I'll try the RHEL4 route.

Chris
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Chris Friesen

Andy Gospodarek wrote:


If you are looking for a decent source for patches you could consider
downloading the latest source-rpm from RHEL4/CentOS4.  The bonding
driver in those releases have been updated to much later code and I can
tell you from personal experience they work pretty well.  You may need
to do some backporting to get the latest arp-monitoring features, but
let me know if you need a hand with that, I might have some laying
around. ;)


I'm just about to load a kernel with a backport of bonding from 2.6.14. 
 I'll try it out and if it doesn't help I'll try the RHEL4 one.



Does eth6 use the same hardware/driver as eth4/5?  (Sorry if I missed
that in the thread, but didn't see if you indicated that it did.)


No, eth6 is an AMD-8111.

Chris
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Andy Gospodarek
On Thu, Mar 29, 2007 at 05:42:54PM -0600, Chris Friesen wrote:
> Jay Vosburgh wrote:
> 
> > 2.6.10 is pretty old, and there have been a number of fixes to
> > the bonding ARP monitor since then, so it may be that it is simply
> > misbehaving (presuming that you're running the 2.6.10 bonding driver).
> > Are you in a position to test against a more recent kernel (and/or
> > bonding driver)?  Does the miimon misbehave in a similar fashion?
> 
> Testing a more recent kernel is problematic.  A new bonding driver could 
> be possible, assuming the code hasn't changed too much.

If you are looking for a decent source for patches you could consider
downloading the latest source-rpm from RHEL4/CentOS4.  The bonding
driver in those releases have been updated to much later code and I can
tell you from personal experience they work pretty well.  You may need
to do some backporting to get the latest arp-monitoring features, but
let me know if you need a hand with that, I might have some laying
around. ;)
 
> I just did another experiment.  Normally we boot via eth4 (which then 
> becomes part of the bond  with eth5 at init time).  If I boot via eth6 
> instead, it appears as though the problem doesn't show up.

Does eth6 use the same hardware/driver as eth4/5?  (Sorry if I missed
that in the thread, but didn't see if you indicated that it did.)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Jay Vosburgh
Chris Friesen <[EMAIL PROTECTED]> wrote:

>Jay Vosburgh wrote:
>
>>  2.6.10 is pretty old, and there have been a number of fixes to
>> the bonding ARP monitor since then, so it may be that it is simply
>> misbehaving (presuming that you're running the 2.6.10 bonding driver).
>> Are you in a position to test against a more recent kernel (and/or
>> bonding driver)?  Does the miimon misbehave in a similar fashion?
>
>Testing a more recent kernel is problematic.  A new bonding driver could
>be possible, assuming the code hasn't changed too much.
>
>I just did another experiment.  Normally we boot via eth4 (which then
>becomes part of the bond  with eth5 at init time).  If I boot via eth6
>instead, it appears as though the problem doesn't show up.

Well, if you're still inclined to investigate, you may want to
inspect the ARP probes generated by bonding in the "bad" situation.  I
don't really have any evidence to back it up, but one guess is that the
IP detection stuff in the ARP monitor is getting messed up. I'd check to
see if the ARP probes have the correct source IP address (which, in the
2.6.10 era bonding, is determined only once by inspection of outbound
ARP traffic, and never updated).  If you're not using active-backup mode
(you didn't say, and I can't tell from your log excerpt), then the ARP
monitor may not work at all (since it will send ARP probes with an IP
source of all zeros).

If bad ARP probe source addresses are your problem, then that is
fixed in a later version of bonding, although the changes would require
some rework to backport to 2.6.10 (if they can be backported).

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Chris Friesen

Jay Vosburgh wrote:


2.6.10 is pretty old, and there have been a number of fixes to
the bonding ARP monitor since then, so it may be that it is simply
misbehaving (presuming that you're running the 2.6.10 bonding driver).
Are you in a position to test against a more recent kernel (and/or
bonding driver)?  Does the miimon misbehave in a similar fashion?


Testing a more recent kernel is problematic.  A new bonding driver could 
be possible, assuming the code hasn't changed too much.


I just did another experiment.  Normally we boot via eth4 (which then 
becomes part of the bond  with eth5 at init time).  If I boot via eth6 
instead, it appears as though the problem doesn't show up.


Chris
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Mark Huth



Jay Vosburgh wrote:

Chris Friesen <[EMAIL PROTECTED]> wrote:
[...]
  

I have a ppc64 blade running a customized 2.6.10.  At init time, two of
our gigE links (eth4 and eth5) are bonded together to form bond0.  This
link has an MTU of 9000, and uses arp monitoring.  We're using an ethernet
driver with a modified RX path for jumbo frames[1].  With the stock
driver, it seems to work fine.



2.6.10 is pretty old, and there have been a number of fixes to
the bonding ARP monitor since then, so it may be that it is simply
misbehaving (presuming that you're running the 2.6.10 bonding driver).
Are you in a position to test against a more recent kernel (and/or
bonding driver)?  Does the miimon misbehave in a similar fashion?

  

The problem is that eth5 seems to be bouncing up and down every 15 sec or
so (see the attached log excerpt).  Also, "ifconfig" shows that only 3
packets totalling 250 bytes have gone out eth5, when I know that the arp
monitoring code from the bond layer is sending 10 arps/sec out the link.


[...]
  

Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: scheduling interface 
eth4 to be reset in 3 msec.


[...]
  

Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: cancelled scheduled 
reset of interface eth5



These two messages (which appear a number of times in your log
excerpt) are not from the standard mainline bonding driver, even in
2.6.10.  I don't know what this is all about.

  

If I boot the system and then log in and manually create the bond link
(rather than it happening at init time) then I don't see the problem.



I would hazard to guess that it's an ARP monitor problem; older
versions of the ARP monitor had less than intelligent means to figure
out what the bond's IP address is (to use for the probes).  This, along
with some logic problems in the monitor code itself, led to various
problems with the ARP probes and the sort of "up / down" cycle of
behavior you seem to be seeing.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
  
I'll second what Jay said.  I support a version of the 2.6.10 kernel 
with bonding, and I needed to upgrade the bonding that was native to 
2.6.10 to get reasonable behavior.  You may also need a newer ifenslave.


It also looks like the mii interface is not well-behaved, because of the 
initialization messages related to link speed.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Jay Vosburgh

Chris Friesen <[EMAIL PROTECTED]> wrote:
[...]
>I have a ppc64 blade running a customized 2.6.10.  At init time, two of
>our gigE links (eth4 and eth5) are bonded together to form bond0.  This
>link has an MTU of 9000, and uses arp monitoring.  We're using an ethernet
>driver with a modified RX path for jumbo frames[1].  With the stock
>driver, it seems to work fine.

2.6.10 is pretty old, and there have been a number of fixes to
the bonding ARP monitor since then, so it may be that it is simply
misbehaving (presuming that you're running the 2.6.10 bonding driver).
Are you in a position to test against a more recent kernel (and/or
bonding driver)?  Does the miimon misbehave in a similar fashion?

>The problem is that eth5 seems to be bouncing up and down every 15 sec or
>so (see the attached log excerpt).  Also, "ifconfig" shows that only 3
>packets totalling 250 bytes have gone out eth5, when I know that the arp
>monitoring code from the bond layer is sending 10 arps/sec out the link.
[...]
>Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: scheduling 
>interface eth4 to be reset in 3 msec.
[...]
>Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: cancelled scheduled 
>reset of interface eth5

These two messages (which appear a number of times in your log
excerpt) are not from the standard mainline bonding driver, even in
2.6.10.  I don't know what this is all about.

>If I boot the system and then log in and manually create the bond link
>(rather than it happening at init time) then I don't see the problem.

I would hazard to guess that it's an ARP monitor problem; older
versions of the ARP monitor had less than intelligent means to figure
out what the bond's IP address is (to use for the probes).  This, along
with some logic problems in the monitor code itself, led to various
problems with the ARP probes and the sort of "up / down" cycle of
behavior you seem to be seeing.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Chris Friesen

Andy Gospodarek wrote:

Can you elaborate on what isn't going well with this driver/hardware?  


I have a ppc64 blade running a customized 2.6.10.  At init time, two of 
our gigE links (eth4 and eth5) are bonded together to form bond0.  This 
link has an MTU of 9000, and uses arp monitoring.  We're using an 
ethernet driver with a modified RX path for jumbo frames[1].  With the 
stock driver, it seems to work fine.


The problem is that eth5 seems to be bouncing up and down every 15 sec 
or so (see the attached log excerpt).  Also, "ifconfig" shows that only 
3 packets totalling 250 bytes have gone out eth5, when I know that the 
arp monitoring code from the bond layer is sending 10 arps/sec out the link.



eth5  Link encap:Ethernet  HWaddr 00:03:CC:51:01:3E
  inet6 addr: fe80::203:ccff:fe51:13e/64 Scope:Link
  UP BROADCAST RUNNING SLAVE MULTICAST  MTU:9000  Metric:1
  RX packets:119325 errors:90283 dropped:90283 overruns:90283 
frame:0

  TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:8978310 (8.5 MiB)  TX bytes:250 (250.0 b)
  Base address:0x3840 Memory:9222-9224


I had initially suspected that it might be due to the "u32 jiffies" 
stuff in bonding.h, but changing that doesn't seem to fix the issue.


If I boot the system and then log in and manually create the bond link 
(rather than it happening at init time) then I don't see the problem.


If it matters at all, normally the system boots from eth4.  I'm going to 
try booting from eth6 and see if the problem still occurs.



Chris




[1] I'm not sure if I'm supposed to mention the specific driver, as it 
hasn't been officially released yet, so I'll keep this high-level. 
Normally for jumbo frames you need to allocate a large physically 
contiguous buffer.  With the modified driver, rather than receiving into 
a contiguous buffer the incoming packet is split across multiple pages 
which are then reassembled into an sk_buff and passed up the link.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: ARP monitoring set to 100 
ms with 2 target(s): 172.24.136.0 172.24.137.0
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: ARP monitoring set to 100 
ms with 2 target(s): 172.25.136.0 172.25.137.0
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: Warning: failed to get 
speed/duplex from eth4, speed forced to 100Mbps, duplex forced to Full.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: enslaving eth4 as an 
active interface with an up link.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: Warning: failed to get 
speed/duplex from eth5, speed forced to 100Mbps, duplex forced to Full.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: enslaving eth5 as an 
active interface with an up link.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: scheduling interface 
eth5 to be reset in 3 msec.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: interface eth5 is 
now down.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: scheduling interface 
eth4 to be reset in 3 msec.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: interface eth4 is 
now down.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: now running without 
any active interface !
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: cancelled scheduled 
reset of interface eth5
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: link status 
definitely up for interface eth5
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: cancelled scheduled 
reset of interface eth4
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: interface eth4 is 
now up
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: scheduling interface 
eth5 to be reset in 3 msec.
Mar 29 20:54:08 base0-0-0-5-0-11-1 kernel: bonding: bond0: interface eth5 is 
now down.
Mar 29 20:54:09 base0-0-0-5-0-11-1 kernel: bonding: interface eth4 reset delay 
set to 600 msec.
Mar 29 20:54:59 base0-0-0-5-0-11-1 kernel: bonding: bond0: cancelled scheduled 
reset of interface eth5
Mar 29 20:54:59 base0-0-0-5-0-11-1 kernel: bonding: bond0: interface eth5 is 
now up
Mar 29 20:54:59 base0-0-0-5-0-11-1 kernel: bonding: bond0: scheduling interface 
eth5 to be reset in 3 msec.
Mar 29 20:54:59 base0-0-0-5-0-11-1 kernel: bonding: bond0: interface eth5 is 
now down.
Mar 29 20:55:15 base0-0-0-5-0-11-1 kernel: bonding: bond0: cancelled scheduled 
reset of interface eth5
Mar 29 20:55:15 base0-0-0-5-0-11-1 kernel: bonding: bond0: interface eth5 is 
now up
Mar 29 20:55:15 base0-0-0-5-0-11-1 kernel: bonding: bond0: scheduling interface 
eth5 to be reset in 3 msec.
Mar 29 20:55:15 base0-0-0-5-0-11-1 kernel: bonding: bond0: interface eth5 is 
now down.
Mar 29 20:55:30 base0-0-0-5-0-11-1 kernel: bonding: bond0: cancelled scheduled 
reset of interface eth5
Mar 29 20:55:30 base0-0-0-5-0-11-1 kernel: bonding: bond0: interface eth5 is 
now u

Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Andy Gospodarek
On Thu, Mar 29, 2007 at 10:14:40AM -0600, Chris Friesen wrote:
> 
> I"m doing some experimenting with a new network driver that receives 
> jumbo frames into multiple separate pages that are then joined together 
> in a single sk_buff using skb_fill_page_desc().
> 
> It behaved fairly well with standard networking, but its behaving 
> strangely with bonding added to the mix.


Can you elaborate on what isn't going well with this driver/hardware?  

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bonding-devel] quick help with bonding?

2007-03-29 Thread Jay Vosburgh
Chris Friesen <[EMAIL PROTECTED]> wrote:

>Could someone either point me to the bonding high level design document 
>(couldn't find one at the sourceforge project page) or else give me a 
>quick overview of the code path followed by an incoming packet when 
>bonding is involved?

There really isn't a high level design document.

The input path goes from the driver, which (probably) calls
netif_receive_skb.  That function does its processing whatnot, the only
special step for bonding is the processing done by skb_bond() which
assigns the packet to the bonding device.  In the current mainline,
skb_bond() also does some stuff to drop traffic on inactive slaves as
the like.

After that, the packet follows the regular input path in
netif_skb_receive.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html