from:"Jay Vosburgh"

Re: [PATCH] bonding: 3ad: update slave arr after initialize

2021-04-19 Thread Jay Vosburgh

jin yiting  wrote:
[...]
>>  The described issue is a race condition (in that
>> ad_agg_selection_logic clears agg->is_active under mode_lock, but
>> bond_open -> bond_update_slave_arr is inspecting agg->is_active outside
>> the lock).  I don't see how the above change will reliably manage this;
>> the real issue looks to be that bond_update_slave_arr is committing
>> changes to the array (via bond_reset_slave_arr) based on a racy
>> inspection of the active aggregator state while it is in flux.
>>
>>  Also, the description of the issue says "The best aggregator in
>> ad_agg_selection_logic has not changed, no need to update slave arr,"
>> but the change above does the opposite, and will set update_slave_arr
>> when the aggregator has not changed (update_slave_arr remains false at
>> return of ad_agg_selection_logic).
>>
>>  I believe I understand the described problem, but I don't see
>> how the patch fixes it.  I suspect (but haven't tested) that the proper
>> fix is to acquire mode_lock in bond_update_slave_arr while calling
>> bond_3ad_get_active_agg_info to avoid conflict with the state machine.
>>
>>  -J
>>
>> ---
>>  -Jay Vosburgh, jay.vosbu...@canonical.com
>> .
>>
>
>   Thank you for your reply. The last patch does have redundant
>update slave arr.Thank you for your correction.
>
>As you said, holding mode_lock in bond_update_slave_arr while
>calling bond_3ad_get_active_agg_info can avoid conflictwith the state
>machine. I have tested this patch, with ifdown/ifup operations for bond or
>slaves.
>
>But bond_update_slave_arr is expected to hold RTNL only and NO
>other lock. And it have WARN_ON(lockdep_is_held(>mode_lock)); in
>bond_update_slave_arr. I'm not sure that holding mode_lock in
>bond_update_slave_arr while calling bond_3ad_get_active_agg_info is a
>correct action.

That WARN_ON came up in discussion recently, and my opinion is
that it's incorrect, and is trying to insure bond_update_slave_arr is
safe for a potential sleep when allocating memory.

https://lore.kernel.org/netdev/20210322123846.3024549-1-maxi...@nvidia.com/

The original authors haven't replied, so I would suggest you
remove the WARN_ON and the surrounding CONFIG_LOCKDEP ifdefs as part of
your patch and replace it with a call to might_sleep.

The other callers of bond_3ad_get_active_agg_info are generally
obtaining the state in order to report it to user space, so I think it's
safe to leave those calls not holding the mode_lock.  The race is still
there, but the data returned to user space is a snapshot and so may
reflect an incomplete state during a transition.  Further, having the
inspection functions acquire the mode_lock permits user space to spam
the lock with little effort.

-J

>diff --git a/drivers/net/bonding/bond_main.c
>b/drivers/net/bonding/bond_main.c
>index 74cbbb2..db988e5 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -4406,7 +4406,9 @@ int bond_update_slave_arr(struct bonding *bond,
>struct slave *skipslave)
>if (BOND_MODE(bond) == BOND_MODE_8023AD) {
>struct ad_info ad_info;
>
>+   spin_lock_bh(>mode_lock);
>if (bond_3ad_get_active_agg_info(bond, _info)) {
>+   spin_unlock_bh(>mode_lock);
>pr_debug("bond_3ad_get_active_agg_info failed\n");
>/* No active aggragator means it's not safe to use
> * the previous array.
>@@ -4414,6 +4416,7 @@ int bond_update_slave_arr(struct bonding *bond,
>struct slave *skipslave)
>bond_reset_slave_arr(bond);
>goto out;
>}
>+   spin_unlock_bh(>mode_lock);
>agg_id = ad_info.aggregator_id;
>}
>bond_for_each_slave(bond, slave, iter) {

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: 3ad: update slave arr after initialize

2021-04-15 Thread Jay Vosburgh

jinyiting  wrote:

>From: jin yiting 
>
>The bond works in mode 4, and performs down/up operations on the bond
>that is normally negotiated. The probability of bond-> slave_arr is NULL
>
>Test commands:
>ifconfig bond1 down
>ifconfig bond1 up
>
>The conflict occurs in the following process：
>
>__dev_open (CPU A)
> --bond_open
>   --queue_delayed_work(bond->wq,>ad_work,0);
>   --bond_update_slave_arr
> --bond_3ad_get_active_agg_info
>
>ad_work(CPU B)
> --bond_3ad_state_machine_handler
>   --ad_agg_selection_logic
>
>ad_work runs on cpu B. In the function ad_agg_selection_logic, all
>agg->is_active will be cleared. Before the new active aggregator is
>selected on CPU B, bond_3ad_get_active_agg_info failed on CPU A,
>bond->slave_arr will be set to NULL. The best aggregator in
>ad_agg_selection_logic has not changed, no need to update slave arr.
>
>Signed-off-by: jin yiting 
>---
> drivers/net/bonding/bond_3ad.c | 6 ++
> 1 file changed, 6 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index 6908822..d100079 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -2327,6 +2327,12 @@ void bond_3ad_state_machine_handler(struct work_struct 
>*work)
> 
>   aggregator = __get_first_agg(port);
>   ad_agg_selection_logic(aggregator, _slave_arr);
>+  if (!update_slave_arr) {
>+  struct aggregator *active = 
>__get_active_agg(aggregator);
>+
>+  if (active && active->is_active)
>+  update_slave_arr = true;
>+  }
>   }
>   bond_3ad_set_carrier(bond);
>   }

The described issue is a race condition (in that
ad_agg_selection_logic clears agg->is_active under mode_lock, but
bond_open -> bond_update_slave_arr is inspecting agg->is_active outside
the lock).  I don't see how the above change will reliably manage this;
the real issue looks to be that bond_update_slave_arr is committing
changes to the array (via bond_reset_slave_arr) based on a racy
inspection of the active aggregator state while it is in flux.

Also, the description of the issue says "The best aggregator in
ad_agg_selection_logic has not changed, no need to update slave arr,"
but the change above does the opposite, and will set update_slave_arr
when the aggregator has not changed (update_slave_arr remains false at
return of ad_agg_selection_logic).

I believe I understand the described problem, but I don't see
how the patch fixes it.  I suspect (but haven't tested) that the proper
fix is to acquire mode_lock in bond_update_slave_arr while calling
bond_3ad_get_active_agg_info to avoid conflict with the state machine.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next] bonding: Added -ENODEV interpret for slaves option

2021-03-13 Thread Jay Vosburgh

Jianlin Lv  wrote:

>After upgrading the kernel, the slave interface name is changed,
>Systemd cannot use the original configuration to create bond interface,
>thereby losing the connection with the host.
>
>Adding log for ENODEV will make it easier to find out such problem lies.

To be clear, this specifically affects add/remove of interfaces
to/from the bond via the "slaves" sysfs interface.

Please update your log to better describe this (that it affects
the sysfs API only) and resubmit.

I'm sympathetic to the problem this is trying to solve, and the
message shouldn't spam the kernel log particularly, but the commit log
needs to more clearly describe what the problem is and how it's being
fixed.

Thanks,

-J

>Signed-off-by: Jianlin Lv 
>---
> drivers/net/bonding/bond_options.c | 9 +
> 1 file changed, 9 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index 77d7c38bd435..c9d3604ae129 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -640,6 +640,15 @@ static void bond_opt_error_interpret(struct bonding *bond,
>   netdev_err(bond->dev, "option %s: unable to set because the 
> bond device is up\n",
>  opt->name);
>   break;
>+  case -ENODEV:
>+  if (val && val->string) {
>+  p = strchr(val->string, '\n');
>+  if (p)
>+  *p = '\0';
>+  netdev_err(bond->dev, "option %s: interface %s does not 
>exist!\n",
>+ opt->name, val->string);
>+  }
>+  break;
>   default:
>   break;
>   }
>-- 
>2.25.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: question about bonding mode 4

2021-01-29 Thread Jay Vosburgh

ry port
structure contains an aggregator structure, so there should always be
one available somewhere.  I'm going to speculate that there's a race
condition somewhere in the teardown processing vs the LACP state machine
that invalidates this presumption.

>> The detailed code analysis is as follows:

[...]

>>  /* if all aggregator's ports are READY_N == TRUE, set ready=TRUE
>>   * in all aggregator's ports, else set ready=FALSE in all
>>   * aggregator's ports
>>   */
>>  __set_agg_ports_ready(port->aggregator,
>>__agg_ports_are_ready(port->aggregator));
>> 
>> analysis: port->aggregator is still NULL, which causes problem.
>> 
>>  aggregator = __get_first_agg(port);
>>  ad_agg_selection_logic(aggregator, update_slave_arr);
>> 
>>  if (!port->aggregator->is_active)
>>  port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;

Correct, if the "did not find a suitable aggregator" path is
taken, port->aggregator is NULL and bad things happen in the above
block.

This is something that needs to be fixed, but I'm also concerned
that there are other issues lurking, so I'd like to be able to reproduce
this.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next v2] bonding: add a vlan+mac tx hashing option

2021-01-14 Thread Jay Vosburgh

kb)
>{
>struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
>u32 srcmac = 0;
>u16 vlan;
>int i;
>
>for (i = 0; i < ETH_ALEN; i++)
>srcmac = (srcmac << 8) | mac_hdr->h_source[i];

I think this will shift h_source[0] and [1] into oblivion.

>if (!skb_vlan_tag_present(skb))
>return srcmac;
>
>vlan = skb_vlan_tag_get(skb);
>
>return vlan ^ srcmac;
>}
>
>Then the documentation is spot-on, and we're future-proof, though
>marginally less performant in calculating the hash, which may have been a
>consideration when the original function was written, but is probably
>basically irrelevant w/modern systems...
>
>> >The default value is layer2.  This option was added in bonding
>> >version 2.6.3.  In earlier versions of bonding, this parameter
>> >does not exist, and the layer2 policy is the only policy.  The
>> 
>> > +static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
>> 
>> Can we drop the inline? It's a static function called once.
>
>Works for me. That was also inherited by copying bond_eth_hash(). :)
>
>> > +{
>> > +  struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
>> 
>> I don't see anything in the patch making sure the interface actually
>> has a L2 header. Should we validate somehow the ifc is Ethernet?
>
>I don't think it's necessary. There doesn't appear to be any explicit
>check for BOND_XMIT_POLICY_LAYER2 either. I believe we're guaranteed to
>not have anything but an ethernet header here, as the only other type I'm
>aware of being supported is Infiniband, but we limit that to active-backup
>only, and xmit_hash_policy isn't valid for active-backup.

This is correct, interfaces in a bond other than active-backup
will all be ARPHRD_ETHER.  I'm unaware of a way to get a packet in there
without at least an Ethernet header.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2021-01-12 Thread Jay Vosburgh

Jarod Wilson  wrote:

>On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote:
>> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:
>> > Jarod Wilson  wrote:
>> > 
>> > >This comes from an end-user request, where they're running multiple VMs on
>> > >hosts with bonded interfaces connected to some interest switch topologies,
>> > >where 802.3ad isn't an option. They're currently running a proprietary
>> > >solution that effectively achieves load-balancing of VMs and bandwidth
>> > >utilization improvements with a similar form of transmission algorithm.
>> > >
>> > >Basically, each VM has it's own vlan, so it always sends its traffic out
>> > >the same interface, unless that interface fails. Traffic gets split
>> > >between the interfaces, maintaining a consistent path, with failover still
>> > >available if an interface goes down.
>> > >
>> > >This has been rudimetarily tested to provide similar results, suitable for
>> > >them to use to move off their current proprietary solution.
>> > >
>> > >Still on the TODO list, if these even looks sane to begin with, is
>> > >fleshing out Documentation/networking/bonding.rst.
>> > 
>> >I'm sure you're aware, but any final submission will also need
>> > to include netlink and iproute2 support.
>> 
>> I believe everything for netlink support is already included, but I'll
>> double-check that before submitting something for inclusion consideration.
>
>I'm not certain if what you actually meant was that I'd have to patch
>iproute2 as well, which I've definitely stumbled onto today, but it's a
>2-line patch, and everything seems to be working fine with it:

Yes, that's what I meant.

>$ sudo ip link set bond0 type bond xmit_hash_policy 5

Does the above work with the text label (presumably "vlansrc")
as well as the number, and does "ip link add test type bond help" print
the correct text for XMIT_HASH_POLICY?

-J

>$ ip -d link show bond0
>11: bond0:  mtu 1500 qdisc noop state DOWN mode 
>DEFAULT group default qlen 1000
>link/ether ce:85:5e:24:ce:90 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 
> maxmtu 65535
>bond mode balance-xor miimon 0 updelay 0 downdelay 0 peer_notify_delay 0 
> use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any 
> primary_reselect always fail_over_mac none xmit_hash_policy vlansrc 
> resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 
> packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 
> addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 
> gso_max_segs 65535
>$ grep Hash /proc/net/bonding/bond0
>Transmit Hash Policy: vlansrc (5)
>
>Nothing bad seems to happen on an older kernel if one tries to set the new
>hash, you just get told that it's an invalid argument.
>
>I *think* this is all ready for submission then, so I'll get both the kernel
>and iproute2 patches out soon.
>
>-- 
>Jarod Wilson
>ja...@redhat.com

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2020-12-18 Thread Jay Vosburgh

Jarod Wilson  wrote:

>This comes from an end-user request, where they're running multiple VMs on
>hosts with bonded interfaces connected to some interest switch topologies,
>where 802.3ad isn't an option. They're currently running a proprietary
>solution that effectively achieves load-balancing of VMs and bandwidth
>utilization improvements with a similar form of transmission algorithm.
>
>Basically, each VM has it's own vlan, so it always sends its traffic out
>the same interface, unless that interface fails. Traffic gets split
>between the interfaces, maintaining a consistent path, with failover still
>available if an interface goes down.
>
>This has been rudimetarily tested to provide similar results, suitable for
>them to use to move off their current proprietary solution.
>
>Still on the TODO list, if these even looks sane to begin with, is
>fleshing out Documentation/networking/bonding.rst.

I'm sure you're aware, but any final submission will also need
to include netlink and iproute2 support.

>Cc: Jay Vosburgh 
>Cc: Veaceslav Falico 
>Cc: Andy Gospodarek 
>Cc: "David S. Miller" 
>Cc: Jakub Kicinski 
>Cc: Thomas Davis 
>Cc: net...@vger.kernel.org
>Signed-off-by: Jarod Wilson 
>---
> drivers/net/bonding/bond_main.c| 27 +--
> drivers/net/bonding/bond_options.c |  1 +
> include/linux/netdevice.h  |  1 +
> include/uapi/linux/if_bonding.h|  1 +
> 4 files changed, 28 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 5fe5232cc3f3..151ce8c7a56f 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -164,7 +164,7 @@ module_param(xmit_hash_policy, charp, 0);
> MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 
> 802.3ad hashing method; "
>  "0 for layer 2 (default), 1 for layer 3+4, "
>  "2 for layer 2+3, 3 for encap layer 2+3, "
>- "4 for encap layer 3+4");
>+ "4 for encap layer 3+4, 5 for vlan+srcmac");
> module_param(arp_interval, int, 0);
> MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
> module_param_array(arp_ip_target, charp, NULL, 0);
>@@ -1434,6 +1434,8 @@ static enum netdev_lag_hash bond_lag_hash_type(struct 
>bonding *bond,
>   return NETDEV_LAG_HASH_E23;
>   case BOND_XMIT_POLICY_ENCAP34:
>   return NETDEV_LAG_HASH_E34;
>+  case BOND_XMIT_POLICY_VLAN_SRCMAC:
>+  return NETDEV_LAG_HASH_VLAN_SRCMAC;
>   default:
>   return NETDEV_LAG_HASH_UNKNOWN;
>   }
>@@ -3494,6 +3496,20 @@ static bool bond_flow_ip(struct sk_buff *skb, struct 
>flow_keys *fk,
>   return true;
> }
> 
>+static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
>+{
>+  struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
>+  u32 srcmac = mac_hdr->h_source[5];
>+  u16 vlan;
>+
>+  if (!skb_vlan_tag_present(skb))
>+  return srcmac;
>+
>+  vlan = skb_vlan_tag_get(skb);
>+
>+  return srcmac ^ vlan;

For the common configuration wherein multiple VLANs are
configured atop a single interface (and thus by default end up with the
same MAC address), this seems like a fairly weak hash.  The TCI is 16
bits (12 of which are the VID), but only 8 are used from the MAC, which
will be a constant.

Is this algorithm copying the proprietary solution you mention?

-J

>+}
>+
> /* Extract the appropriate headers based on bond's xmit policy */
> static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
> struct flow_keys *fk)
>@@ -3501,10 +3517,14 @@ static bool bond_flow_dissect(struct bonding *bond, 
>struct sk_buff *skb,
>   bool l34 = bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34;
>   int noff, proto = -1;
> 
>-  if (bond->params.xmit_policy > BOND_XMIT_POLICY_LAYER23) {
>+  switch (bond->params.xmit_policy) {
>+  case BOND_XMIT_POLICY_ENCAP23:
>+  case BOND_XMIT_POLICY_ENCAP34:
>   memset(fk, 0, sizeof(*fk));
>   return __skb_flow_dissect(NULL, skb, _keys_bonding,
> fk, NULL, 0, 0, 0, 0);
>+  default:
>+  break;
>   }
> 
>   fk->ports.ports = 0;
>@@ -3556,6 +3576,9 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff 
>*skb)
>   skb->l4_hash)
>   return skb->hash;
> 
>+  if (bond->params.xmit_policy == BOND_XMIT_PO

Re: [PATCH net-next] bonding: correct rr balancing during link failure

2020-12-08 Thread Jay Vosburgh

Jakub Kicinski  wrote:

>On Wed, 02 Dec 2020 20:55:57 + Lars Everbrand wrote:
>> This patch updates the sending algorithm for roundrobin to avoid
>> over-subscribing interface(s) when one or more interfaces in the bond is
>> not able to send packets. This happened when order was not random and
>> more than 2 interfaces were used.
>> 
>> Previously the algorithm would find the next available interface
>> when an interface failed to send by, this means that most often it is
>> current_interface + 1. The problem is that when the next packet is to be
>> sent and the "normal" algorithm then continues with interface++ which
>> then hits that same interface again.
>> 
>> This patch updates the resending algorithm to update the global counter
>> of the next interface to use.
>> 
>> Example (prior to patch):
>> 
>> Consider 6 x 100 Mbit/s interfaces in a rr bond. The normal order of links
>> being used to send would look like:
>> 1 2 3 4 5 6  1 2 3 4 5 6  1 2 3 4 5 6 ...
>> 
>> If, for instance, interface 2 where unable to send the order would have been:
>> 1 3 3 4 5 6  1 3 3 4 5 6  1 3 3 4 5 6 ...
>> 
>> The resulting speed (for TCP) would then become:
>> 50 + 0 + 100 + 50 + 50 + 50 = 300 Mbit/s
>> instead of the expected 500 Mbit/s.
>> 
>> If interface 3 also would fail the resulting speed would be half of the
>> expected 400 Mbit/s (33 + 0 + 0 + 100 + 33 + 33).

Are these bandwidth numbers from observation of the actual
behavior?  I'm not sure the real system would behave this way; my
suspicion is that it would increase the likelihood of drops on the
overused slave, not that the overall capacity would be limited.

>> Signed-off-by: Lars Everbrand 
>
>Thanks for the patch!
>
>Looking at the code in question it feels a little like we're breaking
>abstractions if we bump the counter directly in get_slave_by_id.

Agreed; I think a better way to fix this is to enable the slave
array for balance-rr mode, and then use the array to find the right
slave.  This way, we then avoid the problematic "skip unable to tx"
logic for free.

>For one thing when the function is called for IGMP packets the counter
>should not be incremented at all. But also if packets_per_slave is not
>1 we'd still be hitting the same leg multiple times (packets_per_slave
>/ 2). So it seems like we should round the counter up somehow?
>
>For IGMP maybe we don't have to call bond_get_slave_by_id() at all,
>IMHO, just find first leg that can TX. Then we can restructure
>bond_get_slave_by_id() appropriately for the non-IGMP case.

For IGMP, the theory is to confine that traffic to a single
device.  Normally, this will be curr_active_slave, which is updated even
in balance-rr mode as interfaces are added to or removed from the bond.
The call to bond_get_slave_by_id should be a fallback in case
curr_active_slave is empty, and should be the exception, and may not be
possible at all.

But either way, the IGMP path shouldn't mess with rr_tx_counter,
it should be out of band of the normal TX packet counting, so to speak.

-J

>> diff --git a/drivers/net/bonding/bond_main.c 
>> b/drivers/net/bonding/bond_main.c
>> index e0880a3840d7..e02d9c6d40ee 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -4107,6 +4107,7 @@ static struct slave *bond_get_slave_by_id(struct 
>> bonding *bond,
>>  if (--i < 0) {
>>  if (bond_slave_can_tx(slave))
>>  return slave;
>> +bond->rr_tx_counter++;
>>  }
>>  }
>>  
>> @@ -4117,6 +4118,7 @@ static struct slave *bond_get_slave_by_id(struct 
>> bonding *bond,
>>  break;
>>  if (bond_slave_can_tx(slave))
>>  return slave;
>> +bond->rr_tx_counter++;
>>  }
>>  /* no slave that can tx has been found */
>>  return NULL;
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread

2020-12-08 Thread Jay Vosburgh

Jarod Wilson  wrote:

>I'm seeing a system get stuck unable to bring a downed interface back up
>when it's got an updelay value set, behavior which ceased when logging
>spew was removed from bond_miimon_inspect(). I'm monitoring logs on this
>system over another network connection, and it seems that the act of
>spewing logs at all there increases rtnl lock contention, because
>instrumented code showed bond_mii_monitor() never able to succeed in it's
>attempts to call rtnl_trylock() to actually commit link state changes,
>leaving the downed link stuck in BOND_LINK_DOWN. The system in question
>appears to be fine with the log spew being moved to
>bond_commit_link_state(), which is called after the successful
>rtnl_trylock(). I'm actually wondering if perhaps we ultimately need/want
>some bond-specific lock here to prevent racing with bond_close() instead
>of using rtnl, but this shift of the output appears to work. I believe
>this started happening when de77ecd4ef02 ("bonding: improve link-status
>update in mii-monitoring") went in, but I'm not 100% on that.

We use RTNL not to avoid deadlock with bonding itself, but
because the "commit" side undertakes actions which require RTNL, e.g.,
various events will eventually call netdev_lower_state_changed.

However, the RTNL acquisition is a trylock to avoid the deadlock
with bond_close.  Moving that out of line here (e.g., putting the commit
into another work queue event or the like) has the same problem, in that
bond_close needs to wait for all of the work queue events to finish, and
it holds RTNL.

Also, a dim memory says that the various notification messages
were mostly placed in the "inspect" phase and not the "commit" phase to
avoid doing printk-like activities with RTNL held.  As a general
principle, I don't think we want to add more verbiage under RTNL.

>The addition of a case BOND_LINK_BACK in bond_miimon_inspect() is somewhat
>separate from the fix for the actual hang, but it eliminates a constant
>"invalid new link 3 on slave" message seen related to this issue, and it's
>not actually an invalid state here, so we shouldn't be reporting it as an
>error.

Do you mean bond_miimon_commit here and not bond_miimon_inspect
(which already has a case for BOND_LINK_BACK)?

In principle, bond_miimon_commit should not see _BACK or _FAIL
state as a new link state, because those states should be managed at the
bond_miimon_inspect level (as they are the result of updelay and
downdelay).  These states should not be "committed" in the sense of
causing notifications or doing actions that require RTNL.

My recollection is that the "invalid new link" messages were the
result of a bug in de77ecd4ef02, which was fixed in 1899bb325149
("bonding: fix state transition issue in link monitoring"), but maybe
the RTNL problem here induces that in some other fashion.

Either way, I believe this message is correct as-is.

-J

>CC: Mahesh Bandewar 
>CC: Jay Vosburgh 
>CC: Veaceslav Falico 
>CC: Andy Gospodarek 
>CC: "David S. Miller" 
>CC: Jakub Kicinski 
>CC: Thomas Davis 
>CC: net...@vger.kernel.org
>Signed-off-by: Jarod Wilson 
>---
> drivers/net/bonding/bond_main.c | 26 ++
> include/net/bonding.h   | 38 +
> 2 files changed, 44 insertions(+), 20 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 47afc5938c26..cdb6c64f16b6 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2292,23 +2292,13 @@ static int bond_miimon_inspect(struct bonding *bond)
>   bond_propose_link_state(slave, BOND_LINK_FAIL);
>   commit++;
>   slave->delay = bond->params.downdelay;
>-  if (slave->delay) {
>-  slave_info(bond->dev, slave->dev, "link status 
>down for %sinterface, disabling it in %d ms\n",
>- (BOND_MODE(bond) ==
>-  BOND_MODE_ACTIVEBACKUP) ?
>-  (bond_is_active_slave(slave) ?
>-   "active " : "backup ") : "",
>- bond->params.downdelay * 
>bond->params.miimon);
>-  }
>+
>   fallthrough;
>   case BOND_LINK_FAIL:
>   if (link_state) {
>   /* recovered before downdelay expired */
>   bond_propose_link_state(slave, BOND_LIN

Re: [PATCH net v2] bonding: fix feature flag setting at init time

2020-12-02 Thread Jay Vosburgh

Jarod Wilson  wrote:

>On Wed, Dec 2, 2020 at 12:55 PM Jay Vosburgh  
>wrote:
>>
>> Jarod Wilson  wrote:
>>
>> >Don't try to adjust XFRM support flags if the bond device isn't yet
>> >registered. Bad things can currently happen when netdev_change_features()
>> >is called without having wanted_features fully filled in yet. Basically,
>> >this code was racing against register_netdevice() filling in
>> >wanted_features, and when it got there first, the empty wanted_features
>> >led to features also getting emptied out, which was definitely not the
>> >intended behavior, so prevent that from happening.
>>
>> Is this an actual race?  Reading Ivan's prior message, it sounds
>> like it's an ordering problem (in that bond_newlink calls
>> register_netdevice after bond_changelink).
>
>Sorry, yeah, this is not actually a race condition, just an ordering
>issue, bond_check_params() gets called at init time, which leads to
>bond_option_mode_set() being called, and does so prior to
>bond_create() running, which is where we actually call
>register_netdevice().

So this only happens if there's a "mode" module parameter?  That
doesn't sound like the call path that Ivan described (coming in via
bond_newlink).

-J

>> The change to bond_option_mode_set tests against reg_state, so
>> presumably it wants to skip the first(?) time through, before the
>> register_netdevice call; is that right?
>
>Correct. Later on, when the bonding driver is already loaded, and
>parameter changes are made, bond_option_mode_set() gets called and if
>the mode changes to or from active-backup, we do need/want this code
>to run to update wanted and features flags properly.
>
>
>-- 
>Jarod Wilson
>ja...@redhat.com

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net v2] bonding: fix feature flag setting at init time

2020-12-02 Thread Jay Vosburgh

Jarod Wilson  wrote:

>Don't try to adjust XFRM support flags if the bond device isn't yet
>registered. Bad things can currently happen when netdev_change_features()
>is called without having wanted_features fully filled in yet. Basically,
>this code was racing against register_netdevice() filling in
>wanted_features, and when it got there first, the empty wanted_features
>led to features also getting emptied out, which was definitely not the
>intended behavior, so prevent that from happening.

Is this an actual race?  Reading Ivan's prior message, it sounds
like it's an ordering problem (in that bond_newlink calls
register_netdevice after bond_changelink).

The change to bond_option_mode_set tests against reg_state, so
presumably it wants to skip the first(?) time through, before the
register_netdevice call; is that right?

-J

>Originally, I'd hoped to stop adjusting wanted_features at all in the
>bonding driver, as it's documented as being something only the network
>core should touch, but we actually do need to do this to properly update
>both the features and wanted_features fields when changing the bond type,
>or we get to a situation where ethtool sees:
>
>esp-hw-offload: off [requested on]
>
>I do think we should be using netdev_update_features instead of
>netdev_change_features here though, so we only send notifiers when the
>features actually changed.
>
>v2: rework based on further testing and suggestions from ivecera
>
>Fixes: a3b658cfb664 ("bonding: allow xfrm offload setup post-module-load")
>Reported-by: Ivan Vecera 
>Suggested-by: Ivan Vecera 
>Cc: Jay Vosburgh 
>Cc: Veaceslav Falico 
>Cc: Andy Gospodarek 
>Cc: "David S. Miller" 
>Cc: Jakub Kicinski 
>Cc: Thomas Davis 
>Cc: net...@vger.kernel.org
>Signed-off-by: Jarod Wilson 
>---
> drivers/net/bonding/bond_main.c| 10 --
> drivers/net/bonding/bond_options.c |  6 +-
> 2 files changed, 9 insertions(+), 7 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index e0880a3840d7..5fe5232cc3f3 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -4746,15 +4746,13 @@ void bond_setup(struct net_device *bond_dev)
>   NETIF_F_HW_VLAN_CTAG_FILTER;
> 
>   bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL;
>-#ifdef CONFIG_XFRM_OFFLOAD
>-  bond_dev->hw_features |= BOND_XFRM_FEATURES;
>-#endif /* CONFIG_XFRM_OFFLOAD */
>   bond_dev->features |= bond_dev->hw_features;
>   bond_dev->features |= NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX;
> #ifdef CONFIG_XFRM_OFFLOAD
>-  /* Disable XFRM features if this isn't an active-backup config */
>-  if (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP)
>-  bond_dev->features &= ~BOND_XFRM_FEATURES;
>+  bond_dev->hw_features |= BOND_XFRM_FEATURES;
>+  /* Only enable XFRM features if this is an active-backup config */
>+  if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP)
>+  bond_dev->features |= BOND_XFRM_FEATURES;
> #endif /* CONFIG_XFRM_OFFLOAD */
> }
> 
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index 9abfaae1c6f7..19205cfac751 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -768,11 +768,15 @@ static int bond_option_mode_set(struct bonding *bond,
>   bond->params.tlb_dynamic_lb = 1;
> 
> #ifdef CONFIG_XFRM_OFFLOAD
>+  if (bond->dev->reg_state != NETREG_REGISTERED)
>+  goto noreg;
>+
>   if (newval->value == BOND_MODE_ACTIVEBACKUP)
>   bond->dev->wanted_features |= BOND_XFRM_FEATURES;
>       else
>   bond->dev->wanted_features &= ~BOND_XFRM_FEATURES;
>-  netdev_change_features(bond->dev);
>+  netdev_update_features(bond->dev);
>+noreg:
>
> #endif /* CONFIG_XFRM_OFFLOAD */
> 
>   /* don't cache arp_validate between modes */
>-- 
>2.28.0
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH 2/5] bonding: replace use of the term master where possible

2020-11-11 Thread Jay Vosburgh

Jarod Wilson  wrote:

>Simply refer to what was the bonding "master" as the "bond" or bonding
>device, depending on context. However, do retain compat code for the
>bonding_masters sysfs interface to avoid breaking userspace.
>
>Cc: Jay Vosburgh 
>Cc: Veaceslav Falico 
>Cc: Andy Gospodarek 
>Cc: "David S. Miller" 
>Cc: Jakub Kicinski 
>Cc: Thomas Davis 
>Cc: net...@vger.kernel.org
>Signed-off-by: Jarod Wilson 
>---
> drivers/infiniband/core/cma.c |  2 +-
> drivers/infiniband/core/lag.c |  2 +-
> drivers/infiniband/core/roce_gid_mgmt.c   |  6 +-
> drivers/net/bonding/bond_3ad.c|  2 +-
> drivers/net/bonding/bond_main.c   | 58 +--
> drivers/net/bonding/bond_procfs.c |  4 +-
> drivers/net/bonding/bond_sysfs.c  |  8 +--
> .../net/ethernet/mellanox/mlx4/en_netdev.c| 10 ++--
> .../ethernet/netronome/nfp/flower/lag_conf.c  |  2 +-
> .../ethernet/qlogic/netxen/netxen_nic_main.c  |  8 +--
> include/linux/netdevice.h |  8 +--
> include/net/bonding.h |  4 +-
> 12 files changed, 58 insertions(+), 56 deletions(-)
>
>diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
>index a77750b8954d..3a1679d16e19 100644
>--- a/drivers/infiniband/core/cma.c
>+++ b/drivers/infiniband/core/cma.c
>@@ -4753,7 +4753,7 @@ static int cma_netdev_callback(struct notifier_block 
>*self, unsigned long event,
>   if (event != NETDEV_BONDING_FAILOVER)
>   return NOTIFY_DONE;
> 
>-  if (!netif_is_bond_master(ndev))
>+  if (!netif_is_bond_dev(ndev))
>   return NOTIFY_DONE;
> 
>   mutex_lock();
>diff --git a/drivers/infiniband/core/lag.c b/drivers/infiniband/core/lag.c
>index 7063e41eaf26..2afaca2f9d0b 100644
>--- a/drivers/infiniband/core/lag.c
>+++ b/drivers/infiniband/core/lag.c
>@@ -128,7 +128,7 @@ struct net_device *rdma_lag_get_ah_roce_slave(struct 
>ib_device *device,
>   dev_hold(master);
>   rcu_read_unlock();
> 
>-  if (!netif_is_bond_master(master))
>+  if (!netif_is_bond_dev(master))
>   goto put;
> 
>   slave = rdma_get_xmit_slave_udp(device, master, ah_attr, flags);
>diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
>b/drivers/infiniband/core/roce_gid_mgmt.c
>index 6b8364bb032d..e06cf51f1773 100644
>--- a/drivers/infiniband/core/roce_gid_mgmt.c
>+++ b/drivers/infiniband/core/roce_gid_mgmt.c
>@@ -129,7 +129,7 @@ enum bonding_slave_state {
> static enum bonding_slave_state is_eth_active_slave_of_bonding_rcu(struct 
> net_device *dev,
>  struct 
> net_device *upper)
> {
>-  if (upper && netif_is_bond_master(upper)) {
>+  if (upper && netif_is_bond_dev(upper)) {
>   struct net_device *pdev =
>   bond_option_active_slave_get_rcu(netdev_priv(upper));
> 
>@@ -216,7 +216,7 @@ is_ndev_for_default_gid_filter(struct ib_device *ib_dev, 
>u8 port,
>* make sure that it the upper netdevice of rdma netdevice.
>*/
>   res = ((cookie_ndev == rdma_ndev && !netif_is_bond_slave(rdma_ndev)) ||
>- (netif_is_bond_master(cookie_ndev) &&
>+ (netif_is_bond_dev(cookie_ndev) &&
>   rdma_is_upper_dev_rcu(rdma_ndev, cookie_ndev)));
> 
>   rcu_read_unlock();
>@@ -271,7 +271,7 @@ is_upper_ndev_bond_master_filter(struct ib_device *ib_dev, 
>u8 port,
>   return false;
> 
>   rcu_read_lock();
>-  if (netif_is_bond_master(cookie_ndev) &&
>+  if (netif_is_bond_dev(cookie_ndev) &&
>   rdma_is_upper_dev_rcu(rdma_ndev, cookie_ndev))
>   match = true;
>   rcu_read_unlock();
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index 0eb717b0bfc6..852b9c4f6a47 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -2550,7 +2550,7 @@ void bond_3ad_handle_link_change(struct slave *slave, 
>char link)
> }
> 
> /**
>- * bond_3ad_set_carrier - set link state for bonding master
>+ * bond_3ad_set_carrier - set link state for bonding device
>  * @bond: bonding structure
>  *
>  * if we have an active aggregator, we're up, if not, we're down.
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index d79643f6b01e..e9cc7d68f3b9 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -469,8 +469,8 @@ static const struct xfrmdev_ops bond_xfrmdev_ops = {
> 
> /*--- Link stat

Re: [PATCH net-next v4 0/5] bonding: rename bond components

2020-11-11 Thread Jay Vosburgh

Jarod Wilson  wrote:

>The bonding driver's use of master and slave, while largely understood
>in technical circles, poses a barrier for inclusion to some potential
>members of the development and user community, due to the historical
>context of masters and slaves, particularly in the United States. This
>is a first full pass at replacing those phrases with more socially
>inclusive ones, opting for bond to replace master and port to
>replace slave, which is congruent with the bridge and team drivers.
>
>There are a few problems with this change. First up, "port" is used in
>the bonding 802.3ad code, so the first step here is to rename port to
>ad_port, so we can reuse port. Second, we have the issue of not wanting
>to break any existing userspace, which I believe this patchset
>accomplishes, preserving all existing sysfs and procfs interfaces, and
>adding module parameter aliases where necessary.
>
>Third, we do still have the issue of ease of backporting fixes to
>-stable trees. I've not had a huge amount of time to spend on it, but
>brief forays into coccinelle didn't really pay off (since it's meant to
>operate on code, not patches), and the best solution I can come up with
>is providing a shell script someone could run over git-format-patch
>output before git-am'ing the result to a -stable tree, though scripting
>these changes in the first place turned out to be not the best thing to
>do anyway, due to subtle cases where use of master or slave can NOT yet
>be replaced, so a large amount of work was done by hand, inspection,
>trial and error, which is why this set is a lot longer in coming than
>I'd originally hoped. I don't expect -stable backports to be horrible to
>figure out one way or another though, and I don't believe that a bit of
>inconvenience on that front is enough to warrant not making these
>changes.

I think this undersells the impact a bit; this will most likely
break the majority of cherry-picks for the bonding driver to stable
going forward should this patch set be committed.  Yes, the volume of
patches to bonding is relatively low, and the manual backports are not
likely to be technically difficult.  Nevertheless, I expect that most
bonding backports to stable that cross this patch set will require
manual intervention.

As such, I'd still like to see explicit direction from the
kernel development community leadership that change sets of this nature
(not technically driven, with long term maintenance implications) are
changes that should be undertaken rather than are merely permitted.

-J

>See here for further details on Red Hat's commitment to this work:
>https://www.redhat.com/en/blog/making-open-source-more-inclusive-eradicating-problematic-language
>
>As far as testing goes, I've manually operated on various bonds while
>working on this code, and have run it through multiple lnst test runs,
>which exercises the existing sysfs interfaces fairly extensively. As far
>as I can tell through testing and inspection, there is no breakage of
>any existing interfaces with this set.
>
>v2: legacy module parameters are retained this time, and we're trying
>out bond/port instead of aggregator/link in place of master/slave. The
>procfs interface legacy output is also duplicated or dropped, depending
>on Kconfig, rather than being replaced.
>
>v3: remove Kconfig knob, leave sysfs and procfs interfaces entirely
>untouched, but update documentation to reference their deprecated
>nature, explain the name changes, add references to NetworkManager,
>include more netlink/iproute2 examples and make note of netlink
>being the preferred interface for userspace interaction with bonds.
>
>v4: documentation table of contents fixes
>
>Cc: Jay Vosburgh 
>Cc: Veaceslav Falico 
>Cc: Andy Gospodarek 
>Cc: "David S. Miller" 
>Cc: Jakub Kicinski 
>Cc: Thomas Davis 
>Cc: net...@vger.kernel.org
>
>Jarod Wilson (5):
>  bonding: rename 802.3ad's struct port to ad_port
>  bonding: replace use of the term master where possible
>  bonding: rename slave to port where possible
>  bonding: rename bonding_sysfs_slave.c to _port.c
>  bonding: update Documentation for port/bond terminology
>
> .clang-format |4 +-
> Documentation/networking/bonding.rst  |  581 ++--
> drivers/infiniband/core/cma.c |2 +-
> drivers/infiniband/core/lag.c |2 +-
> drivers/infiniband/core/roce_gid_mgmt.c   |   10 +-
> drivers/infiniband/hw/mlx4/main.c |2 +-
> drivers/net/bonding/Makefile  |2 +-
> drivers/net/bonding/bond_3ad.c| 1701 ++--
> drivers/net/bonding/bond_alb.c|  689 ++---
> drivers/net/bonding/bond_debugfs

Re: [PATCH] net: bonding: alb disable balance for IPv6 multicast related mac

2020-10-28 Thread Jay Vosburgh

 will be
>enabled.
>
>[1] https://tools.ietf.org/html/rfc2464#section-7
>[2] https://tools.ietf.org/html/rfc4861
>[3] linux.git/tree/include/net/if_inet6.h#n209-n221
>[4] linux.git/tree/net/ipv6/ndisc.c#n291
>[5] linux.git/tree/net/ipv6/ndisc.c#n346-n348
>[6] https://en.citizendium.org/wiki/Neighbor_Discovery
>
>Signed-off-by: LIU Yulong 
>---
> drivers/net/bonding/bond_alb.c | 10 --
> include/linux/etherdevice.h| 12 
> 2 files changed, 16 insertions(+), 6 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index 095ea51..a4a30bd 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -24,9 +24,6 @@
> #include 
> #include 
> 
>-static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = {
>-  0x33, 0x33, 0x00, 0x00, 0x00, 0x01
>-};
> static const int alb_delta_in_ticks = HZ / ALB_TIMER_TICKS_PER_SEC;
> 
> #pragma pack(1)
>@@ -1422,10 +1419,11 @@ struct slave *bond_xmit_alb_slave_get(struct bonding 
>*bond,
>   break;
>   }
> 
>-  /* IPv6 uses all-nodes multicast as an equivalent to
>-   * broadcasts in IPv4.
>+  /* IPv6 multicast destination should disable the tx-balance 
>since
>+   * the pyhsical world may get into a mass status which will lead
>+   * to the IPv6 traffic broken.

I think this comment can be simplified to simply say that IPv6
multicast destinations should not be tx-balanced, which I suspect is the
real purpose.

>*/
>-  if (ether_addr_equal_64bits(eth_data->h_dest, mac_v6_allmcast)) 
>{
>+  if (is_ipv6_multicast_ether_addr(eth_data->h_dest)) {
>   do_tx_balance = false;
>   break;
>   }
>diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
>index 2e5debc..c6101ab 100644
>--- a/include/linux/etherdevice.h
>+++ b/include/linux/etherdevice.h
>@@ -178,6 +178,18 @@ static inline bool is_unicast_ether_addr(const u8 *addr)
> }
> 
> /**
>+ * is_ipv6_multicast_ether_addr - Determine if the Ethernet address is for
>+ *  IPv6 multicast (rfc2464).
>+ * @addr: Pointer to a six-byte array containing the Ethernet address
>+ *
>+ * Return true if the address is a multicast for IPv6.
>+ */
>+static inline bool is_ipv6_multicast_ether_addr(const u8 *addr)
>+{
>+  return (addr[0] & addr[1]) == 0x33;
>+}

I don't think this does what is intended.  It will return true
for a MAC that starts with any two values whose bitwise AND is 0x33,
e.g., 0x73 0x3b.  For IPv6 multicast, the first two octets of the MAC
must be exactly 0x33 0x33.

-J

>+
>+/**
>  * is_valid_ether_addr - Determine if the given Ethernet address is valid
>  * @addr: Pointer to a six-byte array containing the Ethernet address
>  *
>-- 
>1.8.3.1

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] net: bonding: alb disable balance for IPv6 multicast related mac

2020-10-28 Thread Jay Vosburgh

LIU Yulong  wrote:

>Hi Jay,
>
> 
>
>Thank you for your response and review. Please see my inline comments.

I'm still reviewing your commentary, but to answer your final
question regarding updating the patch, you'll need to repost the entire
patch (with the new changes).  This repost should change the Subject to
include "v2" in the PATCH brackets, and add a change log after the
signoff block.  The detailed expectations are laid out in

https://www.kernel.org/doc/Documentation/process/submitting-patches.rst

-J

> 
>
>>According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined
>to
>
>>construct the multicast destination MAC address for IPv6 multicast
>traffic.
>
>>The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such
>
>>rule. The work steps [6] are:
>
>>  *) Let's assume a destination address of 2001:db8:1:1::1.
>
>>  *) This is mapped into the "Solicited Node Multicast Address" (SNMA)
>
>> format of ff02::1:ffXX:.
>
>>  *) The XX: represent the last 24 bits of the SNMA, and are derived
>
>> directly from the last 24 bits of the destination address.
>
>>  *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1.
>
>>  *) This, being a multicast address, can be mapped to a multicast MAC
>
>> address, using the format 33-33-XX-XX-XX-XX
>
>>  *) Resulting in 33-33-ff-00-00-01.
>
>>  *) This is a MAC address that is only being listened for by nodes
>
>> sharing the same last 24 bits.
>
>>  *) In other words, while there is a chance for a "address collision",
>
>> it is a vast improvement over ARP's guaranteed "collision".
>
>>Kernel related code can be found at [3][4][5].
>
>>
>
> 
>
>>The current bond alb has some leaks of such MAC ranges which will cause
>
>>the physical world failed to determain the back tunnel of the reply
>
>>packet during the response in a Spine-and-Leaf data center architecture.
>
>>The basic topology looks like this:
>
>>
>
> 
>
>>+-+
>
>>| |
>
>>+---| Border Leaf |-+
>
>>|   | | |
>
>>|   +-+ |
>
>>|   |
>
>>| tunnel-1  | tunnel-2
>
>>|   |
>
>>|   |
>
>>+---++   +--+-+
>
>>||   ||
>
>>| Leaf1  +--X-X-X-X--+  Leaf2 |  tunnel-3 will be checked to prevent loop
>
>>||  tunnel-3 ||
>
>>++   +-+--+
>
>> | |
>
>> | |
>
>> | |
>
>> | |
>
>> | |
>
>> | |
>
>> ++   ++
>
>>  +--+nic1+---+nic2+---+
>
>>  |  ++   ++   |
>
>>  |   bond6|
>
>>  ||
>
>>  |   HOST |
>
>>  ++
>
> 
>
>>This description is, overall, very comprehensive, and I believe
>
>>I generally understand what issue you're fixing (which seems to be a
>
>>complicated means to cause MAC flapping), although I'm unclear on a few
>
>>details, below.
>
> 
>
>>However, if you could make the ASCII art smaller I think that
>
>>would be better.
>
> 
>
>Done, I will update.
>
> 
>
>>When nic1 is sending the normal IPv6 traffic to the gateway in Border
>leaf,
>
>>the nic2 (slave) will send the NS packet out periodically, automatically
>
>>and implicitly as well. This is an example packet sending from the slave
>
>>nic2 which will broke the traffic.
>
> 
>
>>With this patch applied, what would happen if nic2 sends the
>
>>normal IPv6 traffic from the source MAC in question (because it is
>
>>tx-balanced there), and the Neighbor Solicitation multicast then goes
>
>>out via nic1?
>
> 
>
>Packets sent out from nic2 kernel will change the source MAC to the slave
>MAC,
>
>aka nic2's MAC, so nic2 normal traffic is just fine.
>
>Related code can be found here (I'm not pretty sure about the real code
>place, but this is what I see):
>
>https://github.com/torvalds/linux/blob/master/drivers/net/bonding/bond_alb.c#L1320
>
>
>The following MAC "ac:1f:6b:90:5c:eb" is the nic2 MAC, while the
>"fa:16:3e:ba:2d:8c"
>
>is the real packet source MAC.
>
> 
>
>>  ac:1f:6b:90:5c:eb > 33:33:ff:00:00:01, ethertype 802.1Q (0x8100),
>
>>  length 90: vlan 205, p 0, ethertype IPv6, (hlim 255,
>
>>  next-header ICMPv6 (58) payload length: 32)
>
>>  fe80::f816:3eff:feba:2d8c > ff02::1:ff00:1:
>
>>  [icmp6 sum ok] ICMP6, neighbor solicitation, length 32,
>
>>  who has 240e:980:2f00:4000::1
>
>>  source link-address option (1), length 8 (1): fa:16:3e:ba:2d:8c
>
> 
>
>>And perhaps trim out the hex dump here.
>
> 
>
>Done, removed.
>
> 
>
>>MAC "fa:16:3e:ba:2d:8c" was first learnt at Leaf1 based on the underlay
>
>>mechanism(BGP EVPN). When this example packet was sent to Border leaf and
>
>>replied with dst_MAC "fa:16:3e:ba:2d:8c", Leaf2 will try to send packet
>
>>back to tunnel-3 at this point dropping happens because of

Re: [PATCH net-next v2 6/6] bonding: make Kconfig toggle to disable legacy interfaces

2020-10-05 Thread Jay Vosburgh

Jarod Wilson  wrote:

>On Fri, Oct 2, 2020 at 6:57 PM David Miller  wrote:
>>
>> From: Jarod Wilson 
>> Date: Fri, 2 Oct 2020 16:23:46 -0400
>>
>> > I'd had a bit of feedback that people would rather see both, and be
>> > able to toggle off the old ones, rather than only having one or the
>> > other, depending on the toggle, so I thought I'd give this a try. I
>> > kind of liked the one or the other route, but I see the problems with
>> > that too.
>>
>> Please keep everything for the entire deprecation period, unconditionally.
>
>Okay, so 100% drop the Kconfig flag patch, but duplicate sysfs
>interface names are acceptable, correct? Then what about the procfs
>file having duplicate Slave and Port lines? Just leave them all as
>Slave?

My preference is to not alter the existing sysfs / proc behavior
at all, and instead create a netlink / iproute UAPI that becomes the
"preferred" UAPI going forward.  Any new functionality would only be
added to netlink as incentive to switch.

I don't see the value in adding duplicate fields, as userspace
code that parses them will not be portable if it only checks for the new
field name.  Then, down the road, deleting the old names will break the
userspace code that was never updated (which will likely be most of it).

I would rather see a single clean break from proc and sysfs to
netlink in one step than a piecemeal addition and removal from the
existing UAPI.  That makes for a much clearer flag day event for end
users.  By this I mean leave proc / sysfs as-is today, and then after a
suitable deprecation period, remove it wholesale (rather than a compile
time option).

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next 4/5] bonding: make Kconfig toggle to disable legacy interfaces

2020-09-24 Thread Jay Vosburgh

Jarod Wilson  wrote:

>On Tue, Sep 22, 2020 at 8:01 PM Stephen Hemminger
> wrote:
>>
>> On Tue, 22 Sep 2020 16:47:07 -0700
>> Jay Vosburgh  wrote:
>>
>> > Stephen Hemminger  wrote:
>> >
>> > >On Tue, 22 Sep 2020 09:37:30 -0400
>> > >Jarod Wilson  wrote:
>> > >
>> > >> By default, enable retaining all user-facing API that includes the use 
>> > >> of
>> > >> master and slave, but add a Kconfig knob that allows those that wish to
>> > >> remove it entirely do so in one shot.
>> > >>
>> > >> Cc: Jay Vosburgh 
>> > >> Cc: Veaceslav Falico 
>> > >> Cc: Andy Gospodarek 
>> > >> Cc: "David S. Miller" 
>> > >> Cc: Jakub Kicinski 
>> > >> Cc: Thomas Davis 
>> > >> Cc: net...@vger.kernel.org
>> > >> Signed-off-by: Jarod Wilson 
>> > >
>> > >Why not just have a config option to remove all the /proc and sysfs 
>> > >options
>> > >in bonding (and bridging) and only use netlink? New tools should be only 
>> > >able
>> > >to use netlink only.
>> >
>> >   I agree that new tooling should be netlink, but what value is
>> > provided by such an option that distros are unlikely to enable, and
>> > enabling will break the UAPI?
>
>Do you mean the initial proposed option, or what Stephen is
>suggesting? I think Red Hat actually will consider the former, the
>latter is less likely in the immediate future, since so many people
>still rely on the output of /proc/net/bonding/* for an overall view of
>their bonds' health and status. I don't know how close we are to
>having something comparable that could be spit out with a single
>invocation of something like 'ip' that would only be using netlink.
>It's entirely possible there's something akin to 'ip link bondX
>overview' already that outputs something similar, and I'm just not
>aware of it, but something like that would definitely need to exist
>and be well-documented for Red Hat to remove the procfs bits, I think.

At the present time, as much as the idea spurs the imagination,
removing the bonding /proc and sysfs stuff wholesale is not feasible.
As you explain, not everything in the proc file is available from other
sources.  I would rather freeze the /proc and sysfs bonding
functionality and move to a netlink / iproute API for all of it, and
then down the road remove the then-legacy APIs.

Even though "down the road" may practically be "never" (because
the removal breaks backwards compatibility for user space), unifying all
of the configuration and reporting to one place would be worthwhile.

For "initial proposed option," I'm not sure right off if that's
referring to CONFIG_BONDING_LEGACY_INTERFACES or "duplicate lines in
/proc/net/bonding."  I'm not sure it matters, since both have the same
problem, in that they create a Venn diagram of mutually incompatible
bonding UAPIs.  Portable user space code ends up having to handle all of
the permutations.

-J

>> > >Then you might convince maintainers to update documentation as well.
>> > >Last I checked there were still references to ifenslave.
>> >
>> >   Distros still include ifenslave, but it's now a shell script
>> > that uses sysfs.  I see it used in scripts from time to time.
>>
>> Some bleeding edge distros have already dropped ifenslave and even ifconfig.
>> The Enterprise ones never will.
>>
>> The one motivation would be for the embedded folks which are always looking
>> to trim out the fat. Although not sure if the minimal versions of commands
>> in busybox are pure netlink yet.
>
>Yeah, the bonding documentation is still filled with references to
>ifenslave. I believe Red Hat still includes it, though it's
>"deprecated" in documentation in favor of using ip. Similar with
>ifconfig. I could see them both getting dropped in a future major
>release of Red Hat Enterprise Linux, but they're definitely still here
>for at least the life of RHEL8.

As ifconfig is typically bundled in with the much-loved netstat
in the net-tools package, it will be difficult to remove.

Having an /sbin/ifenslave program isn't really the issue so much
as its reliance on the bonding sysfs UAPI.  It's a shell script, and
could likely be reworked to use ip link.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next 4/5] bonding: make Kconfig toggle to disable legacy interfaces

2020-09-22 Thread Jay Vosburgh

Stephen Hemminger  wrote:

>On Tue, 22 Sep 2020 09:37:30 -0400
>Jarod Wilson  wrote:
>
>> By default, enable retaining all user-facing API that includes the use of
>> master and slave, but add a Kconfig knob that allows those that wish to
>> remove it entirely do so in one shot.
>> 
>> Cc: Jay Vosburgh 
>> Cc: Veaceslav Falico 
>> Cc: Andy Gospodarek 
>> Cc: "David S. Miller" 
>> Cc: Jakub Kicinski 
>> Cc: Thomas Davis 
>> Cc: net...@vger.kernel.org
>> Signed-off-by: Jarod Wilson 
>
>Why not just have a config option to remove all the /proc and sysfs options
>in bonding (and bridging) and only use netlink? New tools should be only able
>to use netlink only.

I agree that new tooling should be netlink, but what value is
provided by such an option that distros are unlikely to enable, and
enabling will break the UAPI?

>Then you might convince maintainers to update documentation as well.
>Last I checked there were still references to ifenslave.

Distros still include ifenslave, but it's now a shell script
that uses sysfs.  I see it used in scripts from time to time.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next 0/5] bonding: rename bond components

2020-09-22 Thread Jay Vosburgh

Jarod Wilson  wrote:

>The bonding driver's use of master and slave, while largely understood
>in technical circles, poses a barrier for inclusion to some potential
>members of the development and user community, due to the historical
>context of masters and slaves, particularly in the United States. This
>is a first full pass at replacing those phrases with more socially
>inclusive ones, opting for aggregator to replace master and link to
>replace slave, as the bonding driver itself is a link aggregation
>driver.

First, I think there should be some direction from the kernel
development leadership as to whether or not this type of large-scale
search and replace of socially sensitive terms of art or other
terminology is a task that should be undertaken, given the noted issues
it will cause in stable release maintenance going forwards.

Second, on the merits of the proposed changes (presuming for the
moment that this goes forward), I would prefer different nomenclature
that does not reuse existing names for different purposes, i.e., "link"
and "aggregator."

Both of those have specific meanings in the current code, and
old kernels will retain that meaning.  Changing them to have new
meanings going forward will lead to confusion, in my opinion for no good
reason, as there are other names suited that do not conflict.

For example, instead of "master" call everything a "bond," which
matches common usage in discussion.  Changing "master" to "aggregator,"
the replacement results in cumbersome descriptions like "the
aggregator's active aggregator" in the context of LACP.

A replacement term for "slave" is trickier; my first choice
would be "port," but that may make more churn from a code change point
of view, although struct slave could become struct bond_port, and leave
the existing struct port for its current LACP use.

>There are a few problems with this change. First up, "link" is used for
>link state already in the bonding driver, so the first step here is to
>rename link to link_state. Second, aggregator is already used in the
>802.3ad code, but I feel the usage is actually consistent with referring
>to the bonding aggregation virtual device as the aggregator. Third, we
>have the issue of not wanting to break any existing userspace, which I
>believe this patchset accomplishes, while also adding alternative
>interfaces using new terminology, and a Kconfig option that will let
>people make the conscious decision to break userspace and no longer
>expose the original master/slave interfaces, once their userspace is
>able to cope with their removal.

I'm opposed to the Kconfig option because it will lead to
balkanization of the UAPI, which would be worse than a clean break
(which I'm also opposed to).

>Lastly, we do still have the issue of ease of backporting fixes to
>-stable trees. I've not had a huge amount of time to spend on it, but
>brief forays into coccinelle didn't really pay off (since it's meant to
>operate on code, not patches), and the best solution I can come up with
>is providing a shell script someone could run over git-format-patch
>output before git-am'ing the result to a -stable tree, though scripting
>these changes in the first place turned out to be not the best thing to
>do anyway, due to subtle cases where use of master or slave can NOT yet
>be replaced, so a large amount of work was done by hand, inspection,
>trial and error, which is why this set is a lot longer in coming than
>I'd originally hoped. I don't expect -stable backports to be horrible to
>figure out one way or another though, and I don't believe that a bit of
>inconvenience on that front is enough to warrant not making these
>changes.

I'm skeptical that, given the scope of the changes involved,
that it's really feasible to have effective automated massaging of
patches.  I think the reality is that a large fraction of the bonding
fixes in the future will have to be backported entirely by hand.  The
only saving grace here is that the quantity of such patches is generally
low (~40 in 2020 year to date).

-J

>See here for further details on Red Hat's commitment to this work:
>https://www.redhat.com/en/blog/making-open-source-more-inclusive-eradicating-problematic-language
>
>As far as testing goes, I've manually operated on various bonds while
>working on this code, and have run it through multiple lnst test runs,
>which exercises the existing sysfs interfaces fairly extensively. As far
>as I can tell, there is no breakage of existing interfaces with this
>set, unless the user consciously opts to do so via Kconfig.
>
>Jarod Wilson (5):
>  bonding: rename struct slave member link to link_state
>  bonding: rename slave to link whe

Re: [PATCH net] bonding: show saner speed for broadcast mode

2020-08-12 Thread Jay Vosburgh

Jarod Wilson  wrote:

>Broadcast mode bonds transmit a copy of all traffic simultaneously out of
>all interfaces, so the "speed" of the bond isn't really the aggregate of
>all interfaces, but rather, the speed of the lowest active interface.

Did you mean "slowest" here?

>Also, the type of the speed field is u32, not unsigned long, so adjust
>that accordingly, as required to make min() function here without
>complaining about mismatching types.
>
>Fixes: bb5b052f751b ("bond: add support to read speed and duplex via ethtool")
>CC: Jay Vosburgh 
>CC: Veaceslav Falico 
>CC: Andy Gospodarek 
>CC: "David S. Miller" 
>CC: net...@vger.kernel.org
>Signed-off-by: Jarod Wilson 

Did you notice this by inspection, or did it come up in use
somewhere?  I can't recall ever hearing of anyone using broadcast mode,
so I'm curious if there is a use for it, but this change seems
reasonable enough regardless.

-J

Acked-by: Jay Vosburgh 


>---
> drivers/net/bonding/bond_main.c | 21 ++---
> 1 file changed, 18 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 5ad43aaf76e5..c853ca67058c 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -4552,13 +4552,23 @@ static netdev_tx_t bond_start_xmit(struct sk_buff 
>*skb, struct net_device *dev)
>   return ret;
> }
> 
>+static u32 bond_mode_bcast_speed(struct slave *slave, u32 speed)
>+{
>+  if (speed == 0 || speed == SPEED_UNKNOWN)
>+  speed = slave->speed;
>+  else
>+  speed = min(speed, slave->speed);
>+
>+  return speed;
>+}
>+
> static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev,
>  struct ethtool_link_ksettings *cmd)
> {
>   struct bonding *bond = netdev_priv(bond_dev);
>-  unsigned long speed = 0;
>   struct list_head *iter;
>   struct slave *slave;
>+  u32 speed = 0;
> 
>   cmd->base.duplex = DUPLEX_UNKNOWN;
>   cmd->base.port = PORT_OTHER;
>@@ -4570,8 +4580,13 @@ static int bond_ethtool_get_link_ksettings(struct 
>net_device *bond_dev,
>*/
>   bond_for_each_slave(bond, slave, iter) {
>   if (bond_slave_can_tx(slave)) {
>-  if (slave->speed != SPEED_UNKNOWN)
>-  speed += slave->speed;
>+  if (slave->speed != SPEED_UNKNOWN) {
>+  if (BOND_MODE(bond) == BOND_MODE_BROADCAST)
>+  speed = bond_mode_bcast_speed(slave,
>+speed);
>+  else
>+  speed += slave->speed;
>+  }
>   if (cmd->base.duplex == DUPLEX_UNKNOWN &&
>   slave->duplex != DUPLEX_UNKNOWN)
>   cmd->base.duplex = slave->duplex;
>-- 
>2.20.1
>

Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices

2020-06-24 Thread Jay Vosburgh

Jay Vosburgh  wrote:

>sathyanarayanan.kuppusw...@linux.intel.com wrote:
>
>From: Kuppuswamy Sathyanarayanan 
>>
>>Fatal (DPC) error recovery is currently broken for non-hotplug
>>capable devices. With current implementation, after successful
>>fatal error recovery, non-hotplug capable device state won't be
>>restored properly. You can find related issues in following links.
>>
>>https://lkml.org/lkml/2020/5/27/290
>>https://lore.kernel.org/linux-pci/12115.1588207324@famine/
>>https://lkml.org/lkml/2020/3/28/328
>>
>>Current fatal error recovery implementation relies on hotplug handler
>>for detaching/re-enumerating the affected devices/drivers on DLLSC
>>state changes. So when dealing with non-hotplug capable devices,
>>recovery code does not restore the state of the affected devices
>>correctly. Correct implementation should call report_slot_reset()
>>function after resetting the link to restore the state of the
>>device/driver.
>>
>>So use PCI_ERS_RESULT_NEED_RESET as error status for successful
>>reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure
>>case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset()
>>is called after reset link operation which will also fix the above
>>mentioned issue.
>>
>>[original patch is from jay.vosbu...@canonical.com]
>>[original patch link 
>>https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
>>Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>>Signed-off-by: Jay Vosburgh 
>>Signed-off-by: Kuppuswamy Sathyanarayanan 
>>
>
>   I've tested this patch set on one of our test machines, and it
>resolves the issue.  I plan to test with other systems tomorrow.

I've done testing on two different systems that exhibit the
original issue and this patch set appears to behave as expected.

Has anyone else (Yicong?) had an opportunity to test this?

Can this be considered for acceptance, or is additional feedback
or review needed?

-J

>>---
>> drivers/pci/pcie/err.c | 24 ++--
>> 1 file changed, 22 insertions(+), 2 deletions(-)
>>
>>diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>index 14bb8f54723e..5fe8561c7185 100644
>>--- a/drivers/pci/pcie/err.c
>>+++ b/drivers/pci/pcie/err.c
>>@@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>  pci_dbg(dev, "broadcast error_detected message\n");
>>  if (state == pci_channel_io_frozen) {
>>  pci_walk_bus(bus, report_frozen_detected, );
>>- status = reset_link(dev);
>>- if (status != PCI_ERS_RESULT_RECOVERED) {
>>+ /*
>>+  * After resetting the link using reset_link() call, the
>>+  * possible value of error status is either
>>+  * PCI_ERS_RESULT_DISCONNECT (failure case) or
>>+  * PCI_ERS_RESULT_NEED_RESET (success case).
>>+  * So ignore the return value of report_error_detected()
>>+  * call for fatal errors. Instead use
>>+  * PCI_ERS_RESULT_NEED_RESET as initial status value.
>>+  *
>>+  * Ignoring the status return value of report_error_detected()
>>+  * call will also help in case of EDR mode based error
>>+  * recovery. In EDR mode AER and DPC Capabilities are owned by
>>+  * firmware and hence report_error_detected() call will possibly
>>+  * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore
>>+  * the return value of report_error_detected() then
>>+  * pcie_do_recovery() would report incorrect status after
>>+  * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER
>>+      * in non EDR case should not have any functional impact.
>>+  */
>>+ status = PCI_ERS_RESULT_NEED_RESET;
>>+ if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
>>+ status = PCI_ERS_RESULT_DISCONNECT;
>>  pci_warn(dev, "link reset failed\n");
>>  goto failed;
>>  }
>>-- 
>>2.17.1

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next 3/4] bonding: support hardware encryption offload to slaves

2020-06-08 Thread Jay Vosburgh

Jarod Wilson  wrote:

>Currently, this support is limited to active-backup mode, as I'm not sure
>about the feasilibity of mapping an xfrm_state's offload handle to
>multiple hardware devices simultaneously, and we rely on being able to
>pass some hints to both the xfrm and NIC driver about whether or not
>they're operating on a slave device.
>
>I've tested this atop an Intel x520 device (ixgbe) using libreswan in
>transport mode, succesfully achieving ~4.3Gbps throughput with netperf
>(more or less identical to throughput on a bare NIC in this system),
>as well as successful failover and recovery mid-netperf.
>
>v2: rebase on latest net-next and wrap with #ifdef CONFIG_XFRM_OFFLOAD
>v3: add new CONFIG_BOND_XFRM_OFFLOAD option and fix shutdown path
>
>CC: Jay Vosburgh 
>CC: Veaceslav Falico 
>CC: Andy Gospodarek 
>CC: "David S. Miller" 
>CC: Jeff Kirsher 
>CC: Jakub Kicinski 
>CC: Steffen Klassert 
>CC: Herbert Xu 
>CC: net...@vger.kernel.org
>CC: intel-wired-...@lists.osuosl.org
>Signed-off-by: Jarod Wilson 
>
>Signed-off-by: Jarod Wilson 
>---
> drivers/net/Kconfig |  11 
> drivers/net/bonding/bond_main.c | 111 +++-
> include/net/bonding.h   |   3 +
> 3 files changed, 122 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>index c7d310ef1c83..938c4dd9bfb9 100644
>--- a/drivers/net/Kconfig
>+++ b/drivers/net/Kconfig
>@@ -56,6 +56,17 @@ config BONDING
> To compile this driver as a module, choose M here: the module
> will be called bonding.
> 
>+config BONDING_XFRM_OFFLOAD
>+  bool "Bonding driver IPSec XFRM cryptography-offload pass-through 
>support"
>+  depends on BONDING
>+  depends on XFRM_OFFLOAD
>+  default y
>+  select XFRM_ALGO
>+  ---help---
>+Enable support for IPSec offload pass-through in the bonding driver.
>+Currently limited to active-backup mode only, and requires slave
>+devices that support hardware crypto offload.
>+

Why is this a separate Kconfig option?  Is it reasonable to
expect users to enable XFRM_OFFLOAD but not BONDING_XFRM_OFFLOAD?

> config DUMMY
>   tristate "Dummy net driver support"
>   ---help---
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index a25c65d4af71..01b80cef492a 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -79,6 +79,7 @@
> #include 
> #include 
> #include 
>+#include 
> #include 
> #include 
> #include 
>@@ -278,8 +279,6 @@ const char *bond_mode_name(int mode)
>   return names[mode];
> }
> 
>-/*-- VLAN 
>---*/
>-
> /**
>  * bond_dev_queue_xmit - Prepare skb for xmit.
>  *
>@@ -302,6 +301,8 @@ netdev_tx_t bond_dev_queue_xmit(struct bonding *bond, 
>struct sk_buff *skb,
>   return dev_queue_xmit(skb);
> }
> 
>+/*-- VLAN 
>---*/
>+
> /* In the following 2 functions, bond_vlan_rx_add_vid and 
> bond_vlan_rx_kill_vid,
>  * We don't protect the slave list iteration with a lock because:
>  * a. This operation is performed in IOCTL context,
>@@ -372,6 +373,84 @@ static int bond_vlan_rx_kill_vid(struct net_device 
>*bond_dev,
>   return 0;
> }
> 
>+/*-- XFRM 
>---*/
>+
>+#ifdef CONFIG_BONDING_XFRM_OFFLOAD
>+/**
>+ * bond_ipsec_add_sa - program device with a security association
>+ * @xs: pointer to transformer state struct
>+ **/
>+static int bond_ipsec_add_sa(struct xfrm_state *xs)
>+{
>+  struct net_device *bond_dev = xs->xso.dev;
>+  struct bonding *bond = netdev_priv(bond_dev);
>+  struct slave *slave = rtnl_dereference(bond->curr_active_slave);
>+
>+  xs->xso.slave_dev = slave->dev;
>+  bond->xs = xs;
>+
>+  if (!(slave->dev->xfrmdev_ops
>+&& slave->dev->xfrmdev_ops->xdo_dev_state_add)) {
>+  slave_warn(bond_dev, slave->dev, "Slave does not support ipsec 
>offload\n");
>+  return -EINVAL;
>+  }
>+
>+  return slave->dev->xfrmdev_ops->xdo_dev_state_add(xs);
>+}
>+
>+/**
>+ * bond_ipsec_del_sa - clear out this specific SA
>+ * @xs: pointer to transformer state struct
>+ **/
>+static void bond_ipsec_del_sa(struct xfrm_state *xs)
>+{
>+  struct net_device *bond_dev = xs->xso.dev;
>+  struct bonding *bond = netdev_priv(bond_dev);
>+  struct slave *slave = rtnl_derefere

Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices

2020-06-04 Thread Jay Vosburgh

sathyanarayanan.kuppusw...@linux.intel.com wrote:

>From: Kuppuswamy Sathyanarayanan 
>
>Fatal (DPC) error recovery is currently broken for non-hotplug
>capable devices. With current implementation, after successful
>fatal error recovery, non-hotplug capable device state won't be
>restored properly. You can find related issues in following links.
>
>https://lkml.org/lkml/2020/5/27/290
>https://lore.kernel.org/linux-pci/12115.1588207324@famine/
>https://lkml.org/lkml/2020/3/28/328
>
>Current fatal error recovery implementation relies on hotplug handler
>for detaching/re-enumerating the affected devices/drivers on DLLSC
>state changes. So when dealing with non-hotplug capable devices,
>recovery code does not restore the state of the affected devices
>correctly. Correct implementation should call report_slot_reset()
>function after resetting the link to restore the state of the
>device/driver.
>
>So use PCI_ERS_RESULT_NEED_RESET as error status for successful
>reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure
>case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset()
>is called after reset link operation which will also fix the above
>mentioned issue.
>
>[original patch is from jay.vosbu...@canonical.com]
>[original patch link 
>https://lore.kernel.org/linux-pci/12115.1588207324@famine/]
>Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>Signed-off-by: Jay Vosburgh 
>Signed-off-by: Kuppuswamy Sathyanarayanan 
>

I've tested this patch set on one of our test machines, and it
resolves the issue.  I plan to test with other systems tomorrow.

-J

>---
> drivers/pci/pcie/err.c | 24 ++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>index 14bb8f54723e..5fe8561c7185 100644
>--- a/drivers/pci/pcie/err.c
>+++ b/drivers/pci/pcie/err.c
>@@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   pci_dbg(dev, "broadcast error_detected message\n");
>   if (state == pci_channel_io_frozen) {
>   pci_walk_bus(bus, report_frozen_detected, );
>-  status = reset_link(dev);
>-  if (status != PCI_ERS_RESULT_RECOVERED) {
>+  /*
>+   * After resetting the link using reset_link() call, the
>+   * possible value of error status is either
>+   * PCI_ERS_RESULT_DISCONNECT (failure case) or
>+   * PCI_ERS_RESULT_NEED_RESET (success case).
>+   * So ignore the return value of report_error_detected()
>+   * call for fatal errors. Instead use
>+   * PCI_ERS_RESULT_NEED_RESET as initial status value.
>+   *
>+   * Ignoring the status return value of report_error_detected()
>+   * call will also help in case of EDR mode based error
>+   * recovery. In EDR mode AER and DPC Capabilities are owned by
>+   * firmware and hence report_error_detected() call will possibly
>+   * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore
>+   * the return value of report_error_detected() then
>+   * pcie_do_recovery() would report incorrect status after
>+   * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER
>+   * in non EDR case should not have any functional impact.
>+   */
>+  status = PCI_ERS_RESULT_NEED_RESET;
>+  if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
>+  status = PCI_ERS_RESULT_DISCONNECT;
>   pci_warn(dev, "link reset failed\n");
>   goto failed;
>   }
>-- 
>2.17.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: Fix reference count leak in bond_sysfs_slave_add.

2020-05-28 Thread Jay Vosburgh

wu000...@umn.edu wrote:

>From: Qiushi Wu 
>
>kobject_init_and_add() takes reference even when it fails.
>If this function returns an error, kobject_put() must be called to
>properly clean up the memory associated with the object. Previous
>commit "b8eb718348b8" fixed a similar problem.
>
>Fixes: 07699f9a7c8d ("bonding: add sysfs /slave dir for bond slave devices.")
>Signed-off-by: Qiushi Wu 

Acked-by: Jay Vosburgh 


>---
> drivers/net/bonding/bond_sysfs_slave.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_sysfs_slave.c 
>b/drivers/net/bonding/bond_sysfs_slave.c
>index 007481557191..9b8346638f69 100644
>--- a/drivers/net/bonding/bond_sysfs_slave.c
>+++ b/drivers/net/bonding/bond_sysfs_slave.c
>@@ -149,8 +149,10 @@ int bond_sysfs_slave_add(struct slave *slave)
> 
>   err = kobject_init_and_add(>kobj, _ktype,
>  &(slave->dev->dev.kobj), "bonding_slave");
>-  if (err)
>+  if (err) {
>+  kobject_put(>kobj);
>   return err;
>+  }
> 
>   for (a = slave_attrs; *a; ++a) {
>   err = sysfs_create_file(>kobj, &((*a)->attr));
>-- 
>2.17.1
>

Re: [PATCH v1 1/1] PCI/ERR: Handle fatal error recovery for non-hotplug capable devices

2020-05-12 Thread Jay Vosburgh

sathyanarayanan.kuppusw...@linux.intel.com wrote:

>From: Kuppuswamy Sathyanarayanan 
>
>If there are non-hotplug capable devices connected to a given
>port, then during the fatal error recovery(triggered by DPC or
>AER), after calling reset_link() function, we cannot rely on
>hotplug handler to detach and re-enumerate the device drivers
>in the affected bus. Instead, we will have to let the error
>recovery handler call report_slot_reset() for all devices in
>the bus to notify about the reset operation. Although this is
>only required for non hot-plug capable devices, doing it for
>hotplug capable devices should not affect the functionality.

Yicong,

Does the patch below also resolve the issue for you, as with
your changed version of my original patch?

-J

>Along with above issue, this fix also applicable to following
>issue.
>
>Commit 6d2c89441571 ("PCI/ERR: Update error status after
>reset_link()") added support to store status of reset_link()
>call. Although this fixed the error recovery issue observed if
>the initial value of error status is PCI_ERS_RESULT_DISCONNECT
>or PCI_ERS_RESULT_NO_AER_DRIVER, it also discarded the status
>result from report_frozen_detected. This can cause a failure to
>recover if _NEED_RESET is returned by report_frozen_detected and
>report_slot_reset is not invoked.
>
>Such an event can be induced for testing purposes by reducing the
>Max_Payload_Size of a PCIe bridge to less than that of a device
>downstream from the bridge, and then initiating I/O through the
>device, resulting in oversize transactions.  In the presence of DPC,
>this results in a containment event and attempted reset and recovery
>via pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not
>invoked, and the device does not recover.
>
>[original patch is from jay.vosbu...@canonical.com]
>[original patch link 
>https://lore.kernel.org/linux-pci/18609.1588812972@famine/]
>Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")
>Signed-off-by: Jay Vosburgh 
>Signed-off-by: Kuppuswamy Sathyanarayanan 
>
>---
> drivers/pci/pcie/err.c | 19 +++
> 1 file changed, 15 insertions(+), 4 deletions(-)
>
>diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>index 14bb8f54723e..db80e1ecb2dc 100644
>--- a/drivers/pci/pcie/err.c
>+++ b/drivers/pci/pcie/err.c
>@@ -165,13 +165,24 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   pci_dbg(dev, "broadcast error_detected message\n");
>   if (state == pci_channel_io_frozen) {
>   pci_walk_bus(bus, report_frozen_detected, );
>-  status = reset_link(dev);
>-  if (status != PCI_ERS_RESULT_RECOVERED) {
>+  status = PCI_ERS_RESULT_NEED_RESET;
>+  } else {
>+  pci_walk_bus(bus, report_normal_detected, );
>+  }
>+
>+  if (status == PCI_ERS_RESULT_NEED_RESET) {
>+  if (reset_link) {
>+  if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
>+  status = PCI_ERS_RESULT_DISCONNECT;
>+  } else {
>+  if (pci_bus_error_reset(dev))
>+  status = PCI_ERS_RESULT_DISCONNECT;
>+  }
>+
>+  if (status == PCI_ERS_RESULT_DISCONNECT) {
>   pci_warn(dev, "link reset failed\n");
>   goto failed;
>   }
>-  } else {
>-  pci_walk_bus(bus, report_normal_detected, );
>   }
> 
>   if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>-- 
>2.17.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH v3] bonding: force enable lacp port after link state recovery for 802.3ad

2019-09-19 Thread Jay Vosburgh

zhangsha (A)  wrote:

>> -Original Message-
>> From: zhangsha (A)
>> Sent: 2019年9月18日 21:06
>> To: jay.vosbu...@canonical.com; vfal...@gmail.com; a...@greyhouse.net;
>> da...@davemloft.net; net...@vger.kernel.org; linux-kernel@vger.kernel.org;
>> yuehaibing ; hunongda ;
>> Chenzhendong (alex) ; zhangsha (A)
>> 
>> Subject: [PATCH v3] bonding: force enable lacp port after link state 
>> recovery for
>> 802.3ad
>> 
>> From: Sha Zhang 
>> 
>> After the commit 334031219a84 ("bonding/802.3ad: fix slave link 
>> initialization
>> transition states") merged, the slave's link status will be changed to
>> BOND_LINK_FAIL from BOND_LINK_DOWN in the following scenario:
>> - Driver reports loss of carrier and
>>   bonding driver receives NETDEV_DOWN notifier
>> - slave's duplex and speed is zerod and
>>   its port->is_enabled is cleard to 'false';
>> - Driver reports link recovery and
>>   bonding driver receives NETDEV_UP notifier;
>> - If speed/duplex getting failed here, the link status
>>   will be changed to BOND_LINK_FAIL;
>> - The MII monotor later recover the slave's speed/duplex
>>   and set link status to BOND_LINK_UP, but remains
>>   the 'port->is_enabled' to 'false'.
>> 
>> In this scenario, the lacp port will not be enabled even its speed and 
>> duplex are
>> valid. The bond will not send LACPDU's, and its state is 'AD_STATE_DEFAULTED'
>> forever. The simplest fix I think is to call bond_3ad_handle_link_change() in
>> bond_miimon_commit, this function can enable lacp after port slave speed
>> check.
>> As enabled, the lacp port can run its state machine normally after link 
>> recovery.
>> 
>> Signed-off-by: Sha Zhang 
>> ---
>>  drivers/net/bonding/bond_main.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/bonding/bond_main.c
>> b/drivers/net/bonding/bond_main.c index 931d9d9..76324a5 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -2206,7 +2206,8 @@ static void bond_miimon_commit(struct bonding
>> *bond)
>>   */
>>  if (BOND_MODE(bond) == BOND_MODE_8023AD &&
>>  slave->link == BOND_LINK_UP)
>> -
>>  bond_3ad_adapter_speed_duplex_changed(slave);
>> +bond_3ad_handle_link_change(slave,
>> +BOND_LINK_UP);
>>  continue;
>> 
>>  case BOND_LINK_UP:
>
>Hi, David,
>I have replied your email for a while,  I guess you may miss my email, so I 
>resend it.
>The following link address is the last email, please review the new one again, 
>thank you.
>https://patchwork.ozlabs.org/patch/1151915/
>
>Last time, you doubted this is a driver specific problem,
>I prefer to believe it's not because I find the commit 4d2c0cda,
>its log says " Some NIC drivers don't have correct speed/duplex 
>settings at the time they send NETDEV_UP notification ...".
>
>Anyway, I think the lacp status should be fixed correctly,
>since link-monitoring (miimon) set SPEED/DUPLEX right here.

I suspect this is going to be related to the concurrent
discussion with Aleksei, and would like to see the instrumentation
results from his tests before adding another change to attempt to
resolve this.

Also, what device are you using for your testing, and are you
able to run the instrumentation patch that I provided to Aleksei and
provide its results?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: force enable lacp port after link state recovery for 802.3ad

2019-08-20 Thread Jay Vosburgh

 wrote:

>From: Sha Zhang 
>
>After the commit 334031219a84 ("bonding/802.3ad: fix slave link
>initialization transition states") merged,
>the slave's link status will be changed to BOND_LINK_FAIL
>from BOND_LINK_DOWN in the following scenario:
>- Driver reports loss of carrier and
>  bonding driver receives NETDEV_CHANGE notifier
>- slave's duplex and speed is zerod and
>  its port->is_enabled is cleard to 'false';
>- Driver reports link recovery and
>  bonding driver receives NETDEV_UP notifier;
>- If speed/duplex getting failed here, the link status
>  will be changed to BOND_LINK_FAIL;
>- The MII monotor later recover the slave's speed/duplex
>  and set link status to BOND_LINK_UP, but remains
>  the 'port->is_enabled' to 'false'.
>
>In this scenario, the lacp port will not be enabled even its speed
>and duplex are valid. The bond will not send LACPDU's, and its
>state is 'AD_STATE_DEFAULTED' forever. The simplest fix I think
>is to force enable lacp after port slave speed check in
>bond_miimon_commit. As enabled, the lacp port can run its state machine
>normally after link recovery.
>
>Signed-off-by: Sha Zhang 
>---
> drivers/net/bonding/bond_main.c | 8 +++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 931d9d9..379253a 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2194,6 +2194,7 @@ static void bond_miimon_commit(struct bonding *bond)
> {
>   struct list_head *iter;
>   struct slave *slave, *primary;
>+  struct port *port;
> 
>   bond_for_each_slave(bond, slave, iter) {
>   switch (slave->new_link) {
>@@ -2205,8 +2206,13 @@ static void bond_miimon_commit(struct bonding *bond)
>* link status
>*/
>   if (BOND_MODE(bond) == BOND_MODE_8023AD &&
>-  slave->link == BOND_LINK_UP)
>+  slave->link == BOND_LINK_UP) {
>   bond_3ad_adapter_speed_duplex_changed(slave);
>+  if (slave->duplex == DUPLEX_FULL) {
>+  port = &(SLAVE_AD_INFO(slave)->port);
>+  port->is_enabled = true;
>+  }
>+  }

I don't believe that testing duplex here is correct; is_enabled
is not controlled by duplex, but by carrier state.  Duplex does affect
whether or not a port is permitted to aggregate, but that's entirely
separate logic (the AD_PORT_LACP_ENABLED flag).

Would it be better to call bond_3ad_handle_link_change() here,
instead of manually testing duplex and setting is_enabled?

-J

>   continue;
> 
>   case BOND_LINK_UP:
>-- 
>1.8.3.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [bonding][patch] Regarding a bonding lacp issue

2019-08-08 Thread Jay Vosburgh

Felix  wrote:

>Dear Mainteners,
>
>Recently I hit a packet drop issue in bonding driver on Linux 4.9. Please
>see details below. Please take a look to see if my understanding is
>correct. Many thanks.
>
>What is the problem?
>The bonding driver starts to send packets even if the Partner(Switch)'s
>Collecting bit is not enabled yet. Partner would drop all packets until
>its Collecting bit is enabled.
>
>What is the root cuase?
>According to LACP spec, the Actor need to check Partner's Sync and
>Collecting bits before enable its Distributing bit and Distributing
>function. Please see the PIC below.

The diagram you reference is found in 802.1AX-2014 figure 6-21,
which shows the state diagram for an independent control implementation,
i.e., collecting and distributing are managed independently.

However, Linux bonding implements coupled control, which is
shown in figure 6-22.  Here, there is no Partner.Collecting requirement
on the state transition from ATTACHED to COLLECTING_DISTRIBUTING.

To quote 802.1AX-2014 6.4.15:

As independent control is not possible, the coupled control
state machine does not wait for the Partner to signal that
collection has started before enabling both collection and
distribution.

Now, that said, I agree that what you're seeing is likely
explained by this behavior, and your fix should resolve the immediate
problem (that bonding sends packets before the peer has enabled
COLLECTING).

However, your fix does put bonding out of compliance with the
standard, as it does not really implement COLLECTING and DISTRIBUTING as
discrete states.  In particular, if the peer in your case were to later
clear Partner.Collecting, bonding will not react to this as a figure
6-21 independent control implementation would (which isn't a change from
current behavior, but currently this isn't expected).

So, in my opinion a patch like this should have a comment
attached noting that we are deliberately not in compliance with the
standard in this specific situation.  The proper fix is to implement
figure 6-21 separate state.

Lastly, are you able to test and generate a patch against
current upstream, instead of 4.9?

-J

>How to fix?
>Please see the diff as following. And the patch is attached.
>
>--- ../origin/linux-4.9.188/drivers/net/bonding/bond_3ad.c 2019-08-07
>00:29:42.0 +0800
>+++ drivers/net/bonding/bond_3ad.c 2019-08-08 23:13:29.015640197 +0800
>@@ -937,6 +937,7 @@
> */
>if ((port->sm_vars & AD_PORT_SELECTED) &&
>(port->partner_oper.port_state & AD_STATE_SYNCHRONIZATION) &&
>+   (port->partner_oper.port_state & AD_STATE_COLLECTING) &&
>!__check_agg_selection_timer(port)) {
> if (port->aggregator->is_active)
>  port->sm_mux_state =
>
>--
>Thanks,
>Felix

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: Add vlan tx offload to hw_enc_features

2019-08-06 Thread Jay Vosburgh

YueHaibing  wrote:

>As commit 30d8177e8ac7 ("bonding: Always enable vlan tx offload")
>said, we should always enable bonding's vlan tx offload, pass the
>vlan packets to the slave devices with vlan tci, let them to handle
>vlan implementation.
>
>Now if encapsulation protocols like VXLAN is used, skb->encapsulation
>may be set, then the packet is passed to vlan devicec which based on

Typo: "devicec"

>bonding device. However in netif_skb_features(), the check of
>hw_enc_features:
>
>if (skb->encapsulation)
> features &= dev->hw_enc_features;
>
>clears NETIF_F_HW_VLAN_CTAG_TX/NETIF_F_HW_VLAN_STAG_TX. This results
>in same issue in commit 30d8177e8ac7 like this:
>
>vlan_dev_hard_start_xmit
>  -->dev_queue_xmit
>-->validate_xmit_skb
>  -->netif_skb_features //NETIF_F_HW_VLAN_CTAG_TX is cleared
>  -->validate_xmit_vlan
>-->__vlan_hwaccel_push_inside //skb->tci is cleared
>...
> --> bond_start_xmit
>   --> bond_xmit_hash //BOND_XMIT_POLICY_ENCAP34
> --> __skb_flow_dissect // nhoff point to IP header
>-->  case htons(ETH_P_8021Q)
> // skb_vlan_tag_present is false, so
> vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan),
> //vlan point to ip header wrongly
>
>Signed-off-by: YueHaibing 

Looks good to me; should this be tagged with

Fixes: 278339a42a1b ("bonding: propogate vlan_features to bonding master")

as 30d8177e8ac7 was?  If not, is there an appropriate commit id?

Acked-by: Jay Vosburgh 

-J

>---
> drivers/net/bonding/bond_main.c | 2 ++
> 1 file changed, 2 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 02fd782..931d9d9 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1126,6 +1126,8 @@ static void bond_compute_features(struct bonding *bond)
> done:
>   bond_dev->vlan_features = vlan_features;
>   bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL |
>+  NETIF_F_HW_VLAN_CTAG_TX |
>+      NETIF_F_HW_VLAN_STAG_TX |
>   NETIF_F_GSO_UDP_L4;
>   bond_dev->mpls_features = mpls_features;
>   bond_dev->gso_max_segs = gso_max_segs;
>-- 
>2.7.4

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net] bonding/802.3ad: fix slave link initialization transition states

2019-05-24 Thread Jay Vosburgh

Jarod Wilson  wrote:

>Once in a while, with just the right timing, 802.3ad slaves will fail to
>properly initialize, winding up in a weird state, with a partner system
>mac address of 00:00:00:00:00:00. This started happening after a fix to
>properly track link_failure_count tracking, where an 802.3ad slave that
>reported itself as link up in the miimon code, but wasn't able to get a
>valid speed/duplex, started getting set to BOND_LINK_FAIL instead of
>BOND_LINK_DOWN. That was the proper thing to do for the general "my link
>went down" case, but has created a link initialization race that can put
>the interface in this odd state.

Reading back in the git history, the ultimate cause of this
"weird state" appears to be devices that assert NETDEV_UP prior to
actually being able to supply sane speed/duplex values, correct?

Presuming that this is the case, I don't see that there's much
else to be done here, and so:

Acked-by: Jay Vosburgh 

>The simple fix is to instead set the slave link to BOND_LINK_DOWN again,
>if the link has never been up (last_link_up == 0), so the link state
>doesn't bounce from BOND_LINK_DOWN to BOND_LINK_FAIL -- it hasn't failed
>in this case, it simply hasn't been up yet, and this prevents the
>unnecessary state change from DOWN to FAIL and getting stuck in an init
>failure w/o a partner mac.
>
>Fixes: ea53abfab960 ("bonding/802.3ad: fix link_failure_count tracking")
>CC: Jay Vosburgh 
>CC: Veaceslav Falico 
>CC: Andy Gospodarek 
>CC: "David S. Miller" 
>CC: net...@vger.kernel.org
>Tested-by: Heesoon Kim 
>Signed-off-by: Jarod Wilson 



>---
> drivers/net/bonding/bond_main.c | 15 ++-
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 062fa7e3af4c..407f4095a37a 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3122,13 +3122,18 @@ static int bond_slave_netdev_event(unsigned long event,
>   case NETDEV_CHANGE:
>   /* For 802.3ad mode only:
>* Getting invalid Speed/Duplex values here will put slave
>-   * in weird state. So mark it as link-fail for the time
>-   * being and let link-monitoring (miimon) set it right when
>-   * correct speeds/duplex are available.
>+   * in weird state. Mark it as link-fail if the link was
>+   * previously up or link-down if it hasn't yet come up, and
>+   * let link-monitoring (miimon) set it right when correct
>+   * speeds/duplex are available.
>*/
>   if (bond_update_speed_duplex(slave) &&
>-  BOND_MODE(bond) == BOND_MODE_8023AD)
>-  slave->link = BOND_LINK_FAIL;
>+  BOND_MODE(bond) == BOND_MODE_8023AD) {
>+  if (slave->last_link_up)
>+  slave->link = BOND_LINK_FAIL;
>+  else
>+  slave->link = BOND_LINK_DOWN;
>+  }
> 
>   if (BOND_MODE(bond) == BOND_MODE_8023AD)
>   bond_3ad_adapter_speed_duplex_changed(slave);
>-- 
>2.20.1
>

Re: [PATCH] bonding: show full hw address in sysfs for slave entries

2019-03-29 Thread Jay Vosburgh

Konstantin Khorenko  wrote:

>Bond expects ethernet hwaddr for its slave, but it can be longer than 6
>bytes - infiniband interface for example.
>
> # cat /sys/devices//net/ib0/address
> 80:00:02:08:fe:80:00:00:00:00:00:00:7c:fe:90:03:00:be:5d:e1
>
> # cat /sys/devices//net/ib0/bonding_slave/perm_hwaddr
> 80:00:02:08:fe:80
>
>So print full hwaddr in sysfs "bonding_slave/perm_hwaddr" as well.
>
>Signed-off-by: Konstantin Khorenko 
>---
> drivers/net/bonding/bond_sysfs_slave.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_sysfs_slave.c 
>b/drivers/net/bonding/bond_sysfs_slave.c
>index 2f120b2ffef0..4985268e2273 100644
>--- a/drivers/net/bonding/bond_sysfs_slave.c
>+++ b/drivers/net/bonding/bond_sysfs_slave.c
>@@ -55,7 +55,9 @@ static SLAVE_ATTR_RO(link_failure_count);
> 
> static ssize_t perm_hwaddr_show(struct slave *slave, char *buf)
> {
>-  return sprintf(buf, "%pM\n", slave->perm_hwaddr);
>+  return sprintf(buf, "%*phC\n",
>+ slave->dev->addr_len,
>+     slave->perm_hwaddr);
> }
> static SLAVE_ATTR_RO(perm_hwaddr);

Acked-by: Jay Vosburgh 

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH 1/2] net: bonding: fix restricted __be16 degrades to integer

2019-03-08 Thread Jay Vosburgh

Bo YU  wrote:

>There are some warning when:
>
>sudo make C=1 CF=-D__CHECK_ENDIAN__ drivers/net/bonding/
>
>drivers/net/bonding/bond_main.c:2385:26: warning: restricted __be16 degrades 
>to integer
>drivers/net/bonding/bond_main.c:2391:20: warning: restricted __be16 degrades 
>to integer
>...
>drivers/net/bonding/bond_main.c:3241:60: warning: restricted __be16 degrades 
>to integer
>drivers/net/bonding/bond_main.c:3241:60: warning: restricted __be16 degrades 
>to integer
>
>So fix it.
>
>Signed-off-by: Bo YU 
>---
> drivers/net/bonding/bond_main.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index b59708c35faf..135fec28daa9 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2382,13 +2382,13 @@ static void bond_arp_send(struct net_device 
>*slave_dev, int arp_op,
>   return;
>   }
> 
>-  if (!tags || tags->vlan_proto == VLAN_N_VID)
>+  if (!tags || be16_to_cpu(tags->vlan_proto) == VLAN_N_VID)
>   goto xmit;
> 
>   tags++;
> 
>   /* Go through all the tags backwards and add them to the packet */
>-  while (tags->vlan_proto != VLAN_N_VID) {
>+  while (be16_to_cpu(tags->vlan_proto) != VLAN_N_VID) {

I believe both of the above are incorrect, as vlan_proto is set
explicitly to VLAN_N_VID (in host byte order) by bond_verify_device_path
as a sentinel value.  Byte swapping the tags->vlan_proto value would
cause the test or loop to miss the sentinel.


>   if (!tags->vlan_id) {
>   tags++;
>   continue;
>@@ -3238,7 +3238,7 @@ static inline u32 bond_eth_hash(struct sk_buff *skb)
> 
>   ep = skb_header_pointer(skb, 0, sizeof(hdr_tmp), _tmp);
>   if (ep)
>-  return ep->h_dest[5] ^ ep->h_source[5] ^ ep->h_proto;
>+  return ep->h_dest[5] ^ ep->h_source[5] ^ 
>be16_to_cpu(ep->h_proto);

This is probably harmless, other than adding work to the
transmit path.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH 2/2] net: bonding: fix incorrect type in assignment

2019-03-07 Thread Jay Vosburgh

Bo YU  wrote:

>There are some warning when:
>
>sudo make C=1 CF=-D__CHECK_ENDIAN__ drivers/net/bonding/
>
>drivers/net/bonding/bond_main.c:2438:40: warning: incorrect type in assignment 
>(different base types)
>drivers/net/bonding/bond_main.c:2438:40:expected restricted __be16 
>[usertype] vlan_proto
>drivers/net/bonding/bond_main.c:2438:40:
>...
>rivers/net/bonding/bond_options.c:1089:24: warning: incorrect type in 
>assignment (different base types)
>drivers/net/bonding/bond_options.c:1089:24:expected restricted __be32 
>[addressable] [usertype] target
>drivers/net/bonding/bond_options.c:1089:24:got unsigned long long const 
>[usertype] value
>
>So fix it
>
>Signed-off-by: Bo YU 
>---
> drivers/net/bonding/bond_main.c| 2 +-
> drivers/net/bonding/bond_options.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 135fec28daa9..07e52d863e91 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2435,7 +2435,7 @@ struct bond_vlan_tag *bond_verify_device_path(struct 
>net_device *start_dev,
>   tags = kcalloc(level + 1, sizeof(*tags), GFP_ATOMIC);
>   if (!tags)
>   return ERR_PTR(-ENOMEM);
>-  tags[level].vlan_proto = VLAN_N_VID;
>+  tags[level].vlan_proto = cpu_to_be16(VLAN_N_VID);
>   return tags;
>   }
> 
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index da1fc17295d9..3a196999bd1b 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -1086,7 +1086,7 @@ static int bond_option_arp_ip_targets_set(struct bonding 
>*bond,
>   else
>   netdev_err(bond->dev, "no command found in 
> arp_ip_targets file - use + or -\n");
>   } else {
>-  target = newval->value;
>+  target = cpu_to_be32(newval->value);
>   ret = bond_option_arp_ip_target_add(bond, target);

I'm not sure this is correct; if I'm reading the call path
correctly, bond_changelink will

if (data[IFLA_BOND_ARP_IP_TARGET]) {
[...]
__be32 target;
[...]
target = nla_get_be32(attr);

bond_opt_initval(, (__force u64)target);
err = __bond_opt_set(bond, BOND_OPT_ARP_TARGETS,
 );

thus, newval.value is initially be32, but stored in a u64.
__bond_opt_set will call bond_opt_parse, which in turn will call
bond_option_arp_ip_targets_set (via .set), and the change above would
swap the newval.value back to host order (on little endian architectures
for which cpu_to_be32 is not a no-op).

Am I misunderstanding?  Did you test this change on an x86 or
other little endian system?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c

2017-06-21 Thread Jay Vosburgh

Michael J Dilmore <michael.j.dilm...@gmail.com> wrote:

>On 21/06/17 22:56, David Miller wrote:
>
>> From: Michael D <michael.j.dilm...@gmail.com>
>> Date: Wed, 21 Jun 2017 22:41:07 +0100
>>
>>> I don't think you can stop it being dereferenced... you just need to
>>> prevent an attacker from exploiting the null pointer dereference
>>> vulnerability right? And this is done by returning the function right
>>> away?
>> What's all of this about an "attacker"?
>>
>> If there is a bug, we dererence a NULL pointer, and we should
>> fix that bug.
>>
>> The BUG_ON() helps us see where the problem is while at the
>> same time stopping the kernel before the NULL deref happens.
>Ok this is starting to make sense now - went a bit off track but think my
>general thinking is ok - i.e. if we return the function with an error code
>before the dereference then this basically does the same thing as BUG_ON
>but without crashing the kernel.
>
>Something like:
>
>if (WARN_ON(!new_active_slave) {
>netdev_dbg("Can't add new active slave - pointer null");
>return ERROR_CODE
>}

In general, yes, but in this case, the condition should be
impossible to hit, so BUG_ON seems appropriate.

If bond_slave_get_rtnl/rcu() returns NULL for an actual bonding
slave, other code paths (bond_fill_slave_info, bond_handle_frame) will
likely crash before getting to this one.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c

2017-06-21 Thread Jay Vosburgh

Michael J Dilmore  wrote:

>On 21/06/17 22:56, David Miller wrote:
>
>> From: Michael D 
>> Date: Wed, 21 Jun 2017 22:41:07 +0100
>>
>>> I don't think you can stop it being dereferenced... you just need to
>>> prevent an attacker from exploiting the null pointer dereference
>>> vulnerability right? And this is done by returning the function right
>>> away?
>> What's all of this about an "attacker"?
>>
>> If there is a bug, we dererence a NULL pointer, and we should
>> fix that bug.
>>
>> The BUG_ON() helps us see where the problem is while at the
>> same time stopping the kernel before the NULL deref happens.
>Ok this is starting to make sense now - went a bit off track but think my
>general thinking is ok - i.e. if we return the function with an error code
>before the dereference then this basically does the same thing as BUG_ON
>but without crashing the kernel.
>
>Something like:
>
>if (WARN_ON(!new_active_slave) {
>netdev_dbg("Can't add new active slave - pointer null");
>return ERROR_CODE
>}

In general, yes, but in this case, the condition should be
impossible to hit, so BUG_ON seems appropriate.

If bond_slave_get_rtnl/rcu() returns NULL for an actual bonding
slave, other code paths (bond_fill_slave_info, bond_handle_frame) will
likely crash before getting to this one.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c

2017-06-21 Thread Jay Vosburgh

David Miller <da...@davemloft.net> wrote:

>From: Michael D <michael.j.dilm...@gmail.com>
>Date: Wed, 21 Jun 2017 22:41:07 +0100
>
>> I don't think you can stop it being dereferenced... you just need to
>> prevent an attacker from exploiting the null pointer dereference
>> vulnerability right? And this is done by returning the function right
>> away?
>
>What's all of this about an "attacker"?
>
>If there is a bug, we dererence a NULL pointer, and we should
>fix that bug.
>
>The BUG_ON() helps us see where the problem is while at the
>same time stopping the kernel before the NULL deref happens.

Looking at the code more carefully than I did earlier, the only
way the BUG_ON will hit is if the rx_handler_data is NULL for a bonding
slave when this code executes.

This should be impossible, as there doesn't appear to be any way
to get into bond_option_active_slave_set for a slave prior to
bond_enslave registering the rx_handler for that slave, as these
operations are mutexed by RTNL.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c

2017-06-21 Thread Jay Vosburgh

David Miller  wrote:

>From: Michael D 
>Date: Wed, 21 Jun 2017 22:41:07 +0100
>
>> I don't think you can stop it being dereferenced... you just need to
>> prevent an attacker from exploiting the null pointer dereference
>> vulnerability right? And this is done by returning the function right
>> away?
>
>What's all of this about an "attacker"?
>
>If there is a bug, we dererence a NULL pointer, and we should
>fix that bug.
>
>The BUG_ON() helps us see where the problem is while at the
>same time stopping the kernel before the NULL deref happens.

Looking at the code more carefully than I did earlier, the only
way the BUG_ON will hit is if the rx_handler_data is NULL for a bonding
slave when this code executes.

This should be impossible, as there doesn't appear to be any way
to get into bond_option_active_slave_set for a slave prior to
bond_enslave registering the rx_handler for that slave, as these
operations are mutexed by RTNL.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c

2017-06-21 Thread Jay Vosburgh

Michael J Dilmore <michael.j.dilm...@gmail.com> wrote:

>The function below contains a BUG_ON where no active slave is detected. The 
>patch
>converts this to a WARN_ON to avoid crashing the kernel.
>
>Signed-off-by: Michael J Dilmore <michael.j.dilm...@gmail.com>
>---
> drivers/net/bonding/bond_options.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index 1bcbb89..c4b4791 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -778,7 +778,7 @@ static int bond_option_active_slave_set(struct bonding 
>*bond,
>   struct slave *old_active = 
> rtnl_dereference(bond->curr_active_slave);
>   struct slave *new_active = bond_slave_get_rtnl(slave_dev);
> 
>-  BUG_ON(!new_active);
>+  WARN_ON(!new_active);

This is a reasonable idea in principle, but will require
additional changes to prevent dereferencing new_active if it is NULL
(which would happen just below this point in the code).

-J

>   if (new_active == old_active) {
>   /* do nothing */
>-- 
>2.7.4
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c

2017-06-21 Thread Jay Vosburgh

Michael J Dilmore  wrote:

>The function below contains a BUG_ON where no active slave is detected. The 
>patch
>converts this to a WARN_ON to avoid crashing the kernel.
>
>Signed-off-by: Michael J Dilmore 
>---
> drivers/net/bonding/bond_options.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index 1bcbb89..c4b4791 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -778,7 +778,7 @@ static int bond_option_active_slave_set(struct bonding 
>*bond,
>   struct slave *old_active = 
> rtnl_dereference(bond->curr_active_slave);
>   struct slave *new_active = bond_slave_get_rtnl(slave_dev);
> 
>-  BUG_ON(!new_active);
>+  WARN_ON(!new_active);

This is a reasonable idea in principle, but will require
additional changes to prevent dereferencing new_active if it is NULL
(which would happen just below this point in the code).

-J

>   if (new_active == old_active) {
>   /* do nothing */
>-- 
>2.7.4
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert multiple netdev_info messages to netdev_dbg

2017-06-15 Thread Jay Vosburgh

Joe Perches <j...@perches.com> wrote:

>On Thu, 2017-06-15 at 19:14 +0100, Michael J Dilmore wrote:
>> Multiple netdev_info messages clutter kernel output. Also add netdev_dbg for 
>> packets per slave.
>[]
>> diff --git a/drivers/net/bonding/bond_options.c 
>> b/drivers/net/bonding/bond_options.c
>[]
>> @@ -9,6 +9,8 @@
>>   * (at your option) any later version.
>>   */
>>   
>> +#define DEBUG 1
>
>Is defining DEBUG really worthwhile.

I don't believe so, since if CONFIG_DYNAMIC_DEBUG is not
enabled, having #define DEBUG will enable all of the netdev_dbg messages
unconditionally, which is the opposite of the stated purpose of the
patch.  If DYNAMIC_DEBUG is enabled, having DEBUG doesn't do anything
that I can see.

-J

>As well, it's almost always just
>#define DEBUG
>without any level value unless the
>level value is used in the code.
>
>> +
>>  #include 
>>  #include 
>>  #include 
>> @@ -719,13 +721,13 @@ static int bond_option_mode_set(struct bonding *bond,
>>  const struct bond_opt_value *newval)
>>  {
>>  if (!bond_mode_uses_arp(newval->value) && bond->params.arp_interval) {
>> -netdev_info(bond->dev, "%s mode is incompatible with arp 
>> monitoring, start mii monitoring\n",
>> +netdev_dbg(bond->dev, "%s mode is incompatible with arp 
>> monitoring, start mii monitoring\n",
>>  newval->string);
>
>Please realign any multiple line arguments to the
>open parenthesis at the same time.
>
>>  /* disable arp monitoring */
>>  bond->params.arp_interval = 0;
>>  /* set miimon to default value */
>>  bond->params.miimon = BOND_DEFAULT_MIIMON;
>> -netdev_info(bond->dev, "Setting MII monitoring interval to 
>> %d\n",
>> +netdev_dbg(bond->dev, "Setting MII monitoring interval to %d\n",
>>  bond->params.miimon);
>
>etc...
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert multiple netdev_info messages to netdev_dbg

2017-06-15 Thread Jay Vosburgh

Joe Perches  wrote:

>On Thu, 2017-06-15 at 19:14 +0100, Michael J Dilmore wrote:
>> Multiple netdev_info messages clutter kernel output. Also add netdev_dbg for 
>> packets per slave.
>[]
>> diff --git a/drivers/net/bonding/bond_options.c 
>> b/drivers/net/bonding/bond_options.c
>[]
>> @@ -9,6 +9,8 @@
>>   * (at your option) any later version.
>>   */
>>   
>> +#define DEBUG 1
>
>Is defining DEBUG really worthwhile.

I don't believe so, since if CONFIG_DYNAMIC_DEBUG is not
enabled, having #define DEBUG will enable all of the netdev_dbg messages
unconditionally, which is the opposite of the stated purpose of the
patch.  If DYNAMIC_DEBUG is enabled, having DEBUG doesn't do anything
that I can see.

-J

>As well, it's almost always just
>#define DEBUG
>without any level value unless the
>level value is used in the code.
>
>> +
>>  #include 
>>  #include 
>>  #include 
>> @@ -719,13 +721,13 @@ static int bond_option_mode_set(struct bonding *bond,
>>  const struct bond_opt_value *newval)
>>  {
>>  if (!bond_mode_uses_arp(newval->value) && bond->params.arp_interval) {
>> -netdev_info(bond->dev, "%s mode is incompatible with arp 
>> monitoring, start mii monitoring\n",
>> +netdev_dbg(bond->dev, "%s mode is incompatible with arp 
>> monitoring, start mii monitoring\n",
>>  newval->string);
>
>Please realign any multiple line arguments to the
>open parenthesis at the same time.
>
>>  /* disable arp monitoring */
>>  bond->params.arp_interval = 0;
>>  /* set miimon to default value */
>>  bond->params.miimon = BOND_DEFAULT_MIIMON;
>> -netdev_info(bond->dev, "Setting MII monitoring interval to 
>> %d\n",
>> +netdev_dbg(bond->dev, "Setting MII monitoring interval to %d\n",
>>  bond->params.miimon);
>
>etc...
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] net: bonding: use new api ethtool_{get|set}_link_ksettings

2016-10-26 Thread Jay Vosburgh

Philippe Reynes <trem...@gmail.com> wrote:

>The ethtool api {get|set}_settings is deprecated.
>We move this driver to new api {get|set}_link_ksettings.

This is just an API change, i.e., no change to functionality?

-J

>Signed-off-by: Philippe Reynes <trem...@gmail.com>
>---
> drivers/net/bonding/bond_main.c |   16 
> 1 files changed, 8 insertions(+), 8 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index c9944d8..5708f17 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -4080,16 +4080,16 @@ static netdev_tx_t bond_start_xmit(struct sk_buff 
>*skb, struct net_device *dev)
>   return ret;
> }
> 
>-static int bond_ethtool_get_settings(struct net_device *bond_dev,
>-   struct ethtool_cmd *ecmd)
>+static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev,
>+ struct ethtool_link_ksettings *cmd)
> {
>   struct bonding *bond = netdev_priv(bond_dev);
>   unsigned long speed = 0;
>   struct list_head *iter;
>   struct slave *slave;
> 
>-  ecmd->duplex = DUPLEX_UNKNOWN;
>-  ecmd->port = PORT_OTHER;
>+  cmd->base.duplex = DUPLEX_UNKNOWN;
>+  cmd->base.port = PORT_OTHER;
> 
>   /* Since bond_slave_can_tx returns false for all inactive or down 
> slaves, we
>* do not need to check mode.  Though link speed might not represent
>@@ -4100,12 +4100,12 @@ static int bond_ethtool_get_settings(struct net_device 
>*bond_dev,
>   if (bond_slave_can_tx(slave)) {
>   if (slave->speed != SPEED_UNKNOWN)
>   speed += slave->speed;
>-  if (ecmd->duplex == DUPLEX_UNKNOWN &&
>+  if (cmd->base.duplex == DUPLEX_UNKNOWN &&
>   slave->duplex != DUPLEX_UNKNOWN)
>-  ecmd->duplex = slave->duplex;
>+  cmd->base.duplex = slave->duplex;
>   }
>   }
>-  ethtool_cmd_speed_set(ecmd, speed ? : SPEED_UNKNOWN);
>+  cmd->base.speed = speed ? : SPEED_UNKNOWN;
> 
>   return 0;
> }
>@@ -4121,8 +4121,8 @@ static void bond_ethtool_get_drvinfo(struct net_device 
>*bond_dev,
> 
> static const struct ethtool_ops bond_ethtool_ops = {
>   .get_drvinfo= bond_ethtool_get_drvinfo,
>-  .get_settings   = bond_ethtool_get_settings,
>   .get_link   = ethtool_op_get_link,
>+  .get_link_ksettings = bond_ethtool_get_link_ksettings,
> };
> 
> static const struct net_device_ops bond_netdev_ops = {
>-- 
>1.7.4.4

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] net: bonding: use new api ethtool_{get|set}_link_ksettings

2016-10-26 Thread Jay Vosburgh

Philippe Reynes  wrote:

>The ethtool api {get|set}_settings is deprecated.
>We move this driver to new api {get|set}_link_ksettings.

This is just an API change, i.e., no change to functionality?

-J

>Signed-off-by: Philippe Reynes 
>---
> drivers/net/bonding/bond_main.c |   16 
> 1 files changed, 8 insertions(+), 8 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index c9944d8..5708f17 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -4080,16 +4080,16 @@ static netdev_tx_t bond_start_xmit(struct sk_buff 
>*skb, struct net_device *dev)
>   return ret;
> }
> 
>-static int bond_ethtool_get_settings(struct net_device *bond_dev,
>-   struct ethtool_cmd *ecmd)
>+static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev,
>+ struct ethtool_link_ksettings *cmd)
> {
>   struct bonding *bond = netdev_priv(bond_dev);
>   unsigned long speed = 0;
>   struct list_head *iter;
>   struct slave *slave;
> 
>-  ecmd->duplex = DUPLEX_UNKNOWN;
>-  ecmd->port = PORT_OTHER;
>+  cmd->base.duplex = DUPLEX_UNKNOWN;
>+  cmd->base.port = PORT_OTHER;
> 
>   /* Since bond_slave_can_tx returns false for all inactive or down 
> slaves, we
>* do not need to check mode.  Though link speed might not represent
>@@ -4100,12 +4100,12 @@ static int bond_ethtool_get_settings(struct net_device 
>*bond_dev,
>   if (bond_slave_can_tx(slave)) {
>   if (slave->speed != SPEED_UNKNOWN)
>   speed += slave->speed;
>-  if (ecmd->duplex == DUPLEX_UNKNOWN &&
>+  if (cmd->base.duplex == DUPLEX_UNKNOWN &&
>   slave->duplex != DUPLEX_UNKNOWN)
>-  ecmd->duplex = slave->duplex;
>+  cmd->base.duplex = slave->duplex;
>   }
>   }
>-  ethtool_cmd_speed_set(ecmd, speed ? : SPEED_UNKNOWN);
>+  cmd->base.speed = speed ? : SPEED_UNKNOWN;
> 
>   return 0;
> }
>@@ -4121,8 +4121,8 @@ static void bond_ethtool_get_drvinfo(struct net_device 
>*bond_dev,
> 
> static const struct ethtool_ops bond_ethtool_ops = {
>   .get_drvinfo= bond_ethtool_get_drvinfo,
>-  .get_settings   = bond_ethtool_get_settings,
>   .get_link   = ethtool_op_get_link,
>+  .get_link_ksettings = bond_ethtool_get_link_ksettings,
> };
> 
> static const struct net_device_ops bond_netdev_ops = {
>-- 
>1.7.4.4

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: Prevent deletion of a bond, or the last slave from a bond, with active usage.

2016-09-06 Thread Jay Vosburgh

= command + 2;
>+  else
>+  ifname = command + 1;
>+
>   if ((strlen(command) <= 1) ||
>   !dev_valid_name(ifname))
>   goto err_no_cmd;
>@@ -130,6 +135,32 @@ static ssize_t bonding_store_bonds(struct class *cls,
>   res = -ENODEV;
>   }
>   rtnl_unlock();
>+  } else if ((command[0] == '?') && (command[1] == '-')) {
>+  struct net_device *bond_dev;
>+
>+  rtnl_lock();
>+  bond_dev = bond_get_by_name(bn, ifname);
>+
>+  if (bond_dev) {
>+  struct in_device *in_dev;
>+  struct bonding *bond = netdev_priv(bond_dev);
>+
>+  in_dev = __in_dev_get_rtnl(bond_dev);
>+
>+  if (((in_dev->ifa_list) != NULL) &&
>+  (bond->slave_cnt > 0)) {
>+  pr_err("%s is in use. Unconfigure IP %pI4 
>before deletion.\n",
>+ ifname, _dev->ifa_list->ifa_local);
>+  rtnl_unlock();
>+  return -EBUSY;
>+  }
>+  pr_info("%s is being deleted...\n", ifname);
>+      unregister_netdevice(bond_dev);
>+  } else {
>+  pr_err("unable to delete non-existent %s\n", ifname);
>+  res = -ENODEV;
>+  }
>+  rtnl_unlock();
>   } else
>   goto err_no_cmd;
> 
>@@ -139,7 +170,7 @@ static ssize_t bonding_store_bonds(struct class *cls,
>   return res;
> 
> err_no_cmd:
>-  pr_err("no command found in bonding_masters - use +ifname or 
>-ifname\n");
>+  pr_err("no command found in bonding_masters - use +ifname or -ifname or 
>?-ifname\n");
>   return -EPERM;
> }
> 
>-- 
>1.8.3.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: Prevent deletion of a bond, or the last slave from a bond, with active usage.

2016-09-06 Thread Jay Vosburgh

+
>   if ((strlen(command) <= 1) ||
>   !dev_valid_name(ifname))
>   goto err_no_cmd;
>@@ -130,6 +135,32 @@ static ssize_t bonding_store_bonds(struct class *cls,
>   res = -ENODEV;
>   }
>   rtnl_unlock();
>+  } else if ((command[0] == '?') && (command[1] == '-')) {
>+  struct net_device *bond_dev;
>+
>+  rtnl_lock();
>+  bond_dev = bond_get_by_name(bn, ifname);
>+
>+  if (bond_dev) {
>+  struct in_device *in_dev;
>+  struct bonding *bond = netdev_priv(bond_dev);
>+
>+  in_dev = __in_dev_get_rtnl(bond_dev);
>+
>+  if (((in_dev->ifa_list) != NULL) &&
>+  (bond->slave_cnt > 0)) {
>+  pr_err("%s is in use. Unconfigure IP %pI4 
>before deletion.\n",
>+ ifname, _dev->ifa_list->ifa_local);
>+  rtnl_unlock();
>+  return -EBUSY;
>+  }
>+  pr_info("%s is being deleted...\n", ifname);
>+  unregister_netdevice(bond_dev);
>+  } else {
>+  pr_err("unable to delete non-existent %s\n", ifname);
>+  res = -ENODEV;
>+  }
>+  rtnl_unlock();
>   } else
>   goto err_no_cmd;
> 
>@@ -139,7 +170,7 @@ static ssize_t bonding_store_bonds(struct class *cls,
>   return res;
> 
> err_no_cmd:
>-  pr_err("no command found in bonding_masters - use +ifname or 
>-ifname\n");
>+  pr_err("no command found in bonding_masters - use +ifname or -ifname or 
>?-ifname\n");
>   return -EPERM;
> }
> 
>-- 
>1.8.3.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding:Fix perfomance drop if one bonding device in configuration is repeatedly restarted

2016-07-14 Thread Jay Vosburgh

Nicholas Krause <xerofo...@gmail.com> wrote:

>This fixes a issue with the following test caseL
>
>1. Created a alb bonding(bond0) with three of 1Gb interfaces
>(eth0, eth1, eth2) under ipv4, set the bond0 ip address as
>192.168.10.101, run command "iperf -s" as the server(x86_64).
>
>2. Run command "iperf -c 192.168.10.101 -t 1 -i 1"
>on three linux client with CentOS6.5 x86_64
>
>3. Run the following commands on server repeatly,
>"ifconfig eth1 down; sleep 20; ifconfig eth1 up".
>monitor the bonding performance with program "
>nmon_x86_64_centos6", after some time, the
>performance droped from 3000 to 2000 Mbps,
>and could not go up to 3000Mbps any more,
>it seems that the bonding doesn't rebalance
>again.
>This is due to use incorrectly checking in the function,
>rlb_choose_channel if the ethernet address is not equal
>to 64 bits before updating our mac address from arp.
>However this is incorrect as the ethernet header length
>needs to be 64bits in order to update our mac address
>from arp without the above test case having a perfomance
>regression. Fix it by checking for the ethernet header
>having 64 bits and not.

NAK.

I don't understand what this is trying to say, but the change
below does not look valid.

>Signed-off-by: Nicholas Krause <xerofo...@gmail.com>
>---
> drivers/net/bonding/bond_alb.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index 551f0f8..d5ae8a5 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -593,7 +593,7 @@ static struct slave *rlb_choose_channel(struct sk_buff 
>*skb, struct bonding *bon
>   if ((client_info->ip_src == arp->ip_src) &&
>   (client_info->ip_dst == arp->ip_dst)) {
>   /* the entry is already assigned to this client */
>-  if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
>+  if (ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
>   /* update mac address from arp */
>   ether_addr_copy(client_info->mac_dst, 
> arp->mac_dst);

Regardless of the description above, this test is insuring that
the code doesn't use the broadcast MAC address for the client.  Changing
it seems like a bad thing to do, as it would cause traffic for the
client to be sent to the ethernet broadcast address.

This might, in fact, make the described test run better, but it
has nothing to do with rebalancing and would negatively impact other
hosts on the network.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding:Fix perfomance drop if one bonding device in configuration is repeatedly restarted

2016-07-14 Thread Jay Vosburgh

Nicholas Krause  wrote:

>This fixes a issue with the following test caseL
>
>1. Created a alb bonding(bond0) with three of 1Gb interfaces
>(eth0, eth1, eth2) under ipv4, set the bond0 ip address as
>192.168.10.101, run command "iperf -s" as the server(x86_64).
>
>2. Run command "iperf -c 192.168.10.101 -t 1 -i 1"
>on three linux client with CentOS6.5 x86_64
>
>3. Run the following commands on server repeatly,
>"ifconfig eth1 down; sleep 20; ifconfig eth1 up".
>monitor the bonding performance with program "
>nmon_x86_64_centos6", after some time, the
>performance droped from 3000 to 2000 Mbps,
>and could not go up to 3000Mbps any more,
>it seems that the bonding doesn't rebalance
>again.
>This is due to use incorrectly checking in the function,
>rlb_choose_channel if the ethernet address is not equal
>to 64 bits before updating our mac address from arp.
>However this is incorrect as the ethernet header length
>needs to be 64bits in order to update our mac address
>from arp without the above test case having a perfomance
>regression. Fix it by checking for the ethernet header
>having 64 bits and not.

NAK.

I don't understand what this is trying to say, but the change
below does not look valid.

>Signed-off-by: Nicholas Krause 
>---
> drivers/net/bonding/bond_alb.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index 551f0f8..d5ae8a5 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -593,7 +593,7 @@ static struct slave *rlb_choose_channel(struct sk_buff 
>*skb, struct bonding *bon
>   if ((client_info->ip_src == arp->ip_src) &&
>   (client_info->ip_dst == arp->ip_dst)) {
>   /* the entry is already assigned to this client */
>-  if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
>+  if (ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
>   /* update mac address from arp */
>   ether_addr_copy(client_info->mac_dst, 
> arp->mac_dst);

Regardless of the description above, this test is insuring that
the code doesn't use the broadcast MAC address for the client.  Changing
it seems like a bad thing to do, as it would cause traffic for the
client to be sent to the ethernet broadcast address.

This might, in fact, make the described test run better, but it
has nothing to do with rebalancing and would negatively impact other
hosts on the network.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [RFC PATCH net] net/core: don't increment rx_dropped on inactive slaves

2016-01-22 Thread Jay Vosburgh

Jarod Wilson  wrote:

>The network core tries to keep track of dropped packets, but some packets
>you wouldn't really call dropped, so much as intentionally ignored, under
>certain circumstances. One such case is that of bonding and team device
>slaves that are currently inactive. Their respective rx_handler functions
>return RX_HANDLER_EXACT (the only places in the kernel that return that),
>which ends up tracking into the network core's __netif_receive_skb_core()
>function's drop path, with no pt_prev set. On a noisy network, this can
>result in a very rapidly incrementing rx_dropped counter, not only on the
>inactive slave(s), but also on the master device, such as the following:
[...]
>In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an
>active-backup bond0, and you can see that all three have high drop counts,
>with the master bond0 showing a tally of all three.
>
>I know that this was previously discussed some here:
>
>http://www.spinics.net/lists/netdev/msg226341.html
>
>It seems additional counters never came to fruition, but honestly, for
>this particular case, I'm not even sure they're warranted, I'd be inclined
>to say just silently drop these packets without incrementing a counter. At
>least, that's probably what would make someone who has complained loudly
>about this issue happy, as they have monitoring tools that are squaking
>loudly at any increments to rx_dropped.

I don't think the kernel should silently drop packets; there
should be a counter somewhere.  If a packet is being thrown away
deliberately, it should not just vanish into the screaming void of
space.  Someday someone will try and track down where that packet is
being dropped.

I've had that same conversation with customers who insist on
accounting for every packet drop (from the "any drop is an error"
mindset), so I understand the issue.

Thinking about the prior discussion, the rx_drop_inactive is
still a good idea, but I'd actually today get good use from a
"rx_drop_unforwardable" (or an equivalent but shorter name) counter that
counts every time a packet is dropped due to is_skb_forwardable()
returning false.  __dev_forward_skb does this (and hits rx_dropped), as
does the bridge (and does not count it).

-J

>CC: "David S. Miller" 
>CC: Eric Dumazet 
>CC: Jiri Pirko 
>CC: Daniel Borkmann 
>CC: Tom Herbert 
>CC: Jay Vosburgh 
>CC: Veaceslav Falico 
>CC: Andy Gospodarek 
>CC: net...@vger.kernel.org
>Signed-off-by: Jarod Wilson 
>---
> net/core/dev.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 8cba3d8..1354c7b 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -4153,8 +4153,11 @@ ncls:
>   else
>   ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>   } else {
>+  if (deliver_exact)
>+  goto inactive; /* bond or team inactive slave */
> drop:
>   atomic_long_inc(>dev->rx_dropped);
>+inactive:
>   kfree_skb(skb);
>   /* Jamal, now you will not able to escape explaining
>* me how you were going to use this. :-)
>-- 
>1.8.3.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [RFC PATCH net] net/core: don't increment rx_dropped on inactive slaves

2016-01-22 Thread Jay Vosburgh

Jarod Wilson <ja...@redhat.com> wrote:

>The network core tries to keep track of dropped packets, but some packets
>you wouldn't really call dropped, so much as intentionally ignored, under
>certain circumstances. One such case is that of bonding and team device
>slaves that are currently inactive. Their respective rx_handler functions
>return RX_HANDLER_EXACT (the only places in the kernel that return that),
>which ends up tracking into the network core's __netif_receive_skb_core()
>function's drop path, with no pt_prev set. On a noisy network, this can
>result in a very rapidly incrementing rx_dropped counter, not only on the
>inactive slave(s), but also on the master device, such as the following:
[...]
>In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an
>active-backup bond0, and you can see that all three have high drop counts,
>with the master bond0 showing a tally of all three.
>
>I know that this was previously discussed some here:
>
>http://www.spinics.net/lists/netdev/msg226341.html
>
>It seems additional counters never came to fruition, but honestly, for
>this particular case, I'm not even sure they're warranted, I'd be inclined
>to say just silently drop these packets without incrementing a counter. At
>least, that's probably what would make someone who has complained loudly
>about this issue happy, as they have monitoring tools that are squaking
>loudly at any increments to rx_dropped.

I don't think the kernel should silently drop packets; there
should be a counter somewhere.  If a packet is being thrown away
deliberately, it should not just vanish into the screaming void of
space.  Someday someone will try and track down where that packet is
being dropped.

I've had that same conversation with customers who insist on
accounting for every packet drop (from the "any drop is an error"
mindset), so I understand the issue.

Thinking about the prior discussion, the rx_drop_inactive is
still a good idea, but I'd actually today get good use from a
"rx_drop_unforwardable" (or an equivalent but shorter name) counter that
counts every time a packet is dropped due to is_skb_forwardable()
returning false.  __dev_forward_skb does this (and hits rx_dropped), as
does the bridge (and does not count it).

-J

>CC: "David S. Miller" <da...@davemloft.net>
>CC: Eric Dumazet <eduma...@google.com>
>CC: Jiri Pirko <j...@mellanox.com>
>CC: Daniel Borkmann <dan...@iogearbox.net>
>CC: Tom Herbert <t...@herbertland.com>
>CC: Jay Vosburgh <j.vosbu...@gmail.com>
>CC: Veaceslav Falico <vfal...@gmail.com>
>CC: Andy Gospodarek <go...@cumulusnetworks.com>
>CC: net...@vger.kernel.org
>Signed-off-by: Jarod Wilson <ja...@redhat.com>
>---
> net/core/dev.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 8cba3d8..1354c7b 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -4153,8 +4153,11 @@ ncls:
>   else
>   ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>   } else {
>+  if (deliver_exact)
>+  goto inactive; /* bond or team inactive slave */
> drop:
>       atomic_long_inc(>dev->rx_dropped);
>+inactive:
>   kfree_skb(skb);
>   /* Jamal, now you will not able to escape explaining
>* me how you were going to use this. :-)
>-- 
>1.8.3.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-09 Thread Jay Vosburgh

Jarod Wilson  wrote:

>Jarod Wilson wrote:
>...
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
>
>Is there any particular userspace tool that would need some updating, or
>is adding the sysfs knobs sufficient here? I think I've got all the sysfs
>stuff thrown together now, but still need to test.

Most (all?) bonding options should be configurable via iproute
(netlink) now.

>
>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>> using
>>> a different interval ? This seems like a very specific problem to have a
>>> whole new option for.
>>
>> ...I'll wait until we've heard confirmation from Uwe that intervals
>> other than 500ms don't fix things.
>
>Okay, so I believe the "only tested with 500ms" was in reference to
>testing with Uwe's initial patch. I do have supporting evidence in a
>bugzilla report that shows upwards of 5000ms still experience the problem
>here.

I did set up some switches and attempt to reproduce this
yesterday; I daisy-chained three switches (two Cisco and an HP) together
and connected the bonded interfaces to the "end" switches.  I tried
various ARP targets (the switch, hosts on various points of the switch)
and varying arp_intervals and was unable to reproduce the problem.

As I understand it, the working theory is something like this:

- host with two bonded interfaces, A and B.  For active-backup
mode, the interfaces have been assigned the same MAC address.

- switch has MAC for B in its forwarding table

- bonding goes from down to up, and thinks all its slaves are
down, and starts the "curr_arp_slave" search for an active
arp_ip_target.  In this case, it starts with A, and sends an ARP from A.

As an aside, I'm not 100% clear on what exactly is going on in
the "bonding goes from down to up" transition; this seems to be key in
reproducing the issue.

- switch sees source mac coming from port A, starts to update
its forwarding table

- meanwhile, switch forwards ARP request, and receives ARP
reply, which it forwards to port B.  Bonding drops this, as the slave is
inactive.

- switch finishes updating forwarding table, MAC is now assigned
to port A.

- bonding now tries sending on port B, and the cycle repeats.

If this is what's taking place, then the arp_interval itself is
irrelevant, the race is between the switch table update and the
generation of the ARP reply.

Also, presuming the above is what's going on, we could modify
the ARP "curr_arp_slave" logic a bit to resolve this without requiring
any magic knobs.

For example, we could change the "drop on inactive" logic to
recognise the "curr_arp_slave" search and accept the unicast ARP reply,
and perhaps make that receiving slave the next curr_arp_slave
automatically.

I also wonder if the fail_over_mac option would affect this
behavior, as it would cause the slaves to keep their MAC address for the
duration, so the switch would not see the MAC move from port to port.

Another thought would be to have the curr_arp_slave cycle
through the slaves in random order, but that could create
non-deterministic results even when things are working correctly.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-09 Thread Jay Vosburgh

Jarod Wilson <ja...@redhat.com> wrote:

>Jarod Wilson wrote:
>...
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
>
>Is there any particular userspace tool that would need some updating, or
>is adding the sysfs knobs sufficient here? I think I've got all the sysfs
>stuff thrown together now, but still need to test.

Most (all?) bonding options should be configurable via iproute
(netlink) now.

>
>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>> using
>>> a different interval ? This seems like a very specific problem to have a
>>> whole new option for.
>>
>> ...I'll wait until we've heard confirmation from Uwe that intervals
>> other than 500ms don't fix things.
>
>Okay, so I believe the "only tested with 500ms" was in reference to
>testing with Uwe's initial patch. I do have supporting evidence in a
>bugzilla report that shows upwards of 5000ms still experience the problem
>here.

I did set up some switches and attempt to reproduce this
yesterday; I daisy-chained three switches (two Cisco and an HP) together
and connected the bonded interfaces to the "end" switches.  I tried
various ARP targets (the switch, hosts on various points of the switch)
and varying arp_intervals and was unable to reproduce the problem.

As I understand it, the working theory is something like this:

- host with two bonded interfaces, A and B.  For active-backup
mode, the interfaces have been assigned the same MAC address.

- switch has MAC for B in its forwarding table

- bonding goes from down to up, and thinks all its slaves are
down, and starts the "curr_arp_slave" search for an active
arp_ip_target.  In this case, it starts with A, and sends an ARP from A.

As an aside, I'm not 100% clear on what exactly is going on in
the "bonding goes from down to up" transition; this seems to be key in
reproducing the issue.

- switch sees source mac coming from port A, starts to update
its forwarding table

- meanwhile, switch forwards ARP request, and receives ARP
reply, which it forwards to port B.  Bonding drops this, as the slave is
inactive.

- switch finishes updating forwarding table, MAC is now assigned
to port A.

- bonding now tries sending on port B, and the cycle repeats.

If this is what's taking place, then the arp_interval itself is
irrelevant, the race is between the switch table update and the
generation of the ARP reply.

Also, presuming the above is what's going on, we could modify
the ARP "curr_arp_slave" logic a bit to resolve this without requiring
any magic knobs.

For example, we could change the "drop on inactive" logic to
recognise the "curr_arp_slave" search and accept the unicast ARP reply,
and perhaps make that receiving slave the next curr_arp_slave
automatically.

I also wonder if the fail_over_mac option would affect this
behavior, as it would cause the slaves to keep their MAC address for the
duration, so the switch would not see the MAC move from port to port.

Another thought would be to have the curr_arp_slave cycle
through the slaves in random order, but that could create
non-deterministic results even when things are working correctly.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] net/bonding: send arp in interval if no active slave

2015-09-03 Thread Jay Vosburgh

Uwe Koziolek  wrote:

>On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
>> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>>> Uwe Koziolek  wrote:
>>>>
>>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>>>>> From: Uwe Koziolek 
>>>>>>>>
>>>>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>>>>> into
>>>>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>>>>> one
>>>>>>>> to come up as the active slave. There seems to be an issue with the
>>>>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>>>>> so we
>>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>>>>
>>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>>>>> additional
>>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>>>>> with
>>>>>>>> this hardware combination.
>>>>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>>>>> are
>>>>>>> we fixiing here.
>>>>>>>
>>>>>>> It also breaks completely the logic for link state management in case
>>>>>>> of no
>>>>>>> current active slave for 2*arp_interval.
>>>>>>>
>>>>>>> Could you please elaborate what exactly is fixed here, and how it
>>>>>>> works? :)
>>>>>> I can either duplicate some information from the bug, or Uwe can, to
>>>>>> illustrate the exact nature of the problem.
>>>>>>
>>>>>>> p.s. num_grat_arp maybe could help?
>>>>>> That was my thought as well, but as I understand it, that route was
>>>>>> explored, and it didn't help any. I don't actually have a reproducer
>>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>>>>> here...
>>>>>>
>>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>>>> num_grat_arp.
>>>>> The bonding slaves are connected to high available switches, each of the
>>>>> slaves is connected to a different switch. If the bond is starting, only
>>>>> the selected slave sends one arp-request. If a matching arp_response was
>>>>> received, this slave and the bond is going into state up, sending the
>>>>> gratitious arps...
>>>>> But if you got no arp reply the next slave was selected.
>>>>> With most of the newer switches, not overloaded, or with other software
>>>>> bugs, or with a single switch configuration, you would get a arp response
>>>>> on the first arp request.
>>>>> But in case of high availability configuration with non perfect switches
>>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>>> response on the first arp request.
>>>>>
>>>>> I have seen network snoops, there the switches are not responding to the
>>>>> first arp request on slave 1, the second arp request was sent on slave 2
>>>>> but the response was received on slave one,  and all following arp
>>>>> requests are anwsered on the wrong slave for a longer time.
>>>>Could you elaborate on the exact "high availability
>>>> configuration" here, including the model(s) of switch(es) involved?
>>>>
>>>>Is this some kind of race between the switch or switches
>>>> updating the forwarding tables and the bond flip flopping between the
>>>> slaves

Re: [PATCH] net/bonding: send arp in interval if no active slave

2015-09-03 Thread Jay Vosburgh

Uwe Koziolek <uwe.kozio...@redknee.com> wrote:

>On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
>> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>>> Uwe Koziolek <uwe.kozio...@redknee.com> wrote:
>>>>
>>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>>>>> From: Uwe Koziolek <uwe.kozio...@redknee.com>
>>>>>>>>
>>>>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>>>>> into
>>>>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>>>>> one
>>>>>>>> to come up as the active slave. There seems to be an issue with the
>>>>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>>>>> so we
>>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>>>>
>>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>>>>> additional
>>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>>>>> with
>>>>>>>> this hardware combination.
>>>>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>>>>> are
>>>>>>> we fixiing here.
>>>>>>>
>>>>>>> It also breaks completely the logic for link state management in case
>>>>>>> of no
>>>>>>> current active slave for 2*arp_interval.
>>>>>>>
>>>>>>> Could you please elaborate what exactly is fixed here, and how it
>>>>>>> works? :)
>>>>>> I can either duplicate some information from the bug, or Uwe can, to
>>>>>> illustrate the exact nature of the problem.
>>>>>>
>>>>>>> p.s. num_grat_arp maybe could help?
>>>>>> That was my thought as well, but as I understand it, that route was
>>>>>> explored, and it didn't help any. I don't actually have a reproducer
>>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>>>>> here...
>>>>>>
>>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>>>> num_grat_arp.
>>>>> The bonding slaves are connected to high available switches, each of the
>>>>> slaves is connected to a different switch. If the bond is starting, only
>>>>> the selected slave sends one arp-request. If a matching arp_response was
>>>>> received, this slave and the bond is going into state up, sending the
>>>>> gratitious arps...
>>>>> But if you got no arp reply the next slave was selected.
>>>>> With most of the newer switches, not overloaded, or with other software
>>>>> bugs, or with a single switch configuration, you would get a arp response
>>>>> on the first arp request.
>>>>> But in case of high availability configuration with non perfect switches
>>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>>> response on the first arp request.
>>>>>
>>>>> I have seen network snoops, there the switches are not responding to the
>>>>> first arp request on slave 1, the second arp request was sent on slave 2
>>>>> but the response was received on slave one,  and all following arp
>>>>> requests are anwsered on the wrong slave for a longer time.
>>>>Could you elaborate on the exact "high availability
>>>> configuration" here, including the model(s) of switch(es) involved?
>>>>
>>>>Is this some kind of race between the switch or switches
>>>&

Re: [PATCH] net/bonding: send arp in interval if no active slave

2015-08-17 Thread Jay Vosburgh

Uwe Koziolek  wrote:

>On2015-08-17 07:12 PM,Jarod Wilson wrote:
>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>> From: Uwe Koziolek 
>>>>
>>>> With some very finicky switch hardware, active backup bonding can get
>>>> into
>>>> a situation where we play ping-pong between interfaces, trying to get
>>>> one
>>>> to come up as the active slave. There seems to be an issue with the
>>>> switch's arp replies either taking too long, or simply getting lost,
>>>> so we
>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>> sorts itself out after a while, sometimes it doesn't.
>>>>
>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>> additional
>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>> with
>>>> this hardware combination.
>>>
>>> Sorry, I don't understand the logic of why it works, and what exactly
>>> are
>>> we fixiing here.
>>>
>>> It also breaks completely the logic for link state management in case
>>> of no
>>> current active slave for 2*arp_interval.
>>>
>>> Could you please elaborate what exactly is fixed here, and how it
>>> works? :)
>>
>> I can either duplicate some information from the bug, or Uwe can, to
>> illustrate the exact nature of the problem.
>>
>>> p.s. num_grat_arp maybe could help?
>>
>> That was my thought as well, but as I understand it, that route was
>> explored, and it didn't help any. I don't actually have a reproducer
>> setup of my own, unfortunately, so I'm kind of caught in the middle
>> here...
>>
>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>> settings were tried that didn't help? I'm still of the mind that if
>> num_grat_arp *didn't* help, we probably need to do something keyed off
>> num_grat_arp.
>The bonding slaves are connected to high available switches, each of the
>slaves is connected to a different switch. If the bond is starting, only
>the selected slave sends one arp-request. If a matching arp_response was
>received, this slave and the bond is going into state up, sending the
>gratitious arps...
>But if you got no arp reply the next slave was selected.
>With most of the newer switches, not overloaded, or with other software
>bugs, or with a single switch configuration, you would get a arp response
>on the first arp request.
>But in case of high availability configuration with non perfect switches
>like HP ProCurve 54xx, also with some Cisco models, you may not get a
>response on the first arp request.
>
>I have seen network snoops, there the switches are not responding to the
>first arp request on slave 1, the second arp request was sent on slave 2
>but the response was received on slave one,  and all following arp
>requests are anwsered on the wrong slave for a longer time.

Could you elaborate on the exact "high availability
configuration" here, including the model(s) of switch(es) involved?

Is this some kind of race between the switch or switches
updating the forwarding tables and the bond flip flopping between the
slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
the forwarding table, but (for whatever reason) there is no reply.  ARP
on slave 2 is sent (using the same source MAC, unless you set
fail_over_mac), but forwarding tables still send that MAC to slave 1, so
reply is sent there.

>The proposed change sents up to 3 arp requests on a down bond using the
>same slave, delayed by arp_interval.
>Using problematic switches i have seen the the arp response on the right
>slave at latest on the second arp request. So the bond is going into state
>up.
>
>How does it works:
>The bonds in up state are handled on the beginning of bond_ab_arp_probe
>procedure, the other part of this procedure is handling the slave change.
>The proposed change is bypassing the slave change for 2 additional calls
>of bond_ab_arp_probe.
>Now the retries are not only for an up bond available, they are also
>implemented for a down bond.

Does this delay failover or bringup on switches that are not
"problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
this impact failover / recovery times?

-J

>The num_grat_arp has no chance to solve the problem. The num_grat_arp is
>only used, if a different slave is going active.
>But in our case, th

Re: [PATCH] net/bonding: send arp in interval if no active slave

2015-08-17 Thread Jay Vosburgh

Uwe Koziolek uwe.kozio...@redknee.com wrote:

On2015-08-17 07:12 PM,Jarod Wilson wrote:
 On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
 On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
 From: Uwe Koziolek uwe.kozio...@redknee.com

 With some very finicky switch hardware, active backup bonding can get
 into
 a situation where we play ping-pong between interfaces, trying to get
 one
 to come up as the active slave. There seems to be an issue with the
 switch's arp replies either taking too long, or simply getting lost,
 so we
 wind up unable to get any interface up and active. Sometimes, the issue
 sorts itself out after a while, sometimes it doesn't.

 Testing with num_grat_arp has proven fruitless, but sending an
 additional
 arp on curr_arp_slave if we're still in the arp_interval timeslice in
 bond_ab_arp_probe(), has shown to produce 100% reliability in testing
 with
 this hardware combination.

 Sorry, I don't understand the logic of why it works, and what exactly
 are
 we fixiing here.

 It also breaks completely the logic for link state management in case
 of no
 current active slave for 2*arp_interval.

 Could you please elaborate what exactly is fixed here, and how it
 works? :)

 I can either duplicate some information from the bug, or Uwe can, to
 illustrate the exact nature of the problem.

 p.s. num_grat_arp maybe could help?

 That was my thought as well, but as I understand it, that route was
 explored, and it didn't help any. I don't actually have a reproducer
 setup of my own, unfortunately, so I'm kind of caught in the middle
 here...

 Uwe, can you perhaps further enlighten us as to what num_grat_arp
 settings were tried that didn't help? I'm still of the mind that if
 num_grat_arp *didn't* help, we probably need to do something keyed off
 num_grat_arp.
The bonding slaves are connected to high available switches, each of the
slaves is connected to a different switch. If the bond is starting, only
the selected slave sends one arp-request. If a matching arp_response was
received, this slave and the bond is going into state up, sending the
gratitious arps...
But if you got no arp reply the next slave was selected.
With most of the newer switches, not overloaded, or with other software
bugs, or with a single switch configuration, you would get a arp response
on the first arp request.
But in case of high availability configuration with non perfect switches
like HP ProCurve 54xx, also with some Cisco models, you may not get a
response on the first arp request.

I have seen network snoops, there the switches are not responding to the
first arp request on slave 1, the second arp request was sent on slave 2
but the response was received on slave one,  and all following arp
requests are anwsered on the wrong slave for a longer time.

Could you elaborate on the exact high availability
configuration here, including the model(s) of switch(es) involved?

Is this some kind of race between the switch or switches
updating the forwarding tables and the bond flip flopping between the
slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
the forwarding table, but (for whatever reason) there is no reply.  ARP
on slave 2 is sent (using the same source MAC, unless you set
fail_over_mac), but forwarding tables still send that MAC to slave 1, so
reply is sent there.

The proposed change sents up to 3 arp requests on a down bond using the
same slave, delayed by arp_interval.
Using problematic switches i have seen the the arp response on the right
slave at latest on the second arp request. So the bond is going into state
up.

How does it works:
The bonds in up state are handled on the beginning of bond_ab_arp_probe
procedure, the other part of this procedure is handling the slave change.
The proposed change is bypassing the slave change for 2 additional calls
of bond_ab_arp_probe.
Now the retries are not only for an up bond available, they are also
implemented for a down bond.

Does this delay failover or bringup on switches that are not
problematic?  I.e., if arp_interval is, say, 1000 (1 second), will
this impact failover / recovery times?

-J

The num_grat_arp has no chance to solve the problem. The num_grat_arp is
only used, if a different slave is going active.
But in our case, the bonding slaves are not going into the state active
for a longer time.

 [jarod: manufacturing of changelog]
 CC: Jay Vosburgh j.vosbu...@gmail.com
 CC: Veaceslav Falico vfal...@gmail.com
 CC: Andy Gospodarek go...@cumulusnetworks.com
 CC: net...@vger.kernel.org
 Signed-off-by: Uwe Koziolek uwe.kozio...@redknee.com
 Signed-off-by: Jarod Wilson ja...@redhat.com
 ---
 drivers/net/bonding/bond_main.c | 5 +
 1 file changed, 5 insertions(+)

 diff --git a/drivers/net/bonding/bond_main.c
 b/drivers/net/bonding/bond_main.c
 index 0c627b4..60b9483 100644
 --- a/drivers/net/bonding/bond_main.c
 +++ b/drivers/net/bonding/bond_main.c
 @@ -2794,6 +2794,11 @@ static bool

Re: [PATCH v3] bonding: "primary_reselect" with "failure" is not working properly

2015-07-07 Thread Jay Vosburgh

Mazhar Rana  wrote:

>From: Mazhar Rana 
>
>When "primary_reselect" is set to "failure", primary interface should
>not become active until current active slave is down. But if we set first
>member of bond device as a "primary" interface and "primary_reselect"
>is set to "failure" then whenever primary interface's link get back(up)
>it become active slave even if current active slave is still up.
>
>With this patch, "bond_find_best_slave" will not traverse members if
>primary interface is not candidate for failover/reselection and current
>active slave is still up.
>
>Signed-off-by: Mazhar Rana 
>Signed-off-by: Jay Vosburgh 

I don't believe I posted a Signed-off-by, so you really
shouldn't include one for me without it being explicitly stated.

In any event, I'm good with the patch, so:

Signed-off-by: Jay Vosburgh 

-J

>---
>
>v1->v2: 
>return "curr" instead of "bond->curr_active_slave".
>
>v2->v3: 
>To make code more clear, replaced function bond_should_change_active
>with bond_choose_primary_or_current which will return slave device.
>
> drivers/net/bonding/bond_main.c | 51 +++--
> 1 file changed, 34 insertions(+), 17 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 19eb990..317a494 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -689,40 +689,57 @@ out:
> 
> }
> 
>-static bool bond_should_change_active(struct bonding *bond)
>+static struct slave *bond_choose_primary_or_current(struct bonding *bond)
> {
>   struct slave *prim = rtnl_dereference(bond->primary_slave);
>   struct slave *curr = rtnl_dereference(bond->curr_active_slave);
> 
>-  if (!prim || !curr || curr->link != BOND_LINK_UP)
>-  return true;
>+  if (!prim || prim->link != BOND_LINK_UP) {
>+  if (!curr || curr->link != BOND_LINK_UP)
>+  return NULL;
>+  return curr;
>+  }
>+
>   if (bond->force_primary) {
>   bond->force_primary = false;
>-  return true;
>+  return prim;
>+  }
>+
>+  if (!curr || curr->link != BOND_LINK_UP)
>+  return prim;
>+
>+  /* At this point, prim and curr are both up */
>+  switch (bond->params.primary_reselect) {
>+  case BOND_PRI_RESELECT_ALWAYS:
>+  return prim;
>+  case BOND_PRI_RESELECT_BETTER:
>+  if (prim->speed < curr->speed)
>+  return curr;
>+  if (prim->speed == curr->speed && prim->duplex <= curr->duplex)
>+  return curr;
>+  return prim;
>+  case BOND_PRI_RESELECT_FAILURE:
>+  return curr;
>+  default:
>+  netdev_err(bond->dev, "impossible primary_reselect %d\n",
>+ bond->params.primary_reselect);
>+  return curr;
>   }
>-  if (bond->params.primary_reselect == BOND_PRI_RESELECT_BETTER &&
>-  (prim->speed < curr->speed ||
>-   (prim->speed == curr->speed && prim->duplex <= curr->duplex)))
>-  return false;
>-  if (bond->params.primary_reselect == BOND_PRI_RESELECT_FAILURE)
>-  return false;
>-  return true;
> }
> 
> /**
>- * find_best_interface - select the best available slave to be the active one
>+ * bond_find_best_slave - select the best available slave to be the active one
>  * @bond: our bonding struct
>  */
> static struct slave *bond_find_best_slave(struct bonding *bond)
> {
>-  struct slave *slave, *bestslave = NULL, *primary;
>+  struct slave *slave, *bestslave = NULL;
>   struct list_head *iter;
>   int mintime = bond->params.updelay;
> 
>-  primary = rtnl_dereference(bond->primary_slave);
>-  if (primary && primary->link == BOND_LINK_UP &&
>-  bond_should_change_active(bond))
>-  return primary;
>+  slave = bond_choose_primary_or_current(bond);
>+  if (slave)
>+  return slave;
> 
>   bond_for_each_slave(bond, slave, iter) {
>   if (slave->link == BOND_LINK_UP)
>-- 
>1.9.1
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] bonding: primary_reselect with failure is not working properly

2015-07-07 Thread Jay Vosburgh

Mazhar Rana ranamazh...@gmail.com wrote:

From: Mazhar Rana mazhar.r...@cyberoam.com

When primary_reselect is set to failure, primary interface should
not become active until current active slave is down. But if we set first
member of bond device as a primary interface and primary_reselect
is set to failure then whenever primary interface's link get back(up)
it become active slave even if current active slave is still up.

With this patch, bond_find_best_slave will not traverse members if
primary interface is not candidate for failover/reselection and current
active slave is still up.

Signed-off-by: Mazhar Rana mazhar.r...@cyberoam.com
Signed-off-by: Jay Vosburgh j.vosbu...@gmail.com

I don't believe I posted a Signed-off-by, so you really
shouldn't include one for me without it being explicitly stated.

In any event, I'm good with the patch, so:

Signed-off-by: Jay Vosburgh jay.vosbu...@canonical.com

-J

---

v1-v2: 
return curr instead of bond-curr_active_slave.

v2-v3: 
To make code more clear, replaced function bond_should_change_active
with bond_choose_primary_or_current which will return slave device.

 drivers/net/bonding/bond_main.c | 51 +++--
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 19eb990..317a494 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -689,40 +689,57 @@ out:
 
 }
 
-static bool bond_should_change_active(struct bonding *bond)
+static struct slave *bond_choose_primary_or_current(struct bonding *bond)
 {
   struct slave *prim = rtnl_dereference(bond-primary_slave);
   struct slave *curr = rtnl_dereference(bond-curr_active_slave);
 
-  if (!prim || !curr || curr-link != BOND_LINK_UP)
-  return true;
+  if (!prim || prim-link != BOND_LINK_UP) {
+  if (!curr || curr-link != BOND_LINK_UP)
+  return NULL;
+  return curr;
+  }
+
   if (bond-force_primary) {
   bond-force_primary = false;
-  return true;
+  return prim;
+  }
+
+  if (!curr || curr-link != BOND_LINK_UP)
+  return prim;
+
+  /* At this point, prim and curr are both up */
+  switch (bond-params.primary_reselect) {
+  case BOND_PRI_RESELECT_ALWAYS:
+  return prim;
+  case BOND_PRI_RESELECT_BETTER:
+  if (prim-speed  curr-speed)
+  return curr;
+  if (prim-speed == curr-speed  prim-duplex = curr-duplex)
+  return curr;
+  return prim;
+  case BOND_PRI_RESELECT_FAILURE:
+  return curr;
+  default:
+  netdev_err(bond-dev, impossible primary_reselect %d\n,
+ bond-params.primary_reselect);
+  return curr;
   }
-  if (bond-params.primary_reselect == BOND_PRI_RESELECT_BETTER 
-  (prim-speed  curr-speed ||
-   (prim-speed == curr-speed  prim-duplex = curr-duplex)))
-  return false;
-  if (bond-params.primary_reselect == BOND_PRI_RESELECT_FAILURE)
-  return false;
-  return true;
 }
 
 /**
- * find_best_interface - select the best available slave to be the active one
+ * bond_find_best_slave - select the best available slave to be the active one
  * @bond: our bonding struct
  */
 static struct slave *bond_find_best_slave(struct bonding *bond)
 {
-  struct slave *slave, *bestslave = NULL, *primary;
+  struct slave *slave, *bestslave = NULL;
   struct list_head *iter;
   int mintime = bond-params.updelay;
 
-  primary = rtnl_dereference(bond-primary_slave);
-  if (primary  primary-link == BOND_LINK_UP 
-  bond_should_change_active(bond))
-  return primary;
+  slave = bond_choose_primary_or_current(bond);
+  if (slave)
+  return slave;
 
   bond_for_each_slave(bond, slave, iter) {
   if (slave-link == BOND_LINK_UP)
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] bonding: "primary_reselect" with "failure" is not working properly

2015-07-03 Thread Jay Vosburgh

GMAIL  wrote:

>Hi Jay,
>
>On Friday 03 July 2015 02:12 AM, Jay Vosburgh wrote:
>
>>  [ added netdev to cc ]
>>
>> Mazhar Rana  wrote:
>>
>>> When "primary_reselect" is set to "failure", primary interface should
>>> not become active until current active slave is up. But if we set first
>>  I think you mean "until current active slave is down" here, not
>> "up."
>
>Yes, It should be "up", grammatical mistake

"down," right?

[...]
>Below is the updated version of your patch. Any Comments or suggestions ?
[...]
> static struct slave *bond_find_best_slave(struct bonding *bond)
> {
>-  struct slave *slave, *bestslave = NULL, *primary;
>+  struct slave *slave = NULL, *bestslave = NULL, *primary;
>   struct list_head *iter;
>   int mintime = bond->params.updelay;
>   primary = rtnl_dereference(bond->primary_slave);
>-  if (primary && primary->link == BOND_LINK_UP &&
>-  bond_should_change_active(bond))
>-  return primary;
>+  if (primary && primary->link == BOND_LINK_UP)
>+  slave = bond_choose_primary_or_current(bond);
>+
>+  if (slave)
>+  return slave;
>   bond_for_each_slave(bond, slave, iter) {
>   if (slave->link == BOND_LINK_UP)

I think this will misbehave in the case that curr is up and
available, but primary is NULL (this can happen when the primary option
is cleared).  In this case, the above code will not call
bond_choose_primary_or_current, and will then run the loop to find a new
curr, which may select a different slave unnecessarily.

How does the following look?  I prefer to make the call to
choose_primary_or_current unconditional, and have it decide if the
search loop should be run.  In this version, _choose_ tests curr if prim
is not suitable.  Compile tested only.

Thoughts?

-J

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 19eb990..1e35e25 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -689,40 +689,57 @@ out:
 
 }
 
-static bool bond_should_change_active(struct bonding *bond)
+static struct slave *bond_choose_primary_or_current(struct bonding *bond)
 {
struct slave *prim = rtnl_dereference(bond->primary_slave);
struct slave *curr = rtnl_dereference(bond->curr_active_slave);
 
-   if (!prim || !curr || curr->link != BOND_LINK_UP)
-   return true;
+   if (!prim || !prim->link == BOND_LINK_UP) {
+   if (!curr || !curr->link == BOND_LINK_UP)
+   return NULL;
+   return curr;
+   }
+
if (bond->force_primary) {
bond->force_primary = false;
-   return true;
+   return prim;
+   }
+
+   if (!curr || curr->link != BOND_LINK_UP)
+   return prim;
+
+   /* At this point, prim and curr are both up */
+   switch (bond->params.primary_reselect) {
+   case BOND_PRI_RESELECT_ALWAYS:
+   return prim;
+   case BOND_PRI_RESELECT_BETTER:
+   if (prim->speed < curr->speed)
+   return curr;
+   if (prim->speed == curr->speed && prim->duplex <= curr->duplex)
+   return curr;
+   return prim;
+   case BOND_PRI_RESELECT_FAILURE:
+   return curr;
+   default:
+   netdev_err(bond->dev, "impossible primary_reselect %d\n",
+  bond->params.primary_reselect);
+   return curr;
}
-   if (bond->params.primary_reselect == BOND_PRI_RESELECT_BETTER &&
-   (prim->speed < curr->speed ||
-(prim->speed == curr->speed && prim->duplex <= curr->duplex)))
-   return false;
-   if (bond->params.primary_reselect == BOND_PRI_RESELECT_FAILURE)
-   return false;
-   return true;
 }
 
 /**
- * find_best_interface - select the best available slave to be the active one
+ * bond_find_best_slave - select the best available slave to be the active one
  * @bond: our bonding struct
  */
 static struct slave *bond_find_best_slave(struct bonding *bond)
 {
-   struct slave *slave, *bestslave = NULL, *primary;
+   struct slave *slave, *bestslave = NULL;
struct list_head *iter;
int mintime = bond->params.updelay;
 
-   primary = rtnl_dereference(bond->primary_slave);
-   if (primary && primary->link == BOND_LINK_UP &&
-   bond_should_change_active(bond))
-   return primary;
+   slave = bond_choose_primary_or_current(bond);
+

Re: [PATCH v2] bonding: primary_reselect with failure is not working properly

2015-07-03 Thread Jay Vosburgh

GMAIL ranamazh...@gmail.com wrote:

Hi Jay,

On Friday 03 July 2015 02:12 AM, Jay Vosburgh wrote:

  [ added netdev to cc ]

 Mazhar Rana ranamazh...@gmail.com wrote:

 When primary_reselect is set to failure, primary interface should
 not become active until current active slave is up. But if we set first
  I think you mean until current active slave is down here, not
 up.

Yes, It should be up, grammatical mistake

down, right?

[...]
Below is the updated version of your patch. Any Comments or suggestions ?
[...]
 static struct slave *bond_find_best_slave(struct bonding *bond)
 {
-  struct slave *slave, *bestslave = NULL, *primary;
+  struct slave *slave = NULL, *bestslave = NULL, *primary;
   struct list_head *iter;
   int mintime = bond-params.updelay;
   primary = rtnl_dereference(bond-primary_slave);
-  if (primary  primary-link == BOND_LINK_UP 
-  bond_should_change_active(bond))
-  return primary;
+  if (primary  primary-link == BOND_LINK_UP)
+  slave = bond_choose_primary_or_current(bond);
+
+  if (slave)
+  return slave;
   bond_for_each_slave(bond, slave, iter) {
   if (slave-link == BOND_LINK_UP)

I think this will misbehave in the case that curr is up and
available, but primary is NULL (this can happen when the primary option
is cleared).  In this case, the above code will not call
bond_choose_primary_or_current, and will then run the loop to find a new
curr, which may select a different slave unnecessarily.

How does the following look?  I prefer to make the call to
choose_primary_or_current unconditional, and have it decide if the
search loop should be run.  In this version, _choose_ tests curr if prim
is not suitable.  Compile tested only.

Thoughts?

-J

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 19eb990..1e35e25 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -689,40 +689,57 @@ out:
 
 }
 
-static bool bond_should_change_active(struct bonding *bond)
+static struct slave *bond_choose_primary_or_current(struct bonding *bond)
 {
struct slave *prim = rtnl_dereference(bond-primary_slave);
struct slave *curr = rtnl_dereference(bond-curr_active_slave);
 
-   if (!prim || !curr || curr-link != BOND_LINK_UP)
-   return true;
+   if (!prim || !prim-link == BOND_LINK_UP) {
+   if (!curr || !curr-link == BOND_LINK_UP)
+   return NULL;
+   return curr;
+   }
+
if (bond-force_primary) {
bond-force_primary = false;
-   return true;
+   return prim;
+   }
+
+   if (!curr || curr-link != BOND_LINK_UP)
+   return prim;
+
+   /* At this point, prim and curr are both up */
+   switch (bond-params.primary_reselect) {
+   case BOND_PRI_RESELECT_ALWAYS:
+   return prim;
+   case BOND_PRI_RESELECT_BETTER:
+   if (prim-speed  curr-speed)
+   return curr;
+   if (prim-speed == curr-speed  prim-duplex = curr-duplex)
+   return curr;
+   return prim;
+   case BOND_PRI_RESELECT_FAILURE:
+   return curr;
+   default:
+   netdev_err(bond-dev, impossible primary_reselect %d\n,
+  bond-params.primary_reselect);
+   return curr;
}
-   if (bond-params.primary_reselect == BOND_PRI_RESELECT_BETTER 
-   (prim-speed  curr-speed ||
-(prim-speed == curr-speed  prim-duplex = curr-duplex)))
-   return false;
-   if (bond-params.primary_reselect == BOND_PRI_RESELECT_FAILURE)
-   return false;
-   return true;
 }
 
 /**
- * find_best_interface - select the best available slave to be the active one
+ * bond_find_best_slave - select the best available slave to be the active one
  * @bond: our bonding struct
  */
 static struct slave *bond_find_best_slave(struct bonding *bond)
 {
-   struct slave *slave, *bestslave = NULL, *primary;
+   struct slave *slave, *bestslave = NULL;
struct list_head *iter;
int mintime = bond-params.updelay;
 
-   primary = rtnl_dereference(bond-primary_slave);
-   if (primary  primary-link == BOND_LINK_UP 
-   bond_should_change_active(bond))
-   return primary;
+   slave = bond_choose_primary_or_current(bond);
+   if (slave)
+   return slave;
 
bond_for_each_slave(bond, slave, iter) {
if (slave-link == BOND_LINK_UP)

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] bonding: "primary_reselect" with "failure" is not working properly

2015-07-02 Thread Jay Vosburgh

bond_find_best_slave(struct bonding *bond)
 {
-   struct slave *slave, *bestslave = NULL, *primary;
+   struct slave *slave, *bestslave = NULL;
struct list_head *iter;
int mintime = bond->params.updelay;
 
-   primary = rtnl_dereference(bond->primary_slave);
-   if (primary && primary->link == BOND_LINK_UP &&
-   bond_should_change_active(bond))
-   return primary;
+   slave = bond_choose_primary_or_current(bond);
+   if (slave)
+   return slave;
 
bond_for_each_slave(bond, slave, iter) {
if (slave->link == BOND_LINK_UP)

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] bonding: primary_reselect with failure is not working properly

2015-07-02 Thread Jay Vosburgh

 = bond_choose_primary_or_current(bond);
+   if (slave)
+   return slave;
 
bond_for_each_slave(bond, slave, iter) {
if (slave-link == BOND_LINK_UP)

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-27 Thread Jay Vosburgh

Paul E. McKenney  wrote:

>On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
>> On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
>> > Paul E. McKenney  wrote:
>> > 
>> > >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
>> > >> Looking at the dmesg, the early boot messages seem to be
>> > >> confused as to how many CPUs there are, e.g.,
>> > >> 
>> > >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
>> > >> Nodes=1
>> > >> [0.00] Hierarchical RCU implementation.
>> > >> [0.00]  RCU debugfs-based tracing is enabled.
>> > >> [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
>> > >> [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
>> > >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
>> > >> nr_cpu_ids=4
>> > >> [0.00] NR_IRQS:16640 nr_irqs:456 0
>> > >> [0.00]  Offload RCU callbacks from all CPUs
>> > >> [0.00]  Offload RCU callbacks from CPUs: 0-3.
>> > >> 
>> > >> but later shows 2:
>> > >> 
>> > >> [0.233703] x86: Booting SMP configuration:
>> > >> [0.236003]  node  #0, CPUs:  #1
>> > >> [0.255528] x86: Booted up 1 node, 2 CPUs
>> > >> 
>> > >> In any event, the E8400 is a 2 core CPU with no hyperthreading.
>> > >
>> > >Well, this might explain some of the difficulties.  If RCU decides to wait
>> > >on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
>> > >was definitely expecting four CPUs.
>> > >
>> > >So what happens if you boot with maxcpus=2?  (Or build with
>> > >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
>> > >I might have some ideas for a real fix.
>> > 
>> >Booting with maxcpus=2 makes no difference (the dmesg output is
>> > the same).
>> > 
>> >Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
>> > dmesg has different CPU information at boot:
>> > 
>> > [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
>> > [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
>> >  [...]
>> > [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
>> > nr_node_ids:1
>> >  [...]
>> > [0.00] Hierarchical RCU implementation.
>> > [0.00] RCU debugfs-based tracing is enabled.
>> > [0.00] RCU dyntick-idle grace-period acceleration is enabled.
>> > [0.00] NR_IRQS:4352 nr_irqs:440 0
>> > [0.00] Offload RCU callbacks from all CPUs
>> > [0.00] Offload RCU callbacks from CPUs: 0-1.
>> 
>> Thank you -- this confirms my suspicions on the fix, though I must admit
>> to being surprised that maxcpus made no difference.
>
>And here is an alleged fix, lightly tested at this end.  Does this patch
>help?

This patch appears to make the problem go away; I've run about
10 iterations.  I applied this patch to the same -net tree I was using
previously (-net as of Oct 22), with all other test patches removed.

FWIW, dmesg is unchanged, and still shows messages like:

[0.00]  Offload RCU callbacks from CPUs: 0-3.

Tested-by: Jay Vosburgh 

-J

>   Thanx, Paul
>
>
>
>rcu: Make rcu_barrier() understand about missing rcuo kthreads
>
>Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
>avoids creating rcuo kthreads for CPUs that never come online.  This
>fixes a bug in many instances of firmware: Instead of lying about their
>age, these systems instead lie about the number of CPUs that they have.
>Before commit 35ce7f29a44a, this could result in huge numbers of useless
>rcuo kthreads being created.
>
>It appears that experience indicates that I should have told the
>people suffering from this problem to fix their broken firmware, but
>I instead produced what turned out to be a partial fix.   The missing
>piece supplied by this commit makes sure that rcu_barrier() knows not to
>post callbacks for no-CBs CPUs that have not yet come online, because
>otherwise rcu_barrier() will hang on systems having firmware that lies
>about the number of CPUs.
>
>It is tempting to simply have rcu_barrier

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-27 Thread Jay Vosburgh

Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote:
 On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote:
  Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  
  On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
   Looking at the dmesg, the early boot messages seem to be
   confused as to how many CPUs there are, e.g.,
   
   [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, 
   Nodes=1
   [0.00] Hierarchical RCU implementation.
   [0.00]  RCU debugfs-based tracing is enabled.
   [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
   [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
   [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, 
   nr_cpu_ids=4
   [0.00] NR_IRQS:16640 nr_irqs:456 0
   [0.00]  Offload RCU callbacks from all CPUs
   [0.00]  Offload RCU callbacks from CPUs: 0-3.
   
   but later shows 2:
   
   [0.233703] x86: Booting SMP configuration:
   [0.236003]  node  #0, CPUs:  #1
   [0.255528] x86: Booted up 1 node, 2 CPUs
   
   In any event, the E8400 is a 2 core CPU with no hyperthreading.
  
  Well, this might explain some of the difficulties.  If RCU decides to wait
  on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
  was definitely expecting four CPUs.
  
  So what happens if you boot with maxcpus=2?  (Or build with
  CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
  I might have some ideas for a real fix.
  
 Booting with maxcpus=2 makes no difference (the dmesg output is
  the same).
  
 Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
  dmesg has different CPU information at boot:
  
  [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
  [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
   [...]
  [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
  nr_node_ids:1
   [...]
  [0.00] Hierarchical RCU implementation.
  [0.00] RCU debugfs-based tracing is enabled.
  [0.00] RCU dyntick-idle grace-period acceleration is enabled.
  [0.00] NR_IRQS:4352 nr_irqs:440 0
  [0.00] Offload RCU callbacks from all CPUs
  [0.00] Offload RCU callbacks from CPUs: 0-1.
 
 Thank you -- this confirms my suspicions on the fix, though I must admit
 to being surprised that maxcpus made no difference.

And here is an alleged fix, lightly tested at this end.  Does this patch
help?

This patch appears to make the problem go away; I've run about
10 iterations.  I applied this patch to the same -net tree I was using
previously (-net as of Oct 22), with all other test patches removed.

FWIW, dmesg is unchanged, and still shows messages like:

[0.00]  Offload RCU callbacks from CPUs: 0-3.

Tested-by: Jay Vosburgh jay.vosbu...@canonical.com

-J

   Thanx, Paul



rcu: Make rcu_barrier() understand about missing rcuo kthreads

Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online.  This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.

It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix.   The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.

It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks.  It is therefore required to wait even for those callbacks
that cannot possibly be invoked.  Even if doing so hangs the system.

Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case.  Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().

So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Jay Vosburgh

Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote:
>> Paul E. McKenney  wrote:
>> 
>> >On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
>> [...]
>> >> Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
>> >> need to take a hard look at the sleep/wakeup code.
>> >> 
>> >> Thank you for running this!!!
>> >
>> >Could you please try the following patch?  If no joy, could you please
>> >add rcu:rcu_nocb_wake to the list of ftrace events?
>> 
>>  I tried the patch, it did not change the behavior.
>> 
>>  I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
>> and ran it again (with this patch and the first patch from earlier
>> today); the trace output is a bit on the large side so I put it and the
>> dmesg log at:
>> 
>> http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt
>> 
>> http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt
>
>Thank you again!
>
>Very strange part of the trace.  The only sign of CPU 2 and 3 are:
>
>ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Begin 
> cpu -1 remaining 0 # 0
>ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Check 
> cpu -1 remaining 0 # 0
>ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched Inc1 
> cpu -1 remaining 0 # 1
>ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 0 remaining 1 # 1
>ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 0 
> WakeNot
>ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 1 remaining 2 # 1
>ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 1 
> WakeNot
>ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 2 remaining 3 # 1
>ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 2 
> WakeNotPoll
>ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
> OnlineNoCB cpu 3 remaining 4 # 1
>ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 3 
> WakeNotPoll
>ovs-vswitchd-902   [000]    109.896843: rcu_barrier: rcu_sched Inc2 
> cpu -1 remaining 4 # 2
>
>The pair of WakeNotPoll trace entries says that at that point, RCU believed
>that the CPU 2's and CPU 3's rcuo kthreads did not exist.  :-/

On the test system I'm using, CPUs 2 and 3 really do not exist;
it is a 2 CPU system (Intel Core 2 Duo E8400). I mentioned this in an
earlier message, but perhaps you missed it in the flurry.

Looking at the dmesg, the early boot messages seem to be
confused as to how many CPUs there are, e.g.,

[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[0.00] NR_IRQS:16640 nr_irqs:456 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-3.

but later shows 2:

[0.233703] x86: Booting SMP configuration:
[0.236003]  node  #0, CPUs:  #1
[0.255528] x86: Booted up 1 node, 2 CPUs

In any event, the E8400 is a 2 core CPU with no hyperthreading.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Jay Vosburgh

Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
>>  Looking at the dmesg, the early boot messages seem to be
>> confused as to how many CPUs there are, e.g.,
>> 
>> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
>> [0.00] Hierarchical RCU implementation.
>> [0.00]  RCU debugfs-based tracing is enabled.
>> [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
>> [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
>> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
>> [0.00] NR_IRQS:16640 nr_irqs:456 0
>> [0.00]  Offload RCU callbacks from all CPUs
>> [0.00]  Offload RCU callbacks from CPUs: 0-3.
>> 
>>  but later shows 2:
>> 
>> [0.233703] x86: Booting SMP configuration:
>> [0.236003]  node  #0, CPUs:  #1
>> [0.255528] x86: Booted up 1 node, 2 CPUs
>> 
>>  In any event, the E8400 is a 2 core CPU with no hyperthreading.
>
>Well, this might explain some of the difficulties.  If RCU decides to wait
>on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
>was definitely expecting four CPUs.
>
>So what happens if you boot with maxcpus=2?  (Or build with
>CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
>I might have some ideas for a real fix.

Booting with maxcpus=2 makes no difference (the dmesg output is
the same).

Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
dmesg has different CPU information at boot:

[0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
[0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
 [...]
[0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
nr_node_ids:1
 [...]
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00] NR_IRQS:4352 nr_irqs:440 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-1.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Jay Vosburgh

Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote:
  Looking at the dmesg, the early boot messages seem to be
 confused as to how many CPUs there are, e.g.,
 
 [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
 [0.00] Hierarchical RCU implementation.
 [0.00]  RCU debugfs-based tracing is enabled.
 [0.00]  RCU dyntick-idle grace-period acceleration is enabled.
 [0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
 [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
 [0.00] NR_IRQS:16640 nr_irqs:456 0
 [0.00]  Offload RCU callbacks from all CPUs
 [0.00]  Offload RCU callbacks from CPUs: 0-3.
 
  but later shows 2:
 
 [0.233703] x86: Booting SMP configuration:
 [0.236003]  node  #0, CPUs:  #1
 [0.255528] x86: Booted up 1 node, 2 CPUs
 
  In any event, the E8400 is a 2 core CPU with no hyperthreading.

Well, this might explain some of the difficulties.  If RCU decides to wait
on CPUs that don't exist, we will of course get a hang.  And rcu_barrier()
was definitely expecting four CPUs.

So what happens if you boot with maxcpus=2?  (Or build with
CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang.  If so,
I might have some ideas for a real fix.

Booting with maxcpus=2 makes no difference (the dmesg output is
the same).

Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and
dmesg has different CPU information at boot:

[0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2
[0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs
 [...]
[0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 
nr_node_ids:1
 [...]
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00] NR_IRQS:4352 nr_irqs:440 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-1.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-25 Thread Jay Vosburgh

Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
 [...]
  Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
  need to take a hard look at the sleep/wakeup code.
  
  Thank you for running this!!!
 
 Could you please try the following patch?  If no joy, could you please
 add rcu:rcu_nocb_wake to the list of ftrace events?
 
  I tried the patch, it did not change the behavior.
 
  I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
 and ran it again (with this patch and the first patch from earlier
 today); the trace output is a bit on the large side so I put it and the
 dmesg log at:
 
 http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt
 
 http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt

Thank you again!

Very strange part of the trace.  The only sign of CPU 2 and 3 are:

ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Begin 
 cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    109.896840: rcu_barrier: rcu_sched Check 
 cpu -1 remaining 0 # 0
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched Inc1 
 cpu -1 remaining 0 # 1
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 0 remaining 1 # 1
ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 0 
 WakeNot
ovs-vswitchd-902   [000]    109.896841: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 1 remaining 2 # 1
ovs-vswitchd-902   [000] d...   109.896841: rcu_nocb_wake: rcu_sched 1 
 WakeNot
ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 2 remaining 3 # 1
ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 2 
 WakeNotPoll
ovs-vswitchd-902   [000]    109.896842: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 3 remaining 4 # 1
ovs-vswitchd-902   [000] d...   109.896842: rcu_nocb_wake: rcu_sched 3 
 WakeNotPoll
ovs-vswitchd-902   [000]    109.896843: rcu_barrier: rcu_sched Inc2 
 cpu -1 remaining 4 # 2

The pair of WakeNotPoll trace entries says that at that point, RCU believed
that the CPU 2's and CPU 3's rcuo kthreads did not exist.  :-/

On the test system I'm using, CPUs 2 and 3 really do not exist;
it is a 2 CPU system (Intel Core 2 Duo E8400). I mentioned this in an
earlier message, but perhaps you missed it in the flurry.

Looking at the dmesg, the early boot messages seem to be
confused as to how many CPUs there are, e.g.,

[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[0.00] Hierarchical RCU implementation.
[0.00]  RCU debugfs-based tracing is enabled.
[0.00]  RCU dyntick-idle grace-period acceleration is enabled.
[0.00]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[0.00] NR_IRQS:16640 nr_irqs:456 0
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-3.

but later shows 2:

[0.233703] x86: Booting SMP configuration:
[0.236003]  node  #0, CPUs:  #1
[0.255528] x86: Booted up 1 node, 2 CPUs

In any event, the E8400 is a 2 core CPU with no hyperthreading.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
[...]
>> Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
>> need to take a hard look at the sleep/wakeup code.
>> 
>> Thank you for running this!!!
>
>Could you please try the following patch?  If no joy, could you please
>add rcu:rcu_nocb_wake to the list of ftrace events?

I tried the patch, it did not change the behavior.

I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
and ran it again (with this patch and the first patch from earlier
today); the trace output is a bit on the large side so I put it and the
dmesg log at:

http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt

http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt

-J


>   Thanx, Paul
>
>
>
>rcu: Kick rcuo kthreads after their CPU goes offline
>
>If a no-CBs CPU were to post an RCU callback with interrupts disabled
>after it entered the idle loop for the last time, there might be no
>deferred wakeup for the corresponding rcuo kthreads.  This commit
>therefore adds a set of calls to do_nocb_deferred_wakeup() after the
>CPU has gone completely offline.
>
>Signed-off-by: Paul E. McKenney 
>
>diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>index 84b41b3c6ebd..f6880052b917 100644
>--- a/kernel/rcu/tree.c
>+++ b/kernel/rcu/tree.c
>@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self,
>   case CPU_DEAD_FROZEN:
>   case CPU_UP_CANCELED:
>   case CPU_UP_CANCELED_FROZEN:
>-  for_each_rcu_flavor(rsp)
>+  for_each_rcu_flavor(rsp) {
>   rcu_cleanup_dead_cpu(cpu, rsp);
>+  do_nocb_deferred_wakeup(per_cpu_ptr(rsp->rda, cpu));
>+  }
>   break;
>   default:
>   break;
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 03:02:04PM -0700, Jay Vosburgh wrote:
>> Paul E. McKenney  wrote:
>> 
[...]
>>  I've got an ftrace capture from unmodified -net, it looks like
>> this:
>> 
>> ovs-vswitchd-902   [000]    471.778441: rcu_barrier: rcu_sched Begin 
>> cpu -1 remaining 0 # 0
>> ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Check 
>> cpu -1 remaining 0 # 0
>> ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Inc1 
>> cpu -1 remaining 0 # 1
>> ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
>> OnlineNoCB cpu 0 remaining 1 # 1
>> ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
>> OnlineNoCB cpu 1 remaining 2 # 1
>> ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
>> OnlineNoCB cpu 2 remaining 3 # 1
>> ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched 
>> OnlineNoCB cpu 3 remaining 4 # 1
>
>OK, so it looks like your system has four CPUs, and rcu_barrier() placed
>callbacks on them all.

No, the system has only two CPUs.  It's an Intel Core 2 Duo
E8400, and /proc/cpuinfo agrees that there are only 2.  There is a
potentially relevant-sounding message early in dmesg that says:

[0.00] smpboot: Allowing 4 CPUs, 2 hotplug CPUs

>> ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched Inc2 
>> cpu -1 remaining 4 # 2
>
>The above removes the extra count used to avoid races between posting new
>callbacks and completion of previously posted callbacks.
>
>>  rcuos/0-9 [000] ..s.   471.793150: rcu_barrier: rcu_sched CB 
>> cpu -1 remaining 3 # 2
>>  rcuos/1-18[001] ..s.   471.793308: rcu_barrier: rcu_sched CB 
>> cpu -1 remaining 2 # 2
>
>Two of the four callbacks fired, but the other two appear to be AWOL.
>And rcu_barrier() won't return until they all fire.
>
>>  I let it sit through several "hung task" cycles but that was all
>> there was for rcu:rcu_barrier.
>> 
>>  I should have ftrace with the patch as soon as the kernel is
>> done building, then I can try the below patch (I'll start it building
>> now).
>
>Sounds very good, looking forward to hearing of the results.

Going to bounce it for ftrace now, but the cpu count mismatch
seemed important enough to mention separately.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

013fc8e3c0 ...
[  240.412463]   1: 88013fc8e3c0 l:88013fc0e3c0 n:  (null) ...
[  240.419401]   2: 88013fd0e3c0 l:88013fd0e3c0 n:88013fd8e3c0 ...
[  240.426339]   3: 88013fd8e3c0 l:88013fd0e3c0 n:  (null) ...
[  360.432020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
[  360.438881]   Not tainted 3.17.0-testola+ #4
[  360.443484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  360.451289] ovs-vswitchdD 88013fc94600 0   902901 0x0004
[  360.451293]  8800ab20f7b8 0002 8800b3304b00 
8800ab20ffd8
[  360.451297]  00014600 00014600 8800b081 
8800b3304b00
[  360.451300]  8800b3304b00 81c59850 81c59858 
7fff
[  360.451303] Call Trace:
[  360.451311]  [] schedule+0x29/0x70
[  360.451314]  [] schedule_timeout+0x1dc/0x260
[  360.451317]  [] ? _cond_resched+0x29/0x40
[  360.451320]  [] ? wait_for_completion+0x28/0x160
[  360.451325]  [] ? queue_stop_cpus_work+0xc7/0xe0
[  360.451327]  [] wait_for_completion+0xa6/0x160
[  360.451331]  [] ? wake_up_state+0x20/0x20
[  360.451335]  [] _rcu_barrier+0x20c/0x480
[  360.451338]  [] rcu_barrier+0x15/0x20
[  360.451342]  [] netdev_run_todo+0x60/0x300
[  360.451345]  [] rtnl_unlock+0xe/0x10
[  360.451353]  [] internal_dev_destroy+0x55/0x80 
[openvswitch]
[  360.451358]  [] ovs_vport_del+0x32/0x40 [openvswitch]
[  360.451362]  [] ovs_dp_detach_port+0x30/0x40 [openvswitch]
[  360.451366]  [] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
[  360.451370]  [] genl_family_rcv_msg+0x1a5/0x3c0
[  360.451373]  [] ? genl_family_rcv_msg+0x3c0/0x3c0
[  360.451376]  [] genl_rcv_msg+0x91/0xd0
[  360.451379]  [] netlink_rcv_skb+0xc1/0xe0
[  360.451381]  [] genl_rcv+0x2c/0x40
[  360.451384]  [] netlink_unicast+0xf6/0x200
[  360.451387]  [] netlink_sendmsg+0x31d/0x780
[  360.451390]  [] ? netlink_rcv_wake+0x44/0x60
[  360.451394]  [] sock_sendmsg+0x93/0xd0
[  360.451399]  [] ? apparmor_capable+0x60/0x60
[  360.451402]  [] ? verify_iovec+0x47/0xd0
[  360.451406]  [] ___sys_sendmsg+0x399/0x3b0
[  360.451410]  [] ? kernfs_seq_stop_active+0x32/0x40
[  360.451414]  [] ? native_sched_clock+0x35/0x90
[  360.451417]  [] ? native_sched_clock+0x35/0x90
[  360.451419]  [] ? sched_clock+0x9/0x10
[  360.451424]  [] ? acct_account_cputime+0x1c/0x20
[  360.451427]  [] ? account_user_time+0x8b/0xa0
[  360.451431]  [] ? __fget_light+0x25/0x70
[  360.451434]  [] __sys_sendmsg+0x42/0x80
[  360.451437]  [] SyS_sendmsg+0x12/0x20
[  360.451440]  [] tracesys_phase2+0xd8/0xdd
[  360.451442] rcu_show_nocb_setup(): rcu_sched nocb state:
[  360.456737]   0: 88013fc0e600 l:88013fc0e600 n:88013fc8e600 ...
[  360.463676]   1: 88013fc8e600 l:88013fc0e600 n:  (null) ...
[  360.470614]   2: 88013fd0e600 l:88013fd0e600 n:88013fd8e600 N..
[  360.477554]   3: 88013fd8e600 l:88013fd0e600 n:  (null) N..
[  360.484494] rcu_show_nocb_setup(): rcu_bh nocb state:
[  360.489529]   0: 88013fc0e3c0 l:88013fc0e3c0 n:88013fc8e3c0 ...
[  360.496469]   1: 88013fc8e3c0 l:88013fc0e3c0 n:  (null) .G.
[  360.503407]   2: 88013fd0e3c0 l:88013fd0e3c0 n:88013fd8e3c0 ...
[  360.510346]   3: 88013fd8e3c0 l:88013fd0e3c0 n:  (null) ...

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

;);
>   sched_show_task(t);
>   debug_show_held_locks(t);
>+  rcu_show_nocb_setup();
> 
>   touch_nmi_watchdog();
> 
>diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
>index 240fa9094f83..6b373e79ce0e 100644
>--- a/kernel/rcu/rcutorture.c
>+++ b/kernel/rcu/rcutorture.c
>@@ -1513,6 +1513,7 @@ rcu_torture_cleanup(void)
> {
>   int i;
> 
>+  rcu_show_nocb_setup();
>   rcutorture_record_test_transition();
>   if (torture_cleanup_begin()) {
>   if (cur_ops->cb_barrier != NULL)
>diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>index 927c17b081c7..285b3f6fb229 100644
>--- a/kernel/rcu/tree_plugin.h
>+++ b/kernel/rcu/tree_plugin.h
>@@ -2699,6 +2699,31 @@ static bool init_nocb_callback_list(struct rcu_data 
>*rdp)
> 
> #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
> 
>+void rcu_show_nocb_setup(void)
>+{
>+#ifdef CONFIG_RCU_NOCB_CPU
>+  int cpu;
>+  struct rcu_data *rdp;
>+  struct rcu_state *rsp;
>+
>+  for_each_rcu_flavor(rsp) {
>+  pr_alert("rcu_show_nocb_setup(): %s nocb state:\n", rsp->name);
>+  for_each_possible_cpu(cpu) {
>+  if (!rcu_is_nocb_cpu(cpu))
>+  continue;
>+  rdp = per_cpu_ptr(rsp->rda, cpu);
>+  pr_alert("%3d: %p l:%p n:%p %c%c%c\n",
>+   cpu,
>+   rdp, rdp->nocb_leader, rdp->nocb_next_follower,
>+   ".N"[!!rdp->nocb_head],
>+   ".G"[!!rdp->nocb_gp_head],
>+   ".F"[!!rdp->nocb_follower_head]);
>+  }
>+  }
>+#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
>+}
>+EXPORT_SYMBOL_GPL(rcu_show_nocb_setup);
>+
> /*
>  * An adaptive-ticks CPU can potentially execute in kernel mode for an
>  * arbitrarily long period of time with the scheduling-clock tick turned
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

Paul E. McKenney  wrote:

>On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
>> On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
>> > On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
>> > > On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
>> > > > On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
>> > > > > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
>> > > > > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
>> > > > > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
>> > > > > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
>> > > > > > > > > 
>> > > > > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
>> > > > > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
>> > > > > > > > > > wrote:
>> > > > 
>> > > > [ . . . ]
>> > > > 
>> > > > > > > Ok, unless I've messsed up something major, bisecting points to:
>> > > > > > > 
>> > > > > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
>> > > > > > > 
>> > > > > > > Makes any sense ?
>> > > > > > 
>> > > > > > Good question.  ;-)
>> > > > > > 
>> > > > > > Are any of your online CPUs missing rcuo kthreads?  There should be
>> > > > > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
>> > > > > > online CPU.
>> > > > > 
>> > > > > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
>> > > > > reverted, the rcuos are 8
>> > > > > and the modprobe ppp_generic testcase reliably works, libvirt also 
>> > > > > manages
>> > > > > to setup its bridge.
>> > > > > 
>> > > > > Just with linux-tip , the rcuos are 6 but the failure is as reliable 
>> > > > > as
>> > > > > before.
>> > > 
>> > > > Thank you, very interesting.  Which 6 of the rcuos are present?
>> > > 
>> > > Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
>> > > this   
>> > > Phenom II.
>> > 
>> > Ah, you get 8 without the patch because it creates them for potential
>> > CPUs as well as real ones.  OK, got it.
>> > 
>> > > > > Awating instructions: :)
>> > > > 
>> > > > Well, I thought I understood the problem until you found that only 6 of
>> > > > the expected 8 rcuos are present with linux-tip without the revert.  
>> > > > ;-)
>> > > > 
>> > > > I am putting together a patch for the part of the problem that I think
>> > > > I understand, of course, but it would help a lot to know which two of
>> > > > the rcuos are missing.  ;-)
>> > > 
>> > > Ready to test
>> > 
>> > Well, if you are feeling aggressive, give the following patch a spin.
>> > I am doing sanity tests on it in the meantime.
>> 
>> Doesn't seem to make a difference here
>
>OK, inspection isn't cutting it, so time for tracing.  Does the system
>respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
>the problem occurs, then dump the trace buffer after the problem occurs.

My system is up and responsive when the problem occurs, so this
shouldn't be a problem.

Do you want the ftrace with your patch below, or unmodified tip
of tree?

-J


>   Thanx, Paul
>
>> > 
>> > 
>> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> > index 29fb23f33c18..927c17b081c7 100644
>> > --- a/kernel/rcu/tree_plugin.h
>> > +++ b/kernel/rcu/tree_plugin.h
>> > @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
>> > rcu_state *rsp, int cpu)
>> >rdp->nocb_leader = rdp_spawn;
>> >if (rdp_last && rdp != rdp_spawn)
>> >rdp_last->nocb_next_follower = rdp;
>> > -  rdp_last = rdp;
>> > -  rdp = rdp->nocb_next_follower;
>> > -  rdp_last->nocb_next_follower = NULL;
>> > +  if (rdp == rdp_spawn) {
>> > +  rdp = rdp->nocb_next_follower;
>> > +  } else {
>> > +  rdp_last = rdp;
>> > +  rdp = rdp->nocb_next_follower;
>> > +  rdp_last->nocb_next_follower = NULL;
>> > +  }
>> >} while (rdp);
>> >rdp_spawn->nocb_next_follower = rdp_old_leader;
>> >}
>> > 

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

Paul E. McKenney  wrote:

>On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote:
>> Paul E. McKenney  wrote:
[...]
>> >Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in
>> >__call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle
>> >NOCB callbacks from irq-disabled idle code) would fail.  Is that the case?
>> >If not, could you please bisect the commits between 11ed7f934cb8 (rcu:
>> >Make nocb leader kthreads process pending callbacks after spawning)
>> >and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())?
>> 
>>  Just a note to add that I am also reliably inducing what appears
>> to be this issue on a current -net tree, when configuring openvswitch
>> via script.  I am available to test patches or bisect tomorrow (Friday)
>> US time if needed.
>
>Thank you, Jay!  Could you please check to see if reverting this commit
>fixes things for you?
>
>35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
>
>Reverting is not a long-term fix, as this commit is itself a bug fix,
>but would be good to check to see if you are seeing the same thing that
>Yanko is.  ;-)

    Just to confirm what Yanko found, reverting this commit makes
the problem go away for me.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
[...]
 Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in
 __call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle
 NOCB callbacks from irq-disabled idle code) would fail.  Is that the case?
 If not, could you please bisect the commits between 11ed7f934cb8 (rcu:
 Make nocb leader kthreads process pending callbacks after spawning)
 and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())?
 
  Just a note to add that I am also reliably inducing what appears
 to be this issue on a current -net tree, when configuring openvswitch
 via script.  I am available to test patches or bisect tomorrow (Friday)
 US time if needed.

Thank you, Jay!  Could you please check to see if reverting this commit
fixes things for you?

35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs

Reverting is not a long-term fix, as this commit is itself a bug fix,
but would be good to check to see if you are seeing the same thing that
Yanko is.  ;-)

Just to confirm what Yanko found, reverting this commit makes
the problem go away for me.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote:
   On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote:
 On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote:
  On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote:
   On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote:
On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote:
 
 On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote:
  On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti 
  wrote:

[ . . . ]

   Ok, unless I've messsed up something major, bisecting points to:
   
   35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs
   
   Makes any sense ?
  
  Good question.  ;-)
  
  Are any of your online CPUs missing rcuo kthreads?  There should be
  kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each 
  online CPU.
 
 Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a 
 reverted, the rcuos are 8
 and the modprobe ppp_generic testcase reliably works, libvirt also 
 manages
 to setup its bridge.
 
 Just with linux-tip , the rcuos are 6 but the failure is as reliable 
 as
 before.
   
Thank you, very interesting.  Which 6 of the rcuos are present?
   
   Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like 
   this   
   Phenom II.
  
  Ah, you get 8 without the patch because it creates them for potential
  CPUs as well as real ones.  OK, got it.
  
 Awating instructions: :)

Well, I thought I understood the problem until you found that only 6 of
the expected 8 rcuos are present with linux-tip without the revert.  
;-)

I am putting together a patch for the part of the problem that I think
I understand, of course, but it would help a lot to know which two of
the rcuos are missing.  ;-)
   
   Ready to test
  
  Well, if you are feeling aggressive, give the following patch a spin.
  I am doing sanity tests on it in the meantime.
 
 Doesn't seem to make a difference here

OK, inspection isn't cutting it, so time for tracing.  Does the system
respond to user input?  If so, please enable rcu:rcu_barrier ftrace before
the problem occurs, then dump the trace buffer after the problem occurs.

My system is up and responsive when the problem occurs, so this
shouldn't be a problem.

Do you want the ftrace with your patch below, or unmodified tip
of tree?

-J


   Thanx, Paul

  
  
  diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
  index 29fb23f33c18..927c17b081c7 100644
  --- a/kernel/rcu/tree_plugin.h
  +++ b/kernel/rcu/tree_plugin.h
  @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct 
  rcu_state *rsp, int cpu)
 rdp-nocb_leader = rdp_spawn;
 if (rdp_last  rdp != rdp_spawn)
 rdp_last-nocb_next_follower = rdp;
  -  rdp_last = rdp;
  -  rdp = rdp-nocb_next_follower;
  -  rdp_last-nocb_next_follower = NULL;
  +  if (rdp == rdp_spawn) {
  +  rdp = rdp-nocb_next_follower;
  +  } else {
  +  rdp_last = rdp;
  +  rdp = rdp-nocb_next_follower;
  +  rdp_last-nocb_next_follower = NULL;
  +  }
 } while (rdp);
 rdp_spawn-nocb_next_follower = rdp_old_leader;
 }
  

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

()) {
   if (cur_ops-cb_barrier != NULL)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 927c17b081c7..285b3f6fb229 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2699,6 +2699,31 @@ static bool init_nocb_callback_list(struct rcu_data 
*rdp)
 
 #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */
 
+void rcu_show_nocb_setup(void)
+{
+#ifdef CONFIG_RCU_NOCB_CPU
+  int cpu;
+  struct rcu_data *rdp;
+  struct rcu_state *rsp;
+
+  for_each_rcu_flavor(rsp) {
+  pr_alert(rcu_show_nocb_setup(): %s nocb state:\n, rsp-name);
+  for_each_possible_cpu(cpu) {
+  if (!rcu_is_nocb_cpu(cpu))
+  continue;
+  rdp = per_cpu_ptr(rsp-rda, cpu);
+  pr_alert(%3d: %p l:%p n:%p %c%c%c\n,
+   cpu,
+   rdp, rdp-nocb_leader, rdp-nocb_next_follower,
+   .N[!!rdp-nocb_head],
+   .G[!!rdp-nocb_gp_head],
+   .F[!!rdp-nocb_follower_head]);
+  }
+  }
+#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
+}
+EXPORT_SYMBOL_GPL(rcu_show_nocb_setup);
+
 /*
  * An adaptive-ticks CPU can potentially execute in kernel mode for an
  * arbitrarily long period of time with the scheduling-clock tick turned


---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

]   2: 88013fd0e600 l:88013fd0e600 n:88013fd8e600 N..
[  240.393550]   3: 88013fd8e600 l:88013fd0e600 n:  (null) N..
[  240.400489] rcu_show_nocb_setup(): rcu_bh nocb state:
[  240.405525]   0: 88013fc0e3c0 l:88013fc0e3c0 n:88013fc8e3c0 ...
[  240.412463]   1: 88013fc8e3c0 l:88013fc0e3c0 n:  (null) ...
[  240.419401]   2: 88013fd0e3c0 l:88013fd0e3c0 n:88013fd8e3c0 ...
[  240.426339]   3: 88013fd8e3c0 l:88013fd0e3c0 n:  (null) ...
[  360.432020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds.
[  360.438881]   Not tainted 3.17.0-testola+ #4
[  360.443484] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[  360.451289] ovs-vswitchdD 88013fc94600 0   902901 0x0004
[  360.451293]  8800ab20f7b8 0002 8800b3304b00 
8800ab20ffd8
[  360.451297]  00014600 00014600 8800b081 
8800b3304b00
[  360.451300]  8800b3304b00 81c59850 81c59858 
7fff
[  360.451303] Call Trace:
[  360.451311]  [81722b99] schedule+0x29/0x70
[  360.451314]  [81725b6c] schedule_timeout+0x1dc/0x260
[  360.451317]  [81722f69] ? _cond_resched+0x29/0x40
[  360.451320]  [81723818] ? wait_for_completion+0x28/0x160
[  360.451325]  [811081a7] ? queue_stop_cpus_work+0xc7/0xe0
[  360.451327]  [81723896] wait_for_completion+0xa6/0x160
[  360.451331]  [81099980] ? wake_up_state+0x20/0x20
[  360.451335]  [810d0ecc] _rcu_barrier+0x20c/0x480
[  360.451338]  [810d1195] rcu_barrier+0x15/0x20
[  360.451342]  [81625010] netdev_run_todo+0x60/0x300
[  360.451345]  [8162f9ee] rtnl_unlock+0xe/0x10
[  360.451353]  [a01ffcc5] internal_dev_destroy+0x55/0x80 
[openvswitch]
[  360.451358]  [a01ff622] ovs_vport_del+0x32/0x40 [openvswitch]
[  360.451362]  [a01f8dd0] ovs_dp_detach_port+0x30/0x40 [openvswitch]
[  360.451366]  [a01f8ea5] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
[  360.451370]  [81651d75] genl_family_rcv_msg+0x1a5/0x3c0
[  360.451373]  [81651f90] ? genl_family_rcv_msg+0x3c0/0x3c0
[  360.451376]  [81652021] genl_rcv_msg+0x91/0xd0
[  360.451379]  [81650091] netlink_rcv_skb+0xc1/0xe0
[  360.451381]  [816505bc] genl_rcv+0x2c/0x40
[  360.451384]  [8164f626] netlink_unicast+0xf6/0x200
[  360.451387]  [8164fa4d] netlink_sendmsg+0x31d/0x780
[  360.451390]  [8164ca74] ? netlink_rcv_wake+0x44/0x60
[  360.451394]  [81606a53] sock_sendmsg+0x93/0xd0
[  360.451399]  [81337700] ? apparmor_capable+0x60/0x60
[  360.451402]  [81614f27] ? verify_iovec+0x47/0xd0
[  360.451406]  [81606e79] ___sys_sendmsg+0x399/0x3b0
[  360.451410]  [812598a2] ? kernfs_seq_stop_active+0x32/0x40
[  360.451414]  [8101c385] ? native_sched_clock+0x35/0x90
[  360.451417]  [8101c385] ? native_sched_clock+0x35/0x90
[  360.451419]  [8101c3e9] ? sched_clock+0x9/0x10
[  360.451424]  [811277fc] ? acct_account_cputime+0x1c/0x20
[  360.451427]  [8109ce6b] ? account_user_time+0x8b/0xa0
[  360.451431]  [81200bd5] ? __fget_light+0x25/0x70
[  360.451434]  [81607c02] __sys_sendmsg+0x42/0x80
[  360.451437]  [81607c52] SyS_sendmsg+0x12/0x20
[  360.451440]  [81727464] tracesys_phase2+0xd8/0xdd
[  360.451442] rcu_show_nocb_setup(): rcu_sched nocb state:
[  360.456737]   0: 88013fc0e600 l:88013fc0e600 n:88013fc8e600 ...
[  360.463676]   1: 88013fc8e600 l:88013fc0e600 n:  (null) ...
[  360.470614]   2: 88013fd0e600 l:88013fd0e600 n:88013fd8e600 N..
[  360.477554]   3: 88013fd8e600 l:88013fd0e600 n:  (null) N..
[  360.484494] rcu_show_nocb_setup(): rcu_bh nocb state:
[  360.489529]   0: 88013fc0e3c0 l:88013fc0e3c0 n:88013fc8e3c0 ...
[  360.496469]   1: 88013fc8e3c0 l:88013fc0e3c0 n:  (null) .G.
[  360.503407]   2: 88013fd0e3c0 l:88013fd0e3c0 n:88013fd8e3c0 ...
[  360.510346]   3: 88013fd8e3c0 l:88013fd0e3c0 n:  (null) ...

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 03:02:04PM -0700, Jay Vosburgh wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
[...]
  I've got an ftrace capture from unmodified -net, it looks like
 this:
 
 ovs-vswitchd-902   [000]    471.778441: rcu_barrier: rcu_sched Begin 
 cpu -1 remaining 0 # 0
 ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Check 
 cpu -1 remaining 0 # 0
 ovs-vswitchd-902   [000]    471.778452: rcu_barrier: rcu_sched Inc1 
 cpu -1 remaining 0 # 1
 ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 0 remaining 1 # 1
 ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 1 remaining 2 # 1
 ovs-vswitchd-902   [000]    471.778453: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 2 remaining 3 # 1
 ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched 
 OnlineNoCB cpu 3 remaining 4 # 1

OK, so it looks like your system has four CPUs, and rcu_barrier() placed
callbacks on them all.

No, the system has only two CPUs.  It's an Intel Core 2 Duo
E8400, and /proc/cpuinfo agrees that there are only 2.  There is a
potentially relevant-sounding message early in dmesg that says:

[0.00] smpboot: Allowing 4 CPUs, 2 hotplug CPUs

 ovs-vswitchd-902   [000]    471.778454: rcu_barrier: rcu_sched Inc2 
 cpu -1 remaining 4 # 2

The above removes the extra count used to avoid races between posting new
callbacks and completion of previously posted callbacks.

  rcuos/0-9 [000] ..s.   471.793150: rcu_barrier: rcu_sched CB 
 cpu -1 remaining 3 # 2
  rcuos/1-18[001] ..s.   471.793308: rcu_barrier: rcu_sched CB 
 cpu -1 remaining 2 # 2

Two of the four callbacks fired, but the other two appear to be AWOL.
And rcu_barrier() won't return until they all fire.

  I let it sit through several hung task cycles but that was all
 there was for rcu:rcu_barrier.
 
  I should have ftrace with the patch as soon as the kernel is
 done building, then I can try the below patch (I'll start it building
 now).

Sounds very good, looking forward to hearing of the results.

Going to bounce it for ftrace now, but the cpu count mismatch
seemed important enough to mention separately.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-24 Thread Jay Vosburgh

Paul E. McKenney paul...@linux.vnet.ibm.com wrote:

On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote:
[...]
 Hmmm...  It sure looks like we have some callbacks stuck here.  I clearly
 need to take a hard look at the sleep/wakeup code.
 
 Thank you for running this!!!

Could you please try the following patch?  If no joy, could you please
add rcu:rcu_nocb_wake to the list of ftrace events?

I tried the patch, it did not change the behavior.

I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints
and ran it again (with this patch and the first patch from earlier
today); the trace output is a bit on the large side so I put it and the
dmesg log at:

http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt

http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt

-J


   Thanx, Paul



rcu: Kick rcuo kthreads after their CPU goes offline

If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads.  This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 84b41b3c6ebd..f6880052b917 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self,
   case CPU_DEAD_FROZEN:
   case CPU_UP_CANCELED:
   case CPU_UP_CANCELED_FROZEN:
-  for_each_rcu_flavor(rsp)
+  for_each_rcu_flavor(rsp) {
   rcu_cleanup_dead_cpu(cpu, rsp);
+  do_nocb_deferred_wakeup(per_cpu_ptr(rsp-rda, cpu));
+  }
   break;
   default:
   break;


---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Jay Vosburgh

tex can be
>held by two tasks concurrently, given that it does not appear to be a
>reader-writer lock...
>
>Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in
>__call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle
>NOCB callbacks from irq-disabled idle code) would fail.  Is that the case?
>If not, could you please bisect the commits between 11ed7f934cb8 (rcu:
>Make nocb leader kthreads process pending callbacks after spawning)
>and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())?

Just a note to add that I am also reliably inducing what appears
to be this issue on a current -net tree, when configuring openvswitch
via script.  I am available to test patches or bisect tomorrow (Friday)
US time if needed.

The stack is as follows:

[ 1320.492020] INFO: task ovs-vswitchd:1303 blocked for more than 120 seconds.
[ 1320.498965]   Not tainted 3.17.0-testola+ #1
[ 1320.503570] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1320.511374] ovs-vswitchdD 88013fc14600 0  1303   1302 0x0004
[ 1320.511378]  8801388d77d8 0002 880031144b00 
8801388d7fd8
[ 1320.511382]  00014600 00014600 8800b092e400 
880031144b00
[ 1320.511385]  8800b1126000 81c58ad0 81c58ad8 
7fff
[ 1320.511389] Call Trace:
[ 1320.511396]  [] schedule+0x29/0x70
[ 1320.511399]  [] schedule_timeout+0x1dc/0x260
[ 1320.511404]  [] ? check_preempt_curr+0x8d/0xa0
[ 1320.511407]  [] ? ttwu_do_wakeup+0x1d/0xd0
[ 1320.511410]  [] wait_for_completion+0xa6/0x160
[ 1320.511413]  [] ? wake_up_state+0x20/0x20
[ 1320.511417]  [] _rcu_barrier+0x157/0x200
[ 1320.511419]  [] rcu_barrier+0x15/0x20
[ 1320.511423]  [] netdev_run_todo+0x60/0x300
[ 1320.511427]  [] rtnl_unlock+0xe/0x10
[ 1320.511435]  [] internal_dev_destroy+0x55/0x80 
[openvswitch]
[ 1320.511440]  [] ovs_vport_del+0x32/0x40 [openvswitch]
[ 1320.511444]  [] ovs_dp_detach_port+0x30/0x40 [openvswitch]
[ 1320.511448]  [] ovs_vport_cmd_del+0xc5/0x110 [openvswitch]
[ 1320.511452]  [] genl_family_rcv_msg+0x1a5/0x3c0
[ 1320.511455]  [] ? genl_family_rcv_msg+0x3c0/0x3c0
[ 1320.511458]  [] genl_rcv_msg+0x91/0xd0
[ 1320.511461]  [] netlink_rcv_skb+0xc1/0xe0
[ 1320.511463]  [] genl_rcv+0x2c/0x40
[ 1320.511466]  [] netlink_unicast+0xf6/0x200
[ 1320.511468]  [] netlink_sendmsg+0x31d/0x780
[ 1320.511472]  [] ? netlink_rcv_wake+0x44/0x60
[ 1320.511475]  [] ? netlink_recvmsg+0x1d3/0x3e0
[ 1320.511479]  [] sock_sendmsg+0x93/0xd0
[ 1320.511484]  [] ? apparmor_file_alloc_security+0x20/0x40
[ 1320.511487]  [] ? verify_iovec+0x47/0xd0
[ 1320.511491]  [] ___sys_sendmsg+0x399/0x3b0
[ 1320.511495]  [] ? kernfs_seq_stop_active+0x32/0x40
[ 1320.511499]  [] ? native_sched_clock+0x35/0x90
[ 1320.511502]  [] ? native_sched_clock+0x35/0x90
[ 1320.511505]  [] ? sched_clock+0x9/0x10
[ 1320.511509]  [] ? acct_account_cputime+0x1c/0x20
[ 1320.511512]  [] ? account_user_time+0x8b/0xa0
[ 1320.511516]  [] ? __fget_light+0x25/0x70
[ 1320.511519]  [] __sys_sendmsg+0x42/0x80
[ 1320.511521]  [] SyS_sendmsg+0x12/0x20
[ 1320.511525]  [] tracesys_phase2+0xd8/0xdd

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: localed stuck in recent 3.18 git in copy_net_ns?

2014-10-23 Thread Jay Vosburgh

[ 1320.511475]  [816632e3] ? netlink_recvmsg+0x1d3/0x3e0
[ 1320.511479]  [8161c463] sock_sendmsg+0x93/0xd0
[ 1320.511484]  [81332d00] ? apparmor_file_alloc_security+0x20/0x40
[ 1320.511487]  [8162a697] ? verify_iovec+0x47/0xd0
[ 1320.511491]  [8161cc79] ___sys_sendmsg+0x399/0x3b0
[ 1320.511495]  [81254e02] ? kernfs_seq_stop_active+0x32/0x40
[ 1320.511499]  [8101c385] ? native_sched_clock+0x35/0x90
[ 1320.511502]  [8101c385] ? native_sched_clock+0x35/0x90
[ 1320.511505]  [8101c3e9] ? sched_clock+0x9/0x10
[ 1320.511509]  [81122d5c] ? acct_account_cputime+0x1c/0x20
[ 1320.511512]  [8109ce6b] ? account_user_time+0x8b/0xa0
[ 1320.511516]  [811fc135] ? __fget_light+0x25/0x70
[ 1320.511519]  [8161d372] __sys_sendmsg+0x42/0x80
[ 1320.511521]  [8161d3c2] SyS_sendmsg+0x12/0x20
[ 1320.511525]  [8173e6a4] tracesys_phase2+0xd8/0xdd

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1

2014-03-31 Thread Jay Vosburgh

Zheng Li  wrote:

>In bond mode tlb and alb, inactive slaves should keep inactive flag to 1 to
>refuse to receive broadcast packets. Now, active slave send broadcast packets
>(for example ARP requests) which will arrive inactive slaves on same host from
>switch, but inactive slave's inactive flag is zero that cause bridge receive 
>the
>broadcast packets to produce a wrong entry in forward table. Typical situation
>is domu send some ARP request which go out from dom0 bond's active slave, then
>the ARP broadcast request packets go back to inactive slave from switch, 
>because
>the inactive slave's inactive flag is zero, kernel will receive the packets and
>pass them to bridge that cause dom0's bridge map domu's MAC address to port of
>bond, bridge should map domu's MAC to port of vif.

I think the patch is ok (I don't have a machine to test it on at
the moment), but the description above is leaving out some details about
how the problem is induced.

The actual problem being fixed here is that bond_open is not
setting the inactive flag correctly for some modes (alb and tlb),
resulting in the behavior described above if the bond has been
administratively set down and then back up.  This effect should not
occur when slaves are added while the bond is up; it's something that
only happens after a down/up bounce of the bond.

That said, the patch itself looks fine to me.

Signed-off-by: Jay Vosburgh 

-J

>Signed-off-by: Zheng Li 
>---
> drivers/net/bonding/bond_main.c |2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index e5628fc..f97d72e 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3058,7 +3058,7 @@ static int bond_open(struct net_device *bond_dev)
>   if (bond_has_slaves(bond)) {
>   read_lock(>curr_slave_lock);
>   bond_for_each_slave(bond, slave, iter) {
>-  if ((bond->params.mode == BOND_MODE_ACTIVEBACKUP)
>+  if ((bond->params.mode == BOND_MODE_ACTIVEBACKUP || 
>bond_is_lb(bond))
>   && (slave != bond->curr_active_slave)) {
>   bond_set_slave_inactive_flags(slave,
>         
> BOND_SLAVE_NOTIFY_NOW);
>-- 
>1.7.6.5

---
-Jay Vosburgh, j.vosbu...@gmail.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1

2014-03-31 Thread Jay Vosburgh

Zheng Li zheng.x...@oracle.com wrote:

In bond mode tlb and alb, inactive slaves should keep inactive flag to 1 to
refuse to receive broadcast packets. Now, active slave send broadcast packets
(for example ARP requests) which will arrive inactive slaves on same host from
switch, but inactive slave's inactive flag is zero that cause bridge receive 
the
broadcast packets to produce a wrong entry in forward table. Typical situation
is domu send some ARP request which go out from dom0 bond's active slave, then
the ARP broadcast request packets go back to inactive slave from switch, 
because
the inactive slave's inactive flag is zero, kernel will receive the packets and
pass them to bridge that cause dom0's bridge map domu's MAC address to port of
bond, bridge should map domu's MAC to port of vif.

I think the patch is ok (I don't have a machine to test it on at
the moment), but the description above is leaving out some details about
how the problem is induced.

The actual problem being fixed here is that bond_open is not
setting the inactive flag correctly for some modes (alb and tlb),
resulting in the behavior described above if the bond has been
administratively set down and then back up.  This effect should not
occur when slaves are added while the bond is up; it's something that
only happens after a down/up bounce of the bond.

That said, the patch itself looks fine to me.

Signed-off-by: Jay Vosburgh j.vosbu...@gmail.com

-J

Signed-off-by: Zheng Li zheng.x...@oracle.com
---
 drivers/net/bonding/bond_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e5628fc..f97d72e 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3058,7 +3058,7 @@ static int bond_open(struct net_device *bond_dev)
   if (bond_has_slaves(bond)) {
   read_lock(bond-curr_slave_lock);
   bond_for_each_slave(bond, slave, iter) {
-  if ((bond-params.mode == BOND_MODE_ACTIVEBACKUP)
+  if ((bond-params.mode == BOND_MODE_ACTIVEBACKUP || 
bond_is_lb(bond))
(slave != bond-curr_active_slave)) {
   bond_set_slave_inactive_flags(slave,
 
 BOND_SLAVE_NOTIFY_NOW);
-- 
1.7.6.5

---
-Jay Vosburgh, j.vosbu...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1 in tlb and alb mode.

2014-03-27 Thread Jay Vosburgh

Zheng Li  wrote:

>In bond mode tlb and alb, inactive slaves should keep inactive flag to
>1 to refuse to receive broadcast packets. Now, active slave send broadcast 
>packets
>(for example ARP requests) which will arrive inactive slaves on same host from 
>switch,
>but inactive slave's inactive flag is zero that cause bridge receive the 
>broadcast
>packets to produce a wrong entry in forward table. Typical situation is domu 
>send some
>ARP request which go out from dom0 bond's active slave, then the ARP broadcast 
>request
>packets go back to inactive slave from switch, because the inactive slave's 
>inactive
>flag is zero, kernel will receive the packets and pass them to bridge, that 
>cause dom0's
>bridge map domu's MAC address to port of bond, bridge should map domu's MAC to 
>port of vif.

It's probably worth noting that this effect is something that
happens after the bonding master device is opened with slaves, i.e.,
it's got a bunch of slaves, and is then set administratively down for
whatever reason, and is now being set back up, and needs to set the
active or inactive state of all the slaves.

It'd also be a little easier to read if it was formatted for 80
columns.

>Signed-off-by: Zheng Li 
>---
> drivers/net/bonding/bond_main.c |2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index e5628fc..8761df6 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3062,7 +3062,7 @@ static int bond_open(struct net_device *bond_dev)
>   && (slave != bond->curr_active_slave)) {
>   bond_set_slave_inactive_flags(slave,
> 
> BOND_SLAVE_NOTIFY_NOW);
>-  } else {
>+  } else if (!bond_is_lb(bond)) {
>   bond_set_slave_active_flags(slave,
>   
> BOND_SLAVE_NOTIFY_NOW);

This patch doesn't do anything for the modes that are
bond_is_lb, i.e., the balance-tlb and -alb modes.  I believe those two
should be set similarly to active-backup: the curr_active_slave is
active, other slaves are inactive.  The "inactive" setting for alb is
special, and means to not pass broadcast or multicast, but let unicast
through.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1 in tlb and alb mode.

2014-03-27 Thread Jay Vosburgh

Zheng Li zheng.x...@oracle.com wrote:

In bond mode tlb and alb, inactive slaves should keep inactive flag to
1 to refuse to receive broadcast packets. Now, active slave send broadcast 
packets
(for example ARP requests) which will arrive inactive slaves on same host from 
switch,
but inactive slave's inactive flag is zero that cause bridge receive the 
broadcast
packets to produce a wrong entry in forward table. Typical situation is domu 
send some
ARP request which go out from dom0 bond's active slave, then the ARP broadcast 
request
packets go back to inactive slave from switch, because the inactive slave's 
inactive
flag is zero, kernel will receive the packets and pass them to bridge, that 
cause dom0's
bridge map domu's MAC address to port of bond, bridge should map domu's MAC to 
port of vif.

It's probably worth noting that this effect is something that
happens after the bonding master device is opened with slaves, i.e.,
it's got a bunch of slaves, and is then set administratively down for
whatever reason, and is now being set back up, and needs to set the
active or inactive state of all the slaves.

It'd also be a little easier to read if it was formatted for 80
columns.

Signed-off-by: Zheng Li zheng.x...@oracle.com
---
 drivers/net/bonding/bond_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e5628fc..8761df6 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3062,7 +3062,7 @@ static int bond_open(struct net_device *bond_dev)
(slave != bond-curr_active_slave)) {
   bond_set_slave_inactive_flags(slave,
 
 BOND_SLAVE_NOTIFY_NOW);
-  } else {
+  } else if (!bond_is_lb(bond)) {
   bond_set_slave_active_flags(slave,
   
 BOND_SLAVE_NOTIFY_NOW);

This patch doesn't do anything for the modes that are
bond_is_lb, i.e., the balance-tlb and -alb modes.  I believe those two
should be set similarly to active-backup: the curr_active_slave is
active, other slaves are inactive.  The inactive setting for alb is
special, and means to not pass broadcast or multicast, but let unicast
through.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1.

2014-03-21 Thread Jay Vosburgh

zheng.li  wrote:

>于 2014年03月21日 01:02, Jay Vosburgh 写道:
>> Zheng Li  wrote:
>> 
>>> Except bond mode 1, in other bond modes, inactive slaves should keep
>>> inactive flag to 1 to refuse to receive broadcast packets. Now, active
>>> slave send broadcast packets (for example ARP requests) which will
>>> arrive inactive slaves on same host from switch, but inactive slave's
>>> inactive flag is zero that cause bridge receive the broadcast packets
>>> to produce a wrong entry in forward table. Typical situation is domu
>>> send some ARP request which go out from dom0 bond's active slave, then
>>> the ARP broadcast request packets go back to inactive slave from
>>> switch, because the inactive slave's inactive flag is zero, kernel will
>>> receive the packets and pass them to bridge, that cause dom0's bridge
>>> map domu's MAC address to port of bond, bridge should map domu's MAC to
>>> port of vif.
>> 
>>  I suspect this will break LACP (802.3ad) and Etherchannel
>> (balance-xor, balance-rr) modes, as those modes can receive broadcast or
>> multicast on any slave.  In those cases, the switch knows about the
>> aggregation, and will only send the broadcast / multicast to one of the
>> ports, but the port selected is not always the same one.
>> 
>>  In which mode are you having trouble?
>> 
>>  -J
>
>Except bond mode 1, in other modes (major test in mode 6, and test all
>other mode,  except mode 1, all other modes has the bug), the bridge
>make a wrong entry which map guest MAC to the port of bond, it should
>map guest MAC to the port of vif.
>
>Env description: dom0's bridge contains bond1 and vif ports, bond1 as
>port 1 , vif as port 2, bond1 has two slaves which connect a switch.
>when from guest ping others ,the arp broadcast request will go out from
>bond1's active slave, and then go back to itself inactive slave from
>switch , in function of bond_should_deliver_exact_match will return
>false by inactive is zero, return false will cause bridge receive the
>arp request packets whose original is from guest through vif that let
>bridge consider the SRC MAC of guest is from bond1 by analyzing the arp
>broadcast packets, then make a wrong forward entry "MAC of guest, from
>port 1 (bond1)" , the correct entry should be "MAC of guest , from port
>2 (vif)".

I believe I understand; the crux of the problem is that the
broadcast is looped by the switch to the other bond port, which then
processes it as a received packet.

For the tlb and alb modes, you're logically correct; the slaves
other than the active slave should be set to inactive when the bond is
opened.  This is how they are set when configured normally to limit
inbound broadcast and multicast traffic to just one slave (because these
modes do not configure the switch ports into channel groups or
aggregators; the switch has no knowledge of the bond).

Your patch is still incorrect, though; the slaves should be set
to actually be inactive via bond_set_slave_inactive_flags, not left
"semi-active," i.e., BOND_STATE_ACTIVE but slave->inactive set. More on
the slave->inactive flag handling later, though.

For the balance-rr/-xor and 802.3ad modes, I suspect you have a
configuration problem of some sort.  For those modes, the switch should
not loop back the broadcast as you describe when correctly configured.

The balance-rr/-xor modes should be connected to switch ports
that are configured into a single Etherchannel group (static link
aggregation).  If they are not, the looping back behavior you describe
is expected, as the switch will flood the broadcast to all ports.

Similarly, the 802.3ad mode should be connected to a switch with
the ports configured for LACP, in which case the ports will
automatically configure into one or more aggregators, and again, the
switch will limit broadcast and multicast traffic to just one member of
an aggregator.

The -rr/-xor/802.3ad modes must have all slaves set to active in
order for them to correctly process incoming broadcast and multicast
traffic in their proper configuration.  Setting them to inactive will
cause packet loss in those configurations.

The 802.3ad mode is probably the easiest to examine; if you
configure the switch ports for LACP mode, the /proc/net/bonding/bond0
file should indicate that all the slaves are in the same aggregator
(have the same Aggregator ID), and that that aggregator is the active
one.  The switch should have similar indications, although the format is
vendor-specific.  If your switch is configured for LACP and you still
have issues, please post the /proc/net/bonding/bond0 contents.

Finally, getting back to the slave->inactive flag.  Th

Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1.

2014-03-21 Thread Jay Vosburgh

zheng.li zheng.x...@oracle.com wrote:

于 2014年03月21日 01:02, Jay Vosburgh 写道:
 Zheng Li zheng.x...@oracle.com wrote:
 
 Except bond mode 1, in other bond modes, inactive slaves should keep
 inactive flag to 1 to refuse to receive broadcast packets. Now, active
 slave send broadcast packets (for example ARP requests) which will
 arrive inactive slaves on same host from switch, but inactive slave's
 inactive flag is zero that cause bridge receive the broadcast packets
 to produce a wrong entry in forward table. Typical situation is domu
 send some ARP request which go out from dom0 bond's active slave, then
 the ARP broadcast request packets go back to inactive slave from
 switch, because the inactive slave's inactive flag is zero, kernel will
 receive the packets and pass them to bridge, that cause dom0's bridge
 map domu's MAC address to port of bond, bridge should map domu's MAC to
 port of vif.
 
  I suspect this will break LACP (802.3ad) and Etherchannel
 (balance-xor, balance-rr) modes, as those modes can receive broadcast or
 multicast on any slave.  In those cases, the switch knows about the
 aggregation, and will only send the broadcast / multicast to one of the
 ports, but the port selected is not always the same one.
 
  In which mode are you having trouble?
 
  -J

Except bond mode 1, in other modes (major test in mode 6, and test all
other mode,  except mode 1, all other modes has the bug), the bridge
make a wrong entry which map guest MAC to the port of bond, it should
map guest MAC to the port of vif.

Env description: dom0's bridge contains bond1 and vif ports, bond1 as
port 1 , vif as port 2, bond1 has two slaves which connect a switch.
when from guest ping others ,the arp broadcast request will go out from
bond1's active slave, and then go back to itself inactive slave from
switch , in function of bond_should_deliver_exact_match will return
false by inactive is zero, return false will cause bridge receive the
arp request packets whose original is from guest through vif that let
bridge consider the SRC MAC of guest is from bond1 by analyzing the arp
broadcast packets, then make a wrong forward entry MAC of guest, from
port 1 (bond1) , the correct entry should be MAC of guest , from port
2 (vif).

I believe I understand; the crux of the problem is that the
broadcast is looped by the switch to the other bond port, which then
processes it as a received packet.

For the tlb and alb modes, you're logically correct; the slaves
other than the active slave should be set to inactive when the bond is
opened.  This is how they are set when configured normally to limit
inbound broadcast and multicast traffic to just one slave (because these
modes do not configure the switch ports into channel groups or
aggregators; the switch has no knowledge of the bond).

Your patch is still incorrect, though; the slaves should be set
to actually be inactive via bond_set_slave_inactive_flags, not left
semi-active, i.e., BOND_STATE_ACTIVE but slave-inactive set. More on
the slave-inactive flag handling later, though.

For the balance-rr/-xor and 802.3ad modes, I suspect you have a
configuration problem of some sort.  For those modes, the switch should
not loop back the broadcast as you describe when correctly configured.

The balance-rr/-xor modes should be connected to switch ports
that are configured into a single Etherchannel group (static link
aggregation).  If they are not, the looping back behavior you describe
is expected, as the switch will flood the broadcast to all ports.

Similarly, the 802.3ad mode should be connected to a switch with
the ports configured for LACP, in which case the ports will
automatically configure into one or more aggregators, and again, the
switch will limit broadcast and multicast traffic to just one member of
an aggregator.

The -rr/-xor/802.3ad modes must have all slaves set to active in
order for them to correctly process incoming broadcast and multicast
traffic in their proper configuration.  Setting them to inactive will
cause packet loss in those configurations.

The 802.3ad mode is probably the easiest to examine; if you
configure the switch ports for LACP mode, the /proc/net/bonding/bond0
file should indicate that all the slaves are in the same aggregator
(have the same Aggregator ID), and that that aggregator is the active
one.  The switch should have similar indications, although the format is
vendor-specific.  If your switch is configured for LACP and you still
have issues, please post the /proc/net/bonding/bond0 contents.

Finally, getting back to the slave-inactive flag.  The only
difference between bond_set_slave_active_flags or _inactive_flags and
bond_set_slave_state is that the slave-inactive flag (whose only
purpose is duplicate suppression for incoming packets) is cleared by the
first (_flags variant), but not by the second.  Right now, the only
caller of _set_slave_state

Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1.

2014-03-20 Thread Jay Vosburgh

Zheng Li  wrote:

>Except bond mode 1, in other bond modes, inactive slaves should keep
>inactive flag to 1 to refuse to receive broadcast packets. Now, active
>slave send broadcast packets (for example ARP requests) which will
>arrive inactive slaves on same host from switch, but inactive slave's
>inactive flag is zero that cause bridge receive the broadcast packets
>to produce a wrong entry in forward table. Typical situation is domu
>send some ARP request which go out from dom0 bond's active slave, then
>the ARP broadcast request packets go back to inactive slave from
>switch, because the inactive slave's inactive flag is zero, kernel will
>receive the packets and pass them to bridge, that cause dom0's bridge
>map domu's MAC address to port of bond, bridge should map domu's MAC to
>port of vif.

I suspect this will break LACP (802.3ad) and Etherchannel
(balance-xor, balance-rr) modes, as those modes can receive broadcast or
multicast on any slave.  In those cases, the switch knows about the
aggregation, and will only send the broadcast / multicast to one of the
ports, but the port selected is not always the same one.

In which mode are you having trouble?

-J


>
>Signed-off-by: Zheng Li 
>---
> drivers/net/bonding/bond_main.c |2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index e5628fc..2f73f18 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3063,7 +3063,7 @@ static int bond_open(struct net_device *bond_dev)
>   bond_set_slave_inactive_flags(slave,
> 
> BOND_SLAVE_NOTIFY_NOW);
>   } else {
>-  bond_set_slave_active_flags(slave,
>+  bond_set_slave_state(slave, BOND_STATE_ACTIVE,
>   
> BOND_SLAVE_NOTIFY_NOW);
>   }
>   }
>-- 
>1.7.6.5
>

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1.

2014-03-20 Thread Jay Vosburgh

Zheng Li zheng.x...@oracle.com wrote:

Except bond mode 1, in other bond modes, inactive slaves should keep
inactive flag to 1 to refuse to receive broadcast packets. Now, active
slave send broadcast packets (for example ARP requests) which will
arrive inactive slaves on same host from switch, but inactive slave's
inactive flag is zero that cause bridge receive the broadcast packets
to produce a wrong entry in forward table. Typical situation is domu
send some ARP request which go out from dom0 bond's active slave, then
the ARP broadcast request packets go back to inactive slave from
switch, because the inactive slave's inactive flag is zero, kernel will
receive the packets and pass them to bridge, that cause dom0's bridge
map domu's MAC address to port of bond, bridge should map domu's MAC to
port of vif.

I suspect this will break LACP (802.3ad) and Etherchannel
(balance-xor, balance-rr) modes, as those modes can receive broadcast or
multicast on any slave.  In those cases, the switch knows about the
aggregation, and will only send the broadcast / multicast to one of the
ports, but the port selected is not always the same one.

In which mode are you having trouble?

-J



Signed-off-by: Zheng Li zheng.x...@oracle.com
---
 drivers/net/bonding/bond_main.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e5628fc..2f73f18 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3063,7 +3063,7 @@ static int bond_open(struct net_device *bond_dev)
   bond_set_slave_inactive_flags(slave,
 
 BOND_SLAVE_NOTIFY_NOW);
   } else {
-  bond_set_slave_active_flags(slave,
+  bond_set_slave_state(slave, BOND_STATE_ACTIVE,
   
 BOND_SLAVE_NOTIFY_NOW);
   }
   }
-- 
1.7.6.5


---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: https://lkml.org/lkml/2013/2/1/531

2013-05-22 Thread Jay Vosburgh

Smart Weblications GmbH - Florian Wiessner wrote:

>Am 22.05.2013 22:04, schrieb Greg KH:
>
>
>>>
>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/jkirsher/net-next/+/567b871e503316b0927e54a3d7c86d50b722d955%5E!/
>> 
>> Ok, that's what we need.
>> 
>> Now, please cc: the developers / maintainers of that patch and ask them
>> to have it included in the 3.4-stable kernel series.
>> 
>> Then, if they agree, the network maintainer will pick it up and send it
>> to me for inclusion.
>> 
>
>i set committerDavid S. Miller  in cc already, 
>but do not
>know the network maintainer...
>
>this seems to me that "Matthew O'Connor"  sent this to
>netdev on 2013-02-01:
>
>http://lists.openwall.net/netdev/2013/02/01/86
>
>but i couldn't find a trace of the patch in 3.4.36?!

The patch in question here is in net-next; the commit is:

commit 567b871e503316b0927e54a3d7c86d50b722d955
Author: zheng.li 
Date:   Tue Nov 27 23:57:04 2012 +

bonding: rlb mode of bond should not alter ARP originating via bridge

The additional change in the backport from Matthew O'Connor (to
add ether_addr_equal_64bits) appears to still be necessary for 3.4.46.
Alternatively, the patch could utilize ether_addr_equal instead, to
minimize the change set.  Greg, do you have a preference there?

Submissions for stable from networking normally go through
Davem; I can check the patch and repost it to netdev against 3.4.46 if
everybody is ok with that.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: https://lkml.org/lkml/2013/2/1/531

2013-05-22 Thread Jay Vosburgh

Smart Weblications GmbH - Florian Wiessner wrote:

Am 22.05.2013 22:04, schrieb Greg KH:



 https://kernel.googlesource.com/pub/scm/linux/kernel/git/jkirsher/net-next/+/567b871e503316b0927e54a3d7c86d50b722d955%5E!/
 
 Ok, that's what we need.
 
 Now, please cc: the developers / maintainers of that patch and ask them
 to have it included in the 3.4-stable kernel series.
 
 Then, if they agree, the network maintainer will pick it up and send it
 to me for inclusion.
 

i set committerDavid S. Miller da...@davemloft.net in cc already, 
but do not
know the network maintainer...

this seems to me that Matthew O'Connor liquidho...@gmail.com sent this to
netdev on 2013-02-01:

http://lists.openwall.net/netdev/2013/02/01/86

but i couldn't find a trace of the patch in 3.4.36?!

The patch in question here is in net-next; the commit is:

commit 567b871e503316b0927e54a3d7c86d50b722d955
Author: zheng.li zheng.x...@oracle.com
Date:   Tue Nov 27 23:57:04 2012 +

bonding: rlb mode of bond should not alter ARP originating via bridge


The additional change in the backport from Matthew O'Connor (to
add ether_addr_equal_64bits) appears to still be necessary for 3.4.46.
Alternatively, the patch could utilize ether_addr_equal instead, to
minimize the change set.  Greg, do you have a preference there?

Submissions for stable from networking normally go through
Davem; I can check the patch and repost it to netdev against 3.4.46 if
everybody is ok with that.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7

2012-12-07 Thread Jay Vosburgh

Linda Walsh  wrote:

>Sorry for the delay my distro (Suse) has made rebooting my system
>a chore (have to often boot from rescue to get it to come up because
>they put mount libs in /usr/lib expecting they will always boot
>from their ram disk -- preventing those of use who boot directly
>from disk from doing so easily...grrr.
>
>Jay Vosburgh wrote:
>>  The miimon functionality is used to check link state and notice
>> when slaves lose carrier.
>---
>   If I am running 'rr' on 2 channels -- specifically for the purpose
>of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels)
>I'm not sure I see how miimon would provide benefit.  -- if 1 link dies,
>the other, being on the same card is likely to be dead too, so would
>it really serve a purpose?

Perhaps, but if the link partner experiences a failure, that may
be a different situation.  Not all failures will necessarily cause both
links to fail simultaneously.

>>   Running without it will not detect failure of
>> the bonding slaves, which is likely not what you want.  The mode,
>> balance-rr in your case, is what selects the load balance to use, and is
>> separate from the miimon.
>>   
>
>   Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk
>dies, the entire link goes down? 

No; failure of a single slave does not cause the entire bond to
fail (unless that is the last available slave).  For round robin, a
failed slave is taken out of the set used to transmit traffic, and any
remaining slaves continue to round robin amongst themselves.

>   The other end (windows) doesn't dynamically config for a static-link
>aggregation, so I don't think it would provide benefit.

So it (windows) has no means to disable (and discontinue use of)
one channel of the aggregation should it fail, even in a static link
aggregation?

>>  That said, the problem you're seeing appears to be caused by two
>> things: bonding holds a lock (in addition to RTNL) when calling
>> __ethtool_get_settings, and an ixgbe function in the call path to
>> retrieve the settings, ixgbe_acquire_swfw_sync_X540, can sleep.
>>
>>  The test patch above handles one case in bond_enslave, but there
>> is another case in bond_miimon_commit when a slave changes link state
>> from down to up, which will occur shortly after the slave is added.
>>   
>
>   Added your 2nd patch -- no more error messages...
>
>   however -- likely unrelated, the max speed read or write I am seeing
>is about 500MB/s, and that is rare -- usually it's barely <3x a 1Gb
>network speed. (119/125 MB R/W).  I'm not at all sure it's really
>combining the links
>properly.  Anyway to verify that?

How are you testing the throughput?  If you configure the
aggregation with just one link, how does the throughput compare to the
aggregation with both links?

It most likely is combining links properly, but any link
aggregation scheme has tradeoffs, and the best load balance algorithm to
use depends upon the work load.  Two aggregated 10G links are not
interchangable with a single 20G link.

For a round robin transmission scheme, issues arise because
packets are delivered at the other end out of order.  This in turn
triggers various TCP behaviors to deal with what is perceived to be
transmission errors or lost packets (TCP fast retransmit being the most
notable).  This usually results in a single TCP connection being unable
to completely saturate a round-robin aggregated set of links.

There are a few parameters on linux that can be adjusted.  I
don't know what the windows equivalents might be.

On linux, adjusting the net.ipv4.tcp_reordering sysctl value
will increase the tolerance for out of order delivery.  

The sysctl is adjusted via something like

sysctl -w net.ipv4.tcp_reordering=10

the default value is 3, and higher values increase the tolerance
for out of order delivery.  If memory serves, the setting is applied to
connections as they are created, so existing connections will not see
changes.

Also, adjusting the packet coalescing setting for the receiving
devices may also permit higher throughput. The packet coalescing setting
is adjusted via ethtool; the current settings can be viewed via

ethtool -c eth0

and then adjusted via something like

ethtool -C eth0 rx-usecs 30

I've seen reports that raising the "rx-usecs" parameter at the
receiver can increase the round-robin throughput.  My recollection is
that the value used was 30, but the best settings will likely be
dependent upon your particular hardware and configuration.

>   On the windows side it shows the bond-link as a 20Gb connection, but
>I don't see anyplace for something similar on linux. 

There isn't any such in

Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7

2012-12-07 Thread Jay Vosburgh

Linda Walsh l...@tlinx.org wrote:

Sorry for the delay my distro (Suse) has made rebooting my system
a chore (have to often boot from rescue to get it to come up because
they put mount libs in /usr/lib expecting they will always boot
from their ram disk -- preventing those of use who boot directly
from disk from doing so easily...grrr.

Jay Vosburgh wrote:
  The miimon functionality is used to check link state and notice
 when slaves lose carrier.
---
   If I am running 'rr' on 2 channels -- specifically for the purpose
of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels)
I'm not sure I see how miimon would provide benefit.  -- if 1 link dies,
the other, being on the same card is likely to be dead too, so would
it really serve a purpose?

Perhaps, but if the link partner experiences a failure, that may
be a different situation.  Not all failures will necessarily cause both
links to fail simultaneously.

   Running without it will not detect failure of
 the bonding slaves, which is likely not what you want.  The mode,
 balance-rr in your case, is what selects the load balance to use, and is
 separate from the miimon.
   

   Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk
dies, the entire link goes down? 

No; failure of a single slave does not cause the entire bond to
fail (unless that is the last available slave).  For round robin, a
failed slave is taken out of the set used to transmit traffic, and any
remaining slaves continue to round robin amongst themselves.

   The other end (windows) doesn't dynamically config for a static-link
aggregation, so I don't think it would provide benefit.

So it (windows) has no means to disable (and discontinue use of)
one channel of the aggregation should it fail, even in a static link
aggregation?

  That said, the problem you're seeing appears to be caused by two
 things: bonding holds a lock (in addition to RTNL) when calling
 __ethtool_get_settings, and an ixgbe function in the call path to
 retrieve the settings, ixgbe_acquire_swfw_sync_X540, can sleep.

  The test patch above handles one case in bond_enslave, but there
 is another case in bond_miimon_commit when a slave changes link state
 from down to up, which will occur shortly after the slave is added.
   

   Added your 2nd patch -- no more error messages...

   however -- likely unrelated, the max speed read or write I am seeing
is about 500MB/s, and that is rare -- usually it's barely 3x a 1Gb
network speed. (119/125 MB R/W).  I'm not at all sure it's really
combining the links
properly.  Anyway to verify that?

How are you testing the throughput?  If you configure the
aggregation with just one link, how does the throughput compare to the
aggregation with both links?

It most likely is combining links properly, but any link
aggregation scheme has tradeoffs, and the best load balance algorithm to
use depends upon the work load.  Two aggregated 10G links are not
interchangable with a single 20G link.

For a round robin transmission scheme, issues arise because
packets are delivered at the other end out of order.  This in turn
triggers various TCP behaviors to deal with what is perceived to be
transmission errors or lost packets (TCP fast retransmit being the most
notable).  This usually results in a single TCP connection being unable
to completely saturate a round-robin aggregated set of links.

There are a few parameters on linux that can be adjusted.  I
don't know what the windows equivalents might be.

On linux, adjusting the net.ipv4.tcp_reordering sysctl value
will increase the tolerance for out of order delivery.  

The sysctl is adjusted via something like

sysctl -w net.ipv4.tcp_reordering=10

the default value is 3, and higher values increase the tolerance
for out of order delivery.  If memory serves, the setting is applied to
connections as they are created, so existing connections will not see
changes.

Also, adjusting the packet coalescing setting for the receiving
devices may also permit higher throughput. The packet coalescing setting
is adjusted via ethtool; the current settings can be viewed via

ethtool -c eth0

and then adjusted via something like

ethtool -C eth0 rx-usecs 30

I've seen reports that raising the rx-usecs parameter at the
receiver can increase the round-robin throughput.  My recollection is
that the value used was 30, but the best settings will likely be
dependent upon your particular hardware and configuration.

   On the windows side it shows the bond-link as a 20Gb connection, but
I don't see anyplace for something similar on linux. 

There isn't any such indicator; bonding does not advertise its
link speed as the sum of its slaves link speeds.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux

Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge

2012-11-29 Thread Jay Vosburgh

Zheng Li  wrote:

>Do not modify or load balance ARP packets passing through balance-alb
>mode (wherein the ARP did not originate locally, and arrived via a bridge).
>
>Modifying pass-through ARP replies causes an incorrect MAC address 
>to be placed into the ARP packet, rendering peers unable to communicate 
>with the actual destination from which the ARP reply originated.
>
>Load balancing pass-through ARP requests causes an entry to be
>created for the peer in the rlb table, and bond_alb_monitor will
>occasionally issue ARP updates to all peers in the table instrucing them
>as to which MAC address they should communicate with; this occurs when
>some event sets rx_ntt.  In the bridged case, however, the MAC address
>used for the update would be the MAC of the slave, not the actual source
>MAC of the originating destination.  This would render peers unable to
>communicate with the destinations beyond the bridge.
>
>Signed-off-by: Zheng Li 
>Cc: Jay Vosburgh 
>Cc: Andy Gospodarek 
>Cc: "David S. Miller" 

Signed-off-by: Jay Vosburgh 

>---
> drivers/net/bonding/bond_alb.c |6 ++
> drivers/net/bonding/bonding.h  |   13 +
> 2 files changed, 19 insertions(+), 0 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index e15cc11..6fecb52 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, 
>struct bonding *bond)
>   struct arp_pkt *arp = arp_pkt(skb);
>   struct slave *tx_slave = NULL;
>
>+  /* Don't modify or load balance ARPs that do not originate locally
>+   * (e.g.,arrive via a bridge).
>+   */
>+  if (!bond_slave_has_mac(bond, arp->mac_src))
>+  return NULL;
>+
>   if (arp->op_code == htons(ARPOP_REPLY)) {
>   /* the arp must be sent on the selected
>   * rx channel
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index f8af2fc..6dded56 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -22,6 +22,7 @@
> #include 
> #include 
> #include 
>+#include 
> #include "bond_3ad.h"
> #include "bond_alb.h"
>
>@@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct bond_net 
>*bn)
> }
> #endif
>
>+static inline struct slave *bond_slave_has_mac(struct bonding *bond,
>+ const u8 *mac)
>+{
>+  int i = 0;
>+  struct slave *tmp;
>+
>+  bond_for_each_slave(bond, tmp, i)
>+  if (ether_addr_equal_64bits(mac, tmp->dev->dev_addr))
>+  return tmp;
>+
>+  return NULL;
>+}
>
> /* exported from bond_main.c */
> extern int bond_net_id;
>-- 
>1.7.6.5
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge

2012-11-29 Thread Jay Vosburgh

Zheng Li zheng.x...@oracle.com wrote:

Do not modify or load balance ARP packets passing through balance-alb
mode (wherein the ARP did not originate locally, and arrived via a bridge).

Modifying pass-through ARP replies causes an incorrect MAC address 
to be placed into the ARP packet, rendering peers unable to communicate 
with the actual destination from which the ARP reply originated.

Load balancing pass-through ARP requests causes an entry to be
created for the peer in the rlb table, and bond_alb_monitor will
occasionally issue ARP updates to all peers in the table instrucing them
as to which MAC address they should communicate with; this occurs when
some event sets rx_ntt.  In the bridged case, however, the MAC address
used for the update would be the MAC of the slave, not the actual source
MAC of the originating destination.  This would render peers unable to
communicate with the destinations beyond the bridge.

Signed-off-by: Zheng Li zheng.x...@oracle.com
Cc: Jay Vosburgh fu...@us.ibm.com
Cc: Andy Gospodarek a...@greyhouse.net
Cc: David S. Miller da...@davemloft.net

Signed-off-by: Jay Vosburgh fu...@us.ibm.com

---
 drivers/net/bonding/bond_alb.c |6 ++
 drivers/net/bonding/bonding.h  |   13 +
 2 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index e15cc11..6fecb52 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, 
struct bonding *bond)
   struct arp_pkt *arp = arp_pkt(skb);
   struct slave *tx_slave = NULL;

+  /* Don't modify or load balance ARPs that do not originate locally
+   * (e.g.,arrive via a bridge).
+   */
+  if (!bond_slave_has_mac(bond, arp-mac_src))
+  return NULL;
+
   if (arp-op_code == htons(ARPOP_REPLY)) {
   /* the arp must be sent on the selected
   * rx channel
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index f8af2fc..6dded56 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -22,6 +22,7 @@
 #include linux/in6.h
 #include linux/netpoll.h
 #include linux/inetdevice.h
+#include linux/etherdevice.h
 #include bond_3ad.h
 #include bond_alb.h

@@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct bond_net 
*bn)
 }
 #endif

+static inline struct slave *bond_slave_has_mac(struct bonding *bond,
+ const u8 *mac)
+{
+  int i = 0;
+  struct slave *tmp;
+
+  bond_for_each_slave(bond, tmp, i)
+  if (ether_addr_equal_64bits(mac, tmp-dev-dev_addr))
+  return tmp;
+
+  return NULL;
+}

 /* exported from bond_main.c */
 extern int bond_net_id;
-- 
1.7.6.5


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7

2012-11-28 Thread Jay Vosburgh

et_settings+0x34/0x2b0
>[   59.791381]  [] __ethtool_get_settings+0x85/0x140
>[   59.791386]  [] bond_update_speed_duplex+0x23/0x60
>[   59.791389]  [] bond_mii_monitor+0x354/0x640
>[   59.791393]  [] process_one_work+0x1a7/0x680
>[   59.791396]  [] ? process_one_work+0x146/0x680
>[   59.791402]  [] ? put_lock_stats.isra.21+0xe/0x40
>[   59.791411]  [] ? bond_loadbalance_arp_mon+0x2c0/0x2c0
>[   59.791421]  [] worker_thread+0x18d/0x4f0
>[   59.791434]  [] ? sub_preempt_count+0x51/0x60
>[   59.791442]  [] ? manage_workers+0x320/0x320
>[   59.791453]  [] kthread+0x9d/0xb0
>[   59.791460]  [] kernel_thread_helper+0x4/0x10
>[   59.791464]  [] ? finish_task_switch+0x77/0x100
>[   59.791468]  [] ? _raw_spin_unlock_irq+0x36/0x60
>[   59.791472]  [] ? retint_restore_args+0xe/0xe
>[   59.791476]  [] ? flush_kthread_worker+0x160/0x160
>[   59.791480]  [] ? gs_change+0xb/0xb
>[   59.794932] BUG: scheduling while atomic: kworker/u:1/103/0x0002
>[   59.801333] 4 locks held by kworker/u:1/103:
>[   59.801340]  #0:  ((bond_dev->name)){..}, at: []
>process_one_work+0x146/0x680
>[   59.801345]  #1:  ((&(>mii_work)->work)){..}, at:
>[] process_one_work+0x146/0x680
>[   59.801350]  #2:  (rtnl_mutex){..}, at: []
>rtnl_trylock+0x10/0x20
>[   59.801356]  #3:  (>lock){..}, at: []
>bond_mii_monitor+0x2ed/0x640
>[   59.801365] Modules linked in: fan kvm_intel mousedev kvm iTCO_wdt
>iTCO_vendor_support acpi_cpufreq tpm_tis tpm tpm_bios mperf processor
>[   59.801368] Pid: 103, comm: kworker/u:1 Tainted: GW
>3.6.8-Isht-Van #4
>[   59.801369] Call Trace:
>[   59.801373]  [] __schedule_bug+0x5e/0x6c
>[   59.801380]  [] __schedule+0x77c/0x810
>[   59.801385]  [] schedule+0x24/0x70
>[   59.801391]  []
>schedule_hrtimeout_range_clock+0xfc/0x140
>[   59.801395]  [] ? update_rmtp+0x60/0x60
>[   59.801399]  [] ? update_rmtp+0x60/0x60
>[   59.801404]  [] ? hrtimer_start_range_ns+0xf/0x20
>[   59.801409]  [] schedule_hrtimeout_range+0xe/0x10
>[   59.801414]  [] usleep_range+0x3b/0x40
>[   59.801419]  [] ixgbe_release_swfw_sync_X540+0x4e/0x60
>[   59.801424]  [] ixgbe_read_phy_reg_generic+0x101/0x120
>[   59.801429]  []
>ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
>[   59.801433]  [] ? bond_mii_monitor+0x2ed/0x640
>[   59.801441]  [] ixgbe_get_settings+0x34/0x2b0
>[   59.801446]  [] __ethtool_get_settings+0x85/0x140
>[   59.801450]  [] bond_update_speed_duplex+0x23/0x60
>[   59.801471]  [] bond_mii_monitor+0x354/0x640
>[   59.801475]  [] process_one_work+0x1a7/0x680
>[   59.801477]  [] ? process_one_work+0x146/0x680
>[   59.801481]  [] ? put_lock_stats.isra.21+0xe/0x40
>[   59.801484]  [] ? bond_loadbalance_arp_mon+0x2c0/0x2c0
>[   59.801489]  [] worker_thread+0x18d/0x4f0
>[   59.801495]  [] ? sub_preempt_count+0x51/0x60
>[   59.801500]  [] ? manage_workers+0x320/0x320
>[   59.801505]  [] kthread+0x9d/0xb0
>[   59.801510]  [] kernel_thread_helper+0x4/0x10
>[   59.801515]  [] ? finish_task_switch+0x77/0x100
>[   59.801519]  [] ? _raw_spin_unlock_irq+0x36/0x60
>[   59.801524]  [] ? retint_restore_args+0xe/0xe
>[   59.801530]  [] ? flush_kthread_worker+0x160/0x160
>[   59.801536]  [] ? gs_change+0xb/0xb
>[   59.804986] bonding: bond0: link status definitely up for interface
>p2p2, 1 Mbps full duplex.

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7

2012-11-28 Thread Jay Vosburgh

] ? update_rmtp+0x60/0x60
[   59.791334]  [81065a1f] ? hrtimer_start_range_ns+0xf/0x20
[   59.791339]  [81684c3e] schedule_hrtimeout_range+0xe/0x10
[   59.791345]  [8104bddb] usleep_range+0x3b/0x40
[   59.791352]  [814d220c] ixgbe_acquire_swfw_sync_X540+0xbc/0x110
[   59.791357]  [814ce4dd] ixgbe_read_phy_reg_generic+0x3d/0x120
[   59.791361]  [814ce74c]
ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
[   59.791366]  [81480b5d] ? bond_mii_monitor+0x2ed/0x640
[   59.791372]  [814c6454] ixgbe_get_settings+0x34/0x2b0
[   59.791381]  [8159af55] __ethtool_get_settings+0x85/0x140
[   59.791386]  [8147c6e3] bond_update_speed_duplex+0x23/0x60
[   59.791389]  [81480bc4] bond_mii_monitor+0x354/0x640
[   59.791393]  [8105a9b7] process_one_work+0x1a7/0x680
[   59.791396]  [8105a956] ? process_one_work+0x146/0x680
[   59.791402]  [8108c7ce] ? put_lock_stats.isra.21+0xe/0x40
[   59.791411]  [81480870] ? bond_loadbalance_arp_mon+0x2c0/0x2c0
[   59.791421]  [8105b9ed] worker_thread+0x18d/0x4f0
[   59.791434]  [81070991] ? sub_preempt_count+0x51/0x60
[   59.791442]  [8105b860] ? manage_workers+0x320/0x320
[   59.791453]  [81060f7d] kthread+0x9d/0xb0
[   59.791460]  [816892e4] kernel_thread_helper+0x4/0x10
[   59.791464]  [8106c197] ? finish_task_switch+0x77/0x100
[   59.791468]  [81687526] ? _raw_spin_unlock_irq+0x36/0x60
[   59.791472]  [81687a5d] ? retint_restore_args+0xe/0xe
[   59.791476]  [81060ee0] ? flush_kthread_worker+0x160/0x160
[   59.791480]  [816892e0] ? gs_change+0xb/0xb
[   59.794932] BUG: scheduling while atomic: kworker/u:1/103/0x0002
[   59.801333] 4 locks held by kworker/u:1/103:
[   59.801340]  #0:  ((bond_dev-name)){..}, at: [8105a956]
process_one_work+0x146/0x680
[   59.801345]  #1:  (((bond-mii_work)-work)){..}, at:
[8105a956] process_one_work+0x146/0x680
[   59.801350]  #2:  (rtnl_mutex){..}, at: [815a4dd0]
rtnl_trylock+0x10/0x20
[   59.801356]  #3:  (bond-lock){..}, at: [81480b5d]
bond_mii_monitor+0x2ed/0x640
[   59.801365] Modules linked in: fan kvm_intel mousedev kvm iTCO_wdt
iTCO_vendor_support acpi_cpufreq tpm_tis tpm tpm_bios mperf processor
[   59.801368] Pid: 103, comm: kworker/u:1 Tainted: GW
3.6.8-Isht-Van #4
[   59.801369] Call Trace:
[   59.801373]  [8167bb36] __schedule_bug+0x5e/0x6c
[   59.801380]  [816859bc] __schedule+0x77c/0x810
[   59.801385]  [81685ad4] schedule+0x24/0x70
[   59.801391]  [81684bec]
schedule_hrtimeout_range_clock+0xfc/0x140
[   59.801395]  [81064c80] ? update_rmtp+0x60/0x60
[   59.801399]  [81064c80] ? update_rmtp+0x60/0x60
[   59.801404]  [81065a1f] ? hrtimer_start_range_ns+0xf/0x20
[   59.801409]  [81684c3e] schedule_hrtimeout_range+0xe/0x10
[   59.801414]  [8104bddb] usleep_range+0x3b/0x40
[   59.801419]  [814d213e] ixgbe_release_swfw_sync_X540+0x4e/0x60
[   59.801424]  [814ce5a1] ixgbe_read_phy_reg_generic+0x101/0x120
[   59.801429]  [814ce74c]
ixgbe_get_copper_link_capabilities_generic+0x2c/0x60
[   59.801433]  [81480b5d] ? bond_mii_monitor+0x2ed/0x640
[   59.801441]  [814c6454] ixgbe_get_settings+0x34/0x2b0
[   59.801446]  [8159af55] __ethtool_get_settings+0x85/0x140
[   59.801450]  [8147c6e3] bond_update_speed_duplex+0x23/0x60
[   59.801471]  [81480bc4] bond_mii_monitor+0x354/0x640
[   59.801475]  [8105a9b7] process_one_work+0x1a7/0x680
[   59.801477]  [8105a956] ? process_one_work+0x146/0x680
[   59.801481]  [8108c7ce] ? put_lock_stats.isra.21+0xe/0x40
[   59.801484]  [81480870] ? bond_loadbalance_arp_mon+0x2c0/0x2c0
[   59.801489]  [8105b9ed] worker_thread+0x18d/0x4f0
[   59.801495]  [81070991] ? sub_preempt_count+0x51/0x60
[   59.801500]  [8105b860] ? manage_workers+0x320/0x320
[   59.801505]  [81060f7d] kthread+0x9d/0xb0
[   59.801510]  [816892e4] kernel_thread_helper+0x4/0x10
[   59.801515]  [8106c197] ? finish_task_switch+0x77/0x100
[   59.801519]  [81687526] ? _raw_spin_unlock_irq+0x36/0x60
[   59.801524]  [81687a5d] ? retint_restore_args+0xe/0xe
[   59.801530]  [81060ee0] ? flush_kthread_worker+0x160/0x160
[   59.801536]  [816892e0] ? gs_change+0xb/0xb
[   59.804986] bonding: bond0: link status definitely up for interface
p2p2, 1 Mbps full duplex.

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge

2012-11-27 Thread Jay Vosburgh

/* arp requests are broadcast and are sent on the primary
>>   * the arp request will collapse all clients on the subnet to
>>   * the primary slave. We must register these clients to be
>>   * updated with their assigned mac.
>>   */
>>  rlb_req_update_subnet_clients(bond, arp->ip_src);
>> 
>>  that arranges for clients to be given ARP updates for their
>> slave assignments (which may change to the active slave, due to the ARP
>> broadcast being sent via the active slave).
>> 
>>  I think the ARP reply side of this is fine (and is what is
>> described in teh changelog), but the ARP request behavior change is new
>> with this version.
>> 
>>  Since prior versions of the patch didn't cause this code to be
>> skipped, is this change intentional?
>> 
>>  Did you check to see if the above logic is necessary for ARP
>> requests passing through via a bridge to prevent peers from "stacking"
>> (in terms of load balance assignment) on the active slave due to bridged
>> ARP traffic?
>> 
>>  -J
>> 
>>> Signed-off-by: Zheng Li 
>>> Cc: Jay Vosburgh 
>>> Cc: Andy Gospodarek 
>>> Cc: "David S. Miller" 
>>>
>>> ---
>>> drivers/net/bonding/bond_alb.c |6 ++
>>> drivers/net/bonding/bonding.h  |   13 +
>>> 2 files changed, 19 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>>> index e15cc11..75f6f0d 100644
>>> --- a/drivers/net/bonding/bond_alb.c
>>> +++ b/drivers/net/bonding/bond_alb.c
>>> @@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, 
>>> struct bonding *bond)
>>> struct arp_pkt *arp = arp_pkt(skb);
>>> struct slave *tx_slave = NULL;
>>>
>>> +   /* Only modify ARP's MAC if it originates locally;
>>> +* don't change ARPs arriving via a bridge.
>>> +*/
>>> +   if (!bond_slave_has_mac(bond, arp->mac_src))
>>> +   return NULL;
>>> +
>>> if (arp->op_code == htons(ARPOP_REPLY)) {
>>> /* the arp must be sent on the selected
>>> * rx channel
>>> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>>> index f8af2fc..6dded56 100644
>>> --- a/drivers/net/bonding/bonding.h
>>> +++ b/drivers/net/bonding/bonding.h
>>> @@ -22,6 +22,7 @@
>>> #include 
>>> #include 
>>> #include 
>>> +#include 
>>> #include "bond_3ad.h"
>>> #include "bond_alb.h"
>>>
>>> @@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct 
>>> bond_net *bn)
>>> }
>>> #endif
>>>
>>> +static inline struct slave *bond_slave_has_mac(struct bonding *bond,
>>> +  const u8 *mac)
>>> +{
>>> +   int i = 0;
>>> +   struct slave *tmp;
>>> +
>>> +   bond_for_each_slave(bond, tmp, i)
>>> +   if (ether_addr_equal_64bits(mac, tmp->dev->dev_addr))
>>> +   return tmp;
>>> +
>>> +   return NULL;
>>> +}
>>>
>>> /* exported from bond_main.c */
>>> extern int bond_net_id;
>>> -- 
>>> 1.7.6.5

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge

2012-11-27 Thread Jay Vosburgh

 to the active slave, due to the ARP
 broadcast being sent via the active slave).
 
  I think the ARP reply side of this is fine (and is what is
 described in teh changelog), but the ARP request behavior change is new
 with this version.
 
  Since prior versions of the patch didn't cause this code to be
 skipped, is this change intentional?
 
  Did you check to see if the above logic is necessary for ARP
 requests passing through via a bridge to prevent peers from stacking
 (in terms of load balance assignment) on the active slave due to bridged
 ARP traffic?
 
  -J
 
 Signed-off-by: Zheng Li zheng.x...@oracle.com
 Cc: Jay Vosburgh fu...@us.ibm.com
 Cc: Andy Gospodarek a...@greyhouse.net
 Cc: David S. Miller da...@davemloft.net

 ---
 drivers/net/bonding/bond_alb.c |6 ++
 drivers/net/bonding/bonding.h  |   13 +
 2 files changed, 19 insertions(+), 0 deletions(-)

 diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
 index e15cc11..75f6f0d 100644
 --- a/drivers/net/bonding/bond_alb.c
 +++ b/drivers/net/bonding/bond_alb.c
 @@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, 
 struct bonding *bond)
 struct arp_pkt *arp = arp_pkt(skb);
 struct slave *tx_slave = NULL;

 +   /* Only modify ARP's MAC if it originates locally;
 +* don't change ARPs arriving via a bridge.
 +*/
 +   if (!bond_slave_has_mac(bond, arp-mac_src))
 +   return NULL;
 +
 if (arp-op_code == htons(ARPOP_REPLY)) {
 /* the arp must be sent on the selected
 * rx channel
 diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
 index f8af2fc..6dded56 100644
 --- a/drivers/net/bonding/bonding.h
 +++ b/drivers/net/bonding/bonding.h
 @@ -22,6 +22,7 @@
 #include linux/in6.h
 #include linux/netpoll.h
 #include linux/inetdevice.h
 +#include linux/etherdevice.h
 #include bond_3ad.h
 #include bond_alb.h

 @@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct 
 bond_net *bn)
 }
 #endif

 +static inline struct slave *bond_slave_has_mac(struct bonding *bond,
 +  const u8 *mac)
 +{
 +   int i = 0;
 +   struct slave *tmp;
 +
 +   bond_for_each_slave(bond, tmp, i)
 +   if (ether_addr_equal_64bits(mac, tmp-dev-dev_addr))
 +   return tmp;
 +
 +   return NULL;
 +}

 /* exported from bond_main.c */
 extern int bond_net_id;
 -- 
 1.7.6.5

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] bonding: in balance-rr mode, set curr_active_slave only if it is up

2012-11-26 Thread Jay Vosburgh

Michal Kubecek  wrote:

>If all slaves of a balance-rr bond with ARP monitor are enslaved
>with down link state, bond keeps down state even after slaves
>go up.
>
>This is caused by bond_enslave() setting curr_active_slave to
>first slave not taking into account its link state. As
>bond_loadbalance_arp_mon() uses curr_active_slave to identify
>whether slave's down->up transition should update bond's link
>state, bond stays down even if slaves are up (until first slave
>goes from up to down at least once).

The bond_loadbalance_arp_mon function actually uses
curr_active_slave to determine whether or not to do a "failover" (select
a new active slave), which in turn will call bond_set_carrier() from
within bond_select_active_slave().

Other than that nitpick about the description, I see how setting
curr_active_slave to a down slave would cause loadbalance_arp_mon to
skip the "failover" step (because it presumes that an active slave is
always up, and therefore no new one needs to be selected), and thus skip
setting the master's carrier state.

-J

>Before commit f31c7937 "bonding: start slaves with link down for
>ARP monitor", this was masked by slaves always starting in UP
>state with ARP monitor (and MII monitor not relying on
>curr_active_slave being NULL if there is no slave up).
>
>Signed-off-by: Michal Kubecek 

Signed-off-by: Jay Vosburgh 


> drivers/net/bonding/bond_main.c |2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 5f5b69f..c8bff3e 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1838,7 +1838,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
>net_device *slave_dev)
>* anyway (it holds no special properties of the bond device),
>* so we can change it without calling change_active_interface()
>*/
>-  if (!bond->curr_active_slave)
>+  if (!bond->curr_active_slave && new_slave->link == BOND_LINK_UP)
>   bond->curr_active_slave = new_slave;
>
>   break;
>-- 
>1.7.10.4
>

---
-Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 139 matches

Mail list logo