Re: [PATCH] bonding: 3ad: update slave arr after initialize
jin yiting wrote: [...] >> The described issue is a race condition (in that >> ad_agg_selection_logic clears agg->is_active under mode_lock, but >> bond_open -> bond_update_slave_arr is inspecting agg->is_active outside >> the lock). I don't see how the above change will reliably manage this; >> the real issue looks to be that bond_update_slave_arr is committing >> changes to the array (via bond_reset_slave_arr) based on a racy >> inspection of the active aggregator state while it is in flux. >> >> Also, the description of the issue says "The best aggregator in >> ad_agg_selection_logic has not changed, no need to update slave arr," >> but the change above does the opposite, and will set update_slave_arr >> when the aggregator has not changed (update_slave_arr remains false at >> return of ad_agg_selection_logic). >> >> I believe I understand the described problem, but I don't see >> how the patch fixes it. I suspect (but haven't tested) that the proper >> fix is to acquire mode_lock in bond_update_slave_arr while calling >> bond_3ad_get_active_agg_info to avoid conflict with the state machine. >> >> -J >> >> --- >> -Jay Vosburgh, jay.vosbu...@canonical.com >> . >> > > Thank you for your reply. The last patch does have redundant >update slave arr.Thank you for your correction. > >As you said, holding mode_lock in bond_update_slave_arr while >calling bond_3ad_get_active_agg_info can avoid conflictwith the state >machine. I have tested this patch, with ifdown/ifup operations for bond or >slaves. > >But bond_update_slave_arr is expected to hold RTNL only and NO >other lock. And it have WARN_ON(lockdep_is_held(>mode_lock)); in >bond_update_slave_arr. I'm not sure that holding mode_lock in >bond_update_slave_arr while calling bond_3ad_get_active_agg_info is a >correct action. That WARN_ON came up in discussion recently, and my opinion is that it's incorrect, and is trying to insure bond_update_slave_arr is safe for a potential sleep when allocating memory. https://lore.kernel.org/netdev/20210322123846.3024549-1-maxi...@nvidia.com/ The original authors haven't replied, so I would suggest you remove the WARN_ON and the surrounding CONFIG_LOCKDEP ifdefs as part of your patch and replace it with a call to might_sleep. The other callers of bond_3ad_get_active_agg_info are generally obtaining the state in order to report it to user space, so I think it's safe to leave those calls not holding the mode_lock. The race is still there, but the data returned to user space is a snapshot and so may reflect an incomplete state during a transition. Further, having the inspection functions acquire the mode_lock permits user space to spam the lock with little effort. -J >diff --git a/drivers/net/bonding/bond_main.c >b/drivers/net/bonding/bond_main.c >index 74cbbb2..db988e5 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -4406,7 +4406,9 @@ int bond_update_slave_arr(struct bonding *bond, >struct slave *skipslave) >if (BOND_MODE(bond) == BOND_MODE_8023AD) { >struct ad_info ad_info; > >+ spin_lock_bh(>mode_lock); >if (bond_3ad_get_active_agg_info(bond, _info)) { >+ spin_unlock_bh(>mode_lock); >pr_debug("bond_3ad_get_active_agg_info failed\n"); >/* No active aggragator means it's not safe to use > * the previous array. >@@ -4414,6 +4416,7 @@ int bond_update_slave_arr(struct bonding *bond, >struct slave *skipslave) >bond_reset_slave_arr(bond); >goto out; >} >+ spin_unlock_bh(>mode_lock); >agg_id = ad_info.aggregator_id; >} >bond_for_each_slave(bond, slave, iter) { --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: 3ad: update slave arr after initialize
jinyiting wrote: >From: jin yiting > >The bond works in mode 4, and performs down/up operations on the bond >that is normally negotiated. The probability of bond-> slave_arr is NULL > >Test commands: >ifconfig bond1 down >ifconfig bond1 up > >The conflict occurs in the following process: > >__dev_open (CPU A) > --bond_open > --queue_delayed_work(bond->wq,>ad_work,0); > --bond_update_slave_arr > --bond_3ad_get_active_agg_info > >ad_work(CPU B) > --bond_3ad_state_machine_handler > --ad_agg_selection_logic > >ad_work runs on cpu B. In the function ad_agg_selection_logic, all >agg->is_active will be cleared. Before the new active aggregator is >selected on CPU B, bond_3ad_get_active_agg_info failed on CPU A, >bond->slave_arr will be set to NULL. The best aggregator in >ad_agg_selection_logic has not changed, no need to update slave arr. > >Signed-off-by: jin yiting >--- > drivers/net/bonding/bond_3ad.c | 6 ++ > 1 file changed, 6 insertions(+) > >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c >index 6908822..d100079 100644 >--- a/drivers/net/bonding/bond_3ad.c >+++ b/drivers/net/bonding/bond_3ad.c >@@ -2327,6 +2327,12 @@ void bond_3ad_state_machine_handler(struct work_struct >*work) > > aggregator = __get_first_agg(port); > ad_agg_selection_logic(aggregator, _slave_arr); >+ if (!update_slave_arr) { >+ struct aggregator *active = >__get_active_agg(aggregator); >+ >+ if (active && active->is_active) >+ update_slave_arr = true; >+ } > } > bond_3ad_set_carrier(bond); > } The described issue is a race condition (in that ad_agg_selection_logic clears agg->is_active under mode_lock, but bond_open -> bond_update_slave_arr is inspecting agg->is_active outside the lock). I don't see how the above change will reliably manage this; the real issue looks to be that bond_update_slave_arr is committing changes to the array (via bond_reset_slave_arr) based on a racy inspection of the active aggregator state while it is in flux. Also, the description of the issue says "The best aggregator in ad_agg_selection_logic has not changed, no need to update slave arr," but the change above does the opposite, and will set update_slave_arr when the aggregator has not changed (update_slave_arr remains false at return of ad_agg_selection_logic). I believe I understand the described problem, but I don't see how the patch fixes it. I suspect (but haven't tested) that the proper fix is to acquire mode_lock in bond_update_slave_arr while calling bond_3ad_get_active_agg_info to avoid conflict with the state machine. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next] bonding: Added -ENODEV interpret for slaves option
Jianlin Lv wrote: >After upgrading the kernel, the slave interface name is changed, >Systemd cannot use the original configuration to create bond interface, >thereby losing the connection with the host. > >Adding log for ENODEV will make it easier to find out such problem lies. To be clear, this specifically affects add/remove of interfaces to/from the bond via the "slaves" sysfs interface. Please update your log to better describe this (that it affects the sysfs API only) and resubmit. I'm sympathetic to the problem this is trying to solve, and the message shouldn't spam the kernel log particularly, but the commit log needs to more clearly describe what the problem is and how it's being fixed. Thanks, -J >Signed-off-by: Jianlin Lv >--- > drivers/net/bonding/bond_options.c | 9 + > 1 file changed, 9 insertions(+) > >diff --git a/drivers/net/bonding/bond_options.c >b/drivers/net/bonding/bond_options.c >index 77d7c38bd435..c9d3604ae129 100644 >--- a/drivers/net/bonding/bond_options.c >+++ b/drivers/net/bonding/bond_options.c >@@ -640,6 +640,15 @@ static void bond_opt_error_interpret(struct bonding *bond, > netdev_err(bond->dev, "option %s: unable to set because the > bond device is up\n", > opt->name); > break; >+ case -ENODEV: >+ if (val && val->string) { >+ p = strchr(val->string, '\n'); >+ if (p) >+ *p = '\0'; >+ netdev_err(bond->dev, "option %s: interface %s does not >exist!\n", >+ opt->name, val->string); >+ } >+ break; > default: > break; > } >-- >2.25.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: question about bonding mode 4
ry port structure contains an aggregator structure, so there should always be one available somewhere. I'm going to speculate that there's a race condition somewhere in the teardown processing vs the LACP state machine that invalidates this presumption. >> The detailed code analysis is as follows: [...] >> /* if all aggregator's ports are READY_N == TRUE, set ready=TRUE >> * in all aggregator's ports, else set ready=FALSE in all >> * aggregator's ports >> */ >> __set_agg_ports_ready(port->aggregator, >>__agg_ports_are_ready(port->aggregator)); >> >> analysis: port->aggregator is still NULL, which causes problem. >> >> aggregator = __get_first_agg(port); >> ad_agg_selection_logic(aggregator, update_slave_arr); >> >> if (!port->aggregator->is_active) >> port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION; Correct, if the "did not find a suitable aggregator" path is taken, port->aggregator is NULL and bad things happen in the above block. This is something that needs to be fixed, but I'm also concerned that there are other issues lurking, so I'd like to be able to reproduce this. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next v2] bonding: add a vlan+mac tx hashing option
kb) >{ >struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb); >u32 srcmac = 0; >u16 vlan; >int i; > >for (i = 0; i < ETH_ALEN; i++) >srcmac = (srcmac << 8) | mac_hdr->h_source[i]; I think this will shift h_source[0] and [1] into oblivion. >if (!skb_vlan_tag_present(skb)) >return srcmac; > >vlan = skb_vlan_tag_get(skb); > >return vlan ^ srcmac; >} > >Then the documentation is spot-on, and we're future-proof, though >marginally less performant in calculating the hash, which may have been a >consideration when the original function was written, but is probably >basically irrelevant w/modern systems... > >> >The default value is layer2. This option was added in bonding >> >version 2.6.3. In earlier versions of bonding, this parameter >> >does not exist, and the layer2 policy is the only policy. The >> >> > +static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb) >> >> Can we drop the inline? It's a static function called once. > >Works for me. That was also inherited by copying bond_eth_hash(). :) > >> > +{ >> > + struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb); >> >> I don't see anything in the patch making sure the interface actually >> has a L2 header. Should we validate somehow the ifc is Ethernet? > >I don't think it's necessary. There doesn't appear to be any explicit >check for BOND_XMIT_POLICY_LAYER2 either. I believe we're guaranteed to >not have anything but an ethernet header here, as the only other type I'm >aware of being supported is Infiniband, but we limit that to active-backup >only, and xmit_hash_policy isn't valid for active-backup. This is correct, interfaces in a bond other than active-backup will all be ARPHRD_ETHER. I'm unaware of a way to get a packet in there without at least an Ethernet header. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option
Jarod Wilson wrote: >On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote: >> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote: >> > Jarod Wilson wrote: >> > >> > >This comes from an end-user request, where they're running multiple VMs on >> > >hosts with bonded interfaces connected to some interest switch topologies, >> > >where 802.3ad isn't an option. They're currently running a proprietary >> > >solution that effectively achieves load-balancing of VMs and bandwidth >> > >utilization improvements with a similar form of transmission algorithm. >> > > >> > >Basically, each VM has it's own vlan, so it always sends its traffic out >> > >the same interface, unless that interface fails. Traffic gets split >> > >between the interfaces, maintaining a consistent path, with failover still >> > >available if an interface goes down. >> > > >> > >This has been rudimetarily tested to provide similar results, suitable for >> > >them to use to move off their current proprietary solution. >> > > >> > >Still on the TODO list, if these even looks sane to begin with, is >> > >fleshing out Documentation/networking/bonding.rst. >> > >> >I'm sure you're aware, but any final submission will also need >> > to include netlink and iproute2 support. >> >> I believe everything for netlink support is already included, but I'll >> double-check that before submitting something for inclusion consideration. > >I'm not certain if what you actually meant was that I'd have to patch >iproute2 as well, which I've definitely stumbled onto today, but it's a >2-line patch, and everything seems to be working fine with it: Yes, that's what I meant. >$ sudo ip link set bond0 type bond xmit_hash_policy 5 Does the above work with the text label (presumably "vlansrc") as well as the number, and does "ip link add test type bond help" print the correct text for XMIT_HASH_POLICY? -J >$ ip -d link show bond0 >11: bond0: mtu 1500 qdisc noop state DOWN mode >DEFAULT group default qlen 1000 >link/ether ce:85:5e:24:ce:90 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 > maxmtu 65535 >bond mode balance-xor miimon 0 updelay 0 downdelay 0 peer_notify_delay 0 > use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any > primary_reselect always fail_over_mac none xmit_hash_policy vlansrc > resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 > packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 > addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 > gso_max_segs 65535 >$ grep Hash /proc/net/bonding/bond0 >Transmit Hash Policy: vlansrc (5) > >Nothing bad seems to happen on an older kernel if one tries to set the new >hash, you just get told that it's an invalid argument. > >I *think* this is all ready for submission then, so I'll get both the kernel >and iproute2 patches out soon. > >-- >Jarod Wilson >ja...@redhat.com --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option
Jarod Wilson wrote: >This comes from an end-user request, where they're running multiple VMs on >hosts with bonded interfaces connected to some interest switch topologies, >where 802.3ad isn't an option. They're currently running a proprietary >solution that effectively achieves load-balancing of VMs and bandwidth >utilization improvements with a similar form of transmission algorithm. > >Basically, each VM has it's own vlan, so it always sends its traffic out >the same interface, unless that interface fails. Traffic gets split >between the interfaces, maintaining a consistent path, with failover still >available if an interface goes down. > >This has been rudimetarily tested to provide similar results, suitable for >them to use to move off their current proprietary solution. > >Still on the TODO list, if these even looks sane to begin with, is >fleshing out Documentation/networking/bonding.rst. I'm sure you're aware, but any final submission will also need to include netlink and iproute2 support. >Cc: Jay Vosburgh >Cc: Veaceslav Falico >Cc: Andy Gospodarek >Cc: "David S. Miller" >Cc: Jakub Kicinski >Cc: Thomas Davis >Cc: net...@vger.kernel.org >Signed-off-by: Jarod Wilson >--- > drivers/net/bonding/bond_main.c| 27 +-- > drivers/net/bonding/bond_options.c | 1 + > include/linux/netdevice.h | 1 + > include/uapi/linux/if_bonding.h| 1 + > 4 files changed, 28 insertions(+), 2 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 5fe5232cc3f3..151ce8c7a56f 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -164,7 +164,7 @@ module_param(xmit_hash_policy, charp, 0); > MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, > 802.3ad hashing method; " > "0 for layer 2 (default), 1 for layer 3+4, " > "2 for layer 2+3, 3 for encap layer 2+3, " >- "4 for encap layer 3+4"); >+ "4 for encap layer 3+4, 5 for vlan+srcmac"); > module_param(arp_interval, int, 0); > MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds"); > module_param_array(arp_ip_target, charp, NULL, 0); >@@ -1434,6 +1434,8 @@ static enum netdev_lag_hash bond_lag_hash_type(struct >bonding *bond, > return NETDEV_LAG_HASH_E23; > case BOND_XMIT_POLICY_ENCAP34: > return NETDEV_LAG_HASH_E34; >+ case BOND_XMIT_POLICY_VLAN_SRCMAC: >+ return NETDEV_LAG_HASH_VLAN_SRCMAC; > default: > return NETDEV_LAG_HASH_UNKNOWN; > } >@@ -3494,6 +3496,20 @@ static bool bond_flow_ip(struct sk_buff *skb, struct >flow_keys *fk, > return true; > } > >+static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb) >+{ >+ struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb); >+ u32 srcmac = mac_hdr->h_source[5]; >+ u16 vlan; >+ >+ if (!skb_vlan_tag_present(skb)) >+ return srcmac; >+ >+ vlan = skb_vlan_tag_get(skb); >+ >+ return srcmac ^ vlan; For the common configuration wherein multiple VLANs are configured atop a single interface (and thus by default end up with the same MAC address), this seems like a fairly weak hash. The TCI is 16 bits (12 of which are the VID), but only 8 are used from the MAC, which will be a constant. Is this algorithm copying the proprietary solution you mention? -J >+} >+ > /* Extract the appropriate headers based on bond's xmit policy */ > static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb, > struct flow_keys *fk) >@@ -3501,10 +3517,14 @@ static bool bond_flow_dissect(struct bonding *bond, >struct sk_buff *skb, > bool l34 = bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34; > int noff, proto = -1; > >- if (bond->params.xmit_policy > BOND_XMIT_POLICY_LAYER23) { >+ switch (bond->params.xmit_policy) { >+ case BOND_XMIT_POLICY_ENCAP23: >+ case BOND_XMIT_POLICY_ENCAP34: > memset(fk, 0, sizeof(*fk)); > return __skb_flow_dissect(NULL, skb, _keys_bonding, > fk, NULL, 0, 0, 0, 0); >+ default: >+ break; > } > > fk->ports.ports = 0; >@@ -3556,6 +3576,9 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff >*skb) > skb->l4_hash) > return skb->hash; > >+ if (bond->params.xmit_policy == BOND_XMIT_PO
Re: [PATCH net-next] bonding: correct rr balancing during link failure
Jakub Kicinski wrote: >On Wed, 02 Dec 2020 20:55:57 + Lars Everbrand wrote: >> This patch updates the sending algorithm for roundrobin to avoid >> over-subscribing interface(s) when one or more interfaces in the bond is >> not able to send packets. This happened when order was not random and >> more than 2 interfaces were used. >> >> Previously the algorithm would find the next available interface >> when an interface failed to send by, this means that most often it is >> current_interface + 1. The problem is that when the next packet is to be >> sent and the "normal" algorithm then continues with interface++ which >> then hits that same interface again. >> >> This patch updates the resending algorithm to update the global counter >> of the next interface to use. >> >> Example (prior to patch): >> >> Consider 6 x 100 Mbit/s interfaces in a rr bond. The normal order of links >> being used to send would look like: >> 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 ... >> >> If, for instance, interface 2 where unable to send the order would have been: >> 1 3 3 4 5 6 1 3 3 4 5 6 1 3 3 4 5 6 ... >> >> The resulting speed (for TCP) would then become: >> 50 + 0 + 100 + 50 + 50 + 50 = 300 Mbit/s >> instead of the expected 500 Mbit/s. >> >> If interface 3 also would fail the resulting speed would be half of the >> expected 400 Mbit/s (33 + 0 + 0 + 100 + 33 + 33). Are these bandwidth numbers from observation of the actual behavior? I'm not sure the real system would behave this way; my suspicion is that it would increase the likelihood of drops on the overused slave, not that the overall capacity would be limited. >> Signed-off-by: Lars Everbrand > >Thanks for the patch! > >Looking at the code in question it feels a little like we're breaking >abstractions if we bump the counter directly in get_slave_by_id. Agreed; I think a better way to fix this is to enable the slave array for balance-rr mode, and then use the array to find the right slave. This way, we then avoid the problematic "skip unable to tx" logic for free. >For one thing when the function is called for IGMP packets the counter >should not be incremented at all. But also if packets_per_slave is not >1 we'd still be hitting the same leg multiple times (packets_per_slave >/ 2). So it seems like we should round the counter up somehow? > >For IGMP maybe we don't have to call bond_get_slave_by_id() at all, >IMHO, just find first leg that can TX. Then we can restructure >bond_get_slave_by_id() appropriately for the non-IGMP case. For IGMP, the theory is to confine that traffic to a single device. Normally, this will be curr_active_slave, which is updated even in balance-rr mode as interfaces are added to or removed from the bond. The call to bond_get_slave_by_id should be a fallback in case curr_active_slave is empty, and should be the exception, and may not be possible at all. But either way, the IGMP path shouldn't mess with rr_tx_counter, it should be out of band of the normal TX packet counting, so to speak. -J >> diff --git a/drivers/net/bonding/bond_main.c >> b/drivers/net/bonding/bond_main.c >> index e0880a3840d7..e02d9c6d40ee 100644 >> --- a/drivers/net/bonding/bond_main.c >> +++ b/drivers/net/bonding/bond_main.c >> @@ -4107,6 +4107,7 @@ static struct slave *bond_get_slave_by_id(struct >> bonding *bond, >> if (--i < 0) { >> if (bond_slave_can_tx(slave)) >> return slave; >> +bond->rr_tx_counter++; >> } >> } >> >> @@ -4117,6 +4118,7 @@ static struct slave *bond_get_slave_by_id(struct >> bonding *bond, >> break; >> if (bond_slave_can_tx(slave)) >> return slave; >> +bond->rr_tx_counter++; >> } >> /* no slave that can tx has been found */ >> return NULL; > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net] bonding: reduce rtnl lock contention in mii monitor thread
Jarod Wilson wrote: >I'm seeing a system get stuck unable to bring a downed interface back up >when it's got an updelay value set, behavior which ceased when logging >spew was removed from bond_miimon_inspect(). I'm monitoring logs on this >system over another network connection, and it seems that the act of >spewing logs at all there increases rtnl lock contention, because >instrumented code showed bond_mii_monitor() never able to succeed in it's >attempts to call rtnl_trylock() to actually commit link state changes, >leaving the downed link stuck in BOND_LINK_DOWN. The system in question >appears to be fine with the log spew being moved to >bond_commit_link_state(), which is called after the successful >rtnl_trylock(). I'm actually wondering if perhaps we ultimately need/want >some bond-specific lock here to prevent racing with bond_close() instead >of using rtnl, but this shift of the output appears to work. I believe >this started happening when de77ecd4ef02 ("bonding: improve link-status >update in mii-monitoring") went in, but I'm not 100% on that. We use RTNL not to avoid deadlock with bonding itself, but because the "commit" side undertakes actions which require RTNL, e.g., various events will eventually call netdev_lower_state_changed. However, the RTNL acquisition is a trylock to avoid the deadlock with bond_close. Moving that out of line here (e.g., putting the commit into another work queue event or the like) has the same problem, in that bond_close needs to wait for all of the work queue events to finish, and it holds RTNL. Also, a dim memory says that the various notification messages were mostly placed in the "inspect" phase and not the "commit" phase to avoid doing printk-like activities with RTNL held. As a general principle, I don't think we want to add more verbiage under RTNL. >The addition of a case BOND_LINK_BACK in bond_miimon_inspect() is somewhat >separate from the fix for the actual hang, but it eliminates a constant >"invalid new link 3 on slave" message seen related to this issue, and it's >not actually an invalid state here, so we shouldn't be reporting it as an >error. Do you mean bond_miimon_commit here and not bond_miimon_inspect (which already has a case for BOND_LINK_BACK)? In principle, bond_miimon_commit should not see _BACK or _FAIL state as a new link state, because those states should be managed at the bond_miimon_inspect level (as they are the result of updelay and downdelay). These states should not be "committed" in the sense of causing notifications or doing actions that require RTNL. My recollection is that the "invalid new link" messages were the result of a bug in de77ecd4ef02, which was fixed in 1899bb325149 ("bonding: fix state transition issue in link monitoring"), but maybe the RTNL problem here induces that in some other fashion. Either way, I believe this message is correct as-is. -J >CC: Mahesh Bandewar >CC: Jay Vosburgh >CC: Veaceslav Falico >CC: Andy Gospodarek >CC: "David S. Miller" >CC: Jakub Kicinski >CC: Thomas Davis >CC: net...@vger.kernel.org >Signed-off-by: Jarod Wilson >--- > drivers/net/bonding/bond_main.c | 26 ++ > include/net/bonding.h | 38 + > 2 files changed, 44 insertions(+), 20 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 47afc5938c26..cdb6c64f16b6 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -2292,23 +2292,13 @@ static int bond_miimon_inspect(struct bonding *bond) > bond_propose_link_state(slave, BOND_LINK_FAIL); > commit++; > slave->delay = bond->params.downdelay; >- if (slave->delay) { >- slave_info(bond->dev, slave->dev, "link status >down for %sinterface, disabling it in %d ms\n", >- (BOND_MODE(bond) == >- BOND_MODE_ACTIVEBACKUP) ? >- (bond_is_active_slave(slave) ? >- "active " : "backup ") : "", >- bond->params.downdelay * >bond->params.miimon); >- } >+ > fallthrough; > case BOND_LINK_FAIL: > if (link_state) { > /* recovered before downdelay expired */ > bond_propose_link_state(slave, BOND_LIN
Re: [PATCH net v2] bonding: fix feature flag setting at init time
Jarod Wilson wrote: >On Wed, Dec 2, 2020 at 12:55 PM Jay Vosburgh >wrote: >> >> Jarod Wilson wrote: >> >> >Don't try to adjust XFRM support flags if the bond device isn't yet >> >registered. Bad things can currently happen when netdev_change_features() >> >is called without having wanted_features fully filled in yet. Basically, >> >this code was racing against register_netdevice() filling in >> >wanted_features, and when it got there first, the empty wanted_features >> >led to features also getting emptied out, which was definitely not the >> >intended behavior, so prevent that from happening. >> >> Is this an actual race? Reading Ivan's prior message, it sounds >> like it's an ordering problem (in that bond_newlink calls >> register_netdevice after bond_changelink). > >Sorry, yeah, this is not actually a race condition, just an ordering >issue, bond_check_params() gets called at init time, which leads to >bond_option_mode_set() being called, and does so prior to >bond_create() running, which is where we actually call >register_netdevice(). So this only happens if there's a "mode" module parameter? That doesn't sound like the call path that Ivan described (coming in via bond_newlink). -J >> The change to bond_option_mode_set tests against reg_state, so >> presumably it wants to skip the first(?) time through, before the >> register_netdevice call; is that right? > >Correct. Later on, when the bonding driver is already loaded, and >parameter changes are made, bond_option_mode_set() gets called and if >the mode changes to or from active-backup, we do need/want this code >to run to update wanted and features flags properly. > > >-- >Jarod Wilson >ja...@redhat.com --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net v2] bonding: fix feature flag setting at init time
Jarod Wilson wrote: >Don't try to adjust XFRM support flags if the bond device isn't yet >registered. Bad things can currently happen when netdev_change_features() >is called without having wanted_features fully filled in yet. Basically, >this code was racing against register_netdevice() filling in >wanted_features, and when it got there first, the empty wanted_features >led to features also getting emptied out, which was definitely not the >intended behavior, so prevent that from happening. Is this an actual race? Reading Ivan's prior message, it sounds like it's an ordering problem (in that bond_newlink calls register_netdevice after bond_changelink). The change to bond_option_mode_set tests against reg_state, so presumably it wants to skip the first(?) time through, before the register_netdevice call; is that right? -J >Originally, I'd hoped to stop adjusting wanted_features at all in the >bonding driver, as it's documented as being something only the network >core should touch, but we actually do need to do this to properly update >both the features and wanted_features fields when changing the bond type, >or we get to a situation where ethtool sees: > >esp-hw-offload: off [requested on] > >I do think we should be using netdev_update_features instead of >netdev_change_features here though, so we only send notifiers when the >features actually changed. > >v2: rework based on further testing and suggestions from ivecera > >Fixes: a3b658cfb664 ("bonding: allow xfrm offload setup post-module-load") >Reported-by: Ivan Vecera >Suggested-by: Ivan Vecera >Cc: Jay Vosburgh >Cc: Veaceslav Falico >Cc: Andy Gospodarek >Cc: "David S. Miller" >Cc: Jakub Kicinski >Cc: Thomas Davis >Cc: net...@vger.kernel.org >Signed-off-by: Jarod Wilson >--- > drivers/net/bonding/bond_main.c| 10 -- > drivers/net/bonding/bond_options.c | 6 +- > 2 files changed, 9 insertions(+), 7 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index e0880a3840d7..5fe5232cc3f3 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -4746,15 +4746,13 @@ void bond_setup(struct net_device *bond_dev) > NETIF_F_HW_VLAN_CTAG_FILTER; > > bond_dev->hw_features |= NETIF_F_GSO_ENCAP_ALL; >-#ifdef CONFIG_XFRM_OFFLOAD >- bond_dev->hw_features |= BOND_XFRM_FEATURES; >-#endif /* CONFIG_XFRM_OFFLOAD */ > bond_dev->features |= bond_dev->hw_features; > bond_dev->features |= NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_STAG_TX; > #ifdef CONFIG_XFRM_OFFLOAD >- /* Disable XFRM features if this isn't an active-backup config */ >- if (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) >- bond_dev->features &= ~BOND_XFRM_FEATURES; >+ bond_dev->hw_features |= BOND_XFRM_FEATURES; >+ /* Only enable XFRM features if this is an active-backup config */ >+ if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP) >+ bond_dev->features |= BOND_XFRM_FEATURES; > #endif /* CONFIG_XFRM_OFFLOAD */ > } > >diff --git a/drivers/net/bonding/bond_options.c >b/drivers/net/bonding/bond_options.c >index 9abfaae1c6f7..19205cfac751 100644 >--- a/drivers/net/bonding/bond_options.c >+++ b/drivers/net/bonding/bond_options.c >@@ -768,11 +768,15 @@ static int bond_option_mode_set(struct bonding *bond, > bond->params.tlb_dynamic_lb = 1; > > #ifdef CONFIG_XFRM_OFFLOAD >+ if (bond->dev->reg_state != NETREG_REGISTERED) >+ goto noreg; >+ > if (newval->value == BOND_MODE_ACTIVEBACKUP) > bond->dev->wanted_features |= BOND_XFRM_FEATURES; > else > bond->dev->wanted_features &= ~BOND_XFRM_FEATURES; >- netdev_change_features(bond->dev); >+ netdev_update_features(bond->dev); >+noreg: > > #endif /* CONFIG_XFRM_OFFLOAD */ > > /* don't cache arp_validate between modes */ >-- >2.28.0 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH 2/5] bonding: replace use of the term master where possible
Jarod Wilson wrote: >Simply refer to what was the bonding "master" as the "bond" or bonding >device, depending on context. However, do retain compat code for the >bonding_masters sysfs interface to avoid breaking userspace. > >Cc: Jay Vosburgh >Cc: Veaceslav Falico >Cc: Andy Gospodarek >Cc: "David S. Miller" >Cc: Jakub Kicinski >Cc: Thomas Davis >Cc: net...@vger.kernel.org >Signed-off-by: Jarod Wilson >--- > drivers/infiniband/core/cma.c | 2 +- > drivers/infiniband/core/lag.c | 2 +- > drivers/infiniband/core/roce_gid_mgmt.c | 6 +- > drivers/net/bonding/bond_3ad.c| 2 +- > drivers/net/bonding/bond_main.c | 58 +-- > drivers/net/bonding/bond_procfs.c | 4 +- > drivers/net/bonding/bond_sysfs.c | 8 +-- > .../net/ethernet/mellanox/mlx4/en_netdev.c| 10 ++-- > .../ethernet/netronome/nfp/flower/lag_conf.c | 2 +- > .../ethernet/qlogic/netxen/netxen_nic_main.c | 8 +-- > include/linux/netdevice.h | 8 +-- > include/net/bonding.h | 4 +- > 12 files changed, 58 insertions(+), 56 deletions(-) > >diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c >index a77750b8954d..3a1679d16e19 100644 >--- a/drivers/infiniband/core/cma.c >+++ b/drivers/infiniband/core/cma.c >@@ -4753,7 +4753,7 @@ static int cma_netdev_callback(struct notifier_block >*self, unsigned long event, > if (event != NETDEV_BONDING_FAILOVER) > return NOTIFY_DONE; > >- if (!netif_is_bond_master(ndev)) >+ if (!netif_is_bond_dev(ndev)) > return NOTIFY_DONE; > > mutex_lock(); >diff --git a/drivers/infiniband/core/lag.c b/drivers/infiniband/core/lag.c >index 7063e41eaf26..2afaca2f9d0b 100644 >--- a/drivers/infiniband/core/lag.c >+++ b/drivers/infiniband/core/lag.c >@@ -128,7 +128,7 @@ struct net_device *rdma_lag_get_ah_roce_slave(struct >ib_device *device, > dev_hold(master); > rcu_read_unlock(); > >- if (!netif_is_bond_master(master)) >+ if (!netif_is_bond_dev(master)) > goto put; > > slave = rdma_get_xmit_slave_udp(device, master, ah_attr, flags); >diff --git a/drivers/infiniband/core/roce_gid_mgmt.c >b/drivers/infiniband/core/roce_gid_mgmt.c >index 6b8364bb032d..e06cf51f1773 100644 >--- a/drivers/infiniband/core/roce_gid_mgmt.c >+++ b/drivers/infiniband/core/roce_gid_mgmt.c >@@ -129,7 +129,7 @@ enum bonding_slave_state { > static enum bonding_slave_state is_eth_active_slave_of_bonding_rcu(struct > net_device *dev, > struct > net_device *upper) > { >- if (upper && netif_is_bond_master(upper)) { >+ if (upper && netif_is_bond_dev(upper)) { > struct net_device *pdev = > bond_option_active_slave_get_rcu(netdev_priv(upper)); > >@@ -216,7 +216,7 @@ is_ndev_for_default_gid_filter(struct ib_device *ib_dev, >u8 port, >* make sure that it the upper netdevice of rdma netdevice. >*/ > res = ((cookie_ndev == rdma_ndev && !netif_is_bond_slave(rdma_ndev)) || >- (netif_is_bond_master(cookie_ndev) && >+ (netif_is_bond_dev(cookie_ndev) && > rdma_is_upper_dev_rcu(rdma_ndev, cookie_ndev))); > > rcu_read_unlock(); >@@ -271,7 +271,7 @@ is_upper_ndev_bond_master_filter(struct ib_device *ib_dev, >u8 port, > return false; > > rcu_read_lock(); >- if (netif_is_bond_master(cookie_ndev) && >+ if (netif_is_bond_dev(cookie_ndev) && > rdma_is_upper_dev_rcu(rdma_ndev, cookie_ndev)) > match = true; > rcu_read_unlock(); >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c >index 0eb717b0bfc6..852b9c4f6a47 100644 >--- a/drivers/net/bonding/bond_3ad.c >+++ b/drivers/net/bonding/bond_3ad.c >@@ -2550,7 +2550,7 @@ void bond_3ad_handle_link_change(struct slave *slave, >char link) > } > > /** >- * bond_3ad_set_carrier - set link state for bonding master >+ * bond_3ad_set_carrier - set link state for bonding device > * @bond: bonding structure > * > * if we have an active aggregator, we're up, if not, we're down. >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index d79643f6b01e..e9cc7d68f3b9 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -469,8 +469,8 @@ static const struct xfrmdev_ops bond_xfrmdev_ops = { > > /*--- Link stat
Re: [PATCH net-next v4 0/5] bonding: rename bond components
Jarod Wilson wrote: >The bonding driver's use of master and slave, while largely understood >in technical circles, poses a barrier for inclusion to some potential >members of the development and user community, due to the historical >context of masters and slaves, particularly in the United States. This >is a first full pass at replacing those phrases with more socially >inclusive ones, opting for bond to replace master and port to >replace slave, which is congruent with the bridge and team drivers. > >There are a few problems with this change. First up, "port" is used in >the bonding 802.3ad code, so the first step here is to rename port to >ad_port, so we can reuse port. Second, we have the issue of not wanting >to break any existing userspace, which I believe this patchset >accomplishes, preserving all existing sysfs and procfs interfaces, and >adding module parameter aliases where necessary. > >Third, we do still have the issue of ease of backporting fixes to >-stable trees. I've not had a huge amount of time to spend on it, but >brief forays into coccinelle didn't really pay off (since it's meant to >operate on code, not patches), and the best solution I can come up with >is providing a shell script someone could run over git-format-patch >output before git-am'ing the result to a -stable tree, though scripting >these changes in the first place turned out to be not the best thing to >do anyway, due to subtle cases where use of master or slave can NOT yet >be replaced, so a large amount of work was done by hand, inspection, >trial and error, which is why this set is a lot longer in coming than >I'd originally hoped. I don't expect -stable backports to be horrible to >figure out one way or another though, and I don't believe that a bit of >inconvenience on that front is enough to warrant not making these >changes. I think this undersells the impact a bit; this will most likely break the majority of cherry-picks for the bonding driver to stable going forward should this patch set be committed. Yes, the volume of patches to bonding is relatively low, and the manual backports are not likely to be technically difficult. Nevertheless, I expect that most bonding backports to stable that cross this patch set will require manual intervention. As such, I'd still like to see explicit direction from the kernel development community leadership that change sets of this nature (not technically driven, with long term maintenance implications) are changes that should be undertaken rather than are merely permitted. -J >See here for further details on Red Hat's commitment to this work: >https://www.redhat.com/en/blog/making-open-source-more-inclusive-eradicating-problematic-language > >As far as testing goes, I've manually operated on various bonds while >working on this code, and have run it through multiple lnst test runs, >which exercises the existing sysfs interfaces fairly extensively. As far >as I can tell through testing and inspection, there is no breakage of >any existing interfaces with this set. > >v2: legacy module parameters are retained this time, and we're trying >out bond/port instead of aggregator/link in place of master/slave. The >procfs interface legacy output is also duplicated or dropped, depending >on Kconfig, rather than being replaced. > >v3: remove Kconfig knob, leave sysfs and procfs interfaces entirely >untouched, but update documentation to reference their deprecated >nature, explain the name changes, add references to NetworkManager, >include more netlink/iproute2 examples and make note of netlink >being the preferred interface for userspace interaction with bonds. > >v4: documentation table of contents fixes > >Cc: Jay Vosburgh >Cc: Veaceslav Falico >Cc: Andy Gospodarek >Cc: "David S. Miller" >Cc: Jakub Kicinski >Cc: Thomas Davis >Cc: net...@vger.kernel.org > >Jarod Wilson (5): > bonding: rename 802.3ad's struct port to ad_port > bonding: replace use of the term master where possible > bonding: rename slave to port where possible > bonding: rename bonding_sysfs_slave.c to _port.c > bonding: update Documentation for port/bond terminology > > .clang-format |4 +- > Documentation/networking/bonding.rst | 581 ++-- > drivers/infiniband/core/cma.c |2 +- > drivers/infiniband/core/lag.c |2 +- > drivers/infiniband/core/roce_gid_mgmt.c | 10 +- > drivers/infiniband/hw/mlx4/main.c |2 +- > drivers/net/bonding/Makefile |2 +- > drivers/net/bonding/bond_3ad.c| 1701 ++-- > drivers/net/bonding/bond_alb.c| 689 ++--- > drivers/net/bonding/bond_debugfs
Re: [PATCH] net: bonding: alb disable balance for IPv6 multicast related mac
will be >enabled. > >[1] https://tools.ietf.org/html/rfc2464#section-7 >[2] https://tools.ietf.org/html/rfc4861 >[3] linux.git/tree/include/net/if_inet6.h#n209-n221 >[4] linux.git/tree/net/ipv6/ndisc.c#n291 >[5] linux.git/tree/net/ipv6/ndisc.c#n346-n348 >[6] https://en.citizendium.org/wiki/Neighbor_Discovery > >Signed-off-by: LIU Yulong >--- > drivers/net/bonding/bond_alb.c | 10 -- > include/linux/etherdevice.h| 12 > 2 files changed, 16 insertions(+), 6 deletions(-) > >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c >index 095ea51..a4a30bd 100644 >--- a/drivers/net/bonding/bond_alb.c >+++ b/drivers/net/bonding/bond_alb.c >@@ -24,9 +24,6 @@ > #include > #include > >-static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = { >- 0x33, 0x33, 0x00, 0x00, 0x00, 0x01 >-}; > static const int alb_delta_in_ticks = HZ / ALB_TIMER_TICKS_PER_SEC; > > #pragma pack(1) >@@ -1422,10 +1419,11 @@ struct slave *bond_xmit_alb_slave_get(struct bonding >*bond, > break; > } > >- /* IPv6 uses all-nodes multicast as an equivalent to >- * broadcasts in IPv4. >+ /* IPv6 multicast destination should disable the tx-balance >since >+ * the pyhsical world may get into a mass status which will lead >+ * to the IPv6 traffic broken. I think this comment can be simplified to simply say that IPv6 multicast destinations should not be tx-balanced, which I suspect is the real purpose. >*/ >- if (ether_addr_equal_64bits(eth_data->h_dest, mac_v6_allmcast)) >{ >+ if (is_ipv6_multicast_ether_addr(eth_data->h_dest)) { > do_tx_balance = false; > break; > } >diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h >index 2e5debc..c6101ab 100644 >--- a/include/linux/etherdevice.h >+++ b/include/linux/etherdevice.h >@@ -178,6 +178,18 @@ static inline bool is_unicast_ether_addr(const u8 *addr) > } > > /** >+ * is_ipv6_multicast_ether_addr - Determine if the Ethernet address is for >+ * IPv6 multicast (rfc2464). >+ * @addr: Pointer to a six-byte array containing the Ethernet address >+ * >+ * Return true if the address is a multicast for IPv6. >+ */ >+static inline bool is_ipv6_multicast_ether_addr(const u8 *addr) >+{ >+ return (addr[0] & addr[1]) == 0x33; >+} I don't think this does what is intended. It will return true for a MAC that starts with any two values whose bitwise AND is 0x33, e.g., 0x73 0x3b. For IPv6 multicast, the first two octets of the MAC must be exactly 0x33 0x33. -J >+ >+/** > * is_valid_ether_addr - Determine if the given Ethernet address is valid > * @addr: Pointer to a six-byte array containing the Ethernet address > * >-- >1.8.3.1 --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] net: bonding: alb disable balance for IPv6 multicast related mac
LIU Yulong wrote: >Hi Jay, > > > >Thank you for your response and review. Please see my inline comments. I'm still reviewing your commentary, but to answer your final question regarding updating the patch, you'll need to repost the entire patch (with the new changes). This repost should change the Subject to include "v2" in the PATCH brackets, and add a change log after the signoff block. The detailed expectations are laid out in https://www.kernel.org/doc/Documentation/process/submitting-patches.rst -J > > >>According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined >to > >>construct the multicast destination MAC address for IPv6 multicast >traffic. > >>The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such > >>rule. The work steps [6] are: > >> *) Let's assume a destination address of 2001:db8:1:1::1. > >> *) This is mapped into the "Solicited Node Multicast Address" (SNMA) > >> format of ff02::1:ffXX:. > >> *) The XX: represent the last 24 bits of the SNMA, and are derived > >> directly from the last 24 bits of the destination address. > >> *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1. > >> *) This, being a multicast address, can be mapped to a multicast MAC > >> address, using the format 33-33-XX-XX-XX-XX > >> *) Resulting in 33-33-ff-00-00-01. > >> *) This is a MAC address that is only being listened for by nodes > >> sharing the same last 24 bits. > >> *) In other words, while there is a chance for a "address collision", > >> it is a vast improvement over ARP's guaranteed "collision". > >>Kernel related code can be found at [3][4][5]. > >> > > > >>The current bond alb has some leaks of such MAC ranges which will cause > >>the physical world failed to determain the back tunnel of the reply > >>packet during the response in a Spine-and-Leaf data center architecture. > >>The basic topology looks like this: > >> > > > >>+-+ > >>| | > >>+---| Border Leaf |-+ > >>| | | | > >>| +-+ | > >>| | > >>| tunnel-1 | tunnel-2 > >>| | > >>| | > >>+---++ +--+-+ > >>|| || > >>| Leaf1 +--X-X-X-X--+ Leaf2 | tunnel-3 will be checked to prevent loop > >>|| tunnel-3 || > >>++ +-+--+ > >> | | > >> | | > >> | | > >> | | > >> | | > >> | | > >> ++ ++ > >> +--+nic1+---+nic2+---+ > >> | ++ ++ | > >> | bond6| > >> || > >> | HOST | > >> ++ > > > >>This description is, overall, very comprehensive, and I believe > >>I generally understand what issue you're fixing (which seems to be a > >>complicated means to cause MAC flapping), although I'm unclear on a few > >>details, below. > > > >>However, if you could make the ASCII art smaller I think that > >>would be better. > > > >Done, I will update. > > > >>When nic1 is sending the normal IPv6 traffic to the gateway in Border >leaf, > >>the nic2 (slave) will send the NS packet out periodically, automatically > >>and implicitly as well. This is an example packet sending from the slave > >>nic2 which will broke the traffic. > > > >>With this patch applied, what would happen if nic2 sends the > >>normal IPv6 traffic from the source MAC in question (because it is > >>tx-balanced there), and the Neighbor Solicitation multicast then goes > >>out via nic1? > > > >Packets sent out from nic2 kernel will change the source MAC to the slave >MAC, > >aka nic2's MAC, so nic2 normal traffic is just fine. > >Related code can be found here (I'm not pretty sure about the real code >place, but this is what I see): > >https://github.com/torvalds/linux/blob/master/drivers/net/bonding/bond_alb.c#L1320 > > >The following MAC "ac:1f:6b:90:5c:eb" is the nic2 MAC, while the >"fa:16:3e:ba:2d:8c" > >is the real packet source MAC. > > > >> ac:1f:6b:90:5c:eb > 33:33:ff:00:00:01, ethertype 802.1Q (0x8100), > >> length 90: vlan 205, p 0, ethertype IPv6, (hlim 255, > >> next-header ICMPv6 (58) payload length: 32) > >> fe80::f816:3eff:feba:2d8c > ff02::1:ff00:1: > >> [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, > >> who has 240e:980:2f00:4000::1 > >> source link-address option (1), length 8 (1): fa:16:3e:ba:2d:8c > > > >>And perhaps trim out the hex dump here. > > > >Done, removed. > > > >>MAC "fa:16:3e:ba:2d:8c" was first learnt at Leaf1 based on the underlay > >>mechanism(BGP EVPN). When this example packet was sent to Border leaf and > >>replied with dst_MAC "fa:16:3e:ba:2d:8c", Leaf2 will try to send packet > >>back to tunnel-3 at this point dropping happens because of
Re: [PATCH net-next v2 6/6] bonding: make Kconfig toggle to disable legacy interfaces
Jarod Wilson wrote: >On Fri, Oct 2, 2020 at 6:57 PM David Miller wrote: >> >> From: Jarod Wilson >> Date: Fri, 2 Oct 2020 16:23:46 -0400 >> >> > I'd had a bit of feedback that people would rather see both, and be >> > able to toggle off the old ones, rather than only having one or the >> > other, depending on the toggle, so I thought I'd give this a try. I >> > kind of liked the one or the other route, but I see the problems with >> > that too. >> >> Please keep everything for the entire deprecation period, unconditionally. > >Okay, so 100% drop the Kconfig flag patch, but duplicate sysfs >interface names are acceptable, correct? Then what about the procfs >file having duplicate Slave and Port lines? Just leave them all as >Slave? My preference is to not alter the existing sysfs / proc behavior at all, and instead create a netlink / iproute UAPI that becomes the "preferred" UAPI going forward. Any new functionality would only be added to netlink as incentive to switch. I don't see the value in adding duplicate fields, as userspace code that parses them will not be portable if it only checks for the new field name. Then, down the road, deleting the old names will break the userspace code that was never updated (which will likely be most of it). I would rather see a single clean break from proc and sysfs to netlink in one step than a piecemeal addition and removal from the existing UAPI. That makes for a much clearer flag day event for end users. By this I mean leave proc / sysfs as-is today, and then after a suitable deprecation period, remove it wholesale (rather than a compile time option). -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next 4/5] bonding: make Kconfig toggle to disable legacy interfaces
Jarod Wilson wrote: >On Tue, Sep 22, 2020 at 8:01 PM Stephen Hemminger > wrote: >> >> On Tue, 22 Sep 2020 16:47:07 -0700 >> Jay Vosburgh wrote: >> >> > Stephen Hemminger wrote: >> > >> > >On Tue, 22 Sep 2020 09:37:30 -0400 >> > >Jarod Wilson wrote: >> > > >> > >> By default, enable retaining all user-facing API that includes the use >> > >> of >> > >> master and slave, but add a Kconfig knob that allows those that wish to >> > >> remove it entirely do so in one shot. >> > >> >> > >> Cc: Jay Vosburgh >> > >> Cc: Veaceslav Falico >> > >> Cc: Andy Gospodarek >> > >> Cc: "David S. Miller" >> > >> Cc: Jakub Kicinski >> > >> Cc: Thomas Davis >> > >> Cc: net...@vger.kernel.org >> > >> Signed-off-by: Jarod Wilson >> > > >> > >Why not just have a config option to remove all the /proc and sysfs >> > >options >> > >in bonding (and bridging) and only use netlink? New tools should be only >> > >able >> > >to use netlink only. >> > >> > I agree that new tooling should be netlink, but what value is >> > provided by such an option that distros are unlikely to enable, and >> > enabling will break the UAPI? > >Do you mean the initial proposed option, or what Stephen is >suggesting? I think Red Hat actually will consider the former, the >latter is less likely in the immediate future, since so many people >still rely on the output of /proc/net/bonding/* for an overall view of >their bonds' health and status. I don't know how close we are to >having something comparable that could be spit out with a single >invocation of something like 'ip' that would only be using netlink. >It's entirely possible there's something akin to 'ip link bondX >overview' already that outputs something similar, and I'm just not >aware of it, but something like that would definitely need to exist >and be well-documented for Red Hat to remove the procfs bits, I think. At the present time, as much as the idea spurs the imagination, removing the bonding /proc and sysfs stuff wholesale is not feasible. As you explain, not everything in the proc file is available from other sources. I would rather freeze the /proc and sysfs bonding functionality and move to a netlink / iproute API for all of it, and then down the road remove the then-legacy APIs. Even though "down the road" may practically be "never" (because the removal breaks backwards compatibility for user space), unifying all of the configuration and reporting to one place would be worthwhile. For "initial proposed option," I'm not sure right off if that's referring to CONFIG_BONDING_LEGACY_INTERFACES or "duplicate lines in /proc/net/bonding." I'm not sure it matters, since both have the same problem, in that they create a Venn diagram of mutually incompatible bonding UAPIs. Portable user space code ends up having to handle all of the permutations. -J >> > >Then you might convince maintainers to update documentation as well. >> > >Last I checked there were still references to ifenslave. >> > >> > Distros still include ifenslave, but it's now a shell script >> > that uses sysfs. I see it used in scripts from time to time. >> >> Some bleeding edge distros have already dropped ifenslave and even ifconfig. >> The Enterprise ones never will. >> >> The one motivation would be for the embedded folks which are always looking >> to trim out the fat. Although not sure if the minimal versions of commands >> in busybox are pure netlink yet. > >Yeah, the bonding documentation is still filled with references to >ifenslave. I believe Red Hat still includes it, though it's >"deprecated" in documentation in favor of using ip. Similar with >ifconfig. I could see them both getting dropped in a future major >release of Red Hat Enterprise Linux, but they're definitely still here >for at least the life of RHEL8. As ifconfig is typically bundled in with the much-loved netstat in the net-tools package, it will be difficult to remove. Having an /sbin/ifenslave program isn't really the issue so much as its reliance on the bonding sysfs UAPI. It's a shell script, and could likely be reworked to use ip link. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next 4/5] bonding: make Kconfig toggle to disable legacy interfaces
Stephen Hemminger wrote: >On Tue, 22 Sep 2020 09:37:30 -0400 >Jarod Wilson wrote: > >> By default, enable retaining all user-facing API that includes the use of >> master and slave, but add a Kconfig knob that allows those that wish to >> remove it entirely do so in one shot. >> >> Cc: Jay Vosburgh >> Cc: Veaceslav Falico >> Cc: Andy Gospodarek >> Cc: "David S. Miller" >> Cc: Jakub Kicinski >> Cc: Thomas Davis >> Cc: net...@vger.kernel.org >> Signed-off-by: Jarod Wilson > >Why not just have a config option to remove all the /proc and sysfs options >in bonding (and bridging) and only use netlink? New tools should be only able >to use netlink only. I agree that new tooling should be netlink, but what value is provided by such an option that distros are unlikely to enable, and enabling will break the UAPI? >Then you might convince maintainers to update documentation as well. >Last I checked there were still references to ifenslave. Distros still include ifenslave, but it's now a shell script that uses sysfs. I see it used in scripts from time to time. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next 0/5] bonding: rename bond components
Jarod Wilson wrote: >The bonding driver's use of master and slave, while largely understood >in technical circles, poses a barrier for inclusion to some potential >members of the development and user community, due to the historical >context of masters and slaves, particularly in the United States. This >is a first full pass at replacing those phrases with more socially >inclusive ones, opting for aggregator to replace master and link to >replace slave, as the bonding driver itself is a link aggregation >driver. First, I think there should be some direction from the kernel development leadership as to whether or not this type of large-scale search and replace of socially sensitive terms of art or other terminology is a task that should be undertaken, given the noted issues it will cause in stable release maintenance going forwards. Second, on the merits of the proposed changes (presuming for the moment that this goes forward), I would prefer different nomenclature that does not reuse existing names for different purposes, i.e., "link" and "aggregator." Both of those have specific meanings in the current code, and old kernels will retain that meaning. Changing them to have new meanings going forward will lead to confusion, in my opinion for no good reason, as there are other names suited that do not conflict. For example, instead of "master" call everything a "bond," which matches common usage in discussion. Changing "master" to "aggregator," the replacement results in cumbersome descriptions like "the aggregator's active aggregator" in the context of LACP. A replacement term for "slave" is trickier; my first choice would be "port," but that may make more churn from a code change point of view, although struct slave could become struct bond_port, and leave the existing struct port for its current LACP use. >There are a few problems with this change. First up, "link" is used for >link state already in the bonding driver, so the first step here is to >rename link to link_state. Second, aggregator is already used in the >802.3ad code, but I feel the usage is actually consistent with referring >to the bonding aggregation virtual device as the aggregator. Third, we >have the issue of not wanting to break any existing userspace, which I >believe this patchset accomplishes, while also adding alternative >interfaces using new terminology, and a Kconfig option that will let >people make the conscious decision to break userspace and no longer >expose the original master/slave interfaces, once their userspace is >able to cope with their removal. I'm opposed to the Kconfig option because it will lead to balkanization of the UAPI, which would be worse than a clean break (which I'm also opposed to). >Lastly, we do still have the issue of ease of backporting fixes to >-stable trees. I've not had a huge amount of time to spend on it, but >brief forays into coccinelle didn't really pay off (since it's meant to >operate on code, not patches), and the best solution I can come up with >is providing a shell script someone could run over git-format-patch >output before git-am'ing the result to a -stable tree, though scripting >these changes in the first place turned out to be not the best thing to >do anyway, due to subtle cases where use of master or slave can NOT yet >be replaced, so a large amount of work was done by hand, inspection, >trial and error, which is why this set is a lot longer in coming than >I'd originally hoped. I don't expect -stable backports to be horrible to >figure out one way or another though, and I don't believe that a bit of >inconvenience on that front is enough to warrant not making these >changes. I'm skeptical that, given the scope of the changes involved, that it's really feasible to have effective automated massaging of patches. I think the reality is that a large fraction of the bonding fixes in the future will have to be backported entirely by hand. The only saving grace here is that the quantity of such patches is generally low (~40 in 2020 year to date). -J >See here for further details on Red Hat's commitment to this work: >https://www.redhat.com/en/blog/making-open-source-more-inclusive-eradicating-problematic-language > >As far as testing goes, I've manually operated on various bonds while >working on this code, and have run it through multiple lnst test runs, >which exercises the existing sysfs interfaces fairly extensively. As far >as I can tell, there is no breakage of existing interfaces with this >set, unless the user consciously opts to do so via Kconfig. > >Jarod Wilson (5): > bonding: rename struct slave member link to link_state > bonding: rename slave to link whe
Re: [PATCH net] bonding: show saner speed for broadcast mode
Jarod Wilson wrote: >Broadcast mode bonds transmit a copy of all traffic simultaneously out of >all interfaces, so the "speed" of the bond isn't really the aggregate of >all interfaces, but rather, the speed of the lowest active interface. Did you mean "slowest" here? >Also, the type of the speed field is u32, not unsigned long, so adjust >that accordingly, as required to make min() function here without >complaining about mismatching types. > >Fixes: bb5b052f751b ("bond: add support to read speed and duplex via ethtool") >CC: Jay Vosburgh >CC: Veaceslav Falico >CC: Andy Gospodarek >CC: "David S. Miller" >CC: net...@vger.kernel.org >Signed-off-by: Jarod Wilson Did you notice this by inspection, or did it come up in use somewhere? I can't recall ever hearing of anyone using broadcast mode, so I'm curious if there is a use for it, but this change seems reasonable enough regardless. -J Acked-by: Jay Vosburgh >--- > drivers/net/bonding/bond_main.c | 21 ++--- > 1 file changed, 18 insertions(+), 3 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 5ad43aaf76e5..c853ca67058c 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -4552,13 +4552,23 @@ static netdev_tx_t bond_start_xmit(struct sk_buff >*skb, struct net_device *dev) > return ret; > } > >+static u32 bond_mode_bcast_speed(struct slave *slave, u32 speed) >+{ >+ if (speed == 0 || speed == SPEED_UNKNOWN) >+ speed = slave->speed; >+ else >+ speed = min(speed, slave->speed); >+ >+ return speed; >+} >+ > static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev, > struct ethtool_link_ksettings *cmd) > { > struct bonding *bond = netdev_priv(bond_dev); >- unsigned long speed = 0; > struct list_head *iter; > struct slave *slave; >+ u32 speed = 0; > > cmd->base.duplex = DUPLEX_UNKNOWN; > cmd->base.port = PORT_OTHER; >@@ -4570,8 +4580,13 @@ static int bond_ethtool_get_link_ksettings(struct >net_device *bond_dev, >*/ > bond_for_each_slave(bond, slave, iter) { > if (bond_slave_can_tx(slave)) { >- if (slave->speed != SPEED_UNKNOWN) >- speed += slave->speed; >+ if (slave->speed != SPEED_UNKNOWN) { >+ if (BOND_MODE(bond) == BOND_MODE_BROADCAST) >+ speed = bond_mode_bcast_speed(slave, >+speed); >+ else >+ speed += slave->speed; >+ } > if (cmd->base.duplex == DUPLEX_UNKNOWN && > slave->duplex != DUPLEX_UNKNOWN) > cmd->base.duplex = slave->duplex; >-- >2.20.1 >
Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices
Jay Vosburgh wrote: >sathyanarayanan.kuppusw...@linux.intel.com wrote: > >From: Kuppuswamy Sathyanarayanan >> >>Fatal (DPC) error recovery is currently broken for non-hotplug >>capable devices. With current implementation, after successful >>fatal error recovery, non-hotplug capable device state won't be >>restored properly. You can find related issues in following links. >> >>https://lkml.org/lkml/2020/5/27/290 >>https://lore.kernel.org/linux-pci/12115.1588207324@famine/ >>https://lkml.org/lkml/2020/3/28/328 >> >>Current fatal error recovery implementation relies on hotplug handler >>for detaching/re-enumerating the affected devices/drivers on DLLSC >>state changes. So when dealing with non-hotplug capable devices, >>recovery code does not restore the state of the affected devices >>correctly. Correct implementation should call report_slot_reset() >>function after resetting the link to restore the state of the >>device/driver. >> >>So use PCI_ERS_RESULT_NEED_RESET as error status for successful >>reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure >>case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset() >>is called after reset link operation which will also fix the above >>mentioned issue. >> >>[original patch is from jay.vosbu...@canonical.com] >>[original patch link >>https://lore.kernel.org/linux-pci/12115.1588207324@famine/] >>Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()") >>Signed-off-by: Jay Vosburgh >>Signed-off-by: Kuppuswamy Sathyanarayanan >> > > I've tested this patch set on one of our test machines, and it >resolves the issue. I plan to test with other systems tomorrow. I've done testing on two different systems that exhibit the original issue and this patch set appears to behave as expected. Has anyone else (Yicong?) had an opportunity to test this? Can this be considered for acceptance, or is additional feedback or review needed? -J >>--- >> drivers/pci/pcie/err.c | 24 ++-- >> 1 file changed, 22 insertions(+), 2 deletions(-) >> >>diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c >>index 14bb8f54723e..5fe8561c7185 100644 >>--- a/drivers/pci/pcie/err.c >>+++ b/drivers/pci/pcie/err.c >>@@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, >> pci_dbg(dev, "broadcast error_detected message\n"); >> if (state == pci_channel_io_frozen) { >> pci_walk_bus(bus, report_frozen_detected, ); >>- status = reset_link(dev); >>- if (status != PCI_ERS_RESULT_RECOVERED) { >>+ /* >>+ * After resetting the link using reset_link() call, the >>+ * possible value of error status is either >>+ * PCI_ERS_RESULT_DISCONNECT (failure case) or >>+ * PCI_ERS_RESULT_NEED_RESET (success case). >>+ * So ignore the return value of report_error_detected() >>+ * call for fatal errors. Instead use >>+ * PCI_ERS_RESULT_NEED_RESET as initial status value. >>+ * >>+ * Ignoring the status return value of report_error_detected() >>+ * call will also help in case of EDR mode based error >>+ * recovery. In EDR mode AER and DPC Capabilities are owned by >>+ * firmware and hence report_error_detected() call will possibly >>+ * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore >>+ * the return value of report_error_detected() then >>+ * pcie_do_recovery() would report incorrect status after >>+ * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER >>+ * in non EDR case should not have any functional impact. >>+ */ >>+ status = PCI_ERS_RESULT_NEED_RESET; >>+ if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) { >>+ status = PCI_ERS_RESULT_DISCONNECT; >> pci_warn(dev, "link reset failed\n"); >> goto failed; >> } >>-- >>2.17.1 --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next 3/4] bonding: support hardware encryption offload to slaves
Jarod Wilson wrote: >Currently, this support is limited to active-backup mode, as I'm not sure >about the feasilibity of mapping an xfrm_state's offload handle to >multiple hardware devices simultaneously, and we rely on being able to >pass some hints to both the xfrm and NIC driver about whether or not >they're operating on a slave device. > >I've tested this atop an Intel x520 device (ixgbe) using libreswan in >transport mode, succesfully achieving ~4.3Gbps throughput with netperf >(more or less identical to throughput on a bare NIC in this system), >as well as successful failover and recovery mid-netperf. > >v2: rebase on latest net-next and wrap with #ifdef CONFIG_XFRM_OFFLOAD >v3: add new CONFIG_BOND_XFRM_OFFLOAD option and fix shutdown path > >CC: Jay Vosburgh >CC: Veaceslav Falico >CC: Andy Gospodarek >CC: "David S. Miller" >CC: Jeff Kirsher >CC: Jakub Kicinski >CC: Steffen Klassert >CC: Herbert Xu >CC: net...@vger.kernel.org >CC: intel-wired-...@lists.osuosl.org >Signed-off-by: Jarod Wilson > >Signed-off-by: Jarod Wilson >--- > drivers/net/Kconfig | 11 > drivers/net/bonding/bond_main.c | 111 +++- > include/net/bonding.h | 3 + > 3 files changed, 122 insertions(+), 3 deletions(-) > >diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig >index c7d310ef1c83..938c4dd9bfb9 100644 >--- a/drivers/net/Kconfig >+++ b/drivers/net/Kconfig >@@ -56,6 +56,17 @@ config BONDING > To compile this driver as a module, choose M here: the module > will be called bonding. > >+config BONDING_XFRM_OFFLOAD >+ bool "Bonding driver IPSec XFRM cryptography-offload pass-through >support" >+ depends on BONDING >+ depends on XFRM_OFFLOAD >+ default y >+ select XFRM_ALGO >+ ---help--- >+Enable support for IPSec offload pass-through in the bonding driver. >+Currently limited to active-backup mode only, and requires slave >+devices that support hardware crypto offload. >+ Why is this a separate Kconfig option? Is it reasonable to expect users to enable XFRM_OFFLOAD but not BONDING_XFRM_OFFLOAD? > config DUMMY > tristate "Dummy net driver support" > ---help--- >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index a25c65d4af71..01b80cef492a 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -79,6 +79,7 @@ > #include > #include > #include >+#include > #include > #include > #include >@@ -278,8 +279,6 @@ const char *bond_mode_name(int mode) > return names[mode]; > } > >-/*-- VLAN >---*/ >- > /** > * bond_dev_queue_xmit - Prepare skb for xmit. > * >@@ -302,6 +301,8 @@ netdev_tx_t bond_dev_queue_xmit(struct bonding *bond, >struct sk_buff *skb, > return dev_queue_xmit(skb); > } > >+/*-- VLAN >---*/ >+ > /* In the following 2 functions, bond_vlan_rx_add_vid and > bond_vlan_rx_kill_vid, > * We don't protect the slave list iteration with a lock because: > * a. This operation is performed in IOCTL context, >@@ -372,6 +373,84 @@ static int bond_vlan_rx_kill_vid(struct net_device >*bond_dev, > return 0; > } > >+/*-- XFRM >---*/ >+ >+#ifdef CONFIG_BONDING_XFRM_OFFLOAD >+/** >+ * bond_ipsec_add_sa - program device with a security association >+ * @xs: pointer to transformer state struct >+ **/ >+static int bond_ipsec_add_sa(struct xfrm_state *xs) >+{ >+ struct net_device *bond_dev = xs->xso.dev; >+ struct bonding *bond = netdev_priv(bond_dev); >+ struct slave *slave = rtnl_dereference(bond->curr_active_slave); >+ >+ xs->xso.slave_dev = slave->dev; >+ bond->xs = xs; >+ >+ if (!(slave->dev->xfrmdev_ops >+&& slave->dev->xfrmdev_ops->xdo_dev_state_add)) { >+ slave_warn(bond_dev, slave->dev, "Slave does not support ipsec >offload\n"); >+ return -EINVAL; >+ } >+ >+ return slave->dev->xfrmdev_ops->xdo_dev_state_add(xs); >+} >+ >+/** >+ * bond_ipsec_del_sa - clear out this specific SA >+ * @xs: pointer to transformer state struct >+ **/ >+static void bond_ipsec_del_sa(struct xfrm_state *xs) >+{ >+ struct net_device *bond_dev = xs->xso.dev; >+ struct bonding *bond = netdev_priv(bond_dev); >+ struct slave *slave = rtnl_derefere
Re: [PATCH v2 1/2] PCI/ERR: Fix fatal error recovery for non-hotplug capable devices
sathyanarayanan.kuppusw...@linux.intel.com wrote: >From: Kuppuswamy Sathyanarayanan > >Fatal (DPC) error recovery is currently broken for non-hotplug >capable devices. With current implementation, after successful >fatal error recovery, non-hotplug capable device state won't be >restored properly. You can find related issues in following links. > >https://lkml.org/lkml/2020/5/27/290 >https://lore.kernel.org/linux-pci/12115.1588207324@famine/ >https://lkml.org/lkml/2020/3/28/328 > >Current fatal error recovery implementation relies on hotplug handler >for detaching/re-enumerating the affected devices/drivers on DLLSC >state changes. So when dealing with non-hotplug capable devices, >recovery code does not restore the state of the affected devices >correctly. Correct implementation should call report_slot_reset() >function after resetting the link to restore the state of the >device/driver. > >So use PCI_ERS_RESULT_NEED_RESET as error status for successful >reset_link() operation and use PCI_ERS_RESULT_DISCONNECT for failure >case. PCI_ERS_RESULT_NEED_RESET error state will ensure slot_reset() >is called after reset link operation which will also fix the above >mentioned issue. > >[original patch is from jay.vosbu...@canonical.com] >[original patch link >https://lore.kernel.org/linux-pci/12115.1588207324@famine/] >Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()") >Signed-off-by: Jay Vosburgh >Signed-off-by: Kuppuswamy Sathyanarayanan > I've tested this patch set on one of our test machines, and it resolves the issue. I plan to test with other systems tomorrow. -J >--- > drivers/pci/pcie/err.c | 24 ++-- > 1 file changed, 22 insertions(+), 2 deletions(-) > >diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c >index 14bb8f54723e..5fe8561c7185 100644 >--- a/drivers/pci/pcie/err.c >+++ b/drivers/pci/pcie/err.c >@@ -165,8 +165,28 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, > pci_dbg(dev, "broadcast error_detected message\n"); > if (state == pci_channel_io_frozen) { > pci_walk_bus(bus, report_frozen_detected, ); >- status = reset_link(dev); >- if (status != PCI_ERS_RESULT_RECOVERED) { >+ /* >+ * After resetting the link using reset_link() call, the >+ * possible value of error status is either >+ * PCI_ERS_RESULT_DISCONNECT (failure case) or >+ * PCI_ERS_RESULT_NEED_RESET (success case). >+ * So ignore the return value of report_error_detected() >+ * call for fatal errors. Instead use >+ * PCI_ERS_RESULT_NEED_RESET as initial status value. >+ * >+ * Ignoring the status return value of report_error_detected() >+ * call will also help in case of EDR mode based error >+ * recovery. In EDR mode AER and DPC Capabilities are owned by >+ * firmware and hence report_error_detected() call will possibly >+ * return PCI_ERS_RESULT_NO_AER_DRIVER. So if we don't ignore >+ * the return value of report_error_detected() then >+ * pcie_do_recovery() would report incorrect status after >+ * successful recovery. Ignoring PCI_ERS_RESULT_NO_AER_DRIVER >+ * in non EDR case should not have any functional impact. >+ */ >+ status = PCI_ERS_RESULT_NEED_RESET; >+ if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) { >+ status = PCI_ERS_RESULT_DISCONNECT; > pci_warn(dev, "link reset failed\n"); > goto failed; > } >-- >2.17.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: Fix reference count leak in bond_sysfs_slave_add.
wu000...@umn.edu wrote: >From: Qiushi Wu > >kobject_init_and_add() takes reference even when it fails. >If this function returns an error, kobject_put() must be called to >properly clean up the memory associated with the object. Previous >commit "b8eb718348b8" fixed a similar problem. > >Fixes: 07699f9a7c8d ("bonding: add sysfs /slave dir for bond slave devices.") >Signed-off-by: Qiushi Wu Acked-by: Jay Vosburgh >--- > drivers/net/bonding/bond_sysfs_slave.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_sysfs_slave.c >b/drivers/net/bonding/bond_sysfs_slave.c >index 007481557191..9b8346638f69 100644 >--- a/drivers/net/bonding/bond_sysfs_slave.c >+++ b/drivers/net/bonding/bond_sysfs_slave.c >@@ -149,8 +149,10 @@ int bond_sysfs_slave_add(struct slave *slave) > > err = kobject_init_and_add(>kobj, _ktype, > &(slave->dev->dev.kobj), "bonding_slave"); >- if (err) >+ if (err) { >+ kobject_put(>kobj); > return err; >+ } > > for (a = slave_attrs; *a; ++a) { > err = sysfs_create_file(>kobj, &((*a)->attr)); >-- >2.17.1 >
Re: [PATCH v1 1/1] PCI/ERR: Handle fatal error recovery for non-hotplug capable devices
sathyanarayanan.kuppusw...@linux.intel.com wrote: >From: Kuppuswamy Sathyanarayanan > >If there are non-hotplug capable devices connected to a given >port, then during the fatal error recovery(triggered by DPC or >AER), after calling reset_link() function, we cannot rely on >hotplug handler to detach and re-enumerate the device drivers >in the affected bus. Instead, we will have to let the error >recovery handler call report_slot_reset() for all devices in >the bus to notify about the reset operation. Although this is >only required for non hot-plug capable devices, doing it for >hotplug capable devices should not affect the functionality. Yicong, Does the patch below also resolve the issue for you, as with your changed version of my original patch? -J >Along with above issue, this fix also applicable to following >issue. > >Commit 6d2c89441571 ("PCI/ERR: Update error status after >reset_link()") added support to store status of reset_link() >call. Although this fixed the error recovery issue observed if >the initial value of error status is PCI_ERS_RESULT_DISCONNECT >or PCI_ERS_RESULT_NO_AER_DRIVER, it also discarded the status >result from report_frozen_detected. This can cause a failure to >recover if _NEED_RESET is returned by report_frozen_detected and >report_slot_reset is not invoked. > >Such an event can be induced for testing purposes by reducing the >Max_Payload_Size of a PCIe bridge to less than that of a device >downstream from the bridge, and then initiating I/O through the >device, resulting in oversize transactions. In the presence of DPC, >this results in a containment event and attempted reset and recovery >via pcie_do_recovery. After 6d2c89441571 report_slot_reset is not >invoked, and the device does not recover. > >[original patch is from jay.vosbu...@canonical.com] >[original patch link >https://lore.kernel.org/linux-pci/18609.1588812972@famine/] >Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()") >Signed-off-by: Jay Vosburgh >Signed-off-by: Kuppuswamy Sathyanarayanan > >--- > drivers/pci/pcie/err.c | 19 +++ > 1 file changed, 15 insertions(+), 4 deletions(-) > >diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c >index 14bb8f54723e..db80e1ecb2dc 100644 >--- a/drivers/pci/pcie/err.c >+++ b/drivers/pci/pcie/err.c >@@ -165,13 +165,24 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, > pci_dbg(dev, "broadcast error_detected message\n"); > if (state == pci_channel_io_frozen) { > pci_walk_bus(bus, report_frozen_detected, ); >- status = reset_link(dev); >- if (status != PCI_ERS_RESULT_RECOVERED) { >+ status = PCI_ERS_RESULT_NEED_RESET; >+ } else { >+ pci_walk_bus(bus, report_normal_detected, ); >+ } >+ >+ if (status == PCI_ERS_RESULT_NEED_RESET) { >+ if (reset_link) { >+ if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) >+ status = PCI_ERS_RESULT_DISCONNECT; >+ } else { >+ if (pci_bus_error_reset(dev)) >+ status = PCI_ERS_RESULT_DISCONNECT; >+ } >+ >+ if (status == PCI_ERS_RESULT_DISCONNECT) { > pci_warn(dev, "link reset failed\n"); > goto failed; > } >- } else { >- pci_walk_bus(bus, report_normal_detected, ); > } > > if (status == PCI_ERS_RESULT_CAN_RECOVER) { >-- >2.17.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH v3] bonding: force enable lacp port after link state recovery for 802.3ad
zhangsha (A) wrote: >> -Original Message- >> From: zhangsha (A) >> Sent: 2019年9月18日 21:06 >> To: jay.vosbu...@canonical.com; vfal...@gmail.com; a...@greyhouse.net; >> da...@davemloft.net; net...@vger.kernel.org; linux-kernel@vger.kernel.org; >> yuehaibing ; hunongda ; >> Chenzhendong (alex) ; zhangsha (A) >> >> Subject: [PATCH v3] bonding: force enable lacp port after link state >> recovery for >> 802.3ad >> >> From: Sha Zhang >> >> After the commit 334031219a84 ("bonding/802.3ad: fix slave link >> initialization >> transition states") merged, the slave's link status will be changed to >> BOND_LINK_FAIL from BOND_LINK_DOWN in the following scenario: >> - Driver reports loss of carrier and >> bonding driver receives NETDEV_DOWN notifier >> - slave's duplex and speed is zerod and >> its port->is_enabled is cleard to 'false'; >> - Driver reports link recovery and >> bonding driver receives NETDEV_UP notifier; >> - If speed/duplex getting failed here, the link status >> will be changed to BOND_LINK_FAIL; >> - The MII monotor later recover the slave's speed/duplex >> and set link status to BOND_LINK_UP, but remains >> the 'port->is_enabled' to 'false'. >> >> In this scenario, the lacp port will not be enabled even its speed and >> duplex are >> valid. The bond will not send LACPDU's, and its state is 'AD_STATE_DEFAULTED' >> forever. The simplest fix I think is to call bond_3ad_handle_link_change() in >> bond_miimon_commit, this function can enable lacp after port slave speed >> check. >> As enabled, the lacp port can run its state machine normally after link >> recovery. >> >> Signed-off-by: Sha Zhang >> --- >> drivers/net/bonding/bond_main.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/bonding/bond_main.c >> b/drivers/net/bonding/bond_main.c index 931d9d9..76324a5 100644 >> --- a/drivers/net/bonding/bond_main.c >> +++ b/drivers/net/bonding/bond_main.c >> @@ -2206,7 +2206,8 @@ static void bond_miimon_commit(struct bonding >> *bond) >> */ >> if (BOND_MODE(bond) == BOND_MODE_8023AD && >> slave->link == BOND_LINK_UP) >> - >> bond_3ad_adapter_speed_duplex_changed(slave); >> +bond_3ad_handle_link_change(slave, >> +BOND_LINK_UP); >> continue; >> >> case BOND_LINK_UP: > >Hi, David, >I have replied your email for a while, I guess you may miss my email, so I >resend it. >The following link address is the last email, please review the new one again, >thank you. >https://patchwork.ozlabs.org/patch/1151915/ > >Last time, you doubted this is a driver specific problem, >I prefer to believe it's not because I find the commit 4d2c0cda, >its log says " Some NIC drivers don't have correct speed/duplex >settings at the time they send NETDEV_UP notification ...". > >Anyway, I think the lacp status should be fixed correctly, >since link-monitoring (miimon) set SPEED/DUPLEX right here. I suspect this is going to be related to the concurrent discussion with Aleksei, and would like to see the instrumentation results from his tests before adding another change to attempt to resolve this. Also, what device are you using for your testing, and are you able to run the instrumentation patch that I provided to Aleksei and provide its results? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: force enable lacp port after link state recovery for 802.3ad
wrote: >From: Sha Zhang > >After the commit 334031219a84 ("bonding/802.3ad: fix slave link >initialization transition states") merged, >the slave's link status will be changed to BOND_LINK_FAIL >from BOND_LINK_DOWN in the following scenario: >- Driver reports loss of carrier and > bonding driver receives NETDEV_CHANGE notifier >- slave's duplex and speed is zerod and > its port->is_enabled is cleard to 'false'; >- Driver reports link recovery and > bonding driver receives NETDEV_UP notifier; >- If speed/duplex getting failed here, the link status > will be changed to BOND_LINK_FAIL; >- The MII monotor later recover the slave's speed/duplex > and set link status to BOND_LINK_UP, but remains > the 'port->is_enabled' to 'false'. > >In this scenario, the lacp port will not be enabled even its speed >and duplex are valid. The bond will not send LACPDU's, and its >state is 'AD_STATE_DEFAULTED' forever. The simplest fix I think >is to force enable lacp after port slave speed check in >bond_miimon_commit. As enabled, the lacp port can run its state machine >normally after link recovery. > >Signed-off-by: Sha Zhang >--- > drivers/net/bonding/bond_main.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 931d9d9..379253a 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -2194,6 +2194,7 @@ static void bond_miimon_commit(struct bonding *bond) > { > struct list_head *iter; > struct slave *slave, *primary; >+ struct port *port; > > bond_for_each_slave(bond, slave, iter) { > switch (slave->new_link) { >@@ -2205,8 +2206,13 @@ static void bond_miimon_commit(struct bonding *bond) >* link status >*/ > if (BOND_MODE(bond) == BOND_MODE_8023AD && >- slave->link == BOND_LINK_UP) >+ slave->link == BOND_LINK_UP) { > bond_3ad_adapter_speed_duplex_changed(slave); >+ if (slave->duplex == DUPLEX_FULL) { >+ port = &(SLAVE_AD_INFO(slave)->port); >+ port->is_enabled = true; >+ } >+ } I don't believe that testing duplex here is correct; is_enabled is not controlled by duplex, but by carrier state. Duplex does affect whether or not a port is permitted to aggregate, but that's entirely separate logic (the AD_PORT_LACP_ENABLED flag). Would it be better to call bond_3ad_handle_link_change() here, instead of manually testing duplex and setting is_enabled? -J > continue; > > case BOND_LINK_UP: >-- >1.8.3.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [bonding][patch] Regarding a bonding lacp issue
Felix wrote: >Dear Mainteners, > >Recently I hit a packet drop issue in bonding driver on Linux 4.9. Please >see details below. Please take a look to see if my understanding is >correct. Many thanks. > >What is the problem? >The bonding driver starts to send packets even if the Partner(Switch)'s >Collecting bit is not enabled yet. Partner would drop all packets until >its Collecting bit is enabled. > >What is the root cuase? >According to LACP spec, the Actor need to check Partner's Sync and >Collecting bits before enable its Distributing bit and Distributing >function. Please see the PIC below. The diagram you reference is found in 802.1AX-2014 figure 6-21, which shows the state diagram for an independent control implementation, i.e., collecting and distributing are managed independently. However, Linux bonding implements coupled control, which is shown in figure 6-22. Here, there is no Partner.Collecting requirement on the state transition from ATTACHED to COLLECTING_DISTRIBUTING. To quote 802.1AX-2014 6.4.15: As independent control is not possible, the coupled control state machine does not wait for the Partner to signal that collection has started before enabling both collection and distribution. Now, that said, I agree that what you're seeing is likely explained by this behavior, and your fix should resolve the immediate problem (that bonding sends packets before the peer has enabled COLLECTING). However, your fix does put bonding out of compliance with the standard, as it does not really implement COLLECTING and DISTRIBUTING as discrete states. In particular, if the peer in your case were to later clear Partner.Collecting, bonding will not react to this as a figure 6-21 independent control implementation would (which isn't a change from current behavior, but currently this isn't expected). So, in my opinion a patch like this should have a comment attached noting that we are deliberately not in compliance with the standard in this specific situation. The proper fix is to implement figure 6-21 separate state. Lastly, are you able to test and generate a patch against current upstream, instead of 4.9? -J >How to fix? >Please see the diff as following. And the patch is attached. > >--- ../origin/linux-4.9.188/drivers/net/bonding/bond_3ad.c 2019-08-07 >00:29:42.0 +0800 >+++ drivers/net/bonding/bond_3ad.c 2019-08-08 23:13:29.015640197 +0800 >@@ -937,6 +937,7 @@ > */ >if ((port->sm_vars & AD_PORT_SELECTED) && >(port->partner_oper.port_state & AD_STATE_SYNCHRONIZATION) && >+ (port->partner_oper.port_state & AD_STATE_COLLECTING) && >!__check_agg_selection_timer(port)) { > if (port->aggregator->is_active) > port->sm_mux_state = > >-- >Thanks, >Felix --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: Add vlan tx offload to hw_enc_features
YueHaibing wrote: >As commit 30d8177e8ac7 ("bonding: Always enable vlan tx offload") >said, we should always enable bonding's vlan tx offload, pass the >vlan packets to the slave devices with vlan tci, let them to handle >vlan implementation. > >Now if encapsulation protocols like VXLAN is used, skb->encapsulation >may be set, then the packet is passed to vlan devicec which based on Typo: "devicec" >bonding device. However in netif_skb_features(), the check of >hw_enc_features: > >if (skb->encapsulation) > features &= dev->hw_enc_features; > >clears NETIF_F_HW_VLAN_CTAG_TX/NETIF_F_HW_VLAN_STAG_TX. This results >in same issue in commit 30d8177e8ac7 like this: > >vlan_dev_hard_start_xmit > -->dev_queue_xmit >-->validate_xmit_skb > -->netif_skb_features //NETIF_F_HW_VLAN_CTAG_TX is cleared > -->validate_xmit_vlan >-->__vlan_hwaccel_push_inside //skb->tci is cleared >... > --> bond_start_xmit > --> bond_xmit_hash //BOND_XMIT_POLICY_ENCAP34 > --> __skb_flow_dissect // nhoff point to IP header >--> case htons(ETH_P_8021Q) > // skb_vlan_tag_present is false, so > vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), > //vlan point to ip header wrongly > >Signed-off-by: YueHaibing Looks good to me; should this be tagged with Fixes: 278339a42a1b ("bonding: propogate vlan_features to bonding master") as 30d8177e8ac7 was? If not, is there an appropriate commit id? Acked-by: Jay Vosburgh -J >--- > drivers/net/bonding/bond_main.c | 2 ++ > 1 file changed, 2 insertions(+) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 02fd782..931d9d9 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1126,6 +1126,8 @@ static void bond_compute_features(struct bonding *bond) > done: > bond_dev->vlan_features = vlan_features; > bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL | >+ NETIF_F_HW_VLAN_CTAG_TX | >+ NETIF_F_HW_VLAN_STAG_TX | > NETIF_F_GSO_UDP_L4; > bond_dev->mpls_features = mpls_features; > bond_dev->gso_max_segs = gso_max_segs; >-- >2.7.4 --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net] bonding/802.3ad: fix slave link initialization transition states
Jarod Wilson wrote: >Once in a while, with just the right timing, 802.3ad slaves will fail to >properly initialize, winding up in a weird state, with a partner system >mac address of 00:00:00:00:00:00. This started happening after a fix to >properly track link_failure_count tracking, where an 802.3ad slave that >reported itself as link up in the miimon code, but wasn't able to get a >valid speed/duplex, started getting set to BOND_LINK_FAIL instead of >BOND_LINK_DOWN. That was the proper thing to do for the general "my link >went down" case, but has created a link initialization race that can put >the interface in this odd state. Reading back in the git history, the ultimate cause of this "weird state" appears to be devices that assert NETDEV_UP prior to actually being able to supply sane speed/duplex values, correct? Presuming that this is the case, I don't see that there's much else to be done here, and so: Acked-by: Jay Vosburgh >The simple fix is to instead set the slave link to BOND_LINK_DOWN again, >if the link has never been up (last_link_up == 0), so the link state >doesn't bounce from BOND_LINK_DOWN to BOND_LINK_FAIL -- it hasn't failed >in this case, it simply hasn't been up yet, and this prevents the >unnecessary state change from DOWN to FAIL and getting stuck in an init >failure w/o a partner mac. > >Fixes: ea53abfab960 ("bonding/802.3ad: fix link_failure_count tracking") >CC: Jay Vosburgh >CC: Veaceslav Falico >CC: Andy Gospodarek >CC: "David S. Miller" >CC: net...@vger.kernel.org >Tested-by: Heesoon Kim >Signed-off-by: Jarod Wilson >--- > drivers/net/bonding/bond_main.c | 15 ++- > 1 file changed, 10 insertions(+), 5 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 062fa7e3af4c..407f4095a37a 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -3122,13 +3122,18 @@ static int bond_slave_netdev_event(unsigned long event, > case NETDEV_CHANGE: > /* For 802.3ad mode only: >* Getting invalid Speed/Duplex values here will put slave >- * in weird state. So mark it as link-fail for the time >- * being and let link-monitoring (miimon) set it right when >- * correct speeds/duplex are available. >+ * in weird state. Mark it as link-fail if the link was >+ * previously up or link-down if it hasn't yet come up, and >+ * let link-monitoring (miimon) set it right when correct >+ * speeds/duplex are available. >*/ > if (bond_update_speed_duplex(slave) && >- BOND_MODE(bond) == BOND_MODE_8023AD) >- slave->link = BOND_LINK_FAIL; >+ BOND_MODE(bond) == BOND_MODE_8023AD) { >+ if (slave->last_link_up) >+ slave->link = BOND_LINK_FAIL; >+ else >+ slave->link = BOND_LINK_DOWN; >+ } > > if (BOND_MODE(bond) == BOND_MODE_8023AD) > bond_3ad_adapter_speed_duplex_changed(slave); >-- >2.20.1 >
Re: [PATCH] bonding: show full hw address in sysfs for slave entries
Konstantin Khorenko wrote: >Bond expects ethernet hwaddr for its slave, but it can be longer than 6 >bytes - infiniband interface for example. > > # cat /sys/devices//net/ib0/address > 80:00:02:08:fe:80:00:00:00:00:00:00:7c:fe:90:03:00:be:5d:e1 > > # cat /sys/devices//net/ib0/bonding_slave/perm_hwaddr > 80:00:02:08:fe:80 > >So print full hwaddr in sysfs "bonding_slave/perm_hwaddr" as well. > >Signed-off-by: Konstantin Khorenko >--- > drivers/net/bonding/bond_sysfs_slave.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_sysfs_slave.c >b/drivers/net/bonding/bond_sysfs_slave.c >index 2f120b2ffef0..4985268e2273 100644 >--- a/drivers/net/bonding/bond_sysfs_slave.c >+++ b/drivers/net/bonding/bond_sysfs_slave.c >@@ -55,7 +55,9 @@ static SLAVE_ATTR_RO(link_failure_count); > > static ssize_t perm_hwaddr_show(struct slave *slave, char *buf) > { >- return sprintf(buf, "%pM\n", slave->perm_hwaddr); >+ return sprintf(buf, "%*phC\n", >+ slave->dev->addr_len, >+ slave->perm_hwaddr); > } > static SLAVE_ATTR_RO(perm_hwaddr); Acked-by: Jay Vosburgh -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH 1/2] net: bonding: fix restricted __be16 degrades to integer
Bo YU wrote: >There are some warning when: > >sudo make C=1 CF=-D__CHECK_ENDIAN__ drivers/net/bonding/ > >drivers/net/bonding/bond_main.c:2385:26: warning: restricted __be16 degrades >to integer >drivers/net/bonding/bond_main.c:2391:20: warning: restricted __be16 degrades >to integer >... >drivers/net/bonding/bond_main.c:3241:60: warning: restricted __be16 degrades >to integer >drivers/net/bonding/bond_main.c:3241:60: warning: restricted __be16 degrades >to integer > >So fix it. > >Signed-off-by: Bo YU >--- > drivers/net/bonding/bond_main.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index b59708c35faf..135fec28daa9 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -2382,13 +2382,13 @@ static void bond_arp_send(struct net_device >*slave_dev, int arp_op, > return; > } > >- if (!tags || tags->vlan_proto == VLAN_N_VID) >+ if (!tags || be16_to_cpu(tags->vlan_proto) == VLAN_N_VID) > goto xmit; > > tags++; > > /* Go through all the tags backwards and add them to the packet */ >- while (tags->vlan_proto != VLAN_N_VID) { >+ while (be16_to_cpu(tags->vlan_proto) != VLAN_N_VID) { I believe both of the above are incorrect, as vlan_proto is set explicitly to VLAN_N_VID (in host byte order) by bond_verify_device_path as a sentinel value. Byte swapping the tags->vlan_proto value would cause the test or loop to miss the sentinel. > if (!tags->vlan_id) { > tags++; > continue; >@@ -3238,7 +3238,7 @@ static inline u32 bond_eth_hash(struct sk_buff *skb) > > ep = skb_header_pointer(skb, 0, sizeof(hdr_tmp), _tmp); > if (ep) >- return ep->h_dest[5] ^ ep->h_source[5] ^ ep->h_proto; >+ return ep->h_dest[5] ^ ep->h_source[5] ^ >be16_to_cpu(ep->h_proto); This is probably harmless, other than adding work to the transmit path. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH 2/2] net: bonding: fix incorrect type in assignment
Bo YU wrote: >There are some warning when: > >sudo make C=1 CF=-D__CHECK_ENDIAN__ drivers/net/bonding/ > >drivers/net/bonding/bond_main.c:2438:40: warning: incorrect type in assignment >(different base types) >drivers/net/bonding/bond_main.c:2438:40:expected restricted __be16 >[usertype] vlan_proto >drivers/net/bonding/bond_main.c:2438:40: >... >rivers/net/bonding/bond_options.c:1089:24: warning: incorrect type in >assignment (different base types) >drivers/net/bonding/bond_options.c:1089:24:expected restricted __be32 >[addressable] [usertype] target >drivers/net/bonding/bond_options.c:1089:24:got unsigned long long const >[usertype] value > >So fix it > >Signed-off-by: Bo YU >--- > drivers/net/bonding/bond_main.c| 2 +- > drivers/net/bonding/bond_options.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 135fec28daa9..07e52d863e91 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -2435,7 +2435,7 @@ struct bond_vlan_tag *bond_verify_device_path(struct >net_device *start_dev, > tags = kcalloc(level + 1, sizeof(*tags), GFP_ATOMIC); > if (!tags) > return ERR_PTR(-ENOMEM); >- tags[level].vlan_proto = VLAN_N_VID; >+ tags[level].vlan_proto = cpu_to_be16(VLAN_N_VID); > return tags; > } > >diff --git a/drivers/net/bonding/bond_options.c >b/drivers/net/bonding/bond_options.c >index da1fc17295d9..3a196999bd1b 100644 >--- a/drivers/net/bonding/bond_options.c >+++ b/drivers/net/bonding/bond_options.c >@@ -1086,7 +1086,7 @@ static int bond_option_arp_ip_targets_set(struct bonding >*bond, > else > netdev_err(bond->dev, "no command found in > arp_ip_targets file - use + or -\n"); > } else { >- target = newval->value; >+ target = cpu_to_be32(newval->value); > ret = bond_option_arp_ip_target_add(bond, target); I'm not sure this is correct; if I'm reading the call path correctly, bond_changelink will if (data[IFLA_BOND_ARP_IP_TARGET]) { [...] __be32 target; [...] target = nla_get_be32(attr); bond_opt_initval(, (__force u64)target); err = __bond_opt_set(bond, BOND_OPT_ARP_TARGETS, ); thus, newval.value is initially be32, but stored in a u64. __bond_opt_set will call bond_opt_parse, which in turn will call bond_option_arp_ip_targets_set (via .set), and the change above would swap the newval.value back to host order (on little endian architectures for which cpu_to_be32 is not a no-op). Am I misunderstanding? Did you test this change on an x86 or other little endian system? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c
Michael J Dilmore <michael.j.dilm...@gmail.com> wrote: >On 21/06/17 22:56, David Miller wrote: > >> From: Michael D <michael.j.dilm...@gmail.com> >> Date: Wed, 21 Jun 2017 22:41:07 +0100 >> >>> I don't think you can stop it being dereferenced... you just need to >>> prevent an attacker from exploiting the null pointer dereference >>> vulnerability right? And this is done by returning the function right >>> away? >> What's all of this about an "attacker"? >> >> If there is a bug, we dererence a NULL pointer, and we should >> fix that bug. >> >> The BUG_ON() helps us see where the problem is while at the >> same time stopping the kernel before the NULL deref happens. >Ok this is starting to make sense now - went a bit off track but think my >general thinking is ok - i.e. if we return the function with an error code >before the dereference then this basically does the same thing as BUG_ON >but without crashing the kernel. > >Something like: > >if (WARN_ON(!new_active_slave) { >netdev_dbg("Can't add new active slave - pointer null"); >return ERROR_CODE >} In general, yes, but in this case, the condition should be impossible to hit, so BUG_ON seems appropriate. If bond_slave_get_rtnl/rcu() returns NULL for an actual bonding slave, other code paths (bond_fill_slave_info, bond_handle_frame) will likely crash before getting to this one. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c
Michael J Dilmore wrote: >On 21/06/17 22:56, David Miller wrote: > >> From: Michael D >> Date: Wed, 21 Jun 2017 22:41:07 +0100 >> >>> I don't think you can stop it being dereferenced... you just need to >>> prevent an attacker from exploiting the null pointer dereference >>> vulnerability right? And this is done by returning the function right >>> away? >> What's all of this about an "attacker"? >> >> If there is a bug, we dererence a NULL pointer, and we should >> fix that bug. >> >> The BUG_ON() helps us see where the problem is while at the >> same time stopping the kernel before the NULL deref happens. >Ok this is starting to make sense now - went a bit off track but think my >general thinking is ok - i.e. if we return the function with an error code >before the dereference then this basically does the same thing as BUG_ON >but without crashing the kernel. > >Something like: > >if (WARN_ON(!new_active_slave) { >netdev_dbg("Can't add new active slave - pointer null"); >return ERROR_CODE >} In general, yes, but in this case, the condition should be impossible to hit, so BUG_ON seems appropriate. If bond_slave_get_rtnl/rcu() returns NULL for an actual bonding slave, other code paths (bond_fill_slave_info, bond_handle_frame) will likely crash before getting to this one. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c
David Miller <da...@davemloft.net> wrote: >From: Michael D <michael.j.dilm...@gmail.com> >Date: Wed, 21 Jun 2017 22:41:07 +0100 > >> I don't think you can stop it being dereferenced... you just need to >> prevent an attacker from exploiting the null pointer dereference >> vulnerability right? And this is done by returning the function right >> away? > >What's all of this about an "attacker"? > >If there is a bug, we dererence a NULL pointer, and we should >fix that bug. > >The BUG_ON() helps us see where the problem is while at the >same time stopping the kernel before the NULL deref happens. Looking at the code more carefully than I did earlier, the only way the BUG_ON will hit is if the rx_handler_data is NULL for a bonding slave when this code executes. This should be impossible, as there doesn't appear to be any way to get into bond_option_active_slave_set for a slave prior to bond_enslave registering the rx_handler for that slave, as these operations are mutexed by RTNL. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c
David Miller wrote: >From: Michael D >Date: Wed, 21 Jun 2017 22:41:07 +0100 > >> I don't think you can stop it being dereferenced... you just need to >> prevent an attacker from exploiting the null pointer dereference >> vulnerability right? And this is done by returning the function right >> away? > >What's all of this about an "attacker"? > >If there is a bug, we dererence a NULL pointer, and we should >fix that bug. > >The BUG_ON() helps us see where the problem is while at the >same time stopping the kernel before the NULL deref happens. Looking at the code more carefully than I did earlier, the only way the BUG_ON will hit is if the rx_handler_data is NULL for a bonding slave when this code executes. This should be impossible, as there doesn't appear to be any way to get into bond_option_active_slave_set for a slave prior to bond_enslave registering the rx_handler for that slave, as these operations are mutexed by RTNL. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c
Michael J Dilmore <michael.j.dilm...@gmail.com> wrote: >The function below contains a BUG_ON where no active slave is detected. The >patch >converts this to a WARN_ON to avoid crashing the kernel. > >Signed-off-by: Michael J Dilmore <michael.j.dilm...@gmail.com> >--- > drivers/net/bonding/bond_options.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_options.c >b/drivers/net/bonding/bond_options.c >index 1bcbb89..c4b4791 100644 >--- a/drivers/net/bonding/bond_options.c >+++ b/drivers/net/bonding/bond_options.c >@@ -778,7 +778,7 @@ static int bond_option_active_slave_set(struct bonding >*bond, > struct slave *old_active = > rtnl_dereference(bond->curr_active_slave); > struct slave *new_active = bond_slave_get_rtnl(slave_dev); > >- BUG_ON(!new_active); >+ WARN_ON(!new_active); This is a reasonable idea in principle, but will require additional changes to prevent dereferencing new_active if it is NULL (which would happen just below this point in the code). -J > if (new_active == old_active) { > /* do nothing */ >-- >2.7.4 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c
Michael J Dilmore wrote: >The function below contains a BUG_ON where no active slave is detected. The >patch >converts this to a WARN_ON to avoid crashing the kernel. > >Signed-off-by: Michael J Dilmore >--- > drivers/net/bonding/bond_options.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_options.c >b/drivers/net/bonding/bond_options.c >index 1bcbb89..c4b4791 100644 >--- a/drivers/net/bonding/bond_options.c >+++ b/drivers/net/bonding/bond_options.c >@@ -778,7 +778,7 @@ static int bond_option_active_slave_set(struct bonding >*bond, > struct slave *old_active = > rtnl_dereference(bond->curr_active_slave); > struct slave *new_active = bond_slave_get_rtnl(slave_dev); > >- BUG_ON(!new_active); >+ WARN_ON(!new_active); This is a reasonable idea in principle, but will require additional changes to prevent dereferencing new_active if it is NULL (which would happen just below this point in the code). -J > if (new_active == old_active) { > /* do nothing */ >-- >2.7.4 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert multiple netdev_info messages to netdev_dbg
Joe Perches <j...@perches.com> wrote: >On Thu, 2017-06-15 at 19:14 +0100, Michael J Dilmore wrote: >> Multiple netdev_info messages clutter kernel output. Also add netdev_dbg for >> packets per slave. >[] >> diff --git a/drivers/net/bonding/bond_options.c >> b/drivers/net/bonding/bond_options.c >[] >> @@ -9,6 +9,8 @@ >> * (at your option) any later version. >> */ >> >> +#define DEBUG 1 > >Is defining DEBUG really worthwhile. I don't believe so, since if CONFIG_DYNAMIC_DEBUG is not enabled, having #define DEBUG will enable all of the netdev_dbg messages unconditionally, which is the opposite of the stated purpose of the patch. If DYNAMIC_DEBUG is enabled, having DEBUG doesn't do anything that I can see. -J >As well, it's almost always just >#define DEBUG >without any level value unless the >level value is used in the code. > >> + >> #include >> #include >> #include >> @@ -719,13 +721,13 @@ static int bond_option_mode_set(struct bonding *bond, >> const struct bond_opt_value *newval) >> { >> if (!bond_mode_uses_arp(newval->value) && bond->params.arp_interval) { >> -netdev_info(bond->dev, "%s mode is incompatible with arp >> monitoring, start mii monitoring\n", >> +netdev_dbg(bond->dev, "%s mode is incompatible with arp >> monitoring, start mii monitoring\n", >> newval->string); > >Please realign any multiple line arguments to the >open parenthesis at the same time. > >> /* disable arp monitoring */ >> bond->params.arp_interval = 0; >> /* set miimon to default value */ >> bond->params.miimon = BOND_DEFAULT_MIIMON; >> -netdev_info(bond->dev, "Setting MII monitoring interval to >> %d\n", >> +netdev_dbg(bond->dev, "Setting MII monitoring interval to %d\n", >> bond->params.miimon); > >etc... > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert multiple netdev_info messages to netdev_dbg
Joe Perches wrote: >On Thu, 2017-06-15 at 19:14 +0100, Michael J Dilmore wrote: >> Multiple netdev_info messages clutter kernel output. Also add netdev_dbg for >> packets per slave. >[] >> diff --git a/drivers/net/bonding/bond_options.c >> b/drivers/net/bonding/bond_options.c >[] >> @@ -9,6 +9,8 @@ >> * (at your option) any later version. >> */ >> >> +#define DEBUG 1 > >Is defining DEBUG really worthwhile. I don't believe so, since if CONFIG_DYNAMIC_DEBUG is not enabled, having #define DEBUG will enable all of the netdev_dbg messages unconditionally, which is the opposite of the stated purpose of the patch. If DYNAMIC_DEBUG is enabled, having DEBUG doesn't do anything that I can see. -J >As well, it's almost always just >#define DEBUG >without any level value unless the >level value is used in the code. > >> + >> #include >> #include >> #include >> @@ -719,13 +721,13 @@ static int bond_option_mode_set(struct bonding *bond, >> const struct bond_opt_value *newval) >> { >> if (!bond_mode_uses_arp(newval->value) && bond->params.arp_interval) { >> -netdev_info(bond->dev, "%s mode is incompatible with arp >> monitoring, start mii monitoring\n", >> +netdev_dbg(bond->dev, "%s mode is incompatible with arp >> monitoring, start mii monitoring\n", >> newval->string); > >Please realign any multiple line arguments to the >open parenthesis at the same time. > >> /* disable arp monitoring */ >> bond->params.arp_interval = 0; >> /* set miimon to default value */ >> bond->params.miimon = BOND_DEFAULT_MIIMON; >> -netdev_info(bond->dev, "Setting MII monitoring interval to >> %d\n", >> +netdev_dbg(bond->dev, "Setting MII monitoring interval to %d\n", >> bond->params.miimon); > >etc... > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] net: bonding: use new api ethtool_{get|set}_link_ksettings
Philippe Reynes <trem...@gmail.com> wrote: >The ethtool api {get|set}_settings is deprecated. >We move this driver to new api {get|set}_link_ksettings. This is just an API change, i.e., no change to functionality? -J >Signed-off-by: Philippe Reynes <trem...@gmail.com> >--- > drivers/net/bonding/bond_main.c | 16 > 1 files changed, 8 insertions(+), 8 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index c9944d8..5708f17 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -4080,16 +4080,16 @@ static netdev_tx_t bond_start_xmit(struct sk_buff >*skb, struct net_device *dev) > return ret; > } > >-static int bond_ethtool_get_settings(struct net_device *bond_dev, >- struct ethtool_cmd *ecmd) >+static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev, >+ struct ethtool_link_ksettings *cmd) > { > struct bonding *bond = netdev_priv(bond_dev); > unsigned long speed = 0; > struct list_head *iter; > struct slave *slave; > >- ecmd->duplex = DUPLEX_UNKNOWN; >- ecmd->port = PORT_OTHER; >+ cmd->base.duplex = DUPLEX_UNKNOWN; >+ cmd->base.port = PORT_OTHER; > > /* Since bond_slave_can_tx returns false for all inactive or down > slaves, we >* do not need to check mode. Though link speed might not represent >@@ -4100,12 +4100,12 @@ static int bond_ethtool_get_settings(struct net_device >*bond_dev, > if (bond_slave_can_tx(slave)) { > if (slave->speed != SPEED_UNKNOWN) > speed += slave->speed; >- if (ecmd->duplex == DUPLEX_UNKNOWN && >+ if (cmd->base.duplex == DUPLEX_UNKNOWN && > slave->duplex != DUPLEX_UNKNOWN) >- ecmd->duplex = slave->duplex; >+ cmd->base.duplex = slave->duplex; > } > } >- ethtool_cmd_speed_set(ecmd, speed ? : SPEED_UNKNOWN); >+ cmd->base.speed = speed ? : SPEED_UNKNOWN; > > return 0; > } >@@ -4121,8 +4121,8 @@ static void bond_ethtool_get_drvinfo(struct net_device >*bond_dev, > > static const struct ethtool_ops bond_ethtool_ops = { > .get_drvinfo= bond_ethtool_get_drvinfo, >- .get_settings = bond_ethtool_get_settings, > .get_link = ethtool_op_get_link, >+ .get_link_ksettings = bond_ethtool_get_link_ksettings, > }; > > static const struct net_device_ops bond_netdev_ops = { >-- >1.7.4.4 --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] net: bonding: use new api ethtool_{get|set}_link_ksettings
Philippe Reynes wrote: >The ethtool api {get|set}_settings is deprecated. >We move this driver to new api {get|set}_link_ksettings. This is just an API change, i.e., no change to functionality? -J >Signed-off-by: Philippe Reynes >--- > drivers/net/bonding/bond_main.c | 16 > 1 files changed, 8 insertions(+), 8 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index c9944d8..5708f17 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -4080,16 +4080,16 @@ static netdev_tx_t bond_start_xmit(struct sk_buff >*skb, struct net_device *dev) > return ret; > } > >-static int bond_ethtool_get_settings(struct net_device *bond_dev, >- struct ethtool_cmd *ecmd) >+static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev, >+ struct ethtool_link_ksettings *cmd) > { > struct bonding *bond = netdev_priv(bond_dev); > unsigned long speed = 0; > struct list_head *iter; > struct slave *slave; > >- ecmd->duplex = DUPLEX_UNKNOWN; >- ecmd->port = PORT_OTHER; >+ cmd->base.duplex = DUPLEX_UNKNOWN; >+ cmd->base.port = PORT_OTHER; > > /* Since bond_slave_can_tx returns false for all inactive or down > slaves, we >* do not need to check mode. Though link speed might not represent >@@ -4100,12 +4100,12 @@ static int bond_ethtool_get_settings(struct net_device >*bond_dev, > if (bond_slave_can_tx(slave)) { > if (slave->speed != SPEED_UNKNOWN) > speed += slave->speed; >- if (ecmd->duplex == DUPLEX_UNKNOWN && >+ if (cmd->base.duplex == DUPLEX_UNKNOWN && > slave->duplex != DUPLEX_UNKNOWN) >- ecmd->duplex = slave->duplex; >+ cmd->base.duplex = slave->duplex; > } > } >- ethtool_cmd_speed_set(ecmd, speed ? : SPEED_UNKNOWN); >+ cmd->base.speed = speed ? : SPEED_UNKNOWN; > > return 0; > } >@@ -4121,8 +4121,8 @@ static void bond_ethtool_get_drvinfo(struct net_device >*bond_dev, > > static const struct ethtool_ops bond_ethtool_ops = { > .get_drvinfo= bond_ethtool_get_drvinfo, >- .get_settings = bond_ethtool_get_settings, > .get_link = ethtool_op_get_link, >+ .get_link_ksettings = bond_ethtool_get_link_ksettings, > }; > > static const struct net_device_ops bond_netdev_ops = { >-- >1.7.4.4 --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: Prevent deletion of a bond, or the last slave from a bond, with active usage.
= command + 2; >+ else >+ ifname = command + 1; >+ > if ((strlen(command) <= 1) || > !dev_valid_name(ifname)) > goto err_no_cmd; >@@ -130,6 +135,32 @@ static ssize_t bonding_store_bonds(struct class *cls, > res = -ENODEV; > } > rtnl_unlock(); >+ } else if ((command[0] == '?') && (command[1] == '-')) { >+ struct net_device *bond_dev; >+ >+ rtnl_lock(); >+ bond_dev = bond_get_by_name(bn, ifname); >+ >+ if (bond_dev) { >+ struct in_device *in_dev; >+ struct bonding *bond = netdev_priv(bond_dev); >+ >+ in_dev = __in_dev_get_rtnl(bond_dev); >+ >+ if (((in_dev->ifa_list) != NULL) && >+ (bond->slave_cnt > 0)) { >+ pr_err("%s is in use. Unconfigure IP %pI4 >before deletion.\n", >+ ifname, _dev->ifa_list->ifa_local); >+ rtnl_unlock(); >+ return -EBUSY; >+ } >+ pr_info("%s is being deleted...\n", ifname); >+ unregister_netdevice(bond_dev); >+ } else { >+ pr_err("unable to delete non-existent %s\n", ifname); >+ res = -ENODEV; >+ } >+ rtnl_unlock(); > } else > goto err_no_cmd; > >@@ -139,7 +170,7 @@ static ssize_t bonding_store_bonds(struct class *cls, > return res; > > err_no_cmd: >- pr_err("no command found in bonding_masters - use +ifname or >-ifname\n"); >+ pr_err("no command found in bonding_masters - use +ifname or -ifname or >?-ifname\n"); > return -EPERM; > } > >-- >1.8.3.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: Prevent deletion of a bond, or the last slave from a bond, with active usage.
+ > if ((strlen(command) <= 1) || > !dev_valid_name(ifname)) > goto err_no_cmd; >@@ -130,6 +135,32 @@ static ssize_t bonding_store_bonds(struct class *cls, > res = -ENODEV; > } > rtnl_unlock(); >+ } else if ((command[0] == '?') && (command[1] == '-')) { >+ struct net_device *bond_dev; >+ >+ rtnl_lock(); >+ bond_dev = bond_get_by_name(bn, ifname); >+ >+ if (bond_dev) { >+ struct in_device *in_dev; >+ struct bonding *bond = netdev_priv(bond_dev); >+ >+ in_dev = __in_dev_get_rtnl(bond_dev); >+ >+ if (((in_dev->ifa_list) != NULL) && >+ (bond->slave_cnt > 0)) { >+ pr_err("%s is in use. Unconfigure IP %pI4 >before deletion.\n", >+ ifname, _dev->ifa_list->ifa_local); >+ rtnl_unlock(); >+ return -EBUSY; >+ } >+ pr_info("%s is being deleted...\n", ifname); >+ unregister_netdevice(bond_dev); >+ } else { >+ pr_err("unable to delete non-existent %s\n", ifname); >+ res = -ENODEV; >+ } >+ rtnl_unlock(); > } else > goto err_no_cmd; > >@@ -139,7 +170,7 @@ static ssize_t bonding_store_bonds(struct class *cls, > return res; > > err_no_cmd: >- pr_err("no command found in bonding_masters - use +ifname or >-ifname\n"); >+ pr_err("no command found in bonding_masters - use +ifname or -ifname or >?-ifname\n"); > return -EPERM; > } > >-- >1.8.3.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding:Fix perfomance drop if one bonding device in configuration is repeatedly restarted
Nicholas Krause <xerofo...@gmail.com> wrote: >This fixes a issue with the following test caseL > >1. Created a alb bonding(bond0) with three of 1Gb interfaces >(eth0, eth1, eth2) under ipv4, set the bond0 ip address as >192.168.10.101, run command "iperf -s" as the server(x86_64). > >2. Run command "iperf -c 192.168.10.101 -t 1 -i 1" >on three linux client with CentOS6.5 x86_64 > >3. Run the following commands on server repeatly, >"ifconfig eth1 down; sleep 20; ifconfig eth1 up". >monitor the bonding performance with program " >nmon_x86_64_centos6", after some time, the >performance droped from 3000 to 2000 Mbps, >and could not go up to 3000Mbps any more, >it seems that the bonding doesn't rebalance >again. >This is due to use incorrectly checking in the function, >rlb_choose_channel if the ethernet address is not equal >to 64 bits before updating our mac address from arp. >However this is incorrect as the ethernet header length >needs to be 64bits in order to update our mac address >from arp without the above test case having a perfomance >regression. Fix it by checking for the ethernet header >having 64 bits and not. NAK. I don't understand what this is trying to say, but the change below does not look valid. >Signed-off-by: Nicholas Krause <xerofo...@gmail.com> >--- > drivers/net/bonding/bond_alb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c >index 551f0f8..d5ae8a5 100644 >--- a/drivers/net/bonding/bond_alb.c >+++ b/drivers/net/bonding/bond_alb.c >@@ -593,7 +593,7 @@ static struct slave *rlb_choose_channel(struct sk_buff >*skb, struct bonding *bon > if ((client_info->ip_src == arp->ip_src) && > (client_info->ip_dst == arp->ip_dst)) { > /* the entry is already assigned to this client */ >- if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) { >+ if (ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) { > /* update mac address from arp */ > ether_addr_copy(client_info->mac_dst, > arp->mac_dst); Regardless of the description above, this test is insuring that the code doesn't use the broadcast MAC address for the client. Changing it seems like a bad thing to do, as it would cause traffic for the client to be sent to the ethernet broadcast address. This might, in fact, make the described test run better, but it has nothing to do with rebalancing and would negatively impact other hosts on the network. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding:Fix perfomance drop if one bonding device in configuration is repeatedly restarted
Nicholas Krause wrote: >This fixes a issue with the following test caseL > >1. Created a alb bonding(bond0) with three of 1Gb interfaces >(eth0, eth1, eth2) under ipv4, set the bond0 ip address as >192.168.10.101, run command "iperf -s" as the server(x86_64). > >2. Run command "iperf -c 192.168.10.101 -t 1 -i 1" >on three linux client with CentOS6.5 x86_64 > >3. Run the following commands on server repeatly, >"ifconfig eth1 down; sleep 20; ifconfig eth1 up". >monitor the bonding performance with program " >nmon_x86_64_centos6", after some time, the >performance droped from 3000 to 2000 Mbps, >and could not go up to 3000Mbps any more, >it seems that the bonding doesn't rebalance >again. >This is due to use incorrectly checking in the function, >rlb_choose_channel if the ethernet address is not equal >to 64 bits before updating our mac address from arp. >However this is incorrect as the ethernet header length >needs to be 64bits in order to update our mac address >from arp without the above test case having a perfomance >regression. Fix it by checking for the ethernet header >having 64 bits and not. NAK. I don't understand what this is trying to say, but the change below does not look valid. >Signed-off-by: Nicholas Krause >--- > drivers/net/bonding/bond_alb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c >index 551f0f8..d5ae8a5 100644 >--- a/drivers/net/bonding/bond_alb.c >+++ b/drivers/net/bonding/bond_alb.c >@@ -593,7 +593,7 @@ static struct slave *rlb_choose_channel(struct sk_buff >*skb, struct bonding *bon > if ((client_info->ip_src == arp->ip_src) && > (client_info->ip_dst == arp->ip_dst)) { > /* the entry is already assigned to this client */ >- if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) { >+ if (ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) { > /* update mac address from arp */ > ether_addr_copy(client_info->mac_dst, > arp->mac_dst); Regardless of the description above, this test is insuring that the code doesn't use the broadcast MAC address for the client. Changing it seems like a bad thing to do, as it would cause traffic for the client to be sent to the ethernet broadcast address. This might, in fact, make the described test run better, but it has nothing to do with rebalancing and would negatively impact other hosts on the network. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [RFC PATCH net] net/core: don't increment rx_dropped on inactive slaves
Jarod Wilson wrote: >The network core tries to keep track of dropped packets, but some packets >you wouldn't really call dropped, so much as intentionally ignored, under >certain circumstances. One such case is that of bonding and team device >slaves that are currently inactive. Their respective rx_handler functions >return RX_HANDLER_EXACT (the only places in the kernel that return that), >which ends up tracking into the network core's __netif_receive_skb_core() >function's drop path, with no pt_prev set. On a noisy network, this can >result in a very rapidly incrementing rx_dropped counter, not only on the >inactive slave(s), but also on the master device, such as the following: [...] >In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an >active-backup bond0, and you can see that all three have high drop counts, >with the master bond0 showing a tally of all three. > >I know that this was previously discussed some here: > >http://www.spinics.net/lists/netdev/msg226341.html > >It seems additional counters never came to fruition, but honestly, for >this particular case, I'm not even sure they're warranted, I'd be inclined >to say just silently drop these packets without incrementing a counter. At >least, that's probably what would make someone who has complained loudly >about this issue happy, as they have monitoring tools that are squaking >loudly at any increments to rx_dropped. I don't think the kernel should silently drop packets; there should be a counter somewhere. If a packet is being thrown away deliberately, it should not just vanish into the screaming void of space. Someday someone will try and track down where that packet is being dropped. I've had that same conversation with customers who insist on accounting for every packet drop (from the "any drop is an error" mindset), so I understand the issue. Thinking about the prior discussion, the rx_drop_inactive is still a good idea, but I'd actually today get good use from a "rx_drop_unforwardable" (or an equivalent but shorter name) counter that counts every time a packet is dropped due to is_skb_forwardable() returning false. __dev_forward_skb does this (and hits rx_dropped), as does the bridge (and does not count it). -J >CC: "David S. Miller" >CC: Eric Dumazet >CC: Jiri Pirko >CC: Daniel Borkmann >CC: Tom Herbert >CC: Jay Vosburgh >CC: Veaceslav Falico >CC: Andy Gospodarek >CC: net...@vger.kernel.org >Signed-off-by: Jarod Wilson >--- > net/core/dev.c | 3 +++ > 1 file changed, 3 insertions(+) > >diff --git a/net/core/dev.c b/net/core/dev.c >index 8cba3d8..1354c7b 100644 >--- a/net/core/dev.c >+++ b/net/core/dev.c >@@ -4153,8 +4153,11 @@ ncls: > else > ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); > } else { >+ if (deliver_exact) >+ goto inactive; /* bond or team inactive slave */ > drop: > atomic_long_inc(>dev->rx_dropped); >+inactive: > kfree_skb(skb); > /* Jamal, now you will not able to escape explaining >* me how you were going to use this. :-) >-- >1.8.3.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [RFC PATCH net] net/core: don't increment rx_dropped on inactive slaves
Jarod Wilson <ja...@redhat.com> wrote: >The network core tries to keep track of dropped packets, but some packets >you wouldn't really call dropped, so much as intentionally ignored, under >certain circumstances. One such case is that of bonding and team device >slaves that are currently inactive. Their respective rx_handler functions >return RX_HANDLER_EXACT (the only places in the kernel that return that), >which ends up tracking into the network core's __netif_receive_skb_core() >function's drop path, with no pt_prev set. On a noisy network, this can >result in a very rapidly incrementing rx_dropped counter, not only on the >inactive slave(s), but also on the master device, such as the following: [...] >In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an >active-backup bond0, and you can see that all three have high drop counts, >with the master bond0 showing a tally of all three. > >I know that this was previously discussed some here: > >http://www.spinics.net/lists/netdev/msg226341.html > >It seems additional counters never came to fruition, but honestly, for >this particular case, I'm not even sure they're warranted, I'd be inclined >to say just silently drop these packets without incrementing a counter. At >least, that's probably what would make someone who has complained loudly >about this issue happy, as they have monitoring tools that are squaking >loudly at any increments to rx_dropped. I don't think the kernel should silently drop packets; there should be a counter somewhere. If a packet is being thrown away deliberately, it should not just vanish into the screaming void of space. Someday someone will try and track down where that packet is being dropped. I've had that same conversation with customers who insist on accounting for every packet drop (from the "any drop is an error" mindset), so I understand the issue. Thinking about the prior discussion, the rx_drop_inactive is still a good idea, but I'd actually today get good use from a "rx_drop_unforwardable" (or an equivalent but shorter name) counter that counts every time a packet is dropped due to is_skb_forwardable() returning false. __dev_forward_skb does this (and hits rx_dropped), as does the bridge (and does not count it). -J >CC: "David S. Miller" <da...@davemloft.net> >CC: Eric Dumazet <eduma...@google.com> >CC: Jiri Pirko <j...@mellanox.com> >CC: Daniel Borkmann <dan...@iogearbox.net> >CC: Tom Herbert <t...@herbertland.com> >CC: Jay Vosburgh <j.vosbu...@gmail.com> >CC: Veaceslav Falico <vfal...@gmail.com> >CC: Andy Gospodarek <go...@cumulusnetworks.com> >CC: net...@vger.kernel.org >Signed-off-by: Jarod Wilson <ja...@redhat.com> >--- > net/core/dev.c | 3 +++ > 1 file changed, 3 insertions(+) > >diff --git a/net/core/dev.c b/net/core/dev.c >index 8cba3d8..1354c7b 100644 >--- a/net/core/dev.c >+++ b/net/core/dev.c >@@ -4153,8 +4153,11 @@ ncls: > else > ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); > } else { >+ if (deliver_exact) >+ goto inactive; /* bond or team inactive slave */ > drop: > atomic_long_inc(>dev->rx_dropped); >+inactive: > kfree_skb(skb); > /* Jamal, now you will not able to escape explaining >* me how you were going to use this. :-) >-- >1.8.3.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH v4] net/bonding: send arp in interval if no active slave
Jarod Wilson wrote: >Jarod Wilson wrote: >... >> As Andy already stated I'm not a fan of such workarounds either but it's >> necessary sometimes so if this is going to be actually considered then a >> few things need to be fixed. Please make this a proper bonding option >> which can be changed at runtime and not only via a module parameter. > >Is there any particular userspace tool that would need some updating, or >is adding the sysfs knobs sufficient here? I think I've got all the sysfs >stuff thrown together now, but still need to test. Most (all?) bonding options should be configurable via iproute (netlink) now. > >>> Now, I saw that you've only tested with 500 ms, can't this be fixed by >>> using >>> a different interval ? This seems like a very specific problem to have a >>> whole new option for. >> >> ...I'll wait until we've heard confirmation from Uwe that intervals >> other than 500ms don't fix things. > >Okay, so I believe the "only tested with 500ms" was in reference to >testing with Uwe's initial patch. I do have supporting evidence in a >bugzilla report that shows upwards of 5000ms still experience the problem >here. I did set up some switches and attempt to reproduce this yesterday; I daisy-chained three switches (two Cisco and an HP) together and connected the bonded interfaces to the "end" switches. I tried various ARP targets (the switch, hosts on various points of the switch) and varying arp_intervals and was unable to reproduce the problem. As I understand it, the working theory is something like this: - host with two bonded interfaces, A and B. For active-backup mode, the interfaces have been assigned the same MAC address. - switch has MAC for B in its forwarding table - bonding goes from down to up, and thinks all its slaves are down, and starts the "curr_arp_slave" search for an active arp_ip_target. In this case, it starts with A, and sends an ARP from A. As an aside, I'm not 100% clear on what exactly is going on in the "bonding goes from down to up" transition; this seems to be key in reproducing the issue. - switch sees source mac coming from port A, starts to update its forwarding table - meanwhile, switch forwards ARP request, and receives ARP reply, which it forwards to port B. Bonding drops this, as the slave is inactive. - switch finishes updating forwarding table, MAC is now assigned to port A. - bonding now tries sending on port B, and the cycle repeats. If this is what's taking place, then the arp_interval itself is irrelevant, the race is between the switch table update and the generation of the ARP reply. Also, presuming the above is what's going on, we could modify the ARP "curr_arp_slave" logic a bit to resolve this without requiring any magic knobs. For example, we could change the "drop on inactive" logic to recognise the "curr_arp_slave" search and accept the unicast ARP reply, and perhaps make that receiving slave the next curr_arp_slave automatically. I also wonder if the fail_over_mac option would affect this behavior, as it would cause the slaves to keep their MAC address for the duration, so the switch would not see the MAC move from port to port. Another thought would be to have the curr_arp_slave cycle through the slaves in random order, but that could create non-deterministic results even when things are working correctly. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] net/bonding: send arp in interval if no active slave
Jarod Wilson <ja...@redhat.com> wrote: >Jarod Wilson wrote: >... >> As Andy already stated I'm not a fan of such workarounds either but it's >> necessary sometimes so if this is going to be actually considered then a >> few things need to be fixed. Please make this a proper bonding option >> which can be changed at runtime and not only via a module parameter. > >Is there any particular userspace tool that would need some updating, or >is adding the sysfs knobs sufficient here? I think I've got all the sysfs >stuff thrown together now, but still need to test. Most (all?) bonding options should be configurable via iproute (netlink) now. > >>> Now, I saw that you've only tested with 500 ms, can't this be fixed by >>> using >>> a different interval ? This seems like a very specific problem to have a >>> whole new option for. >> >> ...I'll wait until we've heard confirmation from Uwe that intervals >> other than 500ms don't fix things. > >Okay, so I believe the "only tested with 500ms" was in reference to >testing with Uwe's initial patch. I do have supporting evidence in a >bugzilla report that shows upwards of 5000ms still experience the problem >here. I did set up some switches and attempt to reproduce this yesterday; I daisy-chained three switches (two Cisco and an HP) together and connected the bonded interfaces to the "end" switches. I tried various ARP targets (the switch, hosts on various points of the switch) and varying arp_intervals and was unable to reproduce the problem. As I understand it, the working theory is something like this: - host with two bonded interfaces, A and B. For active-backup mode, the interfaces have been assigned the same MAC address. - switch has MAC for B in its forwarding table - bonding goes from down to up, and thinks all its slaves are down, and starts the "curr_arp_slave" search for an active arp_ip_target. In this case, it starts with A, and sends an ARP from A. As an aside, I'm not 100% clear on what exactly is going on in the "bonding goes from down to up" transition; this seems to be key in reproducing the issue. - switch sees source mac coming from port A, starts to update its forwarding table - meanwhile, switch forwards ARP request, and receives ARP reply, which it forwards to port B. Bonding drops this, as the slave is inactive. - switch finishes updating forwarding table, MAC is now assigned to port A. - bonding now tries sending on port B, and the cycle repeats. If this is what's taking place, then the arp_interval itself is irrelevant, the race is between the switch table update and the generation of the ARP reply. Also, presuming the above is what's going on, we could modify the ARP "curr_arp_slave" logic a bit to resolve this without requiring any magic knobs. For example, we could change the "drop on inactive" logic to recognise the "curr_arp_slave" search and accept the unicast ARP reply, and perhaps make that receiving slave the next curr_arp_slave automatically. I also wonder if the fail_over_mac option would affect this behavior, as it would cause the slaves to keep their MAC address for the duration, so the switch would not see the MAC move from port to port. Another thought would be to have the curr_arp_slave cycle through the slaves in random order, but that could create non-deterministic results even when things are working correctly. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] net/bonding: send arp in interval if no active slave
Uwe Koziolek wrote: >On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote: >> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote: >>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote: >>>> Uwe Koziolek wrote: >>>> >>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote: >>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote: >>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote: >>>>>>>> From: Uwe Koziolek >>>>>>>> >>>>>>>> With some very finicky switch hardware, active backup bonding can get >>>>>>>> into >>>>>>>> a situation where we play ping-pong between interfaces, trying to get >>>>>>>> one >>>>>>>> to come up as the active slave. There seems to be an issue with the >>>>>>>> switch's arp replies either taking too long, or simply getting lost, >>>>>>>> so we >>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue >>>>>>>> sorts itself out after a while, sometimes it doesn't. >>>>>>>> >>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an >>>>>>>> additional >>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in >>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing >>>>>>>> with >>>>>>>> this hardware combination. >>>>>>> Sorry, I don't understand the logic of why it works, and what exactly >>>>>>> are >>>>>>> we fixiing here. >>>>>>> >>>>>>> It also breaks completely the logic for link state management in case >>>>>>> of no >>>>>>> current active slave for 2*arp_interval. >>>>>>> >>>>>>> Could you please elaborate what exactly is fixed here, and how it >>>>>>> works? :) >>>>>> I can either duplicate some information from the bug, or Uwe can, to >>>>>> illustrate the exact nature of the problem. >>>>>> >>>>>>> p.s. num_grat_arp maybe could help? >>>>>> That was my thought as well, but as I understand it, that route was >>>>>> explored, and it didn't help any. I don't actually have a reproducer >>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle >>>>>> here... >>>>>> >>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp >>>>>> settings were tried that didn't help? I'm still of the mind that if >>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off >>>>>> num_grat_arp. >>>>> The bonding slaves are connected to high available switches, each of the >>>>> slaves is connected to a different switch. If the bond is starting, only >>>>> the selected slave sends one arp-request. If a matching arp_response was >>>>> received, this slave and the bond is going into state up, sending the >>>>> gratitious arps... >>>>> But if you got no arp reply the next slave was selected. >>>>> With most of the newer switches, not overloaded, or with other software >>>>> bugs, or with a single switch configuration, you would get a arp response >>>>> on the first arp request. >>>>> But in case of high availability configuration with non perfect switches >>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a >>>>> response on the first arp request. >>>>> >>>>> I have seen network snoops, there the switches are not responding to the >>>>> first arp request on slave 1, the second arp request was sent on slave 2 >>>>> but the response was received on slave one, and all following arp >>>>> requests are anwsered on the wrong slave for a longer time. >>>>Could you elaborate on the exact "high availability >>>> configuration" here, including the model(s) of switch(es) involved? >>>> >>>>Is this some kind of race between the switch or switches >>>> updating the forwarding tables and the bond flip flopping between the >>>> slaves
Re: [PATCH] net/bonding: send arp in interval if no active slave
Uwe Koziolek <uwe.kozio...@redknee.com> wrote: >On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote: >> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote: >>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote: >>>> Uwe Koziolek <uwe.kozio...@redknee.com> wrote: >>>> >>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote: >>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote: >>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote: >>>>>>>> From: Uwe Koziolek <uwe.kozio...@redknee.com> >>>>>>>> >>>>>>>> With some very finicky switch hardware, active backup bonding can get >>>>>>>> into >>>>>>>> a situation where we play ping-pong between interfaces, trying to get >>>>>>>> one >>>>>>>> to come up as the active slave. There seems to be an issue with the >>>>>>>> switch's arp replies either taking too long, or simply getting lost, >>>>>>>> so we >>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue >>>>>>>> sorts itself out after a while, sometimes it doesn't. >>>>>>>> >>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an >>>>>>>> additional >>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in >>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing >>>>>>>> with >>>>>>>> this hardware combination. >>>>>>> Sorry, I don't understand the logic of why it works, and what exactly >>>>>>> are >>>>>>> we fixiing here. >>>>>>> >>>>>>> It also breaks completely the logic for link state management in case >>>>>>> of no >>>>>>> current active slave for 2*arp_interval. >>>>>>> >>>>>>> Could you please elaborate what exactly is fixed here, and how it >>>>>>> works? :) >>>>>> I can either duplicate some information from the bug, or Uwe can, to >>>>>> illustrate the exact nature of the problem. >>>>>> >>>>>>> p.s. num_grat_arp maybe could help? >>>>>> That was my thought as well, but as I understand it, that route was >>>>>> explored, and it didn't help any. I don't actually have a reproducer >>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle >>>>>> here... >>>>>> >>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp >>>>>> settings were tried that didn't help? I'm still of the mind that if >>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off >>>>>> num_grat_arp. >>>>> The bonding slaves are connected to high available switches, each of the >>>>> slaves is connected to a different switch. If the bond is starting, only >>>>> the selected slave sends one arp-request. If a matching arp_response was >>>>> received, this slave and the bond is going into state up, sending the >>>>> gratitious arps... >>>>> But if you got no arp reply the next slave was selected. >>>>> With most of the newer switches, not overloaded, or with other software >>>>> bugs, or with a single switch configuration, you would get a arp response >>>>> on the first arp request. >>>>> But in case of high availability configuration with non perfect switches >>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a >>>>> response on the first arp request. >>>>> >>>>> I have seen network snoops, there the switches are not responding to the >>>>> first arp request on slave 1, the second arp request was sent on slave 2 >>>>> but the response was received on slave one, and all following arp >>>>> requests are anwsered on the wrong slave for a longer time. >>>>Could you elaborate on the exact "high availability >>>> configuration" here, including the model(s) of switch(es) involved? >>>> >>>>Is this some kind of race between the switch or switches >>>&
Re: [PATCH] net/bonding: send arp in interval if no active slave
Uwe Koziolek wrote: >On2015-08-17 07:12 PM,Jarod Wilson wrote: >> On 2015-08-17 12:55 PM, Veaceslav Falico wrote: >>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote: >>>> From: Uwe Koziolek >>>> >>>> With some very finicky switch hardware, active backup bonding can get >>>> into >>>> a situation where we play ping-pong between interfaces, trying to get >>>> one >>>> to come up as the active slave. There seems to be an issue with the >>>> switch's arp replies either taking too long, or simply getting lost, >>>> so we >>>> wind up unable to get any interface up and active. Sometimes, the issue >>>> sorts itself out after a while, sometimes it doesn't. >>>> >>>> Testing with num_grat_arp has proven fruitless, but sending an >>>> additional >>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in >>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing >>>> with >>>> this hardware combination. >>> >>> Sorry, I don't understand the logic of why it works, and what exactly >>> are >>> we fixiing here. >>> >>> It also breaks completely the logic for link state management in case >>> of no >>> current active slave for 2*arp_interval. >>> >>> Could you please elaborate what exactly is fixed here, and how it >>> works? :) >> >> I can either duplicate some information from the bug, or Uwe can, to >> illustrate the exact nature of the problem. >> >>> p.s. num_grat_arp maybe could help? >> >> That was my thought as well, but as I understand it, that route was >> explored, and it didn't help any. I don't actually have a reproducer >> setup of my own, unfortunately, so I'm kind of caught in the middle >> here... >> >> Uwe, can you perhaps further enlighten us as to what num_grat_arp >> settings were tried that didn't help? I'm still of the mind that if >> num_grat_arp *didn't* help, we probably need to do something keyed off >> num_grat_arp. >The bonding slaves are connected to high available switches, each of the >slaves is connected to a different switch. If the bond is starting, only >the selected slave sends one arp-request. If a matching arp_response was >received, this slave and the bond is going into state up, sending the >gratitious arps... >But if you got no arp reply the next slave was selected. >With most of the newer switches, not overloaded, or with other software >bugs, or with a single switch configuration, you would get a arp response >on the first arp request. >But in case of high availability configuration with non perfect switches >like HP ProCurve 54xx, also with some Cisco models, you may not get a >response on the first arp request. > >I have seen network snoops, there the switches are not responding to the >first arp request on slave 1, the second arp request was sent on slave 2 >but the response was received on slave one, and all following arp >requests are anwsered on the wrong slave for a longer time. Could you elaborate on the exact "high availability configuration" here, including the model(s) of switch(es) involved? Is this some kind of race between the switch or switches updating the forwarding tables and the bond flip flopping between the slaves? E.g., source MAC from ARP sent on slave 1 is used to populate the forwarding table, but (for whatever reason) there is no reply. ARP on slave 2 is sent (using the same source MAC, unless you set fail_over_mac), but forwarding tables still send that MAC to slave 1, so reply is sent there. >The proposed change sents up to 3 arp requests on a down bond using the >same slave, delayed by arp_interval. >Using problematic switches i have seen the the arp response on the right >slave at latest on the second arp request. So the bond is going into state >up. > >How does it works: >The bonds in up state are handled on the beginning of bond_ab_arp_probe >procedure, the other part of this procedure is handling the slave change. >The proposed change is bypassing the slave change for 2 additional calls >of bond_ab_arp_probe. >Now the retries are not only for an up bond available, they are also >implemented for a down bond. Does this delay failover or bringup on switches that are not "problematic"? I.e., if arp_interval is, say, 1000 (1 second), will this impact failover / recovery times? -J >The num_grat_arp has no chance to solve the problem. The num_grat_arp is >only used, if a different slave is going active. >But in our case, th
Re: [PATCH] net/bonding: send arp in interval if no active slave
Uwe Koziolek uwe.kozio...@redknee.com wrote: On2015-08-17 07:12 PM,Jarod Wilson wrote: On 2015-08-17 12:55 PM, Veaceslav Falico wrote: On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote: From: Uwe Koziolek uwe.kozio...@redknee.com With some very finicky switch hardware, active backup bonding can get into a situation where we play ping-pong between interfaces, trying to get one to come up as the active slave. There seems to be an issue with the switch's arp replies either taking too long, or simply getting lost, so we wind up unable to get any interface up and active. Sometimes, the issue sorts itself out after a while, sometimes it doesn't. Testing with num_grat_arp has proven fruitless, but sending an additional arp on curr_arp_slave if we're still in the arp_interval timeslice in bond_ab_arp_probe(), has shown to produce 100% reliability in testing with this hardware combination. Sorry, I don't understand the logic of why it works, and what exactly are we fixiing here. It also breaks completely the logic for link state management in case of no current active slave for 2*arp_interval. Could you please elaborate what exactly is fixed here, and how it works? :) I can either duplicate some information from the bug, or Uwe can, to illustrate the exact nature of the problem. p.s. num_grat_arp maybe could help? That was my thought as well, but as I understand it, that route was explored, and it didn't help any. I don't actually have a reproducer setup of my own, unfortunately, so I'm kind of caught in the middle here... Uwe, can you perhaps further enlighten us as to what num_grat_arp settings were tried that didn't help? I'm still of the mind that if num_grat_arp *didn't* help, we probably need to do something keyed off num_grat_arp. The bonding slaves are connected to high available switches, each of the slaves is connected to a different switch. If the bond is starting, only the selected slave sends one arp-request. If a matching arp_response was received, this slave and the bond is going into state up, sending the gratitious arps... But if you got no arp reply the next slave was selected. With most of the newer switches, not overloaded, or with other software bugs, or with a single switch configuration, you would get a arp response on the first arp request. But in case of high availability configuration with non perfect switches like HP ProCurve 54xx, also with some Cisco models, you may not get a response on the first arp request. I have seen network snoops, there the switches are not responding to the first arp request on slave 1, the second arp request was sent on slave 2 but the response was received on slave one, and all following arp requests are anwsered on the wrong slave for a longer time. Could you elaborate on the exact high availability configuration here, including the model(s) of switch(es) involved? Is this some kind of race between the switch or switches updating the forwarding tables and the bond flip flopping between the slaves? E.g., source MAC from ARP sent on slave 1 is used to populate the forwarding table, but (for whatever reason) there is no reply. ARP on slave 2 is sent (using the same source MAC, unless you set fail_over_mac), but forwarding tables still send that MAC to slave 1, so reply is sent there. The proposed change sents up to 3 arp requests on a down bond using the same slave, delayed by arp_interval. Using problematic switches i have seen the the arp response on the right slave at latest on the second arp request. So the bond is going into state up. How does it works: The bonds in up state are handled on the beginning of bond_ab_arp_probe procedure, the other part of this procedure is handling the slave change. The proposed change is bypassing the slave change for 2 additional calls of bond_ab_arp_probe. Now the retries are not only for an up bond available, they are also implemented for a down bond. Does this delay failover or bringup on switches that are not problematic? I.e., if arp_interval is, say, 1000 (1 second), will this impact failover / recovery times? -J The num_grat_arp has no chance to solve the problem. The num_grat_arp is only used, if a different slave is going active. But in our case, the bonding slaves are not going into the state active for a longer time. [jarod: manufacturing of changelog] CC: Jay Vosburgh j.vosbu...@gmail.com CC: Veaceslav Falico vfal...@gmail.com CC: Andy Gospodarek go...@cumulusnetworks.com CC: net...@vger.kernel.org Signed-off-by: Uwe Koziolek uwe.kozio...@redknee.com Signed-off-by: Jarod Wilson ja...@redhat.com --- drivers/net/bonding/bond_main.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 0c627b4..60b9483 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2794,6 +2794,11 @@ static bool
Re: [PATCH v3] bonding: "primary_reselect" with "failure" is not working properly
Mazhar Rana wrote: >From: Mazhar Rana > >When "primary_reselect" is set to "failure", primary interface should >not become active until current active slave is down. But if we set first >member of bond device as a "primary" interface and "primary_reselect" >is set to "failure" then whenever primary interface's link get back(up) >it become active slave even if current active slave is still up. > >With this patch, "bond_find_best_slave" will not traverse members if >primary interface is not candidate for failover/reselection and current >active slave is still up. > >Signed-off-by: Mazhar Rana >Signed-off-by: Jay Vosburgh I don't believe I posted a Signed-off-by, so you really shouldn't include one for me without it being explicitly stated. In any event, I'm good with the patch, so: Signed-off-by: Jay Vosburgh -J >--- > >v1->v2: >return "curr" instead of "bond->curr_active_slave". > >v2->v3: >To make code more clear, replaced function bond_should_change_active >with bond_choose_primary_or_current which will return slave device. > > drivers/net/bonding/bond_main.c | 51 +++-- > 1 file changed, 34 insertions(+), 17 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 19eb990..317a494 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -689,40 +689,57 @@ out: > > } > >-static bool bond_should_change_active(struct bonding *bond) >+static struct slave *bond_choose_primary_or_current(struct bonding *bond) > { > struct slave *prim = rtnl_dereference(bond->primary_slave); > struct slave *curr = rtnl_dereference(bond->curr_active_slave); > >- if (!prim || !curr || curr->link != BOND_LINK_UP) >- return true; >+ if (!prim || prim->link != BOND_LINK_UP) { >+ if (!curr || curr->link != BOND_LINK_UP) >+ return NULL; >+ return curr; >+ } >+ > if (bond->force_primary) { > bond->force_primary = false; >- return true; >+ return prim; >+ } >+ >+ if (!curr || curr->link != BOND_LINK_UP) >+ return prim; >+ >+ /* At this point, prim and curr are both up */ >+ switch (bond->params.primary_reselect) { >+ case BOND_PRI_RESELECT_ALWAYS: >+ return prim; >+ case BOND_PRI_RESELECT_BETTER: >+ if (prim->speed < curr->speed) >+ return curr; >+ if (prim->speed == curr->speed && prim->duplex <= curr->duplex) >+ return curr; >+ return prim; >+ case BOND_PRI_RESELECT_FAILURE: >+ return curr; >+ default: >+ netdev_err(bond->dev, "impossible primary_reselect %d\n", >+ bond->params.primary_reselect); >+ return curr; > } >- if (bond->params.primary_reselect == BOND_PRI_RESELECT_BETTER && >- (prim->speed < curr->speed || >- (prim->speed == curr->speed && prim->duplex <= curr->duplex))) >- return false; >- if (bond->params.primary_reselect == BOND_PRI_RESELECT_FAILURE) >- return false; >- return true; > } > > /** >- * find_best_interface - select the best available slave to be the active one >+ * bond_find_best_slave - select the best available slave to be the active one > * @bond: our bonding struct > */ > static struct slave *bond_find_best_slave(struct bonding *bond) > { >- struct slave *slave, *bestslave = NULL, *primary; >+ struct slave *slave, *bestslave = NULL; > struct list_head *iter; > int mintime = bond->params.updelay; > >- primary = rtnl_dereference(bond->primary_slave); >- if (primary && primary->link == BOND_LINK_UP && >- bond_should_change_active(bond)) >- return primary; >+ slave = bond_choose_primary_or_current(bond); >+ if (slave) >+ return slave; > > bond_for_each_slave(bond, slave, iter) { > if (slave->link == BOND_LINK_UP) >-- >1.9.1 > >-- >To unsubscribe from this list: send the line "unsubscribe netdev" in >the body of a message to majord...@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] bonding: primary_reselect with failure is not working properly
Mazhar Rana ranamazh...@gmail.com wrote: From: Mazhar Rana mazhar.r...@cyberoam.com When primary_reselect is set to failure, primary interface should not become active until current active slave is down. But if we set first member of bond device as a primary interface and primary_reselect is set to failure then whenever primary interface's link get back(up) it become active slave even if current active slave is still up. With this patch, bond_find_best_slave will not traverse members if primary interface is not candidate for failover/reselection and current active slave is still up. Signed-off-by: Mazhar Rana mazhar.r...@cyberoam.com Signed-off-by: Jay Vosburgh j.vosbu...@gmail.com I don't believe I posted a Signed-off-by, so you really shouldn't include one for me without it being explicitly stated. In any event, I'm good with the patch, so: Signed-off-by: Jay Vosburgh jay.vosbu...@canonical.com -J --- v1-v2: return curr instead of bond-curr_active_slave. v2-v3: To make code more clear, replaced function bond_should_change_active with bond_choose_primary_or_current which will return slave device. drivers/net/bonding/bond_main.c | 51 +++-- 1 file changed, 34 insertions(+), 17 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 19eb990..317a494 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -689,40 +689,57 @@ out: } -static bool bond_should_change_active(struct bonding *bond) +static struct slave *bond_choose_primary_or_current(struct bonding *bond) { struct slave *prim = rtnl_dereference(bond-primary_slave); struct slave *curr = rtnl_dereference(bond-curr_active_slave); - if (!prim || !curr || curr-link != BOND_LINK_UP) - return true; + if (!prim || prim-link != BOND_LINK_UP) { + if (!curr || curr-link != BOND_LINK_UP) + return NULL; + return curr; + } + if (bond-force_primary) { bond-force_primary = false; - return true; + return prim; + } + + if (!curr || curr-link != BOND_LINK_UP) + return prim; + + /* At this point, prim and curr are both up */ + switch (bond-params.primary_reselect) { + case BOND_PRI_RESELECT_ALWAYS: + return prim; + case BOND_PRI_RESELECT_BETTER: + if (prim-speed curr-speed) + return curr; + if (prim-speed == curr-speed prim-duplex = curr-duplex) + return curr; + return prim; + case BOND_PRI_RESELECT_FAILURE: + return curr; + default: + netdev_err(bond-dev, impossible primary_reselect %d\n, + bond-params.primary_reselect); + return curr; } - if (bond-params.primary_reselect == BOND_PRI_RESELECT_BETTER - (prim-speed curr-speed || - (prim-speed == curr-speed prim-duplex = curr-duplex))) - return false; - if (bond-params.primary_reselect == BOND_PRI_RESELECT_FAILURE) - return false; - return true; } /** - * find_best_interface - select the best available slave to be the active one + * bond_find_best_slave - select the best available slave to be the active one * @bond: our bonding struct */ static struct slave *bond_find_best_slave(struct bonding *bond) { - struct slave *slave, *bestslave = NULL, *primary; + struct slave *slave, *bestslave = NULL; struct list_head *iter; int mintime = bond-params.updelay; - primary = rtnl_dereference(bond-primary_slave); - if (primary primary-link == BOND_LINK_UP - bond_should_change_active(bond)) - return primary; + slave = bond_choose_primary_or_current(bond); + if (slave) + return slave; bond_for_each_slave(bond, slave, iter) { if (slave-link == BOND_LINK_UP) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] bonding: "primary_reselect" with "failure" is not working properly
GMAIL wrote: >Hi Jay, > >On Friday 03 July 2015 02:12 AM, Jay Vosburgh wrote: > >> [ added netdev to cc ] >> >> Mazhar Rana wrote: >> >>> When "primary_reselect" is set to "failure", primary interface should >>> not become active until current active slave is up. But if we set first >> I think you mean "until current active slave is down" here, not >> "up." > >Yes, It should be "up", grammatical mistake "down," right? [...] >Below is the updated version of your patch. Any Comments or suggestions ? [...] > static struct slave *bond_find_best_slave(struct bonding *bond) > { >- struct slave *slave, *bestslave = NULL, *primary; >+ struct slave *slave = NULL, *bestslave = NULL, *primary; > struct list_head *iter; > int mintime = bond->params.updelay; > primary = rtnl_dereference(bond->primary_slave); >- if (primary && primary->link == BOND_LINK_UP && >- bond_should_change_active(bond)) >- return primary; >+ if (primary && primary->link == BOND_LINK_UP) >+ slave = bond_choose_primary_or_current(bond); >+ >+ if (slave) >+ return slave; > bond_for_each_slave(bond, slave, iter) { > if (slave->link == BOND_LINK_UP) I think this will misbehave in the case that curr is up and available, but primary is NULL (this can happen when the primary option is cleared). In this case, the above code will not call bond_choose_primary_or_current, and will then run the loop to find a new curr, which may select a different slave unnecessarily. How does the following look? I prefer to make the call to choose_primary_or_current unconditional, and have it decide if the search loop should be run. In this version, _choose_ tests curr if prim is not suitable. Compile tested only. Thoughts? -J diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 19eb990..1e35e25 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -689,40 +689,57 @@ out: } -static bool bond_should_change_active(struct bonding *bond) +static struct slave *bond_choose_primary_or_current(struct bonding *bond) { struct slave *prim = rtnl_dereference(bond->primary_slave); struct slave *curr = rtnl_dereference(bond->curr_active_slave); - if (!prim || !curr || curr->link != BOND_LINK_UP) - return true; + if (!prim || !prim->link == BOND_LINK_UP) { + if (!curr || !curr->link == BOND_LINK_UP) + return NULL; + return curr; + } + if (bond->force_primary) { bond->force_primary = false; - return true; + return prim; + } + + if (!curr || curr->link != BOND_LINK_UP) + return prim; + + /* At this point, prim and curr are both up */ + switch (bond->params.primary_reselect) { + case BOND_PRI_RESELECT_ALWAYS: + return prim; + case BOND_PRI_RESELECT_BETTER: + if (prim->speed < curr->speed) + return curr; + if (prim->speed == curr->speed && prim->duplex <= curr->duplex) + return curr; + return prim; + case BOND_PRI_RESELECT_FAILURE: + return curr; + default: + netdev_err(bond->dev, "impossible primary_reselect %d\n", + bond->params.primary_reselect); + return curr; } - if (bond->params.primary_reselect == BOND_PRI_RESELECT_BETTER && - (prim->speed < curr->speed || -(prim->speed == curr->speed && prim->duplex <= curr->duplex))) - return false; - if (bond->params.primary_reselect == BOND_PRI_RESELECT_FAILURE) - return false; - return true; } /** - * find_best_interface - select the best available slave to be the active one + * bond_find_best_slave - select the best available slave to be the active one * @bond: our bonding struct */ static struct slave *bond_find_best_slave(struct bonding *bond) { - struct slave *slave, *bestslave = NULL, *primary; + struct slave *slave, *bestslave = NULL; struct list_head *iter; int mintime = bond->params.updelay; - primary = rtnl_dereference(bond->primary_slave); - if (primary && primary->link == BOND_LINK_UP && - bond_should_change_active(bond)) - return primary; + slave = bond_choose_primary_or_current(bond); +
Re: [PATCH v2] bonding: primary_reselect with failure is not working properly
GMAIL ranamazh...@gmail.com wrote: Hi Jay, On Friday 03 July 2015 02:12 AM, Jay Vosburgh wrote: [ added netdev to cc ] Mazhar Rana ranamazh...@gmail.com wrote: When primary_reselect is set to failure, primary interface should not become active until current active slave is up. But if we set first I think you mean until current active slave is down here, not up. Yes, It should be up, grammatical mistake down, right? [...] Below is the updated version of your patch. Any Comments or suggestions ? [...] static struct slave *bond_find_best_slave(struct bonding *bond) { - struct slave *slave, *bestslave = NULL, *primary; + struct slave *slave = NULL, *bestslave = NULL, *primary; struct list_head *iter; int mintime = bond-params.updelay; primary = rtnl_dereference(bond-primary_slave); - if (primary primary-link == BOND_LINK_UP - bond_should_change_active(bond)) - return primary; + if (primary primary-link == BOND_LINK_UP) + slave = bond_choose_primary_or_current(bond); + + if (slave) + return slave; bond_for_each_slave(bond, slave, iter) { if (slave-link == BOND_LINK_UP) I think this will misbehave in the case that curr is up and available, but primary is NULL (this can happen when the primary option is cleared). In this case, the above code will not call bond_choose_primary_or_current, and will then run the loop to find a new curr, which may select a different slave unnecessarily. How does the following look? I prefer to make the call to choose_primary_or_current unconditional, and have it decide if the search loop should be run. In this version, _choose_ tests curr if prim is not suitable. Compile tested only. Thoughts? -J diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 19eb990..1e35e25 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -689,40 +689,57 @@ out: } -static bool bond_should_change_active(struct bonding *bond) +static struct slave *bond_choose_primary_or_current(struct bonding *bond) { struct slave *prim = rtnl_dereference(bond-primary_slave); struct slave *curr = rtnl_dereference(bond-curr_active_slave); - if (!prim || !curr || curr-link != BOND_LINK_UP) - return true; + if (!prim || !prim-link == BOND_LINK_UP) { + if (!curr || !curr-link == BOND_LINK_UP) + return NULL; + return curr; + } + if (bond-force_primary) { bond-force_primary = false; - return true; + return prim; + } + + if (!curr || curr-link != BOND_LINK_UP) + return prim; + + /* At this point, prim and curr are both up */ + switch (bond-params.primary_reselect) { + case BOND_PRI_RESELECT_ALWAYS: + return prim; + case BOND_PRI_RESELECT_BETTER: + if (prim-speed curr-speed) + return curr; + if (prim-speed == curr-speed prim-duplex = curr-duplex) + return curr; + return prim; + case BOND_PRI_RESELECT_FAILURE: + return curr; + default: + netdev_err(bond-dev, impossible primary_reselect %d\n, + bond-params.primary_reselect); + return curr; } - if (bond-params.primary_reselect == BOND_PRI_RESELECT_BETTER - (prim-speed curr-speed || -(prim-speed == curr-speed prim-duplex = curr-duplex))) - return false; - if (bond-params.primary_reselect == BOND_PRI_RESELECT_FAILURE) - return false; - return true; } /** - * find_best_interface - select the best available slave to be the active one + * bond_find_best_slave - select the best available slave to be the active one * @bond: our bonding struct */ static struct slave *bond_find_best_slave(struct bonding *bond) { - struct slave *slave, *bestslave = NULL, *primary; + struct slave *slave, *bestslave = NULL; struct list_head *iter; int mintime = bond-params.updelay; - primary = rtnl_dereference(bond-primary_slave); - if (primary primary-link == BOND_LINK_UP - bond_should_change_active(bond)) - return primary; + slave = bond_choose_primary_or_current(bond); + if (slave) + return slave; bond_for_each_slave(bond, slave, iter) { if (slave-link == BOND_LINK_UP) --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] bonding: "primary_reselect" with "failure" is not working properly
bond_find_best_slave(struct bonding *bond) { - struct slave *slave, *bestslave = NULL, *primary; + struct slave *slave, *bestslave = NULL; struct list_head *iter; int mintime = bond->params.updelay; - primary = rtnl_dereference(bond->primary_slave); - if (primary && primary->link == BOND_LINK_UP && - bond_should_change_active(bond)) - return primary; + slave = bond_choose_primary_or_current(bond); + if (slave) + return slave; bond_for_each_slave(bond, slave, iter) { if (slave->link == BOND_LINK_UP) --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] bonding: primary_reselect with failure is not working properly
= bond_choose_primary_or_current(bond); + if (slave) + return slave; bond_for_each_slave(bond, slave, iter) { if (slave-link == BOND_LINK_UP) --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney wrote: >On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote: >> On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote: >> > Paul E. McKenney wrote: >> > >> > >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote: >> > >> Looking at the dmesg, the early boot messages seem to be >> > >> confused as to how many CPUs there are, e.g., >> > >> >> > >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, >> > >> Nodes=1 >> > >> [0.00] Hierarchical RCU implementation. >> > >> [0.00] RCU debugfs-based tracing is enabled. >> > >> [0.00] RCU dyntick-idle grace-period acceleration is enabled. >> > >> [0.00] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4. >> > >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, >> > >> nr_cpu_ids=4 >> > >> [0.00] NR_IRQS:16640 nr_irqs:456 0 >> > >> [0.00] Offload RCU callbacks from all CPUs >> > >> [0.00] Offload RCU callbacks from CPUs: 0-3. >> > >> >> > >> but later shows 2: >> > >> >> > >> [0.233703] x86: Booting SMP configuration: >> > >> [0.236003] node #0, CPUs: #1 >> > >> [0.255528] x86: Booted up 1 node, 2 CPUs >> > >> >> > >> In any event, the E8400 is a 2 core CPU with no hyperthreading. >> > > >> > >Well, this might explain some of the difficulties. If RCU decides to wait >> > >on CPUs that don't exist, we will of course get a hang. And rcu_barrier() >> > >was definitely expecting four CPUs. >> > > >> > >So what happens if you boot with maxcpus=2? (Or build with >> > >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang. If so, >> > >I might have some ideas for a real fix. >> > >> >Booting with maxcpus=2 makes no difference (the dmesg output is >> > the same). >> > >> >Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and >> > dmesg has different CPU information at boot: >> > >> > [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2 >> > [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs >> > [...] >> > [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 >> > nr_node_ids:1 >> > [...] >> > [0.00] Hierarchical RCU implementation. >> > [0.00] RCU debugfs-based tracing is enabled. >> > [0.00] RCU dyntick-idle grace-period acceleration is enabled. >> > [0.00] NR_IRQS:4352 nr_irqs:440 0 >> > [0.00] Offload RCU callbacks from all CPUs >> > [0.00] Offload RCU callbacks from CPUs: 0-1. >> >> Thank you -- this confirms my suspicions on the fix, though I must admit >> to being surprised that maxcpus made no difference. > >And here is an alleged fix, lightly tested at this end. Does this patch >help? This patch appears to make the problem go away; I've run about 10 iterations. I applied this patch to the same -net tree I was using previously (-net as of Oct 22), with all other test patches removed. FWIW, dmesg is unchanged, and still shows messages like: [0.00] Offload RCU callbacks from CPUs: 0-3. Tested-by: Jay Vosburgh -J > Thanx, Paul > > > >rcu: Make rcu_barrier() understand about missing rcuo kthreads > >Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs) >avoids creating rcuo kthreads for CPUs that never come online. This >fixes a bug in many instances of firmware: Instead of lying about their >age, these systems instead lie about the number of CPUs that they have. >Before commit 35ce7f29a44a, this could result in huge numbers of useless >rcuo kthreads being created. > >It appears that experience indicates that I should have told the >people suffering from this problem to fix their broken firmware, but >I instead produced what turned out to be a partial fix. The missing >piece supplied by this commit makes sure that rcu_barrier() knows not to >post callbacks for no-CBs CPUs that have not yet come online, because >otherwise rcu_barrier() will hang on systems having firmware that lies >about the number of CPUs. > >It is tempting to simply have rcu_barrier
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Oct 25, 2014 at 11:18:27AM -0700, Paul E. McKenney wrote: On Sat, Oct 25, 2014 at 09:38:16AM -0700, Jay Vosburgh wrote: Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote: Looking at the dmesg, the early boot messages seem to be confused as to how many CPUs there are, e.g., [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1 [0.00] Hierarchical RCU implementation. [0.00] RCU debugfs-based tracing is enabled. [0.00] RCU dyntick-idle grace-period acceleration is enabled. [0.00] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4. [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4 [0.00] NR_IRQS:16640 nr_irqs:456 0 [0.00] Offload RCU callbacks from all CPUs [0.00] Offload RCU callbacks from CPUs: 0-3. but later shows 2: [0.233703] x86: Booting SMP configuration: [0.236003] node #0, CPUs: #1 [0.255528] x86: Booted up 1 node, 2 CPUs In any event, the E8400 is a 2 core CPU with no hyperthreading. Well, this might explain some of the difficulties. If RCU decides to wait on CPUs that don't exist, we will of course get a hang. And rcu_barrier() was definitely expecting four CPUs. So what happens if you boot with maxcpus=2? (Or build with CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang. If so, I might have some ideas for a real fix. Booting with maxcpus=2 makes no difference (the dmesg output is the same). Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and dmesg has different CPU information at boot: [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2 [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs [...] [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:1 [...] [0.00] Hierarchical RCU implementation. [0.00] RCU debugfs-based tracing is enabled. [0.00] RCU dyntick-idle grace-period acceleration is enabled. [0.00] NR_IRQS:4352 nr_irqs:440 0 [0.00] Offload RCU callbacks from all CPUs [0.00] Offload RCU callbacks from CPUs: 0-1. Thank you -- this confirms my suspicions on the fix, though I must admit to being surprised that maxcpus made no difference. And here is an alleged fix, lightly tested at this end. Does this patch help? This patch appears to make the problem go away; I've run about 10 iterations. I applied this patch to the same -net tree I was using previously (-net as of Oct 22), with all other test patches removed. FWIW, dmesg is unchanged, and still shows messages like: [0.00] Offload RCU callbacks from CPUs: 0-3. Tested-by: Jay Vosburgh jay.vosbu...@canonical.com -J Thanx, Paul rcu: Make rcu_barrier() understand about missing rcuo kthreads Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs) avoids creating rcuo kthreads for CPUs that never come online. This fixes a bug in many instances of firmware: Instead of lying about their age, these systems instead lie about the number of CPUs that they have. Before commit 35ce7f29a44a, this could result in huge numbers of useless rcuo kthreads being created. It appears that experience indicates that I should have told the people suffering from this problem to fix their broken firmware, but I instead produced what turned out to be a partial fix. The missing piece supplied by this commit makes sure that rcu_barrier() knows not to post callbacks for no-CBs CPUs that have not yet come online, because otherwise rcu_barrier() will hang on systems having firmware that lies about the number of CPUs. It is tempting to simply have rcu_barrier() refuse to post a callback on any no-CBs CPU that does not have an rcuo kthread. This unfortunately does not work because rcu_barrier() is required to wait for all pending callbacks. It is therefore required to wait even for those callbacks that cannot possibly be invoked. Even if doing so hangs the system. Given that posting a callback to a no-CBs CPU that does not yet have an rcuo kthread can hang rcu_barrier(), It is tempting to report an error in this case. Unfortunately, this will result in false positives at boot time, when it is perfectly legal to post callbacks to the boot CPU before the scheduler has started, in other words, before it is legal to invoke rcu_barrier(). So this commit instead has rcu_barrier() avoid posting callbacks to CPUs having neither rcuo kthread nor pending callbacks, and has it complain bitterly if it finds CPUs having no rcuo kthread but some
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney wrote: >On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote: >> Paul E. McKenney wrote: >> >> >On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote: >> [...] >> >> Hmmm... It sure looks like we have some callbacks stuck here. I clearly >> >> need to take a hard look at the sleep/wakeup code. >> >> >> >> Thank you for running this!!! >> > >> >Could you please try the following patch? If no joy, could you please >> >add rcu:rcu_nocb_wake to the list of ftrace events? >> >> I tried the patch, it did not change the behavior. >> >> I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints >> and ran it again (with this patch and the first patch from earlier >> today); the trace output is a bit on the large side so I put it and the >> dmesg log at: >> >> http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt >> >> http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt > >Thank you again! > >Very strange part of the trace. The only sign of CPU 2 and 3 are: > >ovs-vswitchd-902 [000] 109.896840: rcu_barrier: rcu_sched Begin > cpu -1 remaining 0 # 0 >ovs-vswitchd-902 [000] 109.896840: rcu_barrier: rcu_sched Check > cpu -1 remaining 0 # 0 >ovs-vswitchd-902 [000] 109.896841: rcu_barrier: rcu_sched Inc1 > cpu -1 remaining 0 # 1 >ovs-vswitchd-902 [000] 109.896841: rcu_barrier: rcu_sched > OnlineNoCB cpu 0 remaining 1 # 1 >ovs-vswitchd-902 [000] d... 109.896841: rcu_nocb_wake: rcu_sched 0 > WakeNot >ovs-vswitchd-902 [000] 109.896841: rcu_barrier: rcu_sched > OnlineNoCB cpu 1 remaining 2 # 1 >ovs-vswitchd-902 [000] d... 109.896841: rcu_nocb_wake: rcu_sched 1 > WakeNot >ovs-vswitchd-902 [000] 109.896842: rcu_barrier: rcu_sched > OnlineNoCB cpu 2 remaining 3 # 1 >ovs-vswitchd-902 [000] d... 109.896842: rcu_nocb_wake: rcu_sched 2 > WakeNotPoll >ovs-vswitchd-902 [000] 109.896842: rcu_barrier: rcu_sched > OnlineNoCB cpu 3 remaining 4 # 1 >ovs-vswitchd-902 [000] d... 109.896842: rcu_nocb_wake: rcu_sched 3 > WakeNotPoll >ovs-vswitchd-902 [000] 109.896843: rcu_barrier: rcu_sched Inc2 > cpu -1 remaining 4 # 2 > >The pair of WakeNotPoll trace entries says that at that point, RCU believed >that the CPU 2's and CPU 3's rcuo kthreads did not exist. :-/ On the test system I'm using, CPUs 2 and 3 really do not exist; it is a 2 CPU system (Intel Core 2 Duo E8400). I mentioned this in an earlier message, but perhaps you missed it in the flurry. Looking at the dmesg, the early boot messages seem to be confused as to how many CPUs there are, e.g., [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1 [0.00] Hierarchical RCU implementation. [0.00] RCU debugfs-based tracing is enabled. [0.00] RCU dyntick-idle grace-period acceleration is enabled. [0.00] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4. [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4 [0.00] NR_IRQS:16640 nr_irqs:456 0 [0.00] Offload RCU callbacks from all CPUs [0.00] Offload RCU callbacks from CPUs: 0-3. but later shows 2: [0.233703] x86: Booting SMP configuration: [0.236003] node #0, CPUs: #1 [0.255528] x86: Booted up 1 node, 2 CPUs In any event, the E8400 is a 2 core CPU with no hyperthreading. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney wrote: >On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote: >> Looking at the dmesg, the early boot messages seem to be >> confused as to how many CPUs there are, e.g., >> >> [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1 >> [0.00] Hierarchical RCU implementation. >> [0.00] RCU debugfs-based tracing is enabled. >> [0.00] RCU dyntick-idle grace-period acceleration is enabled. >> [0.00] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4. >> [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4 >> [0.00] NR_IRQS:16640 nr_irqs:456 0 >> [0.00] Offload RCU callbacks from all CPUs >> [0.00] Offload RCU callbacks from CPUs: 0-3. >> >> but later shows 2: >> >> [0.233703] x86: Booting SMP configuration: >> [0.236003] node #0, CPUs: #1 >> [0.255528] x86: Booted up 1 node, 2 CPUs >> >> In any event, the E8400 is a 2 core CPU with no hyperthreading. > >Well, this might explain some of the difficulties. If RCU decides to wait >on CPUs that don't exist, we will of course get a hang. And rcu_barrier() >was definitely expecting four CPUs. > >So what happens if you boot with maxcpus=2? (Or build with >CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang. If so, >I might have some ideas for a real fix. Booting with maxcpus=2 makes no difference (the dmesg output is the same). Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and dmesg has different CPU information at boot: [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2 [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs [...] [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:1 [...] [0.00] Hierarchical RCU implementation. [0.00] RCU debugfs-based tracing is enabled. [0.00] RCU dyntick-idle grace-period acceleration is enabled. [0.00] NR_IRQS:4352 nr_irqs:440 0 [0.00] Offload RCU callbacks from all CPUs [0.00] Offload RCU callbacks from CPUs: 0-1. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Fri, Oct 24, 2014 at 09:33:33PM -0700, Jay Vosburgh wrote: Looking at the dmesg, the early boot messages seem to be confused as to how many CPUs there are, e.g., [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1 [0.00] Hierarchical RCU implementation. [0.00] RCU debugfs-based tracing is enabled. [0.00] RCU dyntick-idle grace-period acceleration is enabled. [0.00] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4. [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4 [0.00] NR_IRQS:16640 nr_irqs:456 0 [0.00] Offload RCU callbacks from all CPUs [0.00] Offload RCU callbacks from CPUs: 0-3. but later shows 2: [0.233703] x86: Booting SMP configuration: [0.236003] node #0, CPUs: #1 [0.255528] x86: Booted up 1 node, 2 CPUs In any event, the E8400 is a 2 core CPU with no hyperthreading. Well, this might explain some of the difficulties. If RCU decides to wait on CPUs that don't exist, we will of course get a hang. And rcu_barrier() was definitely expecting four CPUs. So what happens if you boot with maxcpus=2? (Or build with CONFIG_NR_CPUS=2.) I suspect that this might avoid the hang. If so, I might have some ideas for a real fix. Booting with maxcpus=2 makes no difference (the dmesg output is the same). Rebuilding with CONFIG_NR_CPUS=2 makes the problem go away, and dmesg has different CPU information at boot: [0.00] smpboot: 4 Processors exceeds NR_CPUS limit of 2 [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs [...] [0.00] setup_percpu: NR_CPUS:2 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:1 [...] [0.00] Hierarchical RCU implementation. [0.00] RCU debugfs-based tracing is enabled. [0.00] RCU dyntick-idle grace-period acceleration is enabled. [0.00] NR_IRQS:4352 nr_irqs:440 0 [0.00] Offload RCU callbacks from all CPUs [0.00] Offload RCU callbacks from CPUs: 0-1. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Fri, Oct 24, 2014 at 05:20:48PM -0700, Jay Vosburgh wrote: Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote: [...] Hmmm... It sure looks like we have some callbacks stuck here. I clearly need to take a hard look at the sleep/wakeup code. Thank you for running this!!! Could you please try the following patch? If no joy, could you please add rcu:rcu_nocb_wake to the list of ftrace events? I tried the patch, it did not change the behavior. I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints and ran it again (with this patch and the first patch from earlier today); the trace output is a bit on the large side so I put it and the dmesg log at: http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt Thank you again! Very strange part of the trace. The only sign of CPU 2 and 3 are: ovs-vswitchd-902 [000] 109.896840: rcu_barrier: rcu_sched Begin cpu -1 remaining 0 # 0 ovs-vswitchd-902 [000] 109.896840: rcu_barrier: rcu_sched Check cpu -1 remaining 0 # 0 ovs-vswitchd-902 [000] 109.896841: rcu_barrier: rcu_sched Inc1 cpu -1 remaining 0 # 1 ovs-vswitchd-902 [000] 109.896841: rcu_barrier: rcu_sched OnlineNoCB cpu 0 remaining 1 # 1 ovs-vswitchd-902 [000] d... 109.896841: rcu_nocb_wake: rcu_sched 0 WakeNot ovs-vswitchd-902 [000] 109.896841: rcu_barrier: rcu_sched OnlineNoCB cpu 1 remaining 2 # 1 ovs-vswitchd-902 [000] d... 109.896841: rcu_nocb_wake: rcu_sched 1 WakeNot ovs-vswitchd-902 [000] 109.896842: rcu_barrier: rcu_sched OnlineNoCB cpu 2 remaining 3 # 1 ovs-vswitchd-902 [000] d... 109.896842: rcu_nocb_wake: rcu_sched 2 WakeNotPoll ovs-vswitchd-902 [000] 109.896842: rcu_barrier: rcu_sched OnlineNoCB cpu 3 remaining 4 # 1 ovs-vswitchd-902 [000] d... 109.896842: rcu_nocb_wake: rcu_sched 3 WakeNotPoll ovs-vswitchd-902 [000] 109.896843: rcu_barrier: rcu_sched Inc2 cpu -1 remaining 4 # 2 The pair of WakeNotPoll trace entries says that at that point, RCU believed that the CPU 2's and CPU 3's rcuo kthreads did not exist. :-/ On the test system I'm using, CPUs 2 and 3 really do not exist; it is a 2 CPU system (Intel Core 2 Duo E8400). I mentioned this in an earlier message, but perhaps you missed it in the flurry. Looking at the dmesg, the early boot messages seem to be confused as to how many CPUs there are, e.g., [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1 [0.00] Hierarchical RCU implementation. [0.00] RCU debugfs-based tracing is enabled. [0.00] RCU dyntick-idle grace-period acceleration is enabled. [0.00] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=4. [0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4 [0.00] NR_IRQS:16640 nr_irqs:456 0 [0.00] Offload RCU callbacks from all CPUs [0.00] Offload RCU callbacks from CPUs: 0-3. but later shows 2: [0.233703] x86: Booting SMP configuration: [0.236003] node #0, CPUs: #1 [0.255528] x86: Booted up 1 node, 2 CPUs In any event, the E8400 is a 2 core CPU with no hyperthreading. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney wrote: >On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote: [...] >> Hmmm... It sure looks like we have some callbacks stuck here. I clearly >> need to take a hard look at the sleep/wakeup code. >> >> Thank you for running this!!! > >Could you please try the following patch? If no joy, could you please >add rcu:rcu_nocb_wake to the list of ftrace events? I tried the patch, it did not change the behavior. I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints and ran it again (with this patch and the first patch from earlier today); the trace output is a bit on the large side so I put it and the dmesg log at: http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt -J > Thanx, Paul > > > >rcu: Kick rcuo kthreads after their CPU goes offline > >If a no-CBs CPU were to post an RCU callback with interrupts disabled >after it entered the idle loop for the last time, there might be no >deferred wakeup for the corresponding rcuo kthreads. This commit >therefore adds a set of calls to do_nocb_deferred_wakeup() after the >CPU has gone completely offline. > >Signed-off-by: Paul E. McKenney > >diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c >index 84b41b3c6ebd..f6880052b917 100644 >--- a/kernel/rcu/tree.c >+++ b/kernel/rcu/tree.c >@@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self, > case CPU_DEAD_FROZEN: > case CPU_UP_CANCELED: > case CPU_UP_CANCELED_FROZEN: >- for_each_rcu_flavor(rsp) >+ for_each_rcu_flavor(rsp) { > rcu_cleanup_dead_cpu(cpu, rsp); >+ do_nocb_deferred_wakeup(per_cpu_ptr(rsp->rda, cpu)); >+ } > break; > default: > break; > --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney wrote: >On Fri, Oct 24, 2014 at 03:02:04PM -0700, Jay Vosburgh wrote: >> Paul E. McKenney wrote: >> [...] >> I've got an ftrace capture from unmodified -net, it looks like >> this: >> >> ovs-vswitchd-902 [000] 471.778441: rcu_barrier: rcu_sched Begin >> cpu -1 remaining 0 # 0 >> ovs-vswitchd-902 [000] 471.778452: rcu_barrier: rcu_sched Check >> cpu -1 remaining 0 # 0 >> ovs-vswitchd-902 [000] 471.778452: rcu_barrier: rcu_sched Inc1 >> cpu -1 remaining 0 # 1 >> ovs-vswitchd-902 [000] 471.778453: rcu_barrier: rcu_sched >> OnlineNoCB cpu 0 remaining 1 # 1 >> ovs-vswitchd-902 [000] 471.778453: rcu_barrier: rcu_sched >> OnlineNoCB cpu 1 remaining 2 # 1 >> ovs-vswitchd-902 [000] 471.778453: rcu_barrier: rcu_sched >> OnlineNoCB cpu 2 remaining 3 # 1 >> ovs-vswitchd-902 [000] 471.778454: rcu_barrier: rcu_sched >> OnlineNoCB cpu 3 remaining 4 # 1 > >OK, so it looks like your system has four CPUs, and rcu_barrier() placed >callbacks on them all. No, the system has only two CPUs. It's an Intel Core 2 Duo E8400, and /proc/cpuinfo agrees that there are only 2. There is a potentially relevant-sounding message early in dmesg that says: [0.00] smpboot: Allowing 4 CPUs, 2 hotplug CPUs >> ovs-vswitchd-902 [000] 471.778454: rcu_barrier: rcu_sched Inc2 >> cpu -1 remaining 4 # 2 > >The above removes the extra count used to avoid races between posting new >callbacks and completion of previously posted callbacks. > >> rcuos/0-9 [000] ..s. 471.793150: rcu_barrier: rcu_sched CB >> cpu -1 remaining 3 # 2 >> rcuos/1-18[001] ..s. 471.793308: rcu_barrier: rcu_sched CB >> cpu -1 remaining 2 # 2 > >Two of the four callbacks fired, but the other two appear to be AWOL. >And rcu_barrier() won't return until they all fire. > >> I let it sit through several "hung task" cycles but that was all >> there was for rcu:rcu_barrier. >> >> I should have ftrace with the patch as soon as the kernel is >> done building, then I can try the below patch (I'll start it building >> now). > >Sounds very good, looking forward to hearing of the results. Going to bounce it for ftrace now, but the cpu count mismatch seemed important enough to mention separately. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
013fc8e3c0 ... [ 240.412463] 1: 88013fc8e3c0 l:88013fc0e3c0 n: (null) ... [ 240.419401] 2: 88013fd0e3c0 l:88013fd0e3c0 n:88013fd8e3c0 ... [ 240.426339] 3: 88013fd8e3c0 l:88013fd0e3c0 n: (null) ... [ 360.432020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds. [ 360.438881] Not tainted 3.17.0-testola+ #4 [ 360.443484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 360.451289] ovs-vswitchdD 88013fc94600 0 902901 0x0004 [ 360.451293] 8800ab20f7b8 0002 8800b3304b00 8800ab20ffd8 [ 360.451297] 00014600 00014600 8800b081 8800b3304b00 [ 360.451300] 8800b3304b00 81c59850 81c59858 7fff [ 360.451303] Call Trace: [ 360.451311] [] schedule+0x29/0x70 [ 360.451314] [] schedule_timeout+0x1dc/0x260 [ 360.451317] [] ? _cond_resched+0x29/0x40 [ 360.451320] [] ? wait_for_completion+0x28/0x160 [ 360.451325] [] ? queue_stop_cpus_work+0xc7/0xe0 [ 360.451327] [] wait_for_completion+0xa6/0x160 [ 360.451331] [] ? wake_up_state+0x20/0x20 [ 360.451335] [] _rcu_barrier+0x20c/0x480 [ 360.451338] [] rcu_barrier+0x15/0x20 [ 360.451342] [] netdev_run_todo+0x60/0x300 [ 360.451345] [] rtnl_unlock+0xe/0x10 [ 360.451353] [] internal_dev_destroy+0x55/0x80 [openvswitch] [ 360.451358] [] ovs_vport_del+0x32/0x40 [openvswitch] [ 360.451362] [] ovs_dp_detach_port+0x30/0x40 [openvswitch] [ 360.451366] [] ovs_vport_cmd_del+0xc5/0x110 [openvswitch] [ 360.451370] [] genl_family_rcv_msg+0x1a5/0x3c0 [ 360.451373] [] ? genl_family_rcv_msg+0x3c0/0x3c0 [ 360.451376] [] genl_rcv_msg+0x91/0xd0 [ 360.451379] [] netlink_rcv_skb+0xc1/0xe0 [ 360.451381] [] genl_rcv+0x2c/0x40 [ 360.451384] [] netlink_unicast+0xf6/0x200 [ 360.451387] [] netlink_sendmsg+0x31d/0x780 [ 360.451390] [] ? netlink_rcv_wake+0x44/0x60 [ 360.451394] [] sock_sendmsg+0x93/0xd0 [ 360.451399] [] ? apparmor_capable+0x60/0x60 [ 360.451402] [] ? verify_iovec+0x47/0xd0 [ 360.451406] [] ___sys_sendmsg+0x399/0x3b0 [ 360.451410] [] ? kernfs_seq_stop_active+0x32/0x40 [ 360.451414] [] ? native_sched_clock+0x35/0x90 [ 360.451417] [] ? native_sched_clock+0x35/0x90 [ 360.451419] [] ? sched_clock+0x9/0x10 [ 360.451424] [] ? acct_account_cputime+0x1c/0x20 [ 360.451427] [] ? account_user_time+0x8b/0xa0 [ 360.451431] [] ? __fget_light+0x25/0x70 [ 360.451434] [] __sys_sendmsg+0x42/0x80 [ 360.451437] [] SyS_sendmsg+0x12/0x20 [ 360.451440] [] tracesys_phase2+0xd8/0xdd [ 360.451442] rcu_show_nocb_setup(): rcu_sched nocb state: [ 360.456737] 0: 88013fc0e600 l:88013fc0e600 n:88013fc8e600 ... [ 360.463676] 1: 88013fc8e600 l:88013fc0e600 n: (null) ... [ 360.470614] 2: 88013fd0e600 l:88013fd0e600 n:88013fd8e600 N.. [ 360.477554] 3: 88013fd8e600 l:88013fd0e600 n: (null) N.. [ 360.484494] rcu_show_nocb_setup(): rcu_bh nocb state: [ 360.489529] 0: 88013fc0e3c0 l:88013fc0e3c0 n:88013fc8e3c0 ... [ 360.496469] 1: 88013fc8e3c0 l:88013fc0e3c0 n: (null) .G. [ 360.503407] 2: 88013fd0e3c0 l:88013fd0e3c0 n:88013fd8e3c0 ... [ 360.510346] 3: 88013fd8e3c0 l:88013fd0e3c0 n: (null) ... --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
;); > sched_show_task(t); > debug_show_held_locks(t); >+ rcu_show_nocb_setup(); > > touch_nmi_watchdog(); > >diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c >index 240fa9094f83..6b373e79ce0e 100644 >--- a/kernel/rcu/rcutorture.c >+++ b/kernel/rcu/rcutorture.c >@@ -1513,6 +1513,7 @@ rcu_torture_cleanup(void) > { > int i; > >+ rcu_show_nocb_setup(); > rcutorture_record_test_transition(); > if (torture_cleanup_begin()) { > if (cur_ops->cb_barrier != NULL) >diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h >index 927c17b081c7..285b3f6fb229 100644 >--- a/kernel/rcu/tree_plugin.h >+++ b/kernel/rcu/tree_plugin.h >@@ -2699,6 +2699,31 @@ static bool init_nocb_callback_list(struct rcu_data >*rdp) > > #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ > >+void rcu_show_nocb_setup(void) >+{ >+#ifdef CONFIG_RCU_NOCB_CPU >+ int cpu; >+ struct rcu_data *rdp; >+ struct rcu_state *rsp; >+ >+ for_each_rcu_flavor(rsp) { >+ pr_alert("rcu_show_nocb_setup(): %s nocb state:\n", rsp->name); >+ for_each_possible_cpu(cpu) { >+ if (!rcu_is_nocb_cpu(cpu)) >+ continue; >+ rdp = per_cpu_ptr(rsp->rda, cpu); >+ pr_alert("%3d: %p l:%p n:%p %c%c%c\n", >+ cpu, >+ rdp, rdp->nocb_leader, rdp->nocb_next_follower, >+ ".N"[!!rdp->nocb_head], >+ ".G"[!!rdp->nocb_gp_head], >+ ".F"[!!rdp->nocb_follower_head]); >+ } >+ } >+#endif /* #ifdef CONFIG_RCU_NOCB_CPU */ >+} >+EXPORT_SYMBOL_GPL(rcu_show_nocb_setup); >+ > /* > * An adaptive-ticks CPU can potentially execute in kernel mode for an > * arbitrarily long period of time with the scheduling-clock tick turned > --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney wrote: >On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote: >> On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote: >> > On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote: >> > > On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote: >> > > > On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote: >> > > > > On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote: >> > > > > > On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote: >> > > > > > > On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote: >> > > > > > > > On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote: >> > > > > > > > > >> > > > > > > > > On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote: >> > > > > > > > > > On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti >> > > > > > > > > > wrote: >> > > > >> > > > [ . . . ] >> > > > >> > > > > > > Ok, unless I've messsed up something major, bisecting points to: >> > > > > > > >> > > > > > > 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs >> > > > > > > >> > > > > > > Makes any sense ? >> > > > > > >> > > > > > Good question. ;-) >> > > > > > >> > > > > > Are any of your online CPUs missing rcuo kthreads? There should be >> > > > > > kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each >> > > > > > online CPU. >> > > > > >> > > > > Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a >> > > > > reverted, the rcuos are 8 >> > > > > and the modprobe ppp_generic testcase reliably works, libvirt also >> > > > > manages >> > > > > to setup its bridge. >> > > > > >> > > > > Just with linux-tip , the rcuos are 6 but the failure is as reliable >> > > > > as >> > > > > before. >> > > >> > > > Thank you, very interesting. Which 6 of the rcuos are present? >> > > >> > > Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like >> > > this >> > > Phenom II. >> > >> > Ah, you get 8 without the patch because it creates them for potential >> > CPUs as well as real ones. OK, got it. >> > >> > > > > Awating instructions: :) >> > > > >> > > > Well, I thought I understood the problem until you found that only 6 of >> > > > the expected 8 rcuos are present with linux-tip without the revert. >> > > > ;-) >> > > > >> > > > I am putting together a patch for the part of the problem that I think >> > > > I understand, of course, but it would help a lot to know which two of >> > > > the rcuos are missing. ;-) >> > > >> > > Ready to test >> > >> > Well, if you are feeling aggressive, give the following patch a spin. >> > I am doing sanity tests on it in the meantime. >> >> Doesn't seem to make a difference here > >OK, inspection isn't cutting it, so time for tracing. Does the system >respond to user input? If so, please enable rcu:rcu_barrier ftrace before >the problem occurs, then dump the trace buffer after the problem occurs. My system is up and responsive when the problem occurs, so this shouldn't be a problem. Do you want the ftrace with your patch below, or unmodified tip of tree? -J > Thanx, Paul > >> > >> > >> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h >> > index 29fb23f33c18..927c17b081c7 100644 >> > --- a/kernel/rcu/tree_plugin.h >> > +++ b/kernel/rcu/tree_plugin.h >> > @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct >> > rcu_state *rsp, int cpu) >> >rdp->nocb_leader = rdp_spawn; >> >if (rdp_last && rdp != rdp_spawn) >> >rdp_last->nocb_next_follower = rdp; >> > - rdp_last = rdp; >> > - rdp = rdp->nocb_next_follower; >> > - rdp_last->nocb_next_follower = NULL; >> > + if (rdp == rdp_spawn) { >> > + rdp = rdp->nocb_next_follower; >> > + } else { >> > + rdp_last = rdp; >> > + rdp = rdp->nocb_next_follower; >> > + rdp_last->nocb_next_follower = NULL; >> > + } >> >} while (rdp); >> >rdp_spawn->nocb_next_follower = rdp_old_leader; >> >} >> > --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney wrote: >On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote: >> Paul E. McKenney wrote: [...] >> >Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in >> >__call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle >> >NOCB callbacks from irq-disabled idle code) would fail. Is that the case? >> >If not, could you please bisect the commits between 11ed7f934cb8 (rcu: >> >Make nocb leader kthreads process pending callbacks after spawning) >> >and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())? >> >> Just a note to add that I am also reliably inducing what appears >> to be this issue on a current -net tree, when configuring openvswitch >> via script. I am available to test patches or bisect tomorrow (Friday) >> US time if needed. > >Thank you, Jay! Could you please check to see if reverting this commit >fixes things for you? > >35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs > >Reverting is not a long-term fix, as this commit is itself a bug fix, >but would be good to check to see if you are seeing the same thing that >Yanko is. ;-) Just to confirm what Yanko found, reverting this commit makes the problem go away for me. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Oct 23, 2014 at 09:48:34PM -0700, Jay Vosburgh wrote: Paul E. McKenney paul...@linux.vnet.ibm.com wrote: [...] Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in __call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle NOCB callbacks from irq-disabled idle code) would fail. Is that the case? If not, could you please bisect the commits between 11ed7f934cb8 (rcu: Make nocb leader kthreads process pending callbacks after spawning) and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())? Just a note to add that I am also reliably inducing what appears to be this issue on a current -net tree, when configuring openvswitch via script. I am available to test patches or bisect tomorrow (Friday) US time if needed. Thank you, Jay! Could you please check to see if reverting this commit fixes things for you? 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs Reverting is not a long-term fix, as this commit is itself a bug fix, but would be good to check to see if you are seeing the same thing that Yanko is. ;-) Just to confirm what Yanko found, reverting this commit makes the problem go away for me. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Fri, Oct 24, 2014 at 08:35:26PM +0300, Yanko Kaneti wrote: On Fri-10/24/14-2014 10:20, Paul E. McKenney wrote: On Fri, Oct 24, 2014 at 08:09:31PM +0300, Yanko Kaneti wrote: On Fri-10/24/14-2014 09:54, Paul E. McKenney wrote: On Fri, Oct 24, 2014 at 07:29:43PM +0300, Yanko Kaneti wrote: On Fri-10/24/14-2014 08:40, Paul E. McKenney wrote: On Fri, Oct 24, 2014 at 12:08:57PM +0300, Yanko Kaneti wrote: On Thu-10/23/14-2014 15:04, Paul E. McKenney wrote: On Fri, Oct 24, 2014 at 12:45:40AM +0300, Yanko Kaneti wrote: On Thu, 2014-10-23 at 13:05 -0700, Paul E. McKenney wrote: On Thu, Oct 23, 2014 at 10:51:59PM +0300, Yanko Kaneti wrote: [ . . . ] Ok, unless I've messsed up something major, bisecting points to: 35ce7f29a44a rcu: Create rcuo kthreads only for onlined CPUs Makes any sense ? Good question. ;-) Are any of your online CPUs missing rcuo kthreads? There should be kthreads named rcuos/0, rcuos/1, rcuos/2, and so on for each online CPU. Its a Phenom II X6. With 3.17 and linux-tip with 35ce7f29a44a reverted, the rcuos are 8 and the modprobe ppp_generic testcase reliably works, libvirt also manages to setup its bridge. Just with linux-tip , the rcuos are 6 but the failure is as reliable as before. Thank you, very interesting. Which 6 of the rcuos are present? Well, the rcuos are 0 to 5. Which sounds right for a 6 core CPU like this Phenom II. Ah, you get 8 without the patch because it creates them for potential CPUs as well as real ones. OK, got it. Awating instructions: :) Well, I thought I understood the problem until you found that only 6 of the expected 8 rcuos are present with linux-tip without the revert. ;-) I am putting together a patch for the part of the problem that I think I understand, of course, but it would help a lot to know which two of the rcuos are missing. ;-) Ready to test Well, if you are feeling aggressive, give the following patch a spin. I am doing sanity tests on it in the meantime. Doesn't seem to make a difference here OK, inspection isn't cutting it, so time for tracing. Does the system respond to user input? If so, please enable rcu:rcu_barrier ftrace before the problem occurs, then dump the trace buffer after the problem occurs. My system is up and responsive when the problem occurs, so this shouldn't be a problem. Do you want the ftrace with your patch below, or unmodified tip of tree? -J Thanx, Paul diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 29fb23f33c18..927c17b081c7 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2546,9 +2546,13 @@ static void rcu_spawn_one_nocb_kthread(struct rcu_state *rsp, int cpu) rdp-nocb_leader = rdp_spawn; if (rdp_last rdp != rdp_spawn) rdp_last-nocb_next_follower = rdp; - rdp_last = rdp; - rdp = rdp-nocb_next_follower; - rdp_last-nocb_next_follower = NULL; + if (rdp == rdp_spawn) { + rdp = rdp-nocb_next_follower; + } else { + rdp_last = rdp; + rdp = rdp-nocb_next_follower; + rdp_last-nocb_next_follower = NULL; + } } while (rdp); rdp_spawn-nocb_next_follower = rdp_old_leader; } --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
()) { if (cur_ops-cb_barrier != NULL) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 927c17b081c7..285b3f6fb229 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2699,6 +2699,31 @@ static bool init_nocb_callback_list(struct rcu_data *rdp) #endif /* #else #ifdef CONFIG_RCU_NOCB_CPU */ +void rcu_show_nocb_setup(void) +{ +#ifdef CONFIG_RCU_NOCB_CPU + int cpu; + struct rcu_data *rdp; + struct rcu_state *rsp; + + for_each_rcu_flavor(rsp) { + pr_alert(rcu_show_nocb_setup(): %s nocb state:\n, rsp-name); + for_each_possible_cpu(cpu) { + if (!rcu_is_nocb_cpu(cpu)) + continue; + rdp = per_cpu_ptr(rsp-rda, cpu); + pr_alert(%3d: %p l:%p n:%p %c%c%c\n, + cpu, + rdp, rdp-nocb_leader, rdp-nocb_next_follower, + .N[!!rdp-nocb_head], + .G[!!rdp-nocb_gp_head], + .F[!!rdp-nocb_follower_head]); + } + } +#endif /* #ifdef CONFIG_RCU_NOCB_CPU */ +} +EXPORT_SYMBOL_GPL(rcu_show_nocb_setup); + /* * An adaptive-ticks CPU can potentially execute in kernel mode for an * arbitrarily long period of time with the scheduling-clock tick turned --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
] 2: 88013fd0e600 l:88013fd0e600 n:88013fd8e600 N.. [ 240.393550] 3: 88013fd8e600 l:88013fd0e600 n: (null) N.. [ 240.400489] rcu_show_nocb_setup(): rcu_bh nocb state: [ 240.405525] 0: 88013fc0e3c0 l:88013fc0e3c0 n:88013fc8e3c0 ... [ 240.412463] 1: 88013fc8e3c0 l:88013fc0e3c0 n: (null) ... [ 240.419401] 2: 88013fd0e3c0 l:88013fd0e3c0 n:88013fd8e3c0 ... [ 240.426339] 3: 88013fd8e3c0 l:88013fd0e3c0 n: (null) ... [ 360.432020] INFO: task ovs-vswitchd:902 blocked for more than 120 seconds. [ 360.438881] Not tainted 3.17.0-testola+ #4 [ 360.443484] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 360.451289] ovs-vswitchdD 88013fc94600 0 902901 0x0004 [ 360.451293] 8800ab20f7b8 0002 8800b3304b00 8800ab20ffd8 [ 360.451297] 00014600 00014600 8800b081 8800b3304b00 [ 360.451300] 8800b3304b00 81c59850 81c59858 7fff [ 360.451303] Call Trace: [ 360.451311] [81722b99] schedule+0x29/0x70 [ 360.451314] [81725b6c] schedule_timeout+0x1dc/0x260 [ 360.451317] [81722f69] ? _cond_resched+0x29/0x40 [ 360.451320] [81723818] ? wait_for_completion+0x28/0x160 [ 360.451325] [811081a7] ? queue_stop_cpus_work+0xc7/0xe0 [ 360.451327] [81723896] wait_for_completion+0xa6/0x160 [ 360.451331] [81099980] ? wake_up_state+0x20/0x20 [ 360.451335] [810d0ecc] _rcu_barrier+0x20c/0x480 [ 360.451338] [810d1195] rcu_barrier+0x15/0x20 [ 360.451342] [81625010] netdev_run_todo+0x60/0x300 [ 360.451345] [8162f9ee] rtnl_unlock+0xe/0x10 [ 360.451353] [a01ffcc5] internal_dev_destroy+0x55/0x80 [openvswitch] [ 360.451358] [a01ff622] ovs_vport_del+0x32/0x40 [openvswitch] [ 360.451362] [a01f8dd0] ovs_dp_detach_port+0x30/0x40 [openvswitch] [ 360.451366] [a01f8ea5] ovs_vport_cmd_del+0xc5/0x110 [openvswitch] [ 360.451370] [81651d75] genl_family_rcv_msg+0x1a5/0x3c0 [ 360.451373] [81651f90] ? genl_family_rcv_msg+0x3c0/0x3c0 [ 360.451376] [81652021] genl_rcv_msg+0x91/0xd0 [ 360.451379] [81650091] netlink_rcv_skb+0xc1/0xe0 [ 360.451381] [816505bc] genl_rcv+0x2c/0x40 [ 360.451384] [8164f626] netlink_unicast+0xf6/0x200 [ 360.451387] [8164fa4d] netlink_sendmsg+0x31d/0x780 [ 360.451390] [8164ca74] ? netlink_rcv_wake+0x44/0x60 [ 360.451394] [81606a53] sock_sendmsg+0x93/0xd0 [ 360.451399] [81337700] ? apparmor_capable+0x60/0x60 [ 360.451402] [81614f27] ? verify_iovec+0x47/0xd0 [ 360.451406] [81606e79] ___sys_sendmsg+0x399/0x3b0 [ 360.451410] [812598a2] ? kernfs_seq_stop_active+0x32/0x40 [ 360.451414] [8101c385] ? native_sched_clock+0x35/0x90 [ 360.451417] [8101c385] ? native_sched_clock+0x35/0x90 [ 360.451419] [8101c3e9] ? sched_clock+0x9/0x10 [ 360.451424] [811277fc] ? acct_account_cputime+0x1c/0x20 [ 360.451427] [8109ce6b] ? account_user_time+0x8b/0xa0 [ 360.451431] [81200bd5] ? __fget_light+0x25/0x70 [ 360.451434] [81607c02] __sys_sendmsg+0x42/0x80 [ 360.451437] [81607c52] SyS_sendmsg+0x12/0x20 [ 360.451440] [81727464] tracesys_phase2+0xd8/0xdd [ 360.451442] rcu_show_nocb_setup(): rcu_sched nocb state: [ 360.456737] 0: 88013fc0e600 l:88013fc0e600 n:88013fc8e600 ... [ 360.463676] 1: 88013fc8e600 l:88013fc0e600 n: (null) ... [ 360.470614] 2: 88013fd0e600 l:88013fd0e600 n:88013fd8e600 N.. [ 360.477554] 3: 88013fd8e600 l:88013fd0e600 n: (null) N.. [ 360.484494] rcu_show_nocb_setup(): rcu_bh nocb state: [ 360.489529] 0: 88013fc0e3c0 l:88013fc0e3c0 n:88013fc8e3c0 ... [ 360.496469] 1: 88013fc8e3c0 l:88013fc0e3c0 n: (null) .G. [ 360.503407] 2: 88013fd0e3c0 l:88013fd0e3c0 n:88013fd8e3c0 ... [ 360.510346] 3: 88013fd8e3c0 l:88013fd0e3c0 n: (null) ... --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Fri, Oct 24, 2014 at 03:02:04PM -0700, Jay Vosburgh wrote: Paul E. McKenney paul...@linux.vnet.ibm.com wrote: [...] I've got an ftrace capture from unmodified -net, it looks like this: ovs-vswitchd-902 [000] 471.778441: rcu_barrier: rcu_sched Begin cpu -1 remaining 0 # 0 ovs-vswitchd-902 [000] 471.778452: rcu_barrier: rcu_sched Check cpu -1 remaining 0 # 0 ovs-vswitchd-902 [000] 471.778452: rcu_barrier: rcu_sched Inc1 cpu -1 remaining 0 # 1 ovs-vswitchd-902 [000] 471.778453: rcu_barrier: rcu_sched OnlineNoCB cpu 0 remaining 1 # 1 ovs-vswitchd-902 [000] 471.778453: rcu_barrier: rcu_sched OnlineNoCB cpu 1 remaining 2 # 1 ovs-vswitchd-902 [000] 471.778453: rcu_barrier: rcu_sched OnlineNoCB cpu 2 remaining 3 # 1 ovs-vswitchd-902 [000] 471.778454: rcu_barrier: rcu_sched OnlineNoCB cpu 3 remaining 4 # 1 OK, so it looks like your system has four CPUs, and rcu_barrier() placed callbacks on them all. No, the system has only two CPUs. It's an Intel Core 2 Duo E8400, and /proc/cpuinfo agrees that there are only 2. There is a potentially relevant-sounding message early in dmesg that says: [0.00] smpboot: Allowing 4 CPUs, 2 hotplug CPUs ovs-vswitchd-902 [000] 471.778454: rcu_barrier: rcu_sched Inc2 cpu -1 remaining 4 # 2 The above removes the extra count used to avoid races between posting new callbacks and completion of previously posted callbacks. rcuos/0-9 [000] ..s. 471.793150: rcu_barrier: rcu_sched CB cpu -1 remaining 3 # 2 rcuos/1-18[001] ..s. 471.793308: rcu_barrier: rcu_sched CB cpu -1 remaining 2 # 2 Two of the four callbacks fired, but the other two appear to be AWOL. And rcu_barrier() won't return until they all fire. I let it sit through several hung task cycles but that was all there was for rcu:rcu_barrier. I should have ftrace with the patch as soon as the kernel is done building, then I can try the below patch (I'll start it building now). Sounds very good, looking forward to hearing of the results. Going to bounce it for ftrace now, but the cpu count mismatch seemed important enough to mention separately. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Fri, Oct 24, 2014 at 03:59:31PM -0700, Paul E. McKenney wrote: [...] Hmmm... It sure looks like we have some callbacks stuck here. I clearly need to take a hard look at the sleep/wakeup code. Thank you for running this!!! Could you please try the following patch? If no joy, could you please add rcu:rcu_nocb_wake to the list of ftrace events? I tried the patch, it did not change the behavior. I enabled the rcu:rcu_barrier and rcu:rcu_nocb_wake tracepoints and ran it again (with this patch and the first patch from earlier today); the trace output is a bit on the large side so I put it and the dmesg log at: http://people.canonical.com/~jvosburgh/nocb-wake-dmesg.txt http://people.canonical.com/~jvosburgh/nocb-wake-trace.txt -J Thanx, Paul rcu: Kick rcuo kthreads after their CPU goes offline If a no-CBs CPU were to post an RCU callback with interrupts disabled after it entered the idle loop for the last time, there might be no deferred wakeup for the corresponding rcuo kthreads. This commit therefore adds a set of calls to do_nocb_deferred_wakeup() after the CPU has gone completely offline. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 84b41b3c6ebd..f6880052b917 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3493,8 +3493,10 @@ static int rcu_cpu_notify(struct notifier_block *self, case CPU_DEAD_FROZEN: case CPU_UP_CANCELED: case CPU_UP_CANCELED_FROZEN: - for_each_rcu_flavor(rsp) + for_each_rcu_flavor(rsp) { rcu_cleanup_dead_cpu(cpu, rsp); + do_nocb_deferred_wakeup(per_cpu_ptr(rsp-rda, cpu)); + } break; default: break; --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
tex can be >held by two tasks concurrently, given that it does not appear to be a >reader-writer lock... > >Either way, my patch assumed that 39953dfd4007 (rcu: Avoid misordering in >__call_rcu_nocb_enqueue()) would work and that 1772947bd012 (rcu: Handle >NOCB callbacks from irq-disabled idle code) would fail. Is that the case? >If not, could you please bisect the commits between 11ed7f934cb8 (rcu: >Make nocb leader kthreads process pending callbacks after spawning) >and c847f14217d5 (rcu: Avoid misordering in nocb_leader_wait())? Just a note to add that I am also reliably inducing what appears to be this issue on a current -net tree, when configuring openvswitch via script. I am available to test patches or bisect tomorrow (Friday) US time if needed. The stack is as follows: [ 1320.492020] INFO: task ovs-vswitchd:1303 blocked for more than 120 seconds. [ 1320.498965] Not tainted 3.17.0-testola+ #1 [ 1320.503570] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1320.511374] ovs-vswitchdD 88013fc14600 0 1303 1302 0x0004 [ 1320.511378] 8801388d77d8 0002 880031144b00 8801388d7fd8 [ 1320.511382] 00014600 00014600 8800b092e400 880031144b00 [ 1320.511385] 8800b1126000 81c58ad0 81c58ad8 7fff [ 1320.511389] Call Trace: [ 1320.511396] [] schedule+0x29/0x70 [ 1320.511399] [] schedule_timeout+0x1dc/0x260 [ 1320.511404] [] ? check_preempt_curr+0x8d/0xa0 [ 1320.511407] [] ? ttwu_do_wakeup+0x1d/0xd0 [ 1320.511410] [] wait_for_completion+0xa6/0x160 [ 1320.511413] [] ? wake_up_state+0x20/0x20 [ 1320.511417] [] _rcu_barrier+0x157/0x200 [ 1320.511419] [] rcu_barrier+0x15/0x20 [ 1320.511423] [] netdev_run_todo+0x60/0x300 [ 1320.511427] [] rtnl_unlock+0xe/0x10 [ 1320.511435] [] internal_dev_destroy+0x55/0x80 [openvswitch] [ 1320.511440] [] ovs_vport_del+0x32/0x40 [openvswitch] [ 1320.511444] [] ovs_dp_detach_port+0x30/0x40 [openvswitch] [ 1320.511448] [] ovs_vport_cmd_del+0xc5/0x110 [openvswitch] [ 1320.511452] [] genl_family_rcv_msg+0x1a5/0x3c0 [ 1320.511455] [] ? genl_family_rcv_msg+0x3c0/0x3c0 [ 1320.511458] [] genl_rcv_msg+0x91/0xd0 [ 1320.511461] [] netlink_rcv_skb+0xc1/0xe0 [ 1320.511463] [] genl_rcv+0x2c/0x40 [ 1320.511466] [] netlink_unicast+0xf6/0x200 [ 1320.511468] [] netlink_sendmsg+0x31d/0x780 [ 1320.511472] [] ? netlink_rcv_wake+0x44/0x60 [ 1320.511475] [] ? netlink_recvmsg+0x1d3/0x3e0 [ 1320.511479] [] sock_sendmsg+0x93/0xd0 [ 1320.511484] [] ? apparmor_file_alloc_security+0x20/0x40 [ 1320.511487] [] ? verify_iovec+0x47/0xd0 [ 1320.511491] [] ___sys_sendmsg+0x399/0x3b0 [ 1320.511495] [] ? kernfs_seq_stop_active+0x32/0x40 [ 1320.511499] [] ? native_sched_clock+0x35/0x90 [ 1320.511502] [] ? native_sched_clock+0x35/0x90 [ 1320.511505] [] ? sched_clock+0x9/0x10 [ 1320.511509] [] ? acct_account_cputime+0x1c/0x20 [ 1320.511512] [] ? account_user_time+0x8b/0xa0 [ 1320.511516] [] ? __fget_light+0x25/0x70 [ 1320.511519] [] __sys_sendmsg+0x42/0x80 [ 1320.511521] [] SyS_sendmsg+0x12/0x20 [ 1320.511525] [] tracesys_phase2+0xd8/0xdd -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: localed stuck in recent 3.18 git in copy_net_ns?
[ 1320.511475] [816632e3] ? netlink_recvmsg+0x1d3/0x3e0 [ 1320.511479] [8161c463] sock_sendmsg+0x93/0xd0 [ 1320.511484] [81332d00] ? apparmor_file_alloc_security+0x20/0x40 [ 1320.511487] [8162a697] ? verify_iovec+0x47/0xd0 [ 1320.511491] [8161cc79] ___sys_sendmsg+0x399/0x3b0 [ 1320.511495] [81254e02] ? kernfs_seq_stop_active+0x32/0x40 [ 1320.511499] [8101c385] ? native_sched_clock+0x35/0x90 [ 1320.511502] [8101c385] ? native_sched_clock+0x35/0x90 [ 1320.511505] [8101c3e9] ? sched_clock+0x9/0x10 [ 1320.511509] [81122d5c] ? acct_account_cputime+0x1c/0x20 [ 1320.511512] [8109ce6b] ? account_user_time+0x8b/0xa0 [ 1320.511516] [811fc135] ? __fget_light+0x25/0x70 [ 1320.511519] [8161d372] __sys_sendmsg+0x42/0x80 [ 1320.511521] [8161d3c2] SyS_sendmsg+0x12/0x20 [ 1320.511525] [8173e6a4] tracesys_phase2+0xd8/0xdd -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1
Zheng Li wrote: >In bond mode tlb and alb, inactive slaves should keep inactive flag to 1 to >refuse to receive broadcast packets. Now, active slave send broadcast packets >(for example ARP requests) which will arrive inactive slaves on same host from >switch, but inactive slave's inactive flag is zero that cause bridge receive >the >broadcast packets to produce a wrong entry in forward table. Typical situation >is domu send some ARP request which go out from dom0 bond's active slave, then >the ARP broadcast request packets go back to inactive slave from switch, >because >the inactive slave's inactive flag is zero, kernel will receive the packets and >pass them to bridge that cause dom0's bridge map domu's MAC address to port of >bond, bridge should map domu's MAC to port of vif. I think the patch is ok (I don't have a machine to test it on at the moment), but the description above is leaving out some details about how the problem is induced. The actual problem being fixed here is that bond_open is not setting the inactive flag correctly for some modes (alb and tlb), resulting in the behavior described above if the bond has been administratively set down and then back up. This effect should not occur when slaves are added while the bond is up; it's something that only happens after a down/up bounce of the bond. That said, the patch itself looks fine to me. Signed-off-by: Jay Vosburgh -J >Signed-off-by: Zheng Li >--- > drivers/net/bonding/bond_main.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index e5628fc..f97d72e 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -3058,7 +3058,7 @@ static int bond_open(struct net_device *bond_dev) > if (bond_has_slaves(bond)) { > read_lock(>curr_slave_lock); > bond_for_each_slave(bond, slave, iter) { >- if ((bond->params.mode == BOND_MODE_ACTIVEBACKUP) >+ if ((bond->params.mode == BOND_MODE_ACTIVEBACKUP || >bond_is_lb(bond)) > && (slave != bond->curr_active_slave)) { > bond_set_slave_inactive_flags(slave, > > BOND_SLAVE_NOTIFY_NOW); >-- >1.7.6.5 --- -Jay Vosburgh, j.vosbu...@gmail.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1
Zheng Li zheng.x...@oracle.com wrote: In bond mode tlb and alb, inactive slaves should keep inactive flag to 1 to refuse to receive broadcast packets. Now, active slave send broadcast packets (for example ARP requests) which will arrive inactive slaves on same host from switch, but inactive slave's inactive flag is zero that cause bridge receive the broadcast packets to produce a wrong entry in forward table. Typical situation is domu send some ARP request which go out from dom0 bond's active slave, then the ARP broadcast request packets go back to inactive slave from switch, because the inactive slave's inactive flag is zero, kernel will receive the packets and pass them to bridge that cause dom0's bridge map domu's MAC address to port of bond, bridge should map domu's MAC to port of vif. I think the patch is ok (I don't have a machine to test it on at the moment), but the description above is leaving out some details about how the problem is induced. The actual problem being fixed here is that bond_open is not setting the inactive flag correctly for some modes (alb and tlb), resulting in the behavior described above if the bond has been administratively set down and then back up. This effect should not occur when slaves are added while the bond is up; it's something that only happens after a down/up bounce of the bond. That said, the patch itself looks fine to me. Signed-off-by: Jay Vosburgh j.vosbu...@gmail.com -J Signed-off-by: Zheng Li zheng.x...@oracle.com --- drivers/net/bonding/bond_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e5628fc..f97d72e 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3058,7 +3058,7 @@ static int bond_open(struct net_device *bond_dev) if (bond_has_slaves(bond)) { read_lock(bond-curr_slave_lock); bond_for_each_slave(bond, slave, iter) { - if ((bond-params.mode == BOND_MODE_ACTIVEBACKUP) + if ((bond-params.mode == BOND_MODE_ACTIVEBACKUP || bond_is_lb(bond)) (slave != bond-curr_active_slave)) { bond_set_slave_inactive_flags(slave, BOND_SLAVE_NOTIFY_NOW); -- 1.7.6.5 --- -Jay Vosburgh, j.vosbu...@gmail.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1 in tlb and alb mode.
Zheng Li wrote: >In bond mode tlb and alb, inactive slaves should keep inactive flag to >1 to refuse to receive broadcast packets. Now, active slave send broadcast >packets >(for example ARP requests) which will arrive inactive slaves on same host from >switch, >but inactive slave's inactive flag is zero that cause bridge receive the >broadcast >packets to produce a wrong entry in forward table. Typical situation is domu >send some >ARP request which go out from dom0 bond's active slave, then the ARP broadcast >request >packets go back to inactive slave from switch, because the inactive slave's >inactive >flag is zero, kernel will receive the packets and pass them to bridge, that >cause dom0's >bridge map domu's MAC address to port of bond, bridge should map domu's MAC to >port of vif. It's probably worth noting that this effect is something that happens after the bonding master device is opened with slaves, i.e., it's got a bunch of slaves, and is then set administratively down for whatever reason, and is now being set back up, and needs to set the active or inactive state of all the slaves. It'd also be a little easier to read if it was formatted for 80 columns. >Signed-off-by: Zheng Li >--- > drivers/net/bonding/bond_main.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index e5628fc..8761df6 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -3062,7 +3062,7 @@ static int bond_open(struct net_device *bond_dev) > && (slave != bond->curr_active_slave)) { > bond_set_slave_inactive_flags(slave, > > BOND_SLAVE_NOTIFY_NOW); >- } else { >+ } else if (!bond_is_lb(bond)) { > bond_set_slave_active_flags(slave, > > BOND_SLAVE_NOTIFY_NOW); This patch doesn't do anything for the modes that are bond_is_lb, i.e., the balance-tlb and -alb modes. I believe those two should be set similarly to active-backup: the curr_active_slave is active, other slaves are inactive. The "inactive" setting for alb is special, and means to not pass broadcast or multicast, but let unicast through. -J --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1 in tlb and alb mode.
Zheng Li zheng.x...@oracle.com wrote: In bond mode tlb and alb, inactive slaves should keep inactive flag to 1 to refuse to receive broadcast packets. Now, active slave send broadcast packets (for example ARP requests) which will arrive inactive slaves on same host from switch, but inactive slave's inactive flag is zero that cause bridge receive the broadcast packets to produce a wrong entry in forward table. Typical situation is domu send some ARP request which go out from dom0 bond's active slave, then the ARP broadcast request packets go back to inactive slave from switch, because the inactive slave's inactive flag is zero, kernel will receive the packets and pass them to bridge, that cause dom0's bridge map domu's MAC address to port of bond, bridge should map domu's MAC to port of vif. It's probably worth noting that this effect is something that happens after the bonding master device is opened with slaves, i.e., it's got a bunch of slaves, and is then set administratively down for whatever reason, and is now being set back up, and needs to set the active or inactive state of all the slaves. It'd also be a little easier to read if it was formatted for 80 columns. Signed-off-by: Zheng Li zheng.x...@oracle.com --- drivers/net/bonding/bond_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e5628fc..8761df6 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3062,7 +3062,7 @@ static int bond_open(struct net_device *bond_dev) (slave != bond-curr_active_slave)) { bond_set_slave_inactive_flags(slave, BOND_SLAVE_NOTIFY_NOW); - } else { + } else if (!bond_is_lb(bond)) { bond_set_slave_active_flags(slave, BOND_SLAVE_NOTIFY_NOW); This patch doesn't do anything for the modes that are bond_is_lb, i.e., the balance-tlb and -alb modes. I believe those two should be set similarly to active-backup: the curr_active_slave is active, other slaves are inactive. The inactive setting for alb is special, and means to not pass broadcast or multicast, but let unicast through. -J --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1.
zheng.li wrote: >于 2014年03月21日 01:02, Jay Vosburgh 写道: >> Zheng Li wrote: >> >>> Except bond mode 1, in other bond modes, inactive slaves should keep >>> inactive flag to 1 to refuse to receive broadcast packets. Now, active >>> slave send broadcast packets (for example ARP requests) which will >>> arrive inactive slaves on same host from switch, but inactive slave's >>> inactive flag is zero that cause bridge receive the broadcast packets >>> to produce a wrong entry in forward table. Typical situation is domu >>> send some ARP request which go out from dom0 bond's active slave, then >>> the ARP broadcast request packets go back to inactive slave from >>> switch, because the inactive slave's inactive flag is zero, kernel will >>> receive the packets and pass them to bridge, that cause dom0's bridge >>> map domu's MAC address to port of bond, bridge should map domu's MAC to >>> port of vif. >> >> I suspect this will break LACP (802.3ad) and Etherchannel >> (balance-xor, balance-rr) modes, as those modes can receive broadcast or >> multicast on any slave. In those cases, the switch knows about the >> aggregation, and will only send the broadcast / multicast to one of the >> ports, but the port selected is not always the same one. >> >> In which mode are you having trouble? >> >> -J > >Except bond mode 1, in other modes (major test in mode 6, and test all >other mode, except mode 1, all other modes has the bug), the bridge >make a wrong entry which map guest MAC to the port of bond, it should >map guest MAC to the port of vif. > >Env description: dom0's bridge contains bond1 and vif ports, bond1 as >port 1 , vif as port 2, bond1 has two slaves which connect a switch. >when from guest ping others ,the arp broadcast request will go out from >bond1's active slave, and then go back to itself inactive slave from >switch , in function of bond_should_deliver_exact_match will return >false by inactive is zero, return false will cause bridge receive the >arp request packets whose original is from guest through vif that let >bridge consider the SRC MAC of guest is from bond1 by analyzing the arp >broadcast packets, then make a wrong forward entry "MAC of guest, from >port 1 (bond1)" , the correct entry should be "MAC of guest , from port >2 (vif)". I believe I understand; the crux of the problem is that the broadcast is looped by the switch to the other bond port, which then processes it as a received packet. For the tlb and alb modes, you're logically correct; the slaves other than the active slave should be set to inactive when the bond is opened. This is how they are set when configured normally to limit inbound broadcast and multicast traffic to just one slave (because these modes do not configure the switch ports into channel groups or aggregators; the switch has no knowledge of the bond). Your patch is still incorrect, though; the slaves should be set to actually be inactive via bond_set_slave_inactive_flags, not left "semi-active," i.e., BOND_STATE_ACTIVE but slave->inactive set. More on the slave->inactive flag handling later, though. For the balance-rr/-xor and 802.3ad modes, I suspect you have a configuration problem of some sort. For those modes, the switch should not loop back the broadcast as you describe when correctly configured. The balance-rr/-xor modes should be connected to switch ports that are configured into a single Etherchannel group (static link aggregation). If they are not, the looping back behavior you describe is expected, as the switch will flood the broadcast to all ports. Similarly, the 802.3ad mode should be connected to a switch with the ports configured for LACP, in which case the ports will automatically configure into one or more aggregators, and again, the switch will limit broadcast and multicast traffic to just one member of an aggregator. The -rr/-xor/802.3ad modes must have all slaves set to active in order for them to correctly process incoming broadcast and multicast traffic in their proper configuration. Setting them to inactive will cause packet loss in those configurations. The 802.3ad mode is probably the easiest to examine; if you configure the switch ports for LACP mode, the /proc/net/bonding/bond0 file should indicate that all the slaves are in the same aggregator (have the same Aggregator ID), and that that aggregator is the active one. The switch should have similar indications, although the format is vendor-specific. If your switch is configured for LACP and you still have issues, please post the /proc/net/bonding/bond0 contents. Finally, getting back to the slave->inactive flag. Th
Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1.
zheng.li zheng.x...@oracle.com wrote: 于 2014年03月21日 01:02, Jay Vosburgh 写道: Zheng Li zheng.x...@oracle.com wrote: Except bond mode 1, in other bond modes, inactive slaves should keep inactive flag to 1 to refuse to receive broadcast packets. Now, active slave send broadcast packets (for example ARP requests) which will arrive inactive slaves on same host from switch, but inactive slave's inactive flag is zero that cause bridge receive the broadcast packets to produce a wrong entry in forward table. Typical situation is domu send some ARP request which go out from dom0 bond's active slave, then the ARP broadcast request packets go back to inactive slave from switch, because the inactive slave's inactive flag is zero, kernel will receive the packets and pass them to bridge, that cause dom0's bridge map domu's MAC address to port of bond, bridge should map domu's MAC to port of vif. I suspect this will break LACP (802.3ad) and Etherchannel (balance-xor, balance-rr) modes, as those modes can receive broadcast or multicast on any slave. In those cases, the switch knows about the aggregation, and will only send the broadcast / multicast to one of the ports, but the port selected is not always the same one. In which mode are you having trouble? -J Except bond mode 1, in other modes (major test in mode 6, and test all other mode, except mode 1, all other modes has the bug), the bridge make a wrong entry which map guest MAC to the port of bond, it should map guest MAC to the port of vif. Env description: dom0's bridge contains bond1 and vif ports, bond1 as port 1 , vif as port 2, bond1 has two slaves which connect a switch. when from guest ping others ,the arp broadcast request will go out from bond1's active slave, and then go back to itself inactive slave from switch , in function of bond_should_deliver_exact_match will return false by inactive is zero, return false will cause bridge receive the arp request packets whose original is from guest through vif that let bridge consider the SRC MAC of guest is from bond1 by analyzing the arp broadcast packets, then make a wrong forward entry MAC of guest, from port 1 (bond1) , the correct entry should be MAC of guest , from port 2 (vif). I believe I understand; the crux of the problem is that the broadcast is looped by the switch to the other bond port, which then processes it as a received packet. For the tlb and alb modes, you're logically correct; the slaves other than the active slave should be set to inactive when the bond is opened. This is how they are set when configured normally to limit inbound broadcast and multicast traffic to just one slave (because these modes do not configure the switch ports into channel groups or aggregators; the switch has no knowledge of the bond). Your patch is still incorrect, though; the slaves should be set to actually be inactive via bond_set_slave_inactive_flags, not left semi-active, i.e., BOND_STATE_ACTIVE but slave-inactive set. More on the slave-inactive flag handling later, though. For the balance-rr/-xor and 802.3ad modes, I suspect you have a configuration problem of some sort. For those modes, the switch should not loop back the broadcast as you describe when correctly configured. The balance-rr/-xor modes should be connected to switch ports that are configured into a single Etherchannel group (static link aggregation). If they are not, the looping back behavior you describe is expected, as the switch will flood the broadcast to all ports. Similarly, the 802.3ad mode should be connected to a switch with the ports configured for LACP, in which case the ports will automatically configure into one or more aggregators, and again, the switch will limit broadcast and multicast traffic to just one member of an aggregator. The -rr/-xor/802.3ad modes must have all slaves set to active in order for them to correctly process incoming broadcast and multicast traffic in their proper configuration. Setting them to inactive will cause packet loss in those configurations. The 802.3ad mode is probably the easiest to examine; if you configure the switch ports for LACP mode, the /proc/net/bonding/bond0 file should indicate that all the slaves are in the same aggregator (have the same Aggregator ID), and that that aggregator is the active one. The switch should have similar indications, although the format is vendor-specific. If your switch is configured for LACP and you still have issues, please post the /proc/net/bonding/bond0 contents. Finally, getting back to the slave-inactive flag. The only difference between bond_set_slave_active_flags or _inactive_flags and bond_set_slave_state is that the slave-inactive flag (whose only purpose is duplicate suppression for incoming packets) is cleared by the first (_flags variant), but not by the second. Right now, the only caller of _set_slave_state
Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1.
Zheng Li wrote: >Except bond mode 1, in other bond modes, inactive slaves should keep >inactive flag to 1 to refuse to receive broadcast packets. Now, active >slave send broadcast packets (for example ARP requests) which will >arrive inactive slaves on same host from switch, but inactive slave's >inactive flag is zero that cause bridge receive the broadcast packets >to produce a wrong entry in forward table. Typical situation is domu >send some ARP request which go out from dom0 bond's active slave, then >the ARP broadcast request packets go back to inactive slave from >switch, because the inactive slave's inactive flag is zero, kernel will >receive the packets and pass them to bridge, that cause dom0's bridge >map domu's MAC address to port of bond, bridge should map domu's MAC to >port of vif. I suspect this will break LACP (802.3ad) and Etherchannel (balance-xor, balance-rr) modes, as those modes can receive broadcast or multicast on any slave. In those cases, the switch knows about the aggregation, and will only send the broadcast / multicast to one of the ports, but the port selected is not always the same one. In which mode are you having trouble? -J > >Signed-off-by: Zheng Li >--- > drivers/net/bonding/bond_main.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index e5628fc..2f73f18 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -3063,7 +3063,7 @@ static int bond_open(struct net_device *bond_dev) > bond_set_slave_inactive_flags(slave, > > BOND_SLAVE_NOTIFY_NOW); > } else { >- bond_set_slave_active_flags(slave, >+ bond_set_slave_state(slave, BOND_STATE_ACTIVE, > > BOND_SLAVE_NOTIFY_NOW); > } > } >-- >1.7.6.5 > --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: Inactive slaves should keep inactive flag's value to 1.
Zheng Li zheng.x...@oracle.com wrote: Except bond mode 1, in other bond modes, inactive slaves should keep inactive flag to 1 to refuse to receive broadcast packets. Now, active slave send broadcast packets (for example ARP requests) which will arrive inactive slaves on same host from switch, but inactive slave's inactive flag is zero that cause bridge receive the broadcast packets to produce a wrong entry in forward table. Typical situation is domu send some ARP request which go out from dom0 bond's active slave, then the ARP broadcast request packets go back to inactive slave from switch, because the inactive slave's inactive flag is zero, kernel will receive the packets and pass them to bridge, that cause dom0's bridge map domu's MAC address to port of bond, bridge should map domu's MAC to port of vif. I suspect this will break LACP (802.3ad) and Etherchannel (balance-xor, balance-rr) modes, as those modes can receive broadcast or multicast on any slave. In those cases, the switch knows about the aggregation, and will only send the broadcast / multicast to one of the ports, but the port selected is not always the same one. In which mode are you having trouble? -J Signed-off-by: Zheng Li zheng.x...@oracle.com --- drivers/net/bonding/bond_main.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e5628fc..2f73f18 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3063,7 +3063,7 @@ static int bond_open(struct net_device *bond_dev) bond_set_slave_inactive_flags(slave, BOND_SLAVE_NOTIFY_NOW); } else { - bond_set_slave_active_flags(slave, + bond_set_slave_state(slave, BOND_STATE_ACTIVE, BOND_SLAVE_NOTIFY_NOW); } } -- 1.7.6.5 --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: https://lkml.org/lkml/2013/2/1/531
Smart Weblications GmbH - Florian Wiessner wrote: >Am 22.05.2013 22:04, schrieb Greg KH: > > >>> >>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/jkirsher/net-next/+/567b871e503316b0927e54a3d7c86d50b722d955%5E!/ >> >> Ok, that's what we need. >> >> Now, please cc: the developers / maintainers of that patch and ask them >> to have it included in the 3.4-stable kernel series. >> >> Then, if they agree, the network maintainer will pick it up and send it >> to me for inclusion. >> > >i set committerDavid S. Miller in cc already, >but do not >know the network maintainer... > >this seems to me that "Matthew O'Connor" sent this to >netdev on 2013-02-01: > >http://lists.openwall.net/netdev/2013/02/01/86 > >but i couldn't find a trace of the patch in 3.4.36?! The patch in question here is in net-next; the commit is: commit 567b871e503316b0927e54a3d7c86d50b722d955 Author: zheng.li Date: Tue Nov 27 23:57:04 2012 + bonding: rlb mode of bond should not alter ARP originating via bridge The additional change in the backport from Matthew O'Connor (to add ether_addr_equal_64bits) appears to still be necessary for 3.4.46. Alternatively, the patch could utilize ether_addr_equal instead, to minimize the change set. Greg, do you have a preference there? Submissions for stable from networking normally go through Davem; I can check the patch and repost it to netdev against 3.4.46 if everybody is ok with that. -J --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: https://lkml.org/lkml/2013/2/1/531
Smart Weblications GmbH - Florian Wiessner wrote: Am 22.05.2013 22:04, schrieb Greg KH: https://kernel.googlesource.com/pub/scm/linux/kernel/git/jkirsher/net-next/+/567b871e503316b0927e54a3d7c86d50b722d955%5E!/ Ok, that's what we need. Now, please cc: the developers / maintainers of that patch and ask them to have it included in the 3.4-stable kernel series. Then, if they agree, the network maintainer will pick it up and send it to me for inclusion. i set committerDavid S. Miller da...@davemloft.net in cc already, but do not know the network maintainer... this seems to me that Matthew O'Connor liquidho...@gmail.com sent this to netdev on 2013-02-01: http://lists.openwall.net/netdev/2013/02/01/86 but i couldn't find a trace of the patch in 3.4.36?! The patch in question here is in net-next; the commit is: commit 567b871e503316b0927e54a3d7c86d50b722d955 Author: zheng.li zheng.x...@oracle.com Date: Tue Nov 27 23:57:04 2012 + bonding: rlb mode of bond should not alter ARP originating via bridge The additional change in the backport from Matthew O'Connor (to add ether_addr_equal_64bits) appears to still be necessary for 3.4.46. Alternatively, the patch could utilize ether_addr_equal instead, to minimize the change set. Greg, do you have a preference there? Submissions for stable from networking normally go through Davem; I can check the patch and repost it to netdev against 3.4.46 if everybody is ok with that. -J --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
Linda Walsh wrote: >Sorry for the delay my distro (Suse) has made rebooting my system >a chore (have to often boot from rescue to get it to come up because >they put mount libs in /usr/lib expecting they will always boot >from their ram disk -- preventing those of use who boot directly >from disk from doing so easily...grrr. > >Jay Vosburgh wrote: >> The miimon functionality is used to check link state and notice >> when slaves lose carrier. >--- > If I am running 'rr' on 2 channels -- specifically for the purpose >of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels) >I'm not sure I see how miimon would provide benefit. -- if 1 link dies, >the other, being on the same card is likely to be dead too, so would >it really serve a purpose? Perhaps, but if the link partner experiences a failure, that may be a different situation. Not all failures will necessarily cause both links to fail simultaneously. >> Running without it will not detect failure of >> the bonding slaves, which is likely not what you want. The mode, >> balance-rr in your case, is what selects the load balance to use, and is >> separate from the miimon. >> > > Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk >dies, the entire link goes down? No; failure of a single slave does not cause the entire bond to fail (unless that is the last available slave). For round robin, a failed slave is taken out of the set used to transmit traffic, and any remaining slaves continue to round robin amongst themselves. > The other end (windows) doesn't dynamically config for a static-link >aggregation, so I don't think it would provide benefit. So it (windows) has no means to disable (and discontinue use of) one channel of the aggregation should it fail, even in a static link aggregation? >> That said, the problem you're seeing appears to be caused by two >> things: bonding holds a lock (in addition to RTNL) when calling >> __ethtool_get_settings, and an ixgbe function in the call path to >> retrieve the settings, ixgbe_acquire_swfw_sync_X540, can sleep. >> >> The test patch above handles one case in bond_enslave, but there >> is another case in bond_miimon_commit when a slave changes link state >> from down to up, which will occur shortly after the slave is added. >> > > Added your 2nd patch -- no more error messages... > > however -- likely unrelated, the max speed read or write I am seeing >is about 500MB/s, and that is rare -- usually it's barely <3x a 1Gb >network speed. (119/125 MB R/W). I'm not at all sure it's really >combining the links >properly. Anyway to verify that? How are you testing the throughput? If you configure the aggregation with just one link, how does the throughput compare to the aggregation with both links? It most likely is combining links properly, but any link aggregation scheme has tradeoffs, and the best load balance algorithm to use depends upon the work load. Two aggregated 10G links are not interchangable with a single 20G link. For a round robin transmission scheme, issues arise because packets are delivered at the other end out of order. This in turn triggers various TCP behaviors to deal with what is perceived to be transmission errors or lost packets (TCP fast retransmit being the most notable). This usually results in a single TCP connection being unable to completely saturate a round-robin aggregated set of links. There are a few parameters on linux that can be adjusted. I don't know what the windows equivalents might be. On linux, adjusting the net.ipv4.tcp_reordering sysctl value will increase the tolerance for out of order delivery. The sysctl is adjusted via something like sysctl -w net.ipv4.tcp_reordering=10 the default value is 3, and higher values increase the tolerance for out of order delivery. If memory serves, the setting is applied to connections as they are created, so existing connections will not see changes. Also, adjusting the packet coalescing setting for the receiving devices may also permit higher throughput. The packet coalescing setting is adjusted via ethtool; the current settings can be viewed via ethtool -c eth0 and then adjusted via something like ethtool -C eth0 rx-usecs 30 I've seen reports that raising the "rx-usecs" parameter at the receiver can increase the round-robin throughput. My recollection is that the value used was 30, but the best settings will likely be dependent upon your particular hardware and configuration. > On the windows side it shows the bond-link as a 20Gb connection, but >I don't see anyplace for something similar on linux. There isn't any such in
Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
Linda Walsh l...@tlinx.org wrote: Sorry for the delay my distro (Suse) has made rebooting my system a chore (have to often boot from rescue to get it to come up because they put mount libs in /usr/lib expecting they will always boot from their ram disk -- preventing those of use who boot directly from disk from doing so easily...grrr. Jay Vosburgh wrote: The miimon functionality is used to check link state and notice when slaves lose carrier. --- If I am running 'rr' on 2 channels -- specifically for the purpose of link speed aggregation (getting 1 20Gb channel out of 2 10Gb channels) I'm not sure I see how miimon would provide benefit. -- if 1 link dies, the other, being on the same card is likely to be dead too, so would it really serve a purpose? Perhaps, but if the link partner experiences a failure, that may be a different situation. Not all failures will necessarily cause both links to fail simultaneously. Running without it will not detect failure of the bonding slaves, which is likely not what you want. The mode, balance-rr in your case, is what selects the load balance to use, and is separate from the miimon. Wouldn't the entire link die if a slave dies -- like RAID0, 1 disk dies, the entire link goes down? No; failure of a single slave does not cause the entire bond to fail (unless that is the last available slave). For round robin, a failed slave is taken out of the set used to transmit traffic, and any remaining slaves continue to round robin amongst themselves. The other end (windows) doesn't dynamically config for a static-link aggregation, so I don't think it would provide benefit. So it (windows) has no means to disable (and discontinue use of) one channel of the aggregation should it fail, even in a static link aggregation? That said, the problem you're seeing appears to be caused by two things: bonding holds a lock (in addition to RTNL) when calling __ethtool_get_settings, and an ixgbe function in the call path to retrieve the settings, ixgbe_acquire_swfw_sync_X540, can sleep. The test patch above handles one case in bond_enslave, but there is another case in bond_miimon_commit when a slave changes link state from down to up, which will occur shortly after the slave is added. Added your 2nd patch -- no more error messages... however -- likely unrelated, the max speed read or write I am seeing is about 500MB/s, and that is rare -- usually it's barely 3x a 1Gb network speed. (119/125 MB R/W). I'm not at all sure it's really combining the links properly. Anyway to verify that? How are you testing the throughput? If you configure the aggregation with just one link, how does the throughput compare to the aggregation with both links? It most likely is combining links properly, but any link aggregation scheme has tradeoffs, and the best load balance algorithm to use depends upon the work load. Two aggregated 10G links are not interchangable with a single 20G link. For a round robin transmission scheme, issues arise because packets are delivered at the other end out of order. This in turn triggers various TCP behaviors to deal with what is perceived to be transmission errors or lost packets (TCP fast retransmit being the most notable). This usually results in a single TCP connection being unable to completely saturate a round-robin aggregated set of links. There are a few parameters on linux that can be adjusted. I don't know what the windows equivalents might be. On linux, adjusting the net.ipv4.tcp_reordering sysctl value will increase the tolerance for out of order delivery. The sysctl is adjusted via something like sysctl -w net.ipv4.tcp_reordering=10 the default value is 3, and higher values increase the tolerance for out of order delivery. If memory serves, the setting is applied to connections as they are created, so existing connections will not see changes. Also, adjusting the packet coalescing setting for the receiving devices may also permit higher throughput. The packet coalescing setting is adjusted via ethtool; the current settings can be viewed via ethtool -c eth0 and then adjusted via something like ethtool -C eth0 rx-usecs 30 I've seen reports that raising the rx-usecs parameter at the receiver can increase the round-robin throughput. My recollection is that the value used was 30, but the best settings will likely be dependent upon your particular hardware and configuration. On the windows side it shows the bond-link as a 20Gb connection, but I don't see anyplace for something similar on linux. There isn't any such indicator; bonding does not advertise its link speed as the sum of its slaves link speeds. -J --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux
Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge
Zheng Li wrote: >Do not modify or load balance ARP packets passing through balance-alb >mode (wherein the ARP did not originate locally, and arrived via a bridge). > >Modifying pass-through ARP replies causes an incorrect MAC address >to be placed into the ARP packet, rendering peers unable to communicate >with the actual destination from which the ARP reply originated. > >Load balancing pass-through ARP requests causes an entry to be >created for the peer in the rlb table, and bond_alb_monitor will >occasionally issue ARP updates to all peers in the table instrucing them >as to which MAC address they should communicate with; this occurs when >some event sets rx_ntt. In the bridged case, however, the MAC address >used for the update would be the MAC of the slave, not the actual source >MAC of the originating destination. This would render peers unable to >communicate with the destinations beyond the bridge. > >Signed-off-by: Zheng Li >Cc: Jay Vosburgh >Cc: Andy Gospodarek >Cc: "David S. Miller" Signed-off-by: Jay Vosburgh >--- > drivers/net/bonding/bond_alb.c |6 ++ > drivers/net/bonding/bonding.h | 13 + > 2 files changed, 19 insertions(+), 0 deletions(-) > >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c >index e15cc11..6fecb52 100644 >--- a/drivers/net/bonding/bond_alb.c >+++ b/drivers/net/bonding/bond_alb.c >@@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, >struct bonding *bond) > struct arp_pkt *arp = arp_pkt(skb); > struct slave *tx_slave = NULL; > >+ /* Don't modify or load balance ARPs that do not originate locally >+ * (e.g.,arrive via a bridge). >+ */ >+ if (!bond_slave_has_mac(bond, arp->mac_src)) >+ return NULL; >+ > if (arp->op_code == htons(ARPOP_REPLY)) { > /* the arp must be sent on the selected > * rx channel >diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h >index f8af2fc..6dded56 100644 >--- a/drivers/net/bonding/bonding.h >+++ b/drivers/net/bonding/bonding.h >@@ -22,6 +22,7 @@ > #include > #include > #include >+#include > #include "bond_3ad.h" > #include "bond_alb.h" > >@@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct bond_net >*bn) > } > #endif > >+static inline struct slave *bond_slave_has_mac(struct bonding *bond, >+ const u8 *mac) >+{ >+ int i = 0; >+ struct slave *tmp; >+ >+ bond_for_each_slave(bond, tmp, i) >+ if (ether_addr_equal_64bits(mac, tmp->dev->dev_addr)) >+ return tmp; >+ >+ return NULL; >+} > > /* exported from bond_main.c */ > extern int bond_net_id; >-- >1.7.6.5 > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge
Zheng Li zheng.x...@oracle.com wrote: Do not modify or load balance ARP packets passing through balance-alb mode (wherein the ARP did not originate locally, and arrived via a bridge). Modifying pass-through ARP replies causes an incorrect MAC address to be placed into the ARP packet, rendering peers unable to communicate with the actual destination from which the ARP reply originated. Load balancing pass-through ARP requests causes an entry to be created for the peer in the rlb table, and bond_alb_monitor will occasionally issue ARP updates to all peers in the table instrucing them as to which MAC address they should communicate with; this occurs when some event sets rx_ntt. In the bridged case, however, the MAC address used for the update would be the MAC of the slave, not the actual source MAC of the originating destination. This would render peers unable to communicate with the destinations beyond the bridge. Signed-off-by: Zheng Li zheng.x...@oracle.com Cc: Jay Vosburgh fu...@us.ibm.com Cc: Andy Gospodarek a...@greyhouse.net Cc: David S. Miller da...@davemloft.net Signed-off-by: Jay Vosburgh fu...@us.ibm.com --- drivers/net/bonding/bond_alb.c |6 ++ drivers/net/bonding/bonding.h | 13 + 2 files changed, 19 insertions(+), 0 deletions(-) diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c index e15cc11..6fecb52 100644 --- a/drivers/net/bonding/bond_alb.c +++ b/drivers/net/bonding/bond_alb.c @@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond) struct arp_pkt *arp = arp_pkt(skb); struct slave *tx_slave = NULL; + /* Don't modify or load balance ARPs that do not originate locally + * (e.g.,arrive via a bridge). + */ + if (!bond_slave_has_mac(bond, arp-mac_src)) + return NULL; + if (arp-op_code == htons(ARPOP_REPLY)) { /* the arp must be sent on the selected * rx channel diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index f8af2fc..6dded56 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -22,6 +22,7 @@ #include linux/in6.h #include linux/netpoll.h #include linux/inetdevice.h +#include linux/etherdevice.h #include bond_3ad.h #include bond_alb.h @@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct bond_net *bn) } #endif +static inline struct slave *bond_slave_has_mac(struct bonding *bond, + const u8 *mac) +{ + int i = 0; + struct slave *tmp; + + bond_for_each_slave(bond, tmp, i) + if (ether_addr_equal_64bits(mac, tmp-dev-dev_addr)) + return tmp; + + return NULL; +} /* exported from bond_main.c */ extern int bond_net_id; -- 1.7.6.5 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
et_settings+0x34/0x2b0 >[ 59.791381] [] __ethtool_get_settings+0x85/0x140 >[ 59.791386] [] bond_update_speed_duplex+0x23/0x60 >[ 59.791389] [] bond_mii_monitor+0x354/0x640 >[ 59.791393] [] process_one_work+0x1a7/0x680 >[ 59.791396] [] ? process_one_work+0x146/0x680 >[ 59.791402] [] ? put_lock_stats.isra.21+0xe/0x40 >[ 59.791411] [] ? bond_loadbalance_arp_mon+0x2c0/0x2c0 >[ 59.791421] [] worker_thread+0x18d/0x4f0 >[ 59.791434] [] ? sub_preempt_count+0x51/0x60 >[ 59.791442] [] ? manage_workers+0x320/0x320 >[ 59.791453] [] kthread+0x9d/0xb0 >[ 59.791460] [] kernel_thread_helper+0x4/0x10 >[ 59.791464] [] ? finish_task_switch+0x77/0x100 >[ 59.791468] [] ? _raw_spin_unlock_irq+0x36/0x60 >[ 59.791472] [] ? retint_restore_args+0xe/0xe >[ 59.791476] [] ? flush_kthread_worker+0x160/0x160 >[ 59.791480] [] ? gs_change+0xb/0xb >[ 59.794932] BUG: scheduling while atomic: kworker/u:1/103/0x0002 >[ 59.801333] 4 locks held by kworker/u:1/103: >[ 59.801340] #0: ((bond_dev->name)){..}, at: [] >process_one_work+0x146/0x680 >[ 59.801345] #1: ((&(>mii_work)->work)){..}, at: >[] process_one_work+0x146/0x680 >[ 59.801350] #2: (rtnl_mutex){..}, at: [] >rtnl_trylock+0x10/0x20 >[ 59.801356] #3: (>lock){..}, at: [] >bond_mii_monitor+0x2ed/0x640 >[ 59.801365] Modules linked in: fan kvm_intel mousedev kvm iTCO_wdt >iTCO_vendor_support acpi_cpufreq tpm_tis tpm tpm_bios mperf processor >[ 59.801368] Pid: 103, comm: kworker/u:1 Tainted: GW >3.6.8-Isht-Van #4 >[ 59.801369] Call Trace: >[ 59.801373] [] __schedule_bug+0x5e/0x6c >[ 59.801380] [] __schedule+0x77c/0x810 >[ 59.801385] [] schedule+0x24/0x70 >[ 59.801391] [] >schedule_hrtimeout_range_clock+0xfc/0x140 >[ 59.801395] [] ? update_rmtp+0x60/0x60 >[ 59.801399] [] ? update_rmtp+0x60/0x60 >[ 59.801404] [] ? hrtimer_start_range_ns+0xf/0x20 >[ 59.801409] [] schedule_hrtimeout_range+0xe/0x10 >[ 59.801414] [] usleep_range+0x3b/0x40 >[ 59.801419] [] ixgbe_release_swfw_sync_X540+0x4e/0x60 >[ 59.801424] [] ixgbe_read_phy_reg_generic+0x101/0x120 >[ 59.801429] [] >ixgbe_get_copper_link_capabilities_generic+0x2c/0x60 >[ 59.801433] [] ? bond_mii_monitor+0x2ed/0x640 >[ 59.801441] [] ixgbe_get_settings+0x34/0x2b0 >[ 59.801446] [] __ethtool_get_settings+0x85/0x140 >[ 59.801450] [] bond_update_speed_duplex+0x23/0x60 >[ 59.801471] [] bond_mii_monitor+0x354/0x640 >[ 59.801475] [] process_one_work+0x1a7/0x680 >[ 59.801477] [] ? process_one_work+0x146/0x680 >[ 59.801481] [] ? put_lock_stats.isra.21+0xe/0x40 >[ 59.801484] [] ? bond_loadbalance_arp_mon+0x2c0/0x2c0 >[ 59.801489] [] worker_thread+0x18d/0x4f0 >[ 59.801495] [] ? sub_preempt_count+0x51/0x60 >[ 59.801500] [] ? manage_workers+0x320/0x320 >[ 59.801505] [] kthread+0x9d/0xb0 >[ 59.801510] [] kernel_thread_helper+0x4/0x10 >[ 59.801515] [] ? finish_task_switch+0x77/0x100 >[ 59.801519] [] ? _raw_spin_unlock_irq+0x36/0x60 >[ 59.801524] [] ? retint_restore_args+0xe/0xe >[ 59.801530] [] ? flush_kthread_worker+0x160/0x160 >[ 59.801536] [] ? gs_change+0xb/0xb >[ 59.804986] bonding: bond0: link status definitely up for interface >p2p2, 1 Mbps full duplex. --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BUG: scheduling while atomic: ifup-bonding/3711/0x00000002 -- V3.6.7
] ? update_rmtp+0x60/0x60 [ 59.791334] [81065a1f] ? hrtimer_start_range_ns+0xf/0x20 [ 59.791339] [81684c3e] schedule_hrtimeout_range+0xe/0x10 [ 59.791345] [8104bddb] usleep_range+0x3b/0x40 [ 59.791352] [814d220c] ixgbe_acquire_swfw_sync_X540+0xbc/0x110 [ 59.791357] [814ce4dd] ixgbe_read_phy_reg_generic+0x3d/0x120 [ 59.791361] [814ce74c] ixgbe_get_copper_link_capabilities_generic+0x2c/0x60 [ 59.791366] [81480b5d] ? bond_mii_monitor+0x2ed/0x640 [ 59.791372] [814c6454] ixgbe_get_settings+0x34/0x2b0 [ 59.791381] [8159af55] __ethtool_get_settings+0x85/0x140 [ 59.791386] [8147c6e3] bond_update_speed_duplex+0x23/0x60 [ 59.791389] [81480bc4] bond_mii_monitor+0x354/0x640 [ 59.791393] [8105a9b7] process_one_work+0x1a7/0x680 [ 59.791396] [8105a956] ? process_one_work+0x146/0x680 [ 59.791402] [8108c7ce] ? put_lock_stats.isra.21+0xe/0x40 [ 59.791411] [81480870] ? bond_loadbalance_arp_mon+0x2c0/0x2c0 [ 59.791421] [8105b9ed] worker_thread+0x18d/0x4f0 [ 59.791434] [81070991] ? sub_preempt_count+0x51/0x60 [ 59.791442] [8105b860] ? manage_workers+0x320/0x320 [ 59.791453] [81060f7d] kthread+0x9d/0xb0 [ 59.791460] [816892e4] kernel_thread_helper+0x4/0x10 [ 59.791464] [8106c197] ? finish_task_switch+0x77/0x100 [ 59.791468] [81687526] ? _raw_spin_unlock_irq+0x36/0x60 [ 59.791472] [81687a5d] ? retint_restore_args+0xe/0xe [ 59.791476] [81060ee0] ? flush_kthread_worker+0x160/0x160 [ 59.791480] [816892e0] ? gs_change+0xb/0xb [ 59.794932] BUG: scheduling while atomic: kworker/u:1/103/0x0002 [ 59.801333] 4 locks held by kworker/u:1/103: [ 59.801340] #0: ((bond_dev-name)){..}, at: [8105a956] process_one_work+0x146/0x680 [ 59.801345] #1: (((bond-mii_work)-work)){..}, at: [8105a956] process_one_work+0x146/0x680 [ 59.801350] #2: (rtnl_mutex){..}, at: [815a4dd0] rtnl_trylock+0x10/0x20 [ 59.801356] #3: (bond-lock){..}, at: [81480b5d] bond_mii_monitor+0x2ed/0x640 [ 59.801365] Modules linked in: fan kvm_intel mousedev kvm iTCO_wdt iTCO_vendor_support acpi_cpufreq tpm_tis tpm tpm_bios mperf processor [ 59.801368] Pid: 103, comm: kworker/u:1 Tainted: GW 3.6.8-Isht-Van #4 [ 59.801369] Call Trace: [ 59.801373] [8167bb36] __schedule_bug+0x5e/0x6c [ 59.801380] [816859bc] __schedule+0x77c/0x810 [ 59.801385] [81685ad4] schedule+0x24/0x70 [ 59.801391] [81684bec] schedule_hrtimeout_range_clock+0xfc/0x140 [ 59.801395] [81064c80] ? update_rmtp+0x60/0x60 [ 59.801399] [81064c80] ? update_rmtp+0x60/0x60 [ 59.801404] [81065a1f] ? hrtimer_start_range_ns+0xf/0x20 [ 59.801409] [81684c3e] schedule_hrtimeout_range+0xe/0x10 [ 59.801414] [8104bddb] usleep_range+0x3b/0x40 [ 59.801419] [814d213e] ixgbe_release_swfw_sync_X540+0x4e/0x60 [ 59.801424] [814ce5a1] ixgbe_read_phy_reg_generic+0x101/0x120 [ 59.801429] [814ce74c] ixgbe_get_copper_link_capabilities_generic+0x2c/0x60 [ 59.801433] [81480b5d] ? bond_mii_monitor+0x2ed/0x640 [ 59.801441] [814c6454] ixgbe_get_settings+0x34/0x2b0 [ 59.801446] [8159af55] __ethtool_get_settings+0x85/0x140 [ 59.801450] [8147c6e3] bond_update_speed_duplex+0x23/0x60 [ 59.801471] [81480bc4] bond_mii_monitor+0x354/0x640 [ 59.801475] [8105a9b7] process_one_work+0x1a7/0x680 [ 59.801477] [8105a956] ? process_one_work+0x146/0x680 [ 59.801481] [8108c7ce] ? put_lock_stats.isra.21+0xe/0x40 [ 59.801484] [81480870] ? bond_loadbalance_arp_mon+0x2c0/0x2c0 [ 59.801489] [8105b9ed] worker_thread+0x18d/0x4f0 [ 59.801495] [81070991] ? sub_preempt_count+0x51/0x60 [ 59.801500] [8105b860] ? manage_workers+0x320/0x320 [ 59.801505] [81060f7d] kthread+0x9d/0xb0 [ 59.801510] [816892e4] kernel_thread_helper+0x4/0x10 [ 59.801515] [8106c197] ? finish_task_switch+0x77/0x100 [ 59.801519] [81687526] ? _raw_spin_unlock_irq+0x36/0x60 [ 59.801524] [81687a5d] ? retint_restore_args+0xe/0xe [ 59.801530] [81060ee0] ? flush_kthread_worker+0x160/0x160 [ 59.801536] [816892e0] ? gs_change+0xb/0xb [ 59.804986] bonding: bond0: link status definitely up for interface p2p2, 1 Mbps full duplex. --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge
/* arp requests are broadcast and are sent on the primary >> * the arp request will collapse all clients on the subnet to >> * the primary slave. We must register these clients to be >> * updated with their assigned mac. >> */ >> rlb_req_update_subnet_clients(bond, arp->ip_src); >> >> that arranges for clients to be given ARP updates for their >> slave assignments (which may change to the active slave, due to the ARP >> broadcast being sent via the active slave). >> >> I think the ARP reply side of this is fine (and is what is >> described in teh changelog), but the ARP request behavior change is new >> with this version. >> >> Since prior versions of the patch didn't cause this code to be >> skipped, is this change intentional? >> >> Did you check to see if the above logic is necessary for ARP >> requests passing through via a bridge to prevent peers from "stacking" >> (in terms of load balance assignment) on the active slave due to bridged >> ARP traffic? >> >> -J >> >>> Signed-off-by: Zheng Li >>> Cc: Jay Vosburgh >>> Cc: Andy Gospodarek >>> Cc: "David S. Miller" >>> >>> --- >>> drivers/net/bonding/bond_alb.c |6 ++ >>> drivers/net/bonding/bonding.h | 13 + >>> 2 files changed, 19 insertions(+), 0 deletions(-) >>> >>> diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c >>> index e15cc11..75f6f0d 100644 >>> --- a/drivers/net/bonding/bond_alb.c >>> +++ b/drivers/net/bonding/bond_alb.c >>> @@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, >>> struct bonding *bond) >>> struct arp_pkt *arp = arp_pkt(skb); >>> struct slave *tx_slave = NULL; >>> >>> + /* Only modify ARP's MAC if it originates locally; >>> +* don't change ARPs arriving via a bridge. >>> +*/ >>> + if (!bond_slave_has_mac(bond, arp->mac_src)) >>> + return NULL; >>> + >>> if (arp->op_code == htons(ARPOP_REPLY)) { >>> /* the arp must be sent on the selected >>> * rx channel >>> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h >>> index f8af2fc..6dded56 100644 >>> --- a/drivers/net/bonding/bonding.h >>> +++ b/drivers/net/bonding/bonding.h >>> @@ -22,6 +22,7 @@ >>> #include >>> #include >>> #include >>> +#include >>> #include "bond_3ad.h" >>> #include "bond_alb.h" >>> >>> @@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct >>> bond_net *bn) >>> } >>> #endif >>> >>> +static inline struct slave *bond_slave_has_mac(struct bonding *bond, >>> + const u8 *mac) >>> +{ >>> + int i = 0; >>> + struct slave *tmp; >>> + >>> + bond_for_each_slave(bond, tmp, i) >>> + if (ether_addr_equal_64bits(mac, tmp->dev->dev_addr)) >>> + return tmp; >>> + >>> + return NULL; >>> +} >>> >>> /* exported from bond_main.c */ >>> extern int bond_net_id; >>> -- >>> 1.7.6.5 --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: rlb mode of bond should not alter ARP originating via bridge
to the active slave, due to the ARP broadcast being sent via the active slave). I think the ARP reply side of this is fine (and is what is described in teh changelog), but the ARP request behavior change is new with this version. Since prior versions of the patch didn't cause this code to be skipped, is this change intentional? Did you check to see if the above logic is necessary for ARP requests passing through via a bridge to prevent peers from stacking (in terms of load balance assignment) on the active slave due to bridged ARP traffic? -J Signed-off-by: Zheng Li zheng.x...@oracle.com Cc: Jay Vosburgh fu...@us.ibm.com Cc: Andy Gospodarek a...@greyhouse.net Cc: David S. Miller da...@davemloft.net --- drivers/net/bonding/bond_alb.c |6 ++ drivers/net/bonding/bonding.h | 13 + 2 files changed, 19 insertions(+), 0 deletions(-) diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c index e15cc11..75f6f0d 100644 --- a/drivers/net/bonding/bond_alb.c +++ b/drivers/net/bonding/bond_alb.c @@ -694,6 +694,12 @@ static struct slave *rlb_arp_xmit(struct sk_buff *skb, struct bonding *bond) struct arp_pkt *arp = arp_pkt(skb); struct slave *tx_slave = NULL; + /* Only modify ARP's MAC if it originates locally; +* don't change ARPs arriving via a bridge. +*/ + if (!bond_slave_has_mac(bond, arp-mac_src)) + return NULL; + if (arp-op_code == htons(ARPOP_REPLY)) { /* the arp must be sent on the selected * rx channel diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index f8af2fc..6dded56 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -22,6 +22,7 @@ #include linux/in6.h #include linux/netpoll.h #include linux/inetdevice.h +#include linux/etherdevice.h #include bond_3ad.h #include bond_alb.h @@ -450,6 +451,18 @@ static inline void bond_destroy_proc_dir(struct bond_net *bn) } #endif +static inline struct slave *bond_slave_has_mac(struct bonding *bond, + const u8 *mac) +{ + int i = 0; + struct slave *tmp; + + bond_for_each_slave(bond, tmp, i) + if (ether_addr_equal_64bits(mac, tmp-dev-dev_addr)) + return tmp; + + return NULL; +} /* exported from bond_main.c */ extern int bond_net_id; -- 1.7.6.5 --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bonding: in balance-rr mode, set curr_active_slave only if it is up
Michal Kubecek wrote: >If all slaves of a balance-rr bond with ARP monitor are enslaved >with down link state, bond keeps down state even after slaves >go up. > >This is caused by bond_enslave() setting curr_active_slave to >first slave not taking into account its link state. As >bond_loadbalance_arp_mon() uses curr_active_slave to identify >whether slave's down->up transition should update bond's link >state, bond stays down even if slaves are up (until first slave >goes from up to down at least once). The bond_loadbalance_arp_mon function actually uses curr_active_slave to determine whether or not to do a "failover" (select a new active slave), which in turn will call bond_set_carrier() from within bond_select_active_slave(). Other than that nitpick about the description, I see how setting curr_active_slave to a down slave would cause loadbalance_arp_mon to skip the "failover" step (because it presumes that an active slave is always up, and therefore no new one needs to be selected), and thus skip setting the master's carrier state. -J >Before commit f31c7937 "bonding: start slaves with link down for >ARP monitor", this was masked by slaves always starting in UP >state with ARP monitor (and MII monitor not relying on >curr_active_slave being NULL if there is no slave up). > >Signed-off-by: Michal Kubecek Signed-off-by: Jay Vosburgh > drivers/net/bonding/bond_main.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 5f5b69f..c8bff3e 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1838,7 +1838,7 @@ int bond_enslave(struct net_device *bond_dev, struct >net_device *slave_dev) >* anyway (it holds no special properties of the bond device), >* so we can change it without calling change_active_interface() >*/ >- if (!bond->curr_active_slave) >+ if (!bond->curr_active_slave && new_slave->link == BOND_LINK_UP) > bond->curr_active_slave = new_slave; > > break; >-- >1.7.10.4 > --- -Jay Vosburgh, IBM Linux Technology Center, fu...@us.ibm.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/