date:20150716

Re: [PATCH net-next] rocker: forward packets to CPU when port is joined to openvswitch

2015-07-16 Thread Scott Feldman

On Wed, Jul 15, 2015 at 11:58 PM, Jiri Pirko  wrote:
> Thu, Jul 16, 2015 at 08:40:31AM CEST, sfel...@gmail.com wrote:
>>On Wed, Jul 15, 2015 at 6:39 PM, Simon Horman
>> wrote:
>>> Teach rocker to forward packets to CPU when a port is joined to Open 
>>> vSwitch.
>>> There is scope to later refine what is passed up as per Open vSwitch flows
>>> on a port.
>>>
>>> This does not change the behaviour of rocker ports that are
>>> not joined to Open vSwitch.
>>>
>>> Signed-off-by: Simon Horman 
>>
>>Acked-by: Scott Feldman 
>>
>>Now, OVS flows on a port.  Strange enough, that was the first RFC
>>implementation for switchdev/rocker where we hooked into ovs-kernel
>>module and programmed flows into hw.  We pulled all of that code
>>because, IIRC, the ovs folks didn't want us hooking into the kernel
>>module directly.  We dropped the ovs hooks and focused on hooking
>>kernel's L2/L3.  The device (rocker) didn't really change: OF-DPA
>>pipeline was used for both.  Might be interesting to try hooking it
>>again.
>
>
> I think that now we have an infrastructure prepared for that. I mean,
> what we need to do is to introduce another generic switchdev object
> called "ntupleflow" and hook-up again into ovs datapath and cls_flower
> and insert/remove the object from those codes. Should be pretty easy to do.

That sounds right.  Is the ovs datapath hooking still happening in the
ovs-kernel module?  Remind me again, what was the objection the last
time we tried that?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Revert "net: fec: Ensure clocks are enabled while using mdio bus"

2015-07-16 Thread Clemens Gruber

On Wed, Jul 15, 2015 at 05:32:13PM -0700, David Miller wrote:
> From: Clemens Gruber 
> Date: Thu, 16 Jul 2015 02:04:04 +0200
> 
> > This reverts commit 6c3e921b18edca290099adfddde8a50236bf2d80.
> > 
> > The change did break ethernet support on the i.MX6Q and possibly also on
> > other platforms: The PHY was not detected anymore and eth0 was not found.
> > 
> > Signed-off-by: Clemens Gruber 
> 
> This patch was already posted and in my queue (did you look?) and I applied
> it earlier today.

Oh, nice. Thanks! (Yes, before the commit. Must have overlapped, sorry!)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[no subject]

2015-07-16 Thread Clemens Gruber

unsubscribe
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Sending IPv6 packets broken in net-next

2015-07-16 Thread Phil Sutter

On Thu, Jul 16, 2015 at 12:20:46AM +0200, Phil Sutter wrote:
> Commit 9131f3d ("ipv6: Do not iterate over all interfaces when finding
> source address on specific interface") breaks local output of IPv6
> packets. Here is a simple reproducer:

I just noticed, a patch fixing the issue has already been sent to this
list[1]. So nevermind and sorry for the traffic!

Cheers, Phil

[1]: http://www.spinics.net/lists/netdev/msg335586.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] ipv6: Remove unused arguments for __ipv6_dev_get_saddr().

2015-07-16 Thread YOSHIFUJI Hideaki

Signed-off-by: YOSHIFUJI Hideaki 
---
 net/ipv6/addrconf.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4c9a024..32153c2 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1360,8 +1360,6 @@ out:
 
 static int __ipv6_dev_get_saddr(struct net *net,
struct ipv6_saddr_dst *dst,
-   unsigned int prefs,
-   const struct in6_addr *saddr,
struct inet6_dev *idev,
struct ipv6_saddr_score *scores,
int hiscore_idx)
@@ -1485,13 +1483,13 @@ int ipv6_dev_get_saddr(struct net *net, const struct 
net_device *dst_dev,
 
if (use_oif_addr) {
if (idev)
-   hiscore_idx = __ipv6_dev_get_saddr(net, &dst, prefs, 
saddr, idev, scores, hiscore_idx);
+   hiscore_idx = __ipv6_dev_get_saddr(net, &dst, idev, 
scores, hiscore_idx);
} else {
for_each_netdev_rcu(net, dev) {
idev = __in6_dev_get(dev);
if (!idev)
continue;
-   hiscore_idx = __ipv6_dev_get_saddr(net, &dst, prefs, 
saddr, idev, scores, hiscore_idx);
+   hiscore_idx = __ipv6_dev_get_saddr(net, &dst, idev, 
scores, hiscore_idx);
}
}
rcu_read_unlock();
-- 
1.9.1


-- 
Hideaki Yoshifuji 
Technical Division, MIRACLE LINUX CORPORATION
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] ipv6: Remove unused arguments for __ipv6_dev_get_saddr().

2015-07-16 Thread David Miller

From: YOSHIFUJI Hideaki 
Date: Thu, 16 Jul 2015 16:51:30 +0900

> Signed-off-by: YOSHIFUJI Hideaki 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v2 4/5] rocker: add offload_fwd_mark support

2015-07-16 Thread sfeldma

From: Scott Feldman 

If device flags ingress packet as "fwd offload", mark the
skb->offlaod_fwd_mark using the ingress port's dev->offlaod_fwd_mark.  This
will be the hint to the kernel that this packet has already been forwarded
by device to egress ports matching skb->offlaod_fwd_mark.

For rocker, derive port dev->offlaod_fwd_mark based on device switch ID and
port ifindex.  If port is bridged, use the bridge ifindex rather than the
port ifindex.

Signed-off-by: Scott Feldman 
---
 drivers/net/ethernet/rocker/rocker.c |   11 +++
 drivers/net/ethernet/rocker/rocker.h |1 +
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index 9324283..0fdfa47 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -4800,6 +4800,7 @@ static int rocker_port_rx_proc(const struct rocker 
*rocker,
const struct rocker_tlv *attrs[ROCKER_TLV_RX_MAX + 1];
struct sk_buff *skb = rocker_desc_cookie_ptr_get(desc_info);
size_t rx_len;
+   u16 rx_flags = 0;
 
if (!skb)
return -ENOENT;
@@ -4807,6 +4808,8 @@ static int rocker_port_rx_proc(const struct rocker 
*rocker,
rocker_tlv_parse_desc(attrs, ROCKER_TLV_RX_MAX, desc_info);
if (!attrs[ROCKER_TLV_RX_FRAG_LEN])
return -EINVAL;
+   if (attrs[ROCKER_TLV_RX_FLAGS])
+   rx_flags = rocker_tlv_get_u16(attrs[ROCKER_TLV_RX_FLAGS]);
 
rocker_dma_rx_ring_skb_unmap(rocker, attrs);
 
@@ -4814,6 +4817,9 @@ static int rocker_port_rx_proc(const struct rocker 
*rocker,
skb_put(skb, rx_len);
skb->protocol = eth_type_trans(skb, rocker_port->dev);
 
+   if (rx_flags & ROCKER_RX_FLAGS_FWD_OFFLOAD)
+   skb->offload_fwd_mark = rocker_port->dev->offload_fwd_mark;
+
rocker_port->dev->stats.rx_packets++;
rocker_port->dev->stats.rx_bytes += skb->len;
 
@@ -4951,6 +4957,8 @@ static int rocker_probe_port(struct rocker *rocker, 
unsigned int port_number)
}
rocker->ports[port_number] = rocker_port;
 
+   switchdev_port_fwd_mark_set(rocker_port->dev, NULL, false);
+
rocker_port_set_learning(rocker_port, SWITCHDEV_TRANS_NONE);
 
err = rocker_port_ig_tbl(rocker_port, SWITCHDEV_TRANS_NONE, 0);
@@ -5230,6 +5238,7 @@ static int rocker_port_bridge_join(struct rocker_port 
*rocker_port,
rocker_port_internal_vlan_id_get(rocker_port, bridge->ifindex);
 
rocker_port->bridge_dev = bridge;
+   switchdev_port_fwd_mark_set(rocker_port->dev, bridge, true);
 
return rocker_port_vlan_add(rocker_port, SWITCHDEV_TRANS_NONE,
untagged_vid, 0);
@@ -5250,6 +5259,8 @@ static int rocker_port_bridge_leave(struct rocker_port 
*rocker_port)
rocker_port_internal_vlan_id_get(rocker_port,
 rocker_port->dev->ifindex);
 
+   switchdev_port_fwd_mark_set(rocker_port->dev, rocker_port->bridge_dev,
+   false);
rocker_port->bridge_dev = NULL;
 
err = rocker_port_vlan_add(rocker_port, SWITCHDEV_TRANS_NONE,
diff --git a/drivers/net/ethernet/rocker/rocker.h 
b/drivers/net/ethernet/rocker/rocker.h
index 08b2c3d..12490b2 100644
--- a/drivers/net/ethernet/rocker/rocker.h
+++ b/drivers/net/ethernet/rocker/rocker.h
@@ -246,6 +246,7 @@ enum {
 #define ROCKER_RX_FLAGS_TCPBIT(5)
 #define ROCKER_RX_FLAGS_UDPBIT(6)
 #define ROCKER_RX_FLAGS_TCP_UDP_CSUM_GOOD  BIT(7)
+#define ROCKER_RX_FLAGS_FWD_OFFLOADBIT(8)
 
 enum {
ROCKER_TLV_TX_UNSPEC,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v2 0/5] switchdev: avoid duplicate packet forwarding

2015-07-16 Thread sfeldma

From: Scott Feldman 

v2:

 - Per davem review: in sk_buff, union fwd_mark with secmark to save space
   since features appear to be mutually exclusive.
 - Per Simon Horman review:
   - fix grammar in switchdev.txt wrt fwd_mark
   - remove some unrelated changes that snuck in

v1:

This patchset was previously submitted as RFC.  No changes from the last
version (v2) sent under RFC.  Including RFC version history here for reference.

RFC v2:

 - s/fwd_mark/offload_fwd_mark
 - use consume_skb rather than kfree_skb when dropping pkt on egress.
 - Use Jiri's suggestion to use ifindex of one of the ports in a group
   as the mark for all the ports in the group.  This can be done with
   no additional storage (no hashtable from v1).  To pull it off, we
   need some simple recursive routines to walk the netdev tree ensuring
   all leaves in the tree (ports) in the same group (e.g. bridge)
   belonging to the same switch device will have the same offload fwd mark.
   Maybe someone sees a better design for the recusive routines?  They're
   not too bad, and should cover the stacked driver cases.

RFC v1:

With switchdev support for offloading L2/L3 forwarding data path to a
switch device, we have a general problem where both the device and the
kernel may forward the packet, resulting in duplicate packets on the wire.
Anytime a packet is forwarded by the device and a copy is sent to the CPU,
there is potential for duplicate forwarding, as the kernel may also do a
forwarding lookup and send the packet on the wire.

The specific problem this patch series is interested in solving is avoiding
duplicate packets on bridged ports.  There was a previous RFC from Roopa
(http://marc.info/?l=linux-netdev&m=142687073314252&w=2) to address this
problem, but didn't solve the problem of mixed ports in the bridge from
different devices; there was no way to exclude some ports from forwarding
and include others.  This RFC solves that problem by tagging the ingressing
packet with a unique mark, and then comparing the packet mark with the
egress port mark, and skip forwarding when there is a match.  For the mixed
ports bridge case, only those ports with matching marks are skipped.

The switchdev port driver must do two things:

1) Generate a fwd_mark for each switch port, using some unique key of the
   switch device (and optionally port).  This is done when the port netdev
   is registered or if the port's group membership changes (joins/leaves
   a bridge, for example).

2) On packet ingress from port, mark the skb with the ingress port's
   fwd_mark.  If the device supports it, it's useful to only mark skbs
   which were already forwarded by the device.  If the device does not
   support such indication, all skbs can be marked, even if they're
   local dst.

Two new 32-bit fields are added to struct sk_buff and struct netdevice to
hold the fwd_mark.  I've wrapped these with CONFIG_NET_SWITCHDEV for now. I
tried using skb->mark for this purpose, but ebtables can overwrite the
skb->mark before the bridge gets it, so that will not work.

In general, this fwd_mark can be used for any case where a packet is
forwarded by the device and a copy is sent to the CPU, to avoid the kernel
re-forwarding the packet.  sFlow is another use-case that comes to mind,
but I haven't explored the details.


Scott Feldman (5):
  net: don't reforward packets already forwarded by offload device
  net: add phys ID compare helper to test if two IDs are the same
  switchdev: add offload_fwd_mark generator helper
  rocker: add offload_fwd_mark support
  switchdev: update documentation for offload_fwd_mark

 Documentation/networking/switchdev.txt |   14 +++-
 drivers/net/ethernet/rocker/rocker.c   |   11 
 drivers/net/ethernet/rocker/rocker.h   |1 +
 include/linux/netdevice.h  |   13 
 include/linux/skbuff.h |   11 +++-
 include/net/switchdev.h|9 +++
 net/core/dev.c |   10 +++
 net/switchdev/switchdev.c  |  111 ++--
 8 files changed, 171 insertions(+), 9 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v2 2/5] net: add phys ID compare helper to test if two IDs are the same

2015-07-16 Thread sfeldma

From: Scott Feldman 

Signed-off-by: Scott Feldman 
---
 include/linux/netdevice.h |7 +++
 net/switchdev/switchdev.c |8 ++--
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8364f29..607b5f4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -766,6 +766,13 @@ struct netdev_phys_item_id {
unsigned char id_len;
 };
 
+static inline bool netdev_phys_item_id_same(struct netdev_phys_item_id *a,
+   struct netdev_phys_item_id *b)
+{
+   return a->id_len == b->id_len &&
+  memcmp(a->id, b->id, a->id_len) == 0;
+}
+
 typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
   struct sk_buff *skb);
 
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 9f2add3..4e5bba5 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -910,13 +910,9 @@ static struct net_device *switchdev_get_dev_by_nhs(struct 
fib_info *fi)
if (switchdev_port_attr_get(dev, &attr))
return NULL;
 
-   if (nhsel > 0) {
-   if (prev_attr.u.ppid.id_len != attr.u.ppid.id_len)
+   if (nhsel > 0 &&
+   !netdev_phys_item_id_same(&prev_attr.u.ppid, &attr.u.ppid))
return NULL;
-   if (memcmp(prev_attr.u.ppid.id, attr.u.ppid.id,
-  attr.u.ppid.id_len))
-   return NULL;
-   }
 
prev_attr = attr;
}
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v2 3/5] switchdev: add offload_fwd_mark generator helper

2015-07-16 Thread sfeldma

From: Scott Feldman 

skb->offload_fwd_mark and dev->offload_fwd_mark are 32-bit and should be
unique for device and may even be unique for a sub-set of ports within
device, so add switchdev helper function to generate unique marks based on
port's switch ID and group_ifindex.  group_ifindex would typically be the
container dev's ifindex, such as the bridge's ifindex.

The generator uses a global hash table to store offload_fwd_marks hashed by
{switch ID, group_ifindex} key.

Signed-off-by: Scott Feldman 
---
 include/net/switchdev.h   |9 
 net/switchdev/switchdev.c |  103 +
 2 files changed, 112 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index d5671f1..89da893 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -157,6 +157,9 @@ int switchdev_port_fdb_del(struct ndmsg *ndm, struct nlattr 
*tb[],
 int switchdev_port_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
struct net_device *dev,
struct net_device *filter_dev, int idx);
+void switchdev_port_fwd_mark_set(struct net_device *dev,
+struct net_device *group_dev,
+bool joining);
 
 #else
 
@@ -271,6 +274,12 @@ static inline int switchdev_port_fdb_dump(struct sk_buff 
*skb,
return -EOPNOTSUPP;
 }
 
+static inline void switchdev_port_fwd_mark_set(struct net_device *dev,
+  struct net_device *group_dev,
+  bool joining)
+{
+}
+
 #endif
 
 #endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 4e5bba5..33bafa2 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -1039,3 +1039,106 @@ void switchdev_fib_ipv4_abort(struct fib_info *fi)
fi->fib_net->ipv4.fib_offload_disabled = true;
 }
 EXPORT_SYMBOL_GPL(switchdev_fib_ipv4_abort);
+
+static bool switchdev_port_same_parent_id(struct net_device *a,
+ struct net_device *b)
+{
+   struct switchdev_attr a_attr = {
+   .id = SWITCHDEV_ATTR_PORT_PARENT_ID,
+   .flags = SWITCHDEV_F_NO_RECURSE,
+   };
+   struct switchdev_attr b_attr = {
+   .id = SWITCHDEV_ATTR_PORT_PARENT_ID,
+   .flags = SWITCHDEV_F_NO_RECURSE,
+   };
+
+   if (switchdev_port_attr_get(a, &a_attr) ||
+   switchdev_port_attr_get(b, &b_attr))
+   return false;
+
+   return netdev_phys_item_id_same(&a_attr.u.ppid, &b_attr.u.ppid);
+}
+
+static u32 switchdev_port_fwd_mark_get(struct net_device *dev,
+  struct net_device *group_dev)
+{
+   struct net_device *lower_dev;
+   struct list_head *iter;
+
+   netdev_for_each_lower_dev(group_dev, lower_dev, iter) {
+   if (lower_dev == dev)
+   continue;
+   if (switchdev_port_same_parent_id(dev, lower_dev))
+   return lower_dev->offload_fwd_mark;
+   return switchdev_port_fwd_mark_get(dev, lower_dev);
+   }
+
+   return dev->ifindex;
+}
+
+static void switchdev_port_fwd_mark_reset(struct net_device *group_dev,
+ u32 old_mark, u32 *reset_mark)
+{
+   struct net_device *lower_dev;
+   struct list_head *iter;
+
+   netdev_for_each_lower_dev(group_dev, lower_dev, iter) {
+   if (lower_dev->offload_fwd_mark == old_mark) {
+   if (!*reset_mark)
+   *reset_mark = lower_dev->ifindex;
+   lower_dev->offload_fwd_mark = *reset_mark;
+   }
+   switchdev_port_fwd_mark_reset(lower_dev, old_mark, reset_mark);
+   }
+}
+
+/**
+ * switchdev_port_fwd_mark_set - Set port offload forwarding mark
+ *
+ * @dev: port device
+ * @group_dev: containing device
+ * @joining: true if dev is joining group; false if leaving group
+ *
+ * An ungrouped port's offload mark is just its ifindex.  A grouped
+ * port's (member of a bridge, for example) offload mark is the ifindex
+ * of one of the ports in the group with the same parent (switch) ID.
+ * Ports on the same device in the same group will have the same mark.
+ *
+ * Example:
+ *
+ * br0 ifindex=9
+ *   sw1p1 ifindex=2   mark=2
+ *   sw1p2 ifindex=3   mark=2
+ *   sw2p1 ifindex=4   mark=5
+ *   sw2p2 ifindex=5   mark=5
+ *
+ * If sw2p2 leaves the bridge, we'll have:
+ *
+ * br0 ifindex=9
+ *   sw1p1 ifindex=2   mark=2
+ *   sw1p2 ifindex=3   mark=2
+ *   sw2p1 ifindex=4   mark=4
+ * sw2p2   ifindex=5

[PATCH net-next v2 5/5] switchdev: update documentation for offload_fwd_mark

2015-07-16 Thread sfeldma

From: Scott Feldman 

Signed-off-by: Scott Feldman 
---
 Documentation/networking/switchdev.txt |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/switchdev.txt 
b/Documentation/networking/switchdev.txt
index c5d7ade..9825f32 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.txt
@@ -279,8 +279,18 @@ and unknown unicast packets to all ports in domain, if 
allowed by port's
 current STP state.  The switch driver, knowing which ports are within which
 vlan L2 domain, can program the switch device for flooding.  The packet should
 also be sent to the port netdev for processing by the bridge driver.  The
-bridge should not reflood the packet to the same ports the device flooded.
-XXX: the mechanism to avoid duplicate flood packets is being discuseed.
+bridge should not reflood the packet to the same ports the device flooded,
+otherwise there will be duplicate packets on the wire.
+
+To avoid duplicate packets, the device/driver should mark a packet as already
+forwarded using skb->offload_fwd_mark.  The same mark is set on the device
+ports in the domain using dev->offload_fwd_mark.  If the skb->offload_fwd_mark
+is non-zero and matches the forwarding egress port's dev->skb_mark, the kernel
+will drop the skb right before transmit on the egress port, with the
+understanding that the device already forwarded the packet on same egress port.
+The driver can use switchdev_port_fwd_mark_set() to set a globally unique mark
+for port's dev->offload_fwd_mark, based on the port's parent ID (switch ID) and
+a group ifindex.
 
 It is possible for the switch device to not handle flooding and push the
 packets up to the bridge driver for flooding.  This is not ideal as the number
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v2 1/5] net: don't reforward packets already forwarded by offload device

2015-07-16 Thread sfeldma

From: Scott Feldman 

Just before queuing skb for xmit on port, check if skb has been marked by
switchdev port driver as already fordwarded by device.  If so, drop skb.  A
non-zero skb->offload_fwd_mark field is set by the switchdev port
driver/device on ingress to indicate the skb has already been forwarded by
the device to egress ports with matching dev->skb_mark.  The switchdev port
driver would assign a non-zero dev->skb_mark for each device port netdev
during registration, for example.

Signed-off-by: Scott Feldman 
---
 include/linux/netdevice.h |6 ++
 include/linux/skbuff.h|   11 ++-
 net/core/dev.c|   10 ++
 3 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 45cfd79..8364f29 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1456,6 +1456,8 @@ enum netdev_priv_flags {
  *
  * @xps_maps:  XXX: need comments on this one
  *
+ * @offload_fwd_mark:  Offload device fwding mark
+ *
  * @trans_start:   Time (in jiffies) of last Tx
  * @watchdog_timeo:Represents the timeout that is used by
  * the watchdog ( see dev_watchdog() )
@@ -1697,6 +1699,10 @@ struct net_device {
struct xps_dev_maps __rcu *xps_maps;
 #endif
 
+#ifdef CONFIG_NET_SWITCHDEV
+   u32 offload_fwd_mark;
+#endif
+
/* These may be needed for future network-power-down code. */
 
/*
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d6cdd6e..2edcf50 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -506,6 +506,7 @@ static inline u32 skb_mstamp_us_delta(const struct 
skb_mstamp *t1,
  * @no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS
   *@napi_id: id of the NAPI struct this skb came from
  * @secmark: security marking
+ * @offload_fwd_mark: fwding offload mark
  * @mark: Generic packet mark
  * @vlan_proto: vlan encapsulation protocol
  * @vlan_tci: vlan tag control information
@@ -650,9 +651,17 @@ struct sk_buff {
unsigned intsender_cpu;
};
 #endif
+   union {
 #ifdef CONFIG_NETWORK_SECMARK
-   __u32   secmark;
+   __u32   secmark;
+#endif
+#ifdef CONFIG_NET_SWITCHDEV
+   __u32   offload_fwd_mark;
 #endif
+   };
+
+   union {};
+
union {
__u32   mark;
__u32   reserved_tailroom;
diff --git a/net/core/dev.c b/net/core/dev.c
index 8810b6b..2ee15af 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3061,6 +3061,16 @@ static int __dev_queue_xmit(struct sk_buff *skb, void 
*accel_priv)
else
skb_dst_force(skb);
 
+#ifdef CONFIG_NET_SWITCHDEV
+   /* Don't forward if offload device already forwarded */
+   if (skb->offload_fwd_mark &&
+   skb->offload_fwd_mark == dev->offload_fwd_mark) {
+   consume_skb(skb);
+   rc = NET_XMIT_SUCCESS;
+   goto out;
+   }
+#endif
+
txq = netdev_pick_tx(dev, skb, accel_priv);
q = rcu_dereference_bh(txq->qdisc);
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next v3] ipv6: sysctl to restrict candidate source addresses

2015-07-16 Thread Erik Kline

Per RFC 6724, section 4, "Candidate Source Addresses":

It is RECOMMENDED that the candidate source addresses be the set
of unicast addresses assigned to the interface that will be used
to send to the destination (the "outgoing" interface).

Add a sysctl to enable this behaviour.

Signed-off-by: Erik Kline 
---
 Documentation/networking/ip-sysctl.txt |  7 +++
 include/linux/ipv6.h   |  1 +
 include/uapi/linux/ipv6.h  |  1 +
 net/ipv6/addrconf.c| 22 +++---
 4 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index f63aeef..27007c5 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1460,6 +1460,13 @@ router_solicitations - INTEGER
routers are present.
Default: 3
 
+use_oif_addr - BOOLEAN
+   When enabled, the candidate source addresses for destinations
+   routed via this interface are restricted to the set of addresses
+   configured on this interface (vis. RFC 6724, section 4).
+
+   Default: false
+
 use_tempaddr - INTEGER
Preference for Privacy Extensions (RFC3041).
  <= 0 : disable Privacy Extensions
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 1319a6b..190b22b 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -57,6 +57,7 @@ struct ipv6_devconf {
bool initialized;
struct in6_addr secret;
} stable_secret;
+   __s32   use_oif_addr;
void*sysctl;
 };
 
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 5efa54a..cf9d65a 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -171,6 +171,7 @@ enum {
DEVCONF_USE_OPTIMISTIC,
DEVCONF_ACCEPT_RA_MTU,
DEVCONF_STABLE_SECRET,
+   DEVCONF_USE_OIF_ADDR,
DEVCONF_MAX
 };
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4c9a024..a7c49bb 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -211,7 +211,8 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
.accept_ra_mtu  = 1,
.stable_secret  = {
.initialized = false,
-   }
+   },
+   .use_oif_addr   = 0,
 };
 
 static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = {
@@ -253,6 +254,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly 
= {
.stable_secret  = {
.initialized = false,
},
+   .use_oif_addr   = 0,
 };
 
 /* Check if a valid qdisc is available */
@@ -1474,11 +1476,16 @@ int ipv6_dev_get_saddr(struct net *net, const struct 
net_device *dst_dev,
 *include addresses assigned to interfaces
 *belonging to the same site as the outgoing
 *interface.)
+*  - "It is RECOMMENDED that the candidate source addresses
+*be the set of unicast addresses assigned to the
+*interface that will be used to send to the destination
+*(the 'outgoing' interface)." (RFC 6724)
 */
if (dst_dev) {
+   idev = __in6_dev_get(dst_dev);
if ((dst_type & IPV6_ADDR_MULTICAST) ||
-   dst.scope <= IPV6_ADDR_SCOPE_LINKLOCAL) {
-   idev = __in6_dev_get(dst_dev);
+   dst.scope <= IPV6_ADDR_SCOPE_LINKLOCAL ||
+   (idev && idev->cnf.use_oif_addr)) {
use_oif_addr = true;
}
}
@@ -4609,6 +4616,7 @@ static inline void ipv6_store_devconf(struct ipv6_devconf 
*cnf,
array[DEVCONF_ACCEPT_RA_FROM_LOCAL] = cnf->accept_ra_from_local;
array[DEVCONF_ACCEPT_RA_MTU] = cnf->accept_ra_mtu;
/* we omit DEVCONF_STABLE_SECRET for now */
+   array[DEVCONF_USE_OIF_ADDR] = cnf->use_oif_addr;
 }
 
 static inline size_t inet6_ifla6_size(void)
@@ -5608,6 +5616,14 @@ static struct addrconf_sysctl_table
.proc_handler   = addrconf_sysctl_stable_secret,
},
{
+   .procname   = "use_oif_addr",
+   .data   = &ipv6_devconf.use_oif_addr,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+
+   },
+   {
/* sentinel */
}
},
-- 
2.4.3.573.g4eafbef

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] rocker: forward packets to CPU when port is joined to openvswitch

2015-07-16 Thread Jiri Pirko

Thu, Jul 16, 2015 at 09:09:39AM CEST, sfel...@gmail.com wrote:
>On Wed, Jul 15, 2015 at 11:58 PM, Jiri Pirko  wrote:
>> Thu, Jul 16, 2015 at 08:40:31AM CEST, sfel...@gmail.com wrote:
>>>On Wed, Jul 15, 2015 at 6:39 PM, Simon Horman
>>> wrote:
 Teach rocker to forward packets to CPU when a port is joined to Open 
 vSwitch.
 There is scope to later refine what is passed up as per Open vSwitch flows
 on a port.

 This does not change the behaviour of rocker ports that are
 not joined to Open vSwitch.

 Signed-off-by: Simon Horman 
>>>
>>>Acked-by: Scott Feldman 
>>>
>>>Now, OVS flows on a port.  Strange enough, that was the first RFC
>>>implementation for switchdev/rocker where we hooked into ovs-kernel
>>>module and programmed flows into hw.  We pulled all of that code
>>>because, IIRC, the ovs folks didn't want us hooking into the kernel
>>>module directly.  We dropped the ovs hooks and focused on hooking
>>>kernel's L2/L3.  The device (rocker) didn't really change: OF-DPA
>>>pipeline was used for both.  Might be interesting to try hooking it
>>>again.
>>
>>
>> I think that now we have an infrastructure prepared for that. I mean,
>> what we need to do is to introduce another generic switchdev object
>> called "ntupleflow" and hook-up again into ovs datapath and cls_flower
>> and insert/remove the object from those codes. Should be pretty easy to do.
>
>That sounds right.  Is the ovs datapath hooking still happening in the
>ovs-kernel module?  Remind me again, what was the objection the last
>time we tried that?

Yep, we need to hook there. Otherwise it won't be transparent.

Last time the objection was that this would be ovs specific. But that is
passed today. We have switchdev infra with objects, we have cls_flower
which would use the same object. I say let's do this now.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 5/5] switchdev: update documentation for offload_fwd_mark

2015-07-16 Thread Jiri Pirko

Thu, Jul 16, 2015 at 10:04:57AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 2/5] net: add phys ID compare helper to test if two IDs are the same

2015-07-16 Thread Jiri Pirko

Thu, Jul 16, 2015 at 10:04:54AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 3/5] switchdev: add offload_fwd_mark generator helper

2015-07-16 Thread Jiri Pirko

Thu, Jul 16, 2015 at 10:04:55AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>skb->offload_fwd_mark and dev->offload_fwd_mark are 32-bit and should be
>unique for device and may even be unique for a sub-set of ports within
>device, so add switchdev helper function to generate unique marks based on
>port's switch ID and group_ifindex.  group_ifindex would typically be the
>container dev's ifindex, such as the bridge's ifindex.
>
>The generator uses a global hash table to store offload_fwd_marks hashed by
>{switch ID, group_ifindex} key.
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 4/5] rocker: add offload_fwd_mark support

2015-07-16 Thread Jiri Pirko

Thu, Jul 16, 2015 at 10:04:56AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>If device flags ingress packet as "fwd offload", mark the
>skb->offlaod_fwd_mark using the ingress port's dev->offlaod_fwd_mark.  This
>will be the hint to the kernel that this packet has already been forwarded
>by device to egress ports matching skb->offlaod_fwd_mark.
>
>For rocker, derive port dev->offlaod_fwd_mark based on device switch ID and
>port ifindex.  If port is bridged, use the bridge ifindex rather than the
>port ifindex.
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 1/5] net: don't reforward packets already forwarded by offload device

2015-07-16 Thread Jiri Pirko

Thu, Jul 16, 2015 at 10:04:53AM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Just before queuing skb for xmit on port, check if skb has been marked by
>switchdev port driver as already fordwarded by device.  If so, drop skb.  A
>non-zero skb->offload_fwd_mark field is set by the switchdev port
>driver/device on ingress to indicate the skb has already been forwarded by
>the device to egress ports with matching dev->skb_mark.  The switchdev port
>driver would assign a non-zero dev->skb_mark for each device port netdev
>during registration, for example.
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 0/5] switchdev: avoid duplicate packet forwarding

2015-07-16 Thread Jiri Pirko


..
>
>Scott Feldman (5):
>  net: don't reforward packets already forwarded by offload device
>  net: add phys ID compare helper to test if two IDs are the same
>  switchdev: add offload_fwd_mark generator helper
>  rocker: add offload_fwd_mark support
>  switchdev: update documentation for offload_fwd_mark
>
> Documentation/networking/switchdev.txt |   14 +++-
> drivers/net/ethernet/rocker/rocker.c   |   11 
> drivers/net/ethernet/rocker/rocker.h   |1 +
> include/linux/netdevice.h  |   13 
> include/linux/skbuff.h |   11 +++-
> include/net/switchdev.h|9 +++
> net/core/dev.c |   10 +++
> net/switchdev/switchdev.c  |  111 ++--
> 8 files changed, 171 insertions(+), 9 deletions(-)
>
>-- 
>1.7.10.4

Thanks for taking care of this Scott!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] can: mcp251x: fix resume when device is down

2015-07-16 Thread Marc Kleine-Budde

From: Stefan Agner 

If a valid power regulator or a dummy regulator is used (which
happens to be the case when no regulator is specified), restart_work
is queued no matter whether the device was running or not at suspend
time. Since work queues get initialized in the ndo_open callback,
resuming leads to a NULL pointer exception.

Reverse exactly the steps executed at suspend time:
- Enable the power regulator in any case
- Enable the transceiver regulator if the device was running, even in
  case we have a power regulator
- Queue restart_work only in case the device was running

Fixes: bf66f3736a94 ("can: mcp251x: Move to threaded interrupts instead of 
workqueues.")
Signed-off-by: Stefan Agner 
Cc: linux-stable 
Signed-off-by: Marc Kleine-Budde 
---
 drivers/net/can/spi/mcp251x.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/net/can/spi/mcp251x.c b/drivers/net/can/spi/mcp251x.c
index c1a95a34d62e..3b2f34e6ba9b 100644
--- a/drivers/net/can/spi/mcp251x.c
+++ b/drivers/net/can/spi/mcp251x.c
@@ -1222,17 +1222,16 @@ static int __maybe_unused mcp251x_can_resume(struct 
device *dev)
struct spi_device *spi = to_spi_device(dev);
struct mcp251x_priv *priv = spi_get_drvdata(spi);
 
-   if (priv->after_suspend & AFTER_SUSPEND_POWER) {
+   if (priv->after_suspend & AFTER_SUSPEND_POWER)
mcp251x_power_enable(priv->power, 1);
+
+   if (priv->after_suspend & AFTER_SUSPEND_UP) {
+   mcp251x_power_enable(priv->transceiver, 1);
queue_work(priv->wq, &priv->restart_work);
} else {
-   if (priv->after_suspend & AFTER_SUSPEND_UP) {
-   mcp251x_power_enable(priv->transceiver, 1);
-   queue_work(priv->wq, &priv->restart_work);
-   } else {
-   priv->after_suspend = 0;
-   }
+   priv->after_suspend = 0;
}
+
priv->force_quit = 0;
enable_irq(spi->irq);
return 0;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

pull-request: can 2015-07-16

2015-07-16 Thread Marc Kleine-Budde

Hello David,

this is a pull request of 2 patches by Stefan Agner. He fixes the resume
operation in the mcp251x driver.

regards,
Marc

---

The following changes since commit 15afb10df4a3f1bd781373ffd968e70cc4b21a42:

  Merge branch 'for-upstream' of 
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth (2015-07-15 
21:59:23 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can.git 
tags/linux-can-fixes-for-4.2-20150716

for you to fetch changes up to 69da3f2ac528642acbd06ed14564ac1b9a918394:

  can: mcp251x: get regulators optionally (2015-07-16 09:04:22 +0200)


linux-can-fixes-for-4.2-20150716


Stefan Agner (2):
  can: mcp251x: fix resume when device is down
  can: mcp251x: get regulators optionally

 drivers/net/can/spi/mcp251x.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] bonding: correct the MAC address for "follow" fail_over_mac policy

2015-07-16 Thread Ding Tianhong

The "follow" fail_over_mac policy is useful for multiport devices that
either become confused or incur a performance penalty when multiple
ports are programmed with the same MAC address, but the same MAC
address still may happened by this steps for this policy:

1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
   bond0 has the same mac address with eth0, it is MAC1.

2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
   eth1 is backup, eth1 has MAC2.

3) ifconfig eth0 down
   eth1 became active slave, bond will swap MAC for eth0 and eth1,
   so eth1 has MAC1, and eth0 has MAC2.

4) ifconfig eth1 down
   there is no active slave, and eth1 still has MAC1, eth2 has MAC2.

5) ifconfig eth0 up
   the eth0 became active slave again, the bond set eth0 to MAC1.

Something wrong here, then if you set eth1 up, the eth0 and eth1 will have the 
same
MAC address, it will break this policy for ACTIVE_BACKUP mode.

This patch will fix this problem by finding the old active slave and
swap them MAC address before change active slave.

Signed-off-by: Ding Tianhong 
---
 drivers/net/bonding/bond_main.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 317a494..efdb6a4 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -625,6 +625,23 @@ static void bond_set_dev_addr(struct net_device *bond_dev,
call_netdevice_notifiers(NETDEV_CHANGEADDR, bond_dev);
 }
 
+static struct slave *bond_get_old_active(struct bonding *bond,
+struct slave *new_active)
+{
+   struct slave *slave;
+   struct list_head *iter;
+
+   bond_for_each_slave(bond, slave, iter) {
+   if (slave == new_active)
+   continue;
+
+   if (ether_addr_equal(bond->dev->dev_addr, slave->dev->dev_addr))
+   return slave;
+   }
+
+   return NULL;
+}
+
 /* bond_do_fail_over_mac
  *
  * Perform special MAC address swapping for fail_over_mac settings
@@ -652,6 +669,9 @@ static void bond_do_fail_over_mac(struct bonding *bond,
if (!new_active)
return;
 
+   if (!old_active)
+   old_active = bond_get_old_active(bond, new_active);
+
if (old_active) {
ether_addr_copy(tmp_mac, new_active->dev->dev_addr);
ether_addr_copy(saddr.sa_data,
-- 
1.8.0


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] can: mcp251x: get regulators optionally

2015-07-16 Thread Marc Kleine-Budde

From: Stefan Agner 

The regulators power and transceiver are optional. If those are not
present, the pointer (or error pointer) is correctly handled by the
driver, hence we can use devm_regulator_get_optional safely, which
avoids regulators getting created.

Signed-off-by: Stefan Agner 
Signed-off-by: Marc Kleine-Budde 
---
 drivers/net/can/spi/mcp251x.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/can/spi/mcp251x.c b/drivers/net/can/spi/mcp251x.c
index 3b2f34e6ba9b..b7e83c212023 100644
--- a/drivers/net/can/spi/mcp251x.c
+++ b/drivers/net/can/spi/mcp251x.c
@@ -1086,8 +1086,8 @@ static int mcp251x_can_probe(struct spi_device *spi)
if (ret)
goto out_clk;
 
-   priv->power = devm_regulator_get(&spi->dev, "vdd");
-   priv->transceiver = devm_regulator_get(&spi->dev, "xceiver");
+   priv->power = devm_regulator_get_optional(&spi->dev, "vdd");
+   priv->transceiver = devm_regulator_get_optional(&spi->dev, "xceiver");
if ((PTR_ERR(priv->power) == -EPROBE_DEFER) ||
(PTR_ERR(priv->transceiver) == -EPROBE_DEFER)) {
ret = -EPROBE_DEFER;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] inet: frags: fix defragmented packet's IP header for af_packet

2015-07-16 Thread Edward Jee

When ip_frag_queue() computes positions, it assumes that the passed
sk_buff does not contain L2 headers. However, when
PACKET_FANOUT_FLAG_DEFRAG is used, IP defragmentation functions can be
called on outgoing packets that contain L2 headers. Also, IPv4
checksum is not corrected after defragmentation.

Fixes: 7736d33f4262 ("packet: Add pre-defragmentation support for ipv4 
fanouts.")
Signed-off-by: Edward Hyunkoo Jee 
Acked-by: Eric Dumazet 
Cc: Willem de Bruijn 
Cc: Jerry Chu 
---
 net/ipv4/ip_fragment.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index a50dc6d..44c8f2a 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -351,7 +351,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
ihl = ip_hdrlen(skb);
 
/* Determine the position of this fragment. */
-   end = offset + skb->len - ihl;
+   end = offset + skb->len - skb_network_offset(skb) - ihl;
err = -EINVAL;
 
/* Is this the final fragment? */
@@ -381,7 +381,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
goto err;
 
err = -ENOMEM;
-   if (!pskb_pull(skb, ihl))
+   if (!pskb_pull(skb, skb_network_offset(skb) + ihl))
goto err;
 
err = pskb_trim_rcsum(skb, end - offset);
@@ -641,6 +641,8 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*prev,
iph->frag_off = 0;
}
 
+   ip_send_check(ip_hdr(skb));
+
IP_INC_STATS_BH(net, IPSTATS_MIB_REASMOKS);
qp->q.fragments = NULL;
qp->q.fragments_tail = NULL;
-- 
2.4.3.573.g4eafbef

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC net-next 22/22] openvswitch: Use regular GRE net_device instead of vport

2015-07-16 Thread Simon Horman

On Fri, Jul 10, 2015 at 04:19:24PM +0200, Thomas Graf wrote:
> From: Pravin Shelar 
> 
> Removes all of the OVS specific GRE code and makes OVS use a
> GRE net_device.
> 
> Signed-off-by: Pravin B Shelar 

[snip]

> @@ -115,6 +117,8 @@ static bool log_ecn_error = true;
>  module_param(log_ecn_error, bool, 0644);
>  MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
>  
> +#define GRE_TAP_FB_NAME "gretap0"
> +
>  static struct rtnl_link_ops ipgre_link_ops __read_mostly;
>  static int ipgre_tunnel_init(struct net_device *dev);
>  

[snip]

> @@ -690,12 +836,27 @@ static const struct net_device_ops gre_tap_netdev_ops = 
> {
>   .ndo_get_iflink = ip_tunnel_get_iflink,
>  };
>  
> +static const struct net_device_ops gre_fb_netdev_ops = {
> + .ndo_init   = gre_tap_init,
> + .ndo_uninit = ip_tunnel_uninit,
> + .ndo_start_xmit = gre_fb_xmit,
> + .ndo_set_mac_address= eth_mac_addr,
> + .ndo_validate_addr  = eth_validate_addr,
> + .ndo_change_mtu = ip_tunnel_change_mtu,
> + .ndo_get_stats64= ip_tunnel_get_stats64,
> + .ndo_get_iflink = ip_tunnel_get_iflink,
> +};
> +
>  static void ipgre_tap_setup(struct net_device *dev)
>  {
>   ether_setup(dev);
> - dev->netdev_ops = &gre_tap_netdev_ops;
>   dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
>   ip_tunnel_setup(dev, gre_tap_net_id);
> +
> + if (!strcmp(dev->name, GRE_TAP_FB_NAME))
> + dev->netdev_ops = &gre_fb_netdev_ops;
> + else
> + dev->netdev_ops = &gre_tap_netdev_ops;
>  }
>  
>  static int ipgre_newlink(struct net *src_net, struct net_device *dev,

[snip]

Is there a side-effect of the above that if a user creates a gretap device
whose name is "gretap0" then the device will use gre_fb_netdev_ops instead
of gre_tap_netdev_ops. If so, does that imply a change in behaviour for
gretap devices created with that name?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net v4] rtnl/bond: don't send rtnl msg for unregistered iface

2015-07-16 Thread Nicolas Dichtel


Le 13/07/2015 16:11, Kristian Evensen a écrit :

Hello,

I have a quick question about this patch.

On Wed, May 13, 2015 at 2:19 PM, Nicolas Dichtel
 wrote:

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 837d30b5ffed..7b25f1ef3d75 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2415,6 +2415,9 @@ void rtmsg_ifinfo(int type, struct net_device *dev, 
unsigned int change,
  {
 struct sk_buff *skb;

+   if (dev->reg_state != NETREG_REGISTERED)
+   return;
+


Is this check correct, or placed at the correct location? The reason I
am asking is as follows. In rollback_registered_many(), dev->reg_state
is set to NETREG_UNREGISTERING for devices that will be unregistered.
When rtmsg_ifinfo_build_skb(RTM_DELLINK, ...) is called in the
following loop in rollback_registered_many, this comparison will
always be true and no DELLINK event generated.





This change led to some applications I have not behaving as expected
due to missing DELLINK when network devices are removed. I also see no
DELLINK with ip mon link. Removing the check restores the old behavior
(DELLINK events are generated). My machine is running 3.18.18, which
includes this fix.

Ok, I see the problem. My patch depends on commit
395eea6ccf2b ("rtnetlink: delay RTM_DELLINK notification until after 
ndo_uninit()")
which has been introduced in 3.19 (this patch is not backported in 3.18).

After 3.19, rtmsg_ifinfo_build_skb() just builds a message (an skb), it never
calls rtmsg_ifinfo(). After this call, rollback_registered_many() calls
rtmsg_ifinfo_send() (which also never calls rtmsg_ifinfo()).

I will submit a patch for the 3.18.


Regards,
Nicolas
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bonding: correct the MAC address for "follow" fail_over_mac policy

2015-07-16 Thread Nikolay Aleksandrov

On 07/16/2015 10:30 AM, Ding Tianhong wrote:
> The "follow" fail_over_mac policy is useful for multiport devices that
> either become confused or incur a performance penalty when multiple
> ports are programmed with the same MAC address, but the same MAC
> address still may happened by this steps for this policy:
> 
> 1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
>bond0 has the same mac address with eth0, it is MAC1.
> 
> 2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
>eth1 is backup, eth1 has MAC2.
> 
> 3) ifconfig eth0 down
>eth1 became active slave, bond will swap MAC for eth0 and eth1,
>so eth1 has MAC1, and eth0 has MAC2.
> 
> 4) ifconfig eth1 down
>there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
> 
> 5) ifconfig eth0 up
>the eth0 became active slave again, the bond set eth0 to MAC1.
> 
> Something wrong here, then if you set eth1 up, the eth0 and eth1 will have 
> the same
> MAC address, it will break this policy for ACTIVE_BACKUP mode.
> 
> This patch will fix this problem by finding the old active slave and
> swap them MAC address before change active slave.
> 
> Signed-off-by: Ding Tianhong 
> ---
>  drivers/net/bonding/bond_main.c | 20 
>  1 file changed, 20 insertions(+)
> 

This doesn't seem to be true:
~# cat /sys/class/net/bond0/bonding/fail_over_mac 
follow 2
root@debian:~# ip l sh eth1
3: eth1:  mtu 1500 qdisc pfifo_fast 
master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
root@debian:~# ip l sh eth2
4: eth2:  mtu 1500 qdisc pfifo_fast 
master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
root@debian:~# ip l sh bond0
26: bond0:  mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default 
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

*eth1 is the first and active slave and bond0 has taken its mac.
Now trying your steps:
Step 3) (bringing down the active eth1)
root@debian:~# ip l set eth1 down
root@debian:~# ip l sh bond0
26: bond0:  mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default 
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
root@debian:~# ip l sh eth1
3: eth1:  mtu 1500 qdisc pfifo_fast master bond0 
state DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
root@debian:~# ip l sh eth2
4: eth2:  mtu 1500 qdisc pfifo_fast 
master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

*The mac addresses of eth1 and eth2 are correctly swapped, so far so good.

Step 4) (bringing down the active eth2)
root@debian:~# ip l set eth2 down
3: eth1:  mtu 1500 qdisc pfifo_fast master bond0 
state DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
4: eth2:  mtu 1500 qdisc pfifo_fast master bond0 
state DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
26: bond0:  mtu 1500 qdisc noqueue 
state DOWN mode DEFAULT group default 
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

*eth2 has kept the mac address of the bond and they're both down now

Step 5) (bring eth1 up again and observe the macs)
~# ip l set eth1 up
3: eth1:  mtu 1500 qdisc pfifo_fast 
master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
4: eth2:  mtu 1500 qdisc pfifo_fast master bond0 
state DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
26: bond0:  mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default 
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

*The macs are correctly swapped and there's no such bug. 

Step 6(?) bring eth2 up
~# ip l set eth2 up
3: eth1:  mtu 1500 qdisc pfifo_fast 
master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
4: eth2:  mtu 1500 qdisc pfifo_fast 
master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
26: bond0:  mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default 
link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

*Still correct.

Also the mac address that gets set is dev_addr which is changed when
the swapping is done, if you'd like to get the original mac address
you should be using slave->perm_hwaddr.

Cheers,
 Nik
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH linux-3.18.y] rtnl: restore notifications for deleted interfaces

2015-07-16 Thread Nicolas Dichtel

The commit 984ff7a3e060 is an upstream backport. In fact, it depends on
commit 395eea6ccf2b ("rtnetlink: delay RTM_DELLINK notification until after 
ndo_uninit()")
which has not been backported in 3.18.y.

Before commit 395eea6ccf2b, rollback_registered_many() uses rtmsg_ifinfo().
The call to this function is done with dev->reg_state set to
NETREG_UNREGISTERING, thus testing this reg_state in rtmsg_ifinfo() is
wrong.

This patch partially reverts commit 984ff7a3e060.

Fixes: 984ff7a3e060 ("rtnl/bond: don't send rtnl msg for unregistered iface")
Reported-by: Kristian Evensen 
Signed-off-by: Nicolas Dichtel 
---
 net/core/rtnetlink.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 24d3242f0e01..c522f7a00eab 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2229,9 +2229,6 @@ void rtmsg_ifinfo(int type, struct net_device *dev, 
unsigned int change,
int err = -ENOBUFS;
size_t if_info_size;
 
-   if (dev->reg_state != NETREG_REGISTERED)
-   return;
-
skb = nlmsg_new((if_info_size = if_nlmsg_size(dev, 0)), flags);
if (skb == NULL)
goto errout;
-- 
2.4.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next v2 0/5] switchdev: avoid duplicate packet forwarding

2015-07-16 Thread Simon Horman

On Thu, Jul 16, 2015 at 01:04:52AM -0700, sfel...@gmail.com wrote:
> From: Scott Feldman 

[snip]

> Scott Feldman (5):
>   net: don't reforward packets already forwarded by offload device
>   net: add phys ID compare helper to test if two IDs are the same
>   switchdev: add offload_fwd_mark generator helper
>   rocker: add offload_fwd_mark support
>   switchdev: update documentation for offload_fwd_mark

[snip]

All patches:
Reviewed-by: Simon Horman 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next]r8169: set bits on Register Interrupt status on limit

2015-07-16 Thread Corcodel Marian


 Set bits on register Interrupt status on limits by
configuration(critical).
 On chips not alls bits is in use and some is reserved this patch solve this 
issue.

 Committer: Corcodel Marian 
 Changes to be committed:
modified:   drivers/net/ethernet/realtek/r8169.c

Signed-off-by: Corcodel Marian 
---
 drivers/net/ethernet/realtek/r8169.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 744f90a..714af89 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -8181,6 +8181,9 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	else
 		tp->rx_buf_sz = 16383;
 
+	tp->event_slow = cfg->event_slow;
+
+
 	rtl_init_rxcfg(tp);
 
 	rtl_irq_disable(tp);
@@ -8189,7 +8192,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	rtl_hw_reset(tp);
 
-	rtl_ack_events(tp, 0x);
+	rtl_ack_events(tp, 0x & tp->event_slow);
 
 	pci_set_master(pdev);
 
@@ -8325,7 +8328,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	dev->hw_features |= NETIF_F_RXFCS;
 
 	tp->hw_start = cfg->hw_start;
-	tp->event_slow = cfg->event_slow;
+	//tp->event_slow = cfg->event_slow;
 
 	tp->opts1_mask = (tp->mac_version != RTL_GIGA_MAC_VER_01) ?
 		~(RxBOVF | RxFOVF) : ~0;

Re: [virtio-dev] [PATCH] virtio_net: don't require ANY_LAYOUT with VERSION_1

2015-07-16 Thread Stefan Hajnoczi

On Wed, Jul 15, 2015 at 03:26:19PM +0300, Michael S. Tsirkin wrote:
> ANY_LAYOUT is a compatibility feature. It's implied
> for VERSION_1 devices, and non-transitional devices
> might not offer it. Change code to behave accordingly.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/net/virtio_net.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


pgp8NvynTIb2e.pgp
Description: PGP signature

[PATCH v7 1/3] if_link: Add control trust VF

2015-07-16 Thread Hiroshi Shimamoto

From: Hiroshi Shimamoto 

Add netlink directives and ndo entry to trust VF user.

This controls the special permission of VF user.
The administrator will dedicatedly trust VF user to use some features
which impacts security and/or performance.

The administrator never turn it on unless VF user is fully trusted.

Signed-off-by: Hiroshi Shimamoto 
CC: Choi, Sy Jong 
---
 include/linux/if_link.h  |  1 +
 include/linux/netdevice.h|  3 +++
 include/uapi/linux/if_link.h |  6 ++
 net/core/rtnetlink.c | 24 +---
 4 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index ae5d0d2..f923d15 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -24,5 +24,6 @@ struct ifla_vf_info {
__u32 min_tx_rate;
__u32 max_tx_rate;
__u32 rss_query_en;
+   __u32 trusted;
 };
 #endif /* _LINUX_IF_LINK_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e20979d..a034fb8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -873,6 +873,7 @@ typedef u16 (*select_queue_fallback_t)(struct net_device 
*dev,
  * int (*ndo_set_vf_rate)(struct net_device *dev, int vf, int min_tx_rate,
  *   int max_tx_rate);
  * int (*ndo_set_vf_spoofchk)(struct net_device *dev, int vf, bool setting);
+ * int (*ndo_set_vf_trust)(struct net_device *dev, int vf, bool setting);
  * int (*ndo_get_vf_config)(struct net_device *dev,
  * int vf, struct ifla_vf_info *ivf);
  * int (*ndo_set_vf_link_state)(struct net_device *dev, int vf, int 
link_state);
@@ -1095,6 +1096,8 @@ struct net_device_ops {
   int max_tx_rate);
int (*ndo_set_vf_spoofchk)(struct net_device *dev,
   int vf, bool setting);
+   int (*ndo_set_vf_trust)(struct net_device *dev,
+   int vf, bool setting);
int (*ndo_get_vf_config)(struct net_device *dev,
 int vf,
 struct ifla_vf_info *ivf);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 2c7e8e3..891050c 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -485,6 +485,7 @@ enum {
 * on/off switch
 */
IFLA_VF_STATS,  /* network device statistics */
+   IFLA_VF_TRUST,  /* Trust VF */
__IFLA_VF_MAX,
 };
 
@@ -546,6 +547,11 @@ enum {
 
 #define IFLA_VF_STATS_MAX (__IFLA_VF_STATS_MAX - 1)
 
+struct ifla_vf_trust {
+   __u32 vf;
+   __u32 setting;
+};
+
 /* VF ports management section
  *
  * Nested layout of set/get msg is:
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 9e433d5..803b80c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -831,7 +831,8 @@ static inline int rtnl_vfinfo_size(const struct net_device 
*dev,
 /* IFLA_VF_STATS_BROADCAST */
 nla_total_size(sizeof(__u64)) +
 /* IFLA_VF_STATS_MULTICAST */
-nla_total_size(sizeof(__u64)));
+nla_total_size(sizeof(__u64)) +
+nla_total_size(sizeof(struct ifla_vf_trust)));
return size;
} else
return 0;
@@ -1151,6 +1152,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
struct ifla_vf_link_state vf_linkstate;
struct ifla_vf_rss_query_en vf_rss_query_en;
struct ifla_vf_stats vf_stats;
+   struct ifla_vf_trust vf_trust;
 
/*
 * Not all SR-IOV capable drivers support the
@@ -1160,6 +1162,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
 */
ivi.spoofchk = -1;
ivi.rss_query_en = -1;
+   ivi.trusted = -1;
memset(ivi.mac, 0, sizeof(ivi.mac));
/* The default value for VF link state is "auto"
 * IFLA_VF_LINK_STATE_AUTO which equals zero
@@ -1173,7 +1176,8 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
vf_tx_rate.vf =
vf_spoofchk.vf =
vf_linkstate.vf =
-   vf_rss_query_en.vf = ivi.vf;
+   vf_rss_query_en.vf =
+   vf_trust.vf = ivi.vf;
 
memcpy(vf_mac.mac, ivi.mac, sizeof(ivi.mac));

[PATCH v7 2/3] ixgbe: Add new ndo to trust VF

2015-07-16 Thread Hiroshi Shimamoto

From: Hiroshi Shimamoto 

Implements the new netdev op to trust VF in ixgbe.

The administrator can turn on and off VF trusted by ip command which
supports trust message.
 # ip link set dev eth0 vf 1 trust on
or
 # ip link set dev eth0 vf 1 trust off

Send a ping to reset VF on changing the status of trusting.
VF driver will reconfigure its features on reset.

Signed-off-by: Hiroshi Shimamoto 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 37 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h |  1 +
 4 files changed, 40 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index edf1fb9..fb72622 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -152,6 +152,7 @@ struct vf_data_storage {
u16 vlan_count;
u8 spoofchk_enabled;
bool rss_query_enabled;
+   u8 trusted;
unsigned int vf_api;
 };
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 935fce7..b26b64e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8365,6 +8365,7 @@ static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_set_vf_rate= ixgbe_ndo_set_vf_bw,
.ndo_set_vf_spoofchk= ixgbe_ndo_set_vf_spoofchk,
.ndo_set_vf_rss_query_en = ixgbe_ndo_set_vf_rss_query_en,
+   .ndo_set_vf_trust   = ixgbe_ndo_set_vf_trust,
.ndo_get_vf_config  = ixgbe_ndo_get_vf_config,
.ndo_get_stats64= ixgbe_get_stats64,
 #ifdef CONFIG_IXGBE_DCB
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 1d17b58..65aeb58 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -116,6 +116,9 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter 
*adapter)
 * we want to disable the querying by default.
 */
adapter->vfinfo[i].rss_query_enabled = 0;
+
+   /* Untrust all VFs */
+   adapter->vfinfo[i].trusted = false;
}
 
return 0;
@@ -1124,6 +1127,17 @@ void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter)
IXGBE_WRITE_REG(hw, IXGBE_VFRE(1), 0);
 }
 
+static inline void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vf)
+{
+   struct ixgbe_hw *hw = &adapter->hw;
+   u32 ping;
+
+   ping = IXGBE_PF_CONTROL_MSG;
+   if (adapter->vfinfo[vf].clear_to_send)
+   ping |= IXGBE_VT_MSGTYPE_CTS;
+   ixgbe_write_mbx(hw, &ping, 1, vf);
+}
+
 void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
 {
struct ixgbe_hw *hw = &adapter->hw;
@@ -1416,6 +1430,28 @@ int ixgbe_ndo_set_vf_rss_query_en(struct net_device 
*netdev, int vf,
return 0;
 }
 
+int ixgbe_ndo_set_vf_trust(struct net_device *netdev, int vf, bool setting)
+{
+   struct ixgbe_adapter *adapter = netdev_priv(netdev);
+
+   if (vf >= adapter->num_vfs)
+   return -EINVAL;
+
+   /* nothing to do */
+   if (adapter->vfinfo[vf].trusted == setting)
+   return 0;
+
+   adapter->vfinfo[vf].trusted = setting;
+
+   /* reset VF to reconfigure features */
+   adapter->vfinfo[vf].clear_to_send = false;
+   ixgbe_ping_vf(adapter, vf);
+
+   e_info(drv, "VF %u is %strusted\n", vf, setting ? "" : "not ");
+
+   return 0;
+}
+
 int ixgbe_ndo_get_vf_config(struct net_device *netdev,
int vf, struct ifla_vf_info *ivi)
 {
@@ -1430,5 +1466,6 @@ int ixgbe_ndo_get_vf_config(struct net_device *netdev,
ivi->qos = adapter->vfinfo[vf].pf_qos;
ivi->spoofchk = adapter->vfinfo[vf].spoofchk_enabled;
ivi->rss_query_en = adapter->vfinfo[vf].rss_query_enabled;
+   ivi->trusted = adapter->vfinfo[vf].trusted;
return 0;
 }
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
index 2c197e6..dad9257 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
@@ -49,6 +49,7 @@ int ixgbe_ndo_set_vf_bw(struct net_device *netdev, int vf, 
int min_tx_rate,
 int ixgbe_ndo_set_vf_spoofchk(struct net_device *netdev, int vf, bool setting);
 int ixgbe_ndo_set_vf_rss_query_en(struct net_device *netdev, int vf,
  bool setting);
+int ixgbe_ndo_set_vf_trust(struct net_device *netdev, int vf, bool setting);
 int ixgbe_ndo_get_vf_config(struct net_device *netdev,
int vf, struct ifla_vf_info *ivi);
 void ixgbe_check_vf_rate_limit(struct ixgbe_adapter *adapter);
-- 
1.8.3.1

--
To unsubscribe from this lis

[PATCH v7 3/3] ixgbe, ixgbevf: Add new mbox API xcast mode

2015-07-16 Thread Hiroshi Shimamoto

From: Hiroshi Shimamoto 

The limitation of the number of multicast address for VF is not enough
for the large scale server with SR-IOV feature. IPv6 requires the multicast
MAC address for each IP address to handle the Neighbor Solicitation
message. We couldn't assign over 30 IPv6 addresses to a single VF.

This patch introduces the new mailbox API, IXGBE_VF_UPDATE_XCAST_MODE,
to update multicast mode of VF. This adds 3 modes;
  - NONE only L2 exact match addresses or Flow Director enabled
  - MULTIBAM and ROMPE set
  - ALLMULTI BAM, ROMPE and MPE set

If a guest VF user wants over 30 MAC multicast addresses, set IFF_ALLMULTI
to request PF to update xcast mode to enable VF multicast promiscuous mode.

On the other hand, enabling VF multicast promiscuous mode may affect
security and performance in the network of the NIC. Only trusted VF can
enable multicast promiscuous mode. The behavior of untrusted VF is the
same as previous version.

Signed-off-by: Hiroshi Shimamoto 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |  7 +++
 drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h  |  2 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c| 59 +++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  6 +++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  8 +++
 drivers/net/ethernet/intel/ixgbevf/mbx.h  |  2 +
 drivers/net/ethernet/intel/ixgbevf/vf.c   | 41 
 drivers/net/ethernet/intel/ixgbevf/vf.h   |  1 +
 8 files changed, 126 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index fb72622..17250ef 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -153,9 +153,16 @@ struct vf_data_storage {
u8 spoofchk_enabled;
bool rss_query_enabled;
u8 trusted;
+   int xcast_mode;
unsigned int vf_api;
 };
 
+enum ixgbevf_xcast_modes {
+   IXGBEVF_XCAST_MODE_NONE = 0,
+   IXGBEVF_XCAST_MODE_MULTI,
+   IXGBEVF_XCAST_MODE_ALLMULTI,
+};
+
 struct vf_macvlans {
struct list_head l;
int vf;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
index b1e4703..8daa95f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
@@ -102,6 +102,8 @@ enum ixgbe_pfvf_api_rev {
 #define IXGBE_VF_GET_RETA  0x0a/* VF request for RETA */
 #define IXGBE_VF_GET_RSS_KEY   0x0b/* get RSS key */
 
+#define IXGBE_VF_UPDATE_XCAST_MODE 0x0c
+
 /* length of permanent address message returned from PF */
 #define IXGBE_VF_PERMADDR_MSG_LEN 4
 /* word in permanent address message with the current multicast type */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 65aeb58..ac071e5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -119,6 +119,9 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter 
*adapter)
 
/* Untrust all VFs */
adapter->vfinfo[i].trusted = false;
+
+   /* set the default xcast mode */
+   adapter->vfinfo[i].xcast_mode = IXGBEVF_XCAST_MODE_NONE;
}
 
return 0;
@@ -1004,6 +1007,59 @@ static int ixgbe_get_vf_rss_key(struct ixgbe_adapter 
*adapter,
return 0;
 }
 
+static int ixgbe_update_vf_xcast_mode(struct ixgbe_adapter *adapter,
+ u32 *msgbuf, u32 vf)
+{
+   struct ixgbe_hw *hw = &adapter->hw;
+   int xcast_mode = msgbuf[1];
+   u32 vmolr, disable, enable;
+
+   /* verify the PF is supporting the correct APIs */
+   switch (adapter->vfinfo[vf].vf_api) {
+   case ixgbe_mbox_api_12:
+   break;
+   default:
+   return -1;
+   }
+
+   if (xcast_mode > IXGBEVF_XCAST_MODE_MULTI &&
+   !adapter->vfinfo[vf].trusted) {
+   xcast_mode = IXGBEVF_XCAST_MODE_MULTI;
+   }
+
+   if (adapter->vfinfo[vf].xcast_mode == xcast_mode)
+   goto out;
+
+   switch (xcast_mode) {
+   case IXGBEVF_XCAST_MODE_NONE:
+   disable = IXGBE_VMOLR_BAM | IXGBE_VMOLR_ROMPE | IXGBE_VMOLR_MPE;
+   enable = 0;
+   break;
+   case IXGBEVF_XCAST_MODE_MULTI:
+   disable = IXGBE_VMOLR_MPE;
+   enable = IXGBE_VMOLR_BAM | IXGBE_VMOLR_ROMPE;
+   break;
+   case IXGBEVF_XCAST_MODE_ALLMULTI:
+   disable = 0;
+   enable = IXGBE_VMOLR_BAM | IXGBE_VMOLR_ROMPE | IXGBE_VMOLR_MPE;
+   break;
+   default:
+   return -1;
+   }
+
+   vmolr = IXGBE_READ_REG(hw, IXGBE_VMOLR(vf));
+   vmolr &= ~disable;
+   vmolr |= enable;
+   IXGBE_WRITE_REG(hw, IXGBE_VMOLR(vf), vmolr);
+
+   a

Re: [PATCH -next 0/6] Per network namespace netfilter chains

2015-07-16 Thread Pablo Neira Ayuso

On Wed, Jul 15, 2015 at 03:05:00PM -0500, Eric W. Biederman wrote:
> Pablo Neira Ayuso  writes:
[...]
> >> There are lots of other possible and desirable cleanups but this one is
> >> a core change needed to make the other changes independent small
> >> changes.
> >
> > The state->net field will kill that dev_net(...) ? x : y; all over the
> > code, that would be nice.
> 
> Yes it will.  I intend to do that after this patchset settles so I am
> not dealing with more than one issue at a time.  Otherwise there
> is too much work at once.

Sure, thanks.

> > Some comments on your patchset:
> >
> > * 1/6 netfilter: nf_queue: Don't recompute the hook_list head
> >
> > I already passed this to current nf as you insisted on getting this,
> > and for the sake of correctness, so it's basically already in David's
> > net tree.
> 
> I would have expected this patch to be somewhere.  I did not see
> this change in net-next when I wrote the patchset (which seemed
> like a good approximation of the latest thing available).  If I
> overlooked and the patch has already made it to Dave then my apologies
> for being redundant.
> 
> I still don't see this patch in your pending branch.
> 
> Am I missing something?

It doesn't show yet in net-next, but it's already in net:

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=f307170d6e591a48529425b1ed6ca835790995a9

David periodically pulls net into net-next, so it will show there too
at some point.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: sit: Set SKB_GSO_SIT bit when performing GRO

2015-07-16 Thread Wolfgang Walter

Am Donnerstag, 16. Juli 2015, 08:23:50 schrieb Herbert Xu:
> On Wed, Jul 15, 2015 at 02:25:59PM +0200, Wolfgang Walter wrote:
> > Yes. Switching TSO off and leaving GRO on works, too.
> 
> OK, could you please try this patch?

Patch works here.

Thanks,

Wolfgang

> 
> ---8<---
> We need to set the SKB_GSO_SIT bit if we detect a 6-in-4 tunnel
> when doing GRO.  Otherwise we may throw a packet at TSO hardware
> that doesn't know what to do with it.
> 
> Fixes: 19424e052fb4 ("sit: Add gro callbacks to sit_offload")
> Reported-by: Wolfgang Walter 
> Signed-off-by: Herbert Xu 
> 
> diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
> index e893cd1..1252eac 100644
> --- a/net/ipv6/ip6_offload.c
> +++ b/net/ipv6/ip6_offload.c
> @@ -289,11 +289,21 @@ static struct packet_offload ipv6_packet_offload
> __read_mostly = { },
>  };
> 
> +static int sit_gro_complete(struct sk_buff *skb, int nhoff)
> +{
> + int err = ipv6_gro_complete(skb, nhoff);
> +
> + skb->encapsulation = 1;
> + skb_shinfo(skb)->gso_type |= SKB_GSO_SIT;
> +
> + return err;
> +}
> +
>  static const struct net_offload sit_offload = {
>   .callbacks = {
>   .gso_segment= ipv6_gso_segment,
>   .gro_receive= ipv6_gro_receive,
> - .gro_complete   = ipv6_gro_complete,
> + .gro_complete   = sit_gro_complete,
>   },
>  };

-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/2] Avoid link dependency of dlm on sctp module

2015-07-16 Thread Neil Horman

On Wed, Jul 15, 2015 at 01:59:18PM -0700, David Miller wrote:
> From: Neil Horman 
> Date: Wed, 15 Jul 2015 09:57:14 -0400
> 
> > Series
> > Acked-by: Neil Horman 
> 
> I don't like this at all.
> 
> I know it's a pain in the ass to have this dependency on SCTP, but
> calling exported functions is absolutely the right way to handle
> this kind of situation.

I have to disagree.  Its certainly true that a direct call from the kernel is
the right current way to handle a need for functionality in an external module,
but I think its important to offer a mechanism that allows for modules to not
load functionality that they don't need at run time (that is to say, if dlm
decides to use tcp rather than sctp, the sctp module shouldn't be loaded).

Its not a crisis in either case, but it sure would be nice to not load a module
just because a symbol reference exists.

Neil

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [V2 3/7] Drivers: hv: vmbus: add APIs to send/recv hvsock packet and get the r/w-ability

2015-07-16 Thread Dexuan Cui

> -Original Message-
> From: David Miller
> Sent: Thursday, July 16, 2015 12:16
> 
> From: Dexuan Cui
> Date: Tue, 14 Jul 2015 02:58:56 -0700
> 
> > +int vmbus_sendpacket_hvsock(struct vmbus_channel *channel, void *buf,
> u32 len)
> > +{
> > +   struct vmpacket_descriptor desc;
> > +   struct vmpipe_proto_header pipe_hdr;
> > +   u32 packetlen;
> > +   u32 packetlen_aligned;
> > +   struct kvec bufferlist[4];
> > +   u64 aligned_data = 0;
> > +   int ret;
> > +   bool signal = false;
> 
> Reverse christmas-tree (longest to shortest line) order these local
> variables, please.
OK.

> 
> > +EXPORT_SYMBOL(vmbus_sendpacket_hvsock);
> 
> EXPORT_SYMBOL_GPL()
Oh, sorry. I'll fix it.
 
> > +int vmbus_recvpacket_hvsock(struct vmbus_channel *channel, void *buffer,
> > +   u32 bufferlen, u32 *buffer_actual_len)
> > +{
> > +   struct vmpacket_descriptor *desc;
> > +   struct vmpipe_proto_header *pipe_hdr;
> > +   u32 packet_len, payload_len;
> > +   int ret;
> > +   bool signal = false;
> 
> Again, please use reverse christmas-tree order.
OK.


> > +void vmbus_get_hvsock_rw_status(struct vmbus_channel *channel,
> > +  bool *can_read, bool *can_write)
> 
> Second line is not properly indented, it should start exactly one
> column after the openning parenthesis on the previous line.
OK.
I didn't realize this issue. Thanks for reminding me!
The patch did pass the check of scripts/checkpatch.pl. :-)
I found scripts/Lindent can detect such kind of issue.
I'll run scripts/Lindent against my code and fix all of them in V3.

> > +   hv_get_ringbuffer_availbytes(inring_info,
> > +   bytes_avail_toread,
> > +   bytes_avail_towrite);
> 
> Again, improperly indented.
OK. will fix it.

> > +extern int vmbus_sendpacket_hvsock(struct vmbus_channel *channel,
> > +   void *buf, u32 len);
> > +
> 
> Likewise.
OK. will fix it.

> > +extern int vmbus_recvpacket_hvsock(struct vmbus_channel *channel, void
> *buffer,
> > +   u32 bufferlen, u32 *buffer_actual_len);
> > +
> > +extern void vmbus_get_hvsock_rw_status(struct vmbus_channel *channel,
> > +  bool *can_read, bool *can_write);
> 
> Likewise.
OK. will fix it.

-- Dexuan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [V2 1/7] Drivers: hv: vmbus: define the new offer type for Hyper-V socket (hvsock)

2015-07-16 Thread Dexuan Cui

> From: David Miller
> Sent: Thursday, July 16, 2015 12:13
> 
> From: Dexuan Cui
> Date: Tue, 14 Jul 2015 02:58:03 -0700
> 
> > A helper function is also added.
> >
> > diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> > @@ -236,6 +236,7 @@ struct vmbus_channel_offer {
> >  #define VMBUS_CHANNEL_LOOPBACK_OFFER   0x100
> >  #define VMBUS_CHANNEL_PARENT_OFFER 0x200
> >  #define VMBUS_CHANNEL_REQUEST_MONITORED_NOTIFICATION   0x400
> > +#define VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER 0x2000
> >
> >  struct vmpacket_descriptor {
> > u16 type;
> > @@ -758,6 +759,12 @@ struct vmbus_channel {
> > struct list_head percpu_list;
> >  };
> >
> > +static inline bool is_hvsock_channel(const struct vmbus_channel *c)
> > +{
> > +   return !!(c->offermsg.offer.chn_flags &
> > +   VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER);
> > +}
> > +
> 
> This is not indented properly, plus it makes no sense to add a flag before
> anyone even sets the flag.

Hi David,
Thanks for pointing out the indentation issue!  I'll fix it in V3.

The flag is set by the host: the c->offermsg is in the shared VMBus ringbuffer
between the host and the guest, so it makes sense for us to check the flag. :-)

-- Dexuan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [V2 6/7] hvsock: introduce Hyper-V VM Sockets feature

2015-07-16 Thread Dexuan Cui

> From: David Miller
> Sent: Thursday, July 16, 2015 12:19
> 
> From: Dexuan Cui
> Date: Tue, 14 Jul 2015 03:00:48 -0700
> 
> > +   pr_debug("hvsock_sk_destruct: called\n");
> 
> Debug logging just to state that a function is called is not appropriate,
> we have very sophisticated tracing facilities in the kernel that can do
> that transparently, and more.
> 
> Please remove this.
OK. 
 
> > +   if (hvsk->channel) {
> > +   pr_debug("hvsock_sk_destruct: calling vmbus_close()\n");
> 
> Likewise, these kinds of debug logs are totally inappropriate.
OK, I'll remove all the pr_debug() in the patch.
 
> > +static int hvsock_release(struct socket *sock)
> > +{
> > +   /* sock->sk is NULL, if accept() is interrupted by a signal */
> > +   if (sock->sk) {
> > +   __hvsock_release(sock->sk);
> > +   sock->sk = NULL;
> > +   }
> > +
> > +   sock->state = SS_FREE;
> > +   pr_debug("hvsock_release called\n\n");
> 
> Likewise.
OK. Will fix it.

Thanks,
-- Dexuan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

What queues/buffers does tc-netem use?

2015-07-16 Thread Motejlek, Petr

Hello,

I was wondering what queues/buffers does netem use and how does one control or 
monitor them?

I could not find this information anywhere and I am not that good in reading 
the sources to be able to tell enough about this :) If we talk only about the 
situation where netem is the root qdisc for a particular interface, I would 
imagine it might be using the txqueue of that interface, but I am not sure if 
that's really the case...

Thanks,
Petr MOTEJLEK


N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [PATCH] bonding: correct the MAC address for "follow" fail_over_mac policy

2015-07-16 Thread Ding Tianhong

On 2015/7/16 17:24, Nikolay Aleksandrov wrote:
> On 07/16/2015 10:30 AM, Ding Tianhong wrote:
>> The "follow" fail_over_mac policy is useful for multiport devices that
>> either become confused or incur a performance penalty when multiple
>> ports are programmed with the same MAC address, but the same MAC
>> address still may happened by this steps for this policy:
>>
>> 1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
>>bond0 has the same mac address with eth0, it is MAC1.
>>
>> 2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
>>eth1 is backup, eth1 has MAC2.
>>
>> 3) ifconfig eth0 down
>>eth1 became active slave, bond will swap MAC for eth0 and eth1,
>>so eth1 has MAC1, and eth0 has MAC2.
>>
>> 4) ifconfig eth1 down
>>there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
>>
>> 5) ifconfig eth0 up
>>the eth0 became active slave again, the bond set eth0 to MAC1.
>>
>> Something wrong here, then if you set eth1 up, the eth0 and eth1 will have 
>> the same
>> MAC address, it will break this policy for ACTIVE_BACKUP mode.
>>
>> This patch will fix this problem by finding the old active slave and
>> swap them MAC address before change active slave.
>>
>> Signed-off-by: Ding Tianhong 
>> ---
>>  drivers/net/bonding/bond_main.c | 20 
>>  1 file changed, 20 insertions(+)
>>
> 
> This doesn't seem to be true:
> ~# cat /sys/class/net/bond0/bonding/fail_over_mac 
> follow 2
> root@debian:~# ip l sh eth1
> 3: eth1:  mtu 1500 qdisc pfifo_fast 
> master bond0 state UP mode DEFAULT group default qlen 1000
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> root@debian:~# ip l sh eth2
> 4: eth2:  mtu 1500 qdisc pfifo_fast 
> master bond0 state UP mode DEFAULT group default qlen 1000
> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
> root@debian:~# ip l sh bond0
> 26: bond0:  mtu 1500 qdisc noqueue 
> state UP mode DEFAULT group default 
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> 
> *eth1 is the first and active slave and bond0 has taken its mac.
> Now trying your steps:
> Step 3) (bringing down the active eth1)
> root@debian:~# ip l set eth1 down
> root@debian:~# ip l sh bond0
> 26: bond0:  mtu 1500 qdisc noqueue 
> state UP mode DEFAULT group default 
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> root@debian:~# ip l sh eth1
> 3: eth1:  mtu 1500 qdisc pfifo_fast master bond0 
> state DOWN mode DEFAULT group default qlen 1000
> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
> root@debian:~# ip l sh eth2
> 4: eth2:  mtu 1500 qdisc pfifo_fast 
> master bond0 state UP mode DEFAULT group default qlen 1000
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> 
> *The mac addresses of eth1 and eth2 are correctly swapped, so far so good.
> 
> Step 4) (bringing down the active eth2)
> root@debian:~# ip l set eth2 down
> 3: eth1:  mtu 1500 qdisc pfifo_fast master bond0 
> state DOWN mode DEFAULT group default qlen 1000
> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
> 4: eth2:  mtu 1500 qdisc pfifo_fast master bond0 
> state DOWN mode DEFAULT group default qlen 1000
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> 26: bond0:  mtu 1500 qdisc noqueue 
> state DOWN mode DEFAULT group default 
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> 
> *eth2 has kept the mac address of the bond and they're both down now
> 
> Step 5) (bring eth1 up again and observe the macs)
> ~# ip l set eth1 up
> 3: eth1:  mtu 1500 qdisc pfifo_fast 
> master bond0 state UP mode DEFAULT group default qlen 1000
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> 4: eth2:  mtu 1500 qdisc pfifo_fast master bond0 
> state DOWN mode DEFAULT group default qlen 1000
> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
> 26: bond0:  mtu 1500 qdisc noqueue 
> state UP mode DEFAULT group default 
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> 
> *The macs are correctly swapped and there's no such bug. 
> 
> Step 6(?) bring eth2 up
> ~# ip l set eth2 up
> 3: eth1:  mtu 1500 qdisc pfifo_fast 
> master bond0 state UP mode DEFAULT group default qlen 1000
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> 4: eth2:  mtu 1500 qdisc pfifo_fast 
> master bond0 state UP mode DEFAULT group default qlen 1000
> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
> 26: bond0:  mtu 1500 qdisc noqueue 
> state UP mode DEFAULT group default 
> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
> 
> *Still correct.
> 
> Also the mac address that gets set is dev_addr which is changed when
> the swapping is done, if you'd like to get the original mac address
> you should be using slave->perm_hwaddr.
> 

Hi Nik:

Which kernel version do you use, I test this on kernel 3.19.8 and 4.2-rc2, this 
problem exist on both version,
maybe I miss something?

Ding


> Cheers,
>  Nik
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
Mo

Re: [PATCH] bonding: correct the MAC address for "follow" fail_over_mac policy

2015-07-16 Thread Nikolay Aleksandrov

On 07/16/2015 01:48 PM, Ding Tianhong wrote:
> On 2015/7/16 17:24, Nikolay Aleksandrov wrote:
>> On 07/16/2015 10:30 AM, Ding Tianhong wrote:
>>> The "follow" fail_over_mac policy is useful for multiport devices that
>>> either become confused or incur a performance penalty when multiple
>>> ports are programmed with the same MAC address, but the same MAC
>>> address still may happened by this steps for this policy:
>>>
>>> 1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
>>>bond0 has the same mac address with eth0, it is MAC1.
>>>
>>> 2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
>>>eth1 is backup, eth1 has MAC2.
>>>
>>> 3) ifconfig eth0 down
>>>eth1 became active slave, bond will swap MAC for eth0 and eth1,
>>>so eth1 has MAC1, and eth0 has MAC2.
>>>
>>> 4) ifconfig eth1 down
>>>there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
>>>
>>> 5) ifconfig eth0 up
>>>the eth0 became active slave again, the bond set eth0 to MAC1.
>>>
>>> Something wrong here, then if you set eth1 up, the eth0 and eth1 will have 
>>> the same
>>> MAC address, it will break this policy for ACTIVE_BACKUP mode.
>>>
>>> This patch will fix this problem by finding the old active slave and
>>> swap them MAC address before change active slave.
>>>
>>> Signed-off-by: Ding Tianhong 
>>> ---
>>>  drivers/net/bonding/bond_main.c | 20 
>>>  1 file changed, 20 insertions(+)
>>>
>>
>> This doesn't seem to be true:
>> ~# cat /sys/class/net/bond0/bonding/fail_over_mac 
>> follow 2
>> root@debian:~# ip l sh eth1
>> 3: eth1:  mtu 1500 qdisc pfifo_fast 
>> master bond0 state UP mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>> root@debian:~# ip l sh eth2
>> 4: eth2:  mtu 1500 qdisc pfifo_fast 
>> master bond0 state UP mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>> root@debian:~# ip l sh bond0
>> 26: bond0:  mtu 1500 qdisc noqueue 
>> state UP mode DEFAULT group default 
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>
>> *eth1 is the first and active slave and bond0 has taken its mac.
>> Now trying your steps:
>> Step 3) (bringing down the active eth1)
>> root@debian:~# ip l set eth1 down
>> root@debian:~# ip l sh bond0
>> 26: bond0:  mtu 1500 qdisc noqueue 
>> state UP mode DEFAULT group default 
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>> root@debian:~# ip l sh eth1
>> 3: eth1:  mtu 1500 qdisc pfifo_fast master bond0 
>> state DOWN mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>> root@debian:~# ip l sh eth2
>> 4: eth2:  mtu 1500 qdisc pfifo_fast 
>> master bond0 state UP mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>
>> *The mac addresses of eth1 and eth2 are correctly swapped, so far so good.
>>
>> Step 4) (bringing down the active eth2)
>> root@debian:~# ip l set eth2 down
>> 3: eth1:  mtu 1500 qdisc pfifo_fast master bond0 
>> state DOWN mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>> 4: eth2:  mtu 1500 qdisc pfifo_fast master bond0 
>> state DOWN mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>> 26: bond0:  mtu 1500 qdisc noqueue 
>> state DOWN mode DEFAULT group default 
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>
>> *eth2 has kept the mac address of the bond and they're both down now
>>
>> Step 5) (bring eth1 up again and observe the macs)
>> ~# ip l set eth1 up
>> 3: eth1:  mtu 1500 qdisc pfifo_fast 
>> master bond0 state UP mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>> 4: eth2:  mtu 1500 qdisc pfifo_fast master bond0 
>> state DOWN mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>> 26: bond0:  mtu 1500 qdisc noqueue 
>> state UP mode DEFAULT group default 
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>
>> *The macs are correctly swapped and there's no such bug. 
>>
>> Step 6(?) bring eth2 up
>> ~# ip l set eth2 up
>> 3: eth1:  mtu 1500 qdisc pfifo_fast 
>> master bond0 state UP mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>> 4: eth2:  mtu 1500 qdisc pfifo_fast 
>> master bond0 state UP mode DEFAULT group default qlen 1000
>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>> 26: bond0:  mtu 1500 qdisc noqueue 
>> state UP mode DEFAULT group default 
>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>
>> *Still correct.
>>
>> Also the mac address that gets set is dev_addr which is changed when
>> the swapping is done, if you'd like to get the original mac address
>> you should be using slave->perm_hwaddr.
>>
> 
> Hi Nik:
> 
> Which kernel version do you use, I test this on kernel 3.19.8 and 4.2-rc2, 
> this problem exist on both version,
> maybe I miss something

Re: [PATCH] bonding: correct the MAC address for "follow" fail_over_mac policy

2015-07-16 Thread Nikolay Aleksandrov

On 07/16/2015 01:50 PM, Nikolay Aleksandrov wrote:
> On 07/16/2015 01:48 PM, Ding Tianhong wrote:
>> On 2015/7/16 17:24, Nikolay Aleksandrov wrote:
>>> On 07/16/2015 10:30 AM, Ding Tianhong wrote:
 The "follow" fail_over_mac policy is useful for multiport devices that
 either become confused or incur a performance penalty when multiple
 ports are programmed with the same MAC address, but the same MAC
 address still may happened by this steps for this policy:

 1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
bond0 has the same mac address with eth0, it is MAC1.

 2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
eth1 is backup, eth1 has MAC2.

 3) ifconfig eth0 down
eth1 became active slave, bond will swap MAC for eth0 and eth1,
so eth1 has MAC1, and eth0 has MAC2.

 4) ifconfig eth1 down
there is no active slave, and eth1 still has MAC1, eth2 has MAC2.

 5) ifconfig eth0 up
the eth0 became active slave again, the bond set eth0 to MAC1.

 Something wrong here, then if you set eth1 up, the eth0 and eth1 will have 
 the same
 MAC address, it will break this policy for ACTIVE_BACKUP mode.

 This patch will fix this problem by finding the old active slave and
 swap them MAC address before change active slave.

 Signed-off-by: Ding Tianhong 
 ---
  drivers/net/bonding/bond_main.c | 20 
  1 file changed, 20 insertions(+)

>>>
>>> This doesn't seem to be true:
>>> ~# cat /sys/class/net/bond0/bonding/fail_over_mac 
>>> follow 2
>>> root@debian:~# ip l sh eth1
>>> 3: eth1:  mtu 1500 qdisc pfifo_fast 
>>> master bond0 state UP mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>> root@debian:~# ip l sh eth2
>>> 4: eth2:  mtu 1500 qdisc pfifo_fast 
>>> master bond0 state UP mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>>> root@debian:~# ip l sh bond0
>>> 26: bond0:  mtu 1500 qdisc noqueue 
>>> state UP mode DEFAULT group default 
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>>
>>> *eth1 is the first and active slave and bond0 has taken its mac.
>>> Now trying your steps:
>>> Step 3) (bringing down the active eth1)
>>> root@debian:~# ip l set eth1 down
>>> root@debian:~# ip l sh bond0
>>> 26: bond0:  mtu 1500 qdisc noqueue 
>>> state UP mode DEFAULT group default 
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>> root@debian:~# ip l sh eth1
>>> 3: eth1:  mtu 1500 qdisc pfifo_fast master bond0 
>>> state DOWN mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>>> root@debian:~# ip l sh eth2
>>> 4: eth2:  mtu 1500 qdisc pfifo_fast 
>>> master bond0 state UP mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>>
>>> *The mac addresses of eth1 and eth2 are correctly swapped, so far so good.
>>>
>>> Step 4) (bringing down the active eth2)
>>> root@debian:~# ip l set eth2 down
>>> 3: eth1:  mtu 1500 qdisc pfifo_fast master bond0 
>>> state DOWN mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>>> 4: eth2:  mtu 1500 qdisc pfifo_fast master bond0 
>>> state DOWN mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>> 26: bond0:  mtu 1500 qdisc 
>>> noqueue state DOWN mode DEFAULT group default 
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>>
>>> *eth2 has kept the mac address of the bond and they're both down now
>>>
>>> Step 5) (bring eth1 up again and observe the macs)
>>> ~# ip l set eth1 up
>>> 3: eth1:  mtu 1500 qdisc pfifo_fast 
>>> master bond0 state UP mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>> 4: eth2:  mtu 1500 qdisc pfifo_fast master bond0 
>>> state DOWN mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>>> 26: bond0:  mtu 1500 qdisc noqueue 
>>> state UP mode DEFAULT group default 
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>>
>>> *The macs are correctly swapped and there's no such bug. 
>>>
>>> Step 6(?) bring eth2 up
>>> ~# ip l set eth2 up
>>> 3: eth1:  mtu 1500 qdisc pfifo_fast 
>>> master bond0 state UP mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>> 4: eth2:  mtu 1500 qdisc pfifo_fast 
>>> master bond0 state UP mode DEFAULT group default qlen 1000
>>> link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
>>> 26: bond0:  mtu 1500 qdisc noqueue 
>>> state UP mode DEFAULT group default 
>>> link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
>>>
>>> *Still correct.
>>>
>>> Also the mac address that gets set is dev_addr which is changed when
>>> the swapping is done, if you'd like to get the original mac address
>>> you should be using slave->per

Re: [PATCH] bonding: correct the MAC address for "follow" fail_over_mac policy

2015-07-16 Thread Nikolay Aleksandrov

On 07/16/2015 01:54 PM, Nikolay Aleksandrov wrote:
> On 07/16/2015 01:50 PM, Nikolay Aleksandrov wrote:
>> On 07/16/2015 01:48 PM, Ding Tianhong wrote:
>>> On 2015/7/16 17:24, Nikolay Aleksandrov wrote:
 On 07/16/2015 10:30 AM, Ding Tianhong wrote:
> The "follow" fail_over_mac policy is useful for multiport devices that
> either become confused or incur a performance penalty when multiple
> ports are programmed with the same MAC address, but the same MAC
> address still may happened by this steps for this policy:
>
> 1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
>bond0 has the same mac address with eth0, it is MAC1.
>
> 2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
>eth1 is backup, eth1 has MAC2.
>
> 3) ifconfig eth0 down
>eth1 became active slave, bond will swap MAC for eth0 and eth1,
>so eth1 has MAC1, and eth0 has MAC2.
>
> 4) ifconfig eth1 down
>there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
>
> 5) ifconfig eth0 up
>the eth0 became active slave again, the bond set eth0 to MAC1.
>
> Something wrong here, then if you set eth1 up, the eth0 and eth1 will 
> have the same
> MAC address, it will break this policy for ACTIVE_BACKUP mode.
>
> This patch will fix this problem by finding the old active slave and
> swap them MAC address before change active slave.
>
> Signed-off-by: Ding Tianhong 
> ---
>  drivers/net/bonding/bond_main.c | 20 
>  1 file changed, 20 insertions(+)
>

 This doesn't seem to be true:
 ~# cat /sys/class/net/bond0/bonding/fail_over_mac 
 follow 2
 root@debian:~# ip l sh eth1
 3: eth1:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 root@debian:~# ip l sh eth2
 4: eth2:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 root@debian:~# ip l sh bond0
 26: bond0:  mtu 1500 qdisc noqueue 
 state UP mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *eth1 is the first and active slave and bond0 has taken its mac.
 Now trying your steps:
 Step 3) (bringing down the active eth1)
 root@debian:~# ip l set eth1 down
 root@debian:~# ip l sh bond0
 26: bond0:  mtu 1500 qdisc noqueue 
 state UP mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 root@debian:~# ip l sh eth1
 3: eth1:  mtu 1500 qdisc pfifo_fast master 
 bond0 state DOWN mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 root@debian:~# ip l sh eth2
 4: eth2:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *The mac addresses of eth1 and eth2 are correctly swapped, so far so good.

 Step 4) (bringing down the active eth2)
 root@debian:~# ip l set eth2 down
 3: eth1:  mtu 1500 qdisc pfifo_fast master 
 bond0 state DOWN mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 4: eth2:  mtu 1500 qdisc pfifo_fast master 
 bond0 state DOWN mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 26: bond0:  mtu 1500 qdisc 
 noqueue state DOWN mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *eth2 has kept the mac address of the bond and they're both down now

 Step 5) (bring eth1 up again and observe the macs)
 ~# ip l set eth1 up
 3: eth1:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 4: eth2:  mtu 1500 qdisc pfifo_fast master 
 bond0 state DOWN mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 26: bond0:  mtu 1500 qdisc noqueue 
 state UP mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *The macs are correctly swapped and there's no such bug. 

 Step 6(?) bring eth2 up
 ~# ip l set eth2 up
 3: eth1:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 4: eth2:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 26: bond0:  mtu 1500 qdisc noqueue 
 state UP mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *Still correct.

 Also th

Re: [PATCH] bonding: correct the MAC address for "follow" fail_over_mac policy

2015-07-16 Thread Ding Tianhong

On 2015/7/16 19:54, Nikolay Aleksandrov wrote:
> On 07/16/2015 01:50 PM, Nikolay Aleksandrov wrote:
>> On 07/16/2015 01:48 PM, Ding Tianhong wrote:
>>> On 2015/7/16 17:24, Nikolay Aleksandrov wrote:
 On 07/16/2015 10:30 AM, Ding Tianhong wrote:
> The "follow" fail_over_mac policy is useful for multiport devices that
> either become confused or incur a performance penalty when multiple
> ports are programmed with the same MAC address, but the same MAC
> address still may happened by this steps for this policy:
>
> 1) echo +eth0 > /sys/class/net/bond0/bonding/slaves
>bond0 has the same mac address with eth0, it is MAC1.
>
> 2) echo +eth1 > /sys/class/net/bond0/bonding/slaves
>eth1 is backup, eth1 has MAC2.
>
> 3) ifconfig eth0 down
>eth1 became active slave, bond will swap MAC for eth0 and eth1,
>so eth1 has MAC1, and eth0 has MAC2.
>
> 4) ifconfig eth1 down
>there is no active slave, and eth1 still has MAC1, eth2 has MAC2.
>
> 5) ifconfig eth0 up
>the eth0 became active slave again, the bond set eth0 to MAC1.
>
> Something wrong here, then if you set eth1 up, the eth0 and eth1 will 
> have the same
> MAC address, it will break this policy for ACTIVE_BACKUP mode.
>
> This patch will fix this problem by finding the old active slave and
> swap them MAC address before change active slave.
>
> Signed-off-by: Ding Tianhong 
> ---
>  drivers/net/bonding/bond_main.c | 20 
>  1 file changed, 20 insertions(+)
>

 This doesn't seem to be true:
 ~# cat /sys/class/net/bond0/bonding/fail_over_mac 
 follow 2
 root@debian:~# ip l sh eth1
 3: eth1:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 root@debian:~# ip l sh eth2
 4: eth2:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 root@debian:~# ip l sh bond0
 26: bond0:  mtu 1500 qdisc noqueue 
 state UP mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *eth1 is the first and active slave and bond0 has taken its mac.
 Now trying your steps:
 Step 3) (bringing down the active eth1)
 root@debian:~# ip l set eth1 down
 root@debian:~# ip l sh bond0
 26: bond0:  mtu 1500 qdisc noqueue 
 state UP mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 root@debian:~# ip l sh eth1
 3: eth1:  mtu 1500 qdisc pfifo_fast master 
 bond0 state DOWN mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 root@debian:~# ip l sh eth2
 4: eth2:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *The mac addresses of eth1 and eth2 are correctly swapped, so far so good.

 Step 4) (bringing down the active eth2)
 root@debian:~# ip l set eth2 down
 3: eth1:  mtu 1500 qdisc pfifo_fast master 
 bond0 state DOWN mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 4: eth2:  mtu 1500 qdisc pfifo_fast master 
 bond0 state DOWN mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 26: bond0:  mtu 1500 qdisc 
 noqueue state DOWN mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *eth2 has kept the mac address of the bond and they're both down now

 Step 5) (bring eth1 up again and observe the macs)
 ~# ip l set eth1 up
 3: eth1:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 4: eth2:  mtu 1500 qdisc pfifo_fast master 
 bond0 state DOWN mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 26: bond0:  mtu 1500 qdisc noqueue 
 state UP mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *The macs are correctly swapped and there's no such bug. 

 Step 6(?) bring eth2 up
 ~# ip l set eth2 up
 3: eth1:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff
 4: eth2:  mtu 1500 qdisc pfifo_fast 
 master bond0 state UP mode DEFAULT group default qlen 1000
 link/ether 52:54:00:4f:a5:99 brd ff:ff:ff:ff:ff:ff
 26: bond0:  mtu 1500 qdisc noqueue 
 state UP mode DEFAULT group default 
 link/ether 52:54:00:3f:47:69 brd ff:ff:ff:ff:ff:ff

 *Still correct.

 Also the ma

Re: [PATCH nf-next] netfilter: nf_ct_sctp: minimal multihoming support

2015-07-16 Thread Michal Kubecek

On Wed, Jul 15, 2015 at 05:35:08PM -0300, Marcelo Ricardo Leitner wrote:
> Hi,
> 
> On Tue, Jul 14, 2015 at 06:42:25PM +0200, Michal Kubecek wrote:
> > On Tue, Jul 14, 2015 at 03:42:03PM +0200, Florian Westphal wrote:
> > > Michal Kubecek  wrote:
> > > > +   case SCTP_CID_HEARTBEAT:
> > > > +   pr_debug("SCTP_CID_HEARTBEAT");
> > > > +   i = 9;
> > > > +   break;
> > > > +   case SCTP_CID_HEARTBEAT_ACK:
> > > > +   pr_debug("SCTP_CID_HEARTBEAT_ACK");
> > > > +   i = 10;
> > > > +   break;
> > > > default:
> > > > /* Other chunks like DATA, SACK, HEARTBEAT and
> > > > its ACK do not cause a change in state */
> > > > @@ -329,6 +351,8 @@ static int sctp_packet(struct nf_conn *ct,
> > > > !test_bit(SCTP_CID_COOKIE_ECHO, map) &&
> > > > !test_bit(SCTP_CID_ABORT, map) &&
> > > > !test_bit(SCTP_CID_SHUTDOWN_ACK, map) &&
> > > > +   !test_bit(SCTP_CID_HEARTBEAT, map) &&
> > > > +   !test_bit(SCTP_CID_HEARTBEAT_ACK, map) &&
> > > > sh->vtag != ct->proto.sctp.vtag[dir]) {
> > > > pr_debug("Verification tag check failed\n");
> > > > goto out;
> > > > @@ -357,6 +381,16 @@ static int sctp_packet(struct nf_conn *ct,
> > > > /* Sec 8.5.1 (D) */
> > > > if (sh->vtag != ct->proto.sctp.vtag[dir])
> > > > goto out_unlock;
> > > > +   } else if (sch->type == SCTP_CID_HEARTBEAT ||
> > > > +  sch->type == SCTP_CID_HEARTBEAT_ACK) {
> > > > +   if (ct->proto.sctp.vtag[dir] == 0) {
> > > > +   pr_debug("Setting vtag %x for dir %d\n",
> > > > +sh->vtag, dir);
> > > > +   ct->proto.sctp.vtag[dir] = sh->vtag;
> > > 
> > > Could you please elaborate on the [dir] == 0 test?
> > > 
> > > I see this might happen for SCTP_CID_HEARTBEAT_ACK, but why is this
> > > needed for SCTP_CID_HEARTBEAT ?
> > > 
> > > We found a conntrack entry so shouldn't the vtag[dir] already be > 0?
> > 
> > Yes, you are right. This was originally intended to handle the case when
> > a HEARTBEAT in the reply direction is seen before the HEARTBEAT-ACK but
> > such HEARTBEAT would be dropped anyway in current version.
> 
> And we have to keep the first vtag attempted because otherwise an
> attacker could just probe for the right one until she gets a reply.
> 
> IOW, if a different vtag is attempted, we should drop it as the packet
> doesn't belong to that association/conntrack entry.
> 
> As vtags are always != 0 in such case, that's a way to know if we
> already have that information or not.
> 
> > On the other hand, an alternative would be
> > 
> > } else if (sch->type == SCTP_CID_HEARTBEAT_ACK &&
> >ct->proto.sctp.vtag[dir] == 0) {
> > pr_debug("Setting vtag %x for dir %d\n",
> >  sh->vtag, dir);
> > ct->proto.sctp.vtag[dir] = sh->vtag;
> > } else if ((sch->type == SCTP_CID_HEARTBEAT ||
> > sch->type == SCTP_CID_HEARTBEAT_ACK) &&
> >sh->vtag != ct->proto.sctp.vtag[dir]) {
> > pr_debug("Verification tag check failed\n");
> > goto out_unlock;
> > }
> > 
> > I'm not sure it looks better.
> 
> Now it seems swapped, we should save the tag on HB and check on
> HB_ACK only and would have to check against !dir entry. Like:

I forgot to include the explanation of vtag setting/checking logic into
the commit message. It is supposed to work like this:

Normally, vtag is set from the INIT chunk for the reply direction and
from the INIT-ACK chunk for the originating direction (i.e. each of
these defines vtag value for the opposite direction). For secondary
conntracks, we can't rely on seeing INIT/INIT-ACK and even if we have
seen them, we would need to connect two different conntracks. Therefore
simplified logic is applied: vtag of first packet in each direction
(HEARTBEAT in the originating and HEARTBEAT-ACK in reply direction) is
saved and all following packets in that direction are compared with this
saved value. While INIT and INIT-ACK define vtag for the opposite
direction (that's where "!dir" comes from), vtags extracted from
HEARTBEAT and HEARTBEAT-ACK are always for their direction. And we have
to check vtags on packets with HEARTBEAT chunks as well because their
vtags should match vtag of the first (set in sctp_new()).

  Michal Kubecek

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v1 08/12] IB/cma: Add net_dev and private data checks to RDMA CM

2015-07-16 Thread Liran Liss

> From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]

> > After all, it is the payload that designates the entity that you
> > want to establish a connection to, rather than the packet headers,
> > which are just meant to relay the packet to the proper CM
> 
> No, that isn't right. The IBA uses the GMP's destination first, then
> serviceID as the demux. Services IDs are not globally unique, they are
> scoped by the destination.
> 

The destination is the physical CA port and kernel CM agent, so I don't think 
the answer is that clear.
Going forward along these lines:
- Name space lookup is done based on BTH.pkey, private_data.IP, and optionally 
GRH.DGID (if present, for extra validation)
-- Primary and alternate paths are not considered at all

- If P_Key enforcement is set up via cgroups:
-- For CM processing, we only check BTH.pkey
--- Upon conflict, the packet is dropped
-- The primary/alternate path pkeys are not validated by CM, but will be 
validated during QP modification

Does this sound OK?

In any case, let's complete the namespace lookup first, and then follow up with 
a cgroup patchset.

> The path data is just *routing* it doesn't describe at all the entity
> we want to talk to, it is only a proposal for how to flow data to it.
> 
> In any event, both the GMP headers and the path data needs to be
> checked against the container's pkey list. I don't know why this is so
> contentions.
> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Denys Vlasenko

This patch deinlines jhash, jhash2 and __jhash_nwords.

It also removes rhashtable_jhash2(key, length, seed)
because it was merely calling jhash2(key, length, seed).

With this .config: http://busybox.net/~vda/kernel_config,
after deinlining these functions have sizes and callsite counts
as follows:

__jhash_nwords: 72 bytes, 75 calls
jhash: 297 bytes, 111 calls
jhash2: 205 bytes, 136 calls

Total size decrease is about 38,000 bytes:

text data  bss   dec hex filename
90663567 17221960 36659200 144544727 89d93d7 vmlinux5
90625577 17221864 36659200 144506641 89cff11 vmlinux.after

Signed-off-by: Denys Vlasenko 
CC: Thomas Graf 
CC: Alexander Duyck 
CC: Jozsef Kadlecsik 
CC: Herbert Xu 
CC: netdev@vger.kernel.org
CC: linux-ker...@vger.kernel.org
---
Changes in v2: created a new source file, jhash.c

 include/linux/jhash.h | 123 +
 lib/Makefile  |   2 +-
 lib/jhash.c   | 149 ++
 lib/rhashtable.c  |  13 +++--
 4 files changed, 160 insertions(+), 127 deletions(-)
 create mode 100644 lib/jhash.c

diff --git a/include/linux/jhash.h b/include/linux/jhash.h
index 348c6f4..0b3f55d 100644
--- a/include/linux/jhash.h
+++ b/include/linux/jhash.h
@@ -31,131 +31,14 @@
 /* Mask the hash value, i.e (value & jhash_mask(n)) instead of (value % n) */
 #define jhash_mask(n)   (jhash_size(n)-1)
 
-/* __jhash_mix -- mix 3 32-bit values reversibly. */
-#define __jhash_mix(a, b, c)   \
-{  \
-   a -= c;  a ^= rol32(c, 4);  c += b; \
-   b -= a;  b ^= rol32(a, 6);  a += c; \
-   c -= b;  c ^= rol32(b, 8);  b += a; \
-   a -= c;  a ^= rol32(c, 16); c += b; \
-   b -= a;  b ^= rol32(a, 19); a += c; \
-   c -= b;  c ^= rol32(b, 4);  b += a; \
-}
-
-/* __jhash_final - final mixing of 3 32-bit values (a,b,c) into c */
-#define __jhash_final(a, b, c) \
-{  \
-   c ^= b; c -= rol32(b, 14);  \
-   a ^= c; a -= rol32(c, 11);  \
-   b ^= a; b -= rol32(a, 25);  \
-   c ^= b; c -= rol32(b, 16);  \
-   a ^= c; a -= rol32(c, 4);   \
-   b ^= a; b -= rol32(a, 14);  \
-   c ^= b; c -= rol32(b, 24);  \
-}
-
 /* An arbitrary initial parameter */
 #define JHASH_INITVAL  0xdeadbeef
 
-/* jhash - hash an arbitrary key
- * @k: sequence of bytes as key
- * @length: the length of the key
- * @initval: the previous hash, or an arbitray value
- *
- * The generic version, hashes an arbitrary sequence of bytes.
- * No alignment or length assumptions are made about the input key.
- *
- * Returns the hash value of the key. The result depends on endianness.
- */
-static inline u32 jhash(const void *key, u32 length, u32 initval)
-{
-   u32 a, b, c;
-   const u8 *k = key;
-
-   /* Set up the internal state */
-   a = b = c = JHASH_INITVAL + length + initval;
-
-   /* All but the last block: affect some 32 bits of (a,b,c) */
-   while (length > 12) {
-   a += __get_unaligned_cpu32(k);
-   b += __get_unaligned_cpu32(k + 4);
-   c += __get_unaligned_cpu32(k + 8);
-   __jhash_mix(a, b, c);
-   length -= 12;
-   k += 12;
-   }
-   /* Last block: affect all 32 bits of (c) */
-   /* All the case statements fall through */
-   switch (length) {
-   case 12: c += (u32)k[11]<<24;
-   case 11: c += (u32)k[10]<<16;
-   case 10: c += (u32)k[9]<<8;
-   case 9:  c += k[8];
-   case 8:  b += (u32)k[7]<<24;
-   case 7:  b += (u32)k[6]<<16;
-   case 6:  b += (u32)k[5]<<8;
-   case 5:  b += k[4];
-   case 4:  a += (u32)k[3]<<24;
-   case 3:  a += (u32)k[2]<<16;
-   case 2:  a += (u32)k[1]<<8;
-   case 1:  a += k[0];
-__jhash_final(a, b, c);
-   case 0: /* Nothing left to add */
-   break;
-   }
-
-   return c;
-}
-
-/* jhash2 - hash an array of u32's
- * @k: the key which must be an array of u32's
- * @length: the number of u32's in the key
- * @initval: the previous hash, or an arbitray value
- *
- * Returns the hash value of the key.
- */
-static inline u32 jhash2(const u32 *k, u32 length, u32 initval)
-{
-   u32 a, b, c;
-
-   /* Set up the internal state */
-   a = b = c = JHASH_INITVAL + (length<<2) + initval;
-
-   /* Handle most of the key */
-   while (length > 3) {
-   a += k[0];
-   b += k[1];
-   c += k[2];
-   __jhash_mix(a, b, c);
-   length -= 3;
-   k += 3;
-   }
-
-   /* Handle the last 3 u32's: all the case statements fall through */
-   switch (length) {
-   case 3: c += k[2];
-   case 2: b += k[1];
-   case 1: a += k[0];
-   __jhash_f

[PATCH 1/2] iwlwifi: convert hex_dump_to_buffer() to %*ph

2015-07-16 Thread Andy Shevchenko

There is no need to use hex_dump_to_buffer() in the cases like this:

hexdump_to_buffer(buf, len, 16, 1, outbuf, outlen, false);  /* len 
<= 16 */
sprintf("%s\n", outbuf);

since it maybe easily converted to simple:

sprintf("%*ph\n", len, buf);

Note: it seems in one case the output is groupped by 2 bytes and looks like a
typo. Thus, patch changes that to plain byte stream.

Signed-off-by: Andy Shevchenko 
---
 drivers/net/wireless/iwlwifi/dvm/debugfs.c | 8 ++--
 drivers/net/wireless/iwlwifi/mvm/debugfs.c | 7 +--
 2 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/dvm/debugfs.c 
b/drivers/net/wireless/iwlwifi/dvm/debugfs.c
index 0ffb6ff..b15e44f 100644
--- a/drivers/net/wireless/iwlwifi/dvm/debugfs.c
+++ b/drivers/net/wireless/iwlwifi/dvm/debugfs.c
@@ -310,12 +310,8 @@ static ssize_t iwl_dbgfs_nvm_read(struct file *file,
pos += scnprintf(buf + pos, buf_size - pos,
 "NVM version: 0x%x\n", nvm_ver);
for (ofs = 0 ; ofs < eeprom_len ; ofs += 16) {
-   pos += scnprintf(buf + pos, buf_size - pos, "0x%.4x ", ofs);
-   hex_dump_to_buffer(ptr + ofs, 16 , 16, 2, buf + pos,
-  buf_size - pos, 0);
-   pos += strlen(buf + pos);
-   if (buf_size - pos > 0)
-   buf[pos++] = '\n';
+   pos += scnprintf(buf + pos, buf_size - pos, "0x%.4x %16ph\n",
+ofs, ptr + ofs);
}
 
ret = simple_read_from_buffer(user_buf, count, ppos, buf, pos);
diff --git a/drivers/net/wireless/iwlwifi/mvm/debugfs.c 
b/drivers/net/wireless/iwlwifi/mvm/debugfs.c
index ffb4b5c..98abd31 100644
--- a/drivers/net/wireless/iwlwifi/mvm/debugfs.c
+++ b/drivers/net/wireless/iwlwifi/mvm/debugfs.c
@@ -1200,12 +1200,7 @@ static ssize_t iwl_dbgfs_d3_sram_read(struct file *file, 
char __user *user_buf,
if (ptr) {
for (ofs = 0; ofs < len; ofs += 16) {
pos += scnprintf(buf + pos, bufsz - pos,
-"0x%.4x ", ofs);
-   hex_dump_to_buffer(ptr + ofs, 16, 16, 1, buf + pos,
-  bufsz - pos, false);
-   pos += strlen(buf + pos);
-   if (bufsz - pos > 0)
-   buf[pos++] = '\n';
+"0x%.4x %16ph\n", ofs, ptr + ofs);
}
} else {
pos += scnprintf(buf + pos, bufsz - pos,
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] iwlegacy: convert hex_dump_to_buffer() to %*ph

2015-07-16 Thread Andy Shevchenko

There is no need to use hex_dump_to_buffer() in the cases like this:

hexdump_to_buffer(buf, len, 16, 1, outbuf, outlen, false);  /* len 
<= 16 */
sprintf("%s\n", outbuf);

since it maybe easily converted to simple:

sprintf("%*ph\n", len, buf);

Note: it seems in the case the output is groupped by 2 bytes and looks like a
typo. Thus, patch changes that to plain byte stream.

Signed-off-by: Andy Shevchenko 
---
 drivers/net/wireless/iwlegacy/3945-mac.c | 2 +-
 drivers/net/wireless/iwlegacy/debug.c| 8 ++--
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wireless/iwlegacy/3945-mac.c 
b/drivers/net/wireless/iwlegacy/3945-mac.c
index 7f4cb69..af1b3e6 100644
--- a/drivers/net/wireless/iwlegacy/3945-mac.c
+++ b/drivers/net/wireless/iwlegacy/3945-mac.c
@@ -3259,7 +3259,7 @@ il3945_show_measurement(struct device *d, struct 
device_attribute *attr,
 
while (size && PAGE_SIZE - len) {
hex_dump_to_buffer(data + ofs, size, 16, 1, buf + len,
-  PAGE_SIZE - len, 1);
+  PAGE_SIZE - len, true);
len = strlen(buf);
if (PAGE_SIZE - len)
buf[len++] = '\n';
diff --git a/drivers/net/wireless/iwlegacy/debug.c 
b/drivers/net/wireless/iwlegacy/debug.c
index 3440101..908b9f4 100644
--- a/drivers/net/wireless/iwlegacy/debug.c
+++ b/drivers/net/wireless/iwlegacy/debug.c
@@ -515,12 +515,8 @@ il_dbgfs_nvm_read(struct file *file, char __user 
*user_buf, size_t count,
scnprintf(buf + pos, buf_size - pos, "EEPROM " "version: 0x%x\n",
  eeprom_ver);
for (ofs = 0; ofs < eeprom_len; ofs += 16) {
-   pos += scnprintf(buf + pos, buf_size - pos, "0x%.4x ", ofs);
-   hex_dump_to_buffer(ptr + ofs, 16, 16, 2, buf + pos,
-  buf_size - pos, 0);
-   pos += strlen(buf + pos);
-   if (buf_size - pos > 0)
-   buf[pos++] = '\n';
+   pos += scnprintf(buf + pos, buf_size - pos, "0x%.4x %16ph\n",
+ofs, ptr + ofs);
}
 
ret = simple_read_from_buffer(user_buf, count, ppos, buf, pos);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What queues/buffers does tc-netem use?

2015-07-16 Thread Motejlek, Petr

Hello Hagen,

Could you please give me some example of such a tc command that would tell me 
the statistics? I am not sure what you mean.

Is there a way I can manipulate the internal rbtree queue size, please?

Thank you

Petr MOTEJLEK

From: Hagen Paul Pfeifer 
Sent: Thursday, July 16, 2015 2:42 PM
To: netdev@vger.kernel.org; Motejlek, Petr
Cc: ne...@osdl.org
Subject: Re: What queues/buffers does tc-netem use?

> On July 16, 2015 at 1:28 PM "Motejlek, Petr" 
> wrote:

> I was wondering what queues/buffers does netem use and how does one
> control or monitor them?

netem uses his own rbtree based queue. You can use tc(1) to get
statistics.

> I could not find this information anywhere and I am not that good in
> reading the sources to be able to tell enough about this :) If we talk
> only about the situation where netem is the root qdisc for a particular
> interface, I would imagine it might be using the txqueue of that
> interface, but I am not sure if that's really the case...

Saddly there is no netem implementation documentation, but the source code
is straightforward. You may take a look:

http://lxr.free-electrons.com/source/net/sched/sch_netem.c

Cheers, Hagen
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What queues/buffers does tc-netem use?

2015-07-16 Thread Hagen Paul Pfeifer

> On July 16, 2015 at 1:28 PM "Motejlek, Petr" 
> wrote:

> I was wondering what queues/buffers does netem use and how does one
> control or monitor them?

netem uses his own rbtree based queue. You can use tc(1) to get
statistics.

> I could not find this information anywhere and I am not that good in
> reading the sources to be able to tell enough about this :) If we talk
> only about the situation where netem is the root qdisc for a particular
> interface, I would imagine it might be using the txqueue of that
> interface, but I am not sure if that's really the case...

Saddly there is no netem implementation documentation, but the source code
is straightforward. You may take a look:

http://lxr.free-electrons.com/source/net/sched/sch_netem.c

Cheers, Hagen
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What queues/buffers does tc-netem use?

2015-07-16 Thread Hagen Paul Pfeifer

> On July 16, 2015 at 2:48 PM "Motejlek, Petr" 
> wrote:
>
> Could you please give me some example of such a tc command that
> would tell me the statistics? I am not sure what you mean.

tc -s qdisc show dev eth0

> Is there a way I can manipulate the internal rbtree queue size, please?

Sure, the option is called limit.
 
> Thank you

You are welcome!

Hagen
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses

2015-07-16 Thread Vlad Yasevich

On 07/15/2015 03:03 PM, Marcelo Ricardo Leitner wrote:
> On Fri, Jul 10, 2015 at 03:27:02PM -0300, Marcelo Ricardo Leitner wrote:
>> On Fri, Jul 10, 2015 at 01:14:21PM -0400, Vlad Yasevich wrote:
>>> On 07/10/2015 12:17 PM, Marcelo Ricardo Leitner wrote:
 On Fri, Jul 10, 2015 at 11:35:28AM -0400, Vlad Yasevich wrote:
> ...
> have been numerous times where I've seen weak host model in use on the 
> wire
> even with a BSD peer.
>
> This also puts a very big nail through many suggestions we've had over 
> the years
> to allow source based path multihoming in addition to destination based 
> multihoming
> we currently support.
>
> It might be a good idea to make rp-filter like behavior best effort, and 
> have
> the old behavior as fallback.  I am still trying to think up different 
> scenarios
> where rp-filter behavior will cause things to fail prematurely...

 The old behavior is like "if we don't have a src yet and can't find a
 preferred src for this dst, use the 1st bound address". We can add it
 but as I said, I'm afraid it is just doing wrong and not worth. If such
 randomly src addressed packet is meant to be routed, the router will
 likely drop it as it is seen as a spoof. And if it reaches the peer, it
 will probably come back through a different path.

 I'm tempted to say that current usual use cases are handled by the first
 check on this function, which returns the preferred/primary address for
 the interface and checks against bound addresses. Whenever you reach the
 second check, it just allows you to use that 1st bound address that is
 checked. I mean, I can't see use cases that we would be breaking with
 this change. 
>>>
>>> Yes,  the secondary check didn't amount to much, but we've kept it since 2.5
>>> days (when sctp was introduced).  I've made attempts over the years to
>>> try to make it stricter, but that never amounted to anything that worked 
>>> well.
>>>

 But yeah, it impacts source based routing, and I'm not aware of previous
 discussions on it. I'll try to dig some up but if possible, please share
 some pointers on it.
>>>
>>> It's been suggested a few times that we should support source based 
>>> multihoming
>>> particularly for the case where one peer has only 1 address.
>>> We've always punted on this, but people still ask every now and then.
>>
>> Ah okay, now I see it.
>>  
>>> I do have a question about the code though.. Have you tried with mutlipath 
>>> routing
>>> enabled.  I see rp_filter checks have special code to handle that.  Seem 
>>> like we
>>> might get false negatives in sctp.
>>
>> In the sense of CONFIG_IP_ROUTE_MULTIPATH=y, yes, but just that. My
>> routes were simple ones, either 2 peers attaches to 2 local subnets, or
>> with a gateway in the middle (with 2 subnets on each side, but mapped
>> 1-1, no crossing. Aka subnet1<->subnet2 and subnet3<->subnet4 while not
>> (subnet1<->subnet4 or subnet3<->subnet2).
>>
>> Note that this is not rp_filter strictly speaking, as it's mirrored.
>> rp_filter needs to calculate all possible output routes (actually until
>> it finds a valid one) for finding one that would match the one used for
>> incoming. 
>>
>> This check already has an output path, and it's calculating if such
>> input would be acceptable. We can't really expect/check for other hits
>> because it invalidates the chosen output path.
>>
>> Hmmm... but we could support multipath in the output selection, ie in
>> the outputs of ip_route_output_key(), probably in another patch then?
> 
> Thinking further.. we could just compare it with the addresses assigned to the
> interface instead of doing a whole new routing. Cheaper/faster, provides the
> results I'm looking for and the consequences are easier to see.
> 
> Something like (not tested, just illustrating the idea):
> 
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -489,22 +489,33 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
> union sctp_addr *saddr,
> list_for_each_entry_rcu(laddr, &bp->address_list, list) {
> if (!laddr->valid)
> continue;
> if ((laddr->state == SCTP_ADDR_SRC) &&
> (AF_INET == laddr->a.sa.sa_family)) {
> +   struct net_device *odev;
> +
> fl4->fl4_sport = laddr->a.v4.sin_port;
> flowi4_update_output(fl4,
>  asoc->base.sk->sk_bound_dev_if,
>  RT_CONN_FLAGS(asoc->base.sk),
>  daddr->v4.sin_addr.s_addr,
>  laddr->a.v4.sin_addr.s_addr);
>  
> rt = ip_route_output_key(sock_net(sk), fl4);
> -   if (!IS_ERR(rt)) {
> -   dst = &rt->dst;
> -

Re: [PATCH nf-next] netfilter: nf_ct_sctp: minimal multihoming support

2015-07-16 Thread Marcelo Ricardo Leitner

On Thu, Jul 16, 2015 at 02:05:12PM +0200, Michal Kubecek wrote:
> On Wed, Jul 15, 2015 at 05:35:08PM -0300, Marcelo Ricardo Leitner wrote:
> > Hi,
> > 
> > On Tue, Jul 14, 2015 at 06:42:25PM +0200, Michal Kubecek wrote:
> > > On Tue, Jul 14, 2015 at 03:42:03PM +0200, Florian Westphal wrote:
> > > > Michal Kubecek  wrote:
> > > > > + case SCTP_CID_HEARTBEAT:
> > > > > + pr_debug("SCTP_CID_HEARTBEAT");
> > > > > + i = 9;
> > > > > + break;
> > > > > + case SCTP_CID_HEARTBEAT_ACK:
> > > > > + pr_debug("SCTP_CID_HEARTBEAT_ACK");
> > > > > + i = 10;
> > > > > + break;
> > > > >   default:
> > > > >   /* Other chunks like DATA, SACK, HEARTBEAT and
> > > > >   its ACK do not cause a change in state */
> > > > > @@ -329,6 +351,8 @@ static int sctp_packet(struct nf_conn *ct,
> > > > >   !test_bit(SCTP_CID_COOKIE_ECHO, map) &&
> > > > >   !test_bit(SCTP_CID_ABORT, map) &&
> > > > >   !test_bit(SCTP_CID_SHUTDOWN_ACK, map) &&
> > > > > + !test_bit(SCTP_CID_HEARTBEAT, map) &&
> > > > > + !test_bit(SCTP_CID_HEARTBEAT_ACK, map) &&
> > > > >   sh->vtag != ct->proto.sctp.vtag[dir]) {
> > > > >   pr_debug("Verification tag check failed\n");
> > > > >   goto out;
> > > > > @@ -357,6 +381,16 @@ static int sctp_packet(struct nf_conn *ct,
> > > > >   /* Sec 8.5.1 (D) */
> > > > >   if (sh->vtag != ct->proto.sctp.vtag[dir])
> > > > >   goto out_unlock;
> > > > > + } else if (sch->type == SCTP_CID_HEARTBEAT ||
> > > > > +sch->type == SCTP_CID_HEARTBEAT_ACK) {
> > > > > + if (ct->proto.sctp.vtag[dir] == 0) {
> > > > > + pr_debug("Setting vtag %x for dir %d\n",
> > > > > +  sh->vtag, dir);
> > > > > + ct->proto.sctp.vtag[dir] = sh->vtag;
> > > > 
> > > > Could you please elaborate on the [dir] == 0 test?
> > > > 
> > > > I see this might happen for SCTP_CID_HEARTBEAT_ACK, but why is this
> > > > needed for SCTP_CID_HEARTBEAT ?
> > > > 
> > > > We found a conntrack entry so shouldn't the vtag[dir] already be > 0?
> > > 
> > > Yes, you are right. This was originally intended to handle the case when
> > > a HEARTBEAT in the reply direction is seen before the HEARTBEAT-ACK but
> > > such HEARTBEAT would be dropped anyway in current version.
> > 
> > And we have to keep the first vtag attempted because otherwise an
> > attacker could just probe for the right one until she gets a reply.
> > 
> > IOW, if a different vtag is attempted, we should drop it as the packet
> > doesn't belong to that association/conntrack entry.
> > 
> > As vtags are always != 0 in such case, that's a way to know if we
> > already have that information or not.
> > 
> > > On the other hand, an alternative would be
> > > 
> > >   } else if (sch->type == SCTP_CID_HEARTBEAT_ACK &&
> > >  ct->proto.sctp.vtag[dir] == 0) {
> > >   pr_debug("Setting vtag %x for dir %d\n",
> > >sh->vtag, dir);
> > >   ct->proto.sctp.vtag[dir] = sh->vtag;
> > >   } else if ((sch->type == SCTP_CID_HEARTBEAT ||
> > >   sch->type == SCTP_CID_HEARTBEAT_ACK) &&
> > >  sh->vtag != ct->proto.sctp.vtag[dir]) {
> > >   pr_debug("Verification tag check failed\n");
> > >   goto out_unlock;
> > >   }
> > > 
> > > I'm not sure it looks better.
> > 
> > Now it seems swapped, we should save the tag on HB and check on
> > HB_ACK only and would have to check against !dir entry. Like:
> 
> I forgot to include the explanation of vtag setting/checking logic into
> the commit message. It is supposed to work like this:
> 
> Normally, vtag is set from the INIT chunk for the reply direction and
> from the INIT-ACK chunk for the originating direction (i.e. each of
> these defines vtag value for the opposite direction). For secondary

Erf, indeed. I totally confused it and thought they would be equal on
both directions.

> conntracks, we can't rely on seeing INIT/INIT-ACK and even if we have
> seen them, we would need to connect two different conntracks. Therefore
> simplified logic is applied: vtag of first packet in each direction
> (HEARTBEAT in the originating and HEARTBEAT-ACK in reply direction) is
> saved and all following packets in that direction are compared with this
> saved value. While INIT and INIT-ACK define vtag for the opposite
> direction (that's where "!dir" comes from), vtags extracted from
> HEARTBEAT and HEARTBEAT-ACK are always for their direction. And we have
> to check vtags on packets with HEARTBEAT chunks as well because their
> vtags should match vtag of the first (set in sctp_new()).

Yes, that's pretty much it. Original code

Re: [PATCH v2 1/2] sctp: add new getsockopt option SCTP_SOCKOPT_PEELOFF_KERNEL

2015-07-16 Thread Vlad Yasevich

On 07/14/2015 01:13 PM, Marcelo Ricardo Leitner wrote:
> SCTP has this operation to peel off associations from a given socket and
> create a new socket using this association. We currently have two ways
> to use this operation:
> - via getsockopt(), on which it will also create and return a file
>   descriptor for this new socket
> - via sctp_do_peeloff(), which is for kernel only
> 
> The caveat with using sctp_do_peeloff() directly is that it creates a
> dependency to SCTP module, while all other operations are handled via
> kernel_{socket,sendmsg,getsockopt...}() interface. This causes the
> kernel to load SCTP module even when it's not really used.
> 
> This patch then creates a new sockopt that is to be used only by kernel
> users of this protocol. This new sockopt will not allocate a file
> descriptor but instead just return the socket pointer directly.
> 
> Kernel users are actually identified by if the parent socket has or not
> a fd attached to it. If not, it's a kernel a user.
> 
> If called by an user application, it will just return -EPERM.
> 
> Even though it's not intended for user applications, it's listed under
> uapi header. That's because hidding this wouldn't add any extra security
> and to keep the sockopt list in one place, so it's easy to check
> available numbers to use.
> 
> Signed-off-by: Marcelo Ricardo Leitner 
> ---
>  include/uapi/linux/sctp.h | 12 
>  net/sctp/socket.c | 37 +
>  2 files changed, 49 insertions(+)
> 
> diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
> index 
> ce70fe6b45df3e841c35accbdb6379c16563893c..b3aad3ce456ab3c1ebf4d81fdb7269ba40b3d92a
>  100644
> --- a/include/uapi/linux/sctp.h
> +++ b/include/uapi/linux/sctp.h
> @@ -105,6 +105,10 @@ typedef __s32 sctp_assoc_t;
>  #define SCTP_SOCKOPT_BINDX_ADD   100 /* BINDX requests for adding 
> addrs */
>  #define SCTP_SOCKOPT_BINDX_REM   101 /* BINDX requests for removing 
> addrs. */
>  #define SCTP_SOCKOPT_PEELOFF 102 /* peel off association. */
> +#define SCTP_SOCKOPT_PEELOFF_KERNEL  103 /* peel off association.
> +  * only valid for kernel
> +  * users
> +  */

I am not sure how much I like stuff this like in the uapi.  This stuff is
exposed to the user and I'd much rather we try and hide this from
the user completely.

I understand that you are dealing with a rather ugly dependency, but this is
not the only one in the kernel.  There are dependencies like this elsewhere
as well.

I am not familiar enough with DLM and its history, but my question is this:
If dlm always peels off a socket for a new associations, why is it using
1-to-many api in the first place?  Doing a quick scan of DLM lowcoms code
for sctp specific things, I see nothing that has specific dependencies
on 1-to-many api.  It might be simpler to switch to using 1-to-1 api, similar
to dlm tcp and eliminate this dependency.

Is that a naive point of view?

Thanks
-vlad

>  /* Options 104-106 are deprecated and removed. Do not use this space */
>  #define SCTP_SOCKOPT_CONNECTX_OLD107 /* CONNECTX old requests. */
>  #define SCTP_GET_PEER_ADDRS  108 /* Get all peer address. */
> @@ -892,6 +896,14 @@ typedef struct {
>   int sd;
>  } sctp_peeloff_arg_t;
>  
> +/* This is the union that is passed as an argument(optval) to
> + * getsockopt(SCTP_SOCKOPT_PEELOFF_KERNEL).
> + */
> +typedef union {
> + sctp_assoc_t associd;
> + struct socket *socket;
> +} sctp_peeloff_kernel_arg_t;
> +
>  /*
>   *  Peer Address Thresholds socket option
>   */
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 
> f1a65398f3118ab5d3a884e9c875620560e6b5ef..7968de7a1aeabd5cd0a0398461dbf2081bd4c5b7
>  100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -4504,6 +4504,39 @@ out:
>   return retval;
>  }
>  
> +static int sctp_getsockopt_peeloff_kernel(struct sock *sk, int len,
> +   char __user *optval, int __user 
> *optlen)
> +{
> + sctp_peeloff_kernel_arg_t peeloff;
> + struct socket *newsock;
> + int retval = 0;
> +
> + /* We only allow this operation if parent socket also hadn't a
> +  * file descriptor allocated to it, mainly as a way to make sure
> +  * that this is really a kernel socket.
> +  */
> + if (sk->sk_socket->file)
> + return -EPERM;
> +
> + if (len < sizeof(sctp_peeloff_kernel_arg_t))
> + return -EINVAL;
> + len = sizeof(sctp_peeloff_kernel_arg_t);
> + if (copy_from_user(&peeloff, optval, len))
> + return -EFAULT;
> +
> + retval = sctp_do_peeloff(sk, peeloff.associd, &newsock);
> + if (retval < 0)
> + goto out;
> +
> + peeloff.socket = newsock;
> + if (copy_to_user(optval, &peeloff, len)) {
> + sock_release(newsock);
> +

Re: [PATCH nf-next] netfilter: nf_ct_sctp: minimal multihoming support

2015-07-16 Thread Marcelo Ricardo Leitner

On Tue, Jul 14, 2015 at 02:23:11PM +0200, Michal Kubecek wrote:
> Currently nf_conntrack_proto_sctp module handles only packets between
> primary addresses used to establish the connection. Any packets between
> secondary addresses are classified as invalid so that usual firewall
> configurations drop them. Allowing HEARTBEAT and HEARTBEAT-ACK chunks to
> establish a new conntrack would allow traffic between secondary
> addresses to pass through. A more sophisticated solution based on the
> addresses advertised in the initial handshake (and possibly also later
> dynamic address addition and removal) would be much harder to implement.
> Moreover, in general we cannot assume to always see the initial
> handshake as it can be routed through a different path.
> 
> The patch adds two new conntrack states:
> 
>   SCTP_CONNTRACK_HB_SENT  - a HEARTBEAT chunk seen but not acked
>   SCTP_CONNTRACK_HB_ACKED - a HEARTBEAT acked by HEARTBEAT-ACK
> 
> State transition rules:
> 
> - HB_SENT responds to usual chunks the same way as NONE (so that the
>   behaviour changes as little as possible)
> - HB_ACKED responds to usual chunks the same way as ESTABLISHED does,
>   except the resulting state is HB_ACKED rather than ESTABLISHED
> - previously existing states except NONE are preserved when HEARTBEAT or
>   HEARTBEAT-ACK is seen
> - NONE (in the initial direction) changes to HB_SENT on HEARTBEAT
>   and to CLOSED on HEARTBEAT-ACK
> - HB_SENT changes to HB_ACKED on HEARTBEAT-ACK in the reply direction
> - HB_SENT and HB_ACKED are preserved on HEARTBEAT/HEARTBEAT-ACK
>   otherwise
> 
> Default timeout values for new states are
> 
>   HB_SENT: 30 seconds (default hb_interval)
>   HB_ACKED: 210 seconds (hb_interval * path_max_retry + max_rto)
> 
> (We cannot expect to see the shutdown sequence so that the HB_ACKED
> timeout shouldn't be too long.)
> 
> Signed-off-by: Michal Kubecek 
> ---
>  include/uapi/linux/netfilter/nf_conntrack_sctp.h |   2 +
>  net/netfilter/nf_conntrack_proto_sctp.c  | 110 
> ++-
>  2 files changed, 90 insertions(+), 22 deletions(-)
> 
> diff --git a/include/uapi/linux/netfilter/nf_conntrack_sctp.h 
> b/include/uapi/linux/netfilter/nf_conntrack_sctp.h
> index ceeefe6681b5..3ec7a6082457 100644
> --- a/include/uapi/linux/netfilter/nf_conntrack_sctp.h
> +++ b/include/uapi/linux/netfilter/nf_conntrack_sctp.h
> @@ -13,6 +13,8 @@ enum sctp_conntrack {
>   SCTP_CONNTRACK_SHUTDOWN_SENT,
>   SCTP_CONNTRACK_SHUTDOWN_RECD,
>   SCTP_CONNTRACK_SHUTDOWN_ACK_SENT,
> + SCTP_CONNTRACK_HB_SENT,
> + SCTP_CONNTRACK_HB_ACKED,
>   SCTP_CONNTRACK_MAX
>  };
>  
> diff --git a/net/netfilter/nf_conntrack_proto_sctp.c 
> b/net/netfilter/nf_conntrack_proto_sctp.c
> index b45da90fad32..efb6d5b16393 100644
> --- a/net/netfilter/nf_conntrack_proto_sctp.c
> +++ b/net/netfilter/nf_conntrack_proto_sctp.c
> @@ -42,6 +42,8 @@ static const char *const sctp_conntrack_names[] = {
>   "SHUTDOWN_SENT",
>   "SHUTDOWN_RECD",
>   "SHUTDOWN_ACK_SENT",
> + "HEARTBEAT_SENT",
> + "HEARTBEAT_ACKED",
>  };
>  
>  #define SECS  * HZ
> @@ -57,6 +59,8 @@ static unsigned int sctp_timeouts[SCTP_CONNTRACK_MAX] 
> __read_mostly = {
>   [SCTP_CONNTRACK_SHUTDOWN_SENT]  = 300 SECS / 1000,
>   [SCTP_CONNTRACK_SHUTDOWN_RECD]  = 300 SECS / 1000,
>   [SCTP_CONNTRACK_SHUTDOWN_ACK_SENT]  = 3 SECS,
> + [SCTP_CONNTRACK_HB_SENT]= 30 SECS,
> + [SCTP_CONNTRACK_HB_ACKED]   = 210 SECS,
>  };
>  
>  #define sNO SCTP_CONNTRACK_NONE
> @@ -67,6 +71,8 @@ static unsigned int sctp_timeouts[SCTP_CONNTRACK_MAX] 
> __read_mostly = {
>  #define  sSS SCTP_CONNTRACK_SHUTDOWN_SENT
>  #define  sSR SCTP_CONNTRACK_SHUTDOWN_RECD
>  #define  sSA SCTP_CONNTRACK_SHUTDOWN_ACK_SENT
> +#define  sHS SCTP_CONNTRACK_HB_SENT
> +#define  sHA SCTP_CONNTRACK_HB_ACKED
>  #define  sIV SCTP_CONNTRACK_MAX
>  
>  /*
> @@ -88,6 +94,10 @@ SHUTDOWN_ACK_SENT - We have seen a SHUTDOWN_ACK chunk in 
> the direction opposite
>   to that of the SHUTDOWN chunk.
>  CLOSED- We have seen a SHUTDOWN_COMPLETE chunk in the direction 
> of
>   the SHUTDOWN chunk. Connection is closed.
> +HEARTBEAT_SENT- We have seen a HEARTBEAT in a new flow.
> +HEARTBEAT_ACKED   - We have seen a HEARTBEAT-ACK in the direction opposite to
> + that of the HEARTBEAT chunk. Secondary connection is
> + established.
>  */
>  
>  /* TODO
> @@ -97,36 +107,40 @@ CLOSED- We have seen a SHUTDOWN_COMPLETE 
> chunk in the direction of
>   - Check the error type in the reply dir before transitioning from
>  cookie echoed to closed.
>   - Sec 5.2.4 of RFC 2960
> - - Multi Homing support.
> + - Full Multi Homing support.
>  */
>  
>  /* SCTP conntrack state transitions */
> -static const u8 sctp_conntracks[2][9][SCTP_CONNTRACK_MAX] = {
> +static const u8 sctp_conntracks[2][11][SCTP_CONNTRACK_MAX] = {
>

Re: [PATCH v2 1/2] sctp: add new getsockopt option SCTP_SOCKOPT_PEELOFF_KERNEL

2015-07-16 Thread Marcelo Ricardo Leitner

On Thu, Jul 16, 2015 at 09:50:16AM -0400, Vlad Yasevich wrote:
> On 07/14/2015 01:13 PM, Marcelo Ricardo Leitner wrote:
> > SCTP has this operation to peel off associations from a given socket and
> > create a new socket using this association. We currently have two ways
> > to use this operation:
> > - via getsockopt(), on which it will also create and return a file
> >   descriptor for this new socket
> > - via sctp_do_peeloff(), which is for kernel only
> > 
> > The caveat with using sctp_do_peeloff() directly is that it creates a
> > dependency to SCTP module, while all other operations are handled via
> > kernel_{socket,sendmsg,getsockopt...}() interface. This causes the
> > kernel to load SCTP module even when it's not really used.
> > 
> > This patch then creates a new sockopt that is to be used only by kernel
> > users of this protocol. This new sockopt will not allocate a file
> > descriptor but instead just return the socket pointer directly.
> > 
> > Kernel users are actually identified by if the parent socket has or not
> > a fd attached to it. If not, it's a kernel a user.
> > 
> > If called by an user application, it will just return -EPERM.
> > 
> > Even though it's not intended for user applications, it's listed under
> > uapi header. That's because hidding this wouldn't add any extra security
> > and to keep the sockopt list in one place, so it's easy to check
> > available numbers to use.
> > 
> > Signed-off-by: Marcelo Ricardo Leitner 
> > ---
> >  include/uapi/linux/sctp.h | 12 
> >  net/sctp/socket.c | 37 +
> >  2 files changed, 49 insertions(+)
> > 
> > diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
> > index 
> > ce70fe6b45df3e841c35accbdb6379c16563893c..b3aad3ce456ab3c1ebf4d81fdb7269ba40b3d92a
> >  100644
> > --- a/include/uapi/linux/sctp.h
> > +++ b/include/uapi/linux/sctp.h
> > @@ -105,6 +105,10 @@ typedef __s32 sctp_assoc_t;
> >  #define SCTP_SOCKOPT_BINDX_ADD 100 /* BINDX requests for adding 
> > addrs */
> >  #define SCTP_SOCKOPT_BINDX_REM 101 /* BINDX requests for removing 
> > addrs. */
> >  #define SCTP_SOCKOPT_PEELOFF   102 /* peel off association. */
> > +#define SCTP_SOCKOPT_PEELOFF_KERNEL103 /* peel off association.
> > +* only valid for kernel
> > +* users
> > +*/
> 
> I am not sure how much I like stuff this like in the uapi.  This stuff is
> exposed to the user and I'd much rather we try and hide this from
> the user completely.

We can hide it, but as is it would create hidden IDs and adding new gets
complicated, one would have to check two different lists to find free
IDs. Neil's suggestion is much cleaner on this aspect, but has the
caveat on changing sockopt arg format.

> I understand that you are dealing with a rather ugly dependency, but this is
> not the only one in the kernel.  There are dependencies like this elsewhere
> as well.

Doesn't mean we cannot fix one or another every now and then, right?

> I am not familiar enough with DLM and its history, but my question is this:
> If dlm always peels off a socket for a new associations, why is it using
> 1-to-many api in the first place?  Doing a quick scan of DLM lowcoms code
> for sctp specific things, I see nothing that has specific dependencies
> on 1-to-many api.  It might be simpler to switch to using 1-to-1 api, similar
> to dlm tcp and eliminate this dependency.
> 
> Is that a naive point of view?

Not at all, that's a very good question. I also don't know much of DLM
code itself, I'll check that.

Thanks,
Marcelo

> Thanks
> -vlad
> 
> >  /* Options 104-106 are deprecated and removed. Do not use this space */
> >  #define SCTP_SOCKOPT_CONNECTX_OLD  107 /* CONNECTX old requests. */
> >  #define SCTP_GET_PEER_ADDRS108 /* Get all peer 
> > address. */
> > @@ -892,6 +896,14 @@ typedef struct {
> > int sd;
> >  } sctp_peeloff_arg_t;
> >  
> > +/* This is the union that is passed as an argument(optval) to
> > + * getsockopt(SCTP_SOCKOPT_PEELOFF_KERNEL).
> > + */
> > +typedef union {
> > +   sctp_assoc_t associd;
> > +   struct socket *socket;
> > +} sctp_peeloff_kernel_arg_t;
> > +
> >  /*
> >   *  Peer Address Thresholds socket option
> >   */
> > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > index 
> > f1a65398f3118ab5d3a884e9c875620560e6b5ef..7968de7a1aeabd5cd0a0398461dbf2081bd4c5b7
> >  100644
> > --- a/net/sctp/socket.c
> > +++ b/net/sctp/socket.c
> > @@ -4504,6 +4504,39 @@ out:
> > return retval;
> >  }
> >  
> > +static int sctp_getsockopt_peeloff_kernel(struct sock *sk, int len,
> > + char __user *optval, int __user 
> > *optlen)
> > +{
> > +   sctp_peeloff_kernel_arg_t peeloff;
> > +   struct socket *newsock;
> > +   int retval = 0;
> > +
> > +   /* We only allow this operation if par

Re: [RFC PATCH net-next] sctp: fix src address selection if using secondary addresses

2015-07-16 Thread Marcelo Ricardo Leitner

On Thu, Jul 16, 2015 at 09:09:57AM -0400, Vlad Yasevich wrote:
> On 07/15/2015 03:03 PM, Marcelo Ricardo Leitner wrote:
> > On Fri, Jul 10, 2015 at 03:27:02PM -0300, Marcelo Ricardo Leitner wrote:
> >> On Fri, Jul 10, 2015 at 01:14:21PM -0400, Vlad Yasevich wrote:
> >>> On 07/10/2015 12:17 PM, Marcelo Ricardo Leitner wrote:
>  On Fri, Jul 10, 2015 at 11:35:28AM -0400, Vlad Yasevich wrote:
> > ...
> > have been numerous times where I've seen weak host model in use on the 
> > wire
> > even with a BSD peer.
> >
> > This also puts a very big nail through many suggestions we've had over 
> > the years
> > to allow source based path multihoming in addition to destination based 
> > multihoming
> > we currently support.
> >
> > It might be a good idea to make rp-filter like behavior best effort, 
> > and have
> > the old behavior as fallback.  I am still trying to think up different 
> > scenarios
> > where rp-filter behavior will cause things to fail prematurely...
> 
>  The old behavior is like "if we don't have a src yet and can't find a
>  preferred src for this dst, use the 1st bound address". We can add it
>  but as I said, I'm afraid it is just doing wrong and not worth. If such
>  randomly src addressed packet is meant to be routed, the router will
>  likely drop it as it is seen as a spoof. And if it reaches the peer, it
>  will probably come back through a different path.
> 
>  I'm tempted to say that current usual use cases are handled by the first
>  check on this function, which returns the preferred/primary address for
>  the interface and checks against bound addresses. Whenever you reach the
>  second check, it just allows you to use that 1st bound address that is
>  checked. I mean, I can't see use cases that we would be breaking with
>  this change. 
> >>>
> >>> Yes,  the secondary check didn't amount to much, but we've kept it since 
> >>> 2.5
> >>> days (when sctp was introduced).  I've made attempts over the years to
> >>> try to make it stricter, but that never amounted to anything that worked 
> >>> well.
> >>>
> 
>  But yeah, it impacts source based routing, and I'm not aware of previous
>  discussions on it. I'll try to dig some up but if possible, please share
>  some pointers on it.
> >>>
> >>> It's been suggested a few times that we should support source based 
> >>> multihoming
> >>> particularly for the case where one peer has only 1 address.
> >>> We've always punted on this, but people still ask every now and then.
> >>
> >> Ah okay, now I see it.
> >>  
> >>> I do have a question about the code though.. Have you tried with 
> >>> mutlipath routing
> >>> enabled.  I see rp_filter checks have special code to handle that.  Seem 
> >>> like we
> >>> might get false negatives in sctp.
> >>
> >> In the sense of CONFIG_IP_ROUTE_MULTIPATH=y, yes, but just that. My
> >> routes were simple ones, either 2 peers attaches to 2 local subnets, or
> >> with a gateway in the middle (with 2 subnets on each side, but mapped
> >> 1-1, no crossing. Aka subnet1<->subnet2 and subnet3<->subnet4 while not
> >> (subnet1<->subnet4 or subnet3<->subnet2).
> >>
> >> Note that this is not rp_filter strictly speaking, as it's mirrored.
> >> rp_filter needs to calculate all possible output routes (actually until
> >> it finds a valid one) for finding one that would match the one used for
> >> incoming. 
> >>
> >> This check already has an output path, and it's calculating if such
> >> input would be acceptable. We can't really expect/check for other hits
> >> because it invalidates the chosen output path.
> >>
> >> Hmmm... but we could support multipath in the output selection, ie in
> >> the outputs of ip_route_output_key(), probably in another patch then?
> > 
> > Thinking further.. we could just compare it with the addresses assigned to 
> > the
> > interface instead of doing a whole new routing. Cheaper/faster, provides the
> > results I'm looking for and the consequences are easier to see.
> > 
> > Something like (not tested, just illustrating the idea):
> > 
> > --- a/net/sctp/protocol.c
> > +++ b/net/sctp/protocol.c
> > @@ -489,22 +489,33 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
> > union sctp_addr *saddr,
> > list_for_each_entry_rcu(laddr, &bp->address_list, list) {
> > if (!laddr->valid)
> > continue;
> > if ((laddr->state == SCTP_ADDR_SRC) &&
> > (AF_INET == laddr->a.sa.sa_family)) {
> > +   struct net_device *odev;
> > +
> > fl4->fl4_sport = laddr->a.v4.sin_port;
> > flowi4_update_output(fl4,
> >  asoc->base.sk->sk_bound_dev_if,
> >  RT_CONN_FLAGS(asoc->base.sk),
> >  dadd

Re: [PATCH v2] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Alexander Duyck


On 07/16/2015 05:40 AM, Denys Vlasenko wrote:

This patch deinlines jhash, jhash2 and __jhash_nwords.

It also removes rhashtable_jhash2(key, length, seed)
because it was merely calling jhash2(key, length, seed).

With this .config: http://busybox.net/~vda/kernel_config,
after deinlining these functions have sizes and callsite counts
as follows:

__jhash_nwords: 72 bytes, 75 calls
jhash: 297 bytes, 111 calls
jhash2: 205 bytes, 136 calls

Total size decrease is about 38,000 bytes:

 text data  bss   dec hex filename
90663567 17221960 36659200 144544727 89d93d7 vmlinux5
90625577 17221864 36659200 144506641 89cff11 vmlinux.after

Signed-off-by: Denys Vlasenko 
CC: Thomas Graf 
CC: Alexander Duyck 
CC: Jozsef Kadlecsik 
CC: Herbert Xu 
CC: netdev@vger.kernel.org
CC: linux-ker...@vger.kernel.org
---
Changes in v2: created a new source file, jhash.c

  include/linux/jhash.h | 123 +
  lib/Makefile  |   2 +-
  lib/jhash.c   | 149 ++
  lib/rhashtable.c  |  13 +++--
  4 files changed, 160 insertions(+), 127 deletions(-)
  create mode 100644 lib/jhash.c

diff --git a/include/linux/jhash.h b/include/linux/jhash.h
index 348c6f4..0b3f55d 100644
--- a/include/linux/jhash.h
+++ b/include/linux/jhash.h
@@ -31,131 +31,14 @@
  /* Mask the hash value, i.e (value & jhash_mask(n)) instead of (value % n) */
  #define jhash_mask(n)   (jhash_size(n)-1)
  
-/* __jhash_mix -- mix 3 32-bit values reversibly. */

-#define __jhash_mix(a, b, c)   \
-{  \
-   a -= c;  a ^= rol32(c, 4);  c += b; \
-   b -= a;  b ^= rol32(a, 6);  a += c; \
-   c -= b;  c ^= rol32(b, 8);  b += a; \
-   a -= c;  a ^= rol32(c, 16); c += b; \
-   b -= a;  b ^= rol32(a, 19); a += c; \
-   c -= b;  c ^= rol32(b, 4);  b += a; \
-}
-
-/* __jhash_final - final mixing of 3 32-bit values (a,b,c) into c */
-#define __jhash_final(a, b, c) \
-{  \
-   c ^= b; c -= rol32(b, 14);  \
-   a ^= c; a -= rol32(c, 11);  \
-   b ^= a; b -= rol32(a, 25);  \
-   c ^= b; c -= rol32(b, 16);  \
-   a ^= c; a -= rol32(c, 4);   \
-   b ^= a; b -= rol32(a, 14);  \
-   c ^= b; c -= rol32(b, 24);  \
-}
-
  /* An arbitrary initial parameter */
  #define JHASH_INITVAL 0xdeadbeef
  
-/* jhash - hash an arbitrary key

- * @k: sequence of bytes as key
- * @length: the length of the key
- * @initval: the previous hash, or an arbitray value
- *
- * The generic version, hashes an arbitrary sequence of bytes.
- * No alignment or length assumptions are made about the input key.
- *
- * Returns the hash value of the key. The result depends on endianness.
- */
-static inline u32 jhash(const void *key, u32 length, u32 initval)
-{
-   u32 a, b, c;
-   const u8 *k = key;
-
-   /* Set up the internal state */
-   a = b = c = JHASH_INITVAL + length + initval;
-
-   /* All but the last block: affect some 32 bits of (a,b,c) */
-   while (length > 12) {
-   a += __get_unaligned_cpu32(k);
-   b += __get_unaligned_cpu32(k + 4);
-   c += __get_unaligned_cpu32(k + 8);
-   __jhash_mix(a, b, c);
-   length -= 12;
-   k += 12;
-   }
-   /* Last block: affect all 32 bits of (c) */
-   /* All the case statements fall through */
-   switch (length) {
-   case 12: c += (u32)k[11]<<24;
-   case 11: c += (u32)k[10]<<16;
-   case 10: c += (u32)k[9]<<8;
-   case 9:  c += k[8];
-   case 8:  b += (u32)k[7]<<24;
-   case 7:  b += (u32)k[6]<<16;
-   case 6:  b += (u32)k[5]<<8;
-   case 5:  b += k[4];
-   case 4:  a += (u32)k[3]<<24;
-   case 3:  a += (u32)k[2]<<16;
-   case 2:  a += (u32)k[1]<<8;
-   case 1:  a += k[0];
-__jhash_final(a, b, c);
-   case 0: /* Nothing left to add */
-   break;
-   }
-
-   return c;
-}
-
-/* jhash2 - hash an array of u32's
- * @k: the key which must be an array of u32's
- * @length: the number of u32's in the key
- * @initval: the previous hash, or an arbitray value
- *
- * Returns the hash value of the key.
- */
-static inline u32 jhash2(const u32 *k, u32 length, u32 initval)
-{
-   u32 a, b, c;
-
-   /* Set up the internal state */
-   a = b = c = JHASH_INITVAL + (length<<2) + initval;
-
-   /* Handle most of the key */
-   while (length > 3) {
-   a += k[0];
-   b += k[1];
-   c += k[2];
-   __jhash_mix(a, b, c);
-   length -= 3;
-   k += 3;
-   }
-
-   /* Handle the last 3 u32's: all the case statements fall through */
-   switch (length) {
-   case 3: c += k[2];
-   case 2: b

Re: [PATCH net-next] rocker: forward packets to CPU when port is joined to openvswitch

2015-07-16 Thread John Fastabend

On 15-07-16 01:14 AM, Jiri Pirko wrote:
> Thu, Jul 16, 2015 at 09:09:39AM CEST, sfel...@gmail.com wrote:
>> On Wed, Jul 15, 2015 at 11:58 PM, Jiri Pirko  wrote:
>>> Thu, Jul 16, 2015 at 08:40:31AM CEST, sfel...@gmail.com wrote:
 On Wed, Jul 15, 2015 at 6:39 PM, Simon Horman
  wrote:
> Teach rocker to forward packets to CPU when a port is joined to Open 
> vSwitch.
> There is scope to later refine what is passed up as per Open vSwitch flows
> on a port.
>
> This does not change the behaviour of rocker ports that are
> not joined to Open vSwitch.
>
> Signed-off-by: Simon Horman 

 Acked-by: Scott Feldman 

 Now, OVS flows on a port.  Strange enough, that was the first RFC
 implementation for switchdev/rocker where we hooked into ovs-kernel
 module and programmed flows into hw.  We pulled all of that code
 because, IIRC, the ovs folks didn't want us hooking into the kernel
 module directly.  We dropped the ovs hooks and focused on hooking
 kernel's L2/L3.  The device (rocker) didn't really change: OF-DPA
 pipeline was used for both.  Might be interesting to try hooking it
 again.
>>>
>>>
>>> I think that now we have an infrastructure prepared for that. I mean,
>>> what we need to do is to introduce another generic switchdev object
>>> called "ntupleflow" and hook-up again into ovs datapath and cls_flower
>>> and insert/remove the object from those codes. Should be pretty easy to do.
>>
>> That sounds right.  Is the ovs datapath hooking still happening in the
>> ovs-kernel module?  Remind me again, what was the objection the last
>> time we tried that?
> 
> Yep, we need to hook there. Otherwise it won't be transparent.
> 
> Last time the objection was that this would be ovs specific. But that is
> passed today. We have switchdev infra with objects, we have cls_flower
> which would use the same object. I say let's do this now.
> 

My objection wasn't that it was OVS specific but based on two
observations. First the user-kernel interface for OVS would need
to changed to optimally use hardware and then userspace would need
to be changed to pack rules optimally for hardware. The reason is
hardware has wildcards _and_ priority fields typically. This is a
different structure than we would want to use in software. Maybe
there is value in having a sub-optimal 'transparent' implementation
though. Note I can't see how you can possibly reverse engineer this
from what the kernel gets from userspace today and build out an
optimal solution.

Second I was hoping to use the interface as a "better" ethtool flow
classifier with a control plane in user space that controllers in the
network could interface with. In this mode I'm not running on a TOR but
at the edge. In this case I want to do some pre-processing of packets
before sending them up to the kernel to complete processing. Examples
like partial completion of classification and rule chaining where I
implement some rules in software and others in hardware. Perhaps this is
not OVS and I should just write a better ethtool flow classifier. But
with a 'bit' similar to how we do L2 I would get this.

All that said seeing a switchdev infra object could be interesting.

.John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 0/3] net: enable inband link state negotiation only when explicitly requested

2015-07-16 Thread Stas Sergeev

Hello.

Currently the link status auto-negotiation is enabled
for any SGMII link with fixed-link DT binding.
The regression was reported:
https://lkml.org/lkml/2015/7/8/865
Apparently not all HW that implements SGMII protocol, generates the
inband status for the auto-negotiation to work.
More details here:
https://lkml.org/lkml/2015/7/10/206

The following patches reverts to the old behavior by default,
which is to not enable the auto-negotiation for fixed-link.
The new DT property is added that allows to explicitly request
the auto-negotiation.

Those who were affected by the change, please send your Tested-by,
Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] fixed_phy: handle link-down case

2015-07-16 Thread Stas Sergeev


Currently fixed_phy driver recognizes only the link-up state.
This simple patch adds an implementation of link-down state.
It fixes the status registers when link is down, and also allows
to register the fixed-phy with link down without specifying the speed.

Signed-off-by: Stas Sergeev 

CC: Florian Fainelli 
CC: netdev@vger.kernel.org
CC: linux-ker...@vger.kernel.org
---
 drivers/net/phy/fixed_phy.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/phy/fixed_phy.c b/drivers/net/phy/fixed_phy.c
index 1960b46..479b93f 100644
--- a/drivers/net/phy/fixed_phy.c
+++ b/drivers/net/phy/fixed_phy.c
@@ -52,6 +52,10 @@ static int fixed_phy_update_regs(struct fixed_phy *fp)
u16 lpagb = 0;
u16 lpa = 0;

+   if (!fp->status.link)
+   goto done;
+   bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE;
+
if (fp->status.duplex) {
bmcr |= BMCR_FULLDPLX;

@@ -96,15 +100,13 @@ static int fixed_phy_update_regs(struct fixed_phy *fp)
}
}

-   if (fp->status.link)
-   bmsr |= BMSR_LSTATUS | BMSR_ANEGCOMPLETE;
-
if (fp->status.pause)
lpa |= LPA_PAUSE_CAP;

if (fp->status.asym_pause)
lpa |= LPA_PAUSE_ASYM;

+done:
fp->regs[MII_PHYSID1] = 0;
fp->regs[MII_PHYSID2] = 0;

-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC net-next 22/22] openvswitch: Use regular GRE net_device instead of vport

2015-07-16 Thread Thomas Graf

On 07/16/15 at 05:59pm, Simon Horman wrote:
> On Fri, Jul 10, 2015 at 04:19:24PM +0200, Thomas Graf wrote:
> >  static void ipgre_tap_setup(struct net_device *dev)
> >  {
> > ether_setup(dev);
> > -   dev->netdev_ops = &gre_tap_netdev_ops;
> > dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
> > ip_tunnel_setup(dev, gre_tap_net_id);
> > +
> > +   if (!strcmp(dev->name, GRE_TAP_FB_NAME))
> > +   dev->netdev_ops = &gre_fb_netdev_ops;
> > +   else
> > +   dev->netdev_ops = &gre_tap_netdev_ops;
> >  }
> >  
> >  static int ipgre_newlink(struct net *src_net, struct net_device *dev,
> 
> [snip]
> 
> Is there a side-effect of the above that if a user creates a gretap device
> whose name is "gretap0" then the device will use gre_fb_netdev_ops instead
> of gre_tap_netdev_ops. If so, does that imply a change in behaviour for
> gretap devices created with that name?

I'm inclined to change this and use an in-kernel API as well to
create the net_device just like VXLAN does in patch 21.

Pravin, what do you think?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] of_mdio: add new DT property 'managed' to specify the PHY management type

2015-07-16 Thread Stas Sergeev


Currently the PHY management type is selected by the MAC driver arbitrary.
The decision is based on the presence of the "fixed-link" node and on a
will of the driver's authors.
This caused a regression recently, when mvneta driver suddenly started
to use the in-band status for auto-negotiation on fixed links.
It appears the auto-negotiation may not work when expected by the MAC driver.
Sebastien Rannou explains:
<< Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
inband status) is connected to the switch through QSGMII, and in this context
we are on the media side of the PHY. >>
https://lkml.org/lkml/2015/7/10/206

This patch introduces the new string property 'managed' that allows
the user to set the management type explicitly.
The supported values are:
"auto" - default. Uses either MDIO or nothing, depending on the presence
of the fixed-link node
"in-band-status" - use in-band status

Signed-off-by: Stas Sergeev 

CC: Rob Herring 
CC: Pawel Moll 
CC: Mark Rutland 
CC: Ian Campbell 
CC: Kumar Gala 
CC: Florian Fainelli 
CC: Grant Likely 
CC: devicet...@vger.kernel.org
CC: linux-ker...@vger.kernel.org
CC: netdev@vger.kernel.org
---
 Documentation/devicetree/bindings/net/ethernet.txt |  4 
 drivers/of/of_mdio.c   | 19 +--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/ethernet.txt 
b/Documentation/devicetree/bindings/net/ethernet.txt
index 3fc3605..cb115a3 100644
--- a/Documentation/devicetree/bindings/net/ethernet.txt
+++ b/Documentation/devicetree/bindings/net/ethernet.txt
@@ -19,7 +19,11 @@ The following properties are common to the Ethernet 
controllers:
 - phy: the same as "phy-handle" property, not recommended for new bindings.
 - phy-device: the same as "phy-handle" property, not recommended for new
   bindings.
+- managed: string, specifies the PHY management type. Supported values are:
+  "auto", "in-band-status". "auto" is the default, it usess MDIO for
+  management if fixed-link is not specified.

 Child nodes of the Ethernet controller are typically the individual PHY devices
 connected via the MDIO bus (sometimes the MDIO bus controller is separate).
 They are described in the phy.txt file in this same directory.
+For non-MDIO PHY management see fixed-link.txt.
diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index 1bd4305..5dc1ef95 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -262,7 +262,8 @@ EXPORT_SYMBOL(of_phy_attach);
 bool of_phy_is_fixed_link(struct device_node *np)
 {
struct device_node *dn;
-   int len;
+   int len, err;
+   const char *managed;

/* New binding */
dn = of_get_child_by_name(np, "fixed-link");
@@ -271,6 +272,10 @@ bool of_phy_is_fixed_link(struct device_node *np)
return true;
}

+   err = of_property_read_string(np, "managed", &managed);
+   if (err == 0 && strcmp(managed, "auto") != 0)
+   return true;
+
/* Old binding */
if (of_get_property(np, "fixed-link", &len) &&
len == (5 * sizeof(__be32)))
@@ -285,8 +290,18 @@ int of_phy_register_fixed_link(struct device_node *np)
struct fixed_phy_status status = {};
struct device_node *fixed_link_node;
const __be32 *fixed_link_prop;
-   int len;
+   int len, err;
struct phy_device *phy;
+   const char *managed;
+
+   err = of_property_read_string(np, "managed", &managed);
+   if (err == 0) {
+   if (strcmp(managed, "in-band-status") == 0) {
+   /* status is zeroed, namely its .link member */
+   phy = fixed_phy_register(PHY_POLL, &status, np);
+   return IS_ERR(phy) ? PTR_ERR(phy) : 0;
+   }
+   }

/* New binding */
fixed_link_node = of_get_child_by_name(np, "fixed-link");
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] mvneta: use inband status only when explicitly enabled

2015-07-16 Thread Stas Sergeev


The commit 898b2970e2c9 ("mvneta: implement SGMII-based in-band link state
signaling") implemented the link parameters auto-negotiation unconditionally.
Unfortunately it appears that some HW that implements SGMII protocol,
doesn't generate the inband status, so it is not possible to auto-negotiate
anything with such HW.

This patch enables the auto-negotiation only if explicitly requested with
the 'managed' DT property.

This patch fixes the following regression:
https://lkml.org/lkml/2015/7/8/865

Signed-off-by: Stas Sergeev 

CC: Thomas Petazzoni 
CC: netdev@vger.kernel.org
CC: linux-ker...@vger.kernel.org
---
 drivers/net/ethernet/marvell/mvneta.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c 
b/drivers/net/ethernet/marvell/mvneta.c
index 74176ec..7a1deee 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -3008,8 +3008,8 @@ static int mvneta_probe(struct platform_device *pdev)
const char *dt_mac_addr;
char hw_mac_addr[ETH_ALEN];
const char *mac_from;
+   const char *managed;
int phy_mode;
-   int fixed_phy = 0;
int err;

/* Our multiqueue support is not complete, so for now, only
@@ -3043,7 +3043,6 @@ static int mvneta_probe(struct platform_device *pdev)
dev_err(&pdev->dev, "cannot register fixed PHY\n");
goto err_free_irq;
}
-   fixed_phy = 1;

/* In the case of a fixed PHY, the DT node associated
 * to the PHY is the Ethernet MAC DT node.
@@ -3067,8 +3066,10 @@ static int mvneta_probe(struct platform_device *pdev)
pp = netdev_priv(dev);
pp->phy_node = phy_node;
pp->phy_interface = phy_mode;
-   pp->use_inband_status = (phy_mode == PHY_INTERFACE_MODE_SGMII) &&
-   fixed_phy;
+
+   err = of_property_read_string(dn, "managed", &managed);
+   pp->use_inband_status = (err == 0 &&
+strcmp(managed, "in-band-status") == 0);

pp->clk = devm_clk_get(&pdev->dev, NULL);
if (IS_ERR(pp->clk)) {
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 1/3] if_link: Add control trust VF

2015-07-16 Thread Or Gerlitz

On Thu, Jul 16, 2015 at 1:33 PM, Hiroshi Shimamoto
 wrote:
> From: Hiroshi Shimamoto 
>
> Add netlink directives and ndo entry to trust VF user.

You haven't posted cover letter stating the V7 --> V6 and V6 --> older
versions changes

Or.


> This controls the special permission of VF user.
> The administrator will dedicatedly trust VF user to use some features
> which impacts security and/or performance.
>
> The administrator never turn it on unless VF user is fully trusted.
>
> Signed-off-by: Hiroshi Shimamoto 
> CC: Choi, Sy Jong 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bnx2x: Update to FW version 7.12.30

2015-07-16 Thread Kyle McMartin

On Thu, Jul 16, 2015 at 10:10:43AM +0300, Yuval Mintz wrote:
> The new FW will allow us to utilize some new features in our driver,
> mainly adding vlan stripping offload and vxlan offload support.
> 
> In addition, this fixes several issues:
>  - Packets from a VF with pvid configured which were sent with a
>different vlan were transmitted instead of being discarded.
> 
>  - FCoE traffic might not recover after a failue while there's traffic
>to another function.
> 
> Signed-off-by: Yuval Mintz 
> ---
> As mentioned, this was previously sent only to Ben/David.
> Now designating it to the proper mailing list.

applied.

regards, kyle
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Tom Herbert

On Thu, Jul 16, 2015 at 5:40 AM, Denys Vlasenko  wrote:
> This patch deinlines jhash, jhash2 and __jhash_nwords.
>
> It also removes rhashtable_jhash2(key, length, seed)
> because it was merely calling jhash2(key, length, seed).
>
> With this .config: http://busybox.net/~vda/kernel_config,
> after deinlining these functions have sizes and callsite counts
> as follows:
>
> __jhash_nwords: 72 bytes, 75 calls
> jhash: 297 bytes, 111 calls
> jhash2: 205 bytes, 136 calls
>
jhash is used in several places in the critical data path. Does the
decrease in text size justify performance impact of not inlining it?

Tom

> Total size decrease is about 38,000 bytes:
>
> text data  bss   dec hex filename
> 90663567 17221960 36659200 144544727 89d93d7 vmlinux5
> 90625577 17221864 36659200 144506641 89cff11 vmlinux.after
>
> Signed-off-by: Denys Vlasenko 
> CC: Thomas Graf 
> CC: Alexander Duyck 
> CC: Jozsef Kadlecsik 
> CC: Herbert Xu 
> CC: netdev@vger.kernel.org
> CC: linux-ker...@vger.kernel.org
> ---
> Changes in v2: created a new source file, jhash.c
>
>  include/linux/jhash.h | 123 +
>  lib/Makefile  |   2 +-
>  lib/jhash.c   | 149 
> ++
>  lib/rhashtable.c  |  13 +++--
>  4 files changed, 160 insertions(+), 127 deletions(-)
>  create mode 100644 lib/jhash.c
>
> diff --git a/include/linux/jhash.h b/include/linux/jhash.h
> index 348c6f4..0b3f55d 100644
> --- a/include/linux/jhash.h
> +++ b/include/linux/jhash.h
> @@ -31,131 +31,14 @@
>  /* Mask the hash value, i.e (value & jhash_mask(n)) instead of (value % n) */
>  #define jhash_mask(n)   (jhash_size(n)-1)
>
> -/* __jhash_mix -- mix 3 32-bit values reversibly. */
> -#define __jhash_mix(a, b, c)   \
> -{  \
> -   a -= c;  a ^= rol32(c, 4);  c += b; \
> -   b -= a;  b ^= rol32(a, 6);  a += c; \
> -   c -= b;  c ^= rol32(b, 8);  b += a; \
> -   a -= c;  a ^= rol32(c, 16); c += b; \
> -   b -= a;  b ^= rol32(a, 19); a += c; \
> -   c -= b;  c ^= rol32(b, 4);  b += a; \
> -}
> -
> -/* __jhash_final - final mixing of 3 32-bit values (a,b,c) into c */
> -#define __jhash_final(a, b, c) \
> -{  \
> -   c ^= b; c -= rol32(b, 14);  \
> -   a ^= c; a -= rol32(c, 11);  \
> -   b ^= a; b -= rol32(a, 25);  \
> -   c ^= b; c -= rol32(b, 16);  \
> -   a ^= c; a -= rol32(c, 4);   \
> -   b ^= a; b -= rol32(a, 14);  \
> -   c ^= b; c -= rol32(b, 24);  \
> -}
> -
>  /* An arbitrary initial parameter */
>  #define JHASH_INITVAL  0xdeadbeef
>
> -/* jhash - hash an arbitrary key
> - * @k: sequence of bytes as key
> - * @length: the length of the key
> - * @initval: the previous hash, or an arbitray value
> - *
> - * The generic version, hashes an arbitrary sequence of bytes.
> - * No alignment or length assumptions are made about the input key.
> - *
> - * Returns the hash value of the key. The result depends on endianness.
> - */
> -static inline u32 jhash(const void *key, u32 length, u32 initval)
> -{
> -   u32 a, b, c;
> -   const u8 *k = key;
> -
> -   /* Set up the internal state */
> -   a = b = c = JHASH_INITVAL + length + initval;
> -
> -   /* All but the last block: affect some 32 bits of (a,b,c) */
> -   while (length > 12) {
> -   a += __get_unaligned_cpu32(k);
> -   b += __get_unaligned_cpu32(k + 4);
> -   c += __get_unaligned_cpu32(k + 8);
> -   __jhash_mix(a, b, c);
> -   length -= 12;
> -   k += 12;
> -   }
> -   /* Last block: affect all 32 bits of (c) */
> -   /* All the case statements fall through */
> -   switch (length) {
> -   case 12: c += (u32)k[11]<<24;
> -   case 11: c += (u32)k[10]<<16;
> -   case 10: c += (u32)k[9]<<8;
> -   case 9:  c += k[8];
> -   case 8:  b += (u32)k[7]<<24;
> -   case 7:  b += (u32)k[6]<<16;
> -   case 6:  b += (u32)k[5]<<8;
> -   case 5:  b += k[4];
> -   case 4:  a += (u32)k[3]<<24;
> -   case 3:  a += (u32)k[2]<<16;
> -   case 2:  a += (u32)k[1]<<8;
> -   case 1:  a += k[0];
> -__jhash_final(a, b, c);
> -   case 0: /* Nothing left to add */
> -   break;
> -   }
> -
> -   return c;
> -}
> -
> -/* jhash2 - hash an array of u32's
> - * @k: the key which must be an array of u32's
> - * @length: the number of u32's in the key
> - * @initval: the previous hash, or an arbitray value
> - *
> - * Returns the hash value of the key.
> - */
> -static inline u32 jhash2(const u32 *k, u32 length, u32 initval)
> -{
> -   u32 a, b, c;
> -
> -   /* Set up the internal state */
> -   a = b = c = JHASH_INITVAL + (length<<2) + initval;
> -

Re: [PATCH 0/7] introduce Hyper-V VM Sockets(hvsock)

2015-07-16 Thread Stefan Hajnoczi

On Mon, Jul 06, 2015 at 07:39:35AM -0700, Dexuan Cui wrote:
> Hyper-V VM Sockets (hvsock) is a byte-stream based communication mechanism
> between Windowsd 10 (or later) host and a guest. It's kind of TCP over
> VMBus, but the transportation layer (VMBus) is much simpler than IP.
> With Hyper-V VM Sockets, applications between the host and a guest can
> talk with each other directly by the traditional BSD-style socket APIs.
> 
> The patchset implements the necessary support in the guest side by adding
> the necessary new APIs in the vmbus driver, and introducing a new driver
> hv_sock.ko, which implements_a new socket address family AF_HYPERV.
> 
> 
> I know the kernel has already had a VM Sockets driver (AF_VSOCK) based
> on VMware's VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is
> proposing AF_VSOCK of virtio version:
> http://thread.gmane.org/gmane.linux.network/365205.
> 
> However, though Hyper-V VM Sockets may seem conceptually similar to
> AF_VOSCK, there are differences in the transportation layer, and IMO these
> make the direct code reusing impractical:
> 
> 1. In AF_VSOCK, the endpoint type is: , but in
> AF_HYPERV, the endpoint type is: . Here GUID
> is 128-bit.
> 
> 2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't.
> 
> 3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE,
> SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT.
> These are meaningless to AF_HYPERV.
> 
> 4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus,
> like.notify_recv_init
> .notify_recv_pre_block
> .notify_recv_pre_dequeue
> .notify_recv_post_dequeue
> .notify_send_init
> .notify_send_pre_block
> .notify_send_pre_enqueue
> .notify_send_post_enqueue
> etc.
> 
> So I think we'd better introduce a new address family: AF_HYPERV.

Points 2-4 are not critical.  I think there are solutions to them.

Point 1 is the main issue: hvsock has  addresses instead of
vsock's  addresses.  Perhaps a mapping could be used but that
is pretty ugly.  One idea is something like a userspace  <->
 lookup function that applications can use if they want to
accept GUIDs.

I don't have a workable alternative to propose, so I agree that a new
address family is justified.


pgpxXjvLh7gGp.pgp
Description: PGP signature

Re: [PATCH nf-next] netfilter: nf_ct_sctp: minimal multihoming support

2015-07-16 Thread Michal Kubecek

On Thu, Jul 16, 2015 at 10:50:59AM -0300, Marcelo Ricardo Leitner wrote:
> On Tue, Jul 14, 2015 at 02:23:11PM +0200, Michal Kubecek wrote:
> > @@ -278,6 +292,14 @@ static int sctp_new_state(enum ip_conntrack_dir dir,
> > pr_debug("SCTP_CID_SHUTDOWN_COMPLETE\n");
> > i = 8;
> > break;
> > +   case SCTP_CID_HEARTBEAT:
> > +   pr_debug("SCTP_CID_HEARTBEAT");
> > +   i = 9;
> > +   break;
> > +   case SCTP_CID_HEARTBEAT_ACK:
> > +   pr_debug("SCTP_CID_HEARTBEAT_ACK");
> > +   i = 10;
> > +   break;
> > default:
> > /* Other chunks like DATA, SACK, HEARTBEAT and
> > its ACK do not cause a change in state */
> 
> Would you update this comment on default case please? As with this
> patch, HB and its ACK may cause a change in state.

Thank you for catching this. I'll update the comment in v2 I'm going to
send tomorrow after some testing.

 Michal Kubecek

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/6] ARM: net: add support for BPF_ANC | SKF_AD_PKTTYPE in ARM JIT.

2015-07-16 Thread Nicolas Schichan

Signed-off-by: Nicolas Schichan 
---
 arch/arm/net/bpf_jit_32.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index c011e22..6ff248c 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -895,6 +895,17 @@ b_epilogue:
OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
}
break;
+   case BPF_ANC | SKF_AD_PKTTYPE:
+   ctx->seen |= SEEN_SKB;
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
+ __pkt_type_offset[0]) != 1);
+   off = PKT_TYPE_OFFSET();
+   emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
+   emit(ARM_AND_I(r_A, r_A, PKT_TYPE_MAX), ctx);
+#ifdef __BIG_ENDIAN_BITFIELD
+   emit(ARM_LSR_I(r_A, r_A, 5), ctx);
+#endif
+   break;
case BPF_ANC | SKF_AD_QUEUE:
ctx->seen |= SEEN_SKB;
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/6] ARM: net: add support for BPF_ANC | SKF_AD_HATYPE in ARM JIT.

2015-07-16 Thread Nicolas Schichan

Signed-off-by: Nicolas Schichan 
---
 arch/arm/net/bpf_jit_32.c | 22 --
 arch/arm/net/bpf_jit_32.h |  3 +++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 3c73caf..876060b 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -857,7 +857,9 @@ b_epilogue:
emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
break;
case BPF_ANC | SKF_AD_IFINDEX:
+   case BPF_ANC | SKF_AD_HATYPE:
/* A = skb->dev->ifindex */
+   /* A = skb->dev->type */
ctx->seen |= SEEN_SKB;
off = offsetof(struct sk_buff, dev);
emit(ARM_LDR_I(r_scratch, r_skb, off), ctx);
@@ -867,8 +869,24 @@ b_epilogue:
 
BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
  ifindex) != 4);
-   off = offsetof(struct net_device, ifindex);
-   emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
+   BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
+ type) != 2);
+
+   if (code == (BPF_ANC | SKF_AD_IFINDEX)) {
+   off = offsetof(struct net_device, ifindex);
+   emit(ARM_LDR_I(r_A, r_scratch, off), ctx);
+   } else {
+   /*
+* offset of field "type" in "struct
+* net_device" is above what can be
+* used in the ldrh rd, [rn, #imm]
+* instruction, so load the offset in
+* a register and use ldrh rd, [rn, rm]
+*/
+   off = offsetof(struct net_device, type);
+   emit_mov_i(ARM_R3, off, ctx);
+   emit(ARM_LDRH_R(r_A, r_scratch, ARM_R3), ctx);
+   }
break;
case BPF_ANC | SKF_AD_MARK:
ctx->seen |= SEEN_SKB;
diff --git a/arch/arm/net/bpf_jit_32.h b/arch/arm/net/bpf_jit_32.h
index b2d7d92..4b17d5ab 100644
--- a/arch/arm/net/bpf_jit_32.h
+++ b/arch/arm/net/bpf_jit_32.h
@@ -74,6 +74,7 @@
 #define ARM_INST_LDRB_I0x05d0
 #define ARM_INST_LDRB_R0x07d0
 #define ARM_INST_LDRH_I0x01d000b0
+#define ARM_INST_LDRH_R0x019000b0
 #define ARM_INST_LDR_I 0x0590
 
 #define ARM_INST_LDM   0x0890
@@ -160,6 +161,8 @@
 | (rm))
 #define ARM_LDRH_I(rt, rn, off)(ARM_INST_LDRH_I | (rt) << 12 | (rn) << 
16 \
 | (((off) & 0xf0) << 4) | ((off) & 0xf))
+#define ARM_LDRH_R(rt, rn, rm) (ARM_INST_LDRH_R | (rt) << 12 | (rn) << 16 \
+| (rm))
 
 #define ARM_LDM(rn, regs)  (ARM_INST_LDM | (rn) << 16 | (regs))
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/6] ARM: net: add support for BPF_ANC | SKF_AD_PAY_OFFSET in ARM JIT.

2015-07-16 Thread Nicolas Schichan

Signed-off-by: Nicolas Schichan 
---
 arch/arm/net/bpf_jit_32.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 6ff248c..3c73caf 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -915,6 +915,14 @@ b_epilogue:
off = offsetof(struct sk_buff, queue_mapping);
emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
break;
+   case BPF_ANC | SKF_AD_PAY_OFFSET:
+   ctx->seen |= SEEN_SKB | SEEN_CALL;
+
+   emit(ARM_MOV_R(ARM_R0, r_skb), ctx);
+   emit_mov_i(ARM_R3, (unsigned int)skb_get_poff, ctx);
+   emit_blx_r(ARM_R3, ctx);
+   emit(ARM_MOV_R(r_A, ARM_R0), ctx);
+   break;
case BPF_LDX | BPF_W | BPF_ABS:
/*
 * load a 32bit word from struct seccomp_data.
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/6] ARM: net: fix vlan access instructions in ARM JIT.

2015-07-16 Thread Nicolas Schichan

This makes BPF_ANC | SKF_AD_VLAN_TAG and BPF_ANC | SKF_AD_VLAN_TAG_PRESENT
have the same behaviour as the in kernel VM and makes the test_bpf LD_VLAN_TAG
and LD_VLAN_TAG_PRESENT tests pass.

Signed-off-by: Nicolas Schichan 
---
 arch/arm/net/bpf_jit_32.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index d9b2524..c011e22 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -889,9 +889,11 @@ b_epilogue:
off = offsetof(struct sk_buff, vlan_tci);
emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
-   OP_IMM3(ARM_AND, r_A, r_A, VLAN_VID_MASK, ctx);
-   else
-   OP_IMM3(ARM_AND, r_A, r_A, VLAN_TAG_PRESENT, 
ctx);
+   OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, 
ctx);
+   else {
+   OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
+   OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
+   }
break;
case BPF_ANC | SKF_AD_QUEUE:
ctx->seen |= SEEN_SKB;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/6] BPF JIT fixes and features for ARM

2015-07-16 Thread Nicolas Schichan

Hello,

This serie fixes issues with the ARM BPF JIT and adds support for more
instructions to the ARM BPF JIT.

The first three patches are fixing bugs in the ARM JIT and should
probably find their way to a stable kernel.

The last three patches add support to the ARM JIT for more BPF
instructions, namely skb netdevice type retrieval, skb payload offset
retrieval, and skb packet type retrieval.

With the first three patches, all 60 test_bpf tests in Linux 4.1 release
are now passing OK (was 54 out of 60 before).

The last three patches allow 35 tests to use the JIT instead of 29
before.

Like previous ARM JIT patches this should go via the net tree.

Regards,

Nicolas Schichan (6):
  ARM: net: fix condition for load_order > 0 when translating load
instructions.
  ARM: net: handle negative offsets in BPF JIT.
  ARM: net: fix vlan access instructions in ARM JIT.
  ARM: net: add support for BPF_ANC | SKF_AD_PKTTYPE in ARM JIT.
  ARM: net: add support for BPF_ANC | SKF_AD_PAY_OFFSET in ARM JIT.
  ARM: net: add support for BPF_ANC | SKF_AD_HATYPE in ARM JIT.

 arch/arm/net/bpf_jit_32.c | 98 +++
 arch/arm/net/bpf_jit_32.h |  3 ++
 2 files changed, 86 insertions(+), 15 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/6] ARM: net: handle negative offsets in BPF JIT.

2015-07-16 Thread Nicolas Schichan

Previously, the JIT would reject negative offsets known during code
generation and mishandle negative offsets provided at runtime.

Fix that by calling bpf_internal_load_pointer_neg_helper()
appropriately in the jit_get_skb_{b,h,w} slow path helpers and by forcing
the execution flow to the slow path helpers when the offset is
negative.

Signed-off-by: Nicolas Schichan 
---
 arch/arm/net/bpf_jit_32.c | 47 ++-
 1 file changed, 38 insertions(+), 9 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 21f5ace..d9b2524 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -74,32 +74,52 @@ struct jit_ctx {
 
 int bpf_jit_enable __read_mostly;
 
-static u64 jit_get_skb_b(struct sk_buff *skb, unsigned offset)
+static inline int call_neg_helper(struct sk_buff *skb, int offset, void *ret,
+ unsigned int size)
+{
+   void *ptr = bpf_internal_load_pointer_neg_helper(skb, offset, size);
+
+   if (!ptr)
+   return -EFAULT;
+   memcpy(ret, ptr, size);
+   return 0;
+}
+
+static u64 jit_get_skb_b(struct sk_buff *skb, int offset)
 {
u8 ret;
int err;
 
-   err = skb_copy_bits(skb, offset, &ret, 1);
+   if (offset < 0)
+   err = call_neg_helper(skb, offset, &ret, 1);
+   else
+   err = skb_copy_bits(skb, offset, &ret, 1);
 
return (u64)err << 32 | ret;
 }
 
-static u64 jit_get_skb_h(struct sk_buff *skb, unsigned offset)
+static u64 jit_get_skb_h(struct sk_buff *skb, int offset)
 {
u16 ret;
int err;
 
-   err = skb_copy_bits(skb, offset, &ret, 2);
+   if (offset < 0)
+   err = call_neg_helper(skb, offset, &ret, 2);
+   else
+   err = skb_copy_bits(skb, offset, &ret, 2);
 
return (u64)err << 32 | ntohs(ret);
 }
 
-static u64 jit_get_skb_w(struct sk_buff *skb, unsigned offset)
+static u64 jit_get_skb_w(struct sk_buff *skb, int offset)
 {
u32 ret;
int err;
 
-   err = skb_copy_bits(skb, offset, &ret, 4);
+   if (offset < 0)
+   err = call_neg_helper(skb, offset, &ret, 4);
+   else
+   err = skb_copy_bits(skb, offset, &ret, 4);
 
return (u64)err << 32 | ntohl(ret);
 }
@@ -536,9 +556,6 @@ static int build_body(struct jit_ctx *ctx)
case BPF_LD | BPF_B | BPF_ABS:
load_order = 0;
 load:
-   /* the interpreter will deal with the negative K */
-   if ((int)k < 0)
-   return -ENOTSUPP;
emit_mov_i(r_off, k, ctx);
 load_common:
ctx->seen |= SEEN_DATA | SEEN_CALL;
@@ -553,6 +570,18 @@ load_common:
condt = ARM_COND_HI;
}
 
+   /*
+* test for negative offset, only if we are
+* currently scheduled to take the fast
+* path. this will update the flags so that
+* the slowpath instruction are ignored if the
+* offset is negative.
+*
+* for loard_order == 0 the HI condition will
+* make loads at offset 0 take the slow path too.
+*/
+   _emit(condt, ARM_CMP_I(r_off, 0), ctx);
+
_emit(condt, ARM_ADD_R(r_scratch, r_off, r_skb_data),
  ctx);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/6] ARM: net: fix condition for load_order > 0 when translating load instructions.

2015-07-16 Thread Nicolas Schichan

To check whether the load should take the fast path or not, the code
would check that (r_skb_hlen - load_order) is greater than the offset
of the access using an "Unsigned higher or same" condition. For
halfword accesses and an skb length of 1 at offset 0, that test is
valid, as we end up comparing 0x(-1) and 0, so the fast path
is taken and the filter allows the load to wrongly succeed. A similar
issue exists for word loads at offset 0 and an skb length of less than
4.

Fix that by using the condition "Signed greater than or equal"
condition for the fast path code for load orders greater than 0.

Signed-off-by: Nicolas Schichan 
---
 arch/arm/net/bpf_jit_32.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 4550d24..21f5ace 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -547,7 +547,7 @@ load_common:
emit(ARM_SUB_I(r_scratch, r_skb_hl,
   1 << load_order), ctx);
emit(ARM_CMP_R(r_scratch, r_off), ctx);
-   condt = ARM_COND_HS;
+   condt = ARM_COND_GE;
} else {
emit(ARM_CMP_R(r_skb_hl, r_off), ctx);
condt = ARM_COND_HI;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next]r8169: set bits on Register Interrupt status on limit

2015-07-16 Thread Florian Fainelli

On 16/07/15 03:18, Corcodel Marian wrote:
> 
>  Set bits on register Interrupt status on limits by
> configuration(critical).
>  On chips not alls bits is in use and some is reserved this patch solve this 
> issue.
> 
>  Committer: Corcodel Marian 
>  Changes to be committed:
>   modified:   drivers/net/ethernet/realtek/r8169.c

Unfortunately, this still does not look like a proper patch submission,
please take the time to read through Documentation/SubmittingPatches,
and also browse the netdev mailing-list for examples on what other
patches submission look like, e.g:
http://marc.info/?l=linux-netdev&m=143701078222680&w=2

If you use git format-patch and git send-email, things are made largely
easier than trying to manually do this.

> 
> Signed-off-by: Corcodel Marian 
> ---
>  drivers/net/ethernet/realtek/r8169.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 


-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] BPF JIT fixes and features for ARM

2015-07-16 Thread Alexei Starovoitov


On 7/16/15 9:46 AM, Nicolas Schichan wrote:

This serie fixes issues with the ARM BPF JIT and adds support for more
instructions to the ARM BPF JIT.

The first three patches are fixing bugs in the ARM JIT and should
probably find their way to a stable kernel.

The last three patches add support to the ARM JIT for more BPF
instructions, namely skb netdevice type retrieval, skb payload offset
retrieval, and skb packet type retrieval.

With the first three patches, all 60 test_bpf tests in Linux 4.1 release
are now passing OK (was 54 out of 60 before).

The last three patches allow 35 tests to use the JIT instead of 29
before.


looks good to me.
For the series:
Acked-by: Alexei Starovoitov 

you might want to try the latest 4.2-rc, since it has 238 tests :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 1/1] hv_netvsc: Wait for sub-channels to be processed during probe

2015-07-16 Thread K. Y. Srinivasan

The current code returns from probe without waiting for the proper handling
of subchannels that may be requested. If the netvsc driver were to be rapidly
loaded/unloaded, we can  trigger a panic as the unload will be tearing
down state that may not have been fully setup yet. We fix this issue by making
sure that we return from the probe call only after ensuring that the
sub-channel offers in flight are properly handled.

Signed-off-by: K. Y. Srinivasan 
Reviewed-and-tested-by: Haiyang Zhang offermsg.offer.sub_channel_index;
int ret;
+   unsigned long flags;
 
nvscdev = hv_get_drvdata(new_sc->primary_channel->device_obj);
 
+   spin_lock_irqsave(&nvscdev->sc_lock, flags);
+   nvscdev->num_sc_offered--;
+   spin_unlock_irqrestore(&nvscdev->sc_lock, flags);
+   if (nvscdev->num_sc_offered == 0)
+   complete(&nvscdev->channel_init_wait);
+
if (chn_index >= nvscdev->num_chn)
return;
 
@@ -1015,8 +1022,10 @@ int rndis_filter_device_add(struct hv_device *dev,
u32 rsscap_size = sizeof(struct ndis_recv_scale_cap);
u32 mtu, size;
u32 num_rss_qs;
+   u32 sc_delta;
const struct cpumask *node_cpu_mask;
u32 num_possible_rss_qs;
+   unsigned long flags;
 
rndis_device = get_rndis_device();
if (!rndis_device)
@@ -1039,6 +1048,8 @@ int rndis_filter_device_add(struct hv_device *dev,
net_device->max_chn = 1;
net_device->num_chn = 1;
 
+   spin_lock_init(&net_device->sc_lock);
+
net_device->extension = rndis_device;
rndis_device->net_dev = net_device;
 
@@ -1116,6 +1127,9 @@ int rndis_filter_device_add(struct hv_device *dev,
num_possible_rss_qs = cpumask_weight(node_cpu_mask);
net_device->num_chn = min(num_possible_rss_qs, num_rss_qs);
 
+   num_rss_qs = net_device->num_chn - 1;
+   net_device->num_sc_offered = num_rss_qs;
+
if (net_device->num_chn == 1)
goto out;
 
@@ -1157,11 +1171,22 @@ int rndis_filter_device_add(struct hv_device *dev,
 
ret = rndis_filter_set_rss_param(rndis_device, net_device->num_chn);
 
+   /*
+* Wait for the host to send us the sub-channel offers.
+*/
+   spin_lock_irqsave(&net_device->sc_lock, flags);
+   sc_delta = net_device->num_chn - 1 - num_rss_qs;
+   net_device->num_sc_offered -= sc_delta;
+   spin_unlock_irqrestore(&net_device->sc_lock, flags);
+
+   if (net_device->num_sc_offered != 0)
+   wait_for_completion(&net_device->channel_init_wait);
 out:
if (ret) {
net_device->max_chn = 1;
net_device->num_chn = 1;
}
+
return 0; /* return 0 because primary channel can be used alone */
 
 err_dev_remv:
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread David Miller

From: Tom Herbert 
Date: Thu, 16 Jul 2015 08:43:25 -0700

> On Thu, Jul 16, 2015 at 5:40 AM, Denys Vlasenko  wrote:
>> This patch deinlines jhash, jhash2 and __jhash_nwords.
>>
>> It also removes rhashtable_jhash2(key, length, seed)
>> because it was merely calling jhash2(key, length, seed).
>>
>> With this .config: http://busybox.net/~vda/kernel_config,
>> after deinlining these functions have sizes and callsite counts
>> as follows:
>>
>> __jhash_nwords: 72 bytes, 75 calls
>> jhash: 297 bytes, 111 calls
>> jhash2: 205 bytes, 136 calls
>>
> jhash is used in several places in the critical data path. Does the
> decrease in text size justify performance impact of not inlining it?

Tom took the words right out of my mouth.

Denys, you keep making deinlining changes like this all the time, like
a robot.  But I never see you make any effort to look into the performance
nor code generation ramifications of your changes.

And frankly that makes your patches quite tiring to deal with.

Your changes potentially have large performance implications, yet you
don't put any effort into considering that aspect at all.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 08/12] IB/cma: Add net_dev and private data checks to RDMA CM

2015-07-16 Thread Jason Gunthorpe

On Thu, Jul 16, 2015 at 12:01:55PM +, Liran Liss wrote:

> - Name space lookup is done based on BTH.pkey, private_data.IP, and
>   optionally GRH.DGID (if present, for extra validation)

Just changing the pkey to BTH.pkey would be fine by me.

Using GRH.DGID if available instead of the primary path hack is also
smart, I pointed that out at the beginning..

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tg3:Add error handling to the function tg3_test_loopback

2015-07-16 Thread Michael Chan

On Thu, 2015-07-16 at 14:51 -0400, Nicholas Krause wrote: 
> This adds proper error handling for if the calls to the function
> tg3_phy_lpbk_set fail by returning -EIO by assigning the return
> value to the variable err and if it equals anything other then
> zero jumps to the goto label done as no other work can be handled
> internally in the function tg3_test_loopback.
> 
> Signed-off-by: Nicholas Krause 

Acked-by: Michael Chan 

> ---
>  drivers/net/ethernet/broadcom/tg3.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/tg3.c 
> b/drivers/net/ethernet/broadcom/tg3.c
> index 73c934c..8584cb7 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -13625,8 +13625,10 @@ static int tg3_test_loopback(struct tg3 *tp, u64 
> *data, bool do_extlpbk)
>   !tg3_flag(tp, USE_PHYLIB)) {
>   int i;
>  
> - tg3_phy_lpbk_set(tp, 0, false);
> -
> + err = tg3_phy_lpbk_set(tp, 0, false);
> + if (err)
> + goto done;
> +
>   /* Wait for link */
>   for (i = 0; i < 100; i++) {
>   if (tr32(MAC_TX_STATUS) & TX_STATUS_LINK_UP)
> @@ -13644,8 +13646,10 @@ static int tg3_test_loopback(struct tg3 *tp, u64 
> *data, bool do_extlpbk)
>   data[TG3_PHY_LOOPB_TEST] |= TG3_JMB_LOOPBACK_FAILED;
>  
>   if (do_extlpbk) {
> - tg3_phy_lpbk_set(tp, 0, true);
> -
> + err = tg3_phy_lpbk_set(tp, 0, true);
> +
> + if (err)
> + goto done;
>   /* All link indications report up, but the hardware
>* isn't really ready for about 20 msec.  Double it
>* to be sure.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tg3:Add error handling to the function tg3_test_loopback

2015-07-16 Thread Michael Chan

On Thu, 2015-07-16 at 12:18 -0700, Michael Chan wrote: 
> On Thu, 2015-07-16 at 14:51 -0400, Nicholas Krause wrote: 
> > This adds proper error handling for if the calls to the function
> > tg3_phy_lpbk_set fail by returning -EIO by assigning the return
> > value to the variable err and if it equals anything other then
> > zero jumps to the goto label done as no other work can be handled
> > internally in the function tg3_test_loopback.
> > 
> > Signed-off-by: Nicholas Krause 
> 
> Acked-by: Michael Chan 

Your indentation doesn't look right.  Other than that, it is ok.

> 
> > ---
> >  drivers/net/ethernet/broadcom/tg3.c | 12 
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/broadcom/tg3.c 
> > b/drivers/net/ethernet/broadcom/tg3.c
> > index 73c934c..8584cb7 100644
> > --- a/drivers/net/ethernet/broadcom/tg3.c
> > +++ b/drivers/net/ethernet/broadcom/tg3.c
> > @@ -13625,8 +13625,10 @@ static int tg3_test_loopback(struct tg3 *tp, u64 
> > *data, bool do_extlpbk)
> > !tg3_flag(tp, USE_PHYLIB)) {
> > int i;
> >  
> > -   tg3_phy_lpbk_set(tp, 0, false);
> > -
> > +   err = tg3_phy_lpbk_set(tp, 0, false);
> > +   if (err)
> > +   goto done;
> > +
> > /* Wait for link */
> > for (i = 0; i < 100; i++) {
> > if (tr32(MAC_TX_STATUS) & TX_STATUS_LINK_UP)
> > @@ -13644,8 +13646,10 @@ static int tg3_test_loopback(struct tg3 *tp, u64 
> > *data, bool do_extlpbk)
> > data[TG3_PHY_LOOPB_TEST] |= TG3_JMB_LOOPBACK_FAILED;
> >  
> > if (do_extlpbk) {
> > -   tg3_phy_lpbk_set(tp, 0, true);
> > -
> > +   err = tg3_phy_lpbk_set(tp, 0, true);
> > +
> > +   if (err)
> > +   goto done;
> > /* All link indications report up, but the hardware
> >  * isn't really ready for about 20 msec.  Double it
> >  * to be sure.
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] jhash: Deinline jhash, jhash2 and __jhash_nwords

2015-07-16 Thread Joe Perches

On Thu, 2015-07-16 at 11:17 -0700, David Miller wrote:
> From: Tom Herbert 
> Date: Thu, 16 Jul 2015 08:43:25 -0700
> 
> > On Thu, Jul 16, 2015 at 5:40 AM, Denys Vlasenko  wrote:
> >> This patch deinlines jhash, jhash2 and __jhash_nwords.
> >>
> >> It also removes rhashtable_jhash2(key, length, seed)
> >> because it was merely calling jhash2(key, length, seed).
> >>
> >> With this .config: http://busybox.net/~vda/kernel_config,
> >> after deinlining these functions have sizes and callsite counts
> >> as follows:
> >>
> >> __jhash_nwords: 72 bytes, 75 calls
> >> jhash: 297 bytes, 111 calls
> >> jhash2: 205 bytes, 136 calls
> >>
> > jhash is used in several places in the critical data path. Does the
> > decrease in text size justify performance impact of not inlining it?
> 
> Tom took the words right out of my mouth.
> 
> Denys, you keep making deinlining changes like this all the time, like
> a robot.  But I never see you make any effort to look into the performance
> nor code generation ramifications of your changes.
> 
> And frankly that makes your patches quite tiring to deal with.
> 
> Your changes potentially have large performance implications, yet you
> don't put any effort into considering that aspect at all.

It might be useful to have these performance impacting
changes guarded by something like CONFIG_CC_OPTIMIZE_FOR_SIZE
with another static __always_inline __ and a function &
EXPORT_SYMBOL or just a static inline so that where code size
is critical it's uninlined.

Though even for tiny embedded uses, the additional code
complexity might not be worth it.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] genet:Return the variable ret rather then zero for the function bcmgenet_power_down

2015-07-16 Thread Florian Fainelli

On 15/07/15 08:09, Nicholas Krause wrote:
> This makes the function bcmgenet_power_down return the variable ret
> rather then zero in order to make this function be able to signal its
> caller with a error code when a failure occurs internally rather then
> always appearing to run successfully to its caller.

Please adjust the subject to be "net: bcmgenet: blah blah", just to
conform to the style used for all other changes.

This seems fine otherwise...

> 
> Signed-off-by: Nicholas Krause 
> ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index 64c1e9d..129e5b5 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -877,7 +877,7 @@ static int bcmgenet_power_down(struct bcmgenet_priv *priv,
>   break;
>   }
>  
> - return 0;
> + return ret;
>  }
>  
>  static void bcmgenet_power_up(struct bcmgenet_priv *priv,
> 


-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] net: bcmgenet: Remove excessive PHY reset

2015-07-16 Thread Florian Fainelli

On 14/07/15 12:37, Florian Fainelli wrote:
> We are currently issuing multiple PHY resets during a suspend/resume,
> first during bcmgenet_power_up() which does a hardware reset, then a
> software reset by calling bcmgenet_mii_reset(). This is both unnecessary
> and can take as long as 10ms per MDIO transactions while we re-apply
> workarounds because we do not yet have MDIO interrupts enabled.
> 
> phy_resume() takes care of re-apply our workarounds in case we need any,
> and bcmgenet_power_up() does a PHY hardware reset, all of this is more
> than enough to guarantee that the PHY operates correctly.

David, please discard this version, I will send one which actually
compiles, sorry about that.

> 
> Fixes: 1c1008c793fa4 ("net: bcmgenet: add main driver file")
> Signed-off-by: Florian Fainelli 
> ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c |  3 ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.h |  1 -
>  drivers/net/ethernet/broadcom/genet/bcmmii.c   | 10 --
>  3 files changed, 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index 64c1e9db6b0b..674f374dceee 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -907,9 +907,6 @@ static void bcmgenet_power_up(struct bcmgenet_priv *priv,
>   }
>  
>   bcmgenet_ext_writel(priv, reg, EXT_EXT_PWR_MGMT);
> -
> - if (mode == GENET_POWER_PASSIVE)
> - bcmgenet_mii_reset(priv->dev);
>  }
>  
>  /* ioctl handle special commands that are not present in ethtool. */
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h 
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
> index 6159deab8c98..9f9ac0089d4d 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
> @@ -672,7 +672,6 @@ GENET_IO_MACRO(rbuf, GENET_RBUF_OFF);
>  int bcmgenet_mii_init(struct net_device *dev);
>  int bcmgenet_mii_config(struct net_device *dev, bool init);
>  void bcmgenet_mii_exit(struct net_device *dev);
> -void bcmgenet_mii_reset(struct net_device *dev);
>  void bcmgenet_phy_power_set(struct net_device *dev, bool enable);
>  void bcmgenet_mii_setup(struct net_device *dev);
>  
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmmii.c 
> b/drivers/net/ethernet/broadcom/genet/bcmmii.c
> index adf23d2ac488..2a8f97299b13 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmmii.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmmii.c
> @@ -163,16 +163,6 @@ void bcmgenet_mii_setup(struct net_device *dev)
>   phy_print_status(phydev);
>  }
>  
> -void bcmgenet_mii_reset(struct net_device *dev)
> -{
> - struct bcmgenet_priv *priv = netdev_priv(dev);
> -
> - if (priv->phydev) {
> - phy_init_hw(priv->phydev);
> - phy_start_aneg(priv->phydev);
> - }
> -}
> -
>  void bcmgenet_phy_power_set(struct net_device *dev, bool enable)
>  {
>   struct bcmgenet_priv *priv = netdev_priv(dev);
> 


-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] net: netcp: fix improper initialization in netcp_ndo_open()

2015-07-16 Thread Murali Karicheri

The keystone qmss will raise interrupt when packet arrive at the
receive queue. Only control available to avoid interrupt from happening
is to keep the free descriptor queue (FDQ) empty in the receive side.
So the filling of descriptors into the FDQ has to happen after
request_irq() call is made as part of knav_queue_enable_notify(). So
move the function netcp_rxpool_refill() after this call.

Signed-off-by: Murali Karicheri 
---
 drivers/net/ethernet/ti/netcp_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/netcp_core.c 
b/drivers/net/ethernet/ti/netcp_core.c
index 5ec4ed3..ec8ed30 100644
--- a/drivers/net/ethernet/ti/netcp_core.c
+++ b/drivers/net/ethernet/ti/netcp_core.c
@@ -1617,11 +1617,11 @@ static int netcp_ndo_open(struct net_device *ndev)
}
mutex_unlock(&netcp_modules_lock);
 
-   netcp_rxpool_refill(netcp);
napi_enable(&netcp->rx_napi);
napi_enable(&netcp->tx_napi);
knav_queue_enable_notify(netcp->tx_compl_q);
knav_queue_enable_notify(netcp->rx_queue);
+   netcp_rxpool_refill(netcp);
netif_tx_wake_all_queues(ndev);
dev_dbg(netcp->ndev_dev, "netcp device %s opened\n", ndev->name);
return 0;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net:bcmgent:Return the variable ret rather then zero for the function bcmgenet_power_down

2015-07-16 Thread Florian Fainelli

On 16/07/15 12:34, Nicholas Krause wrote:
> This makes the function bcmgenet_power_down return the variable ret
> rather then zero in order to make this function be able to signal its
> caller with a error code when a failure occurs internally rather then
> always appearing to run successfully to its caller.

Still not quite the right subject (missing spaces, typos), copy/paste this:

net: bcmgenet: Return the variable ret rather then zero for the function
bcmgenet_power_down

> 
> Signed-off-by: Nicholas Krause 
> ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c 
> b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index 64c1e9d..129e5b5 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -877,7 +877,7 @@ static int bcmgenet_power_down(struct bcmgenet_priv *priv,
>   break;
>   }
>  
> - return 0;
> + return ret;
>  }
>  
>  static void bcmgenet_power_up(struct bcmgenet_priv *priv,
> 


-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next] net: remove skb_frag_add_head

2015-07-16 Thread Jiri Benc

It's not used anywhere.

Signed-off-by: Jiri Benc 
---
 include/linux/skbuff.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d6cdd6e87d53..a5395be9fe7b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2671,12 +2671,6 @@ static inline void skb_frag_list_init(struct sk_buff 
*skb)
skb_shinfo(skb)->frag_list = NULL;
 }
 
-static inline void skb_frag_add_head(struct sk_buff *skb, struct sk_buff *frag)
-{
-   frag->next = skb_shinfo(skb)->frag_list;
-   skb_shinfo(skb)->frag_list = frag;
-}
-
 #define skb_walk_frags(skb, iter)  \
for (iter = skb_shinfo(skb)->frag_list; iter; iter = iter->next)
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/8] cdc-ncm: use common CDC parser

2015-07-16 Thread Bjørn Mork

I'm on vacation without any usable Internet or other code access, so you'll 
have to excuse me if I'm missing something.  But this seems to leave union_desc 
uninitialized, doesn't it?

Bjørn

On July 16, 2015 9:24:34 PM CEST, Oliver Neukum  wrote:
>Switch to the common parser
>
>Signed-off-by: Oliver Neukum 
>---
>drivers/net/usb/cdc_ncm.c | 69
>+++
> 1 file changed, 9 insertions(+), 60 deletions(-)
>
>diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
>index db40175..2cef13a 100644
>--- a/drivers/net/usb/cdc_ncm.c
>+++ b/drivers/net/usb/cdc_ncm.c
>@@ -698,6 +698,7 @@ int cdc_ncm_bind_common(struct usbnet *dev, struct
>usb_interface *intf, u8 data_
>   int len;
>   int temp;
>   u8 iface_no;
>+  struct usb_cdc_parsed_header hdr;
> 
>   ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>   if (!ctx)
>@@ -722,66 +723,14 @@ int cdc_ncm_bind_common(struct usbnet *dev,
>struct usb_interface *intf, u8 data_
>   len = intf->cur_altsetting->extralen;
> 
>   /* parse through descriptors associated with control interface */
>-  while ((len > 0) && (buf[0] > 2) && (buf[0] <= len)) {
>-
>-  if (buf[1] != USB_DT_CS_INTERFACE)
>-  goto advance;
>-
>-  switch (buf[2]) {
>-  case USB_CDC_UNION_TYPE:
>-  if (buf[0] < sizeof(*union_desc))
>-  break;
>-
>-  union_desc = (const struct usb_cdc_union_desc *)buf;
>-  /* the master must be the interface we are probing */
>-  if (intf->cur_altsetting->desc.bInterfaceNumber !=
>-  union_desc->bMasterInterface0) {
>-  dev_dbg(&intf->dev, "bogus CDC Union\n");
>-  goto error;
>-  }
>-  ctx->data = usb_ifnum_to_if(dev->udev,
>-  
>union_desc->bSlaveInterface0);
>-  break;
>-
>-  case USB_CDC_ETHERNET_TYPE:
>-  if (buf[0] < sizeof(*(ctx->ether_desc)))
>-  break;
>-
>-  ctx->ether_desc =
>-  (const struct usb_cdc_ether_desc *)buf;
>-  break;
>-
>-  case USB_CDC_NCM_TYPE:
>-  if (buf[0] < sizeof(*(ctx->func_desc)))
>-  break;
>-
>-  ctx->func_desc = (const struct usb_cdc_ncm_desc *)buf;
>-  break;
>-
>-  case USB_CDC_MBIM_TYPE:
>-  if (buf[0] < sizeof(*(ctx->mbim_desc)))
>-  break;
>-
>-  ctx->mbim_desc = (const struct usb_cdc_mbim_desc *)buf;
>-  break;
>-
>-  case USB_CDC_MBIM_EXTENDED_TYPE:
>-  if (buf[0] < sizeof(*(ctx->mbim_extended_desc)))
>-  break;
>-
>-  ctx->mbim_extended_desc =
>-  (const struct usb_cdc_mbim_extended_desc *)buf;
>-  break;
>-
>-  default:
>-  break;
>-  }
>-advance:
>-  /* advance to next descriptor */
>-  temp = buf[0];
>-  buf += temp;
>-  len -= temp;
>-  }
>+  cdc_parse_cdc_header(&hdr, intf, buf, len);
>+
>+  ctx->data = usb_ifnum_to_if(dev->udev, 
>+  hdr.usb_cdc_union_desc->bSlaveInterface0);
>+  ctx->ether_desc = hdr.usb_cdc_ether_desc;
>+  ctx->func_desc = hdr.usb_cdc_ncm_desc;
>+  ctx->mbim_desc = hdr.usb_cdc_mbim_desc;
>+  ctx->mbim_extended_desc = hdr.usb_cdc_mbim_extended_desc;
> 
>   /* some buggy devices have an IAD but no CDC Union */
>   if (!union_desc && intf->intf_assoc &&
>intf->intf_assoc->bInterfaceCount == 2) {

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 06/13] tipc: make media xmit call outside node spinlock context

2015-07-16 Thread Jon Maloy

Currently, message sending is performed through a deep call chain,
where the node spinlock is grabbed and held during a significant
part of the transmission time. This is clearly detrimental to
overall throughput performance; it would be better if we could send
the message after the spinlock has been released.

In this commit, we do instead let the call revert on the stack after
the buffer chain has been added to the transmission queue, whereafter
clones of the buffers are transmitted to the device layer outside the
spinlock scope.

As a further step in our effort to separate the roles of the node
and link entities we also move the function tipc_link_xmit() to
node.c, and rename it to tipc_node_xmit().

Reviewed-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/bearer.c |  26 ++
 net/tipc/bearer.h |   3 ++
 net/tipc/link.c   | 132 +++---
 net/tipc/link.h   |   6 +--
 net/tipc/name_distr.c |   4 +-
 net/tipc/node.c   |  78 +
 net/tipc/node.h   |   4 ++
 net/tipc/socket.c |  22 -
 8 files changed, 198 insertions(+), 77 deletions(-)

diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
index 00bc0e6..eae58a6 100644
--- a/net/tipc/bearer.c
+++ b/net/tipc/bearer.c
@@ -470,6 +470,32 @@ void tipc_bearer_send(struct net *net, u32 bearer_id, 
struct sk_buff *buf,
rcu_read_unlock();
 }
 
+/* tipc_bearer_xmit() -send buffer to destination over bearer
+ */
+void tipc_bearer_xmit(struct net *net, u32 bearer_id,
+ struct sk_buff_head *xmitq,
+ struct tipc_media_addr *dst)
+{
+   struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct tipc_bearer *b;
+   struct sk_buff *skb, *tmp;
+
+   if (skb_queue_empty(xmitq))
+   return;
+
+   rcu_read_lock();
+   b = rcu_dereference_rtnl(tn->bearer_list[bearer_id]);
+   if (likely(b)) {
+   skb_queue_walk_safe(xmitq, skb, tmp) {
+   __skb_dequeue(xmitq);
+   b->media->send_msg(net, skb, b, dst);
+   /* Until we remove cloning in tipc_l2_send_msg(): */
+   kfree_skb(skb);
+   }
+   }
+   rcu_read_unlock();
+}
+
 /**
  * tipc_l2_rcv_msg - handle incoming TIPC message from an interface
  * @buf: the received packet
diff --git a/net/tipc/bearer.h b/net/tipc/bearer.h
index dc714d9..6426f24 100644
--- a/net/tipc/bearer.h
+++ b/net/tipc/bearer.h
@@ -217,5 +217,8 @@ void tipc_bearer_cleanup(void);
 void tipc_bearer_stop(struct net *net);
 void tipc_bearer_send(struct net *net, u32 bearer_id, struct sk_buff *buf,
  struct tipc_media_addr *dest);
+void tipc_bearer_xmit(struct net *net, u32 bearer_id,
+ struct sk_buff_head *xmitq,
+ struct tipc_media_addr *dst);
 
 #endif /* _TIPC_BEARER_H */
diff --git a/net/tipc/link.c b/net/tipc/link.c
index ea32679..c052437 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -353,7 +353,6 @@ static int link_schedule_user(struct tipc_link *link, 
struct sk_buff_head *list)
/* This really cannot happen...  */
if (unlikely(imp > TIPC_CRITICAL_IMPORTANCE)) {
pr_warn("%s<%s>, send queue full", link_rst_msg, link->name);
-   tipc_link_reset(link);
return -ENOBUFS;
}
/* Non-blocking sender: */
@@ -701,6 +700,78 @@ int __tipc_link_xmit(struct net *net, struct tipc_link 
*link,
return 0;
 }
 
+/**
+ * tipc_link_xmit(): enqueue buffer list according to queue situation
+ * @link: link to use
+ * @list: chain of buffers containing message
+ * @xmitq: returned list of packets to be sent by caller
+ *
+ * Consumes the buffer chain, except when returning -ELINKCONG,
+ * since the caller then may want to make more send attempts.
+ * Returns 0 if success, or errno: -ELINKCONG, -EMSGSIZE or -ENOBUFS
+ * Messages at TIPC_SYSTEM_IMPORTANCE are always accepted
+ */
+int tipc_link_xmit(struct tipc_link *l, struct sk_buff_head *list,
+  struct sk_buff_head *xmitq)
+{
+   struct tipc_msg *hdr = buf_msg(skb_peek(list));
+   unsigned int maxwin = l->window;
+   unsigned int i, imp = msg_importance(hdr);
+   unsigned int mtu = l->mtu;
+   u16 ack = l->rcv_nxt - 1;
+   u16 seqno = l->snd_nxt;
+   u16 bc_last_in = l->owner->bclink.last_in;
+   struct sk_buff_head *transmq = &l->transmq;
+   struct sk_buff_head *backlogq = &l->backlogq;
+   struct sk_buff *skb, *_skb, *bskb;
+
+   /* Match msg importance against this and all higher backlog limits: */
+   for (i = imp; i <= TIPC_SYSTEM_IMPORTANCE; i++) {
+   if (unlikely(l->backlog[i].len >= l->backlog[i].limit))
+   return link_schedule_user(l, list);
+   }
+   if (unlikely(msg_size(hdr) > mtu))
+   return -EMSGSIZE;
+
+   /* Prepare each packet for send

[PATCH net-next 00/13] tipc: separate link and link aggregation layer

2015-07-16 Thread Jon Maloy

This is the first batch of a longer series that has two main objectives:

o Finer lock granularity during message sending and reception,
  especially regarding usage of the node spinlock.

o Better separation between the link layer implementation and the link
  aggregation layer, represented by node.c::struct tipc_node.

Hopefully these changes also make this part of code somewhat easier
to comprehend and maintain.


Jon Maloy (13):
  tipc: introduce link entry structure to struct tipc_node
  tipc: move link creation from neighbor discoverer to node
  tipc: move link input queue to tipc_node
  tipc: use bearer index when looking up active links
  tipc: change sk_buffer handling in tipc_link_xmit()
  tipc: make media xmit call outside node spinlock context
  tipc: clean up definitions and usage of link flags
  tipc: introduce new link protocol msg create function
  tipc: improve link FSM implementation
  tipc: simplify link timer implementation
  tipc: move link supervision timer to node level
  tipc: introduce node contact FSM
  tipc: reduce locking scope during packet reception

 net/tipc/bcast.c  |   31 +-
 net/tipc/bcast.h  |1 +
 net/tipc/bearer.c |   26 +
 net/tipc/bearer.h |3 +
 net/tipc/core.h   |5 +
 net/tipc/discover.c   |   20 +-
 net/tipc/link.c   | 1517 -
 net/tipc/link.h   |   74 +--
 net/tipc/msg.h|   53 +-
 net/tipc/name_distr.c |6 +-
 net/tipc/node.c   |  549 ++
 net/tipc/node.h   |   93 ++-
 net/tipc/socket.c |   71 +--
 13 files changed, 1431 insertions(+), 1018 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-next 08/13] tipc: introduce new link protocol msg create function

2015-07-16 Thread Jon Maloy

As a preparation for later changes, we introduce a new function
tipc_link_build_proto_msg(). Instead of actually sending the created
protocol message, it only creates it and adds it to the head of a
skb queue provided by the caller.

Since we still need the existing function tipc_link_protocol_xmit()
for a while, we redesign it to make use of the new function.

Reviewed-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/link.c | 144 ++--
 1 file changed, 77 insertions(+), 67 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 35a2da6..657ba91 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -129,6 +129,9 @@ static void tipc_link_proto_rcv(struct tipc_link *link,
struct sk_buff *skb);
 static void link_set_supervision_props(struct tipc_link *l_ptr, u32 tol);
 static void link_state_event(struct tipc_link *l_ptr, u32 event);
+static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool 
probe,
+ u16 rcvgap, int tolerance, int priority,
+ struct sk_buff_head *xmitq);
 static void link_reset_statistics(struct tipc_link *l_ptr);
 static void link_print(struct tipc_link *l_ptr, const char *str);
 static void tipc_link_sync_xmit(struct tipc_link *l);
@@ -1323,77 +1326,21 @@ static void link_handle_out_of_seq_msg(struct tipc_link 
*l_ptr,
 /*
  * Send protocol message to the other endpoint.
  */
-void tipc_link_proto_xmit(struct tipc_link *l_ptr, u32 msg_typ, int probe_msg,
+void tipc_link_proto_xmit(struct tipc_link *l, u32 msg_typ, int probe_msg,
  u32 gap, u32 tolerance, u32 priority)
 {
-   struct sk_buff *buf = NULL;
-   struct tipc_msg *msg = l_ptr->pmsg;
-   u32 msg_size = sizeof(l_ptr->proto_msg);
-   int r_flag;
-   u16 last_rcv;
+   struct sk_buff *skb = NULL;
+   struct sk_buff_head xmitq;
 
-   /* Don't send protocol message during link failover */
-   if (l_ptr->exec_mode == TIPC_LINK_BLOCKED)
-   return;
-
-   /* Abort non-RESET send if communication with node is prohibited */
-   if ((tipc_node_blocked(l_ptr->owner)) && (msg_typ != RESET_MSG))
-   return;
-
-   /* Create protocol message with "out-of-sequence" sequence number */
-   msg_set_type(msg, msg_typ);
-   msg_set_net_plane(msg, l_ptr->net_plane);
-   msg_set_bcast_ack(msg, l_ptr->owner->bclink.last_in);
-   msg_set_last_bcast(msg, tipc_bclink_get_last_sent(l_ptr->owner->net));
-
-   if (msg_typ == STATE_MSG) {
-   u16 next_sent = l_ptr->snd_nxt;
-
-   if (!tipc_link_is_up(l_ptr))
-   return;
-   msg_set_next_sent(msg, next_sent);
-   if (!skb_queue_empty(&l_ptr->deferdq)) {
-   last_rcv = buf_seqno(skb_peek(&l_ptr->deferdq));
-   gap = mod(last_rcv - l_ptr->rcv_nxt);
-   }
-   msg_set_seq_gap(msg, gap);
-   if (gap)
-   l_ptr->stats.sent_nacks++;
-   msg_set_link_tolerance(msg, tolerance);
-   msg_set_linkprio(msg, priority);
-   msg_set_max_pkt(msg, l_ptr->mtu);
-   msg_set_ack(msg, mod(l_ptr->rcv_nxt - 1));
-   msg_set_probe(msg, probe_msg != 0);
-   if (probe_msg)
-   l_ptr->stats.sent_probes++;
-   l_ptr->stats.sent_states++;
-   } else {/* RESET_MSG or ACTIVATE_MSG */
-   msg_set_ack(msg, mod(l_ptr->failover_checkpt - 1));
-   msg_set_seq_gap(msg, 0);
-   msg_set_next_sent(msg, 1);
-   msg_set_probe(msg, 0);
-   msg_set_link_tolerance(msg, l_ptr->tolerance);
-   msg_set_linkprio(msg, l_ptr->priority);
-   msg_set_max_pkt(msg, l_ptr->advertised_mtu);
-   }
-
-   r_flag = (l_ptr->owner->working_links > tipc_link_is_up(l_ptr));
-   msg_set_redundant_link(msg, r_flag);
-   msg_set_linkprio(msg, l_ptr->priority);
-   msg_set_size(msg, msg_size);
-
-   msg_set_seqno(msg, mod(l_ptr->snd_nxt + (0x / 2)));
-
-   buf = tipc_buf_acquire(msg_size);
-   if (!buf)
+   __skb_queue_head_init(&xmitq);
+   tipc_link_build_proto_msg(l, msg_typ, probe_msg, gap,
+ tolerance, priority, &xmitq);
+   skb = __skb_dequeue(&xmitq);
+   if (!skb)
return;
-
-   skb_copy_to_linear_data(buf, msg, sizeof(l_ptr->proto_msg));
-   buf->priority = TC_PRIO_CONTROL;
-   tipc_bearer_send(l_ptr->owner->net, l_ptr->bearer_id, buf,
-&l_ptr->media_addr);
-   l_ptr->rcv_unacked = 0;
-   kfree_skb(buf);
+   tipc_bearer_send(l->owner->net, l->bearer_id, skb, &l->media_addr);
+   l->rcv_unacked = 0;
+   kfree_skb(skb);
 }
 
 /*
@@ -1514,6 +1461,69 @@ exit:

[PATCH net-next 03/13] tipc: move link input queue to tipc_node

2015-07-16 Thread Jon Maloy

At present, the link input queue and the name distributor receive
queues are fields aggregated in struct tipc_link. This is a hazard,
because a link might be deleted while a receiving socket still keeps
reference to one of the queues.

This commit fixes this bug. However, rather than adding yet another
reference counter to the critical data path, we move the two queues
to safe ground inside struct tipc_node, which is already protected, and
let the link code only handle references to the queues. This is also
in line with planned later changes in this area.

Reviewed-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/link.c | 27 +++
 net/tipc/link.h | 12 +++-
 net/tipc/node.c |  4 +++-
 net/tipc/node.h |  3 ++-
 4 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 03372a7..f8e0e2c 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -227,7 +227,9 @@ static void link_set_timer(struct tipc_link *link, unsigned 
long time)
  */
 struct tipc_link *tipc_link_create(struct tipc_node *n_ptr,
   struct tipc_bearer *b_ptr,
-  const struct tipc_media_addr *media_addr)
+  const struct tipc_media_addr *media_addr,
+  struct sk_buff_head *inputq,
+  struct sk_buff_head *namedq)
 {
struct tipc_net *tn = net_generic(n_ptr->net, tipc_net_id);
struct tipc_link *l_ptr;
@@ -289,8 +291,9 @@ struct tipc_link *tipc_link_create(struct tipc_node *n_ptr,
__skb_queue_head_init(&l_ptr->backlogq);
__skb_queue_head_init(&l_ptr->deferdq);
skb_queue_head_init(&l_ptr->wakeupq);
-   skb_queue_head_init(&l_ptr->inputq);
-   skb_queue_head_init(&l_ptr->namedq);
+   l_ptr->inputq = inputq;
+   l_ptr->namedq = namedq;
+   skb_queue_head_init(l_ptr->inputq);
link_reset_statistics(l_ptr);
tipc_node_attach_link(n_ptr, l_ptr);
setup_timer(&l_ptr->timer, link_timeout, (unsigned long)l_ptr);
@@ -391,8 +394,8 @@ void link_prepare_wakeup(struct tipc_link *l)
if ((pnd[imp] + l->backlog[imp].len) >= lim)
break;
skb_unlink(skb, &l->wakeupq);
-   skb_queue_tail(&l->inputq, skb);
-   l->owner->inputq = &l->inputq;
+   skb_queue_tail(l->inputq, skb);
+   l->owner->inputq = l->inputq;
l->owner->action_flags |= TIPC_MSG_EVT;
}
 }
@@ -465,7 +468,7 @@ void tipc_link_reset(struct tipc_link *l_ptr)
__skb_queue_purge(&l_ptr->transmq);
__skb_queue_purge(&l_ptr->deferdq);
if (!owner->inputq)
-   owner->inputq = &l_ptr->inputq;
+   owner->inputq = l_ptr->inputq;
skb_queue_splice_init(&l_ptr->wakeupq, owner->inputq);
if (!skb_queue_empty(owner->inputq))
owner->action_flags |= TIPC_MSG_EVT;
@@ -962,7 +965,7 @@ static bool link_synch(struct tipc_link *l)
 
/* Is it still in the input queue ? */
post_synch = mod(pl->rcv_nxt - l->synch_point) - 1;
-   if (skb_queue_len(&pl->inputq) > post_synch)
+   if (skb_queue_len(pl->inputq) > post_synch)
return false;
 synched:
l->flags &= ~LINK_SYNCHING;
@@ -1141,16 +1144,16 @@ static bool tipc_data_input(struct tipc_link *link, 
struct sk_buff *skb)
case TIPC_HIGH_IMPORTANCE:
case TIPC_CRITICAL_IMPORTANCE:
case CONN_MANAGER:
-   if (tipc_skb_queue_tail(&link->inputq, skb, dport)) {
-   node->inputq = &link->inputq;
+   if (tipc_skb_queue_tail(link->inputq, skb, dport)) {
+   node->inputq = link->inputq;
node->action_flags |= TIPC_MSG_EVT;
}
return true;
case NAME_DISTRIBUTOR:
node->bclink.recv_permitted = true;
-   node->namedq = &link->namedq;
-   skb_queue_tail(&link->namedq, skb);
-   if (skb_queue_len(&link->namedq) == 1)
+   node->namedq = link->namedq;
+   skb_queue_tail(link->namedq, skb);
+   if (skb_queue_len(link->namedq) == 1)
node->action_flags |= TIPC_NAMED_MSG_EVT;
return true;
case MSG_BUNDLER:
diff --git a/net/tipc/link.h b/net/tipc/link.h
index ae0a0ea..9c71d9e 100644
--- a/net/tipc/link.h
+++ b/net/tipc/link.h
@@ -192,8 +192,8 @@ struct tipc_link {
u16 rcv_nxt;
u32 rcv_unacked;
struct sk_buff_head deferdq;
-   struct sk_buff_head inputq;
-   struct sk_buff_head namedq;
+   struct sk_buff_head *inputq;
+   struct sk_buff_head *namedq;
 
/* Congestion handling */
struct sk_buff_head wakeupq;
@@ -207,9 +207,11 @@ struct tipc_link {
 
 struct tipc_port;
 
-struct tipc_link *tipc_link_create(struct tipc_no

[PATCH net-next 12/13] tipc: introduce node contact FSM

2015-07-16 Thread Jon Maloy

The logics for determining when a node is permitted to establish
and maintain contact with its peer node becomes non-trivial in the
presence of multiple parallel links that may come and go independently.

A known failure scenario is that one endpoint registers both its links
to the peer lost, cleans up it binding table, and prepares for a table
update once contact is re-establihed, while the other endpoint may
see its links reset and re-established one by one, hence seeing
no need to re-synchronize the binding table. To avoid this, a node
must not allow re-establishing contact until it has confirmation that
even the peer has lost both links.

Currently, the mechanism for handling this consists of setting and
resetting two state flags from different locations in the code. This
solution is hard to understand and maintain. A closer analysis even
reveals that it is not completely safe.

In this commit we do instead introduce an FSM that keeps track of
the conditions for when the node can establish and maintain links.
It has six states and four events, and is strictly based on explicit
knowledge about the own node's and the peer node's contact states.
Only events leading to state change are shown as edges in the figure
below.

 +--+
 | SELF_UP/ |
   +>| PEER_COMING  |-+
SELF_  | +--+ |PEER_
ESTBL_ || |ESTBL_
CONTACT|  SELF_LOST_CONTACT | |CONTACT
   |v |
   | +--+ |
   |  PEER_  | SELF_DOWN/   | SELF_   |
   |  LOST_   +--| PEER_LEAVING |<--+ LOST_   v
+-+   CONTACT |  +--+   | CONTACT  +---+
| SELF_DOWN/  |<--+ +--| SELF_UP/  |
| PEER_DOWN   |<--+ +--| PEER_UP   |
+-+   SELF_   |  +--+   | PEER_+---+
   |  LOST_   +--| SELF_LEAVING/|<--+ LOST_   A
   |  CONTACT| PEER_DOWN| CONTACT |
   | +--+ |
   | A|
PEER_  |   PEER_LOST_CONTACT ||SELF_
ESTBL_ | ||ESTBL_
CONTACT| +--+ |CONTACT
   +>| PEER_UP/ |-+
 | SELF_COMING  |
 +--+

Reviewed-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/link.c |  74 ++--
 net/tipc/msg.h  |   7 +++
 net/tipc/node.c | 130 +---
 net/tipc/node.h |  28 +---
 4 files changed, 185 insertions(+), 54 deletions(-)

diff --git a/net/tipc/link.c b/net/tipc/link.c
index 5b4609b..eaccf45 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -911,9 +911,13 @@ static void link_retransmit_failure(struct tipc_link 
*l_ptr,
 
if (l_ptr->addr) {
/* Handle failure on standard link */
-   link_print(l_ptr, "Resetting link\n");
+   link_print(l_ptr, "Resetting link ");
+   pr_info("Failed msg: usr %u, typ %u, len %u, err %u\n",
+   msg_user(msg), msg_type(msg), msg_size(msg),
+   msg_errcode(msg));
+   pr_info("sqno %u, prev: %x, src: %x\n",
+   msg_seqno(msg), msg_prevnode(msg), msg_orignode(msg));
tipc_link_reset(l_ptr);
-
} else {
/* Handle failure on broadcast link */
struct tipc_node *n_ptr;
@@ -1067,15 +1071,8 @@ void tipc_rcv(struct net *net, struct sk_buff *skb, 
struct tipc_bearer *b_ptr)
if (unlikely(!l_ptr))
goto unlock;
 
-   /* Verify that communication with node is currently allowed */
-   if ((n_ptr->action_flags & TIPC_WAIT_PEER_LINKS_DOWN) &&
-   msg_user(msg) == LINK_PROTOCOL &&
-   (msg_type(msg) == RESET_MSG ||
-   msg_type(msg) == ACTIVATE_MSG) &&
-   !msg_redundant_link(msg))
-   n_ptr->action_flags &= ~TIPC_WAIT_PEER_LINKS_DOWN;
-
-   if (tipc_node_blocked(n_ptr))
+   /* Is reception of this pkt permitted at the moment ? */
+   if (!tipc_node_filter_skb(n_ptr, msg))
goto unlock;
 
/* Validate message sequence number info */
@@ -1371,15 +1368,6 @@ static void tipc_link_proto_rcv(struct tipc_link *l_ptr,
if (less_eq(msg_session(msg), l_ptr->peer_session))

[PATCH net-next 13/13] tipc: reduce locking scope during packet reception

2015-07-16 Thread Jon Maloy

We convert packet/message reception according to the same principle
we have been using for message sending and timeout handling:

We move the function tipc_rcv() to node.c, hence handling the initial
packet reception at the link aggregation level. The function grabs
the node lock, selects the receiving link, and accesses it via a new
call tipc_link_rcv(). This function appends buffers to the input
queue for delivery upwards, but it may also append outgoing packets
to the xmit queue, just as we do during regular message sending. The
latter will happen when buffers are forwarded from the link backlog,
or when retransmission is requested.

Upon return of this function, and after having released the node lock,
tipc_rcv() delivers/tranmsits the contents of those queues, but it may
also perform actions such as link activation or reset, as indicated by
the return flags from the link.

This reduces the number of cpu cycles spent inside the node spinlock,
and reduces contention on that lock.

Reviewed-by: Ying Xue 
Signed-off-by: Jon Maloy 
---
 net/tipc/bcast.c |  23 ++
 net/tipc/bcast.h |   1 +
 net/tipc/core.h  |   5 +
 net/tipc/link.c  | 673 +--
 net/tipc/link.h  |   6 +-
 net/tipc/msg.h   |  50 -
 net/tipc/node.c  | 105 -
 net/tipc/node.h  |   4 -
 8 files changed, 478 insertions(+), 389 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index aab4e8d..8b010c9 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -316,6 +316,29 @@ void tipc_bclink_update_link_state(struct tipc_node *n_ptr,
}
 }
 
+void tipc_bclink_sync_state(struct tipc_node *n, struct tipc_msg *hdr)
+{
+   u16 last = msg_last_bcast(hdr);
+   int mtyp = msg_type(hdr);
+
+   if (unlikely(msg_user(hdr) != LINK_PROTOCOL))
+   return;
+   if (mtyp == STATE_MSG) {
+   tipc_bclink_update_link_state(n, last);
+   return;
+   }
+   /* Compatibility: older nodes don't know BCAST_PROTOCOL synchronization,
+* and transfer synch info in LINK_PROTOCOL messages.
+*/
+   if (tipc_node_is_up(n))
+   return;
+   if ((mtyp != RESET_MSG) && (mtyp != ACTIVATE_MSG))
+   return;
+   n->bclink.last_sent = last;
+   n->bclink.last_in = last;
+   n->bclink.oos_state = 0;
+}
+
 /**
  * bclink_peek_nack - monitor retransmission requests sent by other nodes
  *
diff --git a/net/tipc/bcast.h b/net/tipc/bcast.h
index 3c290a48..d74c69b 100644
--- a/net/tipc/bcast.h
+++ b/net/tipc/bcast.h
@@ -133,5 +133,6 @@ void tipc_bclink_wakeup_users(struct net *net);
 int tipc_nl_add_bc_link(struct net *net, struct tipc_nl_msg *msg);
 int tipc_nl_bc_link_set(struct net *net, struct nlattr *attrs[]);
 void tipc_bclink_input(struct net *net);
+void tipc_bclink_sync_state(struct tipc_node *n, struct tipc_msg *msg);
 
 #endif
diff --git a/net/tipc/core.h b/net/tipc/core.h
index 0fcf133..f4ed677 100644
--- a/net/tipc/core.h
+++ b/net/tipc/core.h
@@ -129,6 +129,11 @@ static inline int less(u16 left, u16 right)
return less_eq(left, right) && (mod(right) != mod(left));
 }
 
+static inline int in_range(u16 val, u16 min, u16 max)
+{
+   return !less(val, min) && !more(val, max);
+}
+
 #ifdef CONFIG_SYSCTL
 int tipc_register_sysctl(void);
 void tipc_unregister_sysctl(void);
diff --git a/net/tipc/link.c b/net/tipc/link.c
index eaccf45..55b675d 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -77,6 +77,10 @@ static const struct nla_policy 
tipc_nl_prop_policy[TIPC_NLA_PROP_MAX + 1] = {
 };
 
 /*
+ * Interval between NACKs when packets arrive out of order
+ */
+#define TIPC_NACK_INTV (TIPC_MIN_LINK_WIN * 2)
+/*
  * Out-of-range value for link session numbers
  */
 #define WILDCARD_SESSION 0x1
@@ -123,22 +127,19 @@ static int link_establishing(struct tipc_link *l)
return l->state == TIPC_LINK_ESTABLISHING;
 }
 
-static void link_handle_out_of_seq_msg(struct tipc_link *link,
-  struct sk_buff *skb);
-static void tipc_link_proto_rcv(struct tipc_link *link,
-   struct sk_buff *skb);
-static void link_state_event(struct tipc_link *l_ptr, u32 event);
+static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
+  struct sk_buff_head *xmitq);
 static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool 
probe,
  u16 rcvgap, int tolerance, int priority,
  struct sk_buff_head *xmitq);
 static void link_reset_statistics(struct tipc_link *l_ptr);
 static void link_print(struct tipc_link *l_ptr, const char *str);
-static void tipc_link_sync_xmit(struct tipc_link *l);
+static void tipc_link_build_bcast_sync_msg(struct tipc_link *l,
+  struct sk_buff_head *xmitq);
 static void tipc_link_sync_rcv(struct tipc_node *n, struct sk_buff *buf);
 static void tipc_l

1 2 >

1 - 100 of 140 matches

Mail list logo