RE: [PATCH 1/2] bnx2x:Fix error handling in the function bnxc2x_dbcx_set_params

2015-07-18 Thread Yuval Mintz
> This fixes the error handling in the function bnxc2x_dbcx_params for both 
> calls
> to bnx2x_dcbnl_update_applist by checking if they failed by returning a error
> code and if so return immediately to this function's caller due to this
> nonrecoverable internal failure.

Hi Nicholas,

Even assuming this is the reasonable thing to do [which I'm not fully 
convinced],
I don't think this solution is complete - you'd also need to break the 
iteration in
bnx2x_dcbnl_update_applist() [at the moment said function always returns 'rc'
from the last dcb_app weve tried setting].

Also, why do you send multiple patches for fixing the same issue in different
functions in the same file?

Thanks,
Yuval

> 
> Signed-off-by: Nicholas Krause 
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_dcb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_dcb.c
> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_dcb.c
> index 6e4294e..e3cd0c6 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_dcb.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_dcb.c
> @@ -724,7 +724,8 @@ void bnx2x_dcbx_set_params(struct bnx2x *bp, u32
> state)
>* Delete app tlvs from dcbnl before reading new
>* negotiation results
>*/
> - bnx2x_dcbnl_update_applist(bp, true);
> + if (bnx2x_dcbnl_update_applist(bp, true))
> + return;
> 
>   /* Read remote mib if dcbx is in the FW */
>   if (bnx2x_dcbx_read_shmem_remote_mib(bp))
> @@ -748,7 +749,8 @@ void bnx2x_dcbx_set_params(struct bnx2x *bp, u32
> state)
>   /*
>* Add new app tlvs to dcbnl
>*/
> - bnx2x_dcbnl_update_applist(bp, false);
> + if (bnx2x_dcbnl_update_applist(bp, false))
> + return;
>  #endif
>   /*
>* reconfigure the netdevice with the results of the new
> --
> 2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC net-next] net/vxlan: Fix kernel unaligned access in __vxlan_find_mac

2015-07-18 Thread David Miller
From: Joe Perches 
Date: Sat, 18 Jul 2015 11:06:26 -0700

> As it's not fatal, naybe the sparc64 message should be
> KERN_DEBUG/pr_debug.

It's not fatal insofar as it doesn't crash the system.

But performance wise is _IS_ fatal, and I want the kernel to bark
at the user so that they report the issue or fix it sooner rather
than later.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next]r8169:Disable advertise full duplex on chip id RTL_GIGA_MAC_VER_09. Disable advertise full duplex on chip id RTL_GIGA_MAC_VER_09 wich support random full duplex.(very sad) On branc

2015-07-18 Thread David Miller

This is still ver far from acceptable.  You aren't even talking to us when
we ask you questions and ask you to make changes, you just make another
improperly formatted submission of a bad patch all over again.

Your entire description is in your Subject line, and it insufficiently
describes your change.

Frankly, I am very sick and tired of these low quality submissions.  I want to
know where these changes are coming from, who they are important to, and most
importantly why it is so difficult for you to submit your changes properly.

Just simply _LOOK_ at postings by other developers who submit changes
that successfully get accepted by us.

_LOOK_ at how their Subject line looks, and mimick what they do _EXACTLY_

_LOOK_ at how their commit messages look, how detailed they are, how
descriptive they are, and how they are formatted and composed, and
mimick _EACTLY_ what they are doing.

This isn't rocket science, if you want us to take your changes seriously
you _MUST_ learn how to conform to our patch submission methods and
expectations.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] inet: Always increment refcount in inet_twsk_schedule

2015-07-18 Thread subashab
I am seeing an issue with the reference count of time wait sockets which
leads to freeing of active timer object. This occurs in some data stress
test setups, so I am unable to determine the exact step when it occured.
However, I logged the refcount and was able to find out the code path
which leads to this problem.

//Initialize time wait socket and setup timer
inet_twsk_alloc() tw_refcnt = 0
__inet_twsk_hashdance() tw_refcnt = 3
inet_twsk_schedule() tw_refcnt = 4
inet_twsk_put() tw_refcnt = 3

//Receive packet 1 in timewait state
tcp_timewait_state_process() -> inet_twsk_schedule tw_refcnt = 3 (no change)
TCP: tcp_v4_timewait_ack() -> inet_twsk_put() tw_refcnt = 2

//Receive packet 2 in timewait state
tcp_timewait_state_process() -> inet_twsk_schedule tw_refcnt = 2 (no change)
TCP: tcp_v4_timewait_ack() -> inet_twsk_put() tw_refcnt = 1

//Receive packet 3 in timewait state
tcp_timewait_state_process() -> inet_twsk_schedule tw_refcnt = 1 (no change)
TCP: tcp_v4_timewait_ack() -> inet_twsk_put() tw_refcnt = 0

After this step, the time wait socket is destroyed along with the active
timer object. This leads to a warning being printed which eventually leads
to a crash.

ODEBUG: free active (active state 0) object type: timer_list hint:
tw_timer_handler+0x0/0x68

It appears that inet_twsk_schedule needs to increment the reference count
unconditionally, otherwise the socket will be destroyed since reference
count will be decremented each time an ack is sent out as a response for
an incoming packet.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 net/ipv4/inet_timewait_sock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index cbeb022..99c349a 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -246,9 +246,9 @@ void inet_twsk_schedule(struct inet_timewait_sock *tw,
const int timeo)

tw->tw_kill = timeo <= 4*HZ;
if (!mod_timer_pinned(&tw->tw_timer, jiffies + timeo)) {
-   atomic_inc(&tw->tw_refcnt);
atomic_inc(&tw->tw_dr->tw_count);
}
+   atomic_inc(&tw->tw_refcnt);
 }
 EXPORT_SYMBOL_GPL(inet_twsk_schedule);

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v3 4/5] rocker: add offload_fwd_mark support

2015-07-18 Thread sfeldma
From: Scott Feldman 

If device flags ingress packet as "fwd offload", mark the
skb->offlaod_fwd_mark using the ingress port's dev->offlaod_fwd_mark.  This
will be the hint to the kernel that this packet has already been forwarded
by device to egress ports matching skb->offlaod_fwd_mark.

For rocker, derive port dev->offlaod_fwd_mark based on device switch ID and
port ifindex.  If port is bridged, use the bridge ifindex rather than the
port ifindex.

Signed-off-by: Scott Feldman 
Acked-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c |   11 +++
 drivers/net/ethernet/rocker/rocker.h |1 +
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index 9324283..0fdfa47 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -4800,6 +4800,7 @@ static int rocker_port_rx_proc(const struct rocker 
*rocker,
const struct rocker_tlv *attrs[ROCKER_TLV_RX_MAX + 1];
struct sk_buff *skb = rocker_desc_cookie_ptr_get(desc_info);
size_t rx_len;
+   u16 rx_flags = 0;
 
if (!skb)
return -ENOENT;
@@ -4807,6 +4808,8 @@ static int rocker_port_rx_proc(const struct rocker 
*rocker,
rocker_tlv_parse_desc(attrs, ROCKER_TLV_RX_MAX, desc_info);
if (!attrs[ROCKER_TLV_RX_FRAG_LEN])
return -EINVAL;
+   if (attrs[ROCKER_TLV_RX_FLAGS])
+   rx_flags = rocker_tlv_get_u16(attrs[ROCKER_TLV_RX_FLAGS]);
 
rocker_dma_rx_ring_skb_unmap(rocker, attrs);
 
@@ -4814,6 +4817,9 @@ static int rocker_port_rx_proc(const struct rocker 
*rocker,
skb_put(skb, rx_len);
skb->protocol = eth_type_trans(skb, rocker_port->dev);
 
+   if (rx_flags & ROCKER_RX_FLAGS_FWD_OFFLOAD)
+   skb->offload_fwd_mark = rocker_port->dev->offload_fwd_mark;
+
rocker_port->dev->stats.rx_packets++;
rocker_port->dev->stats.rx_bytes += skb->len;
 
@@ -4951,6 +4957,8 @@ static int rocker_probe_port(struct rocker *rocker, 
unsigned int port_number)
}
rocker->ports[port_number] = rocker_port;
 
+   switchdev_port_fwd_mark_set(rocker_port->dev, NULL, false);
+
rocker_port_set_learning(rocker_port, SWITCHDEV_TRANS_NONE);
 
err = rocker_port_ig_tbl(rocker_port, SWITCHDEV_TRANS_NONE, 0);
@@ -5230,6 +5238,7 @@ static int rocker_port_bridge_join(struct rocker_port 
*rocker_port,
rocker_port_internal_vlan_id_get(rocker_port, bridge->ifindex);
 
rocker_port->bridge_dev = bridge;
+   switchdev_port_fwd_mark_set(rocker_port->dev, bridge, true);
 
return rocker_port_vlan_add(rocker_port, SWITCHDEV_TRANS_NONE,
untagged_vid, 0);
@@ -5250,6 +5259,8 @@ static int rocker_port_bridge_leave(struct rocker_port 
*rocker_port)
rocker_port_internal_vlan_id_get(rocker_port,
 rocker_port->dev->ifindex);
 
+   switchdev_port_fwd_mark_set(rocker_port->dev, rocker_port->bridge_dev,
+   false);
rocker_port->bridge_dev = NULL;
 
err = rocker_port_vlan_add(rocker_port, SWITCHDEV_TRANS_NONE,
diff --git a/drivers/net/ethernet/rocker/rocker.h 
b/drivers/net/ethernet/rocker/rocker.h
index 08b2c3d..12490b2 100644
--- a/drivers/net/ethernet/rocker/rocker.h
+++ b/drivers/net/ethernet/rocker/rocker.h
@@ -246,6 +246,7 @@ enum {
 #define ROCKER_RX_FLAGS_TCPBIT(5)
 #define ROCKER_RX_FLAGS_UDPBIT(6)
 #define ROCKER_RX_FLAGS_TCP_UDP_CSUM_GOOD  BIT(7)
+#define ROCKER_RX_FLAGS_FWD_OFFLOADBIT(8)
 
 enum {
ROCKER_TLV_TX_UNSPEC,
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v3 5/5] switchdev: update documentation for offload_fwd_mark

2015-07-18 Thread sfeldma
From: Scott Feldman 

Signed-off-by: Scott Feldman 
Acked-by: Jiri Pirko 
---
 Documentation/networking/switchdev.txt |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/switchdev.txt 
b/Documentation/networking/switchdev.txt
index c5d7ade..9825f32 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.txt
@@ -279,8 +279,18 @@ and unknown unicast packets to all ports in domain, if 
allowed by port's
 current STP state.  The switch driver, knowing which ports are within which
 vlan L2 domain, can program the switch device for flooding.  The packet should
 also be sent to the port netdev for processing by the bridge driver.  The
-bridge should not reflood the packet to the same ports the device flooded.
-XXX: the mechanism to avoid duplicate flood packets is being discuseed.
+bridge should not reflood the packet to the same ports the device flooded,
+otherwise there will be duplicate packets on the wire.
+
+To avoid duplicate packets, the device/driver should mark a packet as already
+forwarded using skb->offload_fwd_mark.  The same mark is set on the device
+ports in the domain using dev->offload_fwd_mark.  If the skb->offload_fwd_mark
+is non-zero and matches the forwarding egress port's dev->skb_mark, the kernel
+will drop the skb right before transmit on the egress port, with the
+understanding that the device already forwarded the packet on same egress port.
+The driver can use switchdev_port_fwd_mark_set() to set a globally unique mark
+for port's dev->offload_fwd_mark, based on the port's parent ID (switch ID) and
+a group ifindex.
 
 It is possible for the switch device to not handle flooding and push the
 packets up to the bridge driver for flooding.  This is not ideal as the number
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v3 2/5] net: add phys ID compare helper to test if two IDs are the same

2015-07-18 Thread sfeldma
From: Scott Feldman 

Signed-off-by: Scott Feldman 
Acked-by: Jiri Pirko 
---
 include/linux/netdevice.h |7 +++
 net/switchdev/switchdev.c |8 ++--
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8364f29..607b5f4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -766,6 +766,13 @@ struct netdev_phys_item_id {
unsigned char id_len;
 };
 
+static inline bool netdev_phys_item_id_same(struct netdev_phys_item_id *a,
+   struct netdev_phys_item_id *b)
+{
+   return a->id_len == b->id_len &&
+  memcmp(a->id, b->id, a->id_len) == 0;
+}
+
 typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
   struct sk_buff *skb);
 
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 9f2add3..4e5bba5 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -910,13 +910,9 @@ static struct net_device *switchdev_get_dev_by_nhs(struct 
fib_info *fi)
if (switchdev_port_attr_get(dev, &attr))
return NULL;
 
-   if (nhsel > 0) {
-   if (prev_attr.u.ppid.id_len != attr.u.ppid.id_len)
+   if (nhsel > 0 &&
+   !netdev_phys_item_id_same(&prev_attr.u.ppid, &attr.u.ppid))
return NULL;
-   if (memcmp(prev_attr.u.ppid.id, attr.u.ppid.id,
-  attr.u.ppid.id_len))
-   return NULL;
-   }
 
prev_attr = attr;
}
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v3 0/5] switchdev: avoid duplicate packet forwarding

2015-07-18 Thread sfeldma
From: Scott Feldman 

v3:

 - Per Nicolas Dichtel review: remove errant empty union.

v2:

 - Per davem review: in sk_buff, union fwd_mark with secmark to save space
   since features appear to be mutually exclusive.
 - Per Simon Horman review:
   - fix grammar in switchdev.txt wrt fwd_mark
   - remove some unrelated changes that snuck in

v1:

This patchset was previously submitted as RFC.  No changes from the last
version (v2) sent under RFC.  Including RFC version history here for reference.

RFC v2:

 - s/fwd_mark/offload_fwd_mark
 - use consume_skb rather than kfree_skb when dropping pkt on egress.
 - Use Jiri's suggestion to use ifindex of one of the ports in a group
   as the mark for all the ports in the group.  This can be done with
   no additional storage (no hashtable from v1).  To pull it off, we
   need some simple recursive routines to walk the netdev tree ensuring
   all leaves in the tree (ports) in the same group (e.g. bridge)
   belonging to the same switch device will have the same offload fwd mark.
   Maybe someone sees a better design for the recusive routines?  They're
   not too bad, and should cover the stacked driver cases.

RFC v1:

With switchdev support for offloading L2/L3 forwarding data path to a
switch device, we have a general problem where both the device and the
kernel may forward the packet, resulting in duplicate packets on the wire.
Anytime a packet is forwarded by the device and a copy is sent to the CPU,
there is potential for duplicate forwarding, as the kernel may also do a
forwarding lookup and send the packet on the wire.

The specific problem this patch series is interested in solving is avoiding
duplicate packets on bridged ports.  There was a previous RFC from Roopa
(http://marc.info/?l=linux-netdev&m=142687073314252&w=2) to address this
problem, but didn't solve the problem of mixed ports in the bridge from
different devices; there was no way to exclude some ports from forwarding
and include others.  This RFC solves that problem by tagging the ingressing
packet with a unique mark, and then comparing the packet mark with the
egress port mark, and skip forwarding when there is a match.  For the mixed
ports bridge case, only those ports with matching marks are skipped.

The switchdev port driver must do two things:

1) Generate a fwd_mark for each switch port, using some unique key of the
   switch device (and optionally port).  This is done when the port netdev
   is registered or if the port's group membership changes (joins/leaves
   a bridge, for example).

2) On packet ingress from port, mark the skb with the ingress port's
   fwd_mark.  If the device supports it, it's useful to only mark skbs
   which were already forwarded by the device.  If the device does not
   support such indication, all skbs can be marked, even if they're
   local dst.

Two new 32-bit fields are added to struct sk_buff and struct netdevice to
hold the fwd_mark.  I've wrapped these with CONFIG_NET_SWITCHDEV for now. I
tried using skb->mark for this purpose, but ebtables can overwrite the
skb->mark before the bridge gets it, so that will not work.

In general, this fwd_mark can be used for any case where a packet is
forwarded by the device and a copy is sent to the CPU, to avoid the kernel
re-forwarding the packet.  sFlow is another use-case that comes to mind,
but I haven't explored the details.
Scott Feldman (5):
  net: don't reforward packets already forwarded by offload device
  net: add phys ID compare helper to test if two IDs are the same
  switchdev: add offload_fwd_mark generator helper
  rocker: add offload_fwd_mark support
  switchdev: update documentation for offload_fwd_mark

 Documentation/networking/switchdev.txt |   14 +++-
 drivers/net/ethernet/rocker/rocker.c   |   11 
 drivers/net/ethernet/rocker/rocker.h   |1 +
 include/linux/netdevice.h  |   13 
 include/linux/skbuff.h |9 ++-
 include/net/switchdev.h|9 +++
 net/core/dev.c |   10 +++
 net/switchdev/switchdev.c  |  111 ++--
 8 files changed, 169 insertions(+), 9 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v3 1/5] net: don't reforward packets already forwarded by offload device

2015-07-18 Thread sfeldma
From: Scott Feldman 

Just before queuing skb for xmit on port, check if skb has been marked by
switchdev port driver as already fordwarded by device.  If so, drop skb.  A
non-zero skb->offload_fwd_mark field is set by the switchdev port
driver/device on ingress to indicate the skb has already been forwarded by
the device to egress ports with matching dev->skb_mark.  The switchdev port
driver would assign a non-zero dev->offload_skb_mark for each device port
netdev during registration, for example.

Signed-off-by: Scott Feldman 
Acked-by: Jiri Pirko 
Acked-by: Roopa Prabhu 
---
 include/linux/netdevice.h |6 ++
 include/linux/skbuff.h|9 -
 net/core/dev.c|   10 ++
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 45cfd79..8364f29 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1456,6 +1456,8 @@ enum netdev_priv_flags {
  *
  * @xps_maps:  XXX: need comments on this one
  *
+ * @offload_fwd_mark:  Offload device fwding mark
+ *
  * @trans_start:   Time (in jiffies) of last Tx
  * @watchdog_timeo:Represents the timeout that is used by
  * the watchdog ( see dev_watchdog() )
@@ -1697,6 +1699,10 @@ struct net_device {
struct xps_dev_maps __rcu *xps_maps;
 #endif
 
+#ifdef CONFIG_NET_SWITCHDEV
+   u32 offload_fwd_mark;
+#endif
+
/* These may be needed for future network-power-down code. */
 
/*
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d6cdd6e..af7a096 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -506,6 +506,7 @@ static inline u32 skb_mstamp_us_delta(const struct 
skb_mstamp *t1,
  * @no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS
   *@napi_id: id of the NAPI struct this skb came from
  * @secmark: security marking
+ * @offload_fwd_mark: fwding offload mark
  * @mark: Generic packet mark
  * @vlan_proto: vlan encapsulation protocol
  * @vlan_tci: vlan tag control information
@@ -650,9 +651,15 @@ struct sk_buff {
unsigned intsender_cpu;
};
 #endif
+   union {
 #ifdef CONFIG_NETWORK_SECMARK
-   __u32   secmark;
+   __u32   secmark;
+#endif
+#ifdef CONFIG_NET_SWITCHDEV
+   __u32   offload_fwd_mark;
 #endif
+   };
+
union {
__u32   mark;
__u32   reserved_tailroom;
diff --git a/net/core/dev.c b/net/core/dev.c
index 8810b6b..2ee15af 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3061,6 +3061,16 @@ static int __dev_queue_xmit(struct sk_buff *skb, void 
*accel_priv)
else
skb_dst_force(skb);
 
+#ifdef CONFIG_NET_SWITCHDEV
+   /* Don't forward if offload device already forwarded */
+   if (skb->offload_fwd_mark &&
+   skb->offload_fwd_mark == dev->offload_fwd_mark) {
+   consume_skb(skb);
+   rc = NET_XMIT_SUCCESS;
+   goto out;
+   }
+#endif
+
txq = netdev_pick_tx(dev, skb, accel_priv);
q = rcu_dereference_bh(txq->qdisc);
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v3 3/5] switchdev: add offload_fwd_mark generator helper

2015-07-18 Thread sfeldma
From: Scott Feldman 

skb->offload_fwd_mark and dev->offload_fwd_mark are 32-bit and should be
unique for device and may even be unique for a sub-set of ports within
device, so add switchdev helper function to generate unique marks based on
port's switch ID and group_ifindex.  group_ifindex would typically be the
container dev's ifindex, such as the bridge's ifindex.

The generator uses a global hash table to store offload_fwd_marks hashed by
{switch ID, group_ifindex} key.

Signed-off-by: Scott Feldman 
Acked-by: Jiri Pirko 
---
 include/net/switchdev.h   |9 
 net/switchdev/switchdev.c |  103 +
 2 files changed, 112 insertions(+)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index d5671f1..89da893 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -157,6 +157,9 @@ int switchdev_port_fdb_del(struct ndmsg *ndm, struct nlattr 
*tb[],
 int switchdev_port_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
struct net_device *dev,
struct net_device *filter_dev, int idx);
+void switchdev_port_fwd_mark_set(struct net_device *dev,
+struct net_device *group_dev,
+bool joining);
 
 #else
 
@@ -271,6 +274,12 @@ static inline int switchdev_port_fdb_dump(struct sk_buff 
*skb,
return -EOPNOTSUPP;
 }
 
+static inline void switchdev_port_fwd_mark_set(struct net_device *dev,
+  struct net_device *group_dev,
+  bool joining)
+{
+}
+
 #endif
 
 #endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 4e5bba5..33bafa2 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -1039,3 +1039,106 @@ void switchdev_fib_ipv4_abort(struct fib_info *fi)
fi->fib_net->ipv4.fib_offload_disabled = true;
 }
 EXPORT_SYMBOL_GPL(switchdev_fib_ipv4_abort);
+
+static bool switchdev_port_same_parent_id(struct net_device *a,
+ struct net_device *b)
+{
+   struct switchdev_attr a_attr = {
+   .id = SWITCHDEV_ATTR_PORT_PARENT_ID,
+   .flags = SWITCHDEV_F_NO_RECURSE,
+   };
+   struct switchdev_attr b_attr = {
+   .id = SWITCHDEV_ATTR_PORT_PARENT_ID,
+   .flags = SWITCHDEV_F_NO_RECURSE,
+   };
+
+   if (switchdev_port_attr_get(a, &a_attr) ||
+   switchdev_port_attr_get(b, &b_attr))
+   return false;
+
+   return netdev_phys_item_id_same(&a_attr.u.ppid, &b_attr.u.ppid);
+}
+
+static u32 switchdev_port_fwd_mark_get(struct net_device *dev,
+  struct net_device *group_dev)
+{
+   struct net_device *lower_dev;
+   struct list_head *iter;
+
+   netdev_for_each_lower_dev(group_dev, lower_dev, iter) {
+   if (lower_dev == dev)
+   continue;
+   if (switchdev_port_same_parent_id(dev, lower_dev))
+   return lower_dev->offload_fwd_mark;
+   return switchdev_port_fwd_mark_get(dev, lower_dev);
+   }
+
+   return dev->ifindex;
+}
+
+static void switchdev_port_fwd_mark_reset(struct net_device *group_dev,
+ u32 old_mark, u32 *reset_mark)
+{
+   struct net_device *lower_dev;
+   struct list_head *iter;
+
+   netdev_for_each_lower_dev(group_dev, lower_dev, iter) {
+   if (lower_dev->offload_fwd_mark == old_mark) {
+   if (!*reset_mark)
+   *reset_mark = lower_dev->ifindex;
+   lower_dev->offload_fwd_mark = *reset_mark;
+   }
+   switchdev_port_fwd_mark_reset(lower_dev, old_mark, reset_mark);
+   }
+}
+
+/**
+ * switchdev_port_fwd_mark_set - Set port offload forwarding mark
+ *
+ * @dev: port device
+ * @group_dev: containing device
+ * @joining: true if dev is joining group; false if leaving group
+ *
+ * An ungrouped port's offload mark is just its ifindex.  A grouped
+ * port's (member of a bridge, for example) offload mark is the ifindex
+ * of one of the ports in the group with the same parent (switch) ID.
+ * Ports on the same device in the same group will have the same mark.
+ *
+ * Example:
+ *
+ * br0 ifindex=9
+ *   sw1p1 ifindex=2   mark=2
+ *   sw1p2 ifindex=3   mark=2
+ *   sw2p1 ifindex=4   mark=5
+ *   sw2p2 ifindex=5   mark=5
+ *
+ * If sw2p2 leaves the bridge, we'll have:
+ *
+ * br0 ifindex=9
+ *   sw1p1 ifindex=2   mark=2
+ *   sw1p2 ifindex=3   mark=2
+ *   sw2p1 ifindex=4   mark=4
+ * sw2p2

Re: [PATCH 1/3] fixed_phy: handle link-down case

2015-07-18 Thread Stas Sergeev

18.07.2015 05:29, Florian Fainelli пишет:

Le 07/17/15 16:53, Stas Sergeev a écrit :

18.07.2015 02:35, Florian Fainelli пишет:

On 17/07/15 16:24, Stas Sergeev wrote:

18.07.2015 01:01, Florian Fainelli пишет:

On 17/07/15 13:03, Stas Sergeev wrote:

17.07.2015 21:50, Florian Fainelli пишет:

On 17/07/15 04:26, Stas Sergeev wrote:

17.07.2015 02:25, Florian Fainelli пишет:

On 16/07/15 07:50, Stas Sergeev wrote:

Currently fixed_phy driver recognizes only the link-up state.
This simple patch adds an implementation of link-down state.
It fixes the status registers when link is down, and also allows
to register the fixed-phy with link down without specifying the
speed.

This patch still breaks my setups here, e.g:
drivers/net/dsa/bcm_sf2.c,
but I will look into it.

Do we really need this for now for your two other patches to work
properly, or is it just nicer to have?

Yes, absolutely.
Otherwise registering fixed phy will return -EINVAL
because of the missing link speed (even though the link
is down).

Ok, I see the problem that you have now. Arguably you could say that
according to the fixed-link binding, speed needs to be specified and
the
code correctly errors out with such an error if you do not specify
it. I

Aren't you missing the fact that .link=0?
I think what you say is true only for the link-up case, no?
.speed==0 is valid for link-down IMHO: no link - zero speed.

Pardon me being very dense and stupid here, but your problem is that
the
"speed" parameter is not specified in your DT,

Not even a fixed-link at all, since the latest patches.
I removed fixed-link defs from my DT.

Hummm, okay, so you just have the inband-status property and that's it,
not even a fixed-link node anymore, right? How does
mvneta_fixed_link_update() work then since it needs a fixed PHY to be
registered?

You can see it from my patch:
---

+err = of_property_read_string(np, "managed", &managed);
+if (err == 0) {
+if (strcmp(managed, "in-band-status") == 0) {
+/* status is zeroed, namely its .link member */
+phy = fixed_phy_register(PHY_POLL, &status, np);
+return IS_ERR(phy) ? PTR_ERR(phy) : 0;
+}
+}

---
which is the hunk added to the of_phy_register_fixed_link().
So in that case we register fixed-phy, but do not parse the fixed-link.

Ok, I missed that part. Could not you just override everything that is
needed here to get past the point where you register your fixed PHY even
with link = 0, this will be discarded anyway once you start in-band
negotiation.

Maybe my English is bad, but I have problems understanding
some of your senteneces. What do you mean?
If you meant to re-use the existing registration code instead
of adding a new hunk, please note that there is no fixed-link
node at all, so we do not even enter the parsing code block.
As such, there is nothing to override.

I will work on something anyway. 

Thanks, hope to hear from you soon.
This stream of regressions is disturbing. :)
Should finally be fixed for real.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 2/3] sched: cls_flower: fix panic on filter replace

2015-07-18 Thread Jiri Pirko
Fri, Jul 17, 2015 at 10:38:44PM CEST, dan...@iogearbox.net wrote:
>The following test case causes a NULL pointer dereference in cls_flower:
>
>  tc filter add dev foo parent 1: flower eth_type ipv4 action ok flowid 1:1
>  tc filter replace dev foo parent 1: pref 49152 handle 0x1 \
>flower eth_type ipv6 action ok flowid 1:1
>
>The problem is that commit 77b9900ef53a ("tc: introduce Flower classifier")
>accidentally swapped the arguments of list_replace_rcu(), the old
>element needs to be the first argument and the new element the second.
>
>Fixes: 77b9900ef53a ("tc: introduce Flower classifier")
>Signed-off-by: Daniel Borkmann 


Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mvneta.c, MVNETA_TXQ_UPDATE_REG, shut down issue

2015-07-18 Thread Thomas Petazzoni
Constantine,

On Sat, 18 Jul 2015 21:48:59 +0300, Constantine Shulyupin wrote:

> I try to run vanilla kernel 4.2.0 on custom Armada-370 board, based on
> DB-88F6710-BP.
> 
> The problem is that the board shuts down after writing to
> MVNETA_TXQ_UPDATE_REG in this code:

What do you mean by shut down? Physically turns off?

I hardly see how writing to this register can turn off the board. Can
you try on a non-custom A370 board?

Thanks,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mvneta.c, MVNETA_TXQ_UPDATE_REG, shut down issue

2015-07-18 Thread Constantine Shulyupin
Hi Rami and Thomas,

I try to run vanilla kernel 4.2.0 on custom Armada-370 board, based on
DB-88F6710-BP.

The problem is that the board shuts down after writing to
MVNETA_TXQ_UPDATE_REG in this code:

static void mvneta_txq_pend_desc_add(struct mvneta_port *pp,
 struct mvneta_tx_queue *txq,
 int pend_desc)
{
u32 val;

/* Only 255 descriptors can be added at once ; Assume caller
 * process TX desriptors in quanta less than 256
 */
val = pend_desc;
mvreg_write(pp, MVNETA_TXQ_UPDATE_REG(txq->id), val);
}

BTW, pend_desc == 1

What can you suggest?

Thanks
Constantine
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC net-next] net/vxlan: Fix kernel unaligned access in __vxlan_find_mac

2015-07-18 Thread Joe Perches
On Fri, 2015-07-17 at 18:18 -0700, David Miller wrote:
> From: Joe Perches  Date: Fri, 17 Jul 2015 16:07:02 -0700
> > On Fri, 2015-07-17 at 22:00 +0200, Sowmini Varadhan wrote:
> >> __vxlan_find_mac invokes ether_addr_equal on the eth_addr field,
> >> which triggers unaligned access messages, so rearrange vxlan_fdb
> >> to avoid this in the most non-intrusive way.

> > What arch does this?

> Sparc, MIPS, etc.

It seems that this code has had unaligned accesses
on this field even before compare_ether_addr was
converted to ether_addr_equal.

Is sparc64 the only one that emits / ratelimits that
unaligned access message?  I looked a little, but I
didn't find a fixup message when MIPS does unaligned
accesses.  Are all the other arches silent when
fixing up unaligned accesses?  Maye adding a generic
debug only ratelimited message might help remove
more of these.  As it's not fatal, naybe the sparc64
message should be KERN_DEBUG/pr_debug.

---

 arch/sparc/kernel/unaligned_64.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/sparc/kernel/unaligned_64.c b/arch/sparc/kernel/unaligned_64.c
index 62098a8..6b7aeb7 100644
--- a/arch/sparc/kernel/unaligned_64.c
+++ b/arch/sparc/kernel/unaligned_64.c
@@ -294,13 +294,9 @@ static void kernel_mna_trap_fault(int fixup_tstate_asi)
 
 static void log_unaligned(struct pt_regs *regs)
 {
-   static DEFINE_RATELIMIT_STATE(ratelimit, 5 * HZ, 5);
+   pr_debug_ratelimited("Kernel unaligned access at TPC[%lx] %pS\n",
+regs->tpc, (void *)regs->tpc);
 
-   if (__ratelimit(&ratelimit)) {
-   printk("Kernel unaligned access at TPC[%lx] %pS\n",
-  regs->tpc, (void *) regs->tpc);
-   }
-}
 
 asmlinkage void kernel_unaligned_trap(struct pt_regs *regs, unsigned int insn)
 {


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 -next] net: fib: use fib result when zero-length prefix aliases exist

2015-07-18 Thread Florian Westphal
Julian Anastasov  wrote:

[ Dave, please toss my patch, its either v3 or something else entirely ]

>   In fact, TOS should be matched just like in
> fib_table_lookup but it is not.
> 
> > This changes fib_select_default to not change the FIB chosen result EXCEPT
> > if this nexthop appears to be unreachable.
> > 
> > fib_detect_death() logic is reversed -- we consider a nexthop 'dead' only
> > if it has a neigh entry in unreachable state.
> 
>   The only difference I see is that NUD_NODE is
> declared by fib_nud_is_unreach() as reachable. May be
> it is better, for example, new route in NUD_NONE state
> will be tried for a while, until its state is determined.

fib_detect_death considers neigh_lookup() returning NULL as unreach.

I don't think its good idea and should rather only skip if NUD is
failed or incomplete.

> > +   if (likely(!fib_nud_is_unreach(res->fi)))
> > +   return;
> 
>   here first is unreachable...
> > +   /* attempt to pick another nexthop */
> > hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
> > struct fib_info *next_fi = fa->fa_info;
> >  
> > @@ -1223,38 +1223,30 @@ void fib_select_default(struct fib_result *res)
> > next_fi->fib_nh[0].nh_scope != RT_SCOPE_LINK)
> > continue;
> >  
> > +   order++;
> > +
> > +   if (next_fi == res->fi) /* already tested, not reachable */
> > +   continue;
> > +
> > fib_alias_accessed(fa);
> >  
> > -   if (!fi) {
> > -   if (next_fi != res->fi)
> > +   if (fib_nud_is_unreach(next_fi))
> 
>   all others are unreachable too...
> 
> > +   continue;
> > +
> > +   /* try to round-robin among all fa_aliases in case
> > +* res->fi nexthop is unreachable.
> 
>   round-robin only among reachables? But original algorithm
> can round-robin among unreachables. State will not change
> without trying some packets to unreachable dests.

Hmm... thanks for pointing that out.

But thats confusing.  If thats true, why does fib_detect_death
even exist?  If we skip unreachables, they cannot become
reachable again...?

> > +   fi = next_fi;
> > +   fi_idx = order;
> > +   if (order > tb->tb_default)
> > break;
> 
>   1=unreac, 2=reach, 3=tb_default(reach), 4=reach.
> We like 2 (fi == NULL), why we continue after 2, skip
> default 3 (because order > tb->tb_default is false) and
> finally select 4 (order=4 > tb->tb_default=3)?

Hmm.  Let me ask different question.
What is the supposed behaviour in the case you describe?

Removing order > default
test means we'll always pick first reachable in the list, no?

I don't think thats bad, but, how/why is is that 'better'?

> > +   tb->tb_default = fi_idx;
> 
>   And we do not round-robin if all look unreachable?

Right.  One possible solution might be this patch, but alter
fib_nud_is_unreach() to also check timer_pending(n->timer), i.e.

if (state & (NUD_INCOMPLETE | NUD_FAILED))
return timer_was_pending ? UNREACHABLE : REACHABLE;

This way we would re-try gw not only if neigh doesn't exist but
also if no retry is pending.

What do you think?

> As for fib_select_default,
> may be only change like below is needed. It additionally
> restricts the alternative routes to same prefix, same TOS.
> The TOS restriction was missing from long time but the
> prefix check was lost recently when fa_head was changed
> to contain aliases from different prefixes.

>  /* Must be invoked inside of an RCU protected region.  */
> -void fib_select_default(struct fib_result *res)
> +void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
>  {
>   struct fib_info *fi = NULL, *last_resort = NULL;
>   struct hlist_head *fa_head = res->fa_head;
>   struct fib_table *tb = res->table;
> + u8 slen = 32 - res->prefixlen;
>   int order = -1, last_idx = -1;
>   struct fib_alias *fa;
>  
>   hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
>   struct fib_info *next_fi = fa->fa_info;
>  
> + if (fa->fa_slen != slen)
> + continue;
>   if (next_fi->fib_scope != res->scope ||
>   fa->fa_type != RTN_UNICAST)
>   continue;
>  
> + if (fa->tb_id != tb->tb_id)
> + continue;
> + if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
> + continue;

Hmm, this means we won't consider alternative if tos doesn't match,
except 0.

So with
ip route add tos 0x00 via 192.168.1.1
ip route add tos 0x04 via 192.168.1.2
ip route add tos 0x08 via 192.168.1.3
ip route add tos 0x0c via 192.168.1.4
ip route add tos 0x10 via 192.168.1.5

and ping -Q 1.2.3.4

if 1.5 is unreachable, we pick 1.1 (tos 0x0).

Others are not attempted.
Do you consider that 'correct'?  I don't see why
we can use tos 0x00 bu

Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait

2015-07-18 Thread Andrew Lunn
On Sat, Jul 18, 2015 at 11:23:19AM -0400, Vivien Didelot wrote:
> Hi all,
> 
> - On Jul 18, 2015, at 10:58 AM, Andrew Lunn and...@lunn.ch wrote:
> 
> >> Good point. The timeout is most definitely quite large and for sure on
> >> the safe side. It might make sense to add some statistics gathering to
> >> see how long the maximum observed delay actually is.
> > 
> > Hi All
> > 
> > Statistics are something which can be used a lot, i bursts and
> > interactivily. ATU, VTU etc, are much less often used. So different
> > delays might be justified.
> > 
> > I agree about doing some statistics gathering to determine actual
> > delays needed.
> > 
> >Andrew
> 
> What do you think about something like this?

Hi Vivien

Lets get some actually statistics first. I would suggest for testing
you make _mv88e6xxx_wait() a busy loop and time how long it actually
takes for different busy bits to clear. We should also test it on
different families.

  Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable panic eviction work queue

2015-07-18 Thread Nikolay Aleksandrov
On 07/18/2015 05:28 PM, Johan Schuijt wrote:
> Thx for your looking into this!
> 
>>
>> Thank you for the report, I will try to reproduce this locally
>> Could you please post the full crash log ?
> 
> Of course, please see attached file.
> 
>> Also could you test
>> with a clean current kernel from Linus' tree or Dave's -net ?
> 
> Will do.
> 
>> These are available at:
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
>> respectively.
>>
>> One last question how many IRQs do you pin i.e. how many cores
>> do you actively use for receive ?
> 
> This varies a bit across our systems, but we’ve managed to reproduce this 
> with IRQs pinned on as many as 2,4,8 or 20 cores.
> 
> I won’t have access to our test-setup till Monday again, so I’ll be testing 3 
> scenario’s then:
> - Your patch
-
> - Linux tree
> - Dave’s -net tree
Just one of these two would be enough. I couldn't reproduce it here but
I don't have as many machines to test right now and had to improvise with VMs. 
:-)

> 
> I’ll make sure to keep you posted on all the results then. We have a kernel 
> dump of the panic, so if you need me to extract any data from there just let 
> me know! (Some instructions might be needed)
> 
> - Johan
> 
Great, thank you!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable panic eviction work queue

2015-07-18 Thread Johan Schuijt
With attachment this time, also not sure wether this is what you were referring 
to, so let me know if anything else needed!

- Johan


> On 18 Jul 2015, at 17:28, Johan Schuijt-Li  wrote:
> 
> Thx for your looking into this!
> 
>> 
>> Thank you for the report, I will try to reproduce this locally
>> Could you please post the full crash log ?
> 
> Of course, please see attached file.
> 
>> Also could you test
>> with a clean current kernel from Linus' tree or Dave's -net ?
> 
> Will do.
> 
>> These are available at:
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
>> respectively.
>> 
>> One last question how many IRQs do you pin i.e. how many cores
>> do you actively use for receive ?
> 
> This varies a bit across our systems, but we’ve managed to reproduce this 
> with IRQs pinned on as many as 2,4,8 or 20 cores.
> 
> I won’t have access to our test-setup till Monday again, so I’ll be testing 3 
> scenario’s then:
> - Your patch
> - Linux tree
> - Dave’s -net tree
> 
> I’ll make sure to keep you posted on all the results then. We have a kernel 
> dump of the panic, so if you need me to extract any data from there just let 
> me know! (Some instructions might be needed)
> 
> - Johan

[28732.285611] general protection fault:  [#1] SMP 
[28732.285665] Modules linked in: vhost_net vhost macvtap macvlan act_police 
cls_u32 sch_ingress cls_fw sch_sfq sch_htb nf_conntrack_ipv6 nf_defrag_ipv6 
xt_conntrack xt_physdev br_netfilter ebt_arp ebt_ip6 ebt_ip ebtable_nat tun 
rpcsec_gss_krb5 nfsv4 dns_resolver ebtable_filter ebtables ip6table_raw 
ip6table_mangle ip6table_filter ip6_tables nfsd auth_rpcgss oid_registry 
nfs_acl nfs lockd grace fscache sunrpc bridge 8021q garp mrp stp llc bonding 
xt_CT xt_DSCP iptable_mangle ipt_REJECT nf_reject_ipv4 xt_pkttype xt_tcpudp 
nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_comment nf_conntrack_ipv4 
nf_defrag_ipv4 xt_state nf_conntrack xt_owner iptable_filter iptable_raw 
ip_tables x_tables loop joydev hid_generic usbhid hid x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm ttm crct10dif_pclmul crc32_pclmul
[28732.286421]  ghash_clmulni_intel aesni_intel drm_kms_helper drm i2c_algo_bit 
aes_x86_64 lrw gf128mul dcdbas ipmi_si i2c_core evdev glue_helper ablk_helper 
tpm_tis mei_me tpm ehci_pci ehci_hcd mei cryptd usbcore iTCO_wdt 
iTCO_vendor_support ipmi_msghandler lpc_ich mfd_core wmi pcspkr usb_common 
shpchp sb_edac edac_core acpi_power_meter acpi_pad button processor thermal_sys 
ext4 crc16 mbcache jbd2 dm_mod sg sd_mod ahci libahci bnx2x libata ptp pps_core 
mdio crc32c_generic megaraid_sas crc32c_intel scsi_mod libcrc32c
[28732.286955] CPU: 9 PID: 56 Comm: kworker/9:0 Not tainted 3.18.7-transip-2.0 
#1
[28732.287023] Hardware name: Dell Inc. PowerEdge M620/0VHRN7, BIOS 2.5.2 
02/03/2015
[28732.287096] Workqueue: events inet_frag_worker
[28732.287139] task: 885f3d9fc210 ti: 885f3da0 task.ti: 
885f3da0
[28732.287205] RIP: 0010:[]  [] 
inet_evict_bucket+0x119/0x180
[28732.287278] RSP: 0018:885f3da03d58  EFLAGS: 00010292
[28732.287318] RAX: 885f3da03d08 RBX: dead001000a8 RCX: 885f3da03d08
[28732.287362] RDX: 0006 RSI: 885f3da03ce8 RDI: dead001000a8
[28732.287406] RBP: 0002 R08: 0286 R09: 88302f401640
[28732.287450] R10: 8000 R11: 88602ec0c138 R12: 81a8d8c0
[28732.287494] R13: 885f3da03d70 R14:  R15: 881d6efe1a00
[28732.287538] FS:  () GS:88602f28() 
knlGS:
[28732.287606] CS:  0010 DS:  ES:  CR0: 80050033
[28732.287647] CR2: 00b11000 CR3: 004f05b24000 CR4: 000427e0
[28732.287691] Stack:
[28732.287722]  81a905e0 81a905e8 814f4599 
881d6efe1a58
[28732.287807]  0246 002e 81a8d8c0 
81a918c0
[28732.287891]  02d3 0019 0240 
8148075a
[28732.287975] Call Trace:
[28732.288013]  [] ? _raw_spin_unlock_irqrestore+0x9/0x10
[28732.288056]  [] ? inet_frag_worker+0x5a/0x250
[28732.288103]  [] ? process_one_work+0x149/0x3f0
[28732.288146]  [] ? worker_thread+0x63/0x490
[28732.288187]  [] ? rescuer_thread+0x290/0x290
[28732.288229]  [] ? kthread+0xce/0xf0
[28732.288269]  [] ? kthread_create_on_node+0x180/0x180
[28732.288313]  [] ? ret_from_fork+0x7c/0xb0
[28732.288353]  [] ? kthread_create_on_node+0x180/0x180
[28732.288396] Code: 8b 04 24 66 83 40 08 01 48 8b 7c 24 18 48 85 ff 74 2a 48 
83 ef 58 75 13 eb 22 0f 1f 84 00 00 00 00 00 48 83 eb 58 48 89 df 74 11 <48> 8b 
5f 58 41 ff 94 24 70 40 00 00 48 85 db 75 e6 48 83 c4 28 
[28732.288827] RIP  [] inet_evict_bucket+0x119/0x180
[28732.288873]  RSP 


Re: reproducable panic eviction work queue

2015-07-18 Thread Johan Schuijt
Thx for your looking into this!

> 
> Thank you for the report, I will try to reproduce this locally
> Could you please post the full crash log ?

Of course, please see attached file.

> Also could you test
> with a clean current kernel from Linus' tree or Dave's -net ?

Will do.

> These are available at:
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> respectively.
> 
> One last question how many IRQs do you pin i.e. how many cores
> do you actively use for receive ?

This varies a bit across our systems, but we’ve managed to reproduce this with 
IRQs pinned on as many as 2,4,8 or 20 cores.

I won’t have access to our test-setup till Monday again, so I’ll be testing 3 
scenario’s then:
- Your patch
- Linux tree
- Dave’s -net tree

I’ll make sure to keep you posted on all the results then. We have a kernel 
dump of the panic, so if you need me to extract any data from there just let me 
know! (Some instructions might be needed)

- 
JohanN�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait

2015-07-18 Thread Vivien Didelot
Hi all,

- On Jul 18, 2015, at 10:58 AM, Andrew Lunn and...@lunn.ch wrote:

>> Good point. The timeout is most definitely quite large and for sure on
>> the safe side. It might make sense to add some statistics gathering to
>> see how long the maximum observed delay actually is.
> 
> Hi All
> 
> Statistics are something which can be used a lot, i bursts and
> interactivily. ATU, VTU etc, are much less often used. So different
> delays might be justified.
> 
> I agree about doing some statistics gathering to determine actual
> delays needed.
> 
>Andrew

What do you think about something like this?

Thanks,
-v

diff --git a/drivers/net/dsa/mv88e6xxx.c b/drivers/net/dsa/mv88e6xxx.c
index 4f3701f..6471807 100644
--- a/drivers/net/dsa/mv88e6xxx.c
+++ b/drivers/net/dsa/mv88e6xxx.c
@@ -563,9 +563,10 @@ static bool mv88e6xxx_6352_family(struct dsa_switch *ds)
 
 /* Must be called with SMI lock held */
 static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset,
-  u16 mask)
+  u16 mask, unsigned int msecs)
 {
unsigned long timeout = jiffies + HZ / 10;
+   unsigned long usecs = msecs * 1000;
 
while (time_before(jiffies, timeout)) {
int ret;
@@ -576,7 +577,8 @@ static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, 
int offset,
if (!(ret & mask))
return 0;
 
-   usleep_range(1000, 2000);
+   if (usecs)
+   usleep_range(usecs, usecs + 1000);
}
return -ETIMEDOUT;
 }
@@ -585,7 +587,7 @@ static int _mv88e6xxx_wait(struct dsa_switch *ds, int reg, 
int offset,
 static int _mv88e6xxx_stats_wait(struct dsa_switch *ds)
 {
return _mv88e6xxx_wait(ds, REG_GLOBAL, GLOBAL_STATS_OP,
-  GLOBAL_STATS_OP_BUSY);
+  GLOBAL_STATS_OP_BUSY, 0);
 }
 
 /* Must be called with SMI mutex held */
@@ -872,13 +874,14 @@ error:
 }
 #endif /* CONFIG_NET_DSA_HWMON */
 
-static int mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset, u16 mask)
+static int mv88e6xxx_wait(struct dsa_switch *ds, int reg, int offset, u16 mask,
+ unsigned int msecs)
 {
struct mv88e6xxx_priv_state *ps = ds_to_priv(ds);
int ret;
 
mutex_lock(&ps->smi_mutex);
-   ret = _mv88e6xxx_wait(ds, reg, offset, mask);
+   ret = _mv88e6xxx_wait(ds, reg, offset, mask, msecs);
mutex_unlock(&ps->smi_mutex);
 
return ret;
@@ -887,33 +890,33 @@ static int mv88e6xxx_wait(struct dsa_switch *ds, int reg, 
int offset, u16 mask)
 static int _mv88e6xxx_phy_wait(struct dsa_switch *ds)
 {
return _mv88e6xxx_wait(ds, REG_GLOBAL2, GLOBAL2_SMI_OP,
-  GLOBAL2_SMI_OP_BUSY);
+  GLOBAL2_SMI_OP_BUSY, 1);
 }
 
 int mv88e6xxx_eeprom_load_wait(struct dsa_switch *ds)
 {
return mv88e6xxx_wait(ds, REG_GLOBAL2, GLOBAL2_EEPROM_OP,
- GLOBAL2_EEPROM_OP_LOAD);
+ GLOBAL2_EEPROM_OP_LOAD, 1);
 }
 
 int mv88e6xxx_eeprom_busy_wait(struct dsa_switch *ds)
 {
return mv88e6xxx_wait(ds, REG_GLOBAL2, GLOBAL2_EEPROM_OP,
- GLOBAL2_EEPROM_OP_BUSY);
+ GLOBAL2_EEPROM_OP_BUSY, 1);
 }
 
 /* Must be called with SMI lock held */
 static int _mv88e6xxx_atu_wait(struct dsa_switch *ds)
 {
return _mv88e6xxx_wait(ds, REG_GLOBAL, GLOBAL_ATU_OP,
-  GLOBAL_ATU_OP_BUSY);
+  GLOBAL_ATU_OP_BUSY, 1);
 }
 
 /* Must be called with SMI lock held */
 static int _mv88e6xxx_scratch_wait(struct dsa_switch *ds)
 {
return _mv88e6xxx_wait(ds, REG_GLOBAL2, GLOBAL2_SCRATCH_MISC,
-  GLOBAL2_SCRATCH_BUSY);
+  GLOBAL2_SCRATCH_BUSY, 1);
 }
 
 /* Must be called with SMI mutex held */
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] net: dsa: mv88e6xxx: sleep in _mv88e6xxx_stats_wait

2015-07-18 Thread Andrew Lunn
> Good point. The timeout is most definitely quite large and for sure on
> the safe side. It might make sense to add some statistics gathering to
> see how long the maximum observed delay actually is.

Hi All

Statistics are something which can be used a lot, i bursts and
interactivily. ATU, VTU etc, are much less often used. So different
delays might be justified.

I agree about doing some statistics gathering to determine actual
delays needed.

   Andrew
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] xfrm: Fix a typo

2015-07-18 Thread Jakub Wilk
Signed-off-by: Jakub Wilk 
---
 net/xfrm/xfrm_user.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index bd16c6c..0cebf1f 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -2048,7 +2048,7 @@ static int xfrm_add_pol_expire(struct sk_buff *skb, 
struct nlmsghdr *nlh,
xfrm_audit_policy_delete(xp, 1, true);
} else {
// reset the timers here?
-   WARN(1, "Dont know what to do with soft policy expire\n");
+   WARN(1, "Don't know what to do with soft policy expire\n");
}
km_policy_expired(xp, p->dir, up->hard, nlh->nlmsg_pid);
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable panic eviction work queue

2015-07-18 Thread Nikolay Aleksandrov
On 07/18/2015 12:02 PM, Nikolay Aleksandrov wrote:
> On 07/18/2015 11:01 AM, Johan Schuijt wrote:
>> Yes, we already found these and are included in our kernel, but even with 
>> these patches we still receive the panic.
>>
>> - Johan
>>
>>
>>> On 18 Jul 2015, at 10:56, Eric Dumazet  wrote:
>>>
>>> On Fri, 2015-07-17 at 21:18 +, Johan Schuijt wrote:
 Hey guys, 


 We’re currently running into a reproducible panic in the eviction work
 queue code when we pin al our eth* IRQ to different CPU cores (in
 order to scale our networking performance for our virtual servers).
 This only occurs in kernels >= 3.17 and is a result of the following
 change:
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.18.y&id=b13d3cbfb8e8a8f53930af67d1ebf05149f32c24


 The race/panic we see seems to be the same as, or similar to:
 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.18.y&id=65ba1f1ec0eff1c25933468e1d238201c0c2cb29


 We can confirm that this is directly exposed by the IRQ pinning since
 disabling this stops us from being able to reproduce this case :)


 How te reproduce: in our test-setup we have 4 machines generating UDP
 packets which are send to the vulnerable host. These all have a MTU of
 100 (for test purposes) and send UDP packets of a size of 256 bytes.
 Within half an hour you will see the following panic:


 crash> bt
 PID: 56 TASK: 885f3d9fc210  CPU: 9   COMMAND: "kworker/9:0"
 #0 [885f3da03b60] machine_kexec at 8104a1f7
 #1 [885f3da03bb0] crash_kexec at 810db187
 #2 [885f3da03c80] oops_end at 81015140
 #3 [885f3da03ca0] general_protection at 814f6c88
[exception RIP: inet_evict_bucket+281]
RIP: 81480699  RSP: 885f3da03d58  RFLAGS: 00010292
RAX: 885f3da03d08  RBX: dead001000a8  RCX:
 885f3da03d08
RDX: 0006  RSI: 885f3da03ce8  RDI:
 dead001000a8
RBP: 0002   R8: 0286   R9:
 88302f401640
R10: 8000  R11: 88602ec0c138  R12:
 81a8d8c0
R13: 885f3da03d70  R14:   R15:
 881d6efe1a00
ORIG_RAX:   CS: 0010  SS: 0018
 #4 [885f3da03db0] inet_frag_worker at 8148075a
 #5 [885f3da03e10] process_one_work at 8107be19
 #6 [885f3da03e60] worker_thread at 8107c6e3
 #7 [885f3da03ed0] kthread at 8108103e
 #8 [885f3da03f50] ret_from_fork at 814f4d7c


 We would love to receive your input on this matter.


 Thx in advance,


 - Johan
>>>
>>> Check commits 65ba1f1ec0eff1c25933468e1d238201c0c2cb29 &
>>> d70127e8a942364de8dd140fe73893efda363293
>>>
>>> Also please send your mails in text format, not html, and CC netdev ( I
>>> did here)
>>>


>>>
>>>
>>
>> N�r��y���b�X��ǧv�^�)޺{.n�+���z�^�)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥
>>
> 
> Thank you for the report, I will try to reproduce this locally
> Could you please post the full crash log ? Also could you test
> with a clean current kernel from Linus' tree or Dave's -net ?
> These are available at:
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> respectively.
> 
> One last question how many IRQs do you pin i.e. how many cores
> do you actively use for receive ?
> 

Flags seems to be modified while still linked and we may get the
following (theoretical) situation:
CPU 1   CPU 2
inet_frag_evictor (wait for chainlock)  spin_lock(chainlock)
unlock(chainlock)
get lock, set EVICT flag, hlist_del etc.
change flags again while
qp is in the evict list

So could you please try the following patch which sets the flag while
holding the chain lock:


diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 5e346a082e5f..2521ed9c1b52 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -354,8 +354,8 @@ static struct inet_frag_queue *inet_frag_intern(struct 
netns_frags *nf,
hlist_for_each_entry(qp, &hb->chain, list) {
if (qp->net == nf && f->match(qp, arg)) {
atomic_inc(&qp->refcnt);
-   spin_unlock(&hb->chain_lock);
qp_in->flags |= INET_FRAG_COMPLETE;
+   spin_unlock(&hb->chain_lock);
inet_frag_put(qp_in, f);
return qp;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of 

[net-next]r8169: Disable advertise full duplex on chip RTL_GIGA_MAC_VER_09. Disable advertise full duplex on chip id RTL_GIGA_MAC_VER_09 wich support random full duplex.(very sad) On branch master

2015-07-18 Thread Corcodel Marian

 Changes to be committed:
modified:   drivers/net/ethernet/realtek/r8169.c

Signed-off-by: Corcodel Marian 
---
 drivers/net/ethernet/realtek/r8169.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 714af89..a373679 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1918,12 +1918,23 @@ static int rtl8169_set_speed_xmi(struct net_device *dev,
 
 		if (adv & ADVERTISED_10baseT_Half)
 			auto_nego |= ADVERTISE_10HALF;
-		if (adv & ADVERTISED_10baseT_Full)
-			auto_nego |= ADVERTISE_10FULL;
+		if (!tp->mac_version == RTL_GIGA_MAC_VER_09) {
+		 if (adv & ADVERTISED_10baseT_Full)
+			 auto_nego |= ADVERTISE_10FULL;
+		} else {
+		   netif_info(tp, link, dev,			   	 
+		 "PHY does not support 10Mbps full duplex\n");
+		  } 
 		if (adv & ADVERTISED_100baseT_Half)
 			auto_nego |= ADVERTISE_100HALF;
-		if (adv & ADVERTISED_100baseT_Full)
-			auto_nego |= ADVERTISE_100FULL;
+		if (!tp->mac_version == RTL_GIGA_MAC_VER_09) {
+		 if (adv & ADVERTISED_100baseT_Full)
+			 auto_nego |= ADVERTISE_100FULL;
+	} else {
+		   netif_info(tp, link, dev,
+		 "PHY does not support 100Mbps full duplex\n");
+	  } 
+
 
 		auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
 


[PATCH net-next]r8169:Disable advertise full duplex on chip id RTL_GIGA_MAC_VER_09. Disable advertise full duplex on chip id RTL_GIGA_MAC_VER_09 wich support random full duplex.(very sad) On branch

2015-07-18 Thread Corcodel Marian

 Changes to be committed:
modified:   drivers/net/ethernet/realtek/r8169.c

Signed-off-by: Corcodel Marian 
---
 drivers/net/ethernet/realtek/r8169.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 714af89..a373679 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1918,12 +1918,23 @@ static int rtl8169_set_speed_xmii(struct net_device *dev,
 
 		if (adv & ADVERTISED_10baseT_Half)
 			auto_nego |= ADVERTISE_10HALF;
-		if (adv & ADVERTISED_10baseT_Full)
-			auto_nego |= ADVERTISE_10FULL;
+		if (!tp->mac_version == RTL_GIGA_MAC_VER_09) {
+		 if (adv & ADVERTISED_10baseT_Full)
+			 auto_nego |= ADVERTISE_10FULL;
+		} else {
+		   netif_info(tp, link, dev,			   	 
+		 "PHY does not support 10Mbps full duplex\n");
+		  } 
 		if (adv & ADVERTISED_100baseT_Half)
 			auto_nego |= ADVERTISE_100HALF;
-		if (adv & ADVERTISED_100baseT_Full)
-			auto_nego |= ADVERTISE_100FULL;
+		if (!tp->mac_version == RTL_GIGA_MAC_VER_09) {
+		 if (adv & ADVERTISED_100baseT_Full)
+			 auto_nego |= ADVERTISE_100FULL;
+	} else {
+		   netif_info(tp, link, dev,
+		 "PHY does not support 100Mbps full duplex\n");
+	  } 
+
 
 		auto_nego |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
 


Re: reproducable panic eviction work queue

2015-07-18 Thread Nikolay Aleksandrov
On 07/18/2015 11:01 AM, Johan Schuijt wrote:
> Yes, we already found these and are included in our kernel, but even with 
> these patches we still receive the panic.
> 
> - Johan
> 
> 
>> On 18 Jul 2015, at 10:56, Eric Dumazet  wrote:
>>
>> On Fri, 2015-07-17 at 21:18 +, Johan Schuijt wrote:
>>> Hey guys, 
>>>
>>>
>>> We’re currently running into a reproducible panic in the eviction work
>>> queue code when we pin al our eth* IRQ to different CPU cores (in
>>> order to scale our networking performance for our virtual servers).
>>> This only occurs in kernels >= 3.17 and is a result of the following
>>> change:
>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.18.y&id=b13d3cbfb8e8a8f53930af67d1ebf05149f32c24
>>>
>>>
>>> The race/panic we see seems to be the same as, or similar to:
>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.18.y&id=65ba1f1ec0eff1c25933468e1d238201c0c2cb29
>>>
>>>
>>> We can confirm that this is directly exposed by the IRQ pinning since
>>> disabling this stops us from being able to reproduce this case :)
>>>
>>>
>>> How te reproduce: in our test-setup we have 4 machines generating UDP
>>> packets which are send to the vulnerable host. These all have a MTU of
>>> 100 (for test purposes) and send UDP packets of a size of 256 bytes.
>>> Within half an hour you will see the following panic:
>>>
>>>
>>> crash> bt
>>> PID: 56 TASK: 885f3d9fc210  CPU: 9   COMMAND: "kworker/9:0"
>>> #0 [885f3da03b60] machine_kexec at 8104a1f7
>>> #1 [885f3da03bb0] crash_kexec at 810db187
>>> #2 [885f3da03c80] oops_end at 81015140
>>> #3 [885f3da03ca0] general_protection at 814f6c88
>>>[exception RIP: inet_evict_bucket+281]
>>>RIP: 81480699  RSP: 885f3da03d58  RFLAGS: 00010292
>>>RAX: 885f3da03d08  RBX: dead001000a8  RCX:
>>> 885f3da03d08
>>>RDX: 0006  RSI: 885f3da03ce8  RDI:
>>> dead001000a8
>>>RBP: 0002   R8: 0286   R9:
>>> 88302f401640
>>>R10: 8000  R11: 88602ec0c138  R12:
>>> 81a8d8c0
>>>R13: 885f3da03d70  R14:   R15:
>>> 881d6efe1a00
>>>ORIG_RAX:   CS: 0010  SS: 0018
>>> #4 [885f3da03db0] inet_frag_worker at 8148075a
>>> #5 [885f3da03e10] process_one_work at 8107be19
>>> #6 [885f3da03e60] worker_thread at 8107c6e3
>>> #7 [885f3da03ed0] kthread at 8108103e
>>> #8 [885f3da03f50] ret_from_fork at 814f4d7c
>>>
>>>
>>> We would love to receive your input on this matter.
>>>
>>>
>>> Thx in advance,
>>>
>>>
>>> - Johan
>>
>> Check commits 65ba1f1ec0eff1c25933468e1d238201c0c2cb29 &
>> d70127e8a942364de8dd140fe73893efda363293
>>
>> Also please send your mails in text format, not html, and CC netdev ( I
>> did here)
>>
>>>
>>>
>>
>>
> 
> N�r��y���b�X��ǧv�^�)޺{.n�+���z�^�)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥
> 

Thank you for the report, I will try to reproduce this locally
Could you please post the full crash log ? Also could you test
with a clean current kernel from Linus' tree or Dave's -net ?
These are available at:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
respectively.

One last question how many IRQs do you pin i.e. how many cores
do you actively use for receive ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable panic eviction work queue

2015-07-18 Thread Johan Schuijt
Yes, we already found these and are included in our kernel, but even with these 
patches we still receive the panic.

- Johan


> On 18 Jul 2015, at 10:56, Eric Dumazet  wrote:
> 
> On Fri, 2015-07-17 at 21:18 +, Johan Schuijt wrote:
>> Hey guys, 
>> 
>> 
>> We’re currently running into a reproducible panic in the eviction work
>> queue code when we pin al our eth* IRQ to different CPU cores (in
>> order to scale our networking performance for our virtual servers).
>> This only occurs in kernels >= 3.17 and is a result of the following
>> change:
>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.18.y&id=b13d3cbfb8e8a8f53930af67d1ebf05149f32c24
>> 
>> 
>> The race/panic we see seems to be the same as, or similar to:
>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.18.y&id=65ba1f1ec0eff1c25933468e1d238201c0c2cb29
>> 
>> 
>> We can confirm that this is directly exposed by the IRQ pinning since
>> disabling this stops us from being able to reproduce this case :)
>> 
>> 
>> How te reproduce: in our test-setup we have 4 machines generating UDP
>> packets which are send to the vulnerable host. These all have a MTU of
>> 100 (for test purposes) and send UDP packets of a size of 256 bytes.
>> Within half an hour you will see the following panic:
>> 
>> 
>> crash> bt
>> PID: 56 TASK: 885f3d9fc210  CPU: 9   COMMAND: "kworker/9:0"
>> #0 [885f3da03b60] machine_kexec at 8104a1f7
>> #1 [885f3da03bb0] crash_kexec at 810db187
>> #2 [885f3da03c80] oops_end at 81015140
>> #3 [885f3da03ca0] general_protection at 814f6c88
>>[exception RIP: inet_evict_bucket+281]
>>RIP: 81480699  RSP: 885f3da03d58  RFLAGS: 00010292
>>RAX: 885f3da03d08  RBX: dead001000a8  RCX:
>> 885f3da03d08
>>RDX: 0006  RSI: 885f3da03ce8  RDI:
>> dead001000a8
>>RBP: 0002   R8: 0286   R9:
>> 88302f401640
>>R10: 8000  R11: 88602ec0c138  R12:
>> 81a8d8c0
>>R13: 885f3da03d70  R14:   R15:
>> 881d6efe1a00
>>ORIG_RAX:   CS: 0010  SS: 0018
>> #4 [885f3da03db0] inet_frag_worker at 8148075a
>> #5 [885f3da03e10] process_one_work at 8107be19
>> #6 [885f3da03e60] worker_thread at 8107c6e3
>> #7 [885f3da03ed0] kthread at 8108103e
>> #8 [885f3da03f50] ret_from_fork at 814f4d7c
>> 
>> 
>> We would love to receive your input on this matter.
>> 
>> 
>> Thx in advance,
>> 
>> 
>> - Johan
> 
> Check commits 65ba1f1ec0eff1c25933468e1d238201c0c2cb29 &
> d70127e8a942364de8dd140fe73893efda363293
> 
> Also please send your mails in text format, not html, and CC netdev ( I
> did here)
> 
>> 
>> 
> 
> 



Re: reproducable panic eviction work queue

2015-07-18 Thread Eric Dumazet
On Fri, 2015-07-17 at 21:18 +, Johan Schuijt wrote:
> Hey guys, 
> 
> 
> We’re currently running into a reproducible panic in the eviction work
> queue code when we pin al our eth* IRQ to different CPU cores (in
> order to scale our networking performance for our virtual servers).
> This only occurs in kernels >= 3.17 and is a result of the following
> change:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.18.y&id=b13d3cbfb8e8a8f53930af67d1ebf05149f32c24
> 
> 
> The race/panic we see seems to be the same as, or similar to:
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-3.18.y&id=65ba1f1ec0eff1c25933468e1d238201c0c2cb29
> 
> 
> We can confirm that this is directly exposed by the IRQ pinning since
> disabling this stops us from being able to reproduce this case :)
> 
> 
> How te reproduce: in our test-setup we have 4 machines generating UDP
> packets which are send to the vulnerable host. These all have a MTU of
> 100 (for test purposes) and send UDP packets of a size of 256 bytes.
> Within half an hour you will see the following panic:
> 
> 
> crash> bt
> PID: 56 TASK: 885f3d9fc210  CPU: 9   COMMAND: "kworker/9:0"
>  #0 [885f3da03b60] machine_kexec at 8104a1f7
>  #1 [885f3da03bb0] crash_kexec at 810db187
>  #2 [885f3da03c80] oops_end at 81015140
>  #3 [885f3da03ca0] general_protection at 814f6c88
> [exception RIP: inet_evict_bucket+281]
> RIP: 81480699  RSP: 885f3da03d58  RFLAGS: 00010292
> RAX: 885f3da03d08  RBX: dead001000a8  RCX:
> 885f3da03d08
> RDX: 0006  RSI: 885f3da03ce8  RDI:
> dead001000a8
> RBP: 0002   R8: 0286   R9:
> 88302f401640
> R10: 8000  R11: 88602ec0c138  R12:
> 81a8d8c0
> R13: 885f3da03d70  R14:   R15:
> 881d6efe1a00
> ORIG_RAX:   CS: 0010  SS: 0018
>  #4 [885f3da03db0] inet_frag_worker at 8148075a
>  #5 [885f3da03e10] process_one_work at 8107be19
>  #6 [885f3da03e60] worker_thread at 8107c6e3
>  #7 [885f3da03ed0] kthread at 8108103e
>  #8 [885f3da03f50] ret_from_fork at 814f4d7c
> 
> 
> We would love to receive your input on this matter.
> 
> 
> Thx in advance,
> 
> 
> - Johan

Check commits 65ba1f1ec0eff1c25933468e1d238201c0c2cb29 &
d70127e8a942364de8dd140fe73893efda363293

Also please send your mails in text format, not html, and CC netdev ( I
did here)

> 
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RGC PATCH v3.10] net: sched: validate that class is found in qdisc_tree_decrease_qlen

2015-07-18 Thread Eric Dumazet
On Fri, 2015-07-17 at 17:27 -0700, Alex Gartrell wrote:
> On Fri, Jul 17, 2015 at 5:10 PM, Cong Wang  wrote:
> > Hmm, but in htb_delete() we do reset the leaf qdisc before removing the
> > class from ha
> >
> > if (!cl->level) {
> > qlen = cl->un.leaf.q->q.qlen;
> > qdisc_reset(cl->un.leaf.q);
> > qdisc_tree_decrease_qlen(cl->un.leaf.q, qlen);
> > }
> >
> > therefore, the leaf fq_codel qdisc is supposed to have 0 skb after that,
> > that is, the second fq_codel_reset() is supposed to return NULL immediately?
> > Why is qdidsc_tree_decrease_qlen() called in the second fq_codel_reset()?
> >
> > Or there is a race condition between ->delete() and ->put()? In which new
> > skb can be enqueued?
> 
> This makes the most sense to me, but I have ~32 hours of experience
> with this subsystem :)
> 
> Taking that for granted, it seems like it'd be appropriate to note the
> invariant in the code i've changed with a WARN_ON and to skip it, and
> then to otherwise find a way to close the hole.  Do you agree?
> 


HTB + fq_codel ?

Make sure you applied Cong fix

commit c0afd9ce4d6a646fb6433536f95a418bb348fab1
Author: WANG Cong 
Date:   Tue Jul 14 11:21:58 2015 -0700

fq_codel: fix return value of fq_codel_drop()

The ->drop() is supposed to return the number of bytes it dropped,
however fq_codel_drop() returns the index of the flow where it drops
a packet from.

Fix this by introducing a helper to wrap fq_codel_drop().


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html