Re: [PATCH] [v2] Bluetooth: btrsi: rework dependencies

2018-03-15 Thread Kalle Valo
Arnd Bergmann  writes:

> On Thu, Mar 15, 2018 at 7:30 PM, Marcel Holtmann  wrote:
>> Hi Arnd,
>>
>>> The linkage between the bluetooth driver and the wireless
>>> driver is not defined properly, leading to build problems
>>> such as:
>>>
>>> warning: (BT_HCIRSI) selects RSI_COEX which has unmet direct
>>> dependencies (NETDEVICES && WLAN && WLAN_VENDOR_RSI && BT_HCIRSI &&
>>> RSI_91X)
>>> drivers/net/wireless/rsi/rsi_91x_main.o: In function `rsi_read_pkt':
>>> (.text+0x205): undefined reference to `rsi_bt_ops'
>>>
>>> As the dependency is actually the reverse (RSI_91X uses
>>> the BT_RSI driver, not the other way round), this changes
>>> the dependency to match, and enables the bluetooth driver
>>> from the RSI_COEX symbol.
>>>
>>> Fixes: 38aa4da50483 ("Bluetooth: btrsi: add new rsi bluetooth driver")
>>> Signed-off-by: Arnd Bergmann 
>>> ---
>>> v2: Pick a different from v1
>>> ---
>>> drivers/bluetooth/Kconfig| 4 +---
>>> drivers/net/wireless/rsi/Kconfig | 4 +++-
>>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> Acked-by: Marcel Holtmann 
>>
>> Since I think Kalle still has to take it through his tree until the
>> btrsi driver makes it into net-next.

Yes, I have to take this as I haven't sent the original patch to Dave
yet.

> Kalle, please wait for v3 though, I just ran into another build
> failure caused by a typo in v2.

Ok, I saw it.

-- 
Kalle Valo


Re: [PATCH 00/16] remove eight obsolete architectures

2018-03-15 Thread afzal mohammed
Hi,

On Thu, Mar 15, 2018 at 10:56:48AM +0100, Arnd Bergmann wrote:
> On Thu, Mar 15, 2018 at 10:42 AM, David Howells  wrote:

> > Do we have anything left that still implements NOMMU?

Please don't kill !MMU.

> Yes, plenty.

> I've made an overview of the remaining architectures for my own reference[1].
> The remaining NOMMU architectures are:
> 
> - arch/arm has ARMv7-M (Cortex-M microcontroller), which is actually
> gaining traction

ARMv7-R as well, also seems ARM is coming up with more !MMU's - v8-M,
v8-R. In addition, though only of academic interest, ARM MMU capable
platform's can run !MMU Linux.

afzal

> - arch/sh has an open-source J2 core that was added not that long ago,
> it seems to
>   be the only SH compatible core that anyone is working on.
> - arch/microblaze supports both MMU/NOMMU modes (most use an MMU)
> - arch/m68k supports several NOMMU targets, both the coldfire SoCs and the
>   classic processors
> - c6x has no MMU


Re: [Intel-wired-lan] [PATCH v2 12/15] ice: Add stats and ethtool support

2018-03-15 Thread Stephen Hemminger
On Thu, 15 Mar 2018 17:50:10 -0700
Alexander Duyck  wrote:

> On Thu, Mar 15, 2018 at 4:52 PM, Stephen Hemminger
>  wrote:
> > On Thu, 15 Mar 2018 16:47:59 -0700
> > Anirudh Venkataramanan  wrote:
> >  
> >> +
> >> +static const struct ice_stats ice_gstrings_vsi_stats[] = {
> >> + ICE_VSI_STAT("tx_unicast", eth_stats.tx_unicast),
> >> + ICE_VSI_STAT("rx_unicast", eth_stats.rx_unicast),
> >> + ICE_VSI_STAT("tx_multicast", eth_stats.tx_multicast),
> >> + ICE_VSI_STAT("rx_multicast", eth_stats.rx_multicast),
> >> + ICE_VSI_STAT("tx_broadcast", eth_stats.tx_broadcast),
> >> + ICE_VSI_STAT("rx_broadcast", eth_stats.rx_broadcast),
> >> + ICE_VSI_STAT("tx_bytes", eth_stats.tx_bytes),
> >> + ICE_VSI_STAT("rx_bytes", eth_stats.rx_bytes),
> >> + ICE_VSI_STAT("rx_discards", eth_stats.rx_discards),
> >> + ICE_VSI_STAT("tx_errors", eth_stats.tx_errors),
> >> + ICE_VSI_STAT("tx_linearize", tx_linearize),
> >> + ICE_VSI_STAT("rx_unknown_protocol", eth_stats.rx_unknown_protocol),
> >> + ICE_VSI_STAT("rx_alloc_fail", rx_buf_failed),
> >> + ICE_VSI_STAT("rx_pg_alloc_fail", rx_page_failed),
> >> +};
> >> +  
> >
> > Ignoring feedback from maintainers is unlikely to help get your driver 
> > adopted.  
> 
> Your feedback wasn't ignored, the netdev stats are gone. I double
> checked and there was this in addition to the netdev stats before so I
> think the suggestion to remove the netdev stats was just taken
> literally.
> 
> The VSI is a slightly different entity from the netdev itself. A
> netdev can be backed by a VSI in the case of the PF, but the VSI can
> be used in other ways such as what we did in i40e where we were using
> it to spawn queue groups to work with mqprio as a filter target and in
> that case the queue groups wouldn't have a netdev directly associated
> with them so in that case it might make sense to leave these as
> separate stats.
> 
> - Alex

Sorry I was confused because eth_stats and thought is was a copy
of net_stats. This looks good.

After reading ice_stat_update40 I can see why you needed
to know starting point.

Since ice_fetch_u64_stats_per_ring is called only by ice_update_vsi_ring_stats,
and the update is only done by watchdog and when stats are fetched;
it doesn't look like you need to use the _irq variant of u64_stats_fetch_begin.
That would save having to disable local irq's while handling stats.


Acked-by: Stephen Hemminger 



[PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-15 Thread Kees Cook
Patch 1 adds const_max_t(), patch 2 uses it in all the places max()
was used for stack arrays. Commit log from patch 1:

---snip---
kernel.h: Introduce const_max_t() for VLA removal

In the effort to remove all VLAs from the kernel[1], it is desirable to
build with -Wvla. However, this warning is overly pessimistic, in that
it is only happy with stack array sizes that are declared as constant
expressions, and not constant values. One case of this is the evaluation
of the max() macro which, due to its construction, ends up converting
constant expression arguments into a constant value result. Attempts
to adjust the behavior of max() ran afoul of version-dependent compiler
behavior[2].

To work around this and still gain -Wvla coverage, this patch introduces
a new macro, const_max_t(), for use in these cases of stack array size
declaration, where the constant expressions are retained. Since this means
losing the double-evaluation protections of the max() macro, this macro is
designed to explicitly fail if used on non-constant arguments.

Older compilers will fail with the unhelpful message:

error: first argument to ‘__builtin_choose_expr’ not a constant

Newer compilers will fail with a hopefully more helpful message:

error: call to ‘__error_non_const_arg’ declared with attribute error: 
const_max_t() used with non-constant expression

To gain the ability to compare differing types, the desired type must
be explicitly declared, as with the existing max_t() macro. This is
needed when comparing different enum types and to allow things like:

int foo[const_max_t(size_t, 6, sizeof(something))];

[1] https://lkml.org/lkml/2018/3/7/621
[2] https://lkml.org/lkml/2018/3/10/170
---eol---

Hopefully this reads well as a summary from all the things that got tried.
I've tested this on allmodconfig builds with gcc 4.4.4 and 6.3.0, with and
without -Wvla.

-Kees

v5: explicit type argument
v4: forced size_t type



[PATCH v5 2/2] Remove false-positive VLAs when using max()

2018-03-15 Thread Kees Cook
As part of removing VLAs from the kernel[1], we want to build with -Wvla,
but it is overly pessimistic and only accepts constant expressions for
stack array sizes, instead of also constant values. The max() macro
triggers the warning, so this refactors these uses of max() to use the
new const_max() instead.

[1] https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Kees Cook 
---
 drivers/input/touchscreen/cyttsp4_core.c |  2 +-
 fs/btrfs/tree-checker.c  |  3 ++-
 lib/vsprintf.c   |  5 +++--
 net/ipv4/proc.c  |  8 
 net/ipv6/proc.c  | 11 +--
 5 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/input/touchscreen/cyttsp4_core.c 
b/drivers/input/touchscreen/cyttsp4_core.c
index 727c3232517c..7fb9bd48e41c 100644
--- a/drivers/input/touchscreen/cyttsp4_core.c
+++ b/drivers/input/touchscreen/cyttsp4_core.c
@@ -868,7 +868,7 @@ static void cyttsp4_get_mt_touches(struct cyttsp4_mt_data 
*md, int num_cur_tch)
struct cyttsp4_touch tch;
int sig;
int i, j, t = 0;
-   int ids[max(CY_TMA1036_MAX_TCH, CY_TMA4XX_MAX_TCH)];
+   int ids[const_max_t(size_t, CY_TMA1036_MAX_TCH, CY_TMA4XX_MAX_TCH)];
 
memset(ids, 0, si->si_ofs.tch_abs[CY_TCH_T].max * sizeof(int));
for (i = 0; i < num_cur_tch; i++) {
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index c3c8d48f6618..d83244e3821f 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -341,7 +341,8 @@ static int check_dir_item(struct btrfs_root *root,
 */
if (key->type == BTRFS_DIR_ITEM_KEY ||
key->type == BTRFS_XATTR_ITEM_KEY) {
-   char namebuf[max(BTRFS_NAME_LEN, XATTR_NAME_MAX)];
+   char namebuf[const_max_t(size_t, BTRFS_NAME_LEN,
+XATTR_NAME_MAX)];
 
read_extent_buffer(leaf, namebuf,
(unsigned long)(di + 1), name_len);
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index d7a708f82559..12ff57a36171 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -744,8 +744,9 @@ char *resource_string(char *buf, char *end, struct resource 
*res,
 #define FLAG_BUF_SIZE  (2 * sizeof(res->flags))
 #define DECODED_BUF_SIZE   sizeof("[mem - 64bit pref window disabled]")
 #define RAW_BUF_SIZE   sizeof("[mem - flags 0x]")
-   char sym[max(2*RSRC_BUF_SIZE + DECODED_BUF_SIZE,
-2*RSRC_BUF_SIZE + FLAG_BUF_SIZE + RAW_BUF_SIZE)];
+   char sym[const_max_t(size_t,
+2*RSRC_BUF_SIZE + DECODED_BUF_SIZE,
+2*RSRC_BUF_SIZE + FLAG_BUF_SIZE + RAW_BUF_SIZE)];
 
char *p = sym, *pend = sym + sizeof(sym);
int decode = (fmt[0] == 'R') ? 1 : 0;
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index dc5edc8f7564..7f5c3b40dac9 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -46,7 +46,7 @@
 #include 
 #include 
 
-#define TCPUDP_MIB_MAX max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX)
+#define TCPUDP_MIB_MAX const_max_t(size_t, UDP_MIB_MAX, TCP_MIB_MAX)
 
 /*
  * Report socket allocation statistics [m...@utu.fi]
@@ -404,7 +404,7 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, void 
*v)
struct net *net = seq->private;
int i;
 
-   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
+   memset(buff, 0, sizeof(buff));
 
seq_puts(seq, "\nTcp:");
for (i = 0; snmp4_tcp_list[i].name; i++)
@@ -421,7 +421,7 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, void 
*v)
seq_printf(seq, " %lu", buff[i]);
}
 
-   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
+   memset(buff, 0, sizeof(buff));
 
snmp_get_cpu_field_batch(buff, snmp4_udp_list,
 net->mib.udp_statistics);
@@ -432,7 +432,7 @@ static int snmp_seq_show_tcp_udp(struct seq_file *seq, void 
*v)
for (i = 0; snmp4_udp_list[i].name; i++)
seq_printf(seq, " %lu", buff[i]);
 
-   memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
+   memset(buff, 0, sizeof(buff));
 
/* the UDP and UDP-Lite MIBs are the same */
seq_puts(seq, "\nUdpLite:");
diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index b67814242f78..b68c233de296 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -30,10 +30,9 @@
 #include 
 #include 
 
-#define MAX4(a, b, c, d) \
-   max_t(u32, max_t(u32, a, b), max_t(u32, c, d))
-#define SNMP_MIB_MAX MAX4(UDP_MIB_MAX, TCP_MIB_MAX, \
-   IPSTATS_MIB_MAX, ICMP_MIB_MAX)
+#define SNMP_MIB_MAX const_max_t(u32,  \
+   const_max_t(u32, UDP_MIB_MAX, TCP_MIB_MAX), \
+   const_max_t(u32, IPSTATS_MIB_MAX, ICMP_MIB_MAX))
 
 static int 

[PATCH v5 1/2] kernel.h: Introduce const_max_t() for VLA removal

2018-03-15 Thread Kees Cook
In the effort to remove all VLAs from the kernel[1], it is desirable to
build with -Wvla. However, this warning is overly pessimistic, in that
it is only happy with stack array sizes that are declared as constant
expressions, and not constant values. One case of this is the evaluation
of the max() macro which, due to its construction, ends up converting
constant expression arguments into a constant value result. Attempts
to adjust the behavior of max() ran afoul of version-dependent compiler
behavior[2].

To work around this and still gain -Wvla coverage, this patch introduces
a new macro, const_max_t(), for use in these cases of stack array size
declaration, where the constant expressions are retained. Since this means
losing the double-evaluation protections of the max() macro, this macro is
designed to explicitly fail if used on non-constant arguments.

Older compilers will fail with the unhelpful message:

error: first argument to ‘__builtin_choose_expr’ not a constant

Newer compilers will fail with a hopefully more helpful message:

error: call to ‘__error_non_const_arg’ declared with attribute error: 
const_max_t() used with non-constant expression

To gain the ability to compare differing types, the desired type must
be explicitly declared, as with the existing max_t() macro. This is
needed when comparing different enum types and to allow things like:

int foo[const_max_t(size_t, 6, sizeof(something))];

[1] https://lkml.org/lkml/2018/3/7/621
[2] https://lkml.org/lkml/2018/3/10/170

Signed-off-by: Kees Cook 
---
 include/linux/kernel.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..e14531781568 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -820,6 +820,25 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
  x, y)
 
 /**
+ * const_max_t - return maximum of two compile-time constant expressions
+ * @type: type used for evaluation
+ * @x: first compile-time constant expression
+ * @y: second compile-time constant expression
+ *
+ * This has no multi-evaluation defenses, and must only ever be used with
+ * compile-time constant expressions (for example when calculating a stack
+ * array size).
+ */
+size_t __error_non_const_arg(void) \
+__compiletime_error("const_max_t() used with non-constant expression");
+#define const_max_t(type, x, y)\
+   __builtin_choose_expr(__builtin_constant_p(x) &&\
+ __builtin_constant_p(y),  \
+ (type)(x) > (type)(y) ?   \
+   (type)(x) : (type)(y),  \
+ __error_non_const_arg())
+
+/**
  * min3 - return minimum of three values
  * @x: first value
  * @y: second value
-- 
2.7.4



Re: [PATCH RFC 2/2] virtio_ring: support packed ring

2018-03-15 Thread Jason Wang



On 2018年02月23日 19:18, Tiwei Bie wrote:

Signed-off-by: Tiwei Bie 
---
  drivers/virtio/virtio_ring.c | 699 +--
  include/linux/virtio_ring.h  |   8 +-
  2 files changed, 618 insertions(+), 89 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index eb30f3e09a47..393778a2f809 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,14 +58,14 @@
  
  struct vring_desc_state {

void *data; /* Data for callback. */
-   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+   void *indir_desc;   /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
  };
  
  struct vring_virtqueue {

struct virtqueue vq;
  
-	/* Actual memory layout for this queue */

-   struct vring vring;
+   bool packed;
  
  	/* Can we use weak barriers? */

bool weak_barriers;
@@ -87,11 +87,28 @@ struct vring_virtqueue {
/* Last used index we've seen. */
u16 last_used_idx;
  
-	/* Last written value to avail->flags */

-   u16 avail_flags_shadow;
-
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue */
+   struct vring vring;
+
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue */
+   struct vring_packed vring_packed;
+   u8 wrap_counter : 1;
+   bool chaining;
+   };
+   };
  
  	/* How to notify other side. FIXME: commonalize hcalls! */

bool (*notify)(struct virtqueue *vq);
@@ -201,26 +218,37 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
  }
  
-static void vring_unmap_one(const struct vring_virtqueue *vq,

-   struct vring_desc *desc)
+static void vring_unmap_one(const struct vring_virtqueue *vq, void *_desc)
  {


Let's split the helpers to packed/split version like other helpers? 
(Consider the caller has already known the type of vq).



+   u64 addr;
+   u32 len;
u16 flags;
  
  	if (!vring_use_dma_api(vq->vq.vdev))

return;
  
-	flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);

+   if (vq->packed) {
+   struct vring_packed_desc *desc = _desc;
+
+   addr = virtio64_to_cpu(vq->vq.vdev, desc->addr);
+   len = virtio32_to_cpu(vq->vq.vdev, desc->len);
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+   } else {
+   struct vring_desc *desc = _desc;
+
+   addr = virtio64_to_cpu(vq->vq.vdev, desc->addr);
+   len = virtio32_to_cpu(vq->vq.vdev, desc->len);
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+   }
  
  	if (flags & VRING_DESC_F_INDIRECT) {

dma_unmap_single(vring_dma_dev(vq),
-virtio64_to_cpu(vq->vq.vdev, desc->addr),
-virtio32_to_cpu(vq->vq.vdev, desc->len),
+addr, len,
 (flags & VRING_DESC_F_WRITE) ?
 DMA_FROM_DEVICE : DMA_TO_DEVICE);
} else {
dma_unmap_page(vring_dma_dev(vq),
-  virtio64_to_cpu(vq->vq.vdev, desc->addr),
-  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  addr, len,
   (flags & VRING_DESC_F_WRITE) ?
   DMA_FROM_DEVICE : DMA_TO_DEVICE);
}
@@ -235,8 +263,9 @@ static int vring_mapping_error(const struct vring_virtqueue 
*vq,
return dma_mapping_error(vring_dma_dev(vq), addr);
  }
  
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,

-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
  {
struct vring_desc *desc;
unsigned int i;
@@ -257,14 +286,32 @@ static struct vring_desc *alloc_indirect(struct virtqueue 
*_vq,
return desc;
  }
  
-static inline int virtqueue_add(struct virtqueue *_vq,

- 

Re: [RFC PATCH V1 01/12] audit: add container id

2018-03-15 Thread Richard Guy Briggs
On 2018-03-15 16:27, Stefan Berger wrote:
> On 03/01/2018 02:41 PM, Richard Guy Briggs wrote:
> > Implement the proc fs write to set the audit container ID of a process,
> > emitting an AUDIT_CONTAINER record to document the event.
> > 
> > This is a write from the container orchestrator task to a proc entry of
> > the form /proc/PID/containerid where PID is the process ID of the newly
> > created task that is to become the first task in a container, or an
> > additional task added to a container.
> > 
> > The write expects up to a u64 value (unset: 18446744073709551615).
> > 
> > This will produce a record such as this:
> > type=UNKNOWN[1333] msg=audit(1519903238.968:261): op=set pid=596 uid=0 
> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 
> > ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
> > 
> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
> > the orchestrator while the "opid" field is the object's PID, the process
> > being "contained".  Old and new container ID values are given in the
> > "contid" fields, while res indicates its success.
> > 
> > It is not permitted to self-set, unset or re-set the container ID.  A
> > child inherits its parent's container ID, but then can be set only once
> > after.
> > 
> > See: https://github.com/linux-audit/audit-kernel/issues/32
> > 
> > 
> >   /* audit_rule_data supports filter rules with both integer and string
> >* fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index 4e0a4ac..0ee1e59 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -2073,6 +2073,92 @@ int audit_set_loginuid(kuid_t loginuid)
> > return rc;
> >   }
> > 
> > +static int audit_set_containerid_perm(struct task_struct *task, u64 
> > containerid)
> > +{
> > +   struct task_struct *parent;
> > +   u64 pcontainerid, ccontainerid;
> > +   pid_t ppid;
> > +
> > +   /* Don't allow to set our own containerid */
> > +   if (current == task)
> > +   return -EPERM;
> > +   /* Don't allow the containerid to be unset */
> > +   if (!cid_valid(containerid))
> > +   return -EINVAL;
> > +   /* if we don't have caps, reject */
> > +   if (!capable(CAP_AUDIT_CONTROL))
> > +   return -EPERM;
> > +   /* if containerid is unset, allow */
> > +   if (!audit_containerid_set(task))
> > +   return 0;
> 
> I am wondering whether there should be a check for the target process that
> will receive the containerid to not have CAP_SYS_ADMIN that would otherwise
> allow it to arbitrarily unshare()/clone() and leave the set of namespaces
> that may make up the container whose containerid we assign here?

This is a reasonable question.  This has been debated and I understood
the conclusion was that without a clear definition of a "container", the
task still remains in that container that just now has more
sub-namespaces (in the case of hierarchical namespaces), we don't want
to restrict it in such a way and that allows it to create nested
containers.  I see setns being more problematic if it could switch to
another existing namespace that was set up by the orchestrator for a
different container.  The coming v2 patchset acknowledges this situation
with the network namespace being potentially shared by multiple
containers.

This is the motivation for the code below that allows to set the
containerid even if it is already inherited from its parent.

> > +   /* it is already set, and not inherited from the parent, reject */
> > +   ccontainerid = audit_get_containerid(task);
> > +   rcu_read_lock();
> > +   parent = rcu_dereference(task->real_parent);
> > +   rcu_read_unlock();
> > +   task_lock(parent);
> > +   pcontainerid = audit_get_containerid(parent);
> > +   ppid = task_tgid_nr(parent);
>
> ppid not needed...

Thanks for catching this.  It was the vestige of a failed devel
experiment that didn't flush that useless appendage.  :-)

> > +   task_unlock(parent);
> > +   if (ccontainerid != pcontainerid)
> > +   return -EPERM;
> > +   return 0;
> > +}
> > +
> > +static void audit_log_set_containerid(struct task_struct *task, u64 
> > oldcontainerid,
> > + u64 containerid, int rc)
> > +{
> > +   struct audit_buffer *ab;
> > +   uid_t uid;
> > +   struct tty_struct *tty;
> > +
> > +   if (!audit_enabled)
> > +   return;
> > +
> > +   ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
> > +   if (!ab)
> > +   return;
> > +
> > +   uid = from_kuid(_user_ns, task_uid(current));
> > +   tty = audit_get_tty(current);
> > +
> > +   audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), 
> > uid);
> > +   audit_log_task_context(ab);
> > +   audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu 
> > contid=%llu res=%d",
> > +from_kuid(_user_ns, audit_get_loginuid(current)),
> > +tty ? tty_name(tty) : 

Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Miguel Ojeda
On Fri, Mar 16, 2018 at 12:49 AM, Kees Cook  wrote:
> On Thu, Mar 15, 2018 at 4:46 PM, Linus Torvalds
>  wrote:
>> What I'm *not* so much ok with is "const_max(5,sizeof(x))" erroring
>> out, or silently causing insane behavior due to hidden subtle type
>> casts..
>
> Yup! I like it as an explicit argument. Thanks!
>

What about something like this?

#define INTMAXT_MAX LLONG_MAX
typedef int64_t intmax_t;

#define const_max(x, y)   \
__builtin_choose_expr(\
!__builtin_constant_p(x) || !__builtin_constant_p(y), \
__error_not_const_arg(),  \
__builtin_choose_expr(\
(x) > INTMAXT_MAX || (y) > INTMAXT_MAX,   \
__error_too_big(),\
__builtin_choose_expr(\
(intmax_t)(x) >= (intmax_t)(y),   \
(x),  \
(y)   \
) \
) \
)

Works for different types, allows to mix negatives and positives and
returns the original type, e.g.:

  const_max(-1, sizeof(char));

is of type 'long unsigned int', but:

  const_max(2, sizeof(char));

is of type 'int'. While I am not a fan that the return type depends on
the arguments, it is useful if you are going to use the expression in
something that needs expects a precise (a printk() for instance?).

The check against the INTMAXT_MAX is there to avoid complexity (if we
do not handle those cases, it is safe to use intmax_t for the
comparison; otherwise you have to have another compile time branch for
the case positive-positive using uintmax_t) and also avoids odd
warnings for some cases above LLONG_MAX about comparisons with 0 for
unsigned expressions being always true. On the positive side, it
prevents using the macro for thing like "(size_t)-1".

Cheers,
Miguel


[PATCH V2] xfrm: fix rcu_read_unlock usage in xfrm_local_error

2018-03-15 Thread Taehee Yoo
In the xfrm_local_error, rcu_read_unlock should be called when afinfo
is not NULL. because xfrm_state_get_afinfo calls rcu_read_unlock
if afinfo is NULL.

Fixes: af5d27c4e12b ("xfrm: remove xfrm_state_put_afinfo")
Signed-off-by: Taehee Yoo 
---

V2 :
 - Add Fixes tag

V1 :
 - Initial patch

 net/xfrm/xfrm_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 2346867..89b178a7 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -285,8 +285,9 @@ void xfrm_local_error(struct sk_buff *skb, int mtu)
return;
 
afinfo = xfrm_state_get_afinfo(proto);
-   if (afinfo)
+   if (afinfo) {
afinfo->local_error(skb, mtu);
-   rcu_read_unlock();
+   rcu_read_unlock();
+   }
 }
 EXPORT_SYMBOL_GPL(xfrm_local_error);
-- 
2.9.3



Re: [PATCH] xfrm: fix rcu_read_unlock usage in xfrm_local_error

2018-03-15 Thread Taehee Yoo
2018-03-15 19:48 GMT+09:00 Steffen Klassert :
> On Tue, Mar 13, 2018 at 05:26:07PM +0900, Taehee Yoo wrote:
>> In the xfrm_local_error, rcu_read_unlock should be called when afinfo
>> is not NULL. because xfrm_state_get_afinfo calls rcu_read_unlock
>> if afinfo is NULL.
>>
>> Signed-off-by: Taehee Yoo 
>
> Can you please add a 'Fixes:' tag, so that it can be
> correctly backported to the stable trees?
>
> Thanks!

Thank you for your review!
I will send V2 patch.


Re: linux-next: manual merge of the net-next tree with the rdma-fixes tree

2018-03-15 Thread Jason Gunthorpe
On Thu, Mar 15, 2018 at 09:18:02PM -0400, Doug Ledford wrote:
 
> Here's the commit (from the rdma git repo) with the proper merge fix
> (although it also has other minor merge stuff that needs to be ignored):
> 
> 2d873449a202 (Merge branch 'k.o/wip/dl-for-rc' into k.o/wip/dl-for-next)

Stephen,

If you merge the branches in the order:
  rdma for-next, rdma for-rc, then net-next
you should not see a merge conflict as rdma for-next already has the
correct resolution.

Jason


Re: [PATCH] mlx5: Remove call to ida_pre_get

2018-03-15 Thread Matthew Wilcox
On Thu, Mar 15, 2018 at 11:58:07PM +, Saeed Mahameed wrote:
> On Wed, 2018-03-14 at 19:57 -0700, Matthew Wilcox wrote:
> > From: Matthew Wilcox 
> > 
> > The mlx5 driver calls ida_pre_get() in a loop for no readily apparent
> > reason.  The driver uses ida_simple_get() which will call
> > ida_pre_get()
> > by itself and there's no need to use ida_pre_get() unless using
> > ida_get_new().
> > 
> 
> Hi Matthew,
> 
> Is this is causing any issues ? or just a simple cleanup ?

I'm removing the API.  At the end of this cleanup, there will be no more
preallocation; instead we will rely on the slab allocator not sucking.

> Adding Maor, the author of this change,
> 
> I believe the idea is to speed up insert_fte (which calls
> ida_simple_get) since insert_fte runs under the FTE write semaphore,
> in this case if ida_pre_get was successful before taking the semaphore
> for all the FTE nodes in the loop, this will be a huge win for
> ida_simple_get which will immediately return success without even
> trying to allocate.

I think that's misguided.  The IDA allocator is only going to allocate
memory once in every 1024 allocations.  Also, it does try to allocate,
even if there are preallocated nodes.  So you're just wasting time,
unfortunately.



[PATCH net-next v4 2/7] ibmvnic: Update and clean up reset TX pool routine

2018-03-15 Thread Thomas Falcon
Update TX pool reset routine to accommodate new TSO pool array. Introduce
a function that resets one TX pool, and use that function to initialize
each pool in both pool arrays.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 45 +-
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 9c7d19c926f9..4dc304422ece 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -557,36 +557,41 @@ static int init_rx_pools(struct net_device *netdev)
return 0;
 }
 
+static int reset_one_tx_pool(struct ibmvnic_adapter *adapter,
+struct ibmvnic_tx_pool *tx_pool)
+{
+   int rc, i;
+
+   rc = reset_long_term_buff(adapter, _pool->long_term_buff);
+   if (rc)
+   return rc;
+
+   memset(tx_pool->tx_buff, 0,
+  tx_pool->num_buffers *
+  sizeof(struct ibmvnic_tx_buff));
+
+   for (i = 0; i < tx_pool->num_buffers; i++)
+   tx_pool->free_map[i] = i;
+
+   tx_pool->consumer_index = 0;
+   tx_pool->producer_index = 0;
+
+   return 0;
+}
+
 static int reset_tx_pools(struct ibmvnic_adapter *adapter)
 {
-   struct ibmvnic_tx_pool *tx_pool;
int tx_scrqs;
-   int i, j, rc;
+   int i, rc;
 
tx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_txsubm_subcrqs);
for (i = 0; i < tx_scrqs; i++) {
-   netdev_dbg(adapter->netdev, "Re-setting tx_pool[%d]\n", i);
-
-   tx_pool = >tx_pool[i];
-
-   rc = reset_long_term_buff(adapter, _pool->long_term_buff);
+   rc = reset_one_tx_pool(adapter, >tso_pool[i]);
if (rc)
return rc;
-
-   rc = reset_long_term_buff(adapter, _pool->tso_ltb);
+   rc = reset_one_tx_pool(adapter, >tx_pool[i]);
if (rc)
return rc;
-
-   memset(tx_pool->tx_buff, 0,
-  adapter->req_tx_entries_per_subcrq *
-  sizeof(struct ibmvnic_tx_buff));
-
-   for (j = 0; j < adapter->req_tx_entries_per_subcrq; j++)
-   tx_pool->free_map[j] = j;
-
-   tx_pool->consumer_index = 0;
-   tx_pool->producer_index = 0;
-   tx_pool->tso_index = 0;
}
 
return 0;
-- 
2.15.0



[PATCH net-next v4 6/7] ibmvnic: Improve TX buffer accounting

2018-03-15 Thread Thomas Falcon
Improve TX pool buffer accounting to prevent the producer
index from overruning the consumer. First, set the next free
index to an invalid value if it is in use. If next buffer
to be consumed is in use, drop the packet.

Finally, if the transmit fails for some other reason, roll
back the consumer index and set the free map entry to its original
value. This should also be done if the DMA map fails.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 672e9221d4a5..af6f8193cb67 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1426,6 +1426,16 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
 
index = tx_pool->free_map[tx_pool->consumer_index];
 
+   if (index == IBMVNIC_INVALID_MAP) {
+   dev_kfree_skb_any(skb);
+   tx_send_failed++;
+   tx_dropped++;
+   ret = NETDEV_TX_OK;
+   goto out;
+   }
+
+   tx_pool->free_map[tx_pool->consumer_index] = IBMVNIC_INVALID_MAP;
+
offset = index * tx_pool->buf_size;
dst = tx_pool->long_term_buff.buff + offset;
memset(dst, 0, tx_pool->buf_size);
@@ -1522,7 +1532,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
tx_map_failed++;
tx_dropped++;
ret = NETDEV_TX_OK;
-   goto out;
+   goto tx_err_out;
}
lpar_rc = send_subcrq_indirect(adapter, handle_array[queue_num],
   (u64)tx_buff->indir_dma,
@@ -1534,13 +1544,6 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
}
if (lpar_rc != H_SUCCESS) {
dev_err(dev, "tx failed with code %ld\n", lpar_rc);
-
-   if (tx_pool->consumer_index == 0)
-   tx_pool->consumer_index =
-   tx_pool->num_buffers - 1;
-   else
-   tx_pool->consumer_index--;
-
dev_kfree_skb_any(skb);
tx_buff->skb = NULL;
 
@@ -1556,7 +1559,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
tx_send_failed++;
tx_dropped++;
ret = NETDEV_TX_OK;
-   goto out;
+   goto tx_err_out;
}
 
if (atomic_add_return(num_entries, _scrq->used)
@@ -1569,7 +1572,16 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
tx_bytes += skb->len;
txq->trans_start = jiffies;
ret = NETDEV_TX_OK;
+   goto out;
 
+tx_err_out:
+   /* roll back consumer index and map array*/
+   if (tx_pool->consumer_index == 0)
+   tx_pool->consumer_index =
+   tx_pool->num_buffers - 1;
+   else
+   tx_pool->consumer_index--;
+   tx_pool->free_map[tx_pool->consumer_index] = index;
 out:
netdev->stats.tx_dropped += tx_dropped;
netdev->stats.tx_bytes += tx_bytes;
-- 
2.15.0



[PATCH net-next v4 4/7] ibmvnic: Update TX pool initialization routine

2018-03-15 Thread Thomas Falcon
Introduce function that initializes one TX pool. Use that to
create each pool entry in both the standard TX pool and TSO
pool arrays.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 90 --
 1 file changed, 48 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 258d54e3a616..2bb5d562dde1 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -635,13 +635,43 @@ static void release_tx_pools(struct ibmvnic_adapter 
*adapter)
adapter->num_active_tx_pools = 0;
 }
 
+static int init_one_tx_pool(struct net_device *netdev,
+   struct ibmvnic_tx_pool *tx_pool,
+   int num_entries, int buf_size)
+{
+   struct ibmvnic_adapter *adapter = netdev_priv(netdev);
+   int i;
+
+   tx_pool->tx_buff = kcalloc(num_entries,
+  sizeof(struct ibmvnic_tx_buff),
+  GFP_KERNEL);
+   if (!tx_pool->tx_buff)
+   return -1;
+
+   if (alloc_long_term_buff(adapter, _pool->long_term_buff,
+num_entries * buf_size))
+   return -1;
+
+   tx_pool->free_map = kcalloc(num_entries, sizeof(int), GFP_KERNEL);
+   if (!tx_pool->free_map)
+   return -1;
+
+   for (i = 0; i < num_entries; i++)
+   tx_pool->free_map[i] = i;
+
+   tx_pool->consumer_index = 0;
+   tx_pool->producer_index = 0;
+   tx_pool->num_buffers = num_entries;
+   tx_pool->buf_size = buf_size;
+
+   return 0;
+}
+
 static int init_tx_pools(struct net_device *netdev)
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
-   struct device *dev = >vdev->dev;
-   struct ibmvnic_tx_pool *tx_pool;
int tx_subcrqs;
-   int i, j;
+   int i, rc;
 
tx_subcrqs = be32_to_cpu(adapter->login_rsp_buf->num_txsubm_subcrqs);
adapter->tx_pool = kcalloc(tx_subcrqs,
@@ -649,53 +679,29 @@ static int init_tx_pools(struct net_device *netdev)
if (!adapter->tx_pool)
return -1;
 
+   adapter->tso_pool = kcalloc(tx_subcrqs,
+   sizeof(struct ibmvnic_tx_pool), GFP_KERNEL);
+   if (!adapter->tso_pool)
+   return -1;
+
adapter->num_active_tx_pools = tx_subcrqs;
 
for (i = 0; i < tx_subcrqs; i++) {
-   tx_pool = >tx_pool[i];
-
-   netdev_dbg(adapter->netdev,
-  "Initializing tx_pool[%d], %lld buffs\n",
-  i, adapter->req_tx_entries_per_subcrq);
-
-   tx_pool->tx_buff = kcalloc(adapter->req_tx_entries_per_subcrq,
-  sizeof(struct ibmvnic_tx_buff),
-  GFP_KERNEL);
-   if (!tx_pool->tx_buff) {
-   dev_err(dev, "tx pool buffer allocation failed\n");
-   release_tx_pools(adapter);
-   return -1;
-   }
-
-   if (alloc_long_term_buff(adapter, _pool->long_term_buff,
-adapter->req_tx_entries_per_subcrq *
-(adapter->req_mtu + VLAN_HLEN))) {
-   release_tx_pools(adapter);
-   return -1;
-   }
-
-   /* alloc TSO ltb */
-   if (alloc_long_term_buff(adapter, _pool->tso_ltb,
-IBMVNIC_TSO_BUFS *
-IBMVNIC_TSO_BUF_SZ)) {
+   rc = init_one_tx_pool(netdev, >tx_pool[i],
+ adapter->req_tx_entries_per_subcrq,
+ adapter->req_mtu + VLAN_HLEN);
+   if (rc) {
release_tx_pools(adapter);
-   return -1;
+   return rc;
}
 
-   tx_pool->tso_index = 0;
-
-   tx_pool->free_map = kcalloc(adapter->req_tx_entries_per_subcrq,
-   sizeof(int), GFP_KERNEL);
-   if (!tx_pool->free_map) {
+   init_one_tx_pool(netdev, >tso_pool[i],
+IBMVNIC_TSO_BUFS,
+IBMVNIC_TSO_BUF_SZ);
+   if (rc) {
release_tx_pools(adapter);
-   return -1;
+   return rc;
}
-
-   for (j = 0; j < adapter->req_tx_entries_per_subcrq; j++)
-   tx_pool->free_map[j] = j;
-
-   tx_pool->consumer_index = 0;
-   tx_pool->producer_index = 0;
}
 
return 0;
-- 
2.15.0



[PATCH net-next v4 5/7] ibmvnic: Update TX and TX completion routines

2018-03-15 Thread Thomas Falcon
Update TX and TX completion routines to account for TX pool
restructuring. TX routine first chooses the pool depending
on whether a packet is GSO or not, then uses it accordingly.

For the completion routine to know which pool it needs to use,
set the most significant bit of the correlator index to one
if the packet uses the TSO pool. On completion, unset the bit
and use the correlator index to release the buffer pool entry.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 55 +++---
 drivers/net/ethernet/ibm/ibmvnic.h |  1 +
 2 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 2bb5d562dde1..672e9221d4a5 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1414,8 +1414,11 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
ret = NETDEV_TX_OK;
goto out;
}
+   if (skb_is_gso(skb))
+   tx_pool = >tso_pool[queue_num];
+   else
+   tx_pool = >tx_pool[queue_num];
 
-   tx_pool = >tx_pool[queue_num];
tx_scrq = adapter->tx_scrq[queue_num];
txq = netdev_get_tx_queue(netdev, skb_get_queue_mapping(skb));
handle_array = (u64 *)((u8 *)(adapter->login_rsp_buf) +
@@ -1423,20 +1426,10 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
 
index = tx_pool->free_map[tx_pool->consumer_index];
 
-   if (skb_is_gso(skb)) {
-   offset = tx_pool->tso_index * IBMVNIC_TSO_BUF_SZ;
-   dst = tx_pool->tso_ltb.buff + offset;
-   memset(dst, 0, IBMVNIC_TSO_BUF_SZ);
-   data_dma_addr = tx_pool->tso_ltb.addr + offset;
-   tx_pool->tso_index++;
-   if (tx_pool->tso_index == IBMVNIC_TSO_BUFS)
-   tx_pool->tso_index = 0;
-   } else {
-   offset = index * (adapter->req_mtu + VLAN_HLEN);
-   dst = tx_pool->long_term_buff.buff + offset;
-   memset(dst, 0, adapter->req_mtu + VLAN_HLEN);
-   data_dma_addr = tx_pool->long_term_buff.addr + offset;
-   }
+   offset = index * tx_pool->buf_size;
+   dst = tx_pool->long_term_buff.buff + offset;
+   memset(dst, 0, tx_pool->buf_size);
+   data_dma_addr = tx_pool->long_term_buff.addr + offset;
 
if (skb_shinfo(skb)->nr_frags) {
int cur, i;
@@ -1459,8 +1452,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
}
 
tx_pool->consumer_index =
-   (tx_pool->consumer_index + 1) %
-   adapter->req_tx_entries_per_subcrq;
+   (tx_pool->consumer_index + 1) % tx_pool->num_buffers;
 
tx_buff = _pool->tx_buff[index];
tx_buff->skb = skb;
@@ -1476,11 +1468,13 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
tx_crq.v1.n_crq_elem = 1;
tx_crq.v1.n_sge = 1;
tx_crq.v1.flags1 = IBMVNIC_TX_COMP_NEEDED;
-   tx_crq.v1.correlator = cpu_to_be32(index);
+
if (skb_is_gso(skb))
-   tx_crq.v1.dma_reg = cpu_to_be16(tx_pool->tso_ltb.map_id);
+   tx_crq.v1.correlator =
+   cpu_to_be32(index | IBMVNIC_TSO_POOL_MASK);
else
-   tx_crq.v1.dma_reg = cpu_to_be16(tx_pool->long_term_buff.map_id);
+   tx_crq.v1.correlator = cpu_to_be32(index);
+   tx_crq.v1.dma_reg = cpu_to_be16(tx_pool->long_term_buff.map_id);
tx_crq.v1.sge_len = cpu_to_be32(skb->len);
tx_crq.v1.ioba = cpu_to_be64(data_dma_addr);
 
@@ -1543,7 +1537,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
 
if (tx_pool->consumer_index == 0)
tx_pool->consumer_index =
-   adapter->req_tx_entries_per_subcrq - 1;
+   tx_pool->num_buffers - 1;
else
tx_pool->consumer_index--;
 
@@ -2547,6 +2541,7 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter 
*adapter,
   struct ibmvnic_sub_crq_queue *scrq)
 {
struct device *dev = >vdev->dev;
+   struct ibmvnic_tx_pool *tx_pool;
struct ibmvnic_tx_buff *txbuff;
union sub_crq *next;
int index;
@@ -2566,7 +2561,14 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter 
*adapter,
continue;
}
index = be32_to_cpu(next->tx_comp.correlators[i]);
-   txbuff = >tx_pool[pool].tx_buff[index];
+   if (index & IBMVNIC_TSO_POOL_MASK) {
+   tx_pool = >tso_pool[pool];
+   index &= ~IBMVNIC_TSO_POOL_MASK;
+   } else {
+   

[PATCH net-next v4 7/7] ibmvnic: Update TX pool cleaning routine

2018-03-15 Thread Thomas Falcon
Update routine that cleans up any outstanding transmits that
have not received completions when the device needs to close.
Introduces a helper function that cleans one TX pool to make
code more readable.

Signed-off-by: Thomas Falcon 
---
v4: Update to use the number of buffers in the TX pool struct
instead of a fixed value saved in the adapter struct. Earlier
implementation resulted in a crash.
---
 drivers/net/ethernet/ibm/ibmvnic.c | 40 +++---
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index af6f8193cb67..5632c030811b 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1128,34 +1128,42 @@ static void clean_rx_pools(struct ibmvnic_adapter 
*adapter)
}
 }
 
-static void clean_tx_pools(struct ibmvnic_adapter *adapter)
+static void clean_one_tx_pool(struct ibmvnic_adapter *adapter,
+ struct ibmvnic_tx_pool *tx_pool)
 {
-   struct ibmvnic_tx_pool *tx_pool;
struct ibmvnic_tx_buff *tx_buff;
u64 tx_entries;
+   int i;
+
+   if (!tx_pool && !tx_pool->tx_buff)
+   return;
+
+   tx_entries = tx_pool->num_buffers;
+
+   for (i = 0; i < tx_entries; i++) {
+   tx_buff = _pool->tx_buff[i];
+   if (tx_buff && tx_buff->skb) {
+   dev_kfree_skb_any(tx_buff->skb);
+   tx_buff->skb = NULL;
+   }
+   }
+}
+
+static void clean_tx_pools(struct ibmvnic_adapter *adapter)
+{
int tx_scrqs;
-   int i, j;
+   int i;
 
-   if (!adapter->tx_pool)
+   if (!adapter->tx_pool || !adapter->tso_pool)
return;
 
tx_scrqs = be32_to_cpu(adapter->login_rsp_buf->num_txsubm_subcrqs);
-   tx_entries = adapter->req_tx_entries_per_subcrq;
 
/* Free any remaining skbs in the tx buffer pools */
for (i = 0; i < tx_scrqs; i++) {
-   tx_pool = >tx_pool[i];
-   if (!tx_pool && !tx_pool->tx_buff)
-   continue;
-
netdev_dbg(adapter->netdev, "Cleaning tx_pool[%d]\n", i);
-   for (j = 0; j < tx_entries; j++) {
-   tx_buff = _pool->tx_buff[j];
-   if (tx_buff && tx_buff->skb) {
-   dev_kfree_skb_any(tx_buff->skb);
-   tx_buff->skb = NULL;
-   }
-   }
+   clean_one_tx_pool(adapter, >tx_pool[i]);
+   clean_one_tx_pool(adapter, >tso_pool[i]);
}
 }
 
-- 
2.15.0



[PATCH net-next v4 0/7] ibmvnic: Update TX pool and TX routines

2018-03-15 Thread Thomas Falcon
This patch restructures the TX pool data structure and provides a
separate TX pool array for TSO transmissions. This is already used
in some way due to our unique DMA situation, namely that we cannot
use single DMA mappings for packet data. Previously, both buffer
arrays used the same pool entry. This restructuring allows for
some additional cleanup in the driver code, especially in some
places in the device transmit routine.

In addition, it allows us to more easily track the consumer
and producer indexes of a particular pool. This has been
further improved by better tracking of in-use buffers to
prevent possible data corruption in case an invalid buffer
entry is used.

v4: Fix error in 7th patch that causes an oops by using
the older fixed value for number of buffers instead
of the respective field in the tx pool data structure

v3: Forgot to update TX pool cleaning function to handle new data
structures. Included 7th patch for that.

v2: Fix typo in 3/6 commit subject line

Thomas Falcon (7):
  ibmvnic: Generalize TX pool structure
  ibmvnic: Update and clean up reset TX pool routine
  ibmvnic: Update release TX pool routine
  ibmvnic: Update TX pool initialization routine
  ibmvnic: Update TX and TX completion routines
  ibmvnic: Improve TX buffer accounting
  ibmvnic: Update TX pool cleaning routine

 drivers/net/ethernet/ibm/ibmvnic.c | 275 +
 drivers/net/ethernet/ibm/ibmvnic.h |   8 +-
 2 files changed, 160 insertions(+), 123 deletions(-)

-- 
2.15.0



[PATCH net-next v4 3/7] ibmvnic: Update release TX pool routine

2018-03-15 Thread Thomas Falcon
Introduce function that frees one TX pool.  Use that to release
each pool in both the standard TX pool and TSO pool arrays.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 4dc304422ece..258d54e3a616 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -608,25 +608,30 @@ static void release_vpd_data(struct ibmvnic_adapter 
*adapter)
adapter->vpd = NULL;
 }
 
+static void release_one_tx_pool(struct ibmvnic_adapter *adapter,
+   struct ibmvnic_tx_pool *tx_pool)
+{
+   kfree(tx_pool->tx_buff);
+   kfree(tx_pool->free_map);
+   free_long_term_buff(adapter, _pool->long_term_buff);
+}
+
 static void release_tx_pools(struct ibmvnic_adapter *adapter)
 {
-   struct ibmvnic_tx_pool *tx_pool;
int i;
 
if (!adapter->tx_pool)
return;
 
for (i = 0; i < adapter->num_active_tx_pools; i++) {
-   netdev_dbg(adapter->netdev, "Releasing tx_pool[%d]\n", i);
-   tx_pool = >tx_pool[i];
-   kfree(tx_pool->tx_buff);
-   free_long_term_buff(adapter, _pool->long_term_buff);
-   free_long_term_buff(adapter, _pool->tso_ltb);
-   kfree(tx_pool->free_map);
+   release_one_tx_pool(adapter, >tx_pool[i]);
+   release_one_tx_pool(adapter, >tso_pool[i]);
}
 
kfree(adapter->tx_pool);
adapter->tx_pool = NULL;
+   kfree(adapter->tso_pool);
+   adapter->tso_pool = NULL;
adapter->num_active_tx_pools = 0;
 }
 
-- 
2.15.0



[PATCH net-next v4 1/7] ibmvnic: Generalize TX pool structure

2018-03-15 Thread Thomas Falcon
Remove some unused fields in the structure and include values
describing the individual buffer size and number of buffers in
a TX pool. This allows us to use these fields for TX pool buffer
accounting as opposed to using hard coded values. Finally, split
TSO buffers out and provide an additional TX pool array for TSO.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.h | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.h 
b/drivers/net/ethernet/ibm/ibmvnic.h
index 099c89d49945..a2e21b39074f 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.h
+++ b/drivers/net/ethernet/ibm/ibmvnic.h
@@ -917,11 +917,9 @@ struct ibmvnic_tx_pool {
int *free_map;
int consumer_index;
int producer_index;
-   wait_queue_head_t ibmvnic_tx_comp_q;
-   struct task_struct *work_thread;
struct ibmvnic_long_term_buff long_term_buff;
-   struct ibmvnic_long_term_buff tso_ltb;
-   int tso_index;
+   int num_buffers;
+   int buf_size;
 };
 
 struct ibmvnic_rx_buff {
@@ -1044,6 +1042,7 @@ struct ibmvnic_adapter {
u64 promisc;
 
struct ibmvnic_tx_pool *tx_pool;
+   struct ibmvnic_tx_pool *tso_pool;
struct completion init_done;
int init_done_rc;
 
-- 
2.15.0



Re: linux-next: manual merge of the net-next tree with the rdma-fixes tree

2018-03-15 Thread Doug Ledford
On Fri, 2018-03-16 at 11:56 +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the net-next tree got a conflict in:
> 
>   drivers/infiniband/hw/mlx5/main.c
> 
> between commit:
> 
>   42cea83f9524 ("IB/mlx5: Fix cleanup order on unload")
> 
> from the rdma-fixes tree and commit:
> 
>   b5ca15ad7e61 ("IB/mlx5: Add proper representors support")
> 
> from the net-next tree.

We are aware of the merge conflict.  This is a result of the fact that
code had been submitted to the for-next area (the representors support)
and after that an issue was found by the syzkaller bot that deserved rc
fix status and which conflicted.  The fixup you list below is
insufficient to fix the merge conflict.  The full fixup can be found in
the rdma tree from where I merged the for-rc branch into the for-next
branch and created a complete fixup of the merge conflict.  The problem
is that one patch change the device init stage flow, while the other
patch duplicates the normal device init stage flow to the representor
device stage flow.  To resolve the fix, you not only have to resolve the
contextual diffs, but you have to duplicate the changes to the normal
device stage flow into the representor device stage flow.  It is very
far from a trivial merge.  We were planning on talking to Dave about
this issue tomorrow, but you beat us to raising the issue ;-).

Here's the commit (from the rdma git repo) with the proper merge fix
(although it also has other minor merge stuff that needs to be ignored):

2d873449a202 (Merge branch 'k.o/wip/dl-for-rc' into k.o/wip/dl-for-next)

> I fixed it up (see below and the merge fix patch as well) and can
> carry the fix as necessary. This is now fixed as far as linux-next is
> concerned, but any non trivial conflicts should be mentioned to your
> upstream maintainer when your tree is submitted for merging.  You may
> also want to consider cooperating with the maintainer of the conflicting
> tree to minimise any particularly complex conflicts.
> 
> From: Stephen Rothwell 
> Date: Fri, 16 Mar 2018 11:54:01 +1100
> Subject: [PATCH] IB/mlx5: merge fix for "Fix cleanup order on unload"
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  drivers/infiniband/hw/mlx5/ib_rep.c  | 6 +++---
>  drivers/infiniband/hw/mlx5/mlx5_ib.h | 3 +--
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/ib_rep.c 
> b/drivers/infiniband/hw/mlx5/ib_rep.c
> index 61cc3d7db257..7fb997dadd80 100644
> --- a/drivers/infiniband/hw/mlx5/ib_rep.c
> +++ b/drivers/infiniband/hw/mlx5/ib_rep.c
> @@ -33,9 +33,9 @@ static const struct mlx5_ib_profile rep_profile = {
>   STAGE_CREATE(MLX5_IB_STAGE_IB_REG,
>mlx5_ib_stage_ib_reg_init,
>mlx5_ib_stage_ib_reg_cleanup),
> - STAGE_CREATE(MLX5_IB_STAGE_UMR_RESOURCES,
> -  mlx5_ib_stage_umr_res_init,
> -  mlx5_ib_stage_umr_res_cleanup),
> + STAGE_CREATE(MLX5_IB_STAGE_POST_IB_REG_UMR,
> +  mlx5_ib_stage_post_ib_reg_umr_init,
> +  NULL),
>   STAGE_CREATE(MLX5_IB_STAGE_CLASS_ATTR,
>mlx5_ib_stage_class_attr_init,
>NULL),
> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
> b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> index 7ec753ec7962..c45a7abdbe3e 100644
> --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
> +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> @@ -1071,8 +1071,7 @@ int mlx5_ib_stage_bfrag_init(struct mlx5_ib_dev *dev);
>  void mlx5_ib_stage_bfrag_cleanup(struct mlx5_ib_dev *dev);
>  int mlx5_ib_stage_ib_reg_init(struct mlx5_ib_dev *dev);
>  void mlx5_ib_stage_ib_reg_cleanup(struct mlx5_ib_dev *dev);
> -int mlx5_ib_stage_umr_res_init(struct mlx5_ib_dev *dev);
> -void mlx5_ib_stage_umr_res_cleanup(struct mlx5_ib_dev *dev);
> +int mlx5_ib_stage_post_ib_reg_umr_init(struct mlx5_ib_dev *dev);
>  int mlx5_ib_stage_class_attr_init(struct mlx5_ib_dev *dev);
>  void __mlx5_ib_remove(struct mlx5_ib_dev *dev,
> const struct mlx5_ib_profile *profile,
> -- 
> 2.16.1
> 

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

signature.asc
Description: This is a digitally signed message part


Re: [PATCH v2 5/6] ixgbevf: keep writel() closer to wmb()

2018-03-15 Thread Sinan Kaya
On 3/15/2018 9:04 PM, Sinan Kaya wrote:
>   /* notify HW of packet */
> - ixgbevf_write_tail(tx_ring, i);
> + writel(value, tx_ring->tail);
>  

oops. copy paste mistake. 

I'll hold onto posting v3 until i hear more feedback.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH v2 1/6] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 8 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e554aa6cf..9455869 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -185,7 +185,7 @@ static int i40e_program_fdir_filter(struct i40e_fdir_filter 
*fdir_data,
/* Mark the data descriptor to be watched */
first->next_to_watch = tx_desc;
 
-   writel(tx_ring->next_to_use, tx_ring->tail);
+   writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
return 0;
 
 dma_fail:
@@ -1375,7 +1375,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2258,7 +2258,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
 */
wmb();
 
-   writel(xdp_ring->next_to_use, xdp_ring->tail);
+   writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
}
 
rx_ring->skb = skb;
@@ -3286,7 +3286,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, 
struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 357d605..56eea20 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -667,7 +667,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring 
*rx_ring, u32 val)
 * such as IA-64).
 */
wmb();
-   writel(val, rx_ring->tail);
+   writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2243,7 +2243,7 @@ static inline void i40evf_tx_map(struct i40e_ring 
*tx_ring, struct sk_buff *skb,
 
/* notify HW of packet */
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4



[PATCH v2 3/6] igbvf: eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c 
b/drivers/net/ethernet/intel/igbvf/netdev.c
index 4214c15..edb1c34 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -251,7 +251,7 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring 
*rx_ring,
 * such as IA-64).
*/
wmb();
-   writel(i, adapter->hw.hw_addr + rx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + rx_ring->tail);
}
 }
 
@@ -2297,7 +2297,7 @@ static inline void igbvf_tx_queue_adv(struct 
igbvf_adapter *adapter,
 
tx_ring->buffer_info[first].next_to_watch = tx_desc;
tx_ring->next_to_use = i;
-   writel(i, adapter->hw.hw_addr + tx_ring->tail);
+   writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
 */
-- 
2.7.4



[PATCH v2 2/6] ixgbe: eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0da5aa2..58ed70f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1692,7 +1692,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, 
u16 cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
@@ -2453,7 +2453,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
 * know there are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
 
xdp_do_flush_map();
}
@@ -8078,7 +8078,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10014,7 +10014,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 * are new descriptors to fetch.
 */
wmb();
-   writel(ring->next_to_use, ring->tail);
+   writel_relaxed(ring->next_to_use, ring->tail);
 
return;
 }
-- 
2.7.4



[PATCH v2 4/6] igb: eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Sinan Kaya
Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae7..82aea92 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5671,7 +5671,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-   writel(i, tx_ring->tail);
+   writel_relaxed(i, tx_ring->tail);
 
/* we need this if more than one processor can write to our tail
 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8072,7 +8072,7 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 
cleaned_count)
 * such as IA-64).
 */
wmb();
-   writel(i, rx_ring->tail);
+   writel_relaxed(i, rx_ring->tail);
}
 }
 
-- 
2.7.4



[PATCH v2 5/6] ixgbevf: keep writel() closer to wmb()

2018-03-15 Thread Sinan Kaya
Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index f695242..11e893e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -244,11 +244,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring 
*ring)
return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-   writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)  \
(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 9b3d43d..b65f691 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -659,7 +659,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring 
*rx_ring,
 * such as IA-64).
 */
wmb();
-   ixgbevf_write_tail(rx_ring, i);
+   writel(i, rx_ring->tail);
}
 }
 
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   ixgbevf_write_tail(tx_ring, i);
+   writel(value, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH v2 6/6] ixgbevf: eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index b65f691..9e2e0fd 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
tx_ring->next_to_use = i;
 
/* notify HW of packet */
-   writel(value, tx_ring->tail);
+   writel_relaxed(value, tx_ring->tail);
 
return;
 dma_error:
-- 
2.7.4



[PATCH v2 0/6] Eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Sinan Kaya
Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about and posted this series. Note also that
I have one Infiniband patch in the series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Changes since v1:

i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
missed writel calls in:
i40e:
  i40e_program_fdir_filter
  i40e_clean_rx_irq
  i40e_tx_map
i40evf:
  i40e_clean_rx_irq
  i40e_tx_map

ixgbe: eliminate duplicate barriers on weakly-ordered archs
missed the writel at the end of ixgbe_tx_map

RDMA/qedr: eliminate duplicate barriers on weakly-ordered archs
dropped since applied

igbvf: eliminate duplicate barriers on weakly-ordered archs
missed the writel at the end of igbvf_tx_queue_adv()

igb: eliminate duplicate barriers on weakly-ordered archs
missed the writel at the end of igb_tx_map()

e1000: eliminate duplicate barriers on weakly-ordered archs
dropped

ixgbevf: eliminate duplicate barriers on weakly-ordered archs
split into two and remove extra barrier.

Sinan Kaya (6):
  i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
  ixgbe: eliminate duplicate barriers on weakly-ordered archs
  igbvf: eliminate duplicate barriers on weakly-ordered archs
  igb: eliminate duplicate barriers on weakly-ordered archs
  ixgbevf: keep writel() closer to wmb()
  ixgbevf: eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 8 
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 4 ++--
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 drivers/net/ethernet/intel/igbvf/netdev.c | 4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  | 5 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 7 files changed, 16 insertions(+), 21 deletions(-)

-- 
2.7.4



linux-next: manual merge of the net-next tree with the rdma-fixes tree

2018-03-15 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/infiniband/hw/mlx5/main.c

between commit:

  42cea83f9524 ("IB/mlx5: Fix cleanup order on unload")

from the rdma-fixes tree and commit:

  b5ca15ad7e61 ("IB/mlx5: Add proper representors support")

from the net-next tree.

I fixed it up (see below and the merge fix patch as well) and can
carry the fix as necessary. This is now fixed as far as linux-next is
concerned, but any non trivial conflicts should be mentioned to your
upstream maintainer when your tree is submitted for merging.  You may
also want to consider cooperating with the maintainer of the conflicting
tree to minimise any particularly complex conflicts.

From: Stephen Rothwell 
Date: Fri, 16 Mar 2018 11:54:01 +1100
Subject: [PATCH] IB/mlx5: merge fix for "Fix cleanup order on unload"

Signed-off-by: Stephen Rothwell 
---
 drivers/infiniband/hw/mlx5/ib_rep.c  | 6 +++---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 3 +--
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/ib_rep.c 
b/drivers/infiniband/hw/mlx5/ib_rep.c
index 61cc3d7db257..7fb997dadd80 100644
--- a/drivers/infiniband/hw/mlx5/ib_rep.c
+++ b/drivers/infiniband/hw/mlx5/ib_rep.c
@@ -33,9 +33,9 @@ static const struct mlx5_ib_profile rep_profile = {
STAGE_CREATE(MLX5_IB_STAGE_IB_REG,
 mlx5_ib_stage_ib_reg_init,
 mlx5_ib_stage_ib_reg_cleanup),
-   STAGE_CREATE(MLX5_IB_STAGE_UMR_RESOURCES,
-mlx5_ib_stage_umr_res_init,
-mlx5_ib_stage_umr_res_cleanup),
+   STAGE_CREATE(MLX5_IB_STAGE_POST_IB_REG_UMR,
+mlx5_ib_stage_post_ib_reg_umr_init,
+NULL),
STAGE_CREATE(MLX5_IB_STAGE_CLASS_ATTR,
 mlx5_ib_stage_class_attr_init,
 NULL),
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 7ec753ec7962..c45a7abdbe3e 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1071,8 +1071,7 @@ int mlx5_ib_stage_bfrag_init(struct mlx5_ib_dev *dev);
 void mlx5_ib_stage_bfrag_cleanup(struct mlx5_ib_dev *dev);
 int mlx5_ib_stage_ib_reg_init(struct mlx5_ib_dev *dev);
 void mlx5_ib_stage_ib_reg_cleanup(struct mlx5_ib_dev *dev);
-int mlx5_ib_stage_umr_res_init(struct mlx5_ib_dev *dev);
-void mlx5_ib_stage_umr_res_cleanup(struct mlx5_ib_dev *dev);
+int mlx5_ib_stage_post_ib_reg_umr_init(struct mlx5_ib_dev *dev);
 int mlx5_ib_stage_class_attr_init(struct mlx5_ib_dev *dev);
 void __mlx5_ib_remove(struct mlx5_ib_dev *dev,
  const struct mlx5_ib_profile *profile,
-- 
2.16.1

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/infiniband/hw/mlx5/main.c
index da091de4e69d,d9474b95d8e5..
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@@ -4860,19 -4999,19 +4996,19 @@@ int mlx5_ib_stage_ib_reg_init(struct ml
return ib_register_device(>ib_dev, NULL);
  }
  
 -void mlx5_ib_stage_ib_reg_cleanup(struct mlx5_ib_dev *dev)
 +static void mlx5_ib_stage_pre_ib_reg_umr_cleanup(struct mlx5_ib_dev *dev)
  {
 -  ib_unregister_device(>ib_dev);
 +  destroy_umrc_res(dev);
  }
  
- static void mlx5_ib_stage_ib_reg_cleanup(struct mlx5_ib_dev *dev)
 -int mlx5_ib_stage_umr_res_init(struct mlx5_ib_dev *dev)
++void mlx5_ib_stage_ib_reg_cleanup(struct mlx5_ib_dev *dev)
  {
 -  return create_umr_res(dev);
 +  ib_unregister_device(>ib_dev);
  }
  
- static int mlx5_ib_stage_post_ib_reg_umr_init(struct mlx5_ib_dev *dev)
 -void mlx5_ib_stage_umr_res_cleanup(struct mlx5_ib_dev *dev)
++int mlx5_ib_stage_post_ib_reg_umr_init(struct mlx5_ib_dev *dev)
  {
 -  destroy_umrc_res(dev);
 +  return create_umr_res(dev);
  }
  
  static int mlx5_ib_stage_delay_drop_init(struct mlx5_ib_dev *dev)
@@@ -4999,6 -5144,48 +5144,48 @@@ static const struct mlx5_ib_profile pf_
 NULL),
  };
  
+ static const struct mlx5_ib_profile nic_rep_profile = {
+   STAGE_CREATE(MLX5_IB_STAGE_INIT,
+mlx5_ib_stage_init_init,
+mlx5_ib_stage_init_cleanup),
+   STAGE_CREATE(MLX5_IB_STAGE_FLOW_DB,
+mlx5_ib_stage_flow_db_init,
+mlx5_ib_stage_flow_db_cleanup),
+   STAGE_CREATE(MLX5_IB_STAGE_CAPS,
+mlx5_ib_stage_caps_init,
+NULL),
+   STAGE_CREATE(MLX5_IB_STAGE_NON_DEFAULT_CB,
+mlx5_ib_stage_rep_non_default_cb,
+NULL),
+   STAGE_CREATE(MLX5_IB_STAGE_ROCE,
+mlx5_ib_stage_rep_roce_init,
+mlx5_ib_stage_rep_roce_cleanup),
+   STAGE_CREATE(MLX5_IB_STAGE_DEVICE_RESOURCES,
+mlx5_ib_stage_dev_res_init,
+mlx5_ib_stage_dev_res_cleanup),
+   STAGE_CREATE(MLX5_IB_STAGE_COUNTERS,
+

Re: [PATCH 6/7] e1000: eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Sinan Kaya
On 3/15/2018 8:25 PM, Alexander Duyck wrote:
> On Thu, Mar 15, 2018 at 4:30 PM, Sinan Kaya  wrote:
>> On 3/14/2018 9:41 PM, Alexander Duyck wrote:
  }

>>> So you missed the writel in e1000_xmit_frame. You should probably get
>>> that one too while you are doing these updates. The wmb() is in
>>> e1000_tx_queue().
>>>
>>
>> I brought wmb() outside along with the next descriptor assignment to be
>> similar to the rest of the other code.
>>
>> if wmb() and writel() are not visible in the same function, let's not touch
>> the code.
> 
> Maybe for e1000 we should just skip the driver entirely. Odds are you
> aren't going to have any e1000 parts running on ARM anyway since most
> of them are legacy PCI or PCI-X parts that were made over 10 years
> ago. Most of your efforts would probably be best spent on igb, igbvf,
> ixgbe, ixgbevf, i40e, i40evf, and fm10k.
> 

Sure. I'll drop it.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [Intel-wired-lan] [PATCH v2 12/15] ice: Add stats and ethtool support

2018-03-15 Thread Alexander Duyck
On Thu, Mar 15, 2018 at 4:52 PM, Stephen Hemminger
 wrote:
> On Thu, 15 Mar 2018 16:47:59 -0700
> Anirudh Venkataramanan  wrote:
>
>> +
>> +static const struct ice_stats ice_gstrings_vsi_stats[] = {
>> + ICE_VSI_STAT("tx_unicast", eth_stats.tx_unicast),
>> + ICE_VSI_STAT("rx_unicast", eth_stats.rx_unicast),
>> + ICE_VSI_STAT("tx_multicast", eth_stats.tx_multicast),
>> + ICE_VSI_STAT("rx_multicast", eth_stats.rx_multicast),
>> + ICE_VSI_STAT("tx_broadcast", eth_stats.tx_broadcast),
>> + ICE_VSI_STAT("rx_broadcast", eth_stats.rx_broadcast),
>> + ICE_VSI_STAT("tx_bytes", eth_stats.tx_bytes),
>> + ICE_VSI_STAT("rx_bytes", eth_stats.rx_bytes),
>> + ICE_VSI_STAT("rx_discards", eth_stats.rx_discards),
>> + ICE_VSI_STAT("tx_errors", eth_stats.tx_errors),
>> + ICE_VSI_STAT("tx_linearize", tx_linearize),
>> + ICE_VSI_STAT("rx_unknown_protocol", eth_stats.rx_unknown_protocol),
>> + ICE_VSI_STAT("rx_alloc_fail", rx_buf_failed),
>> + ICE_VSI_STAT("rx_pg_alloc_fail", rx_page_failed),
>> +};
>> +
>
> Ignoring feedback from maintainers is unlikely to help get your driver 
> adopted.

Your feedback wasn't ignored, the netdev stats are gone. I double
checked and there was this in addition to the netdev stats before so I
think the suggestion to remove the netdev stats was just taken
literally.

The VSI is a slightly different entity from the netdev itself. A
netdev can be backed by a VSI in the case of the PF, but the VSI can
be used in other ways such as what we did in i40e where we were using
it to spawn queue groups to work with mqprio as a filter target and in
that case the queue groups wouldn't have a netdev directly associated
with them so in that case it might make sense to leave these as
separate stats.

- Alex


Re: [PATCH net-next v3 0/7] ibmvnic: Update TX pool and TX routines

2018-03-15 Thread Thomas Falcon
On 03/15/2018 11:02 AM, Thomas Falcon wrote:
> This patch restructures the TX pool data structure and provides a
> separate TX pool array for TSO transmissions. This is already used
> in some way due to our unique DMA situation, namely that we cannot
> use single DMA mappings for packet data. Previously, both buffer
> arrays used the same pool entry. This restructuring allows for
> some additional cleanup in the driver code, especially in some
> places in the device transmit routine.
>
> In addition, it allows us to more easily track the consumer
> and producer indexes of a particular pool. This has been
> further improved by better tracking of in-use buffers to
> prevent possible data corruption in case an invalid buffer
> entry is used.
>
> v3: Forgot to update TX pool cleaning function to handle new data
> structures. Included 7th patch for that.
>
> v2: Fix typo in 3/6 commit subject line
>
> Thomas Falcon (7):
>   ibmvnic: Generalize TX pool structure
>   ibmvnic: Update and clean up reset TX pool routine
>   ibmvnic: Update release TX pool routine
>   ibmvnic: Update TX pool initialization routine
>   ibmvnic: Update TX and TX completion routines
>   ibmvnic: Improve TX buffer accounting
>   ibmvnic: Update TX pool cleaning routine
>
>  drivers/net/ethernet/ibm/ibmvnic.c | 275 
> +
>  drivers/net/ethernet/ibm/ibmvnic.h |   8 +-
>  2 files changed, 160 insertions(+), 123 deletions(-)
>
Sorry again, I need to send another version because of a bug in the 7th patch.



Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]

2018-03-15 Thread Cong Wang
On Wed, Mar 14, 2018 at 1:10 AM, Marco Berizzi  wrote:
>> Il 9 marzo 2018 alle 0.14 Cong Wang  ha scritto:
>>
>>
>> On Thu, Mar 8, 2018 at 8:02 AM, Marco Berizzi  wrote:
>> >> Marco Berizzi wrote:
>> >>
>> >>
>> >> Hello everyone,
>> >>
>> >> Yesterday I got this error on a slackware linux 4.16-rc4 system
>> >> running as a traffic shaping gateway and netfilter nat.
>> >> The error has been arisen after a partial ISP network outage,
>> >> so unfortunately it will not trivial for me to reproduce it again.
>> >
>> > Hello everyone,
>> >
>> > I'm getting this error twice/day, so fortunately I'm able to
>> > reproduce it.
>>
>> IIRC, there was a patch for this, but it got lost...
>>
>> I will take a look anyway.
>
> ok, thanks for the response. Let me know when there will be a patch
> available to test.

It has been reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=109581

And there is a workaround from Konstantin:
https://patchwork.ozlabs.org/patch/803885/

Unfortunately I don't think that is a real fix, we probably need to
fix HFSC itself rather than just workaround the qlen==0. It is not
trivial since HFSC implementation is not easy to understand.
Maybe Jamal knows better than me.


Thanks


Re: [PATCH net] net/sched: act_simple: don't leak 'index' in the error path

2018-03-15 Thread Cong Wang
On Wed, Mar 14, 2018 at 3:43 PM, Davide Caratti  wrote:
> hello Cong, thank you for reviewing this.
>
> On Wed, 2018-03-14 at 11:41 -0700, Cong Wang wrote:
>> On Tue, Mar 13, 2018 at 7:13 PM, Davide Caratti  wrote:
>>
>> Looks like we just need to replace the tcf_idr_cleanup() with
>> tcf_idr_release()? Which is also simpler.
>
> I just tried it on act_simple, and I can confirm: 'index' does not leak
> anymore if alloc_defdata() fails to kzalloc(), and then tcf_idr_release()
> is called in place of of tcf_idr_cleanup().

Good.

>
>> Looks like all other callers of tcf_idr_cleanup() need to be replaced too,
>> but I don't audit all of them...
>
> no problem, I can try to do that, it's not going to be a big series
> anyway.


Please audit all of them.


>
> while at it, I will also fix other spots where the same bug can be
> reproduced, even if tcf_idr_cleanup() is not there: for example, when
> tcf_vlan_init() fails allocating struct tcf_vlan_params *p,
>
> ASSERT_RTNL();
> p = kzalloc(sizeof(*p), GFP_KERNEL);
> if (!p) {
> if (ovr)
> tcf_idr_release(*a, bind);
> return -ENOMEM;
> }
>
> the followinng behavior can be observed:
>
> # tc actions flush action vlan
> # tc actions add action vlan pop index 5
> RTNETLINK answers: Cannot allocate memory
> We have an error talking to the kernel
> # tc actions add action vlan pop index 5
> RTNETLINK answers: No space left on device
> We have an error talking to the kernel
> # tc actions add action vlan pop index 5
> RTNETLINK answers: No space left on device
> We have an error talking to the kernel
>
> Probably testing the value of 'ovr' here is wrong, or maybe it's
> not enough: I will also verify what happens using 'replace'
> keyword instead of 'add'.

Please fix it separately if really needed, and it would be nicer
if you can add your test cases to tools/testing/selftests/tc-testing/.

Thanks!


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Daniel Borkmann
On 03/16/2018 12:06 AM, Alexei Starovoitov wrote:
> On Thu, Mar 15, 2018 at 11:55:39PM +0100, Daniel Borkmann wrote:
>> On 03/15/2018 11:20 PM, Alexei Starovoitov wrote:
>>> On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
 On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>  
>> +/* User return codes for SK_MSG prog type. */
>> +enum sk_msg_action {
>> +SK_MSG_DROP = 0,
>> +SK_MSG_PASS,
>> +};
>
> do we really need new enum here?
> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> and there will be only drop/pass in both enums.
> Also I don't see where these two new SK_MSG_* are used...
>
>> +
>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>> + * be added to the end of this structure
>> + */
>> +struct sk_msg_md {
>> +__u32 data;
>> +__u32 data_end;
>> +};
>
> I think it's time for me to ask for forgiveness :)

 :-)

> I used __u32 for data and data_end only because all other fields
> in __sk_buff were __u32 at the time and I couldn't easily figure out
> how to teach verifier to recognize 8-byte rewrites.
> Unfortunately my mistake stuck and was copied over into xdp.
> Since this is new struct let's do it right and add
> 'void *data, *data_end' here,
> since bpf prog will use them as 'void *' pointers.
> There are no compat issues here, since bpf is always 64-bit.

 But at least offset-wise when you do the ctx rewrite this would then
 be a bit more tricky when you have 64 bit kernel with 32 bit user
 space since void * members are in each cases at different offset. So
 unless I'm missing something, this still should either be __u32 or
 __u64 instead of void *, no?
>>>
>>> there is no 32-bit user space. these structs are seen by bpf progs only
>>> and bpf is 64-bit only too.
>>> unless I'm missing your point.
>>
>> Ok, so lets say you have 32 bit LLVM binary and compile the prog where
>> you access md->data_end. Given the void * in the struct will that access
>> end up being BPF_W at ctx offset 4 or BPF_DW at ctx offset 8 from clang
>> perspective (iow, is the back end treating this special and always use
>> fixed BPF_DW in such case)? If not and it would be the first case with
>> offset 4, then we could have the case that underlying 64 bit kernel is
>> expecting ctx offset 8 for doing the md ctx conversion.
> 
> i'm still not quite following.
> Whether llvm itself is 32-bit binary or it's arm32 or sprac32 binary
> doesn't matter. It will produce the same 64-bit bpf code.
> It will see 'void *' deref from this struct and will emit DW.
> May be confusion is from newly added -mattr=+alu32 flag?
> That option doesn't change that sizeof(void*)==8.
> It only allows backend to emit 32-bit alu insns.

Ok, so conclusion we had is that while BPF target is unconditionally 64 bit,
it depends which clang front end you use for compilation wrt structs. E.g.
on 32 bit native (e.g. arm) clang front end it would compile the ctx void *
pointers as 4 byte while using clang -target bpf it would compile it as 8
byte. The native clang front end is needed in case of tracing when accessing
pt_regs for walking data structures, but not for networking use case, so
always using -target bpf there is proper way. Meaning there would be no
confusion on the void * since size will always be 8 regardless of underlying
arch being 32 or 64 bit or clang/llvm binary being 32 bit on 64 bit kernel.
Thus, sticking to void * would be fine, but definitely samples/sockmap/Makefile
must be fixed as well, such that people don't copy it wrongly.

Cheers,
Daniel


Re: [PATCH net-next] net: ethernet: ti: cpsw: enable vlan rx vlan offload

2018-03-15 Thread Andrew Lunn
On Thu, Mar 15, 2018 at 03:15:50PM -0500, Grygorii Strashko wrote:
> In VLAN_AWARE mode CPSW can insert VLAN header encapsulation word on Host
> port 0 egress (RX) before the packet data if RX_VLAN_ENCAP bit is set in
> CPSW_CONTROL register. VLAN header encapsulation word has following format:
> 
>  HDR_PKT_Priority bits 29-31 - Header Packet VLAN prio (Highest prio: 7)
>  HDR_PKT_CFIbits 28 - Header Packet VLAN CFI bit.
>  HDR_PKT_Vidbits 27-16 - Header Packet VLAN ID
>  PKT_Type bits 8-9 - Packet Type. Indicates whether the packet is
>   VLAN-tagged, priority-tagged, or non-tagged.
>   00: VLAN-tagged packet
>   01: Reserved
>   10: Priority-tagged packet
>   11: Non-tagged packet
> 
> This feature can be used to implement TX VLAN offload in case of
> VLAN-tagged packets and to insert VLAN tag in case Non-tagged packet was
> received on port with PVID set. As per documentation, CPSW never modifies
> packet data on Host egress (RX) and as result, without this feature
> enabled, Host port will not be able to receive properly packets which
> entered switch non-tagged through external Port with PVID set (when
> non-tagged packet forwarded from external Port with PVID set to another
> external Port - packet will be VLAN tagged properly).

So, i think it is time to discuss the future of this driver. It should
really be replaced by a switchdev/DSA driver. There are plenty of
carrots for a new driver: Better statistics, working ethtool support
for all the PHYs, better user experience, etc. But maybe now it is
time for the stick. Should we Maintainers decide that no new features
should be added to the existing drivers, just bug fixes?

   Andrew


Re: [PATCH 6/7] e1000: eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Alexander Duyck
On Thu, Mar 15, 2018 at 4:30 PM, Sinan Kaya  wrote:
> On 3/14/2018 9:41 PM, Alexander Duyck wrote:
>>>  }
>>>
>> So you missed the writel in e1000_xmit_frame. You should probably get
>> that one too while you are doing these updates. The wmb() is in
>> e1000_tx_queue().
>>
>
> I brought wmb() outside along with the next descriptor assignment to be
> similar to the rest of the other code.
>
> if wmb() and writel() are not visible in the same function, let's not touch
> the code.

Maybe for e1000 we should just skip the driver entirely. Odds are you
aren't going to have any e1000 parts running on ARM anyway since most
of them are legacy PCI or PCI-X parts that were made over 10 years
ago. Most of your efforts would probably be best spent on igb, igbvf,
ixgbe, ixgbevf, i40e, i40evf, and fm10k.


Re: [PATCH v6 0/6] staging: Introduce DPAA2 Ethernet Switch driver

2018-03-15 Thread Andrew Lunn
On Thu, Mar 15, 2018 at 01:56:42PM +0300, Dan Carpenter wrote:
> On Thu, Mar 15, 2018 at 12:44:37AM +0100, Andrew Lunn wrote:
> > On Wed, Mar 14, 2018 at 10:55:52AM -0500, Razvan Stefanescu wrote:
> > > This patchset introduces the Ethernet Switch Driver for Freescale/NXP SoCs
> > > with DPAA2 (DataPath Acceleration Architecture v2). The driver manages
> > > switch objects discovered on the fsl-mc bus. A description of the driver
> > > can be found in the associated README file.
> > 
> > Hi Greg
> > 
> > This code has much better quality than the usual stuff in staging. I
> > see no reason not to merge it. 
> 
> Yeah.  It seems pretty decent.  Stuart, Laurentiu, care to comment?
> 
> Meanwhile, netdev and DaveM aren't even on the CC list and they're the
> ones to ultimately decide.

The patches are for staging, so it is GregKH who decides at this
point, not really DaveM.

   Andrew


Re: [PATCH] mlx5: Remove call to ida_pre_get

2018-03-15 Thread Saeed Mahameed
On Wed, 2018-03-14 at 19:57 -0700, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> The mlx5 driver calls ida_pre_get() in a loop for no readily apparent
> reason.  The driver uses ida_simple_get() which will call
> ida_pre_get()
> by itself and there's no need to use ida_pre_get() unless using
> ida_get_new().
> 

Hi Matthew,

Is this is causing any issues ? or just a simple cleanup ?

Adding Maor, the author of this change,

I believe the idea is to speed up insert_fte (which calls
ida_simple_get) since insert_fte runs under the FTE write semaphore,
in this case if ida_pre_get was successful before taking the semaphore
for all the FTE nodes in the loop, this will be a huge win for
ida_simple_get which will immediately return success without even
trying to allocate.

so it is a best effort to speed up critical path.

Maor, if this is really the case and this is not causing any issues,
then we need to consider adding a comment.


> Signed-off-by: Matthew Wilcox 
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> index 10e16381f20a..3ba07c7096ef 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
> @@ -1647,7 +1647,6 @@ try_add_to_existing_fg(struct mlx5_flow_table
> *ft,
>  
>   list_for_each_entry(iter, match_head, list) {
>   nested_down_read_ref_node(>g->node,
> FS_LOCK_PARENT);
> - ida_pre_get(>g->fte_allocator, GFP_KERNEL);
>   }
>  
>  search_again_locked:
> 

Re: [PATCH v2 12/15] ice: Add stats and ethtool support

2018-03-15 Thread Stephen Hemminger
On Thu, 15 Mar 2018 16:47:59 -0700
Anirudh Venkataramanan  wrote:

> +
> +static const struct ice_stats ice_gstrings_vsi_stats[] = {
> + ICE_VSI_STAT("tx_unicast", eth_stats.tx_unicast),
> + ICE_VSI_STAT("rx_unicast", eth_stats.rx_unicast),
> + ICE_VSI_STAT("tx_multicast", eth_stats.tx_multicast),
> + ICE_VSI_STAT("rx_multicast", eth_stats.rx_multicast),
> + ICE_VSI_STAT("tx_broadcast", eth_stats.tx_broadcast),
> + ICE_VSI_STAT("rx_broadcast", eth_stats.rx_broadcast),
> + ICE_VSI_STAT("tx_bytes", eth_stats.tx_bytes),
> + ICE_VSI_STAT("rx_bytes", eth_stats.rx_bytes),
> + ICE_VSI_STAT("rx_discards", eth_stats.rx_discards),
> + ICE_VSI_STAT("tx_errors", eth_stats.tx_errors),
> + ICE_VSI_STAT("tx_linearize", tx_linearize),
> + ICE_VSI_STAT("rx_unknown_protocol", eth_stats.rx_unknown_protocol),
> + ICE_VSI_STAT("rx_alloc_fail", rx_buf_failed),
> + ICE_VSI_STAT("rx_pg_alloc_fail", rx_page_failed),
> +};
> +

Ignoring feedback from maintainers is unlikely to help get your driver adopted.


[PATCH v2 07/15] ice: Add support for VSI allocation and deallocation

2018-03-15 Thread Anirudh Venkataramanan
This patch introduces data structures and functions to alloc/free
VSIs. The driver represents a VSI using the ice_vsi structure.

Some noteworthy points about VSI allocation:

1) A VSI is allocated in the firmware using the "add VSI" admin queue
   command (implemented as ice_aq_add_vsi). The firmware returns an
   identifier for the allocated VSI. The VSI context is used to program
   certain aspects (loopback, queue map, etc.) of the VSI's configuration.

2) A VSI is deleted using the "free VSI" admin queue command (implemented
   as ice_aq_free_vsi).

3) The driver represents a VSI using struct ice_vsi. This is allocated
   and initialized as part of the ice_vsi_alloc flow, and deallocated
   as part of the ice_vsi_delete flow.

4) Once the VSI is created, a netdev is allocated and associated with it.
   The VSI's ring and vector related data structures are also allocated
   and initialized.

5) A VSI's queues can either be contiguous or scattered. To do this, the
   driver maintains a bitmap (vsi->avail_txqs) which is kept in sync with
   the firmware's VSI queue allocation imap. If the VSI can't get a
   contiguous queue allocation, it will fallback to scatter. This is
   implemented in ice_vsi_get_qs which is called as part of the VSI setup
   flow. In the release flow, the VSI's queues are released and the bitmap
   is updated to reflect this by ice_vsi_put_qs.

CC: Shannon Nelson 
Signed-off-by: Anirudh Venkataramanan 
---
v2: Addressed Shannon Nelson's comments by
   1) using a new define ICE_NO_VSI instead of the magic number 0x.
   2) adding missing curly braces and break statements.

Also, ice_set_def_vsi_ctx was changed to ice_set_dflt_vsi_ctx for clarity.
---
 drivers/net/ethernet/intel/ice/ice.h|   72 ++
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |  199 
 drivers/net/ethernet/intel/ice/ice_main.c   |  +++
 drivers/net/ethernet/intel/ice/ice_switch.c |  115 +++
 drivers/net/ethernet/intel/ice/ice_switch.h |   21 +
 drivers/net/ethernet/intel/ice/ice_txrx.h   |   26 +
 drivers/net/ethernet/intel/ice/ice_type.h   |4 +
 7 files changed, 1548 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index c8079c852a48..c9f59374daad 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -25,6 +25,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ice_devids.h"
 #include "ice_type.h"
@@ -41,17 +44,43 @@
 #include "ice_sched.h"
 
 #define ICE_BAR0   0
+#define ICE_DFLT_NUM_DESC  128
+#define ICE_REQ_DESC_MULTIPLE  32
 #define ICE_INT_NAME_STR_LEN   (IFNAMSIZ + 16)
 #define ICE_AQ_LEN 64
 #define ICE_MIN_MSIX   2
+#define ICE_NO_VSI 0x
 #define ICE_MAX_VSI_ALLOC  130
 #define ICE_MAX_TXQS   2048
 #define ICE_MAX_RXQS   2048
+#define ICE_VSI_MAP_CONTIG 0
+#define ICE_VSI_MAP_SCATTER1
+#define ICE_MAX_SCATTER_TXQS   16
+#define ICE_MAX_SCATTER_RXQS   16
 #define ICE_RES_VALID_BIT  0x8000
 #define ICE_RES_MISC_VEC_ID(ICE_RES_VALID_BIT - 1)
+#define ICE_INVAL_Q_INDEX  0x
 
 #define ICE_DFLT_NETIF_M (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK)
 
+#define ICE_MAX_MTU(ICE_AQ_SET_MAC_FRAME_SIZE_MAX - \
+ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN)
+
+#define ICE_UP_TABLE_TRANSLATE(val, i) \
+   (((val) << ICE_AQ_VSI_UP_TABLE_UP##i##_S) & \
+ ICE_AQ_VSI_UP_TABLE_UP##i##_M)
+
+struct ice_tc_info {
+   u16 qoffset;
+   u16 qcount;
+};
+
+struct ice_tc_cfg {
+   u8 numtc; /* Total number of enabled TCs */
+   u8 ena_tc; /* TX map */
+   struct ice_tc_info tc_info[ICE_MAX_TRAFFIC_CLASS];
+};
+
 struct ice_res_tracker {
u16 num_entries;
u16 search_hint;
@@ -75,8 +104,47 @@ enum ice_state {
 /* struct that defines a VSI, associated with a dev */
 struct ice_vsi {
struct net_device *netdev;
+   struct ice_sw *vsw;  /* switch this VSI is on */
+   struct ice_pf *back; /* back pointer to PF */
struct ice_port_info *port_info; /* back pointer to port_info */
+   struct ice_ring **rx_rings;  /* rx ring array */
+   struct ice_ring **tx_rings;  /* tx ring array */
+   struct ice_q_vector **q_vectors; /* q_vector array */
+   DECLARE_BITMAP(state, __ICE_STATE_NBITS);
+   int num_q_vectors;
+   int base_vector;
+   enum ice_vsi_type type;
u16 vsi_num; /* HW (absolute) index of this VSI */
+   u16 idx; /* software index in pf->vsi[] */
+
+   /* Interrupt thresholds */
+   u16 work_lmt;
+
+   struct ice_aqc_vsi_props info;   /* VSI properties */
+
+   

[PATCH v2 06/15] ice: Initialize PF and setup miscellaneous interrupt

2018-03-15 Thread Anirudh Venkataramanan
This patch continues the initialization flow as follows:

1) Allocate and initialize necessary fields (like vsi, num_alloc_vsi,
   irq_tracker, etc) in the ice_pf instance.

2) Setup the miscellaneous interrupt handler. This also known as the
   "other interrupt causes" (OIC) handler and is used to handle non
   hotpath interrupts (like control queue events, link events,
   exceptions, etc.

3) Implement a background task to process admin queue receive (ARQ)
   events received by the driver.

CC: Shannon Nelson 
Signed-off-by: Anirudh Venkataramanan 
---
v2: Removed reference to "lump" as suggested by Shannon Nelson.
---
 drivers/net/ethernet/intel/ice/ice.h|  84 +++
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |   2 +
 drivers/net/ethernet/intel/ice/ice_common.c |   6 +
 drivers/net/ethernet/intel/ice/ice_common.h |   3 +
 drivers/net/ethernet/intel/ice/ice_controlq.c   | 101 
 drivers/net/ethernet/intel/ice/ice_controlq.h   |   8 +
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h |  63 +++
 drivers/net/ethernet/intel/ice/ice_main.c   | 719 +++-
 drivers/net/ethernet/intel/ice/ice_txrx.h   |  43 ++
 drivers/net/ethernet/intel/ice/ice_type.h   |  11 +
 10 files changed, 1039 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_txrx.h

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index 9681e971bcab..c8079c852a48 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -26,29 +26,113 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include "ice_devids.h"
 #include "ice_type.h"
+#include "ice_txrx.h"
 #include "ice_switch.h"
 #include "ice_common.h"
 #include "ice_sched.h"
 
 #define ICE_BAR0   0
+#define ICE_INT_NAME_STR_LEN   (IFNAMSIZ + 16)
 #define ICE_AQ_LEN 64
+#define ICE_MIN_MSIX   2
+#define ICE_MAX_VSI_ALLOC  130
+#define ICE_MAX_TXQS   2048
+#define ICE_MAX_RXQS   2048
+#define ICE_RES_VALID_BIT  0x8000
+#define ICE_RES_MISC_VEC_ID(ICE_RES_VALID_BIT - 1)
 
 #define ICE_DFLT_NETIF_M (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK)
 
+struct ice_res_tracker {
+   u16 num_entries;
+   u16 search_hint;
+   u16 list[1];
+};
+
+struct ice_sw {
+   struct ice_pf *pf;
+   u16 sw_id;  /* switch ID for this switch */
+   u16 bridge_mode;/* VEB/VEPA/Port Virtualizer */
+};
+
 enum ice_state {
__ICE_DOWN,
+   __ICE_PFR_REQ,  /* set by driver and peers */
+   __ICE_ADMINQ_EVENT_PENDING,
+   __ICE_SERVICE_SCHED,
__ICE_STATE_NBITS   /* must be last */
 };
 
+/* struct that defines a VSI, associated with a dev */
+struct ice_vsi {
+   struct net_device *netdev;
+   struct ice_port_info *port_info; /* back pointer to port_info */
+   u16 vsi_num; /* HW (absolute) index of this VSI */
+} cacheline_internodealigned_in_smp;
+
+enum ice_pf_flags {
+   ICE_FLAG_MSIX_ENA,
+   ICE_FLAG_FLTR_SYNC,
+   ICE_FLAG_RSS_ENA,
+   ICE_PF_FLAGS_NBITS  /* must be last */
+};
+
 struct ice_pf {
struct pci_dev *pdev;
+   struct msix_entry *msix_entries;
+   struct ice_res_tracker *irq_tracker;
+   struct ice_vsi **vsi;   /* VSIs created by the driver */
+   struct ice_sw *first_sw;/* first switch created by firmware */
DECLARE_BITMAP(state, __ICE_STATE_NBITS);
+   DECLARE_BITMAP(avail_txqs, ICE_MAX_TXQS);
+   DECLARE_BITMAP(avail_rxqs, ICE_MAX_RXQS);
+   DECLARE_BITMAP(flags, ICE_PF_FLAGS_NBITS);
+   unsigned long serv_tmr_period;
+   unsigned long serv_tmr_prev;
+   struct timer_list serv_tmr;
+   struct work_struct serv_task;
+   struct mutex avail_q_mutex; /* protects access to avail_[rx|tx]qs */
+   struct mutex sw_mutex;  /* lock for protecting VSI alloc flow */
u32 msg_enable;
+   u32 oicr_idx;   /* Other interrupt cause vector index */
+   u32 num_lan_msix;   /* Total MSIX vectors for base driver */
+   u32 num_avail_msix; /* remaining MSIX vectors left unclaimed */
+   u16 num_lan_tx; /* num lan tx queues setup */
+   u16 num_lan_rx; /* num lan rx queues setup */
+   u16 q_left_tx;  /* remaining num tx queues left unclaimed */
+   u16 q_left_rx;  /* remaining num rx queues left unclaimed */
+   u16 next_vsi;   /* Next free slot in pf->vsi[] - 0-based! */
+   u16 num_alloc_vsi;
+
struct ice_hw hw;
+   char int_name[ICE_INT_NAME_STR_LEN];
 };
+
+/**
+ * ice_irq_dynamic_ena - Enable default interrupt generation settings
+ * @hw: pointer to hw struct
+ */
+static inline void ice_irq_dynamic_ena(struct ice_hw 

[PATCH v2 09/15] ice: Configure VSIs for Tx/Rx

2018-03-15 Thread Anirudh Venkataramanan
This patch configures the VSIs to be able to send and receive
packets by doing the following:

1) Initialize flexible parser to extract and include certain
   fields in the Rx descriptor.

2) Add Tx queues by programming the Tx queue context (implemented in
   ice_vsi_cfg_txqs). Note that adding the queues also enables (starts)
   the queues.

3) Add Rx queues by programming Rx queue context (implemented in
   ice_vsi_cfg_rxqs). Note that this only adds queues but doesn't start
   them. The rings will be started by calling ice_vsi_start_rx_rings on
   interface up.

4) Configure interrupts for VSI queues.

5) Implement ice_open and ice_stop.

Signed-off-by: Anirudh Venkataramanan 
---
 drivers/net/ethernet/intel/ice/Makefile |3 +-
 drivers/net/ethernet/intel/ice/ice.h|   36 +-
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |   86 ++
 drivers/net/ethernet/intel/ice/ice_common.c |  602 
 drivers/net/ethernet/intel/ice/ice_common.h |   13 +
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h |   59 ++
 drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h  |  260 ++
 drivers/net/ethernet/intel/ice/ice_main.c   | 1140 ++-
 drivers/net/ethernet/intel/ice/ice_sched.c  |  105 +++
 drivers/net/ethernet/intel/ice/ice_sched.h  |5 +
 drivers/net/ethernet/intel/ice/ice_status.h |2 +
 drivers/net/ethernet/intel/ice/ice_txrx.c   |  375 
 drivers/net/ethernet/intel/ice/ice_txrx.h   |   75 ++
 drivers/net/ethernet/intel/ice/ice_type.h   |2 +
 14 files changed, 2757 insertions(+), 6 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_txrx.c

diff --git a/drivers/net/ethernet/intel/ice/Makefile 
b/drivers/net/ethernet/intel/ice/Makefile
index 809d85c04398..0abeb20c006d 100644
--- a/drivers/net/ethernet/intel/ice/Makefile
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -29,4 +29,5 @@ ice-y := ice_main.o   \
 ice_common.o   \
 ice_nvm.o  \
 ice_switch.o   \
-ice_sched.o
+ice_sched.o\
+ice_txrx.o
diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index c9f59374daad..e3ec19099e37 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -25,8 +25,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -57,6 +59,8 @@
 #define ICE_VSI_MAP_SCATTER1
 #define ICE_MAX_SCATTER_TXQS   16
 #define ICE_MAX_SCATTER_RXQS   16
+#define ICE_Q_WAIT_RETRY_LIMIT 10
+#define ICE_Q_WAIT_MAX_RETRY   (5 * ICE_Q_WAIT_RETRY_LIMIT)
 #define ICE_RES_VALID_BIT  0x8000
 #define ICE_RES_MISC_VEC_ID(ICE_RES_VALID_BIT - 1)
 #define ICE_INVAL_Q_INDEX  0x
@@ -70,6 +74,14 @@
(((val) << ICE_AQ_VSI_UP_TABLE_UP##i##_S) & \
  ICE_AQ_VSI_UP_TABLE_UP##i##_M)
 
+#define ICE_RX_DESC(R, i) (&(((union ice_32b_rx_flex_desc *)((R)->desc))[i]))
+
+#define ice_for_each_txq(vsi, i) \
+   for ((i) = 0; (i) < (vsi)->num_txq; (i)++)
+
+#define ice_for_each_rxq(vsi, i) \
+   for ((i) = 0; (i) < (vsi)->num_rxq; (i)++)
+
 struct ice_tc_info {
u16 qoffset;
u16 qcount;
@@ -110,6 +122,9 @@ struct ice_vsi {
struct ice_ring **rx_rings;  /* rx ring array */
struct ice_ring **tx_rings;  /* tx ring array */
struct ice_q_vector **q_vectors; /* q_vector array */
+
+   irqreturn_t (*irq_handler)(int irq, void *data);
+
DECLARE_BITMAP(state, __ICE_STATE_NBITS);
int num_q_vectors;
int base_vector;
@@ -120,8 +135,14 @@ struct ice_vsi {
/* Interrupt thresholds */
u16 work_lmt;
 
+   u16 max_frame;
+   u16 rx_buf_len;
+
struct ice_aqc_vsi_props info;   /* VSI properties */
 
+   bool irqs_ready;
+   bool current_isup;   /* Sync 'link up' logging */
+
/* queue information */
u8 tx_mapping_mode;  /* ICE_MAP_MODE_[CONTIG|SCATTER] */
u8 rx_mapping_mode;  /* ICE_MAP_MODE_[CONTIG|SCATTER] */
@@ -142,9 +163,11 @@ struct ice_q_vector {
struct napi_struct napi;
struct ice_ring_container rx;
struct ice_ring_container tx;
+   struct irq_affinity_notify affinity_notify;
u16 v_idx;  /* index in the vsi->q_vector array. */
u8 num_ring_tx; /* total number of tx rings in vector */
u8 num_ring_rx; /* total number of rx rings in vector */
+   char name[ICE_INT_NAME_STR_LEN];
 } cacheline_internodealigned_in_smp;
 
 enum ice_pf_flags {
@@ -192,10 +215,14 @@ struct ice_netdev_priv {
 /**
  * ice_irq_dynamic_ena - Enable default interrupt generation settings
  * @hw: pointer to hw struct
+ * @vsi: pointer to vsi struct, can be NULL
+ * @q_vector: pointer to 

[PATCH v2 05/15] ice: Get MAC/PHY/link info and scheduler topology

2018-03-15 Thread Anirudh Venkataramanan
This patch adds code to continue the initialization flow as follows:

1) Get PHY/link information and store it
2) Get default scheduler tree topology and store it
3) Get the MAC address associated with the port and store it

Signed-off-by: Anirudh Venkataramanan 
---
 drivers/net/ethernet/intel/ice/ice.h|   1 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 261 +++
 drivers/net/ethernet/intel/ice/ice_common.c | 264 +++
 drivers/net/ethernet/intel/ice/ice_common.h |   3 +
 drivers/net/ethernet/intel/ice/ice_sched.c  | 328 
 drivers/net/ethernet/intel/ice/ice_sched.h  |   6 +
 drivers/net/ethernet/intel/ice/ice_status.h |   1 +
 drivers/net/ethernet/intel/ice/ice_type.h   |  65 +
 8 files changed, 929 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index f6e3339591bb..9681e971bcab 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 66a3f41df673..13e3b7f3e24d 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -118,6 +118,35 @@ struct ice_aqc_list_caps_elem {
__le64 rsvd2;
 };
 
+/* Manage MAC address, read command - indirect (0x0107)
+ * This struct is also used for the response
+ */
+struct ice_aqc_manage_mac_read {
+   __le16 flags; /* Zeroed by device driver */
+#define ICE_AQC_MAN_MAC_LAN_ADDR_VALID BIT(4)
+#define ICE_AQC_MAN_MAC_SAN_ADDR_VALID BIT(5)
+#define ICE_AQC_MAN_MAC_PORT_ADDR_VALIDBIT(6)
+#define ICE_AQC_MAN_MAC_WOL_ADDR_VALID BIT(7)
+#define ICE_AQC_MAN_MAC_READ_S 4
+#define ICE_AQC_MAN_MAC_READ_M (0xF << ICE_AQC_MAN_MAC_READ_S)
+   u8 lport_num;
+   u8 lport_num_valid;
+#define ICE_AQC_MAN_MAC_PORT_NUM_IS_VALID  BIT(0)
+   u8 num_addr; /* Used in response */
+   u8 reserved[3];
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
+/* Response buffer format for manage MAC read command */
+struct ice_aqc_manage_mac_read_resp {
+   u8 lport_num;
+   u8 addr_type;
+#define ICE_AQC_MAN_MAC_ADDR_TYPE_LAN  0
+#define ICE_AQC_MAN_MAC_ADDR_TYPE_WOL  1
+   u8 mac_addr[ETH_ALEN];
+};
+
 /* Clear PXE Command and response (direct 0x0110) */
 struct ice_aqc_clear_pxe {
u8 rx_cnt;
@@ -175,6 +204,16 @@ struct ice_aqc_get_sw_cfg_resp {
struct ice_aqc_get_sw_cfg_resp_elem elements[1];
 };
 
+/* Get Default Topology (indirect 0x0400) */
+struct ice_aqc_get_topo {
+   u8 port_num;
+   u8 num_branches;
+   __le16 reserved1;
+   __le32 reserved2;
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
 /* Add TSE (indirect 0x0401)
  * Delete TSE (indirect 0x040F)
  * Move TSE (indirect 0x0408)
@@ -235,6 +274,12 @@ struct ice_aqc_txsched_topo_grp_info_hdr {
__le16 reserved2;
 };
 
+struct ice_aqc_get_topo_elem {
+   struct ice_aqc_txsched_topo_grp_info_hdr hdr;
+   struct ice_aqc_txsched_elem_data
+   generic[ICE_AQC_TOPO_MAX_LEVEL_NUM];
+};
+
 struct ice_aqc_delete_elem {
struct ice_aqc_txsched_topo_grp_info_hdr hdr;
__le32 teid[1];
@@ -280,6 +325,210 @@ struct ice_aqc_query_txsched_res_resp {
struct ice_aqc_layer_props layer_props[ICE_AQC_TOPO_MAX_LEVEL_NUM];
 };
 
+/* Get PHY capabilities (indirect 0x0600) */
+struct ice_aqc_get_phy_caps {
+   u8 lport_num;
+   u8 reserved;
+   __le16 param0;
+   /* 18.0 - Report qualified modules */
+#define ICE_AQC_GET_PHY_RQMBIT(0)
+   /* 18.1 - 18.2 : Report mode
+* 00b - Report NVM capabilities
+* 01b - Report topology capabilities
+* 10b - Report SW configured
+*/
+#define ICE_AQC_REPORT_MODE_S  1
+#define ICE_AQC_REPORT_MODE_M  (3 << ICE_AQC_REPORT_MODE_S)
+#define ICE_AQC_REPORT_NVM_CAP 0
+#define ICE_AQC_REPORT_TOPO_CAPBIT(1)
+#define ICE_AQC_REPORT_SW_CFG  BIT(2)
+   __le32 reserved1;
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
+/* This is #define of PHY type (Extended):
+ * The first set of defines is for phy_type_low.
+ */
+#define ICE_PHY_TYPE_LOW_100BASE_TXBIT_ULL(0)
+#define ICE_PHY_TYPE_LOW_100M_SGMIIBIT_ULL(1)
+#define ICE_PHY_TYPE_LOW_1000BASE_TBIT_ULL(2)
+#define ICE_PHY_TYPE_LOW_1000BASE_SX   BIT_ULL(3)
+#define ICE_PHY_TYPE_LOW_1000BASE_LX   BIT_ULL(4)
+#define ICE_PHY_TYPE_LOW_1000BASE_KX   BIT_ULL(5)
+#define ICE_PHY_TYPE_LOW_1G_SGMII  BIT_ULL(6)
+#define ICE_PHY_TYPE_LOW_2500BASE_TBIT_ULL(7)
+#define 

Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Kees Cook
On Thu, Mar 15, 2018 at 4:46 PM, Linus Torvalds
 wrote:
> What I'm *not* so much ok with is "const_max(5,sizeof(x))" erroring
> out, or silently causing insane behavior due to hidden subtle type
> casts..

Yup! I like it as an explicit argument. Thanks!

-Kees

-- 
Kees Cook
Pixel Security


[PATCH v2 08/15] ice: Add support for switch filter programming

2018-03-15 Thread Anirudh Venkataramanan
A VSI needs traffic directed towards it. This is done by programming
filter rules on the switch (embedded vSwitch) element in the hardware,
which connects the VSI to the ingress/egress port.

This patch introduces data structures and functions necessary to add
remove or update switch rules on the switch element. This is a pretty low
level function that is generic enough to add a whole range of filters.

This patch also introduces two top level functions ice_add_mac and
ice_remove mac which through a series of intermediate helper functions
eventually call ice_aq_sw_rules to add/delete simple MAC based filters.
It's worth noting that one invocation of ice_add_mac/ice_remove_mac
is capable of adding/deleting multiple MAC filters.

Also worth noting is the fact that the driver maintains a list of currently
active filters, so every filter addition/removal causes an update to this
list. This is done for a couple of reasons:

1) If two VSIs try to add the same filters, we need to detect it and do
   things a little differently (i.e. use VSI lists, described below) as
   the same filter can't be added more than once.

2) In the event of a hardware reset we can simply walk through this list
   and restore the filters.

VSI Lists:
In a multi-VSI situation, it's possible that multiple VSIs want to add the
same filter rule. For example, two VSIs that want to receive broadcast
traffic would both add a filter for destination MAC ff:ff:ff:ff:ff:ff.
This can become cumbersome to maintain and so this is handled using a
VSI list.

A VSI list is resource that can be allocated in the hardware using the
ice_aq_alloc_free_res admin queue command. Simply put, a VSI list can
be thought of as a subscription list containing a set of VSIs to which
the packet should be forwarded, should the filter match.

For example, if VSI-0 has already added a broadcast filter, and VSI-1
wants to do the same thing, the filter creation flow will detect this,
allocate a VSI list and update the switch rule so that broadcast traffic
will now be forwarded to the VSI list which contains VSI-0 and VSI-1.

Signed-off-by: Anirudh Venkataramanan 
---
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |  249 
 drivers/net/ethernet/intel/ice/ice_common.c |   74 +-
 drivers/net/ethernet/intel/ice/ice_main.c   |   92 ++
 drivers/net/ethernet/intel/ice/ice_status.h |3 +
 drivers/net/ethernet/intel/ice/ice_switch.c | 1378 +++
 drivers/net/ethernet/intel/ice/ice_switch.h |  120 ++
 drivers/net/ethernet/intel/ice/ice_type.h   |   21 +
 7 files changed, 1935 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 570169c99786..c834ed38602b 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -22,6 +22,7 @@
  * descriptor format.  It is shared between Firmware and Software.
  */
 
+#define ICE_MAX_VSI768
 #define ICE_AQC_TOPO_MAX_LEVEL_NUM 0x9
 #define ICE_AQ_SET_MAC_FRAME_SIZE_MAX  9728
 
@@ -205,6 +206,46 @@ struct ice_aqc_get_sw_cfg_resp {
struct ice_aqc_get_sw_cfg_resp_elem elements[1];
 };
 
+/* These resource type defines are used for all switch resource
+ * commands where a resource type is required, such as:
+ * Get Resource Allocation command (indirect 0x0204)
+ * Allocate Resources command (indirect 0x0208)
+ * Free Resources command (indirect 0x0209)
+ * Get Allocated Resource Descriptors Command (indirect 0x020A)
+ */
+#define ICE_AQC_RES_TYPE_VSI_LIST_REP  0x03
+#define ICE_AQC_RES_TYPE_VSI_LIST_PRUNE0x04
+
+/* Allocate Resources command (indirect 0x0208)
+ * Free Resources command (indirect 0x0209)
+ */
+struct ice_aqc_alloc_free_res_cmd {
+   __le16 num_entries; /* Number of Resource entries */
+   u8 reserved[6];
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
+/* Resource descriptor */
+struct ice_aqc_res_elem {
+   union {
+   __le16 sw_resp;
+   __le16 flu_resp;
+   } e;
+};
+
+/* Buffer for Allocate/Free Resources commands */
+struct ice_aqc_alloc_free_res_elem {
+   __le16 res_type; /* Types defined above cmd 0x0204 */
+#define ICE_AQC_RES_TYPE_SHARED_S  7
+#define ICE_AQC_RES_TYPE_SHARED_M  (0x1 << ICE_AQC_RES_TYPE_SHARED_S)
+#define ICE_AQC_RES_TYPE_VSI_PRUNE_LIST_S  8
+#define ICE_AQC_RES_TYPE_VSI_PRUNE_LIST_M  \
+   (0xF << ICE_AQC_RES_TYPE_VSI_PRUNE_LIST_S)
+   __le16 num_elems;
+   struct ice_aqc_res_elem elem[1];
+};
+
 /* Add VSI (indirect 0x0210)
  * Update VSI (indirect 0x0211)
  * Get VSI (indirect 0x0212)
@@ -398,6 +439,202 @@ struct ice_aqc_vsi_props {
u8 reserved[24];
 };
 
+/* Add/Update/Remove/Get switch rules (indirect 0x02A0, 0x02A1, 0x02A2, 0x02A3)
+ */
+struct ice_aqc_sw_rules {
+   /* ops: add switch 

[PATCH v2 02/15] ice: Add support for control queues

2018-03-15 Thread Anirudh Venkataramanan
A control queue is a hardware interface which is used by the driver
to interact with other subsystems (like firmware, PHY, etc.). It is
implemented as a producer-consumer ring. More specifically, an
"admin queue" is a type of control queue used to interact with the
firmware.

This patch introduces data structures and functions to initialize
and teardown control/admin queues. Once the admin queue is initialized,
the driver uses it to get the firmware version.

Signed-off-by: Anirudh Venkataramanan 
---
 drivers/net/ethernet/intel/ice/Makefile |   4 +-
 drivers/net/ethernet/intel/ice/ice.h|   1 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 108 +++
 drivers/net/ethernet/intel/ice/ice_common.c | 144 
 drivers/net/ethernet/intel/ice/ice_common.h |  39 +
 drivers/net/ethernet/intel/ice/ice_controlq.c   | 979 
 drivers/net/ethernet/intel/ice/ice_controlq.h   |  97 +++
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h |  46 ++
 drivers/net/ethernet/intel/ice/ice_main.c   |  11 +-
 drivers/net/ethernet/intel/ice/ice_osdep.h  |  86 +++
 drivers/net/ethernet/intel/ice/ice_status.h |  35 +
 drivers/net/ethernet/intel/ice/ice_type.h   |  22 +
 12 files changed, 1570 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_common.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_common.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_controlq.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_controlq.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_hw_autogen.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_osdep.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_status.h

diff --git a/drivers/net/ethernet/intel/ice/Makefile 
b/drivers/net/ethernet/intel/ice/Makefile
index 2a177ea21b74..eebf619e84a8 100644
--- a/drivers/net/ethernet/intel/ice/Makefile
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -24,4 +24,6 @@
 
 obj-$(CONFIG_ICE) += ice.o
 
-ice-y := ice_main.o
+ice-y := ice_main.o\
+ice_controlq.o \
+ice_common.o
diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index d781027330cc..ea2fb63bb095 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ice_devids.h"
 #include "ice_type.h"
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
new file mode 100644
index ..885fa3c6fec4
--- /dev/null
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Intel(R) Ethernet Connection E800 Series Linux Driver
+ * Copyright (c) 2018, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * The full GNU General Public License is included in this distribution in
+ * the file called "COPYING".
+ */
+
+#ifndef _ICE_ADMINQ_CMD_H_
+#define _ICE_ADMINQ_CMD_H_
+
+/* This header file defines the Admin Queue commands, error codes and
+ * descriptor format.  It is shared between Firmware and Software.
+ */
+
+struct ice_aqc_generic {
+   __le32 param0;
+   __le32 param1;
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
+/* Get version (direct 0x0001) */
+struct ice_aqc_get_ver {
+   __le32 rom_ver;
+   __le32 fw_build;
+   u8 fw_branch;
+   u8 fw_major;
+   u8 fw_minor;
+   u8 fw_patch;
+   u8 api_branch;
+   u8 api_major;
+   u8 api_minor;
+   u8 api_patch;
+};
+
+/* Queue Shutdown (direct 0x0003) */
+struct ice_aqc_q_shutdown {
+#define ICE_AQC_DRIVER_UNLOADING   BIT(0)
+   __le32 driver_unloading;
+   u8 reserved[12];
+};
+
+/**
+ * struct ice_aq_desc - Admin Queue (AQ) descriptor
+ * @flags: ICE_AQ_FLAG_* flags
+ * @opcode: AQ command opcode
+ * @datalen: length in bytes of indirect/external data buffer
+ * @retval: return value from firmware
+ * @cookie_h: opaque data high-half
+ * @cookie_l: opaque data low-half
+ * @params: command-specific parameters
+ *
+ * Descriptor format for commands the driver posts on the Admin Transmit Queue
+ * (ATQ).  The firmware writes back onto the command descriptor and returns
+ * the result of the command.  Asynchronous events that are not an immediate
+ * result of the command are written to the Admin Receive Queue (ARQ) using
+ * the same 

[PATCH v2 13/15] ice: Update Tx scheduler tree for VSI multi-Tx queue support

2018-03-15 Thread Anirudh Venkataramanan
This patch adds the ability for a VSI to use multiple Tx queues. More
specifically, the patch
1) Provides the ability to update the Tx scheduler tree in the
   firmware. The driver can configure the Tx scheduler tree by
   adding/removing multiple Tx queues per TC per VSI.

2) Allows a VSI to reconfigure its Tx queues during runtime.

3) Synchronizes the Tx scheduler update operations using locks.

Signed-off-by: Anirudh Venkataramanan 
---
 drivers/net/ethernet/intel/ice/ice.h|   7 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |  28 +
 drivers/net/ethernet/intel/ice/ice_common.c |  54 ++
 drivers/net/ethernet/intel/ice/ice_common.h |   3 +
 drivers/net/ethernet/intel/ice/ice_main.c   |  20 +-
 drivers/net/ethernet/intel/ice/ice_sched.c  | 886 
 drivers/net/ethernet/intel/ice/ice_sched.h  |   4 +
 drivers/net/ethernet/intel/ice/ice_type.h   |   7 +
 8 files changed, 1006 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index 6014ef9c36e1..cb1e8a127af1 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -56,6 +56,7 @@ extern const char ice_drv_ver[];
 #define ICE_MIN_NUM_DESC   8
 #define ICE_MAX_NUM_DESC   8160
 #define ICE_REQ_DESC_MULTIPLE  32
+#define ICE_DFLT_TRAFFIC_CLASS BIT(0)
 #define ICE_INT_NAME_STR_LEN   (IFNAMSIZ + 16)
 #define ICE_ETHTOOL_FWVER_LEN  32
 #define ICE_AQ_LEN 64
@@ -275,6 +276,12 @@ static inline void ice_irq_dynamic_ena(struct ice_hw *hw, 
struct ice_vsi *vsi,
wr32(hw, GLINT_DYN_CTL(vector), val);
 }
 
+static inline void ice_vsi_set_tc_cfg(struct ice_vsi *vsi)
+{
+   vsi->tc_cfg.ena_tc =  ICE_DFLT_TRAFFIC_CLASS;
+   vsi->tc_cfg.numtc = 1;
+}
+
 void ice_set_ethtool_ops(struct net_device *netdev);
 int ice_up(struct ice_vsi *vsi);
 int ice_down(struct ice_vsi *vsi);
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 2c8d8533f87d..62509635fc5e 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -645,6 +645,25 @@ struct ice_aqc_get_topo {
__le32 addr_low;
 };
 
+/* Update TSE (indirect 0x0403)
+ * Get TSE (indirect 0x0404)
+ */
+struct ice_aqc_get_cfg_elem {
+   __le16 num_elem_req;/* Used by commands */
+   __le16 num_elem_resp;   /* Used by responses */
+   __le32 reserved;
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
+/* This is the buffer for:
+ * Suspend Nodes (indirect 0x0409)
+ * Resume Nodes (indirect 0x040A)
+ */
+struct ice_aqc_suspend_resume_elem {
+   __le32 teid[1];
+};
+
 /* Add TSE (indirect 0x0401)
  * Delete TSE (indirect 0x040F)
  * Move TSE (indirect 0x0408)
@@ -705,6 +724,11 @@ struct ice_aqc_txsched_topo_grp_info_hdr {
__le16 reserved2;
 };
 
+struct ice_aqc_add_elem {
+   struct ice_aqc_txsched_topo_grp_info_hdr hdr;
+   struct ice_aqc_txsched_elem_data generic[1];
+};
+
 struct ice_aqc_get_topo_elem {
struct ice_aqc_txsched_topo_grp_info_hdr hdr;
struct ice_aqc_txsched_elem_data
@@ -1195,6 +1219,7 @@ struct ice_aq_desc {
struct ice_aqc_get_sw_cfg get_sw_conf;
struct ice_aqc_sw_rules sw_rules;
struct ice_aqc_get_topo get_topo;
+   struct ice_aqc_get_cfg_elem get_update_elem;
struct ice_aqc_query_txsched_res query_sched_res;
struct ice_aqc_add_move_delete_elem add_move_delete_elem;
struct ice_aqc_nvm nvm;
@@ -1272,6 +1297,9 @@ enum ice_adminq_opc {
 
/* transmit scheduler commands */
ice_aqc_opc_get_dflt_topo   = 0x0400,
+   ice_aqc_opc_add_sched_elems = 0x0401,
+   ice_aqc_opc_suspend_sched_elems = 0x0409,
+   ice_aqc_opc_resume_sched_elems  = 0x040A,
ice_aqc_opc_delete_sched_elems  = 0x040F,
ice_aqc_opc_query_sched_res = 0x0412,
 
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c 
b/drivers/net/ethernet/intel/ice/ice_common.c
index 43cca9370444..958161a21115 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -2103,3 +2103,57 @@ ice_dis_vsi_txq(struct ice_port_info *pi, u8 num_queues, 
u16 *q_ids,
mutex_unlock(>sched_lock);
return status;
 }
+
+/**
+ * ice_cfg_vsi_qs - configure the new/exisiting VSI queues
+ * @pi: port information structure
+ * @vsi_id: VSI Id
+ * @tc_bitmap: TC bitmap
+ * @maxqs: max queues array per TC
+ * @owner: lan or rdma
+ *
+ * This function adds/updates the VSI queues per TC.
+ */
+static enum ice_status
+ice_cfg_vsi_qs(struct ice_port_info *pi, u16 vsi_id, u8 tc_bitmap,
+  u16 *maxqs, u8 owner)
+{
+   enum 

[PATCH v2 04/15] ice: Get switch config, scheduler config and device capabilities

2018-03-15 Thread Anirudh Venkataramanan
This patch adds to the initialization flow by getting switch
configuration, scheduler configuration and device capabilities.

Switch configuration:
On boot, an L2 switch element is created in the firmware per physical
function. Each physical function is also mapped to a port, to which its
switch element is connected. In other words, this switch can be visualized
as an embedded vSwitch that can connect a physical functions's virtual
station interfaces (VSIs) to the egress/ingress port. Egress/ingress
filters will be eventually created and applied on this switch element.
As part of the initialization flow, the driver gets configuration data
from this switch element and stores it.

Scheduler configuration:
The Tx scheduler is a subsystem responsible for setting and enforcing QoS.
As part of the initialization flow, the driver queries and stores the
default scheduler configuration for the given physical function.

Device capabilities:
As part of initialization, the driver has to determine what the device is
capable of (ex. max queues, VSIs, etc). This information is obtained from
the firmware and stored by the driver.

CC: Shannon Nelson 
Signed-off-by: Anirudh Venkataramanan 
---
v2: Addressed Shannon Nelson's review comment by changing retry count value
to 2.
---
 drivers/net/ethernet/intel/ice/Makefile |   4 +-
 drivers/net/ethernet/intel/ice/ice.h|   2 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 209 ++
 drivers/net/ethernet/intel/ice/ice_common.c | 231 
 drivers/net/ethernet/intel/ice/ice_common.h |   2 +
 drivers/net/ethernet/intel/ice/ice_sched.c  | 354 
 drivers/net/ethernet/intel/ice/ice_sched.h  |  42 +++
 drivers/net/ethernet/intel/ice/ice_switch.c | 158 +++
 drivers/net/ethernet/intel/ice/ice_switch.h |  28 ++
 drivers/net/ethernet/intel/ice/ice_type.h   | 109 
 10 files changed, 1138 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_sched.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_sched.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_switch.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_switch.h

diff --git a/drivers/net/ethernet/intel/ice/Makefile 
b/drivers/net/ethernet/intel/ice/Makefile
index 373d481dbb25..809d85c04398 100644
--- a/drivers/net/ethernet/intel/ice/Makefile
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -27,4 +27,6 @@ obj-$(CONFIG_ICE) += ice.o
 ice-y := ice_main.o\
 ice_controlq.o \
 ice_common.o   \
-ice_nvm.o
+ice_nvm.o  \
+ice_switch.o   \
+ice_sched.o
diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index ab2800c31906..f6e3339591bb 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -30,7 +30,9 @@
 #include 
 #include "ice_devids.h"
 #include "ice_type.h"
+#include "ice_switch.h"
 #include "ice_common.h"
+#include "ice_sched.h"
 
 #define ICE_BAR0   0
 #define ICE_AQ_LEN 64
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 05b22a1ffd70..66a3f41df673 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -22,6 +22,8 @@
  * descriptor format.  It is shared between Firmware and Software.
  */
 
+#define ICE_AQC_TOPO_MAX_LEVEL_NUM 0x9
+
 struct ice_aqc_generic {
__le32 param0;
__le32 param1;
@@ -82,6 +84,40 @@ struct ice_aqc_req_res {
u8 reserved[2];
 };
 
+/* Get function capabilities (indirect 0x000A)
+ * Get device capabilities (indirect 0x000B)
+ */
+struct ice_aqc_list_caps {
+   u8 cmd_flags;
+   u8 pf_index;
+   u8 reserved[2];
+   __le32 count;
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
+/* Device/Function buffer entry, repeated per reported capability */
+struct ice_aqc_list_caps_elem {
+   __le16 cap;
+#define ICE_AQC_CAPS_VSI   0x0017
+#define ICE_AQC_CAPS_RSS   0x0040
+#define ICE_AQC_CAPS_RXQS  0x0041
+#define ICE_AQC_CAPS_TXQS  0x0042
+#define ICE_AQC_CAPS_MSIX  0x0043
+#define ICE_AQC_CAPS_MAX_MTU   0x0047
+
+   u8 major_ver;
+   u8 minor_ver;
+   /* Number of resources described by this capability */
+   __le32 number;
+   /* Only meaningful for some types of resources */
+   __le32 logical_id;
+   /* Only meaningful for some types of resources */
+   __le32 phys_id;
+   __le64 rsvd1;
+   __le64 rsvd2;
+};
+
 /* Clear PXE Command and response (direct 0x0110) */
 struct ice_aqc_clear_pxe {
u8 rx_cnt;
@@ -89,6 +125,161 @@ struct ice_aqc_clear_pxe {

[PATCH v2 12/15] ice: Add stats and ethtool support

2018-03-15 Thread Anirudh Venkataramanan
This patch implements a watchdog task to get packet statistics from
the device.

This patch also adds support for the following ethtool operations:

ethtool devname
ethtool -s devname [msglvl N] [msglevel type on|off]
ethtool -g|--show-ring devname
ethtool -G|--set-ring devname [rx N] [tx N]
ethtool -i|--driver devname
ethtool -d|--register-dump devname [raw on|off] [hex on|off] [file name]
ethtool -k|--show-features|--show-offload devname
ethtool -K|--features|--offload devname feature on|off
ethtool -P|--show-permaddr devname
ethtool -S|--statistics devname
ethtool -a|--show-pause devname
ethtool -A|--pause devname [autoneg on|off] [rx on|off] [tx on|off]
ethtool -r|--negotiate devname

CC: Andrew Lunn 
CC: Jakub Kicinski 
CC: Stephen Hemminger 
Signed-off-by: Anirudh Venkataramanan 
---
v2: Addressed multiple review comments. Specifically,
1) Andrew Lunn's comment on PHY statistics.
2) Jakub Kicinski's comment on netdev stats.
3) Stephen Hemminger's comment on the net_stats_prev field.

Additionally, the code around stats collection was reworked a bit:
1) A new function ice_update_vsi_ring_stats was added to update ring
   stats. ice_get_stats64 which also reports ring stats was re-written
   to use this function.
2) Calls to ice_update_vsi_stats and ice_update_pf_stats in
   ice_get_ethtool_stats were removed, as this is done by the
   watchdog task anyway.
---
 drivers/net/ethernet/intel/ice/Makefile |   3 +-
 drivers/net/ethernet/intel/ice/ice.h|  28 +-
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |  43 ++
 drivers/net/ethernet/intel/ice/ice_common.c | 195 +
 drivers/net/ethernet/intel/ice/ice_common.h |   5 +
 drivers/net/ethernet/intel/ice/ice_ethtool.c| 954 
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h |  80 ++
 drivers/net/ethernet/intel/ice/ice_main.c   | 469 +++-
 drivers/net/ethernet/intel/ice/ice_type.h   |  70 ++
 9 files changed, 1842 insertions(+), 5 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_ethtool.c

diff --git a/drivers/net/ethernet/intel/ice/Makefile 
b/drivers/net/ethernet/intel/ice/Makefile
index 0abeb20c006d..643d63016624 100644
--- a/drivers/net/ethernet/intel/ice/Makefile
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -30,4 +30,5 @@ ice-y := ice_main.o   \
 ice_nvm.o  \
 ice_switch.o   \
 ice_sched.o\
-ice_txrx.o
+ice_txrx.o \
+ice_ethtool.o
diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index f10ae53cc4ac..6014ef9c36e1 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -27,12 +27,14 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -48,10 +50,14 @@
 #include "ice_common.h"
 #include "ice_sched.h"
 
+extern const char ice_drv_ver[];
 #define ICE_BAR0   0
 #define ICE_DFLT_NUM_DESC  128
+#define ICE_MIN_NUM_DESC   8
+#define ICE_MAX_NUM_DESC   8160
 #define ICE_REQ_DESC_MULTIPLE  32
 #define ICE_INT_NAME_STR_LEN   (IFNAMSIZ + 16)
+#define ICE_ETHTOOL_FWVER_LEN  32
 #define ICE_AQ_LEN 64
 #define ICE_MIN_MSIX   2
 #define ICE_NO_VSI 0x
@@ -70,6 +76,8 @@
 #define ICE_RES_MISC_VEC_ID(ICE_RES_VALID_BIT - 1)
 #define ICE_INVAL_Q_INDEX  0x
 
+#define ICE_VSIQF_HKEY_ARRAY_SIZE  ((VSIQF_HKEY_MAX_INDEX + 1) *   4)
+
 #define ICE_DFLT_NETIF_M (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK)
 
 #define ICE_MAX_MTU(ICE_AQ_SET_MAC_FRAME_SIZE_MAX - \
@@ -116,6 +124,7 @@ enum ice_state {
__ICE_DOWN,
__ICE_PFR_REQ,  /* set by driver and peers */
__ICE_ADMINQ_EVENT_PENDING,
+   __ICE_CFG_BUSY,
__ICE_SERVICE_SCHED,
__ICE_STATE_NBITS   /* must be last */
 };
@@ -132,8 +141,13 @@ struct ice_vsi {
 
irqreturn_t (*irq_handler)(int irq, void *data);
 
+   u64 tx_linearize;
DECLARE_BITMAP(state, __ICE_STATE_NBITS);
unsigned long active_vlans[BITS_TO_LONGS(VLAN_N_VID)];
+   u32 tx_restart;
+   u32 tx_busy;
+   u32 rx_buf_failed;
+   u32 rx_page_failed;
int num_q_vectors;
int base_vector;
enum ice_vsi_type type;
@@ -155,8 +169,14 @@ struct ice_vsi {
 
struct ice_aqc_vsi_props info;   /* VSI properties */
 
+   /* VSI stats */
+   struct rtnl_link_stats64 net_stats;
+   struct ice_eth_stats eth_stats;
+   struct ice_eth_stats eth_stats_prev;
+
bool irqs_ready;
bool current_isup;   /* Sync 'link up' logging */
+   bool stat_offsets_loaded;
 
/* queue information */
u8 tx_mapping_mode;  /* ICE_MAP_MODE_[CONTIG|SCATTER] 

[PATCH v2 03/15] ice: Start hardware initialization

2018-03-15 Thread Anirudh Venkataramanan
This patch implements multiple pieces of the initialization flow
as follows:

1) A reset is issued to ensure a clean device state, followed
   by initialization of admin queue interface.

2) Once the admin queue interface is up, clear the PF config
   and transition the device to non-PXE mode.

3) Get the NVM configuration stored in the device's non-volatile
   memory (NVM) using ice_init_nvm.

CC: Shannon Nelson 
Signed-off-by: Anirudh Venkataramanan 
---
v2: Addressed Shannon Nelson's review comments by
1) removing an unnecessary register write in ice_aq_clear_pxe_mode.
2) adding a comment explaining the need to convert word sized values
   to byte sized values.
---
 drivers/net/ethernet/intel/ice/Makefile |   3 +-
 drivers/net/ethernet/intel/ice/ice.h|   2 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |  79 +
 drivers/net/ethernet/intel/ice/ice_common.c | 405 
 drivers/net/ethernet/intel/ice/ice_common.h |  11 +
 drivers/net/ethernet/intel/ice/ice_controlq.h   |   3 +
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h |  30 ++
 drivers/net/ethernet/intel/ice/ice_main.c   |  31 ++
 drivers/net/ethernet/intel/ice/ice_nvm.c| 250 +++
 drivers/net/ethernet/intel/ice/ice_osdep.h  |   1 +
 drivers/net/ethernet/intel/ice/ice_status.h |   5 +
 drivers/net/ethernet/intel/ice/ice_type.h   |  49 +++
 12 files changed, 868 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_nvm.c

diff --git a/drivers/net/ethernet/intel/ice/Makefile 
b/drivers/net/ethernet/intel/ice/Makefile
index eebf619e84a8..373d481dbb25 100644
--- a/drivers/net/ethernet/intel/ice/Makefile
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -26,4 +26,5 @@ obj-$(CONFIG_ICE) += ice.o
 
 ice-y := ice_main.o\
 ice_controlq.o \
-ice_common.o
+ice_common.o   \
+ice_nvm.o
diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index ea2fb63bb095..ab2800c31906 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -30,8 +30,10 @@
 #include 
 #include "ice_devids.h"
 #include "ice_type.h"
+#include "ice_common.h"
 
 #define ICE_BAR0   0
+#define ICE_AQ_LEN 64
 
 #define ICE_DFLT_NETIF_M (NETIF_MSG_DRV | NETIF_MSG_PROBE | NETIF_MSG_LINK)
 
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 885fa3c6fec4..05b22a1ffd70 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -50,6 +50,67 @@ struct ice_aqc_q_shutdown {
u8 reserved[12];
 };
 
+/* Request resource ownership (direct 0x0008)
+ * Release resource ownership (direct 0x0009)
+ */
+struct ice_aqc_req_res {
+   __le16 res_id;
+#define ICE_AQC_RES_ID_NVM 1
+#define ICE_AQC_RES_ID_SDP 2
+#define ICE_AQC_RES_ID_CHNG_LOCK   3
+#define ICE_AQC_RES_ID_GLBL_LOCK   4
+   __le16 access_type;
+#define ICE_AQC_RES_ACCESS_READ1
+#define ICE_AQC_RES_ACCESS_WRITE   2
+
+   /* Upon successful completion, FW writes this value and driver is
+* expected to release resource before timeout. This value is provided
+* in milliseconds.
+*/
+   __le32 timeout;
+#define ICE_AQ_RES_NVM_READ_DFLT_TIMEOUT_MS3000
+#define ICE_AQ_RES_NVM_WRITE_DFLT_TIMEOUT_MS   18
+#define ICE_AQ_RES_CHNG_LOCK_DFLT_TIMEOUT_MS   1000
+#define ICE_AQ_RES_GLBL_LOCK_DFLT_TIMEOUT_MS   3000
+   /* For SDP: pin id of the SDP */
+   __le32 res_number;
+   /* Status is only used for ICE_AQC_RES_ID_GLBL_LOCK */
+   __le16 status;
+#define ICE_AQ_RES_GLBL_SUCCESS0
+#define ICE_AQ_RES_GLBL_IN_PROG1
+#define ICE_AQ_RES_GLBL_DONE   2
+   u8 reserved[2];
+};
+
+/* Clear PXE Command and response (direct 0x0110) */
+struct ice_aqc_clear_pxe {
+   u8 rx_cnt;
+#define ICE_AQC_CLEAR_PXE_RX_CNT   0x2
+   u8 reserved[15];
+};
+
+/* NVM Read command (indirect 0x0701)
+ * NVM Erase commands (direct 0x0702)
+ * NVM Update commands (indirect 0x0703)
+ */
+struct ice_aqc_nvm {
+   u8  cmd_flags;
+#define ICE_AQC_NVM_LAST_CMD   BIT(0)
+#define ICE_AQC_NVM_PCIR_REQ   BIT(0)  /* Used by NVM Update reply */
+#define ICE_AQC_NVM_PRESERVATION_S 1
+#define ICE_AQC_NVM_PRESERVATION_M (3 << CSR_AQ_NVM_PRESERVATION_S)
+#define ICE_AQC_NVM_NO_PRESERVATION(0 << CSR_AQ_NVM_PRESERVATION_S)
+#define ICE_AQC_NVM_PRESERVE_ALL   BIT(1)
+#define ICE_AQC_NVM_PRESERVE_SELECTED  (3 << CSR_AQ_NVM_PRESERVATION_S)
+#define ICE_AQC_NVM_FLASH_ONLY BIT(7)
+   u8  module_typeid;
+   __le16  length;
+#define ICE_AQC_NVM_ERASE_LEN  0x
+   __le32  offset;
+   __le32  addr_high;
+   __le32  

[PATCH v2 14/15] ice: Support link events, reset and rebuild

2018-03-15 Thread Anirudh Venkataramanan
Link events are posted to a PF's admin receive queue (ARQ). This patch
adds the ability to detect and process link events.

This patch also adds the ability to process resets.

The driver can process the following resets:
1) EMP Reset (EMPR)
2) Global Reset (GLOBR)
3) Core Reset (CORER)
4) Physical Function Reset (PFR)

EMPR is the largest level of reset that the driver can handle. An EMPR
resets the manageability block and also the data path, including PHY and
link for all the PFs. The affected PFs are notified of this event through
a miscellaneous interrupt.

GLOBR is a subset of EMPR. It does everything EMPR does except that it
doesn't reset the manageability block.

CORER is a subset of GLOBR. It does everything GLOBR does but doesn't
reset PHY and link.

PFR is a subset of CORER and affects only the given physical function.
In other words, PFR can be thought of as a CORER for a single PF. Since
only the issuing PF is affected, a PFR doesn't result in the miscellaneousi
interrupt being triggered.

All the resets have the following in common:
1) Tx/Rx is halted and all queues are stopped.
2) All the VSIs and filters programmed for the PF are lost and have to be
   reprogrammed.
3) Control queue interfaces are reset and have to be reprogrammed.

In the rebuild flow, control queues are reinitialized, VSIs are reallocated
and filters are restored.

Signed-off-by: Anirudh Venkataramanan 
---
 drivers/net/ethernet/intel/ice/ice.h|  19 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |  19 +
 drivers/net/ethernet/intel/ice/ice_common.c |  60 +++
 drivers/net/ethernet/intel/ice/ice_common.h |   5 +
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h |   2 +
 drivers/net/ethernet/intel/ice/ice_main.c   | 581 +++-
 drivers/net/ethernet/intel/ice/ice_type.h   |   1 +
 7 files changed, 681 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index cb1e8a127af1..6d7d03b80dbf 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -92,6 +92,11 @@ extern const char ice_drv_ver[];
 #define ICE_RX_DESC(R, i) (&(((union ice_32b_rx_flex_desc *)((R)->desc))[i]))
 #define ICE_TX_CTX_DESC(R, i) (&(((struct ice_tx_ctx_desc *)((R)->desc))[i]))
 
+/* Macro for each VSI in a PF */
+#define ice_for_each_vsi(pf, i) \
+   for ((i) = 0; (i) < (pf)->num_alloc_vsi; (i)++)
+
+/* Macros for each tx/rx ring in a VSI */
 #define ice_for_each_txq(vsi, i) \
for ((i) = 0; (i) < (vsi)->num_txq; (i)++)
 
@@ -123,7 +128,16 @@ struct ice_sw {
 
 enum ice_state {
__ICE_DOWN,
+   __ICE_NEEDS_RESTART,
+   __ICE_RESET_RECOVERY_PENDING,   /* set by driver when reset starts */
__ICE_PFR_REQ,  /* set by driver and peers */
+   __ICE_CORER_REQ,/* set by driver and peers */
+   __ICE_GLOBR_REQ,/* set by driver and peers */
+   __ICE_CORER_RECV,   /* set by OICR handler */
+   __ICE_GLOBR_RECV,   /* set by OICR handler */
+   __ICE_EMPR_RECV,/* set by OICR handler */
+   __ICE_SUSPENDED,/* set on module remove path */
+   __ICE_RESET_FAILED, /* set by reset/rebuild */
__ICE_ADMINQ_EVENT_PENDING,
__ICE_CFG_BUSY,
__ICE_SERVICE_SCHED,
@@ -240,6 +254,11 @@ struct ice_pf {
u16 q_left_rx;  /* remaining num rx queues left unclaimed */
u16 next_vsi;   /* Next free slot in pf->vsi[] - 0-based! */
u16 num_alloc_vsi;
+   u16 corer_count;/* Core reset count */
+   u16 globr_count;/* Global reset count */
+   u16 empr_count; /* EMP reset count */
+   u16 pfr_count;  /* PF reset count */
+
struct ice_hw_port_stats stats;
struct ice_hw_port_stats stats_prev;
struct ice_hw hw;
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 62509635fc5e..8cade22c1cf6 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -1023,6 +1023,23 @@ struct ice_aqc_get_link_status_data {
__le64 reserved4;
 };
 
+/* Set event mask command (direct 0x0613) */
+struct ice_aqc_set_event_mask {
+   u8  lport_num;
+   u8  reserved[7];
+   __le16  event_mask;
+#define ICE_AQ_LINK_EVENT_UPDOWN   BIT(1)
+#define ICE_AQ_LINK_EVENT_MEDIA_NA BIT(2)
+#define ICE_AQ_LINK_EVENT_LINK_FAULT   BIT(3)
+#define ICE_AQ_LINK_EVENT_PHY_TEMP_ALARM   BIT(4)
+#define ICE_AQ_LINK_EVENT_EXCESSIVE_ERRORS BIT(5)
+#define ICE_AQ_LINK_EVENT_SIGNAL_DETECTBIT(6)
+#define ICE_AQ_LINK_EVENT_AN_COMPLETED BIT(7)
+#define ICE_AQ_LINK_EVENT_MODULE_QUAL_FAIL BIT(8)
+#define 

[PATCH v2 11/15] ice: Add support for VLANs and offloads

2018-03-15 Thread Anirudh Venkataramanan
This patch adds support for VLANs. When a VLAN is created a switch filter
is added to direct the VLAN traffic to the corresponding VSI. When a VLAN
is deleted, the filter is deleted as well.

This patch also adds support for the following hardware offloads.
1) VLAN tag insertion/stripping
2) Receive Side Scaling (RSS)
3) Tx checksum and TCP segmentation
4) Rx checksum

Signed-off-by: Anirudh Venkataramanan 
---
 drivers/net/ethernet/intel/ice/ice.h|  19 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |  62 +++
 drivers/net/ethernet/intel/ice/ice_common.c | 188 
 drivers/net/ethernet/intel/ice/ice_common.h |  13 +
 drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h  | 169 +++
 drivers/net/ethernet/intel/ice/ice_main.c   | 601 +++-
 drivers/net/ethernet/intel/ice/ice_switch.c | 169 +++
 drivers/net/ethernet/intel/ice/ice_switch.h |   4 +
 drivers/net/ethernet/intel/ice/ice_txrx.c   | 405 +++-
 drivers/net/ethernet/intel/ice/ice_txrx.h   |  17 +
 10 files changed, 1631 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index 7998e57994bf..f10ae53cc4ac 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -37,7 +37,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
+#include 
 #include "ice_devids.h"
 #include "ice_type.h"
 #include "ice_txrx.h"
@@ -61,6 +64,8 @@
 #define ICE_MAX_SCATTER_RXQS   16
 #define ICE_Q_WAIT_RETRY_LIMIT 10
 #define ICE_Q_WAIT_MAX_RETRY   (5 * ICE_Q_WAIT_RETRY_LIMIT)
+#define ICE_MAX_LG_RSS_QS  256
+#define ICE_MAX_SMALL_RSS_QS   8
 #define ICE_RES_VALID_BIT  0x8000
 #define ICE_RES_MISC_VEC_ID(ICE_RES_VALID_BIT - 1)
 #define ICE_INVAL_Q_INDEX  0x
@@ -76,6 +81,7 @@
 
 #define ICE_TX_DESC(R, i) (&(((struct ice_tx_desc *)((R)->desc))[i]))
 #define ICE_RX_DESC(R, i) (&(((union ice_32b_rx_flex_desc *)((R)->desc))[i]))
+#define ICE_TX_CTX_DESC(R, i) (&(((struct ice_tx_ctx_desc *)((R)->desc))[i]))
 
 #define ice_for_each_txq(vsi, i) \
for ((i) = 0; (i) < (vsi)->num_txq; (i)++)
@@ -127,6 +133,7 @@ struct ice_vsi {
irqreturn_t (*irq_handler)(int irq, void *data);
 
DECLARE_BITMAP(state, __ICE_STATE_NBITS);
+   unsigned long active_vlans[BITS_TO_LONGS(VLAN_N_VID)];
int num_q_vectors;
int base_vector;
enum ice_vsi_type type;
@@ -136,6 +143,13 @@ struct ice_vsi {
/* Interrupt thresholds */
u16 work_lmt;
 
+   /* RSS config */
+   u16 rss_table_size; /* HW RSS table size */
+   u16 rss_size;   /* Allocated RSS queues */
+   u8 *rss_hkey_user;  /* User configured hash keys */
+   u8 *rss_lut_user;   /* User configured lookup table entries */
+   u8 rss_lut_type;/* used to configure Get/Set RSS LUT AQ call */
+
u16 max_frame;
u16 rx_buf_len;
 
@@ -195,6 +209,7 @@ struct ice_pf {
struct mutex avail_q_mutex; /* protects access to avail_[rx|tx]qs */
struct mutex sw_mutex;  /* lock for protecting VSI alloc flow */
u32 msg_enable;
+   u32 hw_csum_rx_error;
u32 oicr_idx;   /* Other interrupt cause vector index */
u32 num_lan_msix;   /* Total MSIX vectors for base driver */
u32 num_avail_msix; /* remaining MSIX vectors left unclaimed */
@@ -238,4 +253,8 @@ static inline void ice_irq_dynamic_ena(struct ice_hw *hw, 
struct ice_vsi *vsi,
wr32(hw, GLINT_DYN_CTL(vector), val);
 }
 
+int ice_set_rss(struct ice_vsi *vsi, u8 *seed, u8 *lut, u16 lut_size);
+int ice_get_rss(struct ice_vsi *vsi, u8 *seed, u8 *lut, u16 lut_size);
+void ice_fill_rss_lut(u8 *lut, u16 rss_table_size, u16 rss_size);
+
 #endif /* _ICE_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 358a482630db..49102817f0a9 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -982,6 +982,60 @@ struct ice_aqc_nvm {
__le32  addr_low;
 };
 
+/* Get/Set RSS key (indirect 0x0B04/0x0B02) */
+struct ice_aqc_get_set_rss_key {
+#define ICE_AQC_GSET_RSS_KEY_VSI_VALID BIT(15)
+#define ICE_AQC_GSET_RSS_KEY_VSI_ID_S  0
+#define ICE_AQC_GSET_RSS_KEY_VSI_ID_M  (0x3FF << ICE_AQC_GSET_RSS_KEY_VSI_ID_S)
+   __le16 vsi_id;
+   u8 reserved[6];
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
+#define ICE_AQC_GET_SET_RSS_KEY_DATA_RSS_KEY_SIZE  0x28
+#define ICE_AQC_GET_SET_RSS_KEY_DATA_HASH_KEY_SIZE 0xC
+
+struct ice_aqc_get_set_rss_keys {
+   u8 standard_rss_key[ICE_AQC_GET_SET_RSS_KEY_DATA_RSS_KEY_SIZE];
+   u8 extended_hash_key[ICE_AQC_GET_SET_RSS_KEY_DATA_HASH_KEY_SIZE];
+};
+
+/* Get/Set RSS LUT (indirect 0x0B05/0x0B03) */
+struct  ice_aqc_get_set_rss_lut {
+#define 

[PATCH v2 15/15] ice: Implement filter sync, NDO operations and bump version

2018-03-15 Thread Anirudh Venkataramanan
This patch implements multiple pieces of functionality:

1. Added ice_vsi_sync_filters, which is called through the service task
   to push filter updates to the hardware.

2. Add support to enable/disable promiscuous mode on an interface.
   Enabling/disabling promiscuous mode on an interface results in
   addition/removal of a promisc filter rule through ice_vsi_sync_filters.

3. Implement handlers for ndo_set_mac_address, ndo_change_mtu,
   ndo_poll_controller and ndo_set_rx_mode.

This patch also marks the end of the driver addition by bumping up the
driver version.

Signed-off-by: Anirudh Venkataramanan 
---
 drivers/net/ethernet/intel/ice/ice.h|  14 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h |  21 +
 drivers/net/ethernet/intel/ice/ice_common.c |  28 ++
 drivers/net/ethernet/intel/ice/ice_common.h |   3 +
 drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h  |  12 +
 drivers/net/ethernet/intel/ice/ice_main.c   | 567 +++-
 drivers/net/ethernet/intel/ice/ice_switch.c |  77 
 drivers/net/ethernet/intel/ice/ice_switch.h |   2 +
 drivers/net/ethernet/intel/ice/ice_type.h   |   5 +
 9 files changed, 728 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index 6d7d03b80dbf..9bb8a99b929e 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -139,11 +139,20 @@ enum ice_state {
__ICE_SUSPENDED,/* set on module remove path */
__ICE_RESET_FAILED, /* set by reset/rebuild */
__ICE_ADMINQ_EVENT_PENDING,
+   __ICE_FLTR_OVERFLOW_PROMISC,
__ICE_CFG_BUSY,
__ICE_SERVICE_SCHED,
__ICE_STATE_NBITS   /* must be last */
 };
 
+enum ice_vsi_flags {
+   ICE_VSI_FLAG_UMAC_FLTR_CHANGED,
+   ICE_VSI_FLAG_MMAC_FLTR_CHANGED,
+   ICE_VSI_FLAG_VLAN_FLTR_CHANGED,
+   ICE_VSI_FLAG_PROMISC_CHANGED,
+   ICE_VSI_FLAG_NBITS  /* must be last */
+};
+
 /* struct that defines a VSI, associated with a dev */
 struct ice_vsi {
struct net_device *netdev;
@@ -158,7 +167,9 @@ struct ice_vsi {
 
u64 tx_linearize;
DECLARE_BITMAP(state, __ICE_STATE_NBITS);
+   DECLARE_BITMAP(flags, ICE_VSI_FLAG_NBITS);
unsigned long active_vlans[BITS_TO_LONGS(VLAN_N_VID)];
+   unsigned int current_netdev_flags;
u32 tx_restart;
u32 tx_busy;
u32 rx_buf_failed;
@@ -189,6 +200,9 @@ struct ice_vsi {
struct ice_eth_stats eth_stats;
struct ice_eth_stats eth_stats_prev;
 
+   struct list_head tmp_sync_list; /* MAC filters to be synced */
+   struct list_head tmp_unsync_list;   /* MAC filters to be unsynced */
+
bool irqs_ready;
bool current_isup;   /* Sync 'link up' logging */
bool stat_offsets_loaded;
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 8cade22c1cf6..fc19c287ebc5 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -149,6 +149,24 @@ struct ice_aqc_manage_mac_read_resp {
u8 mac_addr[ETH_ALEN];
 };
 
+/* Manage MAC address, write command - direct (0x0108) */
+struct ice_aqc_manage_mac_write {
+   u8 port_num;
+   u8 flags;
+#define ICE_AQC_MAN_MAC_WR_MC_MAG_EN   BIT(0)
+#define ICE_AQC_MAN_MAC_WR_WOL_LAA_PFR_KEEPBIT(1)
+#define ICE_AQC_MAN_MAC_WR_S   6
+#define ICE_AQC_MAN_MAC_WR_M   (3 << ICE_AQC_MAN_MAC_WR_S)
+#define ICE_AQC_MAN_MAC_UPDATE_LAA 0
+#define ICE_AQC_MAN_MAC_UPDATE_LAA_WOL (BIT(0) << ICE_AQC_MAN_MAC_WR_S)
+   /* High 16 bits of MAC address in big endian order */
+   __be16 sah;
+   /* Low 32 bits of MAC address in big endian order */
+   __be32 sal;
+   __le32 addr_high;
+   __le32 addr_low;
+};
+
 /* Clear PXE Command and response (direct 0x0110) */
 struct ice_aqc_clear_pxe {
u8 rx_cnt;
@@ -1228,6 +1246,7 @@ struct ice_aq_desc {
struct ice_aqc_q_shutdown q_shutdown;
struct ice_aqc_req_res res_owner;
struct ice_aqc_manage_mac_read mac_read;
+   struct ice_aqc_manage_mac_write mac_write;
struct ice_aqc_clear_pxe clear_pxe;
struct ice_aqc_list_caps get_cap;
struct ice_aqc_get_phy_caps get_phy;
@@ -1272,6 +1291,7 @@ enum ice_aq_err {
ICE_AQ_RC_ENOMEM= 9,  /* Out of memory */
ICE_AQ_RC_EBUSY = 12, /* Device or resource busy */
ICE_AQ_RC_EEXIST= 13, /* object already exists */
+   ICE_AQ_RC_ENOSPC= 16, /* No space left or allocation failure */
 };
 
 /* Admin Queue command opcodes */
@@ -1290,6 +1310,7 @@ enum ice_adminq_opc {
 
/* manage MAC address */
ice_aqc_opc_manage_mac_read = 

[PATCH v2 01/15] ice: Add basic driver framework for Intel(R) E800 Series

2018-03-15 Thread Anirudh Venkataramanan
This patch adds a basic driver framework for the Intel(R) E800 Ethernet
Series of network devices. There is no functionality right now other than
the ability to load.

Signed-off-by: Anirudh Venkataramanan 
---
 Documentation/networking/ice.txt|  39 +++
 MAINTAINERS |   1 +
 drivers/net/ethernet/intel/Kconfig  |  14 +++
 drivers/net/ethernet/intel/Makefile |   1 +
 drivers/net/ethernet/intel/ice/Makefile |  27 +
 drivers/net/ethernet/intel/ice/ice.h|  48 
 drivers/net/ethernet/intel/ice/ice_devids.h |  33 ++
 drivers/net/ethernet/intel/ice/ice_main.c   | 172 
 drivers/net/ethernet/intel/ice/ice_type.h   |  42 +++
 9 files changed, 377 insertions(+)
 create mode 100644 Documentation/networking/ice.txt
 create mode 100644 drivers/net/ethernet/intel/ice/Makefile
 create mode 100644 drivers/net/ethernet/intel/ice/ice.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_devids.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_main.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_type.h

diff --git a/Documentation/networking/ice.txt b/Documentation/networking/ice.txt
new file mode 100644
index ..6261c46378e1
--- /dev/null
+++ b/Documentation/networking/ice.txt
@@ -0,0 +1,39 @@
+Intel(R) Ethernet Connection E800 Series Linux Driver
+===
+
+Intel ice Linux driver.
+Copyright(c) 2018 Intel Corporation.
+
+Contents
+
+- Enabling the driver
+- Support
+
+The driver in this release supports Intel's E800 Series of products. For
+more information, visit Intel's support page at http://support.intel.com.
+
+Enabling the driver
+===
+
+The driver is enabled via the standard kernel configuration system,
+using the make command:
+
+ Make oldconfig/silentoldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+   -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+   -> Ethernet driver support
+ -> Intel devices
+   -> Intel(R) Ethernet Connection E800 Series Support
+
+Support
+===
+
+For general information, go to the Intel support website at:
+
+http://support.intel.com
+
+If an issue is identified with the released source code, please email
+the maintainer listed in the MAINTAINERS file.
diff --git a/MAINTAINERS b/MAINTAINERS
index 079af8b7ae8e..f1fe3bbec595 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7019,6 +7019,7 @@ F:Documentation/networking/ixgbe.txt
 F: Documentation/networking/ixgbevf.txt
 F: Documentation/networking/i40e.txt
 F: Documentation/networking/i40evf.txt
+F: Documentation/networking/ice.txt
 F: drivers/net/ethernet/intel/
 F: drivers/net/ethernet/intel/*/
 F: include/linux/avf/virtchnl.h
diff --git a/drivers/net/ethernet/intel/Kconfig 
b/drivers/net/ethernet/intel/Kconfig
index 1feb54b6d92e..14d287bed33c 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -251,6 +251,20 @@ config I40EVF
  will be called i40evf.  MSI-X interrupt support is required
  for this driver to work correctly.
 
+config ICE
+   tristate "Intel(R) Ethernet Connection E800 Series Support"
+   default n
+   depends on PCI_MSI
+   ---help---
+ This driver supports Intel(R) Ethernet Connection E800 Series of
+ devices.  For more information on how to identify your adapter, go
+ to the Adapter & Driver ID Guide that can be located at:
+
+ 
+
+ To compile this driver as a module, choose M here. The module
+ will be called ice.
+
 config FM10K
tristate "Intel(R) FM1 Ethernet Switch Host Interface Support"
default n
diff --git a/drivers/net/ethernet/intel/Makefile 
b/drivers/net/ethernet/intel/Makefile
index 90af7757a885..807a4f8c7e4e 100644
--- a/drivers/net/ethernet/intel/Makefile
+++ b/drivers/net/ethernet/intel/Makefile
@@ -14,3 +14,4 @@ obj-$(CONFIG_I40E) += i40e/
 obj-$(CONFIG_IXGB) += ixgb/
 obj-$(CONFIG_I40EVF) += i40evf/
 obj-$(CONFIG_FM10K) += fm10k/
+obj-$(CONFIG_ICE) += ice/
diff --git a/drivers/net/ethernet/intel/ice/Makefile 
b/drivers/net/ethernet/intel/ice/Makefile
new file mode 100644
index ..2a177ea21b74
--- /dev/null
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -0,0 +1,27 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+#
+# Intel(R) Ethernet Connection E800 Series Linux Driver
+# Copyright (c) 2018, Intel Corporation.
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms and conditions of the GNU General Public License,
+# version 2, as published by the Free Software Foundation.
+#
+# This program is distributed in the 

[PATCH v2 10/15] ice: Implement transmit and NAPI support

2018-03-15 Thread Anirudh Venkataramanan
This patch implements ice_start_xmit (the handler for ndo_start_xmit) and
related functions. ice_start_xmit ultimately calls ice_tx_map, where the
Tx descriptor is built and posted to the hardware by bumping the ring tail.

This patch also implements ice_napi_poll, which is invoked when there's an
interrupt on the VSI's queues. The interrupt can be due to either a
completed Tx or an Rx event. In case of a completed Tx/Rx event, resources
are reclaimed. Additionally, in case of an Rx event, the skb is fetched
and passed up to the network stack.

Signed-off-by: Anirudh Venkataramanan 
---
 drivers/net/ethernet/intel/ice/ice.h   |1 +
 drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h |   46 ++
 drivers/net/ethernet/intel/ice/ice_main.c  |   55 ++
 drivers/net/ethernet/intel/ice/ice_txrx.c  | 1026 +++-
 drivers/net/ethernet/intel/ice/ice_txrx.h  |   45 ++
 5 files changed, 1171 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h 
b/drivers/net/ethernet/intel/ice/ice.h
index e3ec19099e37..7998e57994bf 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -74,6 +74,7 @@
(((val) << ICE_AQ_VSI_UP_TABLE_UP##i##_S) & \
  ICE_AQ_VSI_UP_TABLE_UP##i##_M)
 
+#define ICE_TX_DESC(R, i) (&(((struct ice_tx_desc *)((R)->desc))[i]))
 #define ICE_RX_DESC(R, i) (&(((union ice_32b_rx_flex_desc *)((R)->desc))[i]))
 
 #define ice_for_each_txq(vsi, i) \
diff --git a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h 
b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
index 0cdf1ae480cf..c930f3e06ecc 100644
--- a/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
+++ b/drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
@@ -145,6 +145,33 @@ enum ice_rx_flg64_bits {
ICE_RXFLG_RSVD  = 63
 };
 
+/* for ice_32byte_rx_flex_desc.ptype_flexi_flags0 member */
+#define ICE_RX_FLEX_DESC_PTYPE_M   (0x3FF) /* 10-bits */
+
+/* for ice_32byte_rx_flex_desc.pkt_length member */
+#define ICE_RX_FLX_DESC_PKT_LEN_M  (0x3FFF) /* 14-bits */
+
+enum ice_rx_flex_desc_status_error_0_bits {
+   /* Note: These are predefined bit offsets */
+   ICE_RX_FLEX_DESC_STATUS0_DD_S = 0,
+   ICE_RX_FLEX_DESC_STATUS0_EOF_S,
+   ICE_RX_FLEX_DESC_STATUS0_HBO_S,
+   ICE_RX_FLEX_DESC_STATUS0_L3L4P_S,
+   ICE_RX_FLEX_DESC_STATUS0_XSUM_IPE_S,
+   ICE_RX_FLEX_DESC_STATUS0_XSUM_L4E_S,
+   ICE_RX_FLEX_DESC_STATUS0_XSUM_EIPE_S,
+   ICE_RX_FLEX_DESC_STATUS0_XSUM_EUDPE_S,
+   ICE_RX_FLEX_DESC_STATUS0_LPBK_S,
+   ICE_RX_FLEX_DESC_STATUS0_IPV6EXADD_S,
+   ICE_RX_FLEX_DESC_STATUS0_RXE_S,
+   ICE_RX_FLEX_DESC_STATUS0_CRCP_S,
+   ICE_RX_FLEX_DESC_STATUS0_RSS_VALID_S,
+   ICE_RX_FLEX_DESC_STATUS0_L2TAG1P_S,
+   ICE_RX_FLEX_DESC_STATUS0_XTRMD0_VALID_S,
+   ICE_RX_FLEX_DESC_STATUS0_XTRMD1_VALID_S,
+   ICE_RX_FLEX_DESC_STATUS0_LAST /* this entry must be last!!! */
+};
+
 #define ICE_RXQ_CTX_SIZE_DWORDS8
 #define ICE_RXQ_CTX_SZ (ICE_RXQ_CTX_SIZE_DWORDS * sizeof(u32))
 
@@ -215,6 +242,25 @@ struct ice_tx_desc {
__le64 cmd_type_offset_bsz;
 };
 
+enum ice_tx_desc_dtype_value {
+   ICE_TX_DESC_DTYPE_DATA  = 0x0,
+   ICE_TX_DESC_DTYPE_CTX   = 0x1,
+   /* DESC_DONE - HW has completed write-back of descriptor */
+   ICE_TX_DESC_DTYPE_DESC_DONE = 0xF,
+};
+
+#define ICE_TXD_QW1_CMD_S  4
+#define ICE_TXD_QW1_CMD_M  (0xFFFUL << ICE_TXD_QW1_CMD_S)
+
+enum ice_tx_desc_cmd_bits {
+   ICE_TX_DESC_CMD_EOP = 0x0001,
+   ICE_TX_DESC_CMD_RS  = 0x0002,
+};
+
+#define ICE_TXD_QW1_OFFSET_S   16
+#define ICE_TXD_QW1_TX_BUF_SZ_S34
+#define ICE_TXD_QW1_L2TAG1_S   48
+
 #define ICE_LAN_TXQ_MAX_QGRPS  127
 #define ICE_LAN_TXQ_MAX_QDIS   1023
 
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index afb400a1f1d2..b802cac8376c 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -1272,6 +1272,23 @@ static int ice_vsi_alloc_arrays(struct ice_vsi *vsi, 
bool alloc_qvectors)
return -ENOMEM;
 }
 
+/**
+ * ice_msix_clean_rings - MSIX mode Interrupt Handler
+ * @irq: interrupt number
+ * @data: pointer to a q_vector
+ */
+static irqreturn_t ice_msix_clean_rings(int __always_unused irq, void *data)
+{
+   struct ice_q_vector *q_vector = (struct ice_q_vector *)data;
+
+   if (!q_vector->tx.ring && !q_vector->rx.ring)
+   return IRQ_HANDLED;
+
+   napi_schedule(_vector->napi);
+
+   return IRQ_HANDLED;
+}
+
 /**
  * ice_vsi_alloc - Allocates the next available struct vsi in the PF
  * @pf: board private structure
@@ -1312,6 +1329,8 @@ static struct ice_vsi *ice_vsi_alloc(struct ice_pf *pf, 
enum ice_vsi_type type)
if (ice_vsi_alloc_arrays(vsi, true))
goto 

[PATCH v2 00/15] Add ice driver

2018-03-15 Thread Anirudh Venkataramanan
This patch series adds the ice driver, which will support the Intel(R)
E800 Series of network devices.

This is the first phase in the release of this driver where we implement
basic transmit and receive. The idea behind the multi-phase release is to
aid in code review as well as testing. Subsequent phases will implement
advanced features (like SR-IOV, tunnelling, flow director, QoS, etc.) that
build upon the previous phase(s). Each phase will be submitted as a patch
series.

I cc'd netdev for review since this is a new driver, even though this is
targeted to go through Jeff Kirsher's Intel Wired LAN git tree(s).

v2: Addressed community feedback
  patch #3 : Removed register write based on Shannon's comments
  patch #4 : Change retries value based on Shannon's comments
  patch #6 : Remove reference to "lump" as Shannon suggested
  patch #7 : Add define for magic number as Shannon suggested
  patch #12: Reworked based on multiple comments (Jakub, Stephen, et al.)

Anirudh Venkataramanan (15):
  ice: Add basic driver framework for Intel(R) E800 Series
  ice: Add support for control queues
  ice: Start hardware initialization
  ice: Get switch config, scheduler config and device capabilities
  ice: Get MAC/PHY/link info and scheduler topology
  ice: Initialize PF and setup miscellaneous interrupt
  ice: Add support for VSI allocation and deallocation
  ice: Add support for switch filter programming
  ice: Configure VSIs for Tx/Rx
  ice: Implement transmit and NAPI support
  ice: Add support for VLANs and offloads
  ice: Add stats and ethtool support
  ice: Update Tx scheduler tree for VSI multi-Tx queue support
  ice: Support link events, reset and rebuild
  ice: Implement filter sync, NDO operations and bump version

 Documentation/networking/ice.txt|   39 +
 MAINTAINERS |1 +
 drivers/net/ethernet/intel/Kconfig  |   14 +
 drivers/net/ethernet/intel/Makefile |1 +
 drivers/net/ethernet/intel/ice/Makefile |   34 +
 drivers/net/ethernet/intel/ice/ice.h|  326 ++
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 1366 ++
 drivers/net/ethernet/intel/ice/ice_common.c | 2247 +
 drivers/net/ethernet/intel/ice/ice_common.h |  100 +
 drivers/net/ethernet/intel/ice/ice_controlq.c   | 1080 +
 drivers/net/ethernet/intel/ice/ice_controlq.h   |  108 +
 drivers/net/ethernet/intel/ice/ice_devids.h |   33 +
 drivers/net/ethernet/intel/ice/ice_ethtool.c|  954 
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h |  280 ++
 drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h  |  487 ++
 drivers/net/ethernet/intel/ice/ice_main.c   | 5509 +++
 drivers/net/ethernet/intel/ice/ice_nvm.c|  250 +
 drivers/net/ethernet/intel/ice/ice_osdep.h  |   87 +
 drivers/net/ethernet/intel/ice/ice_sched.c  | 1673 +++
 drivers/net/ethernet/intel/ice/ice_sched.h  |   57 +
 drivers/net/ethernet/intel/ice/ice_status.h |   46 +
 drivers/net/ethernet/intel/ice/ice_switch.c | 1897 
 drivers/net/ethernet/intel/ice/ice_switch.h |  175 +
 drivers/net/ethernet/intel/ice/ice_txrx.c   | 1796 
 drivers/net/ethernet/intel/ice/ice_txrx.h   |  206 +
 drivers/net/ethernet/intel/ice/ice_type.h   |  408 ++
 26 files changed, 19174 insertions(+)
 create mode 100644 Documentation/networking/ice.txt
 create mode 100644 drivers/net/ethernet/intel/ice/Makefile
 create mode 100644 drivers/net/ethernet/intel/ice/ice.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_common.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_common.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_controlq.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_controlq.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_devids.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_ethtool.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_hw_autogen.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_main.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_nvm.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_osdep.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_sched.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_sched.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_status.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_switch.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_switch.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_txrx.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_txrx.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_type.h

-- 
2.14.3



Re: netns: send uevent messages

2018-03-15 Thread Christian Brauner
On Thu, Mar 15, 2018 at 05:14:13PM +0300, Kirill Tkhai wrote:
> On 15.03.2018 16:39, Christian Brauner wrote:
> > On Thu, Mar 15, 2018 at 12:47:30PM +0300, Kirill Tkhai wrote:
> >> CC Andrey Vagin
> > 
> > Hey Kirill,
> > 
> > Thanks for CCing Andrey.
> > 
> >>
> >> On 15.03.2018 03:12, Christian Brauner wrote:
> >>> This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
> >>> to allow sending uevent messages into the network namespace the socket
> >>> belongs to.
> >>>
> >>> Currently non-initial network namespaces are already isolated and don't
> >>> receive uevents. There are a number of cases where it is beneficial for a
> >>> sufficiently privileged userspace process to send a uevent into a network
> >>> namespace.
> >>>
> >>> One such use case would be debugging and fuzzing of a piece of software
> >>> which listens and reacts to uevents. By running a copy of that software
> >>> inside a network namespace, specific uevents could then be presented to 
> >>> it.
> >>> More concretely, this would allow for easy testing of udevd/ueventd.
> >>>
> >>> This will also allow some piece of software to run components inside a
> >>> separate network namespace and then effectively filter what that software
> >>> can receive. Some examples of software that do directly listen to uevents
> >>> and that we have in the past attempted to run inside a network namespace
> >>> are rbd (CEPH client) or the X server.
> >>>
> >>> Implementation:
> >>> The implementation has been kept as simple as possible from the kernel's
> >>> perspective. Specifically, a simple input method uevent_net_rcv() is added
> >>> to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
> >>> af_netlink infrastructure and does neither add an additional netlink 
> >>> family
> >>> nor requires any user-visible changes.
> >>>
> >>> For example, by using netlink_rcv_skb() we can make use of existing 
> >>> netlink
> >>> infrastructure to report back informative error messages to userspace.
> >>>
> >>> Furthermore, this implementation does not introduce any overhead for
> >>> existing uevent generating codepaths. The struct netns gets a new uevent
> >>> socket member that records the uevent socket associated with that network
> >>> namespace. Since we record the uevent socket for each network namespace in
> >>> struct net we don't have to walk the whole uevent socket list.
> >>> Instead we can directly retrieve the relevant uevent socket and send the
> >>> message. This keeps the codepath very performant without introducing
> >>> needless overhead.
> >>>
> >>> Uevent sequence numbers are kept global. When a uevent message is sent to
> >>> another network namespace the implementation will simply increment the
> >>> global uevent sequence number and append it to the received uevent. This
> >>> has the advantage that the kernel will never need to parse the received
> >>> uevent message to replace any existing uevent sequence numbers. Instead it
> >>> is up to the userspace process to remove any existing uevent sequence
> >>> numbers in case the uevent message to be sent contains any.
> >>>
> >>> Security:
> >>> In order for a caller to send uevent messages to a target network 
> >>> namespace
> >>> the caller must have CAP_SYS_ADMIN in the owning user namespace of the
> >>> target network namespace. Additionally, any received uevent message is
> >>> verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
> >>> needed to append the uevent sequence number.
> >>>
> >>> Testing:
> >>> This patch has been tested and verified to work with the following udev
> >>> implementations:
> >>> 1. CentOS 6 with udevd version 147
> >>> 2. Debian Sid with systemd-udevd version 237
> >>> 3. Android 7.1.1 with ueventd
> >>>
> >>> Signed-off-by: Christian Brauner 
> >>> ---
> >>>  include/net/net_namespace.h |  1 +
> >>>  lib/kobject_uevent.c| 88 
> >>> -
> >>>  2 files changed, 88 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> >>> index f306b2aa15a4..467bde763a9b 100644
> >>> --- a/include/net/net_namespace.h
> >>> +++ b/include/net/net_namespace.h
> >>> @@ -78,6 +78,7 @@ struct net {
> >>>  
> >>>   struct sock *rtnl;  /* rtnetlink socket */
> >>>   struct sock *genl_sock;
> >>> + struct sock *uevent_sock;   /* uevent socket */
> >>
> >> Since you add this per-net uevent_sock pointer, currently existing 
> >> iterations in uevent_net_exit()
> >> become to look confusing. There are:
> >>
> >> mutex_lock(_sock_mutex);
> >> list_for_each_entry(ue_sk, _sock_list, list) {
> >> if (sock_net(ue_sk->sk) == net)
> >> goto found;
> >> }
> >>
> >> Can't we make a small cleanup in lib/kobject_uevent.c after this change
> >> and before the main part of the patch goes?
> 

Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 4:46 PM, Linus Torvalds
 wrote:
>
> Well, the explicit typing allows that mixing, in that you can just
> have "const_max_t(5,sizeof(x))"

I obviously meant "const_max_t(size_t,5,sizeof(x))". Heh.

Linus


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 4:41 PM, Kees Cook  wrote:
>
> I much prefer explicit typing, but both you and Rasmus mentioned
> wanting the int/sizeof_t mixing.

Well, the explicit typing allows that mixing, in that you can just
have "const_max_t(5,sizeof(x))"

So I'm ok with that.

What I'm *not* so much ok with is "const_max(5,sizeof(x))" erroring
out, or silently causing insane behavior due to hidden subtle type
casts..

Linus


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Kees Cook
On Thu, Mar 15, 2018 at 4:34 PM, Linus Torvalds
 wrote:
> On Thu, Mar 15, 2018 at 3:46 PM, Kees Cook  wrote:
>>
>> So, AIUI, I can either get strict type checking, in which case, this
>> is rejected (which I assume there is still a desire to have):
>>
>> int foo[const_max(6, sizeof(whatever))];
>
> Ehh, yes, that looks fairly sane, and erroring out would be annoying.
>
> But maybe we should just make the type explicit, and make it "const_max_t()"?
>
> I think all the existing users are of type "max_t()" anyway due to the
> very same issue, no?

All but one are using max()[1]. One case uses max_t() to get u32.

> At least if there's an explicit type like 'size_t', then passing in
> "-1" becoming a large unsigned integer is understandable and clear,
> not just some odd silent behavior.
>
> Put another way: I think it's unacceptable that
>
>  const_max(-1,6)
>
> magically becomes a huge positive number like in that patch of yours, but
>
>  const_max_t(size_t, -1, 6)
>
> *obviously* is a huge positive number.
>
> The two things would *do* the same thing, but in the second case the
> type is explicit and visible.
>
>> due to __builtin_types_compatible_p() rejecting it, or I can construct
>> a "positive arguments only" test, in which the above is accepted, but
>> this is rejected:
>
> That sounds acceptable too, although the "const_max_t()" thing is
> presumably going to be simpler?

I much prefer explicit typing, but both you and Rasmus mentioned
wanting the int/sizeof_t mixing. I'm totally happy with const_max_t()
-- even if it makes my line-wrapping harder due to the longer name. ;)

I'll resend in a moment...

-Kees

[1] https://patchwork.kernel.org/patch/10285709/

-- 
Kees Cook
Pixel Security


[PATCH v1] netns: send uevent messages

2018-03-15 Thread Christian Brauner
This patch adds a receive method to NETLINK_KOBJECT_UEVENT netlink sockets
to allow sending uevent messages into the network namespace the socket
belongs to.

Currently non-initial network namespaces are already isolated and don't
receive uevents. There are a number of cases where it is beneficial for a
sufficiently privileged userspace process to send a uevent into a network
namespace.

One such use case would be debugging and fuzzing of a piece of software
which listens and reacts to uevents. By running a copy of that software
inside a network namespace, specific uevents could then be presented to it.
More concretely, this would allow for easy testing of udevd/ueventd.

This will also allow some piece of software to run components inside a
separate network namespace and then effectively filter what that software
can receive. Some examples of software that do directly listen to uevents
and that we have in the past attempted to run inside a network namespace
are rbd (CEPH client) or the X server.

Implementation:
The implementation has been kept as simple as possible from the kernel's
perspective. Specifically, a simple input method uevent_net_rcv() is added
to NETLINK_KOBJECT_UEVENT sockets which completely reuses existing
af_netlink infrastructure and does neither add an additional netlink family
nor requires any user-visible changes.

For example, by using netlink_rcv_skb() we can make use of existing netlink
infrastructure to report back informative error messages to userspace.

Furthermore, this implementation does not introduce any overhead for
existing uevent generating codepaths. The struct netns gets a new uevent
socket member that records the uevent socket associated with that network
namespace. Since we record the uevent socket for each network namespace in
struct net we don't have to walk the whole uevent socket list.
Instead we can directly retrieve the relevant uevent socket and send the
message. This keeps the codepath very performant without introducing
needless overhead.

Uevent sequence numbers are kept global. When a uevent message is sent to
another network namespace the implementation will simply increment the
global uevent sequence number and append it to the received uevent. This
has the advantage that the kernel will never need to parse the received
uevent message to replace any existing uevent sequence numbers. Instead it
is up to the userspace process to remove any existing uevent sequence
numbers in case the uevent message to be sent contains any.

Security:
In order for a caller to send uevent messages to a target network namespace
the caller must have CAP_SYS_ADMIN in the owning user namespace of the
target network namespace. Additionally, any received uevent message is
verified to not exceed size UEVENT_BUFFER_SIZE. This includes the space
needed to append the uevent sequence number.

Testing:
This patch has been tested and verified to work with the following udev
implementations:
1. CentOS 6 with udevd version 147
2. Debian Sid with systemd-udevd version 237
3. Android 7.1.1 with ueventd

Signed-off-by: Christian Brauner 
---
Changelog v0->v1:
* Hold mutex_lock() until uevent is sent to preserve uevent message
  ordering. See udev and commit for reference:

  commit 7b60a18da393ed70db043a777fd9e6d5363077c4
  Author: Andrew Vagin 
  Date:   Wed Mar 7 14:49:56 2012 +0400

  uevent: send events in correct order according to seqnum (v3)

  The queue handling in the udev daemon assumes that the events are
  ordered.
---
 include/net/net_namespace.h |  1 +
 lib/kobject_uevent.c| 82 -
 2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index f306b2aa15a4..467bde763a9b 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -78,6 +78,7 @@ struct net {
 
struct sock *rtnl;  /* rtnetlink socket */
struct sock *genl_sock;
+   struct sock *uevent_sock;   /* uevent socket */
 
struct list_headdev_base_head;
struct hlist_head   *dev_name_head;
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 9fe6ec8fda28..5a07798359a1 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 
@@ -602,12 +603,88 @@ int add_uevent_var(struct kobj_uevent_env *env, const 
char *format, ...)
 EXPORT_SYMBOL_GPL(add_uevent_var);
 
 #if defined(CONFIG_NET)
+static int uevent_net_send(struct sock *usk, struct sk_buff *skb,
+  struct netlink_ext_ack *extack)
+{
+   int ret;
+   /* u64 to chars: 2^64 - 1 = 21 chars */
+   char buf[sizeof("SEQNUM=") + 21];
+   struct sk_buff *skbc;
+
+   /* bump and prepare sequence number */
+   ret = snprintf(buf, sizeof(buf), 

Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 3:46 PM, Kees Cook  wrote:
>
> So, AIUI, I can either get strict type checking, in which case, this
> is rejected (which I assume there is still a desire to have):
>
> int foo[const_max(6, sizeof(whatever))];

Ehh, yes, that looks fairly sane, and erroring out would be annoying.

But maybe we should just make the type explicit, and make it "const_max_t()"?

I think all the existing users are of type "max_t()" anyway due to the
very same issue, no?

At least if there's an explicit type like 'size_t', then passing in
"-1" becoming a large unsigned integer is understandable and clear,
not just some odd silent behavior.

Put another way: I think it's unacceptable that

 const_max(-1,6)

magically becomes a huge positive number like in that patch of yours, but

 const_max_t(size_t, -1, 6)

*obviously* is a huge positive number.

The two things would *do* the same thing, but in the second case the
type is explicit and visible.

> due to __builtin_types_compatible_p() rejecting it, or I can construct
> a "positive arguments only" test, in which the above is accepted, but
> this is rejected:

That sounds acceptable too, although the "const_max_t()" thing is
presumably going to be simpler?

 Linus


Re: [Intel-wired-lan] [next-queue 4/4] ixgbe: enable tso with ipsec offload

2018-03-15 Thread Shannon Nelson

On 3/15/2018 3:03 PM, Alexander Duyck wrote:

On Thu, Mar 15, 2018 at 2:23 PM, Shannon Nelson
 wrote:

Fix things up to support TSO offload in conjunction
with IPsec hw offload.  This raises throughput with
IPsec offload on to nearly line rate.

Signed-off-by: Shannon Nelson 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c |  7 +--
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 25 +++--
  2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 5ddea43..bfbcfc2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -896,6 +896,7 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
  void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter)
  {
 struct ixgbe_ipsec *ipsec;
+   netdev_features_t features;
 size_t size;

 if (adapter->hw.mac.type == ixgbe_mac_82598EB)
@@ -929,8 +930,10 @@ void ixgbe_init_ipsec_offload(struct ixgbe_adapter 
*adapter)
 ixgbe_ipsec_clear_hw_tables(adapter);

 adapter->netdev->xfrmdev_ops = _xfrmdev_ops;
-   adapter->netdev->features |= NETIF_F_HW_ESP;
-   adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP;
+
+   features = NETIF_F_HW_ESP | NETIF_F_HW_ESP_TX_CSUM | NETIF_F_GSO_ESP;
+   adapter->netdev->features |= features;
+   adapter->netdev->hw_enc_features |= features;


Instead of adding the local variable you might just create a new
define that includes these 3 feature flags and then use that here. You
could use the way I did IXGBE_GSO_PARTIAL_FEATURES as an example.


 return;

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index a54f3d8..6022666 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7721,9 +7721,11 @@ static void ixgbe_service_task(struct work_struct *work)

  static int ixgbe_tso(struct ixgbe_ring *tx_ring,
  struct ixgbe_tx_buffer *first,
-u8 *hdr_len)
+u8 *hdr_len,
+struct ixgbe_ipsec_tx_data *itd)
  {
 u32 vlan_macip_lens, type_tucmd, mss_l4len_idx;
+   u32 fceof_saidx = 0;
 struct sk_buff *skb = first->skb;


Reverse xmas tree this. It should probably be moved down to just past
the declaration of paylen and l4_offset.


 union {
 struct iphdr *v4;
@@ -7762,9 +7764,13 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);

 /* IP header will have to cancel out any data that
-* is not a part of the outer IP header
+* is not a part of the outer IP header, except for
+* IPsec where we want the IP+ESP header.
  */
-   ip.v4->check = csum_fold(csum_partial(trans_start,
+   if (first->tx_flags & IXGBE_TX_FLAGS_IPSEC)
+   ip.v4->check = 0;
+   else
+   ip.v4->check = csum_fold(csum_partial(trans_start,
   csum_start - trans_start,
   0));
 type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;


I would say this should be flipped like so:
ip.v4->check =  (skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) ?
  csum_fold(csum_partial(trans_start,
csum_start - trans_start, 0) : 0;


@@ -7797,12 +7803,15 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
 mss_l4len_idx = (*hdr_len - l4_offset) << IXGBE_ADVTXD_L4LEN_SHIFT;
 mss_l4len_idx |= skb_shinfo(skb)->gso_size << IXGBE_ADVTXD_MSS_SHIFT;

+   fceof_saidx |= itd->sa_idx;
+   type_tucmd |= itd->flags | itd->trailer_len;
+
 /* vlan_macip_lens: HEADLEN, MACLEN, VLAN tag */
 vlan_macip_lens = l4.hdr - ip.hdr;
 vlan_macip_lens |= (ip.hdr - skb->data) << IXGBE_ADVTXD_MACLEN_SHIFT;
 vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;

-   ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, 0, type_tucmd,
+   ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, fceof_saidx, type_tucmd,
   mss_l4len_idx);

 return 1;
@@ -8493,7 +8502,8 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
 if (skb->sp && !ixgbe_ipsec_tx(tx_ring, first, _tx))
 goto out_drop;
  #endif
-   tso = ixgbe_tso(tx_ring, first, _len);
+
+   tso = ixgbe_tso(tx_ring, first, _len, _tx);
 if (tso < 0)
 goto out_drop;
 else if (!tso)


No need for the extra blank line. I would say just leave it as is and
add your extra argument.


Yep, you're right on all counts.  That SKB_GSO_PARTIAL 

Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Kees Cook
On Thu, Mar 15, 2018 at 4:17 PM, Miguel Ojeda
 wrote:
>> The full one, using your naming convention:
>>
>> #define const_max(x, y)  \
>> ({   \
>> if (!__builtin_constant_p(x))\
>> __error_not_const_arg(); \
>> if (!__builtin_constant_p(y))\
>> __error_not_const_arg(); \
>> if (!__builtin_types_compatible_p(typeof(x), typeof(y))) \
>> __error_incompatible_types();\
>> if ((x) < 0) \
>> __error_not_positive_arg();  \
>> if ((y) < 0) \
>> __error_not_positive_arg();  \
>> __builtin_choose_expr((x) > (y), (x), (y));  \
>> })
>>
>
> Nevermind... gcc doesn't take that as a constant expr, even if it
> compiles as one at -O0.

Yeah, unfortunately. :(

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 6/7] e1000: eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Sinan Kaya
On 3/14/2018 9:41 PM, Alexander Duyck wrote:
>>  }
>>
> So you missed the writel in e1000_xmit_frame. You should probably get
> that one too while you are doing these updates. The wmb() is in
> e1000_tx_queue().
> 

I brought wmb() outside along with the next descriptor assignment to be
similar to the rest of the other code.

if wmb() and writel() are not visible in the same function, let's not touch
the code.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Miguel Ojeda
On Fri, Mar 16, 2018 at 12:08 AM, Miguel Ojeda
 wrote:
> On Thu, Mar 15, 2018 at 11:58 PM, Miguel Ojeda
>  wrote:
>> On Thu, Mar 15, 2018 at 11:46 PM, Kees Cook  wrote:
>>>
>>> By using this eye-bleed:
>>>
>>> size_t __error_not_const_arg(void) \
>>> __compiletime_error("const_max() used with non-compile-time constant arg");
>>> size_t __error_not_positive_arg(void) \
>>> __compiletime_error("const_max() used with negative arg");
>>> #define const_max(x, y) \
>>> __builtin_choose_expr(__builtin_constant_p(x) &&\
>>>   __builtin_constant_p(y),  \
>>> __builtin_choose_expr((x) >= 0 && (y) >= 0, \
>>>   (typeof(x))(x) > (typeof(y))(y) ? \
>>> (x) : (y),  \
>>>   __error_not_positive_arg()),  \
>>> __error_not_const_arg())
>>>
>>
>> I was writing it like this:
>>
>> #define const_max(a, b) \
>> ({ \
>> if ((a) < 0) \
>> __const_max_called_with_negative_value(); \
>> if ((b) < 0) \
>> __const_max_called_with_negative_value(); \
>> if (!__builtin_types_compatible_p(typeof(a), typeof(b))) \
>> __const_max_called_with_incompatible_types(); \
>> __builtin_choose_expr((a) > (b), (a), (b)); \
>> })
>
> The full one, using your naming convention:
>
> #define const_max(x, y)  \
> ({   \
> if (!__builtin_constant_p(x))\
> __error_not_const_arg(); \
> if (!__builtin_constant_p(y))\
> __error_not_const_arg(); \
> if (!__builtin_types_compatible_p(typeof(x), typeof(y))) \
> __error_incompatible_types();\
> if ((x) < 0) \
> __error_not_positive_arg();  \
> if ((y) < 0) \
> __error_not_positive_arg();  \
> __builtin_choose_expr((x) > (y), (x), (y));  \
> })
>

Nevermind... gcc doesn't take that as a constant expr, even if it
compiles as one at -O0.


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Miguel Ojeda
On Thu, Mar 15, 2018 at 11:58 PM, Miguel Ojeda
 wrote:
> On Thu, Mar 15, 2018 at 11:46 PM, Kees Cook  wrote:
>>
>> By using this eye-bleed:
>>
>> size_t __error_not_const_arg(void) \
>> __compiletime_error("const_max() used with non-compile-time constant arg");
>> size_t __error_not_positive_arg(void) \
>> __compiletime_error("const_max() used with negative arg");
>> #define const_max(x, y) \
>> __builtin_choose_expr(__builtin_constant_p(x) &&\
>>   __builtin_constant_p(y),  \
>> __builtin_choose_expr((x) >= 0 && (y) >= 0, \
>>   (typeof(x))(x) > (typeof(y))(y) ? \
>> (x) : (y),  \
>>   __error_not_positive_arg()),  \
>> __error_not_const_arg())
>>
>
> I was writing it like this:
>
> #define const_max(a, b) \
> ({ \
> if ((a) < 0) \
> __const_max_called_with_negative_value(); \
> if ((b) < 0) \
> __const_max_called_with_negative_value(); \
> if (!__builtin_types_compatible_p(typeof(a), typeof(b))) \
> __const_max_called_with_incompatible_types(); \
> __builtin_choose_expr((a) > (b), (a), (b)); \
> })

The full one, using your naming convention:

#define const_max(x, y)  \
({   \
if (!__builtin_constant_p(x))\
__error_not_const_arg(); \
if (!__builtin_constant_p(y))\
__error_not_const_arg(); \
if (!__builtin_types_compatible_p(typeof(x), typeof(y))) \
__error_incompatible_types();\
if ((x) < 0) \
__error_not_positive_arg();  \
if ((y) < 0) \
__error_not_positive_arg();  \
__builtin_choose_expr((x) > (y), (x), (y));  \
})

Miguel


Announce: Netdev 0x12 Conference

2018-03-15 Thread Jamal Hadi Salim

The NetDev Society is pleased to announce that Netdev 0x12
will take place July 11-13, 2018 in Montreal, Canada.

More details here:
https://www.netdevconf.org/0x12

For regular updates, please subscribe to peo...@lists.netdevconf.org
(more info at: https://lists.netdevconf.org/cgi-bin/mailman/listinfo/people)

If twitter is your thing then follow us: @netdev01
and use hashtag #netdevconf

cheers,
jamal(on behalf of NetDev Society)


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Alexei Starovoitov
On Thu, Mar 15, 2018 at 11:55:39PM +0100, Daniel Borkmann wrote:
> On 03/15/2018 11:20 PM, Alexei Starovoitov wrote:
> > On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
> >> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> >>> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>   
>  +/* User return codes for SK_MSG prog type. */
>  +enum sk_msg_action {
>  +SK_MSG_DROP = 0,
>  +SK_MSG_PASS,
>  +};
> >>>
> >>> do we really need new enum here?
> >>> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> >>> and there will be only drop/pass in both enums.
> >>> Also I don't see where these two new SK_MSG_* are used...
> >>>
>  +
>  +/* user accessible metadata for SK_MSG packet hook, new fields must
>  + * be added to the end of this structure
>  + */
>  +struct sk_msg_md {
>  +__u32 data;
>  +__u32 data_end;
>  +};
> >>>
> >>> I think it's time for me to ask for forgiveness :)
> >>
> >> :-)
> >>
> >>> I used __u32 for data and data_end only because all other fields
> >>> in __sk_buff were __u32 at the time and I couldn't easily figure out
> >>> how to teach verifier to recognize 8-byte rewrites.
> >>> Unfortunately my mistake stuck and was copied over into xdp.
> >>> Since this is new struct let's do it right and add
> >>> 'void *data, *data_end' here,
> >>> since bpf prog will use them as 'void *' pointers.
> >>> There are no compat issues here, since bpf is always 64-bit.
> >>
> >> But at least offset-wise when you do the ctx rewrite this would then
> >> be a bit more tricky when you have 64 bit kernel with 32 bit user
> >> space since void * members are in each cases at different offset. So
> >> unless I'm missing something, this still should either be __u32 or
> >> __u64 instead of void *, no?
> > 
> > there is no 32-bit user space. these structs are seen by bpf progs only
> > and bpf is 64-bit only too.
> > unless I'm missing your point.
> 
> Ok, so lets say you have 32 bit LLVM binary and compile the prog where
> you access md->data_end. Given the void * in the struct will that access
> end up being BPF_W at ctx offset 4 or BPF_DW at ctx offset 8 from clang
> perspective (iow, is the back end treating this special and always use
> fixed BPF_DW in such case)? If not and it would be the first case with
> offset 4, then we could have the case that underlying 64 bit kernel is
> expecting ctx offset 8 for doing the md ctx conversion.

i'm still not quite following.
Whether llvm itself is 32-bit binary or it's arm32 or sprac32 binary
doesn't matter. It will produce the same 64-bit bpf code.
It will see 'void *' deref from this struct and will emit DW.
May be confusion is from newly added -mattr=+alu32 flag?
That option doesn't change that sizeof(void*)==8.
It only allows backend to emit 32-bit alu insns.



[PATCH net 5/5] net/sched: fix NULL dereference on the error path of tcf_skbmod_init()

2018-03-15 Thread Davide Caratti
when the following command

 # tc action replace action skbmod swap mac index 100

is run for the first time, and tcf_skbmod_init() fails to allocate struct
tcf_skbmod_params, tcf_skbmod_cleanup() calls kfree_rcu(NULL), thus
causing the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0008
 IP: __call_rcu+0x23/0x2b0
 PGD 800034057067 P4D 800034057067 PUD 74937067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_skbmod(E) psample ip6table_filter ip6_tables 
iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel 
snd_hda_codec crct10dif_pclmul mbcache jbd2 crc32_pclmul snd_hda_core 
ghash_clmulni_intel snd_hwdep pcbc snd_seq snd_seq_device snd_pcm aesni_intel 
snd_timer crypto_simd glue_helper snd cryptd virtio_balloon joydev soundcore 
pcspkr i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs 
libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect 
sysimgblt fb_sys_fops ttm drm virtio_console virtio_net virtio_blk ata_piix 
libata crc32c_intel virtio_pci serio_raw virtio_ring virtio i2c_core floppy 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_skbmod]
 CPU: 3 PID: 3144 Comm: tc Tainted: GE4.16.0-rc4.act_vlan.orig+ 
#403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:bd2e403e7798 EFLAGS: 00010246
 RAX: c0872080 RBX: 981d34bff780 RCX: 
 RDX: 922a5f00 RSI:  RDI: 
 RBP:  R08: 0001 R09: 021f
 R10: 3d003000 R11: 00aa R12: 
 R13: 922a5f00 R14: 0001 R15: 981d3b698c2c
 FS:  7f3678292740() GS:981d3fd8() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0008 CR3: 7c57a006 CR4: 001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_skbmod_init+0x1d1/0x210 [act_skbmod]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f36776a3ba0
 RSP: 002b:7fff4703b618 EFLAGS: 0246 ORIG_RAX: 002e
 RAX: ffda RBX: 7fff4703b740 RCX: 7f36776a3ba0
 RDX:  RSI: 7fff4703b690 RDI: 0003
 RBP: 5aaaba36 R08: 0002 R09: 
 R10: 7fff4703b0a0 R11: 0246 R12: 
 R13: 7fff4703b754 R14: 0001 R15: 00669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 
55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 
00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: bd2e403e7798
 CR2: 0008

Fix it in tcf_skbmod_cleanup(), ensuring that kfree_rcu(p, ...) is called
only when p is not NULL.

Fixes: 86da71b57383 ("net_sched: Introduce skbmod action")
Signed-off-by: Davide Caratti 
---
 net/sched/act_skbmod.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_skbmod.c b/net/sched/act_skbmod.c
index fa975262dbac..d09565d6433e 100644
--- a/net/sched/act_skbmod.c
+++ b/net/sched/act_skbmod.c
@@ -190,7 +190,8 @@ static void tcf_skbmod_cleanup(struct tc_action *a)
struct tcf_skbmod_params  *p;
 
p = rcu_dereference_protected(d->skbmod_p, 1);
-   kfree_rcu(p, rcu);
+   if (p)
+   kfree_rcu(p, rcu);
 }
 
 static int tcf_skbmod_dump(struct sk_buff *skb, struct tc_action *a,
-- 
2.14.3



[PATCH net 1/5] net/sched: fix NULL dereference in the error path of tcf_vlan_init()

2018-03-15 Thread Davide Caratti
when the following command

 # tc actions replace action vlan pop index 100

is run for the first time, and tcf_vlan_init() fails allocating struct
tcf_vlan_params, tcf_vlan_cleanup() calls kfree_rcu(NULL, ...). This causes
the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0018
 IP: __call_rcu+0x23/0x2b0
 PGD 8000760a2067 P4D 8000760a2067 PUD 742c1067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_vlan(E) ip6table_filter ip6_tables iptable_filter 
binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel mbcache snd_hda_codec jbd2 
snd_hda_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_hwdep 
snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer glue_helper 
snd cryptd joydev soundcore virtio_balloon pcspkr i2c_piix4 nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi qxl 
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm 
virtio_console virtio_blk virtio_net ata_piix crc32c_intel libata virtio_pci 
i2c_core virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log 
dm_mod [last unloaded: act_vlan]
 CPU: 3 PID: 3119 Comm: tc Tainted: GE4.16.0-rc4.act_vlan.orig+ 
#403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:aac3005fb798 EFLAGS: 00010246
 RAX: c0704080 RBX: 97f2b4bbe900 RCX: 
 RDX: abca5f00 RSI: 0010 RDI: 0010
 RBP: 0010 R08: 0001 R09: 0044
 R10: fd003000 R11: 97f2faab5b91 R12: 
 R13: abca5f00 R14: 97f2fb80202c R15: fff4
 FS:  7f68f75b4740() GS:97f2ffd8() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0018 CR3: 72b52001 CR4: 001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_vlan_init+0x168/0x270 [act_vlan]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f68f69c5ba0
 RSP: 002b:7fffd79c1118 EFLAGS: 0246 ORIG_RAX: 002e
 RAX: ffda RBX: 7fffd79c1240 RCX: 7f68f69c5ba0
 RDX:  RSI: 7fffd79c1190 RDI: 0003
 RBP: 5aaa708e R08: 0002 R09: 
 R10: 7fffd79c0ba0 R11: 0246 R12: 
 R13: 7fffd79c1254 R14: 0001 R15: 00669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 
55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 
00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: aac3005fb798
 CR2: 0018

fix this in tcf_vlan_cleanup(), ensuring that kfree_rcu(p, ...) is called
only when p is not NULL.

Fixes: 4c5b9d9642c8 ("act_vlan: VLAN action rewrite to use RCU lock/unlock and 
update")
Acked-by: Jiri Pirko 
Acked-by: Manish Kurup 
Signed-off-by: Davide Caratti 
---
 net/sched/act_vlan.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index e1a1b3f3983a..c2914e9a4a6f 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -225,7 +225,8 @@ static void tcf_vlan_cleanup(struct tc_action *a)
struct tcf_vlan_params *p;
 
p = rcu_dereference_protected(v->vlan_p, 1);
-   kfree_rcu(p, rcu);
+   if (p)
+   kfree_rcu(p, rcu);
 }
 
 static int tcf_vlan_dump(struct sk_buff *skb, struct tc_action *a,
-- 
2.14.3



[PATCH net 3/5] net/sched: fix NULL dereference in the error path of tunnel_key_init()

2018-03-15 Thread Davide Caratti
when the following command

 # tc action add action tunnel_key unset index 100

is run for the first time, and tunnel_key_init() fails to allocate struct
tcf_tunnel_key_params, tunnel_key_release() dereferences NULL pointers.
This causes the following error:

 BUG: unable to handle kernel NULL pointer dereference at 0010
 IP: tunnel_key_release+0xd/0x40 [act_tunnel_key]
 PGD 800033787067 P4D 800033787067 PUD 74646067 PMD 0
 Oops:  [#1] SMP PTI
 Modules linked in: act_tunnel_key(E) act_csum ip6table_filter ip6_tables 
iptable_filter binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul 
snd_hda_codec_generic ghash_clmulni_intel snd_hda_intel pcbc snd_hda_codec 
snd_hda_core snd_hwdep snd_seq aesni_intel snd_seq_device crypto_simd 
glue_helper snd_pcm cryptd joydev snd_timer pcspkr virtio_balloon snd i2c_piix4 
soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c 
ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops ttm virtio_net virtio_blk drm virtio_console crc32c_intel ata_piix 
serio_raw i2c_core virtio_pci libata virtio_ring virtio floppy dm_mirror 
dm_region_hash dm_log dm_mod
 CPU: 2 PID: 3101 Comm: tc Tainted: GE4.16.0-rc4.act_vlan.orig+ 
#403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:tunnel_key_release+0xd/0x40 [act_tunnel_key]
 RSP: 0018:ba46803b7768 EFLAGS: 00010286
 RAX: c09010a0 RBX:  RCX: 0024
 RDX:  RSI:  RDI: 99ee336d7480
 RBP:  R08: 0001 R09: 0044
 R10: 0220 R11: 99ee79d73131 R12: 
 R13: 99ee32d67610 R14: 99ee7671dc38 R15: fff4
 FS:  7febcb2cd740() GS:99ee7fd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0010 CR3: 7c8e4005 CR4: 001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tunnel_key_init+0xd9/0x460 [act_tunnel_key]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7febca6deba0
 RSP: 002b:7ffe7b0dd128 EFLAGS: 0246 ORIG_RAX: 002e
 RAX: ffda RBX: 7ffe7b0dd250 RCX: 7febca6deba0
 RDX:  RSI: 7ffe7b0dd1a0 RDI: 0003
 RBP: 5aaa90cb R08: 0002 R09: 
 R10: 7ffe7b0dcba0 R11: 0246 R12: 
 R13: 7ffe7b0dd264 R14: 0001 R15: 00669f60
 Code: 44 00 00 8b 0d b5 23 00 00 48 8b 87 48 10 00 00 48 8b 3c c8 e9 a5 e5 d8 
c3 0f 1f 44 00 00 0f 1f 44 00 00 53 48 8b 9f b0 00 00 00 <83> 7b 10 01 74 0b 48 
89 df 31 f6 5b e9 f2 fa 7f c3 48 8b 7b 18
 RIP: tunnel_key_release+0xd/0x40 [act_tunnel_key] RSP: ba46803b7768
 CR2: 0010

Fix this in tunnel_key_release(), ensuring 'param' is not NULL before
dereferencing it.

Fixes: d0f6dd8a914f ("net/sched: Introduce act_tunnel_key")
Signed-off-by: Davide Caratti 
---
 net/sched/act_tunnel_key.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 0e23aac09ad6..5dd819840feb 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -207,11 +207,12 @@ static void tunnel_key_release(struct tc_action *a)
struct tcf_tunnel_key_params *params;
 
params = rcu_dereference_protected(t->params, 1);
+   if (params) {
+   if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET)
+   dst_release(>tcft_enc_metadata->dst);
 
-   if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET)
-   dst_release(>tcft_enc_metadata->dst);
-
-   kfree_rcu(params, rcu);
+   kfree_rcu(params, rcu);
+   }
 }
 
 static int tunnel_key_dump_addresses(struct sk_buff *skb,
-- 
2.14.3



[PATCH net 0/5] net/sched: fix NULL dereference in the error path of .init()

2018-03-15 Thread Davide Caratti
with several TC actions it's possible to see NULL pointer dereference,
when the .init() function calls tcf_idr_alloc(), fails at some point and
then calls tcf_idr_release(): this series fixes all them introducing
non-NULL tests in the .cleanup() function.

Davide Caratti (5):
  net/sched: fix NULL dereference in the error path of tcf_vlan_init()
  net/sched: fix NULL dereference in the error path of tcf_csum_init()
  net/sched: fix NULL dereference in the error path of tunnel_key_init()
  net/sched: fix NULL dereference in the error path of tcf_sample_init()
  net/sched: fix NULL dereference on the error path of tcf_skbmod_init()

 net/sched/act_csum.c   | 3 ++-
 net/sched/act_sample.c | 3 ++-
 net/sched/act_skbmod.c | 3 ++-
 net/sched/act_tunnel_key.c | 9 +
 net/sched/act_vlan.c   | 3 ++-
 5 files changed, 13 insertions(+), 8 deletions(-)

-- 
2.14.3



[PATCH net 4/5] net/sched: fix NULL dereference in the error path of tcf_sample_init()

2018-03-15 Thread Davide Caratti
when the following command

 # tc action add action sample rate 100 group 100 index 100

is run for the first time, and psample_group_get(100) fails to create a
new group, tcf_sample_cleanup() calls psample_group_put(NULL), thus
causing the following error:

 BUG: unable to handle kernel NULL pointer dereference at 001c
 IP: psample_group_put+0x15/0x71 [psample]
 PGD 800075775067 P4D 800075775067 PUD 7453c067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_sample(E) psample ip6table_filter ip6_tables 
iptable_filter binfmt_misc ext4 snd_hda_codec_generic snd_hda_intel 
snd_hda_codec snd_hda_core mbcache jbd2 crct10dif_pclmul snd_hwdep crc32_pclmul 
snd_seq ghash_clmulni_intel pcbc snd_seq_device snd_pcm aesni_intel crypto_simd 
snd_timer glue_helper snd cryptd joydev pcspkr i2c_piix4 soundcore 
virtio_balloon nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs 
libcrc32c ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect 
sysimgblt fb_sys_fops ttm drm virtio_net ata_piix virtio_console virtio_blk 
libata serio_raw crc32c_intel virtio_pci i2c_core virtio_ring virtio floppy 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_tunnel_key]
 CPU: 2 PID: 5740 Comm: tc Tainted: GE4.16.0-rc4.act_vlan.orig+ 
#403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:psample_group_put+0x15/0x71 [psample]
 RSP: 0018:b8a80032f7d0 EFLAGS: 00010246
 RAX:  RBX:  RCX: 0024
 RDX: 0001 RSI:  RDI: c06d93c0
 RBP:  R08: 0001 R09: 0044
 R10: bd003000 R11: 979fba04aa59 R12: 
 R13:  R14:  R15: 979fbba3f22c
 FS:  7f7638112740() GS:979fbfd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 001c CR3: 734ea001 CR4: 001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_sample_init+0x125/0x1d0 [act_sample]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7f7637523ba0
 RSP: 002b:7fff0473ef58 EFLAGS: 0246 ORIG_RAX: 002e
 RAX: ffda RBX: 7fff0473f080 RCX: 7f7637523ba0
 RDX:  RSI: 7fff0473efd0 RDI: 0003
 RBP: 5c80 R08: 0002 R09: 
 R10: 7fff0473e9e0 R11: 0246 R12: 
 R13: 7fff0473f094 R14: 0001 R15: 00669f60
 Code: be 02 00 00 00 48 89 df e8 a9 fe ff ff e9 7c ff ff ff 0f 1f 40 00 0f 1f 
44 00 00 53 48 89 fb 48 c7 c7 c0 93 6d c0 e8 db 20 8c ef <83> 6b 1c 01 74 10 48 
c7 c7 c0 93 6d c0 ff 14 25 e8 83 83 b0 5b
 RIP: psample_group_put+0x15/0x71 [psample] RSP: b8a80032f7d0
 CR2: 001c

Fix it in tcf_sample_cleanup(), ensuring that calls to psample_group_put(p)
are done only when p is not NULL.

Fixes: cadb9c9fdbc6 ("net/sched: act_sample: Fix error path in init")
Signed-off-by: Davide Caratti 
---
 net/sched/act_sample.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index 1ba0df238756..74c5d7e6a0fa 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -103,7 +103,8 @@ static void tcf_sample_cleanup(struct tc_action *a)
 
psample_group = rtnl_dereference(s->psample_group);
RCU_INIT_POINTER(s->psample_group, NULL);
-   psample_group_put(psample_group);
+   if (psample_group)
+   psample_group_put(psample_group);
 }
 
 static bool tcf_sample_dev_ok_push(struct net_device *dev)
-- 
2.14.3



[PATCH net 2/5] net/sched: fix NULL dereference in the error path of tcf_csum_init()

2018-03-15 Thread Davide Caratti
when the following command

 # tc action add action csum udp continue index 100

is run for the first time, and tcf_csum_init() fails allocating struct
tcf_csum, tcf_csum_cleanup() calls kfree_rcu(NULL,...). This causes the
following error:

 BUG: unable to handle kernel NULL pointer dereference at 0010
 IP: __call_rcu+0x23/0x2b0
 PGD 8000740b4067 P4D 8000740b4067 PUD 32e7f067 PMD 0
 Oops: 0002 [#1] SMP PTI
 Modules linked in: act_csum(E) act_vlan ip6table_filter ip6_tables 
iptable_filter binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel snd_hda_codec_generic pcbc snd_hda_intel snd_hda_codec 
snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer aesni_intel 
crypto_simd glue_helper cryptd snd joydev pcspkr virtio_balloon i2c_piix4 
soundcore nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c 
ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops ttm virtio_blk drm virtio_net virtio_console ata_piix crc32c_intel 
libata virtio_pci serio_raw i2c_core virtio_ring virtio floppy dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: act_vlan]
 CPU: 2 PID: 5763 Comm: tc Tainted: GE4.16.0-rc4.act_vlan.orig+ 
#403
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:__call_rcu+0x23/0x2b0
 RSP: 0018:b275803e77c0 EFLAGS: 00010246
 RAX: c057b080 RBX: 9674bc6f5240 RCX: 
 RDX: 928a5f00 RSI: 0008 RDI: 0008
 RBP: 0008 R08: 0001 R09: 0044
 R10: 0220 R11: 9674b9ab4821 R12: 
 R13: 928a5f00 R14:  R15: 0001
 FS:  7fa6368d8740() GS:9674bfd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0010 CR3: 73dec001 CR4: 001606e0
 Call Trace:
  __tcf_idr_release+0x79/0xf0
  tcf_csum_init+0xfb/0x180 [act_csum]
  tcf_action_init_1+0x2cc/0x430
  tcf_action_init+0xd3/0x1b0
  tc_ctl_action+0x18b/0x240
  rtnetlink_rcv_msg+0x29c/0x310
  ? _cond_resched+0x15/0x30
  ? __kmalloc_node_track_caller+0x1b9/0x270
  ? rtnl_calcit.isra.28+0x100/0x100
  netlink_rcv_skb+0xd2/0x110
  netlink_unicast+0x17c/0x230
  netlink_sendmsg+0x2cd/0x3c0
  sock_sendmsg+0x30/0x40
  ___sys_sendmsg+0x27a/0x290
  ? filemap_map_pages+0x34a/0x3a0
  ? __handle_mm_fault+0xbfd/0xe20
  __sys_sendmsg+0x51/0x90
  do_syscall_64+0x6e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
 RIP: 0033:0x7fa635ce9ba0
 RSP: 002b:7ffc185b0fc8 EFLAGS: 0246 ORIG_RAX: 002e
 RAX: ffda RBX: 7ffc185b10f0 RCX: 7fa635ce9ba0
 RDX:  RSI: 7ffc185b1040 RDI: 0003
 RBP: 5aaa85e0 R08: 0002 R09: 
 R10: 7ffc185b0a20 R11: 0246 R12: 
 R13: 7ffc185b1104 R14: 0001 R15: 00669f60
 Code: 5d e9 42 da ff ff 66 90 0f 1f 44 00 00 41 57 41 56 41 55 49 89 d5 41 54 
55 48 89 fd 53 48 83 ec 08 40 f6 c7 07 0f 85 19 02 00 00 <48> 89 75 08 48 c7 45 
00 00 00 00 00 9c 58 0f 1f 44 00 00 49 89
 RIP: __call_rcu+0x23/0x2b0 RSP: b275803e77c0
 CR2: 0010

fix this in tcf_csum_cleanup(), ensuring that kfree_rcu(param, ...) is
called only when param is not NULL.

Fixes: 9c5f69bbd75a ("net/sched: act_csum: don't use spinlock in the fast path")
Signed-off-by: Davide Caratti 
---
 net/sched/act_csum.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 24b2e8e681cf..2a5c8fd860cf 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -626,7 +626,8 @@ static void tcf_csum_cleanup(struct tc_action *a)
struct tcf_csum_params *params;
 
params = rcu_dereference_protected(p->params, 1);
-   kfree_rcu(params, rcu);
+   if (params)
+   kfree_rcu(params, rcu);
 }
 
 static int tcf_csum_walker(struct net *net, struct sk_buff *skb,
-- 
2.14.3



Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Miguel Ojeda
On Thu, Mar 15, 2018 at 11:46 PM, Kees Cook  wrote:
> On Thu, Mar 15, 2018 at 3:23 PM, Linus Torvalds
>  wrote:
>> On Thu, Mar 15, 2018 at 3:16 PM, Kees Cook  wrote:
>>>
>>> size_t __error_not_const_arg(void) \
>>> __compiletime_error("const_max() used with non-compile-time constant arg");
>>> #define const_max(x, y) \
>>> __builtin_choose_expr(__builtin_constant_p(x) &&\
>>>   __builtin_constant_p(y),  \
>>>   (typeof(x))(x) > (typeof(y))(y) ? \
>>> (x) : (y),  \
>>>   __error_not_const_arg())
>>>
>>> Is typeof() forcing enums to int? Regardless, I'll put this through
>>> larger testing. How does that look?
>>
>> Ok, that alleviates my worry about one class of insane behavior, but
>> it does raise a few other questions:
>>
>>  - what drugs is gcc on where (typeof(x)(x)) makes a difference? Funky.
>
> Yeah, that's why I didn't even try that originally. But in looking
> back at max() again, it seemed to be the only thing missing that would
> handle the enum evaluation, which turned out to be true.
>
>>  - this does have the usual "what happen if you do
>>
>>  const_max(-1,sizeof(x))
>>
>> where the comparison will now be done in 'size_t', and -1 ends up
>> being a very very big unsigned integer.
>>
>> Is there no way to get that type checking inserted? Maybe now is a
>> good point for that __builtin_types_compatible(), and add it to the
>> constness checking (and change the name of that error case function)?
>
> So, AIUI, I can either get strict type checking, in which case, this
> is rejected (which I assume there is still a desire to have):
>
> int foo[const_max(6, sizeof(whatever))];

Is it that bad to just call it with (size_t)6?

>
> due to __builtin_types_compatible_p() rejecting it, or I can construct
> a "positive arguments only" test, in which the above is accepted, but
> this is rejected:
>
> int foo[const_max(-1, sizeof(whatever))];

Do we need this case?

>
> By using this eye-bleed:
>
> size_t __error_not_const_arg(void) \
> __compiletime_error("const_max() used with non-compile-time constant arg");
> size_t __error_not_positive_arg(void) \
> __compiletime_error("const_max() used with negative arg");
> #define const_max(x, y) \
> __builtin_choose_expr(__builtin_constant_p(x) &&\
>   __builtin_constant_p(y),  \
> __builtin_choose_expr((x) >= 0 && (y) >= 0, \
>   (typeof(x))(x) > (typeof(y))(y) ? \
> (x) : (y),  \
>   __error_not_positive_arg()),  \
> __error_not_const_arg())
>

I was writing it like this:

#define const_max(a, b) \
({ \
if ((a) < 0) \
__const_max_called_with_negative_value(); \
if ((b) < 0) \
__const_max_called_with_negative_value(); \
if (!__builtin_types_compatible_p(typeof(a), typeof(b))) \
__const_max_called_with_incompatible_types(); \
__builtin_choose_expr((a) > (b), (a), (b)); \
})

Cheers,
Miguel


> -Kees
>
> --
> Kees Cook
> Pixel Security


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Daniel Borkmann
On 03/15/2018 11:20 PM, Alexei Starovoitov wrote:
> On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
>> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
>>> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
  
 +/* User return codes for SK_MSG prog type. */
 +enum sk_msg_action {
 +  SK_MSG_DROP = 0,
 +  SK_MSG_PASS,
 +};
>>>
>>> do we really need new enum here?
>>> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
>>> and there will be only drop/pass in both enums.
>>> Also I don't see where these two new SK_MSG_* are used...
>>>
 +
 +/* user accessible metadata for SK_MSG packet hook, new fields must
 + * be added to the end of this structure
 + */
 +struct sk_msg_md {
 +  __u32 data;
 +  __u32 data_end;
 +};
>>>
>>> I think it's time for me to ask for forgiveness :)
>>
>> :-)
>>
>>> I used __u32 for data and data_end only because all other fields
>>> in __sk_buff were __u32 at the time and I couldn't easily figure out
>>> how to teach verifier to recognize 8-byte rewrites.
>>> Unfortunately my mistake stuck and was copied over into xdp.
>>> Since this is new struct let's do it right and add
>>> 'void *data, *data_end' here,
>>> since bpf prog will use them as 'void *' pointers.
>>> There are no compat issues here, since bpf is always 64-bit.
>>
>> But at least offset-wise when you do the ctx rewrite this would then
>> be a bit more tricky when you have 64 bit kernel with 32 bit user
>> space since void * members are in each cases at different offset. So
>> unless I'm missing something, this still should either be __u32 or
>> __u64 instead of void *, no?
> 
> there is no 32-bit user space. these structs are seen by bpf progs only
> and bpf is 64-bit only too.
> unless I'm missing your point.

Ok, so lets say you have 32 bit LLVM binary and compile the prog where
you access md->data_end. Given the void * in the struct will that access
end up being BPF_W at ctx offset 4 or BPF_DW at ctx offset 8 from clang
perspective (iow, is the back end treating this special and always use
fixed BPF_DW in such case)? If not and it would be the first case with
offset 4, then we could have the case that underlying 64 bit kernel is
expecting ctx offset 8 for doing the md ctx conversion.


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Kees Cook
On Thu, Mar 15, 2018 at 3:23 PM, Linus Torvalds
 wrote:
> On Thu, Mar 15, 2018 at 3:16 PM, Kees Cook  wrote:
>>
>> size_t __error_not_const_arg(void) \
>> __compiletime_error("const_max() used with non-compile-time constant arg");
>> #define const_max(x, y) \
>> __builtin_choose_expr(__builtin_constant_p(x) &&\
>>   __builtin_constant_p(y),  \
>>   (typeof(x))(x) > (typeof(y))(y) ? \
>> (x) : (y),  \
>>   __error_not_const_arg())
>>
>> Is typeof() forcing enums to int? Regardless, I'll put this through
>> larger testing. How does that look?
>
> Ok, that alleviates my worry about one class of insane behavior, but
> it does raise a few other questions:
>
>  - what drugs is gcc on where (typeof(x)(x)) makes a difference? Funky.

Yeah, that's why I didn't even try that originally. But in looking
back at max() again, it seemed to be the only thing missing that would
handle the enum evaluation, which turned out to be true.

>  - this does have the usual "what happen if you do
>
>  const_max(-1,sizeof(x))
>
> where the comparison will now be done in 'size_t', and -1 ends up
> being a very very big unsigned integer.
>
> Is there no way to get that type checking inserted? Maybe now is a
> good point for that __builtin_types_compatible(), and add it to the
> constness checking (and change the name of that error case function)?

So, AIUI, I can either get strict type checking, in which case, this
is rejected (which I assume there is still a desire to have):

int foo[const_max(6, sizeof(whatever))];

due to __builtin_types_compatible_p() rejecting it, or I can construct
a "positive arguments only" test, in which the above is accepted, but
this is rejected:

int foo[const_max(-1, sizeof(whatever))];

By using this eye-bleed:

size_t __error_not_const_arg(void) \
__compiletime_error("const_max() used with non-compile-time constant arg");
size_t __error_not_positive_arg(void) \
__compiletime_error("const_max() used with negative arg");
#define const_max(x, y) \
__builtin_choose_expr(__builtin_constant_p(x) &&\
  __builtin_constant_p(y),  \
__builtin_choose_expr((x) >= 0 && (y) >= 0, \
  (typeof(x))(x) > (typeof(y))(y) ? \
(x) : (y),  \
  __error_not_positive_arg()),  \
__error_not_const_arg())

-Kees

-- 
Kees Cook
Pixel Security


Re: [bug, bisected] pfifo_fast causes packet reordering

2018-03-15 Thread John Fastabend
On 03/15/2018 11:08 AM, Jakob Unterwurzacher wrote:
> On 14.03.18 05:03, John Fastabend wrote:
>> On 03/13/2018 11:35 AM, Dave Taht wrote:
>>> On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher
>>>  wrote:
 During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
 v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets 
 are
 delivered out-of-order.

>>
>> Is the stress-testing tool available somewhere? What type of packets
>> are being sent?
> 
> 
> I have reproduced it using two USB network cards connected to each other. The 
> test tool sends UDP packets containing a counter and listens on the other 
> interface, it is available at
> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
> 

Great thanks, can you also run this with taskset to bind to
a single CPU,

 # taskset 0x1 ./pifof_stress.py

And let me know if you still see the OOO.

> Here is what I get:
> 
> root@rk3399-q7:~# ./pfifo_stress.py
> [...]
> expected ctr 0xcdc, received 0xcdd
> expected ctr 0xcde, received 0xcdc
> expected ctr 0xcdd, received 0xcde
> expected ctr 0xe3c, received 0xe3d
> expected ctr 0xe3e, received 0xe3c
> expected ctr 0xe3d, received 0xe3e
> expected ctr 0x1097, received 0x1098
> expected ctr 0x1099, received 0x1097
> expected ctr 0x1098, received 0x1099
> expected ctr 0x17c0, received 0x17c1
> expected ctr 0x17c2, received 0x17c0
> [...]
> 
> Best regards,
> Jakob



Re: [PATCH 3/7] RDMA/qedr: eliminate duplicate barriers on weakly-ordered archs

2018-03-15 Thread Jason Gunthorpe
On Tue, Mar 13, 2018 at 11:20:24PM -0400, Sinan Kaya wrote:
> Code includes wmb() followed by writel() in multiple places. writel()
> already has a barrier on some architectures like arm64.
> 
> This ends up CPU observing two barriers back to back before executing the
> register write.
> 
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
> 
> Signed-off-by: Sinan Kaya 
> Acked-by: Jason Gunthorpe 
>  drivers/infiniband/hw/qedr/verbs.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Applied to RDMA for-next

Thanks,
Jason


Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 3:16 PM, Kees Cook  wrote:
>
> size_t __error_not_const_arg(void) \
> __compiletime_error("const_max() used with non-compile-time constant arg");
> #define const_max(x, y) \
> __builtin_choose_expr(__builtin_constant_p(x) &&\
>   __builtin_constant_p(y),  \
>   (typeof(x))(x) > (typeof(y))(y) ? \
> (x) : (y),  \
>   __error_not_const_arg())
>
> Is typeof() forcing enums to int? Regardless, I'll put this through
> larger testing. How does that look?

Ok, that alleviates my worry about one class of insane behavior, but
it does raise a few other questions:

 - what drugs is gcc on where (typeof(x)(x)) makes a difference? Funky.

 - this does have the usual "what happen if you do

 const_max(-1,sizeof(x))

where the comparison will now be done in 'size_t', and -1 ends up
being a very very big unsigned integer.

Is there no way to get that type checking inserted? Maybe now is a
good point for that __builtin_types_compatible(), and add it to the
constness checking (and change the name of that error case function)?

  Linus


Re: [PATCH net] net/sched: fix NULL dereference in the error path of tcf_vlan_init()

2018-03-15 Thread Davide Caratti
On Thu, 2018-03-15 at 15:29 +0100, Davide Caratti wrote:
> On Thu, 2018-03-15 at 15:21 +0100, Jiri Pirko wrote:
> ...
> 
> > Acked-by: Jiri Pirko 
> 
> thank you for reviewing!
> 
> apparently, also act_tunnel_key seem and act_csum have a similar problem.
> I will check and eventually do a followup series this afternoon.
> 
> thank you,
> regards

hello David,

please drop this patch: after some tests, the following TC actions are
affected by the same problem:

act_vlan
act_csum
act_tunnel_key
act_skbmod
act_sample

so, I'm posting right now a series that fixes all of them.

In act_ife and act_bpf, the problem is potentially there, but we don't see
it crashing yet because we don't call tcf_idr_release() on the error
path. 
This is causing the leak of 'index', and will be fixed in another series
tomorrow.

thank you in advance,
regards
-- 
davide


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Alexei Starovoitov
On Thu, Mar 15, 2018 at 11:17:12PM +0100, Daniel Borkmann wrote:
> On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
> >>  
> >> +/* User return codes for SK_MSG prog type. */
> >> +enum sk_msg_action {
> >> +  SK_MSG_DROP = 0,
> >> +  SK_MSG_PASS,
> >> +};
> > 
> > do we really need new enum here?
> > It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> > and there will be only drop/pass in both enums.
> > Also I don't see where these two new SK_MSG_* are used...
> > 
> >> +
> >> +/* user accessible metadata for SK_MSG packet hook, new fields must
> >> + * be added to the end of this structure
> >> + */
> >> +struct sk_msg_md {
> >> +  __u32 data;
> >> +  __u32 data_end;
> >> +};
> > 
> > I think it's time for me to ask for forgiveness :)
> 
> :-)
> 
> > I used __u32 for data and data_end only because all other fields
> > in __sk_buff were __u32 at the time and I couldn't easily figure out
> > how to teach verifier to recognize 8-byte rewrites.
> > Unfortunately my mistake stuck and was copied over into xdp.
> > Since this is new struct let's do it right and add
> > 'void *data, *data_end' here,
> > since bpf prog will use them as 'void *' pointers.
> > There are no compat issues here, since bpf is always 64-bit.
> 
> But at least offset-wise when you do the ctx rewrite this would then
> be a bit more tricky when you have 64 bit kernel with 32 bit user
> space since void * members are in each cases at different offset. So
> unless I'm missing something, this still should either be __u32 or
> __u64 instead of void *, no?

there is no 32-bit user space. these structs are seen by bpf progs only
and bpf is 64-bit only too.
unless I'm missing your point.



Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Kees Cook
On Thu, Mar 15, 2018 at 2:42 PM, Linus Torvalds
 wrote:
> On Thu, Mar 15, 2018 at 12:47 PM, Kees Cook  wrote:
>>
>> To gain the ability to compare differing types, the arguments are
>> explicitly cast to size_t.
>
> Ugh, I really hate this.
>
> It silently does insane things if you do
>
>const_max(-1,6)
>
> and there is nothing in the name that implies that you can't use
> negative constants.

Yeah, I didn't like that effect either. I was seeing this:

./include/linux/kernel.h:836:14: warning: comparison between ‘enum
’ and ‘enum ’ [-Wenum-compare]
  (x) > (y) ? \
  ^
./include/linux/kernel.h:838:7: note: in definition of macro ‘const_max’
  (y),  \
   ^
net/ipv6/proc.c:34:11: note: in expansion of macro ‘const_max’
   const_max(IPSTATS_MIB_MAX, ICMP_MIB_MAX))
   ^

But it turns out that just doing a typeof() fixes this, and there's no
need for the hard cast to size_t:

size_t __error_not_const_arg(void) \
__compiletime_error("const_max() used with non-compile-time constant arg");
#define const_max(x, y) \
__builtin_choose_expr(__builtin_constant_p(x) &&\
  __builtin_constant_p(y),  \
  (typeof(x))(x) > (typeof(y))(y) ? \
(x) : (y),  \
  __error_not_const_arg())

Is typeof() forcing enums to int? Regardless, I'll put this through
larger testing. How does that look?

-Kees

-- 
Kees Cook
Pixel Security


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Daniel Borkmann
On 03/15/2018 10:59 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>  
>> +/* User return codes for SK_MSG prog type. */
>> +enum sk_msg_action {
>> +SK_MSG_DROP = 0,
>> +SK_MSG_PASS,
>> +};
> 
> do we really need new enum here?
> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> and there will be only drop/pass in both enums.
> Also I don't see where these two new SK_MSG_* are used...
> 
>> +
>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>> + * be added to the end of this structure
>> + */
>> +struct sk_msg_md {
>> +__u32 data;
>> +__u32 data_end;
>> +};
> 
> I think it's time for me to ask for forgiveness :)

:-)

> I used __u32 for data and data_end only because all other fields
> in __sk_buff were __u32 at the time and I couldn't easily figure out
> how to teach verifier to recognize 8-byte rewrites.
> Unfortunately my mistake stuck and was copied over into xdp.
> Since this is new struct let's do it right and add
> 'void *data, *data_end' here,
> since bpf prog will use them as 'void *' pointers.
> There are no compat issues here, since bpf is always 64-bit.

But at least offset-wise when you do the ctx rewrite this would then
be a bit more tricky when you have 64 bit kernel with 32 bit user
space since void * members are in each cases at different offset. So
unless I'm missing something, this still should either be __u32 or
__u64 instead of void *, no?

>> +static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
>> +{
>> +return ((_rc == SK_PASS) ?
>> +   (md->map ? __SK_REDIRECT : __SK_PASS) :
>> +   __SK_DROP);
> 
> you're using old SK_PASS here too ;)
> that's to my point of not adding SK_MSG_PASS...
> 
> Overall the patch set looks absolutely great.
> Thank you for working on it.

+1


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread John Fastabend
On 03/15/2018 02:59 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>>  
>> +/* User return codes for SK_MSG prog type. */
>> +enum sk_msg_action {
>> +SK_MSG_DROP = 0,
>> +SK_MSG_PASS,
>> +};
> 
> do we really need new enum here?

Nope and as you noticed the actual code uses the
SK_{DROP|PASS} enum. Will remove this.

> It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
> and there will be only drop/pass in both enums.
> Also I don't see where these two new SK_MSG_* are used...
> 
>> +
>> +/* user accessible metadata for SK_MSG packet hook, new fields must
>> + * be added to the end of this structure
>> + */
>> +struct sk_msg_md {
>> +__u32 data;
>> +__u32 data_end;
>> +};
> 
> I think it's time for me to ask for forgiveness :)
> I used __u32 for data and data_end only because all other fields
> in __sk_buff were __u32 at the time and I couldn't easily figure out
> how to teach verifier to recognize 8-byte rewrites.
> Unfortunately my mistake stuck and was copied over into xdp.
> Since this is new struct let's do it right and add
> 'void *data, *data_end' here,
> since bpf prog will use them as 'void *' pointers.
> There are no compat issues here, since bpf is always 64-bit.
> 

aha nice catch. Yep lets use 'void*' here. I had forgot about
that discussion and copied them here as well.

>> +static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
>> +{
>> +return ((_rc == SK_PASS) ?
>> +   (md->map ? __SK_REDIRECT : __SK_PASS) :
>> +   __SK_DROP);
> 
> you're using old SK_PASS here too ;)
> that's to my point of not adding SK_MSG_PASS...
> 

+1

> Overall the patch set looks absolutely great.
> Thank you for working on it.
> 

I'll fixup a few of these small things now and should have
a v3 shortly.


Re: [bpf-next PATCH v2 15/18] bpf: sockmap sample support for bpf_msg_cork_bytes()

2018-03-15 Thread John Fastabend
On 03/15/2018 01:15 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:24:21PM -0700, John Fastabend wrote:
>> Add sample application support for the bpf_msg_cork_bytes helper. This
>> lets the user specify how many bytes each verdict should apply to.
>>
>> Similar to apply_bytes() tests these can be run as a stand-alone test
>> when used without other options or inline with other tests by using
>> the txmsg_cork option along with any of the basic tests txmsg,
>> txmsg_redir, txmsg_drop.
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  include/uapi/linux/bpf_common.h   |7 ++--
>>  samples/sockmap/sockmap_kern.c|   53 
>> +
>>  samples/sockmap/sockmap_user.c|   19 ++
>>  tools/include/uapi/linux/bpf.h|3 +-
>>  tools/testing/selftests/bpf/bpf_helpers.h |2 +
>>  5 files changed, 71 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/uapi/linux/bpf_common.h 
>> b/include/uapi/linux/bpf_common.h
>> index ee97668..18be907 100644
>> --- a/include/uapi/linux/bpf_common.h
>> +++ b/include/uapi/linux/bpf_common.h
>> @@ -15,10 +15,9 @@
>>  
>>  /* ld/ldx fields */
>>  #define BPF_SIZE(code)  ((code) & 0x18)
>> -#define BPF_W   0x00 /* 32-bit */
>> -#define BPF_H   0x08 /* 16-bit */
>> -#define BPF_B   0x10 /*  8-bit */
>> -/* eBPF BPF_DW  0x1864-bit */
>> +#define BPF_W   0x00
>> +#define BPF_H   0x08
>> +#define BPF_B   0x10
> 
> this hunk seems wrong here. Botched rebase?
> 

Yep this hunk has nothing to do with my work so will
remove this hunk.


Re: [Intel-wired-lan] [next-queue 4/4] ixgbe: enable tso with ipsec offload

2018-03-15 Thread Alexander Duyck
On Thu, Mar 15, 2018 at 2:23 PM, Shannon Nelson
 wrote:
> Fix things up to support TSO offload in conjunction
> with IPsec hw offload.  This raises throughput with
> IPsec offload on to nearly line rate.
>
> Signed-off-by: Shannon Nelson 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c |  7 +--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 25 +++--
>  2 files changed, 24 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
> index 5ddea43..bfbcfc2 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
> @@ -896,6 +896,7 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
>  void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter)
>  {
> struct ixgbe_ipsec *ipsec;
> +   netdev_features_t features;
> size_t size;
>
> if (adapter->hw.mac.type == ixgbe_mac_82598EB)
> @@ -929,8 +930,10 @@ void ixgbe_init_ipsec_offload(struct ixgbe_adapter 
> *adapter)
> ixgbe_ipsec_clear_hw_tables(adapter);
>
> adapter->netdev->xfrmdev_ops = _xfrmdev_ops;
> -   adapter->netdev->features |= NETIF_F_HW_ESP;
> -   adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP;
> +
> +   features = NETIF_F_HW_ESP | NETIF_F_HW_ESP_TX_CSUM | NETIF_F_GSO_ESP;
> +   adapter->netdev->features |= features;
> +   adapter->netdev->hw_enc_features |= features;

Instead of adding the local variable you might just create a new
define that includes these 3 feature flags and then use that here. You
could use the way I did IXGBE_GSO_PARTIAL_FEATURES as an example.

> return;
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index a54f3d8..6022666 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -7721,9 +7721,11 @@ static void ixgbe_service_task(struct work_struct 
> *work)
>
>  static int ixgbe_tso(struct ixgbe_ring *tx_ring,
>  struct ixgbe_tx_buffer *first,
> -u8 *hdr_len)
> +u8 *hdr_len,
> +struct ixgbe_ipsec_tx_data *itd)
>  {
> u32 vlan_macip_lens, type_tucmd, mss_l4len_idx;
> +   u32 fceof_saidx = 0;
> struct sk_buff *skb = first->skb;

Reverse xmas tree this. It should probably be moved down to just past
the declaration of paylen and l4_offset.

> union {
> struct iphdr *v4;
> @@ -7762,9 +7764,13 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
> unsigned char *trans_start = ip.hdr + (ip.v4->ihl * 4);
>
> /* IP header will have to cancel out any data that
> -* is not a part of the outer IP header
> +* is not a part of the outer IP header, except for
> +* IPsec where we want the IP+ESP header.
>  */
> -   ip.v4->check = csum_fold(csum_partial(trans_start,
> +   if (first->tx_flags & IXGBE_TX_FLAGS_IPSEC)
> +   ip.v4->check = 0;
> +   else
> +   ip.v4->check = csum_fold(csum_partial(trans_start,
>   csum_start - 
> trans_start,
>   0));
> type_tucmd |= IXGBE_ADVTXD_TUCMD_IPV4;

I would say this should be flipped like so:
ip.v4->check =  (skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL) ?
 csum_fold(csum_partial(trans_start,
csum_start - trans_start, 0) : 0;

> @@ -7797,12 +7803,15 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
> mss_l4len_idx = (*hdr_len - l4_offset) << IXGBE_ADVTXD_L4LEN_SHIFT;
> mss_l4len_idx |= skb_shinfo(skb)->gso_size << IXGBE_ADVTXD_MSS_SHIFT;
>
> +   fceof_saidx |= itd->sa_idx;
> +   type_tucmd |= itd->flags | itd->trailer_len;
> +
> /* vlan_macip_lens: HEADLEN, MACLEN, VLAN tag */
> vlan_macip_lens = l4.hdr - ip.hdr;
> vlan_macip_lens |= (ip.hdr - skb->data) << IXGBE_ADVTXD_MACLEN_SHIFT;
> vlan_macip_lens |= first->tx_flags & IXGBE_TX_FLAGS_VLAN_MASK;
>
> -   ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, 0, type_tucmd,
> +   ixgbe_tx_ctxtdesc(tx_ring, vlan_macip_lens, fceof_saidx, type_tucmd,
>   mss_l4len_idx);
>
> return 1;
> @@ -8493,7 +8502,8 @@ netdev_tx_t ixgbe_xmit_frame_ring(struct sk_buff *skb,
> if (skb->sp && !ixgbe_ipsec_tx(tx_ring, first, _tx))
> goto out_drop;
>  #endif
> -   tso = ixgbe_tso(tx_ring, first, _len);
> +
> +   tso = ixgbe_tso(tx_ring, first, _len, _tx);
> if (tso < 0)
> goto out_drop;
> else if (!tso)

No need for the extra blank line. I would say just leave 

Re: [bpf-next PATCH v2 06/18] bpf: sockmap, add bpf_msg_apply_bytes() helper

2018-03-15 Thread John Fastabend
On 03/15/2018 01:32 PM, Daniel Borkmann wrote:
> On 03/12/2018 08:23 PM, John Fastabend wrote:
>> A single sendmsg or sendfile system call can contain multiple logical
>> messages that a BPF program may want to read and apply a verdict. But,
>> without an apply_bytes helper any verdict on the data applies to all
>> bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
>> care to read the first N bytes of a msg. If the payload is large say
>> MB or even GB setting up and calling the BPF program repeatedly for
>> all bytes, even though the verdict is already known, creates
>> unnecessary overhead.
>>
>> To allow BPF programs to control how many bytes a given verdict
>> applies to we implement a bpf_msg_apply_bytes() helper. When called
>> from within a BPF program this sets a counter, internal to the
>> BPF infrastructure, that applies the last verdict to the next N
>> bytes. If the N is smaller than the current data being processed
>> from a sendmsg/sendfile call, the first N bytes will be sent and
>> the BPF program will be re-run with start_data pointing to the N+1
>> byte. If N is larger than the current data being processed the
>> BPF verdict will be applied to multiple sendmsg/sendfile calls
>> until N bytes are consumed.
>>
>> Note1 if a socket closes with apply_bytes counter non-zero this
>> is not a problem because data is not being buffered for N bytes
>> and is sent as its received.
>>
>> Note2 if this is operating in the sendpage context the data
>> pointers may be zeroed after this call if the apply walks beyond
>> a msg_pull_data() call specified data range. (helper implemented
>> shortly in this series).
>>
>> Signed-off-by: John Fastabend 
>> ---
>>  include/uapi/linux/bpf.h |3 ++-
>>  net/core/filter.c|   16 
>>  2 files changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index b8275f0..e50c61f 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -769,7 +769,8 @@ enum bpf_attach_type {
>>  FN(getsockopt), \
>>  FN(override_return),\
>>  FN(sock_ops_cb_flags_set),  \
>> -FN(msg_redirect_map),
>> +FN(msg_redirect_map),   \
>> +FN(msg_apply_bytes),
>>  
>>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>   * function eBPF program intends to call
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 314c311..df2a8f4 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -1928,6 +1928,20 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff 
>> *msg)
>>  .arg4_type  = ARG_ANYTHING,
>>  };
>>  
>> +BPF_CALL_2(bpf_msg_apply_bytes, struct sk_msg_buff *, msg, u64, bytes)
>> +{
>> +msg->apply_bytes = bytes;
> 
> Here in bpf_msg_apply_bytes() but also in bpf_msg_cork_bytes() the signature
> is u64, but in struct sk_msg_buff and struct smap_psock it's type int, so
> user provided u64 will make these negative. Is there a reason to have this
> allow a negative value and not being of type u32 everywhere?
> 

Nope no reason for negative values, we can make it consistently
u32.


Re: [bpf-next PATCH v2 05/18] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

2018-03-15 Thread Alexei Starovoitov
On Mon, Mar 12, 2018 at 12:23:29PM -0700, John Fastabend wrote:
>  
> +/* User return codes for SK_MSG prog type. */
> +enum sk_msg_action {
> + SK_MSG_DROP = 0,
> + SK_MSG_PASS,
> +};

do we really need new enum here?
It's the same as 'enum sk_action' and SK_DROP == SK_MSG_DROP
and there will be only drop/pass in both enums.
Also I don't see where these two new SK_MSG_* are used...

> +
> +/* user accessible metadata for SK_MSG packet hook, new fields must
> + * be added to the end of this structure
> + */
> +struct sk_msg_md {
> + __u32 data;
> + __u32 data_end;
> +};

I think it's time for me to ask for forgiveness :)
I used __u32 for data and data_end only because all other fields
in __sk_buff were __u32 at the time and I couldn't easily figure out
how to teach verifier to recognize 8-byte rewrites.
Unfortunately my mistake stuck and was copied over into xdp.
Since this is new struct let's do it right and add
'void *data, *data_end' here,
since bpf prog will use them as 'void *' pointers.
There are no compat issues here, since bpf is always 64-bit.

> +static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
> +{
> + return ((_rc == SK_PASS) ?
> +(md->map ? __SK_REDIRECT : __SK_PASS) :
> +__SK_DROP);

you're using old SK_PASS here too ;)
that's to my point of not adding SK_MSG_PASS...

Overall the patch set looks absolutely great.
Thank you for working on it.



Re: [bpf-next PATCH v2 06/18] bpf: sockmap, add bpf_msg_apply_bytes() helper

2018-03-15 Thread John Fastabend
On 03/15/2018 02:45 PM, Alexei Starovoitov wrote:
> On Mon, Mar 12, 2018 at 12:23:34PM -0700, John Fastabend wrote:
>> A single sendmsg or sendfile system call can contain multiple logical
>> messages that a BPF program may want to read and apply a verdict. But,
>> without an apply_bytes helper any verdict on the data applies to all
>> bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
>> care to read the first N bytes of a msg. If the payload is large say
>> MB or even GB setting up and calling the BPF program repeatedly for
>> all bytes, even though the verdict is already known, creates
>> unnecessary overhead.
>>
>> To allow BPF programs to control how many bytes a given verdict
>> applies to we implement a bpf_msg_apply_bytes() helper. When called
>> from within a BPF program this sets a counter, internal to the
>> BPF infrastructure, that applies the last verdict to the next N
>> bytes. If the N is smaller than the current data being processed
>> from a sendmsg/sendfile call, the first N bytes will be sent and
>> the BPF program will be re-run with start_data pointing to the N+1
>> byte. If N is larger than the current data being processed the
>> BPF verdict will be applied to multiple sendmsg/sendfile calls
>> until N bytes are consumed.
>>
>> Note1 if a socket closes with apply_bytes counter non-zero this
>> is not a problem because data is not being buffered for N bytes
>> and is sent as its received.
>>
>> Note2 if this is operating in the sendpage context the data
>> pointers may be zeroed after this call if the apply walks beyond
>> a msg_pull_data() call specified data range. (helper implemented
>> shortly in this series).
> 
> instead of 'shortly in this seris' you meant 'implemented earlier'?
> patch 5 handles it, but it's set here, right?
> 

Yep just a hold-over from an earlier patch description. I'll remove
that entire note2 and fixup a couple small things Daniel noticed
with a v3.

> The semantics of the helper looks great.
> 

Great!


Re: [bpf-next PATCH v2 06/18] bpf: sockmap, add bpf_msg_apply_bytes() helper

2018-03-15 Thread Alexei Starovoitov
On Mon, Mar 12, 2018 at 12:23:34PM -0700, John Fastabend wrote:
> A single sendmsg or sendfile system call can contain multiple logical
> messages that a BPF program may want to read and apply a verdict. But,
> without an apply_bytes helper any verdict on the data applies to all
> bytes in the sendmsg/sendfile. Alternatively, a BPF program may only
> care to read the first N bytes of a msg. If the payload is large say
> MB or even GB setting up and calling the BPF program repeatedly for
> all bytes, even though the verdict is already known, creates
> unnecessary overhead.
> 
> To allow BPF programs to control how many bytes a given verdict
> applies to we implement a bpf_msg_apply_bytes() helper. When called
> from within a BPF program this sets a counter, internal to the
> BPF infrastructure, that applies the last verdict to the next N
> bytes. If the N is smaller than the current data being processed
> from a sendmsg/sendfile call, the first N bytes will be sent and
> the BPF program will be re-run with start_data pointing to the N+1
> byte. If N is larger than the current data being processed the
> BPF verdict will be applied to multiple sendmsg/sendfile calls
> until N bytes are consumed.
> 
> Note1 if a socket closes with apply_bytes counter non-zero this
> is not a problem because data is not being buffered for N bytes
> and is sent as its received.
> 
> Note2 if this is operating in the sendpage context the data
> pointers may be zeroed after this call if the apply walks beyond
> a msg_pull_data() call specified data range. (helper implemented
> shortly in this series).

instead of 'shortly in this seris' you meant 'implemented earlier'?
patch 5 handles it, but it's set here, right?

The semantics of the helper looks great.



Re: [PATCH v4 1/2] kernel.h: Introduce const_max() for VLA removal

2018-03-15 Thread Linus Torvalds
On Thu, Mar 15, 2018 at 12:47 PM, Kees Cook  wrote:
>
> To gain the ability to compare differing types, the arguments are
> explicitly cast to size_t.

Ugh, I really hate this.

It silently does insane things if you do

   const_max(-1,6)

and there is nothing in the name that implies that you can't use
negative constants.

   Linus


Re: [RFC net-next 2/6] driver: net: bonding: allow registration of tc offload callbacks in bond

2018-03-15 Thread Or Gerlitz
On Wed, Mar 14, 2018 at 5:56 PM, Jiri Pirko  wrote:
> Wed, Mar 14, 2018 at 12:23:59PM CET, gerlitz...@gmail.com wrote:
>>On Wed, Mar 14, 2018 at 11:50 AM, Jiri Pirko  wrote:
>>> Tue, Mar 13, 2018 at 04:51:02PM CET, gerlitz...@gmail.com wrote:
On Wed, Mar 7, 2018 at 12:57 PM, Jiri Pirko  wrote:
>>
This sounds nice for the case where one install ingress tc rules on
the bond (lets
call them type 1, see next)

One obstacle pointed by my colleague, Rabie, is that when the upper layer
issues stat call on the filter, they will get two replies, this can confuse 
them
and lead to wrong decisions (aging). I wonder if/how we can set a knob
>>>
>>> The bonding itself would not do anything on stats update
>>> command (TC_CLSFLOWER_STATS for example). Only the slaves would do
>>> update. So there will be only reply from slaves.
>>>
>>> Bond/team is just going to probagare block bind/unbind down. Nothing else.
>>
>>Do we agree that user space will get the replies of all lower (slave) devices,
>>or I am missing something here?
>
> "user space will get the replies" - not sure what exactly do you mean by
> this. The stats would be accumulated over all devices/drivers who
> registered block callback.

OK, this is probably something I have to check, thanks


2. bond being egress port of a rule
2.1 VF rep --> uplink 0
2.2 VF rep --> uplink 1

and we do that in the driver (add/del two HW rules, combine the stat
results, etc)
>>>
>>> That is up to the driver. If the driver can share block between 2
>>> devices, he can do that. If he cannot share, it will just report stats
>>> for every device separatelly (2 block cbs registered) and tc will see
>>> them both together. No need to do anything in driver.
>>
>>right
>>
3. ingress rule on VF rep port with shared tunnel device being the
egress (encap)
and where the routing of the underlay (tunnel) goes through LAG.
>>
>>> Same as "2."
>>
>>ok
>>
4. ingress rule shared tunnel device being the ingress and VF rep port 
being the egress (decap)

>>> I don't follow :(

>> the way tunneling is handled in tc classifier/action is

>> encap:  ingress: net port, action1: tunnel key set action2: mirred to
>> shared-tunnel device

>> decap: ingress: shared tunnel device, action1: tunnel key unset
>> action2: mirred to net port

>> type 4 are the decap rules, when we offload it to as HW ACL we stretch
>> the line and the ingress in a HW port too (e.g uplink port in NICs)

> Okay, I see. But where's the bond here? Is it the one I mentioned as
> "mirred redirect to lag"?

since the ingress port is not HW port, we will use the egdev approach
and offload the rule as the uplink of this VF rep port being the ingress.

Since we will see that this uplink is into LAG, we will offload another rule
which the 2nd uplink being the ingress

>>> I see another thing we need to sanitize: vxlan rule ingress match action
>>> mirred redirect to lag
>>right, we don't have  for NIC but for switch ASIC, I guess it is applicable
> Yes, it is. For future NICs I guess it is going to be as well.

might


Re: [PATCH RFC 3/7] net: phy: resume PHY only if needed in, mdio_bus_phy_suspend

2018-03-15 Thread Heiner Kallweit
Am 15.03.2018 um 00:50 schrieb Florian Fainelli:
> On 03/14/2018 01:16 PM, Heiner Kallweit wrote:
>> Currently the PHY is unconditionally resumed in mdio_bus_phy_suspend().
>> In cases where the PHY was sleepinh before suspending or if somebody else
>> takes care of resuming later, this is not needed and wastes energy.
>>
>> Also start the state machine only if it's used by the driver (indicated
>> by the adjust_link callback being defined).
>>
>> Signed-off-by: Heiner Kallweit 
>> ---
>>  drivers/net/phy/phy_device.c | 33 +++--
>>  1 file changed, 23 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index a5691536f..c6fd79758 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -124,6 +124,18 @@ static bool phy_may_suspend(struct phy_device *phydev)
>>  }
>>  
>>  #ifdef CONFIG_PM
>> +
>> +static bool mdio_bus_phy_needs_start(struct phy_device *phydev)
>> +{
>> +bool start;
> 
> How about needs_start? This is uber nitpicking but it seems to be more
> in line with what is being tested for here.
> 
Agree ..

>> +
>> +mutex_lock(>lock);
>> +start = phydev->state == PHY_UP && phydev->adjust_link;
> 
> You probably need to add an || phydev->phy_link_change here because that
> is what PHYLINK uses, it does not register an adjust_link callback, but
> would likely expect similar semantics. Even better, introduce a helper
> function that tests for both to avoid possible issues...
> 

phydev->phy_link_change is set in phy_attach_direct(). Therefore it's
always set if the device is attached. And mdio_bus_phy_needs_start()
is only used after we have verified that the device is attached.
Having said that I don't see when phydev->phy_link_change could
be NULL.

When talking about phydev->phy_link_change, why does it exist at all?
I found no driver setting an own callback, replacing the default
phy_link_change(). So we could use the default directly.
Or in which use case would a driver set an own callback?

>> +mutex_unlock(>lock);
>> +
>> +return start;
>> +}
>> +
>>  static int mdio_bus_phy_suspend(struct device *dev)
>>  {
>>  struct phy_device *phydev = to_phy_device(dev);
>> @@ -142,25 +154,25 @@ static int mdio_bus_phy_suspend(struct device *dev)
>>  static int mdio_bus_phy_resume(struct device *dev)
>>  {
>>  struct phy_device *phydev = to_phy_device(dev);
>> -int ret;
>> +int ret = 0;
>>  
>> -ret = phy_resume(phydev);
>> -if (ret < 0)
>> -return ret;
>> +if (!phydev->attached_dev)
>> +return 0;
>>  
>> -if (phydev->attached_dev && phydev->adjust_link)
>> -phy_start_machine(phydev);
>> +if (mdio_bus_phy_needs_start(phydev))
>> +phy_start(phydev);
>> +else if (!phydev->adjust_link)
>> +ret = phy_resume(phydev);
> 
> Humm, under which conditions can you not have phydev->attached_dev and
> also not phydev->adjust_link being set? As mentioned earlier, you would
> likely need to test for phy_link_change too here.
> 
We come here only if phydev->attached_dev is set. If this is the case
and phydev->adjust_link is not set this indicates that the driver
doesn't use the phylib state machine.
And in this case I'd prefer to just call phy_resume().

>>  
>> -return 0;
>> +return ret;
>>  }
>>  
>>  static int mdio_bus_phy_restore(struct device *dev)
>>  {
>>  struct phy_device *phydev = to_phy_device(dev);
>> -struct net_device *netdev = phydev->attached_dev;
>>  int ret;
>>  
>> -if (!netdev)
>> +if (!phydev->attached_dev)
>>  return 0;
> 
> That does not seem to be making any functional difference, so I would
> just drop this for now.
> 
>>  
>>  ret = phy_init_hw(phydev);
>> @@ -171,7 +183,8 @@ static int mdio_bus_phy_restore(struct device *dev)
>>  phydev->link = 0;
>>  phydev->state = PHY_UP;
>>  
>> -phy_start_machine(phydev);
>> +if (mdio_bus_phy_needs_start(phydev))
>> +phy_start(phydev);
>>  
>>  return 0;
>>  }
>>
> 
> 



Re: [PATCH RFC 0/7] net: phy: patch series aiming to improve few aspects of phylib

2018-03-15 Thread Heiner Kallweit
Am 15.03.2018 um 00:53 schrieb Florian Fainelli:
> On 03/14/2018 01:10 PM, Heiner Kallweit wrote:
>> This patch series aims to tackle few issues with phylib:
>>  
>> - address issues with patch series [1] (smsc911x + phylib changes)
>> - make phy_stop synchronous
>> - get rid of phy_start/stop_machine and handle it in phy_start/phy_stop
>> - in mdio_suspend consider runtime pm state of mdio bus parent
>> - consider more WOL conditions when deciding whether PHY is allowed to
>>   suspend
>> - only resume phy after system suspend if needed
>>
>> [1] https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
>>
>> It works fine here but other NIC drivers may use phylib differently. 
>> Therefore I'd appreciate feedback and more testing.
>>
>> I could think of some subsequent patches, e.g. phy_error() could be
>> reduced to calling phy_stop() and printing an error message
>> (today it silently sets the PHY state to PHY_HALTED).
> 
> Thanks for the patch series, I will give it a spin on a number of
> devices using different PHYLIB integration and see if something breaks.
> 
Great, and thanks for the immediate feedback.
I'll prepare a v2 based on it, also considerung Geert's feedback.

>>
>> Heiner Kallweit (7):
>>   net: phy: unconditionally resume and re-enable interrupts in phy_start
>>   net: phy: improve checking for when PHY is allowed to suspend
>>   net: phy: resume PHY only if needed in mdio_bus_phy_suspend
>>   net: phy: remove phy_start_machine
>>   net: phy: make phy_stop synchronous
>>   net: phy: use new function phy_stop_suspending in mdio_bus_phy_suspend
>>   net: phy: remove phy_stop_machine
>>
>>  drivers/net/phy/phy.c| 102 
>> +--
>>  drivers/net/phy/phy_device.c |  80 -
>>  drivers/net/phy/phylink.c|   1 -
>>  include/linux/phy.h  |  14 --
>>  4 files changed, 100 insertions(+), 97 deletions(-)
>>
> 
> 



Re: [PATCH RFC 0/7] net: phy: patch series aiming to improve few aspects of phylib

2018-03-15 Thread Heiner Kallweit
Am 15.03.2018 um 11:07 schrieb Geert Uytterhoeven:
> Hi Heiner,
> 
> On Wed, Mar 14, 2018 at 9:10 PM, Heiner Kallweit  wrote:
>> This patch series aims to tackle few issues with phylib:
>>
>> - address issues with patch series [1] (smsc911x + phylib changes)
>> - make phy_stop synchronous
>> - get rid of phy_start/stop_machine and handle it in phy_start/phy_stop
>> - in mdio_suspend consider runtime pm state of mdio bus parent
>> - consider more WOL conditions when deciding whether PHY is allowed to
>>   suspend
>> - only resume phy after system suspend if needed
>>
>> [1] https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
>>
>> It works fine here but other NIC drivers may use phylib differently.
>> Therefore I'd appreciate feedback and more testing.
>>
>> I could think of some subsequent patches, e.g. phy_error() could be
>> reduced to calling phy_stop() and printing an error message
>> (today it silently sets the PHY state to PHY_HALTED).
>>
>> Heiner Kallweit (7):
>>   net: phy: unconditionally resume and re-enable interrupts in phy_start
>>   net: phy: improve checking for when PHY is allowed to suspend
>>   net: phy: resume PHY only if needed in mdio_bus_phy_suspend
>>   net: phy: remove phy_start_machine
>>   net: phy: make phy_stop synchronous
>>   net: phy: use new function phy_stop_suspending in mdio_bus_phy_suspend
>>   net: phy: remove phy_stop_machine
> 
> Thanks for your series!
> 
> I've gave this a try on a few machines, incl. r8a73a4/ape6evm and
> sh73a0/kzm9g, which have smsc911x Ethernet chips on a power-managed bus.
> 
> On both machines it crashes during system suspend, which means the smsc911c's
> registers are accessed while the device is suspended:
> 
> PM: suspend entry (deep)
> PM: Syncing filesystems ... done.
> Freezing user space processes ... (elapsed 0.001 seconds) done.
> OOM killer disabled.
> Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> PM: suspend devices took 0.130 seconds
> Disabling non-boot CPUs ...
> Unhandled fault: imprecise external abort (0x1406) at 0x000ce408
> pgd = f4465d7b
> [000ce408] *pgd=
> Internal error: : 1406 [#1] SMP ARM
> Modules linked in:
> CPU: 1 PID: 20 Comm: kworker/1:1 Not tainted
> 4.16.0-rc5-kzm9g-00470-g319cfb3643965f46-dirty #1030
> Hardware name: Generic SH73A0 (Flattened Device Tree)
> Workqueue: events linkwatch_event
> PC is at __smsc911x_reg_read+0x1c/0x60
> LR is at smsc911x_tx_get_txstatus+0x2c/0x7c
> pc : []lr : []psr: 20010093
> sp : df51bd38  ip : df51bce0  fp : 
> r10:   r9 :   r8 : c0909b58
> r7 : a0010013  r6 : df636e08  r5 : df636dc0  r4 : df636800
> r3 : e0903000  r2 : 0001  r1 : e0903080  r0 : 
> Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> Control: 10c5387d  Table: 5ee4004a  DAC: 0051
> Process kworker/1:1 (pid: 20, stack limit = 0x1e2af6bb)
> Stack: (0xdf51bd38 to 0xdf51c000)
> bd20:   c03efb14 df636800
> bd40: df636dc0 c063c198 df51bdb0 c03efa80 c03efb14 df636800 df636800 c03efb20
> bd60: c03efb14 dec5e8f4 df636800 c063c198 df51bdb0 c04b4494 dec5e8f0 dec4ea80
> bd80: df636800 c04d7c28 dec4ea80 df636800 dec5e800 c04d3d68 002a 
> bda0: c04d3990 c020af0c df400a80     
> bdc0:   0050  df51be03 c04a5828 0580 c04a5758
> bde0: dec4ea80 04db 014000c0 c0908448 0001 c04a58a0 df51be03 c04d14e0
> be00:  3cef0b86 c04d13bc dec4ea80 df636800 0010  
> be20: df636800  c0931b44 c04d73c0    
> be40:     014000c0 df636800 c0931ad8 df51bed4
> be60: c0931ad8 c04d7468 014000c0   c014404c c0908448 c0908448
> be80: df636800 c04d7534 014000c0   014000c0 c0908448 c04b9d8c
> bea0: df636800   3cef0b86 c0931b44 df636800 c0931b44 c04d8854
> bec0: df636aac c04d8b10 df51bf2c c0908448  df51bed4 df51bed4 3cef0b86
> bee0: df51bf2c df50dc80 c0931ad8 dfbdaac0 df51bf2c dfbddd00  0001
> bf00:  c04d8b98 c04d8b74 c013cc8c 0001  c013cc14 c013d214
> bf20: c0908448  0004 c0931ad8   c075f7d9 3cef0b86
> bf40: c0905900 df50dc80 dfbdaac0 dfbdaac0 df51a000 dfbdaaf4 c0905900 df50dc98
> bf60: 0008 c013d4b0 df518540 df50de80 df5110c0  df491eb0 df50dc80
> bf80: c013d1e4 df50deb8  c014293c df5110c0 c014281c  
> bfa0:    c01010b4    
> bfc0:        
> bfe0:     0013  7fdf fff7fdff
> [] (__smsc911x_reg_read) from []
> (smsc911x_tx_get_txstatus+0x2c/0x7c)
> [] (smsc911x_tx_get_txstatus) from []
> (smsc911x_tx_update_txcounters+0x14/0xa8)
> [] (smsc911x_tx_update_txcounters) from []
> 

  1   2   3   4   >