date:20161025

Re: [PATCH] IB/mlx4: avoid a -Wmaybe-uninitialize warning

2016-10-25 Thread Yishai Hadas


On 10/25/2016 7:16 PM, Arnd Bergmann wrote:

There is an old warning about mlx4_SW2HW_EQ_wrapper on x86:

ethernet/mellanox/mlx4/resource_tracker.c: In function ‘mlx4_SW2HW_EQ_wrapper’:
ethernet/mellanox/mlx4/resource_tracker.c:3071:10: error: ‘eq’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]

The problem here is that gcc won't track the state of the variable
across a spin_unlock. Moving the assignment out of the lock is
safe here and avoids the warning.

Signed-off-by: Arnd Bergmann 


Reviewed-by: Yishai Hadas 


---
 drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c 
b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
index 84d7857ccc27..c548beaaf910 100644
--- a/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
+++ b/drivers/net/ethernet/mellanox/mlx4/resource_tracker.c
@@ -1605,13 +1605,14 @@ static int eq_res_start_move_to(struct mlx4_dev *dev, 
int slave, int index,
r->com.from_state = r->com.state;
r->com.to_state = state;
r->com.state = RES_EQ_BUSY;
-   if (eq)
-   *eq = r;
}
}

spin_unlock_irq(mlx4_tlock(dev));

+   if (!err && eq)
+   *eq = r;
+
return err;
 }

Re: [PATCH 19/28] brcmfmac: avoid maybe-uninitialized warning in brcmf_cfg80211_start_ap

2016-10-25 Thread Kalle Valo

Arnd Bergmann  writes:

> A bugfix added a sanity check around the assignment and use of the
> 'is_11d' variable, which looks correct to me, but as the function is
> rather complex already, this confuses the compiler to the point where
> it can no longer figure out if the variable is always initialized
> correctly:
>
> brcm80211/brcmfmac/cfg80211.c: In function ‘brcmf_cfg80211_start_ap’:
> brcm80211/brcmfmac/cfg80211.c:4586:10: error: ‘is_11d’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
>
> This adds an initialization for the newly introduced case in which
> the variable should not really be used, in order to make the warning
> go away.
>
> Fixes: b3589dfe0212 ("brcmfmac: ignore 11d configuration errors")
> Cc: Hante Meuleman 
> Cc: Arend van Spriel 
> Cc: Kalle Valo 
> Signed-off-by: Arnd Bergmann 

Via which tree are you planning to submit this? Should I take it?

-- 
Kalle Valo

[PATCH net-next] net: core: Traverse the adjacency list from first entry

2016-10-25 Thread idosch

From: Ido Schimmel 

netdev_next_lower_dev() returns NULL when we finished traversing the
adjacency list ('iter' points to the list's head). Therefore, we must
start traversing the list from the first entry and not its head.

Fixes: 1a3f060c1a47 ("net: Introduce new api for walking upper and lower 
devices")
Signed-off-by: Ido Schimmel 
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f55fb45..d9c937f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5419,7 +5419,7 @@ int netdev_walk_all_lower_dev(struct net_device *dev,
struct list_head *iter;
int ret;
 
-   for (iter = &dev->adj_list.lower,
+   for (iter = dev->adj_list.lower.next,
 ldev = netdev_next_lower_dev(dev, &iter);
 ldev;
 ldev = netdev_next_lower_dev(dev, &iter)) {
-- 
2.7.4

Re: [PATCH (net.git)] net: phy: at803x: disable by default the hibernation feature

2016-10-25 Thread Giuseppe CAVALLARO


Hello Andrew.

On 10/25/2016 11:00 AM, Andrew Lunn wrote:

For example, while booting a Kernel the SYNP MAC (stmmac) fails
to initialize own DMA engine if the phy entered in hibernation
before.


Have you tried fixing stmmac instead?


Let me describe better what happens, to be honest, this is a
marginal user-case, but, maybe it makes sense to share this patch in
case of somebody meets the same issue.

When performing "ifconfig eth0 up", if this phy is not in hibernation,
the iface comes up w/o any issues.
If the PHY is in hibernation, because the cable is unplugged (and
this is a default for these transceivers), the phy clock does down and
the MAC cannot init own DMA. The stmmac is designed to fail the open in
this case.
If I plug the cable the next ifconfig up is ok.

The meaning of the patch, I proposed, is to remove by default this
hibernation feature at PHY level that, for me, should be an option
not a default. For example, I have used other HW where some
power state features could be enabled but, by default, were turned
off. Also these transceivers support EEE so, I guess, there is all the
technology to manage the power consumption on new setup.

Concerning the stmmac, how the driver could fix this situation?
The PHY does not provide the clock required for GMAC and the stmmac
cannot reset own DMA. I had thought to delay this as soon as the link
is UP but I don't like this approach where the open should return
a sane state but this is not true and we should wait the ACK from the
PHY to reset the MAC DMAC.

Anyway, as said, the patch covers a marginal user-case so feel free to
consider it or not. For sure, I am open to change something at MAC
level if you have better idea.

Regards
Peppe



 Andrew

Re: [PATCH net] sctp: validate chunk len before actually using it

2016-10-25 Thread Xin Long

On Wed, Oct 26, 2016 at 12:27 AM, Marcelo Ricardo Leitner
 wrote:
> Andrey Konovalov reported that KASAN detected that SCTP was using a slab
> beyond the boundaries. It was caused because when handling out of the
> blue packets in function sctp_sf_ootb() it was checking the chunk len
> only after already processing the first chunk, validating only for the
> 2nd and subsequent ones.
>
> The fix is to just move the check upwards so it's also validated for the
> 1st chunk.
>
> Reported-by: Andrey Konovalov 
> Tested-by: Andrey Konovalov 
> Signed-off-by: Marcelo Ricardo Leitner 

Reviewed-by: Xin Long

Re: [PATCH net] packet: on direct_xmit, limit tso and csum to supported devices

2016-10-25 Thread Willem de Bruijn

On Tue, Oct 25, 2016 at 8:57 PM, Eric Dumazet  wrote:
> On Tue, 2016-10-25 at 20:28 -0400, Willem de Bruijn wrote:
>> From: Willem de Bruijn 
>>
>> When transmitting on a packet socket with PACKET_VNET_HDR and
>> PACKET_QDISC_BYPASS, validate device support for features requested
>> in vnet_hdr.
>
>
> You probably need to add an EXPORT_SYMBOL(validate_xmit_skb_list)
> because af_packet might be modular.

Thanks, Eric. I'll send a v2.

Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-25 Thread David Ahern

On 10/25/16 8:48 PM, Eric Dumazet wrote:
> Maybe I do not understand how you plan to use this.
> 
> Let say you want a filter to force a BIND_TO_DEVICE xxx because a task
> runs in a cgroup yyy
> 
> Then a program doing a socket() + connect (127.0.0.1)  will fail ?

maybe. VRF devices can have 127.0.0.1 address in which case the connect would 
succeed. ntpq uses 127.0.0.1 to talk to ntpd for example. If ntpd is bound to a 
Management VRF, then you need this context for ntpq to talk to it.

> 
> I do not see how a BPF filter at socket() time can be selective.

Here's my use case - and this is what we are doing today with the l3mdev cgroup 
(a patch which has not been accepted upstream):

1. create VRF device

2. create cgroup and configure it

   in this case it means load the bpf program that sets the sk_bound_dev_if to 
the vrf device that was just created

3. Add shell to cgroup

   For Management VRF this can be done automatically at login so a user does 
not need to take any action.

At this point any command run in the shell runs in the VRF context (PS1 for 
bash can show the VRF to keep you from going crazy on why a connect fails :-)) 
so any ipv4/ipv6 sockets have that VRF scope.

For example, if the VRF is a management VRF, sockets opened by apt-get are 
automatically bound to the VRF at create time, so requests go out the eth0 
(management) interface.

In a similar fashion, using a cgroup and assigning tasks to it allows automated 
launch of systemd services with instances running in a VRF context - one 
dhcrelay in vrf red, one in vrf blue with both using a parameterized instance 
file.

Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-25 Thread Eric Dumazet

On Tue, 2016-10-25 at 20:21 -0600, David Ahern wrote:
> On 10/25/16 5:39 PM, Eric Dumazet wrote:
> > On Tue, 2016-10-25 at 15:30 -0700, David Ahern wrote:
> >> Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
> >> BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
> >> any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
> >> Currently only sk_bound_dev_if is exported to userspace for modification
> >> by a bpf program.
> >>
> >> This allows a cgroup to be configured such that AF_INET{6} sockets opened
> >> by processes are automatically bound to a specific device. In turn, this
> >> enables the running of programs that do not support SO_BINDTODEVICE in a
> >> specific VRF context / L3 domain.
> > 
> > Does this mean that these programs no longer can use loopback ?
> 
> I am probably misunderstanding your question, so I'll ramble a bit and
> see if I cover it.
> 
> This patch set generically allows sk_bound_dev_if to be set to any
> value. It does not check that an index corresponds to a device at that
> moment (either bpf prog install or execution of the filter), and even
> if it did the device can be deleted at any moment. That seems to be
> standard operating procedure with bpf filters (user mistakes mean
> packets go no where and in this case a socket is bound to a
> non-existent device).
> 
> The index can be any interface (e.g., eth0) or an L3 device (e.g., a
> VRF). Loopback and index=1 is allowed.
> 
> The VRF device is the loopback device for the domain, so binding to it
> covers addresses on the VRF device as well as interfaces enslaved to
> it.
> 
> Did you mean something else?

Maybe I do not understand how you plan to use this.

Let say you want a filter to force a BIND_TO_DEVICE xxx because a task
runs in a cgroup yyy

Then a program doing a socket() + connect (127.0.0.1)  will fail ?

I do not see how a BPF filter at socket() time can be selective.

[PATCH v2 3/5] kconfig: regenerate *.c_shipped files after previous changes

2016-10-25 Thread Nicolas Pitre

Signed-off-by: Nicolas Pitre 
---
 scripts/kconfig/zconf.hash.c_shipped |  228 ++---
 scripts/kconfig/zconf.tab.c_shipped  | 1631 --
 2 files changed, 888 insertions(+), 971 deletions(-)

diff --git a/scripts/kconfig/zconf.hash.c_shipped 
b/scripts/kconfig/zconf.hash.c_shipped
index 360a62df2b..bf7f1378b3 100644
--- a/scripts/kconfig/zconf.hash.c_shipped
+++ b/scripts/kconfig/zconf.hash.c_shipped
@@ -32,7 +32,7 @@
 struct kconf_id;
 
 static const struct kconf_id *kconf_id_lookup(register const char *str, 
register unsigned int len);
-/* maximum key range = 71, duplicates = 0 */
+/* maximum key range = 72, duplicates = 0 */
 
 #ifdef __GNUC__
 __inline
@@ -46,32 +46,32 @@ kconf_id_hash (register const char *str, register unsigned 
int len)
 {
   static const unsigned char asso_values[] =
 {
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73,  0, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73,  5, 25, 25,
-   0,  0,  0,  5,  0,  0, 73, 73,  5,  0,
-  10,  5, 45, 73, 20, 20,  0, 15, 15, 73,
-  20,  5, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73, 73, 73, 73, 73,
-  73, 73, 73, 73, 73, 73
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74,  0, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74,  0, 20, 10,
+   0,  0,  0, 30,  0,  0, 74, 74,  5, 15,
+   0, 25, 40, 74, 15,  0,  0, 10, 35, 74,
+  10,  0, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74, 74, 74, 74, 74,
+  74, 74, 74, 74, 74, 74
 };
   register int hval = len;
 
@@ -97,33 +97,35 @@ struct kconf_id_strings_t
 char kconf_id_strings_str8[sizeof("tristate")];
 char kconf_id_strings_str9[sizeof("endchoice")];
 char kconf_id_strings_str10[sizeof("---help---")];
+char kconf_id_strings_str11[sizeof("select")];
 char kconf_id_strings_str12[sizeof("def_tristate")];
 char kconf_id_strings_str13[sizeof("def_bool")];
 char kconf_id_strings_str14[sizeof("defconfig_list")];
-char kconf_id_strings_str17[sizeof("on")];
-char kconf_id_strings_str18[sizeof("optional")];
-char kconf_id_strings_str21[sizeof("option")];
-char kconf_id_strings_str22[sizeof("endmenu")];
-char kconf_id_strings_str23[sizeof("mainmenu")];
-char kconf_id_strings_str25[sizeof("menuconfig")];
-char kconf_id_strings_str27[sizeof("modules")];
-char kconf_id_strings_str28[sizeof("allnoconfig_y")];
+char kconf_id_strings_str16[sizeof("source")];
+char kconf_id_strings_str17[sizeof("endmenu")];
+char kconf_id_strings_str18[sizeof("allnoconfig_y")];
+char kconf_id_strings_str20[sizeof("range")];
+char kconf_id_strings_str22[sizeof("modules")];
+char kconf_id_strings_str23[sizeof("hex")];
+char kconf_id_strings_str27[sizeof("on")];
 char kconf_id_strings_str29[sizeof("menu")];
-char kconf_id_strings_str31[sizeof("select")];
+char kconf_id_strings_str31[sizeof("option")];
 char kconf_id_strings_str32[sizeof("comment")];
-char kconf_id_strings_str33[sizeof("env")];
-char kconf_id_strings_str35[sizeof("range")];
-char kconf_id_strings_str36[sizeof("choice")];
-char kconf_id_strings_str39[sizeof("bool")];
-char kconf_id_strings_str41[sizeof("source")];
+char kconf_id_string

[PATCH v2 5/5] posix-timers: make it configurable

2016-10-25 Thread Nicolas Pitre

Some embedded systems have no use for them.  This removes about
22KB from the kernel binary size when configured out.

Corresponding syscalls are routed to a stub logging the attempt to
use those syscalls which should be enough of a clue if they were
disabled without proper consideration. They are: timer_create,
timer_gettime: timer_getoverrun, timer_settime, timer_delete,
clock_adjtime.

The clock_settime, clock_gettime, clock_getres and clock_nanosleep
syscalls are replaced by simple wrappers compatible with CLOCK_REALTIME,
CLOCK_MONOTONIC and CLOCK_BOOTTIME only which should cover the vast
majority of use cases with very little code.

Signed-off-by: Nicolas Pitre 
Reviewed-by: Josh Triplett 
---
 drivers/ptp/Kconfig  |   2 +-
 include/linux/posix-timers.h |  28 +-
 include/linux/sched.h|  10 
 init/Kconfig |  17 +++
 kernel/signal.c  |   4 ++
 kernel/time/Makefile |  10 +++-
 kernel/time/posix-stubs.c| 118 +++
 7 files changed, 184 insertions(+), 5 deletions(-)
 create mode 100644 kernel/time/posix-stubs.c

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 0f7492f8ea..bdce332911 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -6,7 +6,7 @@ menu "PTP clock support"
 
 config PTP_1588_CLOCK
tristate "PTP clock support"
-   depends on NET
+   depends on NET && POSIX_TIMERS
select PPS
select NET_PTP_CLASSIFY
help
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index 62d44c1760..2288c5c557 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -118,6 +118,8 @@ struct k_clock {
 extern struct k_clock clock_posix_cpu;
 extern struct k_clock clock_posix_dynamic;
 
+#ifdef CONFIG_POSIX_TIMERS
+
 void posix_timers_register_clock(const clockid_t clock_id, struct k_clock 
*new_clock);
 
 /* function to call to trigger timer event */
@@ -131,8 +133,30 @@ void posix_cpu_timers_exit_group(struct task_struct *task);
 void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
   cputime_t *newval, cputime_t *oldval);
 
-long clock_nanosleep_restart(struct restart_block *restart_block);
-
 void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
 
+#else
+
+#include 
+
+static inline void posix_timers_register_clock(const clockid_t clock_id,
+  struct k_clock *new_clock) {}
+static inline int posix_timer_event(struct k_itimer *timr, int si_private)
+{ return 0; }
+static inline void run_posix_cpu_timers(struct task_struct *task) {}
+static inline void posix_cpu_timers_exit(struct task_struct *task)
+{
+   add_device_randomness((const void*) &task->se.sum_exec_runtime,
+ sizeof(unsigned long long));
+}
+static inline void posix_cpu_timers_exit_group(struct task_struct *task) {}
+static inline void set_process_cpu_timer(struct task_struct *task,
+   unsigned int clock_idx, cputime_t *newval, cputime_t *oldval) {}
+static inline void update_rlimit_cpu(struct task_struct *task,
+unsigned long rlim_new) {}
+
+#endif
+
+long clock_nanosleep_restart(struct restart_block *restart_block);
+
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b0ec..ad716d5559 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2946,8 +2946,13 @@ static inline void exit_thread(struct task_struct *tsk)
 extern void exit_files(struct task_struct *);
 extern void __cleanup_sighand(struct sighand_struct *);
 
+#ifdef CONFIG_POSIX_TIMERS
 extern void exit_itimers(struct signal_struct *);
 extern void flush_itimer_signals(void);
+#else
+static inline void exit_itimers(struct signal_struct *s) {}
+static inline void flush_itimer_signals(void) {}
+#endif
 
 extern void do_group_exit(int);
 
@@ -3450,7 +3455,12 @@ static __always_inline bool need_resched(void)
  * Thread group CPU time accounting.
  */
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times);
+#ifdef CONFIG_POSIX_TIMERS
 void thread_group_cputimer(struct task_struct *tsk, struct task_cputime 
*times);
+#else
+static inline void thread_group_cputimer(struct task_struct *tsk,
+struct task_cputime *times) {}
+#endif
 
 /*
  * Reevaluate whether the task has signals pending delivery.
diff --git a/init/Kconfig b/init/Kconfig
index 34407f15e6..351d422252 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1445,6 +1445,23 @@ config SYSCTL_SYSCALL
 
  If unsure say N here.
 
+config POSIX_TIMERS
+   bool "Posix Clocks & timers" if EXPERT
+   default y
+   help
+ This includes native support for POSIX timers to the kernel.
+ Most embedded systems may have no use for them and therefore they
+ can be configured out to reduce the size of the kernel image.
+

[PATCH v2 1/5] kconfig: introduce the "imply" keyword

2016-10-25 Thread Nicolas Pitre

The "imply" keyword is a weak version of "select" where the target
config symbol can still be turned off, avoiding those pitfalls that come
with the "select" keyword.

This is useful e.g. with multiple drivers that want to indicate their
ability to hook into a given subsystem while still being able to
configure that subsystem out and keep those drivers selected.

Currently, the same effect can almost be achieved with:

config DRIVER_A
tristate

config DRIVER_B
tristate

config DRIVER_C
tristate

config DRIVER_D
tristate

[...]

config SUBSYSTEM_X
tristate
default DRIVER_A || DRIVER_B || DRIVER_C || DRIVER_D || [...]

This is unwieldly to maintain especially with a large number of drivers.
Furthermore, there is no easy way to restrict the choice for SUBSYSTEM_X
to y or n, excluding m, when some drivers are built-in. The "select"
keyword allows for excluding m, but it excludes n as well. Hence
this "imply" keyword.  The above becomes:

config DRIVER_A
tristate
imply SUBSYSTEM_X

config DRIVER_B
tristate
imply SUBSYSTEM_X

[...]

config SUBSYSTEM_X
tristate

This is much cleaner, and way more flexible than "select". SUBSYSTEM_X
can still be configured out, and it can be set as a module when none of
the drivers are selected or all of them are also modular.

Signed-off-by: Nicolas Pitre 
Reviewed-by: Josh Triplett 
---
 Documentation/kbuild/kconfig-language.txt | 28 
 scripts/kconfig/expr.h|  2 ++
 scripts/kconfig/menu.c| 55 ++-
 scripts/kconfig/symbol.c  | 24 +-
 scripts/kconfig/zconf.gperf   |  1 +
 scripts/kconfig/zconf.y   | 16 +++--
 6 files changed, 107 insertions(+), 19 deletions(-)

diff --git a/Documentation/kbuild/kconfig-language.txt 
b/Documentation/kbuild/kconfig-language.txt
index 069fcb3eef..5ee0dd3c85 100644
--- a/Documentation/kbuild/kconfig-language.txt
+++ b/Documentation/kbuild/kconfig-language.txt
@@ -113,6 +113,33 @@ applicable everywhere (see syntax).
That will limit the usefulness but on the other hand avoid
the illegal configurations all over.
 
+- weak reverse dependencies: "imply"  ["if" ]
+  This is similar to "select" as it enforces a lower limit on another
+  symbol except that the "implied" config symbol's value may still be
+  set to n from a direct dependency or with a visible prompt.
+  Given the following example:
+
+  config FOO
+   tristate
+   imply BAZ
+
+  config BAZ
+   tristate
+   depends on BAR
+
+  The following values are possible:
+
+   FOO BAR BAZ's default   choice for BAZ
+   --- --- -   --
+   n   y   n   N/m/y
+   m   y   m   M/y/n
+   y   y   y   Y/n
+   y   n   *   N
+
+  This is useful e.g. with multiple drivers that want to indicate their
+  ability to hook into a given subsystem while still being able to
+  configure that subsystem out and keep those drivers selected.
+
 - limiting menu display: "visible if" 
   This attribute is only applicable to menu blocks, if the condition is
   false, the menu block is not displayed to the user (the symbols
@@ -481,6 +508,7 @@ historical issues resolved through these different 
solutions.
   b) Match dependency semantics:
b1) Swap all "select FOO" to "depends on FOO" or,
b2) Swap all "depends on FOO" to "select FOO"
+  c) Consider the use of "imply" instead of "select"
 
 The resolution to a) can be tested with the sample Kconfig file
 Documentation/kbuild/Kconfig.recursion-issue-01 through the removal
diff --git a/scripts/kconfig/expr.h b/scripts/kconfig/expr.h
index 973b6f7333..a73f762c48 100644
--- a/scripts/kconfig/expr.h
+++ b/scripts/kconfig/expr.h
@@ -85,6 +85,7 @@ struct symbol {
struct property *prop;
struct expr_value dir_dep;
struct expr_value rev_dep;
+   struct expr_value implied;
 };
 
 #define for_all_symbols(i, sym) for (i = 0; i < SYMBOL_HASHSIZE; i++) for (sym 
= symbol_hash[i]; sym; sym = sym->next) if (sym->type != S_OTHER)
@@ -136,6 +137,7 @@ enum prop_type {
P_DEFAULT,  /* default y */
P_CHOICE,   /* choice value */
P_SELECT,   /* select BAR */
+   P_IMPLY,/* imply BAR */
P_RANGE,/* range 7..100 (for a symbol) */
P_ENV,  /* value from environment variable */
P_SYMBOL,   /* where a symbol is defined */
diff --git a/scripts/kconfig/menu.c b/scripts/kconfig/menu.c
index aed678e8a7..e9357931b4 100644
--- a/scripts/kconfig/menu.c
+++ b/scripts/kconfig/menu.c
@@ -233,6 +233,8 @@ static void sym_check_prop(struct symbol *sym)
 {
struct property *prop;
struct symbol *sym2;
+   char

Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-25 Thread David Ahern

On 10/25/16 7:55 PM, Alexei Starovoitov wrote:
> Same question as Daniel... why extra helper?

It can be dropped. wrong path while learning this code.

> If program overwrites bpf_sock->sk_bound_dev_if can we use that
> after program returns?
> Also do you think it's possible to extend this patch to prototype
> the port bind restrictions that were proposed few month back using
> the same bpf_sock input structure?
> Probably the check would need to be moved into different
> place instead of sk_alloc(), but then we'll have more
> opportunities to overwrite bound_dev_if, look at ports and so on ?
> 

I think the sk_bound_dev_if should be set when the socket is created versus 
waiting until it is used (bind, connect, sendmsg, recvmsg). That said, the 
filter could (should?) be run in the protocol family create function 
(inet_create and inet6_create) versus sk_alloc. That would allow the filter to 
allocate a local port based on its logic. I'd prefer interested parties to look 
into the details of that use case.

I'll move the running of the filter to the end of the create functions for v2.

[PATCH v2 4/5] ptp_clock: allow for it to be optional

2016-10-25 Thread Nicolas Pitre

In order to break the hard dependency between the PTP clock subsystem and
ethernet drivers capable of being clock providers, this patch provides
simple PTP stub functions to allow linkage of those drivers into the
kernel even when the PTP subsystem is configured out. Drivers must be
ready to accept NULL from ptp_clock_register() in that case.

And to make it possible for PTP to be configured out, the select statement
in those driver's Kconfig menu entries is converted to the new "imply"
statement. This way the PTP subsystem may have Kconfig dependencies of
its own, such as POSIX_TIMERS, without having to make those ethernet
drivers unavailable if POSIX timers are cconfigured out. And when support
for POSIX timers is selected again then the default config option for PTP
clock support will automatically be adjusted accordingly.

The pch_gbe driver is a bit special as it relies on extra code in
drivers/ptp/ptp_pch.c. Therefore we let the make process descend into
drivers/ptp/ even if PTP_1588_CLOCK is unselected.

Signed-off-by: Nicolas Pitre 
Reviewed-by: Josh Triplett 
---
 drivers/Makefile|  2 +-
 drivers/net/ethernet/adi/Kconfig|  2 +-
 drivers/net/ethernet/amd/Kconfig|  2 +-
 drivers/net/ethernet/amd/xgbe/xgbe-main.c   |  6 ++-
 drivers/net/ethernet/broadcom/Kconfig   |  4 +-
 drivers/net/ethernet/cavium/Kconfig |  2 +-
 drivers/net/ethernet/freescale/Kconfig  |  2 +-
 drivers/net/ethernet/intel/Kconfig  | 10 ++--
 drivers/net/ethernet/mellanox/mlx4/Kconfig  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig |  2 +-
 drivers/net/ethernet/renesas/Kconfig|  2 +-
 drivers/net/ethernet/samsung/Kconfig|  2 +-
 drivers/net/ethernet/sfc/Kconfig|  2 +-
 drivers/net/ethernet/stmicro/stmmac/Kconfig |  2 +-
 drivers/net/ethernet/ti/Kconfig |  2 +-
 drivers/net/ethernet/tile/Kconfig   |  2 +-
 drivers/ptp/Kconfig |  8 +--
 include/linux/ptp_clock_kernel.h| 65 -
 18 files changed, 69 insertions(+), 50 deletions(-)

diff --git a/drivers/Makefile b/drivers/Makefile
index f0afdfb3c7..8cfa1ff8f6 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -107,7 +107,7 @@ obj-$(CONFIG_INPUT) += input/
 obj-$(CONFIG_RTC_LIB)  += rtc/
 obj-y  += i2c/ media/
 obj-$(CONFIG_PPS)  += pps/
-obj-$(CONFIG_PTP_1588_CLOCK)   += ptp/
+obj-y  += ptp/
 obj-$(CONFIG_W1)   += w1/
 obj-y  += power/
 obj-$(CONFIG_HWMON)+= hwmon/
diff --git a/drivers/net/ethernet/adi/Kconfig b/drivers/net/ethernet/adi/Kconfig
index 6b94ba6103..98cc8f5350 100644
--- a/drivers/net/ethernet/adi/Kconfig
+++ b/drivers/net/ethernet/adi/Kconfig
@@ -58,7 +58,7 @@ config BFIN_RX_DESC_NUM
 config BFIN_MAC_USE_HWSTAMP
bool "Use IEEE 1588 hwstamp"
depends on BFIN_MAC && BF518
-   select PTP_1588_CLOCK
+   imply PTP_1588_CLOCK
default y
---help---
  To support the IEEE 1588 Precision Time Protocol (PTP), select y here
diff --git a/drivers/net/ethernet/amd/Kconfig b/drivers/net/ethernet/amd/Kconfig
index 0038709fd3..713ea7ad22 100644
--- a/drivers/net/ethernet/amd/Kconfig
+++ b/drivers/net/ethernet/amd/Kconfig
@@ -177,7 +177,7 @@ config AMD_XGBE
depends on ARM64 || COMPILE_TEST
select BITREVERSE
select CRC32
-   select PTP_1588_CLOCK
+   imply PTP_1588_CLOCK
---help---
  This driver supports the AMD 10GbE Ethernet device found on an
  AMD SoC.
diff --git a/drivers/net/ethernet/amd/xgbe/xgbe-main.c 
b/drivers/net/ethernet/amd/xgbe/xgbe-main.c
index 9de078819a..e10e569c0d 100644
--- a/drivers/net/ethernet/amd/xgbe/xgbe-main.c
+++ b/drivers/net/ethernet/amd/xgbe/xgbe-main.c
@@ -773,7 +773,8 @@ static int xgbe_probe(struct platform_device *pdev)
goto err_wq;
}
 
-   xgbe_ptp_register(pdata);
+   if (IS_REACHABLE(CONFIG_PTP_1588_CLOCK))
+   xgbe_ptp_register(pdata);
 
xgbe_debugfs_init(pdata);
 
@@ -812,7 +813,8 @@ static int xgbe_remove(struct platform_device *pdev)
 
xgbe_debugfs_exit(pdata);
 
-   xgbe_ptp_unregister(pdata);
+   if (IS_REACHABLE(CONFIG_PTP_1588_CLOCK))
+   xgbe_ptp_unregister(pdata);
 
flush_workqueue(pdata->an_workqueue);
destroy_workqueue(pdata->an_workqueue);
diff --git a/drivers/net/ethernet/broadcom/Kconfig 
b/drivers/net/ethernet/broadcom/Kconfig
index bd8c80c0b7..6a8d74aeb6 100644
--- a/drivers/net/ethernet/broadcom/Kconfig
+++ b/drivers/net/ethernet/broadcom/Kconfig
@@ -110,7 +110,7 @@ config TIGON3
depends on PCI
select PHYLIB
select HWMON
-   select PTP_1588_CLOCK
+   imply PTP_1588_CLOCK
---help---
  This driver supports Broadcom T

[PATCH v2 2/5] kconfig: introduce the "suggest" keyword

2016-10-25 Thread Nicolas Pitre

Similar to "imply" but with no added restrictions on the target symbol's
value. Useful for providing a default value to another symbol.

Suggested by Edward Cree.

Signed-off-by: Nicolas Pitre 
---
 Documentation/kbuild/kconfig-language.txt |  6 ++
 scripts/kconfig/expr.h|  2 ++
 scripts/kconfig/menu.c| 15 ++-
 scripts/kconfig/symbol.c  | 20 +++-
 scripts/kconfig/zconf.gperf   |  1 +
 scripts/kconfig/zconf.y   | 16 ++--
 6 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/Documentation/kbuild/kconfig-language.txt 
b/Documentation/kbuild/kconfig-language.txt
index 5ee0dd3c85..b7f4f0ca1d 100644
--- a/Documentation/kbuild/kconfig-language.txt
+++ b/Documentation/kbuild/kconfig-language.txt
@@ -140,6 +140,12 @@ applicable everywhere (see syntax).
   ability to hook into a given subsystem while still being able to
   configure that subsystem out and keep those drivers selected.
 
+- even weaker reverse dependencies: "suggest"  ["if" ]
+  This is similar to "imply" except that this doesn't add any restrictions
+  on the value the suggested symbol may use. In other words this only
+  provides a default for the specified symbol based on the value for the
+  config entry where this is used.
+
 - limiting menu display: "visible if" 
   This attribute is only applicable to menu blocks, if the condition is
   false, the menu block is not displayed to the user (the symbols
diff --git a/scripts/kconfig/expr.h b/scripts/kconfig/expr.h
index a73f762c48..eea3aa3c7a 100644
--- a/scripts/kconfig/expr.h
+++ b/scripts/kconfig/expr.h
@@ -86,6 +86,7 @@ struct symbol {
struct expr_value dir_dep;
struct expr_value rev_dep;
struct expr_value implied;
+   struct expr_value suggested;
 };
 
 #define for_all_symbols(i, sym) for (i = 0; i < SYMBOL_HASHSIZE; i++) for (sym 
= symbol_hash[i]; sym; sym = sym->next) if (sym->type != S_OTHER)
@@ -138,6 +139,7 @@ enum prop_type {
P_CHOICE,   /* choice value */
P_SELECT,   /* select BAR */
P_IMPLY,/* imply BAR */
+   P_SUGGEST,  /* suggest BAR */
P_RANGE,/* range 7..100 (for a symbol) */
P_ENV,  /* value from environment variable */
P_SYMBOL,   /* where a symbol is defined */
diff --git a/scripts/kconfig/menu.c b/scripts/kconfig/menu.c
index e9357931b4..3abc5c85ac 100644
--- a/scripts/kconfig/menu.c
+++ b/scripts/kconfig/menu.c
@@ -255,7 +255,9 @@ static void sym_check_prop(struct symbol *sym)
break;
case P_SELECT:
case P_IMPLY:
-   use = prop->type == P_SELECT ? "select" : "imply";
+   case P_SUGGEST:
+   use = prop->type == P_SELECT ? "select" :
+ prop->type == P_IMPLY ? "imply" : "suggest";
sym2 = prop_get_symbol(prop);
if (sym->type != S_BOOLEAN && sym->type != S_TRISTATE)
prop_warn(prop,
@@ -341,6 +343,10 @@ void menu_finalize(struct menu *parent)
struct symbol *es = 
prop_get_symbol(prop);
es->implied.expr = 
expr_alloc_or(es->implied.expr,

expr_alloc_and(expr_alloc_symbol(menu->sym), expr_copy(dep)));
+   } else if (prop->type == P_SUGGEST) {
+   struct symbol *es = 
prop_get_symbol(prop);
+   es->suggested.expr = 
expr_alloc_or(es->suggested.expr,
+   
expr_alloc_and(expr_alloc_symbol(menu->sym), expr_copy(dep)));
}
}
}
@@ -687,6 +693,13 @@ static void get_symbol_str(struct gstr *r, struct symbol 
*sym,
str_append(r, "\n");
}
 
+   get_symbol_props_str(r, sym, P_SUGGEST, _("  Suggests: "));
+   if (sym->suggested.expr) {
+   str_append(r, _("  Suggested by: "));
+   expr_gstr_print(sym->suggested.expr, r);
+   str_append(r, "\n");
+   }
+
str_append(r, "\n\n");
 }
 
diff --git a/scripts/kconfig/symbol.c b/scripts/kconfig/symbol.c
index 20136ffefb..4a8094a63c 100644
--- a/scripts/kconfig/symbol.c
+++ b/scripts/kconfig/symbol.c
@@ -267,6 +267,16 @@ static void sym_calc_visibility(struct symbol *sym)
sym->implied.tri = tri;
sym_set_changed(sym);
}
+   tri = no;
+   if (sym->suggested.expr)
+   tri = expr_calc_value(sym->suggested.expr);
+   tri = EXPR_AND(tri, sym->visible);
+   if (tri == mod && sym_get_type(sym) == S_BOOLEAN)
+   tri = yes;
+   if (sym->suggested.tri != tri) {
+   sym->suggested.tri = tri;
+

[no subject]

2016-10-25 Thread Nicolas Pitre

From: Nicolas Pitre 
Subject: [PATCH v2 0/5] make POSIX timers optional with some Kconfig help

Many embedded systems don't need the full POSIX timer support.
Configuring them out provides a nice kernel image size reduction.

When POSIX timers are configured out, the PTP clock subsystem should be
left out as well. However a bunch of ethernet drivers currently *select*
the later in their Kconfig entries. Therefore some more work was needed
to break that hard dependency from those drivers without preventing their
usage altogether.

Therefore this series also includes kconfig changes to implement a new
keyword to express some reverse dependencies like "select" does, named
"imply", and still allowing for the target config symbol to be disabled
if the user or a direct dependency says so. The "suggest" keyword is
also provided to complement "imply" but without the restrictions from
"imply" or "select".

At this point I'd like to gather ACKs especially from people in the "To"
field. Ideally this would need to go upstream as a single series to avoid
cross subsystem dependency issues, and we should decide which maintainer
tree to use.  Suggestions welcome.

Changes from v1:

- added "suggest" to kconfig for completeness
- various typo fixes
- small "imply" effect visibility fix

The bulk of the diffstat comes from the kconfig lex parser regeneration.

Diffstat:

 Documentation/kbuild/kconfig-language.txt   |   34 +
 drivers/Makefile|2 +-
 drivers/net/ethernet/adi/Kconfig|2 +-
 drivers/net/ethernet/amd/Kconfig|2 +-
 drivers/net/ethernet/amd/xgbe/xgbe-main.c   |6 +-
 drivers/net/ethernet/broadcom/Kconfig   |4 +-
 drivers/net/ethernet/cavium/Kconfig |2 +-
 drivers/net/ethernet/freescale/Kconfig  |2 +-
 drivers/net/ethernet/intel/Kconfig  |   10 +-
 drivers/net/ethernet/mellanox/mlx4/Kconfig  |2 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig |2 +-
 drivers/net/ethernet/renesas/Kconfig|2 +-
 drivers/net/ethernet/samsung/Kconfig|2 +-
 drivers/net/ethernet/sfc/Kconfig|2 +-
 drivers/net/ethernet/stmicro/stmmac/Kconfig |2 +-
 drivers/net/ethernet/ti/Kconfig |2 +-
 drivers/net/ethernet/tile/Kconfig   |2 +-
 drivers/ptp/Kconfig |   10 +-
 include/linux/posix-timers.h|   28 +-
 include/linux/ptp_clock_kernel.h|   65 +-
 include/linux/sched.h   |   10 +
 init/Kconfig|   17 +
 kernel/signal.c |4 +
 kernel/time/Makefile|   10 +-
 kernel/time/posix-stubs.c   |  118 ++
 scripts/kconfig/expr.h  |4 +
 scripts/kconfig/menu.c  |   68 +-
 scripts/kconfig/symbol.c|   42 +-
 scripts/kconfig/zconf.gperf |2 +
 scripts/kconfig/zconf.hash.c_shipped|  228 +--
 scripts/kconfig/zconf.tab.c_shipped | 1631 -
 scripts/kconfig/zconf.y |   28 +-
 32 files changed, 1300 insertions(+), 1045 deletions(-)

Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-25 Thread David Ahern

On 10/25/16 5:39 PM, Eric Dumazet wrote:
> On Tue, 2016-10-25 at 15:30 -0700, David Ahern wrote:
>> Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
>> BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
>> any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
>> Currently only sk_bound_dev_if is exported to userspace for modification
>> by a bpf program.
>>
>> This allows a cgroup to be configured such that AF_INET{6} sockets opened
>> by processes are automatically bound to a specific device. In turn, this
>> enables the running of programs that do not support SO_BINDTODEVICE in a
>> specific VRF context / L3 domain.
> 
> Does this mean that these programs no longer can use loopback ?

I am probably misunderstanding your question, so I'll ramble a bit and see if I 
cover it.

This patch set generically allows sk_bound_dev_if to be set to any value. It 
does not check that an index corresponds to a device at that moment (either bpf 
prog install or execution of the filter), and even if it did the device can be 
deleted at any moment. That seems to be standard operating procedure with bpf 
filters (user mistakes mean packets go no where and in this case a socket is 
bound to a non-existent device).

The index can be any interface (e.g., eth0) or an L3 device (e.g., a VRF). 
Loopback and index=1 is allowed.

The VRF device is the loopback device for the domain, so binding to it covers 
addresses on the VRF device as well as interfaces enslaved to it.

Did you mean something else?

[PATCH] ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()

2016-10-25 Thread Eli Cooper

This patch updates skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit() when an
IPv6 header is installed to a socket buffer.

This is not a cosmetic change.  Without updating this value, GSO packets
transmitted through an ipip6 tunnel have the protocol of ETH_P_IP and
skb_mac_gso_segment() will attempt to call gso_segment() for IPv4,
which results in the packets being dropped.

Fixes: b8921ca83eed ("ip4ip6: Support for GSO/GRO")
Signed-off-by: Eli Cooper 
---
 net/ipv6/ip6_tunnel.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 202d16a..03e050d 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1172,6 +1172,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
if (err)
return err;
 
+   skb->protocol = htons(ETH_P_IPV6);
skb_push(skb, sizeof(struct ipv6hdr));
skb_reset_network_header(skb);
ipv6h = ipv6_hdr(skb);
-- 
2.10.1

Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-25 Thread David Ahern

On 10/25/16 5:28 PM, Daniel Borkmann wrote:
>> +BPF_CALL_3(bpf_sock_store_u32, struct sock *, sk, u32, offset, u32, val)
>> +{
>> +u8 *ptr = (u8 *)sk;
>> +
>> +if (unlikely(offset > sizeof(*sk)))
>> +return -EFAULT;
>> +
>> +*((u32 *)ptr) = val;
>> +
>> +return 0;
>> +}
> 
> Seems strange to me. So, this helper allows to overwrite arbitrary memory
> of a struct sock instance. Potentially we could crash the kernel.
> 
> And in your sock_filter_convert_ctx_access(), you already implement inline
> read/write for the context ...
> 
> Your demo code does in pseudocode:
> 
>   r1 = sk
>   r2 = offsetof(struct bpf_sock, bound_dev_if)
>   r3 = idx
>   r1->sk_bound_dev_if = idx
>   sock_store_u32(r1, r2, r3) // updates sk_bound_dev_if again to idx
>   return 1
> 
> Dropping that helper from the patch, the only thing a program can do here
> is to read/write the sk_bound_dev_if helper per cgroup. Hmm ... dunno. So
> this really has to be for cgroups v2, right?

Showing my inexperience with the bpf code. The helper can be dropped. I'll do 
that for v2.

Yes, Daniel's patch set provides the infra for this one and it has a cgroups v2 
limitation.

Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-25 Thread Alexei Starovoitov

On Wed, Oct 26, 2016 at 01:28:24AM +0200, Daniel Borkmann wrote:
> On 10/26/2016 12:30 AM, David Ahern wrote:
> >Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
> >BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
> >any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
> >Currently only sk_bound_dev_if is exported to userspace for modification
> >by a bpf program.
> >
> >This allows a cgroup to be configured such that AF_INET{6} sockets opened
> >by processes are automatically bound to a specific device. In turn, this
> >enables the running of programs that do not support SO_BINDTODEVICE in a
> >specific VRF context / L3 domain.
> >
> >Signed-off-by: David Ahern 
> [...]
> >@@ -524,6 +535,10 @@ struct bpf_tunnel_key {
> > __u32 tunnel_label;
> >  };
> >
> >+struct bpf_sock {
> >+__u32 bound_dev_if;
> >+};
> >+
> >  /* User return codes for XDP prog type.
> >   * A valid XDP program must return one of these defined values. All other
> >   * return codes are reserved for future use. Unknown return codes will 
> > result
> [...]
> >diff --git a/net/core/filter.c b/net/core/filter.c
> >index 4552b8c93b99..775802881b01 100644
> >--- a/net/core/filter.c
> >+++ b/net/core/filter.c
> >@@ -2482,6 +2482,27 @@ static const struct bpf_func_proto 
> >bpf_xdp_event_output_proto = {
> > .arg5_type  = ARG_CONST_STACK_SIZE,
> >  };
> >
> >+BPF_CALL_3(bpf_sock_store_u32, struct sock *, sk, u32, offset, u32, val)
> >+{
> >+u8 *ptr = (u8 *)sk;
> >+
> >+if (unlikely(offset > sizeof(*sk)))
> >+return -EFAULT;
> >+
> >+*((u32 *)ptr) = val;
> >+
> >+return 0;
> >+}
> 
> Seems strange to me. So, this helper allows to overwrite arbitrary memory
> of a struct sock instance. Potentially we could crash the kernel.
> 
> And in your sock_filter_convert_ctx_access(), you already implement inline
> read/write for the context ...
> 
> Your demo code does in pseudocode:
> 
>   r1 = sk
>   r2 = offsetof(struct bpf_sock, bound_dev_if)
>   r3 = idx
>   r1->sk_bound_dev_if = idx
>   sock_store_u32(r1, r2, r3) // updates sk_bound_dev_if again to idx
>   return 1
> 
> Dropping that helper from the patch, the only thing a program can do here
> is to read/write the sk_bound_dev_if helper per cgroup. Hmm ... dunno. So
> this really has to be for cgroups v2, right?

Looks pretty cool.
Same question as Daniel... why extra helper?
If program overwrites bpf_sock->sk_bound_dev_if can we use that
after program returns?
Also do you think it's possible to extend this patch to prototype
the port bind restrictions that were proposed few month back using
the same bpf_sock input structure?
Probably the check would need to be moved into different
place instead of sk_alloc(), but then we'll have more
opportunities to overwrite bound_dev_if, look at ports and so on ?

Re: [PATCH net] bpf: fix samples to add fake KBUILD_MODNAME

2016-10-25 Thread Alexei Starovoitov

On Wed, Oct 26, 2016 at 12:37:53AM +0200, Daniel Borkmann wrote:
> Some of the sample files are causing issues when they are loaded with tc
> and cls_bpf, meaning tc bails out while trying to parse the resulting ELF
> file as program/map/etc sections are not present, which can be easily
> spotted with readelf(1).
> 
> Currently, BPF samples are including some of the kernel headers and mid
> term we should change them to refrain from this, really. When dynamic
> debugging is enabled, we bail out due to undeclared KBUILD_MODNAME, which
> is easily overlooked in the build as clang spills this along with other
> noisy warnings from various header includes, and llc still generates an
> ELF file with mentioned characteristics. For just playing around with BPF
> examples, this can be a bit of a hurdle to take.
> 
> Just add a fake KBUILD_MODNAME as a band-aid to fix the issue, same is
> done in xdp*_kern samples already.
> 
> Fixes: 65d472fb007d ("samples/bpf: add 'pointer to packet' tests")
> Fixes: 6afb1e28b859 ("samples/bpf: Add tunnel set/get tests.")
> Fixes: a3f74617340b ("cgroup: bpf: Add an example to do cgroup checking in 
> BPF")
> Reported-by: Chandrasekar Kannan 
> Signed-off-by: Daniel Borkmann 
> ---
>  samples/bpf/parse_ldabs.c| 1 +
>  samples/bpf/parse_simple.c   | 1 +
>  samples/bpf/parse_varlen.c   | 1 +
>  samples/bpf/tcbpf1_kern.c| 1 +
>  samples/bpf/tcbpf2_kern.c| 1 +
>  samples/bpf/test_cgrp2_tc_kern.c | 1 +
>  6 files changed, 6 insertions(+)

It's also needed for all of tracex*_kern.c, right?

For networking samlpes we probably should get rid of kernel headers.
I guess they were there by copy-paste mistake from tracing, since
tracing samples actually need to include them, since they do bpf_probe_read
into kernel data structures.
For this patch in the mean time:
Acked-by: Alexei Starovoitov

[PATCH net-next V3 8/9] liquidio CN23XX: copyrights changes and alignment

2016-10-25 Thread Raghu Vatsavayi

Updated copyrights comments and also changed some other comments
alignments.

Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 53 ++
 .../ethernet/cavium/liquidio/cn23xx_pf_device.h| 39 +++-
 .../net/ethernet/cavium/liquidio/cn23xx_pf_regs.h  | 39 +++-
 .../net/ethernet/cavium/liquidio/cn66xx_device.c   | 36 +++
 .../net/ethernet/cavium/liquidio/cn66xx_device.h   | 37 +++
 drivers/net/ethernet/cavium/liquidio/cn66xx_regs.h | 37 +++
 .../net/ethernet/cavium/liquidio/cn68xx_device.c   | 36 +++
 .../net/ethernet/cavium/liquidio/cn68xx_device.h   | 37 +++
 drivers/net/ethernet/cavium/liquidio/cn68xx_regs.h | 37 +++
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 36 +++
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 42 -
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 36 +++
 .../net/ethernet/cavium/liquidio/liquidio_common.h | 37 +++
 .../net/ethernet/cavium/liquidio/liquidio_image.h  | 36 +++
 .../net/ethernet/cavium/liquidio/octeon_config.h   | 37 +++
 .../net/ethernet/cavium/liquidio/octeon_console.c  | 43 --
 .../net/ethernet/cavium/liquidio/octeon_device.c   | 36 +++
 .../net/ethernet/cavium/liquidio/octeon_device.h   | 45 --
 drivers/net/ethernet/cavium/liquidio/octeon_droq.c | 36 +++
 drivers/net/ethernet/cavium/liquidio/octeon_droq.h | 17 +++
 drivers/net/ethernet/cavium/liquidio/octeon_iq.h   | 21 -
 .../net/ethernet/cavium/liquidio/octeon_mailbox.c  |  3 --
 .../net/ethernet/cavium/liquidio/octeon_mailbox.h  |  3 --
 drivers/net/ethernet/cavium/liquidio/octeon_main.h | 19 +++-
 .../net/ethernet/cavium/liquidio/octeon_mem_ops.c  |  5 +-
 .../net/ethernet/cavium/liquidio/octeon_mem_ops.h  |  5 +-
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  5 +-
 drivers/net/ethernet/cavium/liquidio/octeon_nic.c  |  5 +-
 drivers/net/ethernet/cavium/liquidio/octeon_nic.h  |  5 +-
 .../net/ethernet/cavium/liquidio/request_manager.c |  5 +-
 .../ethernet/cavium/liquidio/response_manager.c|  5 +-
 .../ethernet/cavium/liquidio/response_manager.h|  5 +-
 32 files changed, 352 insertions(+), 486 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index d6bbccd..c9a706d 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -1,27 +1,21 @@
 /**
-* Author: Cavium, Inc.
-*
-* Contact: supp...@cavium.com
-*  Please include "LiquidIO" in the subject.
-*
-* Copyright (c) 2003-2015 Cavium, Inc.
-*
-* This file is free software; you can redistribute it and/or modify
-* it under the terms of the GNU General Public License, Version 2, as
-* published by the Free Software Foundation.
-*
-* This file is distributed in the hope that it will be useful, but
-* AS-IS and WITHOUT ANY WARRANTY; without even the implied warranty
-* of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, TITLE, or
-* NONINFRINGEMENT.  See the GNU General Public License for more
-* details.
-*
-* This file may also be available under a different license from Cavium.
-* Contact Cavium, Inc. for more information
-**/
-
+ * Author: Cavium, Inc.
+ *
+ * Contact: supp...@cavium.com
+ *  Please include "LiquidIO" in the subject.
+ *
+ * Copyright (c) 2003-2016 Cavium, Inc.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, Version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This file is distributed in the hope that it will be useful, but
+ * AS-IS and WITHOUT ANY WARRANTY; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, TITLE, or
+ * NONINFRINGEMENT.  See the GNU General Public License for more details.
+ ***/
 #include 
-#include 
 #include 
 #include 
 #include "liquidio_common.h"
@@ -421,10 +415,10 @@ static int cn23xx_pf_setup_global_input_regs(struct 
octeon_device *oct)
return -1;
 
/** Set the MAC_NUM and PVF_NUM in IQ_PKT_CONTROL reg
-   * for all queues.Only PF can set these bits.
-   * bits 29:30 indicate the MAC num.
-   * bits 32:47 indicate the PVF num.
-   */
+* for all queues.Only PF can set these bits.
+* bits 29:30 indicate the MAC num.
+* bits 32:47 indicate the PVF num.
+*/
for (q_no = 0; q_no < ern; q_no++) {

[PATCH net-next V3 4/9] liquidio CN23XX: mailbox interrupt processing

2016-10-25 Thread Raghu Vatsavayi

Adds support for mailbox interrupt processing of various
commands.

Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 157 +
 drivers/net/ethernet/cavium/liquidio/lio_main.c|  12 ++
 .../net/ethernet/cavium/liquidio/octeon_device.c   |   1 +
 .../net/ethernet/cavium/liquidio/octeon_device.h   |  21 ++-
 4 files changed, 184 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index 2c7cf89..37d1a4e 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -30,6 +30,7 @@
 #include "octeon_device.h"
 #include "cn23xx_pf_device.h"
 #include "octeon_main.h"
+#include "octeon_mailbox.h"
 
 #define RESET_NOTDONE 0
 #define RESET_DONE 1
@@ -677,6 +678,118 @@ static void cn23xx_setup_oq_regs(struct octeon_device 
*oct, u32 oq_no)
}
 }
 
+static void cn23xx_pf_mbox_thread(struct work_struct *work)
+{
+   struct cavium_wk *wk = (struct cavium_wk *)work;
+   struct octeon_mbox *mbox = (struct octeon_mbox *)wk->ctxptr;
+   struct octeon_device *oct = mbox->oct_dev;
+   u64 mbox_int_val, val64;
+   u32 q_no, i;
+
+   if (oct->rev_id < OCTEON_CN23XX_REV_1_1) {
+   /*read and clear by writing 1*/
+   mbox_int_val = readq(mbox->mbox_int_reg);
+   writeq(mbox_int_val, mbox->mbox_int_reg);
+
+   for (i = 0; i < oct->sriov_info.num_vfs_alloced; i++) {
+   q_no = i * oct->sriov_info.rings_per_vf;
+
+   val64 = readq(oct->mbox[q_no]->mbox_write_reg);
+
+   if (val64 && (val64 != OCTEON_PFVFACK)) {
+   if (octeon_mbox_read(oct->mbox[q_no]))
+   octeon_mbox_process_message(
+   oct->mbox[q_no]);
+   }
+   }
+
+   schedule_delayed_work(&wk->work, msecs_to_jiffies(10));
+   } else {
+   octeon_mbox_process_message(mbox);
+   }
+}
+
+static int cn23xx_setup_pf_mbox(struct octeon_device *oct)
+{
+   struct octeon_mbox *mbox = NULL;
+   u16 mac_no = oct->pcie_port;
+   u16 pf_num = oct->pf_num;
+   u32 q_no, i;
+
+   if (!oct->sriov_info.max_vfs)
+   return 0;
+
+   for (i = 0; i < oct->sriov_info.max_vfs; i++) {
+   q_no = i * oct->sriov_info.rings_per_vf;
+
+   mbox = vmalloc(sizeof(*mbox));
+   if (!mbox)
+   goto free_mbox;
+
+   memset(mbox, 0, sizeof(struct octeon_mbox));
+
+   spin_lock_init(&mbox->lock);
+
+   mbox->oct_dev = oct;
+
+   mbox->q_no = q_no;
+
+   mbox->state = OCTEON_MBOX_STATE_IDLE;
+
+   /* PF mbox interrupt reg */
+   mbox->mbox_int_reg = (u8 *)oct->mmio[0].hw_addr +
+CN23XX_SLI_MAC_PF_MBOX_INT(mac_no, pf_num);
+
+   /* PF writes into SIG0 reg */
+   mbox->mbox_write_reg = (u8 *)oct->mmio[0].hw_addr +
+  CN23XX_SLI_PKT_PF_VF_MBOX_SIG(q_no, 0);
+
+   /* PF reads from SIG1 reg */
+   mbox->mbox_read_reg = (u8 *)oct->mmio[0].hw_addr +
+ CN23XX_SLI_PKT_PF_VF_MBOX_SIG(q_no, 1);
+
+   /*Mail Box Thread creation*/
+   INIT_DELAYED_WORK(&mbox->mbox_poll_wk.work,
+ cn23xx_pf_mbox_thread);
+   mbox->mbox_poll_wk.ctxptr = (void *)mbox;
+
+   oct->mbox[q_no] = mbox;
+
+   writeq(OCTEON_PFVFSIG, mbox->mbox_read_reg);
+   }
+
+   if (oct->rev_id < OCTEON_CN23XX_REV_1_1)
+   schedule_delayed_work(&oct->mbox[0]->mbox_poll_wk.work,
+ msecs_to_jiffies(0));
+
+   return 0;
+
+free_mbox:
+   while (i) {
+   i--;
+   vfree(oct->mbox[i]);
+   }
+
+   return 1;
+}
+
+static int cn23xx_free_pf_mbox(struct octeon_device *oct)
+{
+   u32 q_no, i;
+
+   if (!oct->sriov_info.max_vfs)
+   return 0;
+
+   for (i = 0; i < oct->sriov_info.max_vfs; i++) {
+   q_no = i * oct->sriov_info.rings_per_vf;
+   cancel_delayed_work_sync(
+   &oct->mbox[q_no]->mbox_poll_wk.work);
+   vfree(oct->mbox[q_no]);
+   }
+
+   return 0;
+}
+
 static int cn23xx_enable_io_queues(struct octeon_device *oct)
 {
u64 reg_val;
@@ -871,6 +984,29 @@ static u64 cn23xx_pf_msix_interrupt_handler(void *dev)
return ret;
 }
 
+static void cn23xx_handle_pf_mbox_intr(struct octeon_device *oct)
+{
+   struct delayed_work *work;
+

[PATCH net-next V3 9/9] liquidio CN23XX: fix for new check patch errors

2016-10-25 Thread Raghu Vatsavayi

New checkpatch script shows some errors with pre-existing
driver. This patch provides fix for those errors.

Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 .../net/ethernet/cavium/liquidio/cn23xx_pf_regs.h  |  12 +--
 drivers/net/ethernet/cavium/liquidio/cn66xx_regs.h |  12 +--
 .../net/ethernet/cavium/liquidio/cn68xx_device.c   |   2 +-
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c |   9 +-
 drivers/net/ethernet/cavium/liquidio/lio_main.c|   9 +-
 .../net/ethernet/cavium/liquidio/liquidio_common.h |  50 -
 .../net/ethernet/cavium/liquidio/octeon_console.c  | 113 ++---
 .../net/ethernet/cavium/liquidio/octeon_device.c   |  23 ++---
 .../net/ethernet/cavium/liquidio/octeon_device.h   |  20 ++--
 drivers/net/ethernet/cavium/liquidio/octeon_droq.c |  40 
 drivers/net/ethernet/cavium/liquidio/octeon_iq.h   |   3 +
 .../net/ethernet/cavium/liquidio/octeon_mem_ops.c  |   2 +-
 .../net/ethernet/cavium/liquidio/octeon_network.h  |   6 +-
 drivers/net/ethernet/cavium/liquidio/octeon_nic.h  |   2 +-
 .../net/ethernet/cavium/liquidio/request_manager.c |  16 ++-
 15 files changed, 149 insertions(+), 170 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_regs.h 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_regs.h
index 680a405..e6d4ad9 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_regs.h
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_regs.h
@@ -58,7 +58,7 @@
 
 #define CN23XX_CONFIG_SRIOV_BAR_START 0x19C
 #define CN23XX_CONFIG_SRIOV_BARX(i)\
-   (CN23XX_CONFIG_SRIOV_BAR_START + (i * 4))
+   (CN23XX_CONFIG_SRIOV_BAR_START + ((i) * 4))
 #define CN23XX_CONFIG_SRIOV_BAR_PF0x08
 #define CN23XX_CONFIG_SRIOV_BAR_64BIT 0x04
 #define CN23XX_CONFIG_SRIOV_BAR_IO0x01
@@ -508,7 +508,7 @@
 /* 4 Registers (64 - bit) */
 #defineCN23XX_SLI_S2M_PORT_CTL_START 0x23D80
 #defineCN23XX_SLI_S2M_PORTX_CTL(port)  \
-   (CN23XX_SLI_S2M_PORT_CTL_START + (port * 0x10))
+   (CN23XX_SLI_S2M_PORT_CTL_START + ((port) * 0x10))
 
 #defineCN23XX_SLI_MAC_NUMBER 0x20050
 
@@ -549,26 +549,26 @@
  * Provides DMA Engine Queue Enable
  */
 #defineCN23XX_DPI_DMA_ENG0_ENB0x0001df80ULL
-#defineCN23XX_DPI_DMA_ENG_ENB(eng) (CN23XX_DPI_DMA_ENG0_ENB + (eng * 8))
+#defineCN23XX_DPI_DMA_ENG_ENB(eng) (CN23XX_DPI_DMA_ENG0_ENB + ((eng) * 8))
 
 /* 8 register (64-bit) - DPI_DMA(0..7)_REQQ_CTL
  * Provides control bits for transaction on 8 Queues
  */
 #defineCN23XX_DPI_DMA_REQQ0_CTL   0x0001df000180ULL
 #defineCN23XX_DPI_DMA_REQQ_CTL(q_no)   \
-   (CN23XX_DPI_DMA_REQQ0_CTL + (q_no * 8))
+   (CN23XX_DPI_DMA_REQQ0_CTL + ((q_no) * 8))
 
 /* 6 register (64-bit) - DPI_ENG(0..5)_BUF
  * Provides DMA Engine FIFO (Queue) Size
  */
 #defineCN23XX_DPI_DMA_ENG0_BUF0x0001df000880ULL
 #defineCN23XX_DPI_DMA_ENG_BUF(eng)   \
-   (CN23XX_DPI_DMA_ENG0_BUF + (eng * 8))
+   (CN23XX_DPI_DMA_ENG0_BUF + ((eng) * 8))
 
 /* 4 Registers (64-bit) */
 #defineCN23XX_DPI_SLI_PRT_CFG_START   0x0001df000900ULL
 #defineCN23XX_DPI_SLI_PRTX_CFG(port)\
-   (CN23XX_DPI_SLI_PRT_CFG_START + (port * 0x8))
+   (CN23XX_DPI_SLI_PRT_CFG_START + ((port) * 0x8))
 
 /* Masks for DPI_DMA_CONTROL Register */
 #defineCN23XX_DPI_DMA_COMMIT_MODE BIT_ULL(58)
diff --git a/drivers/net/ethernet/cavium/liquidio/cn66xx_regs.h 
b/drivers/net/ethernet/cavium/liquidio/cn66xx_regs.h
index 23152c0..b248966 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn66xx_regs.h
+++ b/drivers/net/ethernet/cavium/liquidio/cn66xx_regs.h
@@ -438,10 +438,10 @@
 #defineCN6XXX_SLI_S2M_PORT0_CTL  0x3D80
 #defineCN6XXX_SLI_S2M_PORT1_CTL  0x3D90
 #defineCN6XXX_SLI_S2M_PORTX_CTL(port)\
-   (CN6XXX_SLI_S2M_PORT0_CTL + (port * 0x10))
+   (CN6XXX_SLI_S2M_PORT0_CTL + ((port) * 0x10))
 
 #defineCN6XXX_SLI_INT_ENB64(port)\
-   (CN6XXX_SLI_INT_ENB64_PORT0 + (port * 0x10))
+   (CN6XXX_SLI_INT_ENB64_PORT0 + ((port) * 0x10))
 
 #defineCN6XXX_SLI_MAC_NUMBER 0x3E00
 
@@ -453,7 +453,7 @@
 #defineCN6XXX_PCI_BAR1_OFFSET  0x8
 
 #defineCN6XXX_BAR1_REG(idx, port) \
-   (CN6XXX_BAR1_INDEX_START + (port * CN6XXX_PEM_OFFSET) + \
+   (CN6XXX_BAR1_INDEX_START + ((port) * CN6XXX_PEM_OFFSET) + \
(CN6XXX_PCI_BAR1_OFFSET * (idx)))
 
 /* DPI #*/
@@ -471,17 +471,17 @@
 #defineCN6XXX_DPI_DMA_ENG0_ENB0x0001df80ULL
 
 #defineCN6XXX_DPI_DMA_ENG_ENB(q_no)   \
-   (CN6XXX_DPI_DMA_ENG0_ENB + (q_no * 8))
+   (CN6XXX_DPI_DMA_ENG0_ENB + ((q_no) * 8))
 
 #d

[PATCH net-next V3 6/9] liquidio CN23XX: device states

2016-10-25 Thread Raghu Vatsavayi

Cleaned up resource leaks during destroy resources by
introducing more device states.

Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 33 --
 .../net/ethernet/cavium/liquidio/octeon_device.c   |  6 +++-
 .../net/ethernet/cavium/liquidio/octeon_device.h   | 29 ++-
 drivers/net/ethernet/cavium/liquidio/octeon_droq.c | 13 +
 drivers/net/ethernet/cavium/liquidio/octeon_main.h |  8 --
 .../net/ethernet/cavium/liquidio/request_manager.c |  6 +++-
 6 files changed, 64 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index b31ab7e..fcf38ab 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -780,6 +780,7 @@ static void delete_glists(struct lio *lio)
}
 
kfree((void *)lio->glist);
+   kfree((void *)lio->glist_lock);
 }
 
 /**
@@ -1339,6 +1340,7 @@ static int liquidio_watchdog(void *param)
complete(&first_stage);
 
if (octeon_device_init(oct_dev)) {
+   complete(&hs->init);
liquidio_remove(pdev);
return -ENOMEM;
}
@@ -1363,7 +1365,15 @@ static int liquidio_watchdog(void *param)
oct_dev->watchdog_task = kthread_create(
liquidio_watchdog, oct_dev,
"liowd/%02hhx:%02hhx.%hhx", bus, device, function);
-   wake_up_process(oct_dev->watchdog_task);
+   if (!IS_ERR(oct_dev->watchdog_task)) {
+   wake_up_process(oct_dev->watchdog_task);
+   } else {
+   oct_dev->watchdog_task = NULL;
+   dev_err(&oct_dev->pci_dev->dev,
+   "failed to create kernel_thread\n");
+   liquidio_remove(pdev);
+   return -1;
+   }
}
}
 
@@ -1427,6 +1437,8 @@ static void octeon_destroy_resources(struct octeon_device 
*oct)
if (lio_wait_for_oq_pkts(oct))
dev_err(&oct->pci_dev->dev, "OQ had pending packets\n");
 
+   /* fallthrough */
+   case OCT_DEV_INTR_SET_DONE:
/* Disable interrupts  */
oct->fn_list.disable_interrupt(oct, OCTEON_ALL_INTR);
 
@@ -1453,6 +1465,8 @@ static void octeon_destroy_resources(struct octeon_device 
*oct)
pci_disable_msi(oct->pci_dev);
}
 
+   /* fallthrough */
+   case OCT_DEV_MSIX_ALLOC_VECTOR_DONE:
if (OCTEON_CN23XX_PF(oct))
octeon_free_ioq_vector(oct);
 
@@ -1516,10 +1530,13 @@ static void octeon_destroy_resources(struct 
octeon_device *oct)
octeon_unmap_pci_barx(oct, 1);
 
/* fallthrough */
-   case OCT_DEV_BEGIN_STATE:
+   case OCT_DEV_PCI_ENABLE_DONE:
+   pci_clear_master(oct->pci_dev);
/* Disable the device, releasing the PCI INT */
pci_disable_device(oct->pci_dev);
 
+   /* fallthrough */
+   case OCT_DEV_BEGIN_STATE:
/* Nothing to be done here either */
break;
}   /* end switch (oct->status) */
@@ -1798,6 +1815,7 @@ static int octeon_pci_os_setup(struct octeon_device *oct)
 
if (dma_set_mask_and_coherent(&oct->pci_dev->dev, DMA_BIT_MASK(64))) {
dev_err(&oct->pci_dev->dev, "Unexpected DMA device 
capability\n");
+   pci_disable_device(oct->pci_dev);
return 1;
}
 
@@ -4452,6 +4470,8 @@ static int octeon_device_init(struct octeon_device 
*octeon_dev)
if (octeon_pci_os_setup(octeon_dev))
return 1;
 
+   atomic_set(&octeon_dev->status, OCT_DEV_PCI_ENABLE_DONE);
+
/* Identify the Octeon type and map the BAR address space. */
if (octeon_chip_specific_setup(octeon_dev)) {
dev_err(&octeon_dev->pci_dev->dev, "Chip specific setup 
failed\n");
@@ -4523,9 +4543,6 @@ static int octeon_device_init(struct octeon_device 
*octeon_dev)
if (octeon_setup_instr_queues(octeon_dev)) {
dev_err(&octeon_dev->pci_dev->dev,
"instruction queue initialization failed\n");
-   /* On error, release any previously allocated queues */
-   for (j = 0; j < octeon_dev->num_iqs; j++)
-   octeon_delete_instr_queue(octeon_dev, j);
return 1;
}
atomic_set(&octeon_dev->status, OCT_DEV_INSTR_QUEUE_INIT_DONE);
@@ -4541,9 +4558,6 @@ static int octeon_device_init(struct octeon_device 
*octeon_dev)
 
if (

[PATCH net-next V3 5/9] liquidio CN23XX: VF related operations

2016-10-25 Thread Raghu Vatsavayi

Adds support for VF related operations like mac address vlan
and link changes.

Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c|  22 +++
 .../ethernet/cavium/liquidio/cn23xx_pf_device.h|   3 +
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 214 +
 .../net/ethernet/cavium/liquidio/liquidio_common.h |   5 +
 .../net/ethernet/cavium/liquidio/octeon_device.h   |   8 +
 5 files changed, 252 insertions(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index 37d1a4e..d6bbccd 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "liquidio_common.h"
 #include "octeon_droq.h"
 #include "octeon_iq.h"
@@ -1457,3 +1458,24 @@ int cn23xx_fw_loaded(struct octeon_device *oct)
val = octeon_read_csr64(oct, CN23XX_SLI_SCRATCH1);
return (val >> 1) & 1ULL;
 }
+
+void cn23xx_tell_vf_its_macaddr_changed(struct octeon_device *oct, int vfidx,
+   u8 *mac)
+{
+   if (oct->sriov_info.vf_drv_loaded_mask & BIT_ULL(vfidx)) {
+   struct octeon_mbox_cmd mbox_cmd;
+
+   mbox_cmd.msg.u64 = 0;
+   mbox_cmd.msg.s.type = OCTEON_MBOX_REQUEST;
+   mbox_cmd.msg.s.resp_needed = 0;
+   mbox_cmd.msg.s.cmd = OCTEON_PF_CHANGED_VF_MACADDR;
+   mbox_cmd.msg.s.len = 1;
+   mbox_cmd.recv_len = 0;
+   mbox_cmd.recv_status = 0;
+   mbox_cmd.fn = NULL;
+   mbox_cmd.fn_arg = 0;
+   ether_addr_copy(mbox_cmd.msg.s.params, mac);
+   mbox_cmd.q_no = vfidx * oct->sriov_info.rings_per_vf;
+   octeon_mbox_write(oct, &mbox_cmd);
+   }
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
index 21b5c90..20a9dc5 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.h
@@ -56,4 +56,7 @@ int validate_cn23xx_pf_config_info(struct octeon_device *oct,
 void cn23xx_dump_pf_initialized_regs(struct octeon_device *oct);
 
 int cn23xx_fw_loaded(struct octeon_device *oct);
+
+void cn23xx_tell_vf_its_macaddr_changed(struct octeon_device *oct, int vfidx,
+   u8 *mac);
 #endif
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 0fc6257..b31ab7e 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3590,6 +3590,151 @@ static void liquidio_del_vxlan_port(struct net_device 
*netdev,
OCTNET_CMD_VXLAN_PORT_DEL);
 }
 
+static int __liquidio_set_vf_mac(struct net_device *netdev, int vfidx,
+u8 *mac, bool is_admin_assigned)
+{
+   struct lio *lio = GET_LIO(netdev);
+   struct octeon_device *oct = lio->oct_dev;
+   struct octnic_ctrl_pkt nctrl;
+
+   if (!is_valid_ether_addr(mac))
+   return -EINVAL;
+
+   if (vfidx < 0 || vfidx >= oct->sriov_info.max_vfs)
+   return -EINVAL;
+
+   memset(&nctrl, 0, sizeof(struct octnic_ctrl_pkt));
+
+   nctrl.ncmd.u64 = 0;
+   nctrl.ncmd.s.cmd = OCTNET_CMD_CHANGE_MACADDR;
+   /* vfidx is 0 based, but vf_num (param1) is 1 based */
+   nctrl.ncmd.s.param1 = vfidx + 1;
+   nctrl.ncmd.s.param2 = (is_admin_assigned ? 1 : 0);
+   nctrl.ncmd.s.more = 1;
+   nctrl.iq_no = lio->linfo.txpciq[0].s.q_no;
+   nctrl.cb_fn = 0;
+   nctrl.wait_time = 100;
+
+   nctrl.udd[0] = 0;
+   /* The MAC Address is presented in network byte order. */
+   ether_addr_copy((u8 *)&nctrl.udd[0] + 2, mac);
+
+   oct->sriov_info.vf_macaddr[vfidx] = nctrl.udd[0];
+
+   octnet_send_nic_ctrl_pkt(oct, &nctrl);
+
+   return 0;
+}
+
+static int liquidio_set_vf_mac(struct net_device *netdev, int vfidx, u8 *mac)
+{
+   struct lio *lio = GET_LIO(netdev);
+   struct octeon_device *oct = lio->oct_dev;
+   int retval;
+
+   retval = __liquidio_set_vf_mac(netdev, vfidx, mac, true);
+   if (!retval)
+   cn23xx_tell_vf_its_macaddr_changed(oct, vfidx, mac);
+
+   return retval;
+}
+
+static int liquidio_set_vf_vlan(struct net_device *netdev, int vfidx,
+   u16 vlan, u8 qos, __be16 vlan_proto)
+{
+   struct lio *lio = GET_LIO(netdev);
+   struct octeon_device *oct = lio->oct_dev;
+   struct octnic_ctrl_pkt nctrl;
+   u16 vlantci;
+
+   if (vfidx < 0 || vfidx >= oct->sriov_info.num_vfs_alloced)
+   return -EINVAL;
+
+   if (vl

[PATCH net-next V3 0/9] liquidio CN23XX VF support

2016-10-25 Thread Raghu Vatsavayi

Dave,

Following is the V3 patch series for adding VF support on
CN23XX devices. This version addressed:
1) Your concern for ordering of local variable declarations
   from longest to shortest line.
2) As recommended by you removed custom module parameter max_vfs.
3) Minor changes for fixing new checkpatch script related 
   errors on pre-existing driver.

I will post remaining VF patches soon after this patchseries is
applied. Please apply patches in the following order as some of
the patches depend on earlier patches.

Thanks.


Raghu Vatsavayi (9):
  liquidio CN23XX: HW config for VF support
  liquidio CN23XX: sysfs VF config support
  liquidio CN23XX: Mailbox support
  liquidio CN23XX: mailbox interrupt processing
  liquidio CN23XX: VF related operations
  liquidio CN23XX: device states
  liquidio CN23XX: code cleanup
  liquidio CN23XX: copyrights changes and alignment
  liquidio CN23XX: fix for new check patch errors

 drivers/net/ethernet/cavium/liquidio/Makefile  |   1 +
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 357 ++---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.h|  42 +-
 .../net/ethernet/cavium/liquidio/cn23xx_pf_regs.h  |  51 ++-
 .../net/ethernet/cavium/liquidio/cn66xx_device.c   |  49 +--
 .../net/ethernet/cavium/liquidio/cn66xx_device.h   |  41 +-
 drivers/net/ethernet/cavium/liquidio/cn66xx_regs.h |  49 +--
 .../net/ethernet/cavium/liquidio/cn68xx_device.c   |  38 +-
 .../net/ethernet/cavium/liquidio/cn68xx_device.h   |  37 +-
 drivers/net/ethernet/cavium/liquidio/cn68xx_regs.h |  37 +-
 drivers/net/ethernet/cavium/liquidio/lio_core.c|  68 +++-
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c |  65 ++-
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 442 ++---
 .../net/ethernet/cavium/liquidio/liquidio_common.h | 100 +++--
 .../net/ethernet/cavium/liquidio/liquidio_image.h  |  36 +-
 .../net/ethernet/cavium/liquidio/octeon_config.h   |  46 ++-
 .../net/ethernet/cavium/liquidio/octeon_console.c  | 156 
 .../net/ethernet/cavium/liquidio/octeon_device.c   |  74 ++--
 .../net/ethernet/cavium/liquidio/octeon_device.h   | 133 ---
 drivers/net/ethernet/cavium/liquidio/octeon_droq.c |  91 +++--
 drivers/net/ethernet/cavium/liquidio/octeon_droq.h |  18 +-
 drivers/net/ethernet/cavium/liquidio/octeon_iq.h   |  25 +-
 .../net/ethernet/cavium/liquidio/octeon_mailbox.c  | 318 +++
 .../net/ethernet/cavium/liquidio/octeon_mailbox.h  | 112 ++
 drivers/net/ethernet/cavium/liquidio/octeon_main.h |  47 +--
 .../net/ethernet/cavium/liquidio/octeon_mem_ops.c  |   7 +-
 .../net/ethernet/cavium/liquidio/octeon_mem_ops.h  |   5 +-
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  11 +-
 drivers/net/ethernet/cavium/liquidio/octeon_nic.c  |   5 +-
 drivers/net/ethernet/cavium/liquidio/octeon_nic.h  |   7 +-
 .../net/ethernet/cavium/liquidio/request_manager.c |  34 +-
 .../ethernet/cavium/liquidio/response_manager.c|  11 +-
 .../ethernet/cavium/liquidio/response_manager.h|   6 +-
 33 files changed, 1733 insertions(+), 786 deletions(-)
 create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
 create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h

-- 
1.8.3.1

[PATCH net-next V3 1/9] liquidio CN23XX: HW config for VF support

2016-10-25 Thread Raghu Vatsavayi

Adds support for configuring HW for creating VFs.

Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 .../ethernet/cavium/liquidio/cn23xx_pf_device.c| 125 -
 drivers/net/ethernet/cavium/liquidio/lio_main.c|  23 
 .../net/ethernet/cavium/liquidio/octeon_config.h   |   6 +
 .../net/ethernet/cavium/liquidio/octeon_device.h   |  12 +-
 4 files changed, 135 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
index 380a641..2c7cf89 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
@@ -40,11 +40,6 @@
  */
 #define CN23XX_INPUT_JABBER 64600
 
-#define LIOLUT_RING_DISTRIBUTION 9
-const int liolut_num_vfs_to_rings_per_vf[LIOLUT_RING_DISTRIBUTION] = {
-   0, 8, 4, 2, 2, 2, 1, 1, 1
-};
-
 void cn23xx_dump_pf_initialized_regs(struct octeon_device *oct)
 {
int i = 0;
@@ -309,9 +304,10 @@ u32 cn23xx_pf_get_oq_ticks(struct octeon_device *oct, u32 
time_intr_in_us)
 
 static void cn23xx_setup_global_mac_regs(struct octeon_device *oct)
 {
-   u64 reg_val;
u16 mac_no = oct->pcie_port;
u16 pf_num = oct->pf_num;
+   u64 reg_val;
+   u64 temp;
 
/* programming SRN and TRS for each MAC(0..3)  */
 
@@ -333,6 +329,14 @@ static void cn23xx_setup_global_mac_regs(struct 
octeon_device *oct)
/* setting TRS <23:16> */
reg_val = reg_val |
  (oct->sriov_info.trs << CN23XX_PKT_MAC_CTL_RINFO_TRS_BIT_POS);
+   /* setting RPVF <39:32> */
+   temp = oct->sriov_info.rings_per_vf & 0xff;
+   reg_val |= (temp << CN23XX_PKT_MAC_CTL_RINFO_RPVF_BIT_POS);
+
+   /* setting NVFS <55:48> */
+   temp = oct->sriov_info.max_vfs & 0xff;
+   reg_val |= (temp << CN23XX_PKT_MAC_CTL_RINFO_NVFS_BIT_POS);
+
/* write these settings to MAC register */
octeon_write_csr64(oct, CN23XX_SLI_PKT_MAC_RINFO64(mac_no, pf_num),
   reg_val);
@@ -399,11 +403,12 @@ static int cn23xx_reset_io_queues(struct octeon_device 
*oct)
 
 static int cn23xx_pf_setup_global_input_regs(struct octeon_device *oct)
 {
+   struct octeon_cn23xx_pf *cn23xx = (struct octeon_cn23xx_pf *)oct->chip;
+   struct octeon_instr_queue *iq;
+   u64 intr_threshold, reg_val;
u32 q_no, ern, srn;
u64 pf_num;
-   u64 intr_threshold, reg_val;
-   struct octeon_instr_queue *iq;
-   struct octeon_cn23xx_pf *cn23xx = (struct octeon_cn23xx_pf *)oct->chip;
+   u64 vf_num;
 
pf_num = oct->pf_num;
 
@@ -420,6 +425,16 @@ static int cn23xx_pf_setup_global_input_regs(struct 
octeon_device *oct)
*/
for (q_no = 0; q_no < ern; q_no++) {
reg_val = oct->pcie_port << CN23XX_PKT_INPUT_CTL_MAC_NUM_POS;
+
+   /* for VF assigned queues. */
+   if (q_no < oct->sriov_info.pf_srn) {
+   vf_num = q_no / oct->sriov_info.rings_per_vf;
+   vf_num += 1; /* VF1, VF2, */
+   } else {
+   vf_num = 0;
+   }
+
+   reg_val |= vf_num << CN23XX_PKT_INPUT_CTL_VF_NUM_POS;
reg_val |= pf_num << CN23XX_PKT_INPUT_CTL_PF_NUM_POS;
 
octeon_write_csr64(oct, CN23XX_SLI_IQ_PKT_CONTROL64(q_no),
@@ -1048,50 +1063,100 @@ static void cn23xx_setup_reg_address(struct 
octeon_device *oct)
 
 static int cn23xx_sriov_config(struct octeon_device *oct)
 {
-   u32 total_rings;
struct octeon_cn23xx_pf *cn23xx = (struct octeon_cn23xx_pf *)oct->chip;
-   /* num_vfs is already filled for us */
+   u32 max_rings, total_rings, max_vfs;
u32 pf_srn, num_pf_rings;
+   u32 max_possible_vfs;
+   u32 rings_per_vf = 0;
 
cn23xx->conf =
-   (struct octeon_config *)oct_get_config_info(oct, LIO_23XX);
+   (struct octeon_config *)oct_get_config_info(oct, LIO_23XX);
switch (oct->rev_id) {
case OCTEON_CN23XX_REV_1_0:
-   total_rings = CN23XX_MAX_RINGS_PER_PF_PASS_1_0;
+   max_rings = CN23XX_MAX_RINGS_PER_PF_PASS_1_0;
+   max_possible_vfs = CN23XX_MAX_VFS_PER_PF_PASS_1_0;
break;
case OCTEON_CN23XX_REV_1_1:
-   total_rings = CN23XX_MAX_RINGS_PER_PF_PASS_1_1;
+   max_rings = CN23XX_MAX_RINGS_PER_PF_PASS_1_1;
+   max_possible_vfs = CN23XX_MAX_VFS_PER_PF_PASS_1_1;
break;
default:
-   total_rings = CN23XX_MAX_RINGS_PER_PF;
+   max_rings = CN23XX_MAX_RINGS_PER_PF;
+   max_possible_vfs = CN23XX_MAX_VFS_PER_PF;
break;
}
-   if (!oct->sriov_info.num_pf_rings) {
-   if (total_rings > num_present_cpus())
-   num_pf_rings = num_present_

[PATCH net-next V3 7/9] liquidio CN23XX: code cleanup

2016-10-25 Thread Raghu Vatsavayi

Cleaned up unnecessary comments and added some minor macros.

Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/cn66xx_device.c   | 13 -
 drivers/net/ethernet/cavium/liquidio/cn66xx_device.h   |  4 ++--
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 14 --
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 17 +
 drivers/net/ethernet/cavium/liquidio/liquidio_common.h |  2 --
 drivers/net/ethernet/cavium/liquidio/octeon_device.c   |  8 
 drivers/net/ethernet/cavium/liquidio/octeon_droq.c |  2 +-
 drivers/net/ethernet/cavium/liquidio/octeon_droq.h |  1 -
 drivers/net/ethernet/cavium/liquidio/octeon_iq.h   |  1 -
 drivers/net/ethernet/cavium/liquidio/octeon_main.h | 18 --
 drivers/net/ethernet/cavium/liquidio/request_manager.c |  7 ++-
 .../net/ethernet/cavium/liquidio/response_manager.c|  6 +-
 .../net/ethernet/cavium/liquidio/response_manager.h|  1 -
 13 files changed, 23 insertions(+), 71 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn66xx_device.c 
b/drivers/net/ethernet/cavium/liquidio/cn66xx_device.c
index e779af8..1ebc225 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn66xx_device.c
+++ b/drivers/net/ethernet/cavium/liquidio/cn66xx_device.c
@@ -275,7 +275,6 @@ void lio_cn6xxx_setup_iq_regs(struct octeon_device *oct, 
u32 iq_no)
 {
struct octeon_instr_queue *iq = oct->instr_queue[iq_no];
 
-   /* Disable Packet-by-Packet mode; No Parse Mode or Skip length */
octeon_write_csr64(oct, CN6XXX_SLI_IQ_PKT_INSTR_HDR64(iq_no), 0);
 
/* Write the start of the input queue's ring and its size  */
@@ -378,7 +377,7 @@ void lio_cn6xxx_disable_io_queues(struct octeon_device *oct)
 
/* Reset the doorbell register for each Input queue. */
for (i = 0; i < MAX_OCTEON_INSTR_QUEUES(oct); i++) {
-   if (!(oct->io_qmask.iq & (1ULL << i)))
+   if (!(oct->io_qmask.iq & BIT_ULL(i)))
continue;
octeon_write_csr(oct, CN6XXX_SLI_IQ_DOORBELL(i), 0x);
d32 = octeon_read_csr(oct, CN6XXX_SLI_IQ_DOORBELL(i));
@@ -400,9 +399,8 @@ void lio_cn6xxx_disable_io_queues(struct octeon_device *oct)
;
 
/* Reset the doorbell register for each Output queue. */
-   /* for (i = 0; i < oct->num_oqs; i++) { */
for (i = 0; i < MAX_OCTEON_OUTPUT_QUEUES(oct); i++) {
-   if (!(oct->io_qmask.oq & (1ULL << i)))
+   if (!(oct->io_qmask.oq & BIT_ULL(i)))
continue;
octeon_write_csr(oct, CN6XXX_SLI_OQ_PKTS_CREDIT(i), 0x);
d32 = octeon_read_csr(oct, CN6XXX_SLI_OQ_PKTS_CREDIT(i));
@@ -537,15 +535,14 @@ static int lio_cn6xxx_process_droq_intr_regs(struct 
octeon_device *oct)
 
oct->droq_intr = 0;
 
-   /* for (oq_no = 0; oq_no < oct->num_oqs; oq_no++) { */
for (oq_no = 0; oq_no < MAX_OCTEON_OUTPUT_QUEUES(oct); oq_no++) {
-   if (!(droq_mask & (1ULL << oq_no)))
+   if (!(droq_mask & BIT_ULL(oq_no)))
continue;
 
droq = oct->droq[oq_no];
pkt_count = octeon_droq_check_hw_for_pkts(droq);
if (pkt_count) {
-   oct->droq_intr |= (1ULL << oq_no);
+   oct->droq_intr |= BIT_ULL(oq_no);
if (droq->ops.poll_mode) {
u32 value;
u32 reg;
@@ -721,8 +718,6 @@ int lio_setup_cn66xx_octeon_device(struct octeon_device 
*oct)
 int lio_validate_cn6xxx_config_info(struct octeon_device *oct,
struct octeon_config *conf6xxx)
 {
-   /* int total_instrs = 0; */
-
if (CFG_GET_IQ_MAX_Q(conf6xxx) > CN6XXX_MAX_INPUT_QUEUES) {
dev_err(&oct->pci_dev->dev, "%s: Num IQ (%d) exceeds Max 
(%d)\n",
__func__, CFG_GET_IQ_MAX_Q(conf6xxx),
diff --git a/drivers/net/ethernet/cavium/liquidio/cn66xx_device.h 
b/drivers/net/ethernet/cavium/liquidio/cn66xx_device.h
index a40a913..32fbbb2 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn66xx_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/cn66xx_device.h
@@ -96,8 +96,8 @@ void lio_cn6xxx_setup_reg_address(struct octeon_device *oct, 
void *chip,
  struct octeon_reg_list *reg_list);
 u32 lio_cn6xxx_coprocessor_clock(struct octeon_device *oct);
 u32 lio_cn6xxx_get_oq_ticks(struct octeon_device *oct, u32 time_intr_in_us);
-int lio_setup_cn66xx_octeon_device(struct octeon_device *);
+int lio_setup_cn66xx_octeon_device(struct octeon_device *oct);
 int lio_validate_cn6xxx_config_info(struct octeon_device *oct,
-   struct octeon_config *);
+   struct octeon_co

[PATCH net-next V3 3/9] liquidio CN23XX: Mailbox support

2016-10-25 Thread Raghu Vatsavayi

Adds support for mailbox communication between PF and VF.

Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/Makefile  |   1 +
 drivers/net/ethernet/cavium/liquidio/lio_core.c|  32 ++
 .../net/ethernet/cavium/liquidio/liquidio_common.h |   6 +-
 .../net/ethernet/cavium/liquidio/octeon_device.h   |   4 +
 .../net/ethernet/cavium/liquidio/octeon_mailbox.c  | 321 +
 .../net/ethernet/cavium/liquidio/octeon_mailbox.h  | 115 
 drivers/net/ethernet/cavium/liquidio/octeon_main.h |   2 +-
 7 files changed, 478 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
 create mode 100644 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.h

diff --git a/drivers/net/ethernet/cavium/liquidio/Makefile 
b/drivers/net/ethernet/cavium/liquidio/Makefile
index 5a27b2a..14958de 100644
--- a/drivers/net/ethernet/cavium/liquidio/Makefile
+++ b/drivers/net/ethernet/cavium/liquidio/Makefile
@@ -11,6 +11,7 @@ liquidio-$(CONFIG_LIQUIDIO) += lio_ethtool.o \
cn66xx_device.o\
cn68xx_device.o\
cn23xx_pf_device.o \
+   octeon_mailbox.o   \
octeon_mem_ops.o   \
octeon_droq.o  \
octeon_nic.o
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 201eddb..e6026df 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -264,3 +264,35 @@ void liquidio_link_ctrl_cmd_completion(void *nctrl_ptr)
nctrl->ncmd.s.cmd);
}
 }
+
+void octeon_pf_changed_vf_macaddr(struct octeon_device *oct, u8 *mac)
+{
+   bool macaddr_changed = false;
+   struct net_device *netdev;
+   struct lio *lio;
+
+   rtnl_lock();
+
+   netdev = oct->props[0].netdev;
+   lio = GET_LIO(netdev);
+
+   lio->linfo.macaddr_is_admin_asgnd = true;
+
+   if (!ether_addr_equal(netdev->dev_addr, mac)) {
+   macaddr_changed = true;
+   ether_addr_copy(netdev->dev_addr, mac);
+   ether_addr_copy(((u8 *)&lio->linfo.hw_addr) + 2, mac);
+   call_netdevice_notifiers(NETDEV_CHANGEADDR, netdev);
+   }
+
+   rtnl_unlock();
+
+   if (macaddr_changed)
+   dev_info(&oct->pci_dev->dev,
+"PF changed VF's MAC address to 
%02hhx:%02hhx:%02hhx:%02hhx:%02hhx:%02hhx\n",
+mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
+
+   /* no need to notify the firmware of the macaddr change because
+* the PF did that already
+*/
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h 
b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
index 0d990ac..caeff9a 100644
--- a/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
+++ b/drivers/net/ethernet/cavium/liquidio/liquidio_common.h
@@ -731,13 +731,15 @@ struct oct_link_info {
 
 #ifdef __BIG_ENDIAN_BITFIELD
u64 gmxport:16;
-   u64 rsvd:32;
+   u64 macaddr_is_admin_asgnd:1;
+   u64 rsvd:31;
u64 num_txpciq:8;
u64 num_rxpciq:8;
 #else
u64 num_rxpciq:8;
u64 num_txpciq:8;
-   u64 rsvd:32;
+   u64 rsvd:31;
+   u64 macaddr_is_admin_asgnd:1;
u64 gmxport:16;
 #endif
 
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
index cfd12ec..77a6eb7 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
@@ -492,6 +492,9 @@ struct octeon_device {
 
int msix_on;
 
+   /** Mail Box details of each octeon queue. */
+   struct octeon_mbox  *mbox[MAX_POSSIBLE_VFS];
+
/** IOq information of it's corresponding MSI-X interrupt. */
struct octeon_ioq_vector*ioq_vector;
 
@@ -511,6 +514,7 @@ struct octeon_device {
 #define  OCTEON_CN6XXX(oct)   ((oct->chip_id == OCTEON_CN66XX) || \
   (oct->chip_id == OCTEON_CN68XX))
 #define  OCTEON_CN23XX_PF(oct)(oct->chip_id == OCTEON_CN23XX_PF_VID)
+#define  OCTEON_CN23XX_VF(oct)((oct)->chip_id == OCTEON_CN23XX_VF_VID)
 #define CHIP_FIELD(oct, TYPE, field) \
(((struct octeon_ ## TYPE  *)(oct->chip))->field)
 
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c 
b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
new file mode 100644
index 000..3a2f6c1
--- /dev/null
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
@@ -0,0 +1,321 @@
+/**
+ * Author: Cavium, Inc.
+ *
+ * Contact: supp...@cavium.com
+ *  Please includ

[PATCH net-next V3 2/9] liquidio CN23XX: sysfs VF config support

2016-10-25 Thread Raghu Vatsavayi

Adds sysfs based support for enabling or disabling VFs.

Signed-off-by: Raghu Vatsavayi 
Signed-off-by: Derek Chickles 
Signed-off-by: Satanand Burla 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 98 ++
 .../net/ethernet/cavium/liquidio/octeon_config.h   |  3 +
 .../net/ethernet/cavium/liquidio/octeon_device.h   |  8 ++
 3 files changed, 109 insertions(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index d25746f..51ed875 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -194,6 +194,8 @@ struct octeon_device_priv {
unsigned long napi_mask;
 };
 
+static int liquidio_enable_sriov(struct pci_dev *dev, int num_vfs);
+
 static int octeon_device_init(struct octeon_device *);
 static int liquidio_stop(struct net_device *netdev);
 static void liquidio_remove(struct pci_dev *pdev);
@@ -532,6 +534,7 @@ static int liquidio_resume(struct pci_dev *pdev 
__attribute__((unused)))
.suspend= liquidio_suspend,
.resume = liquidio_resume,
 #endif
+   .sriov_configure = liquidio_enable_sriov,
 };
 
 /**
@@ -1486,6 +1489,8 @@ static void octeon_destroy_resources(struct octeon_device 
*oct)
continue;
octeon_delete_instr_queue(oct, i);
}
+   if (oct->sriov_info.sriov_enabled)
+   pci_disable_sriov(oct->pci_dev);
/* fallthrough */
case OCT_DEV_SC_BUFF_POOL_INIT_DONE:
octeon_free_sc_buffer_pool(oct);
@@ -4013,6 +4018,99 @@ static int setup_nic_devices(struct octeon_device 
*octeon_dev)
return -ENODEV;
 }
 
+static int octeon_enable_sriov(struct octeon_device *oct)
+{
+   unsigned int num_vfs_alloced = oct->sriov_info.num_vfs_alloced;
+   struct pci_dev *vfdev;
+   int err;
+   u32 u;
+
+   if (OCTEON_CN23XX_PF(oct) && num_vfs_alloced) {
+   err = pci_enable_sriov(oct->pci_dev,
+  oct->sriov_info.num_vfs_alloced);
+   if (err) {
+   dev_err(&oct->pci_dev->dev,
+   "OCTEON: Failed to enable PCI sriov: %d\n",
+   err);
+   oct->sriov_info.num_vfs_alloced = 0;
+   return err;
+   }
+   oct->sriov_info.sriov_enabled = 1;
+
+   /* init lookup table that maps DPI ring number to VF pci_dev
+* struct pointer
+*/
+   u = 0;
+   vfdev = pci_get_device(PCI_VENDOR_ID_CAVIUM,
+  OCTEON_CN23XX_VF_VID, NULL);
+   while (vfdev) {
+   if (vfdev->is_virtfn &&
+   (vfdev->physfn == oct->pci_dev)) {
+   oct->sriov_info.dpiring_to_vfpcidev_lut[u] =
+   vfdev;
+   u += oct->sriov_info.rings_per_vf;
+   }
+   vfdev = pci_get_device(PCI_VENDOR_ID_CAVIUM,
+  OCTEON_CN23XX_VF_VID, vfdev);
+   }
+   }
+
+   return num_vfs_alloced;
+}
+
+static int lio_pci_sriov_disable(struct octeon_device *oct)
+{
+   int u;
+
+   if (pci_vfs_assigned(oct->pci_dev)) {
+   dev_err(&oct->pci_dev->dev, "VFs are still assigned to VMs.\n");
+   return -EPERM;
+   }
+
+   pci_disable_sriov(oct->pci_dev);
+
+   u = 0;
+   while (u < MAX_POSSIBLE_VFS) {
+   oct->sriov_info.dpiring_to_vfpcidev_lut[u] = NULL;
+   u += oct->sriov_info.rings_per_vf;
+   }
+
+   oct->sriov_info.num_vfs_alloced = 0;
+   dev_info(&oct->pci_dev->dev, "oct->pf_num:%d disabled VFs\n",
+oct->pf_num);
+
+   return 0;
+}
+
+static int liquidio_enable_sriov(struct pci_dev *dev, int num_vfs)
+{
+   struct octeon_device *oct = pci_get_drvdata(dev);
+   int ret = 0;
+
+   if ((num_vfs == oct->sriov_info.num_vfs_alloced) &&
+   (oct->sriov_info.sriov_enabled)) {
+   dev_info(&oct->pci_dev->dev, "oct->pf_num:%d already enabled 
num_vfs:%d\n",
+oct->pf_num, num_vfs);
+   return 0;
+   }
+
+   if (!num_vfs) {
+   ret = lio_pci_sriov_disable(oct);
+   } else if (num_vfs > oct->sriov_info.max_vfs) {
+   dev_err(&oct->pci_dev->dev,
+   "OCTEON: Max allowed VFs:%d user requested:%d",
+   oct->sriov_info.max_vfs, num_vfs);
+   ret = -EPERM;
+   } else {
+   oct->sriov_info.num_vfs_alloced = num_vfs;
+   ret = octeon_enable_sriov(oct);
+   dev_info(&oct->pci_dev->dev, "oct->pf_n

Re: [PATCH net] packet: on direct_xmit, limit tso and csum to supported devices

2016-10-25 Thread Eric Dumazet

On Tue, 2016-10-25 at 20:28 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> When transmitting on a packet socket with PACKET_VNET_HDR and
> PACKET_QDISC_BYPASS, validate device support for features requested
> in vnet_hdr.

You probably need to add an EXPORT_SYMBOL(validate_xmit_skb_list)
because af_packet might be modular.

Sorry for not catching this earlier.

[PATCH net] packet: on direct_xmit, limit tso and csum to supported devices

2016-10-25 Thread Willem de Bruijn

From: Willem de Bruijn 

When transmitting on a packet socket with PACKET_VNET_HDR and
PACKET_QDISC_BYPASS, validate device support for features requested
in vnet_hdr.

Drop TSO packets sent to devices that do not support TSO or have the
feature disabled. Note that the latter currently do process those
packets correctly, regardless of not advertising the feature.

Because of SKB_GSO_DODGY, it is not sufficient to test device features
with netif_needs_gso. Full validate_xmit_skb is needed.

Switch to software checksum for non-TSO packets that request checksum
offload if that device feature is unsupported or disabled. Note that
similar to the TSO case, device drivers may perform checksum offload
correctly even when not advertising it.

When switching to software checksum, packets hit skb_checksum_help,
which has two BUG_ON checksum not in linear segment. Packet sockets
always allocate at least up to csum_start + csum_off + 2 as linear.

Tested by running github.com/wdebruij/kerneltools/psock_txring_vnet.c

  ethtool -K eth0 tso off tx on
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v
  psock_txring_vnet -d $dst -s $src -i eth0 -l 2000 -n 1 -q -v -N

  ethtool -K eth0 tx off
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G
  psock_txring_vnet -d $dst -s $src -i eth0 -l 1000 -n 1 -q -v -G -N

Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option")
Signed-off-by: Willem de Bruijn 
---
 net/packet/af_packet.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 11db0d6..d2238b2 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -250,7 +250,7 @@ static void __fanout_link(struct sock *sk, struct 
packet_sock *po);
 static int packet_direct_xmit(struct sk_buff *skb)
 {
struct net_device *dev = skb->dev;
-   netdev_features_t features;
+   struct sk_buff *orig_skb = skb;
struct netdev_queue *txq;
int ret = NETDEV_TX_BUSY;
 
@@ -258,9 +258,8 @@ static int packet_direct_xmit(struct sk_buff *skb)
 !netif_carrier_ok(dev)))
goto drop;
 
-   features = netif_skb_features(skb);
-   if (skb_needs_linearize(skb, features) &&
-   __skb_linearize(skb))
+   skb = validate_xmit_skb_list(skb, dev);
+   if (skb != orig_skb)
goto drop;
 
txq = skb_get_tx_queue(dev, skb);
@@ -280,7 +279,7 @@ static int packet_direct_xmit(struct sk_buff *skb)
return ret;
 drop:
atomic_long_inc(&dev->tx_dropped);
-   kfree_skb(skb);
+   kfree_skb_list(skb);
return NET_XMIT_DROP;
 }
 
-- 
2.8.0.rc3.226.g39d4020

Re: [PATCH] net: Reset skb to network header in neigh_hh_output

2016-10-25 Thread Eric Dumazet

On Wed, 2016-10-26 at 01:57 +0200, Abdelrhman Ahmed wrote:
>  > What is the issue you want to fix exactly ? 
>  > Please describe the use case. 
> 
> When netfilter hook uses skb_push to add a specific header between network
> header and hardware header.
> For the first time(s) before caching hardware header, this header will be
> removed / overwritten by hardware header due to resetting to network header.
> After using the cached hardware header, this header will be kept as we do not
> reset. I think this behavior is inconsistent, so we need to reset in both 
> cases.
> 
>  > Otherwise, your fix is in fact adding a critical bug. 
> 
> Could you explain more as it's not clear to me?
> 

Maybe my wording was not good here.

What I intended to say is that the 
__skb_pull(skb, skb_network_offset(skb)) might not be at the right
place.

Look at commit e1f165032c8bade3a6bdf546f8faf61fda4dd01c to find the
reason.


> 
> 
>   On Fri, 07 Oct 2016 23:10:56 +0200 Eric Dumazet 
>  wrote  
>  > On Fri, 2016-10-07 at 16:14 +0200, Abdelrhman Ahmed wrote: 
>  > > When hardware header is added without using cached one, 
> neigh_resolve_output 
>  > > and neigh_connected_output reset skb to network header before adding it. 
>  > > When cached one is used, neigh_hh_output does not reset the skb to 
> network 
>  > > header. 
>  > >  
>  > > The fix is to reset skb to network header before adding cached hardware 
> header 
>  > > to keep the behavior consistent in all cases. 
>  >  
>  > What is the issue you want to fix exactly ? 
>  >  
>  > Please describe the use case. 
>  >  
>  > I highly suggest you take a look at commit 
>  >  
>  > e1f165032c8bade3a6bdf546f8faf61fda4dd01c 
>  > ("net: Fix skb_under_panic oops in neigh_resolve_output") 
>  >  
>  > Otherwise, your fix is in fact adding a critical bug. 
>  >  
>  >  
>  > 
>

[PATCH net-next] ibmveth: v1 calculate correct gso_size and set gso_type

2016-10-25 Thread Jon Maxwell

We recently encountered a bug where a few customers using ibmveth on the 
same LPAR hit an issue where a TCP session hung when large receive was
enabled. Closer analysis revealed that the session was stuck because the 
one side was advertising a zero window repeatedly.

We narrowed this down to the fact the ibmveth driver did not set gso_size 
which is translated by TCP into the MSS later up the stack. The MSS is 
used to calculate the TCP window size and as that was abnormally large, 
it was calculating a zero window, even although the sockets receive buffer 
was completely empty. 

We were able to reproduce this and worked with IBM to fix this. Thanks Tom 
and Marcelo for all your help and review on this.

The patch fixes both our internal reproduction tests and our customers tests.

Signed-off-by: Jon Maxwell 
---
 drivers/net/ethernet/ibm/ibmveth.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
b/drivers/net/ethernet/ibm/ibmveth.c
index 29c05d0..c51717e 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int 
budget)
int frames_processed = 0;
unsigned long lpar_rc;
struct iphdr *iph;
+   bool large_packet = 0;
+   u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
 
 restart_poll:
while (frames_processed < budget) {
@@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int 
budget)
iph->check = 0;
iph->check = 
ip_fast_csum((unsigned char *)iph, iph->ihl);
adapter->rx_large_packets++;
+   large_packet = 1;
}
}
}
 
+   if (skb->len > netdev->mtu) {
+   iph = (struct iphdr *)skb->data;
+   if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
+   iph->protocol == IPPROTO_TCP) {
+   hdr_len += sizeof(struct iphdr);
+   skb_shinfo(skb)->gso_type = 
SKB_GSO_TCPV4;
+   skb_shinfo(skb)->gso_size = netdev->mtu 
- hdr_len;
+   } else if (be16_to_cpu(skb->protocol) == 
ETH_P_IPV6 &&
+  iph->protocol == IPPROTO_TCP) {
+   hdr_len += sizeof(struct ipv6hdr);
+   skb_shinfo(skb)->gso_type = 
SKB_GSO_TCPV6;
+   skb_shinfo(skb)->gso_size = netdev->mtu 
- hdr_len;
+   }
+   if (!large_packet)
+   adapter->rx_large_packets++;
+   }
+
napi_gro_receive(napi, skb);/* send it up */
 
netdev->stats.rx_packets++;
-- 
1.8.3.1

Re: [PATCH] net: Reset skb to network header in neigh_hh_output

2016-10-25 Thread Abdelrhman Ahmed

 > What is the issue you want to fix exactly ? 
 > Please describe the use case. 

When netfilter hook uses skb_push to add a specific header between network
header and hardware header.
For the first time(s) before caching hardware header, this header will be
removed / overwritten by hardware header due to resetting to network header.
After using the cached hardware header, this header will be kept as we do not
reset. I think this behavior is inconsistent, so we need to reset in both cases.

 > Otherwise, your fix is in fact adding a critical bug. 

Could you explain more as it's not clear to me?

  On Fri, 07 Oct 2016 23:10:56 +0200 Eric Dumazet  
wrote  
 > On Fri, 2016-10-07 at 16:14 +0200, Abdelrhman Ahmed wrote: 
 > > When hardware header is added without using cached one, 
 > > neigh_resolve_output 
 > > and neigh_connected_output reset skb to network header before adding it. 
 > > When cached one is used, neigh_hh_output does not reset the skb to network 
 > > header. 
 > >  
 > > The fix is to reset skb to network header before adding cached hardware 
 > > header 
 > > to keep the behavior consistent in all cases. 
 >  
 > What is the issue you want to fix exactly ? 
 >  
 > Please describe the use case. 
 >  
 > I highly suggest you take a look at commit 
 >  
 > e1f165032c8bade3a6bdf546f8faf61fda4dd01c 
 > ("net: Fix skb_under_panic oops in neigh_resolve_output") 
 >  
 > Otherwise, your fix is in fact adding a critical bug. 
 >  
 >  
 >

Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-25 Thread Eric Dumazet

On Tue, 2016-10-25 at 15:30 -0700, David Ahern wrote:
> Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
> BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
> any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
> Currently only sk_bound_dev_if is exported to userspace for modification
> by a bpf program.
> 
> This allows a cgroup to be configured such that AF_INET{6} sockets opened
> by processes are automatically bound to a specific device. In turn, this
> enables the running of programs that do not support SO_BINDTODEVICE in a
> specific VRF context / L3 domain.

Does this mean that these programs no longer can use loopback ?

Re: [PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-25 Thread Daniel Borkmann


On 10/26/2016 12:30 AM, David Ahern wrote:

Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.

This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.

Signed-off-by: David Ahern 

[...]

@@ -524,6 +535,10 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
  };

+struct bpf_sock {
+   __u32 bound_dev_if;
+};
+
  /* User return codes for XDP prog type.
   * A valid XDP program must return one of these defined values. All other
   * return codes are reserved for future use. Unknown return codes will result

[...]

diff --git a/net/core/filter.c b/net/core/filter.c
index 4552b8c93b99..775802881b01 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2482,6 +2482,27 @@ static const struct bpf_func_proto 
bpf_xdp_event_output_proto = {
.arg5_type  = ARG_CONST_STACK_SIZE,
  };

+BPF_CALL_3(bpf_sock_store_u32, struct sock *, sk, u32, offset, u32, val)
+{
+   u8 *ptr = (u8 *)sk;
+
+   if (unlikely(offset > sizeof(*sk)))
+   return -EFAULT;
+
+   *((u32 *)ptr) = val;
+
+   return 0;
+}


Seems strange to me. So, this helper allows to overwrite arbitrary memory
of a struct sock instance. Potentially we could crash the kernel.

And in your sock_filter_convert_ctx_access(), you already implement inline
read/write for the context ...

Your demo code does in pseudocode:

  r1 = sk
  r2 = offsetof(struct bpf_sock, bound_dev_if)
  r3 = idx
  r1->sk_bound_dev_if = idx
  sock_store_u32(r1, r2, r3) // updates sk_bound_dev_if again to idx
  return 1

Dropping that helper from the patch, the only thing a program can do here
is to read/write the sk_bound_dev_if helper per cgroup. Hmm ... dunno. So
this really has to be for cgroups v2, right?

Re: [PATCH net-next 1/3] bpf: Refactor cgroups code in prep for new type

2016-10-25 Thread David Ahern

On 10/25/16 5:01 PM, Daniel Borkmann wrote:
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index a0ab43f264b0..918c01a6f129 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -117,6 +117,19 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
>>   }
>>   }
>>
>> +static int __cgroup_bpf_run_filter_skb(struct sk_buff *skb,
>> +   struct bpf_prog *prog)
>> +{
>> +unsigned int offset = skb->data - skb_network_header(skb);
>> +int ret;
>> +
>> +__skb_push(skb, offset);
>> +ret = bpf_prog_run_clear_cb(prog, skb) == 1 ? 0 : -EPERM;
> 
> Original code save skb->cb[], this one clears it.
> 

ah, it changed in Daniel's v6 to v7 code and I missed it. Will fix. Thanks for 
pointing it out.

[PATCH] uapi: Fix userspace compilation of ip_tables.h/ip6_tables.h in C++ mode

2016-10-25 Thread Jason Gunthorpe

The implicit cast from void * is not allowed for C++ compilers, and the
arithmetic on void * generates warnings if a C++ application tries to include
these UAPI headers.

$ g++ -c t.cc
ip_tables.h:221:24: warning: pointer of type 'void *' used in arithmetic
ip_tables.h:221:24: error: invalid conversion from 'void*' to 'xt_entry_target*'

Signed-off-by: Jason Gunthorpe 
---
 include/uapi/linux/netfilter_ipv4/ip_tables.h  | 2 +-
 include/uapi/linux/netfilter_ipv6/ip6_tables.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/netfilter_ipv4/ip_tables.h 
b/include/uapi/linux/netfilter_ipv4/ip_tables.h
index d0da53d96d93..4682b18f3f44 100644
--- a/include/uapi/linux/netfilter_ipv4/ip_tables.h
+++ b/include/uapi/linux/netfilter_ipv4/ip_tables.h
@@ -221,7 +221,7 @@ struct ipt_get_entries {
 static __inline__ struct xt_entry_target *
 ipt_get_target(struct ipt_entry *e)
 {
-   return (void *)e + e->target_offset;
+   return (struct xt_entry_target *)((__u8 *)e + e->target_offset);
 }
 
 /*
diff --git a/include/uapi/linux/netfilter_ipv6/ip6_tables.h 
b/include/uapi/linux/netfilter_ipv6/ip6_tables.h
index d1b22653daf2..05e0631a6d12 100644
--- a/include/uapi/linux/netfilter_ipv6/ip6_tables.h
+++ b/include/uapi/linux/netfilter_ipv6/ip6_tables.h
@@ -261,7 +261,7 @@ struct ip6t_get_entries {
 static __inline__ struct xt_entry_target *
 ip6t_get_target(struct ip6t_entry *e)
 {
-   return (void *)e + e->target_offset;
+   return (struct xt_entry_target *)((__u8 *)e + e->target_offset);
 }
 
 /*
-- 
2.1.4

Re: [PATCH net-next 1/3] bpf: Refactor cgroups code in prep for new type

2016-10-25 Thread Daniel Borkmann


On 10/26/2016 12:30 AM, David Ahern wrote:

Code move only; no functional change intended.


Not quite, see below.


Signed-off-by: David Ahern 
---
  kernel/bpf/cgroup.c  | 27 ++-
  kernel/bpf/syscall.c | 28 +++-
  2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a0ab43f264b0..918c01a6f129 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -117,6 +117,19 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
}
  }

+static int __cgroup_bpf_run_filter_skb(struct sk_buff *skb,
+  struct bpf_prog *prog)
+{
+   unsigned int offset = skb->data - skb_network_header(skb);
+   int ret;
+
+   __skb_push(skb, offset);
+   ret = bpf_prog_run_clear_cb(prog, skb) == 1 ? 0 : -EPERM;


Original code save skb->cb[], this one clears it.


+   __skb_pull(skb, offset);
+
+   return ret;
+}
+
  /**
   * __cgroup_bpf_run_filter() - Run a program for packet filtering
   * @sk: The socken sending or receiving traffic
@@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,

prog = rcu_dereference(cgrp->bpf.effective[type]);
if (prog) {
-   unsigned int offset = skb->data - skb_network_header(skb);
-
-   __skb_push(skb, offset);
-   ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
-   __skb_pull(skb, offset);
+   switch (type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   ret = __cgroup_bpf_run_filter_skb(skb, prog);
+   break;
+   /* make gcc happy else complains about missing enum value */
+   default:
+   return 0;
+   }
}

Re: [PATCH net] inet: Fix missing return value in inet6_hash

2016-10-25 Thread Soheil Hassas Yeganeh

On Tue, Oct 25, 2016 at 6:08 PM, Craig Gallek  wrote:
> From: Craig Gallek 
>
> As part of a series to implement faster SO_REUSEPORT lookups,
> commit 086c653f5862 ("sock: struct proto hash function may error")
> added return values to protocol hash functions and
> commit 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function")
> implemented a new hash function for IPv6.  However, the latter does
> not respect the former's convention.
>
> This properly propagates the hash errors in the IPv6 case.
>
> Fixes: 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function")
> Reported-by: Soheil Hassas Yeganeh 
> Signed-off-by: Craig Gallek 
Acked-by: Soheil Hassas Yeganeh 

> ---
>  net/ipv6/inet6_hashtables.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
> index 2fd0374a35b1..02761c9fe43e 100644
> --- a/net/ipv6/inet6_hashtables.c
> +++ b/net/ipv6/inet6_hashtables.c
> @@ -264,13 +264,15 @@ EXPORT_SYMBOL_GPL(inet6_hash_connect);
>
>  int inet6_hash(struct sock *sk)
>  {
> +   int err = 0;
> +
> if (sk->sk_state != TCP_CLOSE) {
> local_bh_disable();
> -   __inet_hash(sk, NULL, ipv6_rcv_saddr_equal);
> +   err = __inet_hash(sk, NULL, ipv6_rcv_saddr_equal);
> local_bh_enable();
> }
>
> -   return 0;
> +   return err;
>  }
>  EXPORT_SYMBOL_GPL(inet6_hash);

Thanks for the fix!

> --
> 2.8.0.rc3.226.g39d4020
>

[PATCH net] bpf: fix samples to add fake KBUILD_MODNAME

2016-10-25 Thread Daniel Borkmann

Some of the sample files are causing issues when they are loaded with tc
and cls_bpf, meaning tc bails out while trying to parse the resulting ELF
file as program/map/etc sections are not present, which can be easily
spotted with readelf(1).

Currently, BPF samples are including some of the kernel headers and mid
term we should change them to refrain from this, really. When dynamic
debugging is enabled, we bail out due to undeclared KBUILD_MODNAME, which
is easily overlooked in the build as clang spills this along with other
noisy warnings from various header includes, and llc still generates an
ELF file with mentioned characteristics. For just playing around with BPF
examples, this can be a bit of a hurdle to take.

Just add a fake KBUILD_MODNAME as a band-aid to fix the issue, same is
done in xdp*_kern samples already.

Fixes: 65d472fb007d ("samples/bpf: add 'pointer to packet' tests")
Fixes: 6afb1e28b859 ("samples/bpf: Add tunnel set/get tests.")
Fixes: a3f74617340b ("cgroup: bpf: Add an example to do cgroup checking in BPF")
Reported-by: Chandrasekar Kannan 
Signed-off-by: Daniel Borkmann 
---
 samples/bpf/parse_ldabs.c| 1 +
 samples/bpf/parse_simple.c   | 1 +
 samples/bpf/parse_varlen.c   | 1 +
 samples/bpf/tcbpf1_kern.c| 1 +
 samples/bpf/tcbpf2_kern.c| 1 +
 samples/bpf/test_cgrp2_tc_kern.c | 1 +
 6 files changed, 6 insertions(+)

diff --git a/samples/bpf/parse_ldabs.c b/samples/bpf/parse_ldabs.c
index d175501..6db6b21 100644
--- a/samples/bpf/parse_ldabs.c
+++ b/samples/bpf/parse_ldabs.c
@@ -4,6 +4,7 @@
  * modify it under the terms of version 2 of the GNU General Public
  * License as published by the Free Software Foundation.
  */
+#define KBUILD_MODNAME "foo"
 #include 
 #include 
 #include 
diff --git a/samples/bpf/parse_simple.c b/samples/bpf/parse_simple.c
index cf2511c..10af53d 100644
--- a/samples/bpf/parse_simple.c
+++ b/samples/bpf/parse_simple.c
@@ -4,6 +4,7 @@
  * modify it under the terms of version 2 of the GNU General Public
  * License as published by the Free Software Foundation.
  */
+#define KBUILD_MODNAME "foo"
 #include 
 #include 
 #include 
diff --git a/samples/bpf/parse_varlen.c b/samples/bpf/parse_varlen.c
index edab34d..95c1632 100644
--- a/samples/bpf/parse_varlen.c
+++ b/samples/bpf/parse_varlen.c
@@ -4,6 +4,7 @@
  * modify it under the terms of version 2 of the GNU General Public
  * License as published by the Free Software Foundation.
  */
+#define KBUILD_MODNAME "foo"
 #include 
 #include 
 #include 
diff --git a/samples/bpf/tcbpf1_kern.c b/samples/bpf/tcbpf1_kern.c
index fa051b3..274c884 100644
--- a/samples/bpf/tcbpf1_kern.c
+++ b/samples/bpf/tcbpf1_kern.c
@@ -1,3 +1,4 @@
+#define KBUILD_MODNAME "foo"
 #include 
 #include 
 #include 
diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 3303bb8..9c823a6 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -5,6 +5,7 @@
  * modify it under the terms of version 2 of the GNU General Public
  * License as published by the Free Software Foundation.
  */
+#define KBUILD_MODNAME "foo"
 #include 
 #include 
 #include 
diff --git a/samples/bpf/test_cgrp2_tc_kern.c b/samples/bpf/test_cgrp2_tc_kern.c
index 10ff734..1547b36 100644
--- a/samples/bpf/test_cgrp2_tc_kern.c
+++ b/samples/bpf/test_cgrp2_tc_kern.c
@@ -4,6 +4,7 @@
  * modify it under the terms of version 2 of the GNU General Public
  * License as published by the Free Software Foundation.
  */
+#define KBUILD_MODNAME "foo"
 #include 
 #include 
 #include 
-- 
1.9.3

[PATCH net-next 2/3] bpf: Add new cgroups prog type to enable sock modifications

2016-10-25 Thread David Ahern

Add new cgroup based program type, BPF_PROG_TYPE_CGROUP_SOCK. Similar to
BPF_PROG_TYPE_CGROUP_SKB programs can be attached to a cgroup and run
any time a process in the cgroup opens an AF_INET or AF_INET6 socket.
Currently only sk_bound_dev_if is exported to userspace for modification
by a bpf program.

This allows a cgroup to be configured such that AF_INET{6} sockets opened
by processes are automatically bound to a specific device. In turn, this
enables the running of programs that do not support SO_BINDTODEVICE in a
specific VRF context / L3 domain.

Signed-off-by: David Ahern 
---
 include/linux/filter.h   |  2 +-
 include/uapi/linux/bpf.h | 15 
 kernel/bpf/cgroup.c  |  9 +
 kernel/bpf/syscall.c |  4 +++
 net/core/filter.c| 92 
 net/core/sock.c  |  7 
 6 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c521adfe..808e158742a2 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -408,7 +408,7 @@ struct bpf_prog {
enum bpf_prog_type  type;   /* Type of BPF program */
struct bpf_prog_aux *aux;   /* Auxiliary fields */
struct sock_fprog_kern  *orig_prog; /* Original BPF program */
-   unsigned int(*bpf_func)(const struct sk_buff *skb,
+   unsigned int(*bpf_func)(const void *ctx,
const struct bpf_insn *filter);
/* Instructions for interpreter */
union {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6b62ee9a2f78..ce5283f221e7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -99,11 +99,13 @@ enum bpf_prog_type {
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
BPF_PROG_TYPE_CGROUP_SKB,
+   BPF_PROG_TYPE_CGROUP_SOCK,
 };
 
 enum bpf_attach_type {
BPF_CGROUP_INET_INGRESS,
BPF_CGROUP_INET_EGRESS,
+   BPF_CGROUP_INET_SOCK_CREATE,
__MAX_BPF_ATTACH_TYPE
 };
 
@@ -449,6 +451,15 @@ enum bpf_func_id {
 */
BPF_FUNC_get_numa_node_id,
 
+   /**
+* sock_store_u32(sk, offset, val) - store bytes into sock
+* @sk: pointer to sock
+* @offset: offset within sock
+* @val: value to write
+* Return: 0 on success
+*/
+   BPF_FUNC_sock_store_u32,
+
__BPF_FUNC_MAX_ID,
 };
 
@@ -524,6 +535,10 @@ struct bpf_tunnel_key {
__u32 tunnel_label;
 };
 
+struct bpf_sock {
+   __u32 bound_dev_if;
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 918c01a6f129..4fcb58013a3a 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -117,6 +117,12 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
}
 }
 
+static int __cgroup_bpf_run_filter_sk_create(struct sock *sk,
+struct bpf_prog *prog)
+{
+   return prog->bpf_func(sk, prog->insnsi) == 1 ? 0 : -EPERM;
+}
+
 static int __cgroup_bpf_run_filter_skb(struct sk_buff *skb,
   struct bpf_prog *prog)
 {
@@ -171,6 +177,9 @@ int __cgroup_bpf_run_filter(struct sock *sk,
case BPF_CGROUP_INET_EGRESS:
ret = __cgroup_bpf_run_filter_skb(skb, prog);
break;
+   case BPF_CGROUP_INET_SOCK_CREATE:
+   ret = __cgroup_bpf_run_filter_sk_create(sk, prog);
+   break;
/* make gcc happy else complains about missing enum value */
default:
return 0;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9abc88deabbc..3b7e30e28cd3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -844,6 +844,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
ptype = BPF_PROG_TYPE_CGROUP_SKB;
break;
 
+   case BPF_CGROUP_INET_SOCK_CREATE:
+   ptype = BPF_PROG_TYPE_CGROUP_SOCK;
+   break;
default:
return -EINVAL;
}
@@ -879,6 +882,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
switch (attr->attach_type) {
case BPF_CGROUP_INET_INGRESS:
case BPF_CGROUP_INET_EGRESS:
+   case BPF_CGROUP_INET_SOCK_CREATE:
cgrp = cgroup_get_from_fd(attr->target_fd);
if (IS_ERR(cgrp))
return PTR_ERR(cgrp);
diff --git a/net/core/filter.c b/net/core/filter.c
index 4552b8c93b99..775802881b01 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2482,6 +2482,27 @@ static const struct bpf_func_proto 
bpf_xdp_event_output_proto = {
.arg5_type  = ARG_CONST_STACK_SIZE,
 };
 
+

[PATCH net-next 1/3] bpf: Refactor cgroups code in prep for new type

2016-10-25 Thread David Ahern

Code move only; no functional change intended.

Signed-off-by: David Ahern 
---
 kernel/bpf/cgroup.c  | 27 ++-
 kernel/bpf/syscall.c | 28 +++-
 2 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index a0ab43f264b0..918c01a6f129 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -117,6 +117,19 @@ void __cgroup_bpf_update(struct cgroup *cgrp,
}
 }
 
+static int __cgroup_bpf_run_filter_skb(struct sk_buff *skb,
+  struct bpf_prog *prog)
+{
+   unsigned int offset = skb->data - skb_network_header(skb);
+   int ret;
+
+   __skb_push(skb, offset);
+   ret = bpf_prog_run_clear_cb(prog, skb) == 1 ? 0 : -EPERM;
+   __skb_pull(skb, offset);
+
+   return ret;
+}
+
 /**
  * __cgroup_bpf_run_filter() - Run a program for packet filtering
  * @sk: The socken sending or receiving traffic
@@ -153,11 +166,15 @@ int __cgroup_bpf_run_filter(struct sock *sk,
 
prog = rcu_dereference(cgrp->bpf.effective[type]);
if (prog) {
-   unsigned int offset = skb->data - skb_network_header(skb);
-
-   __skb_push(skb, offset);
-   ret = bpf_prog_run_save_cb(prog, skb) == 1 ? 0 : -EPERM;
-   __skb_pull(skb, offset);
+   switch (type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   ret = __cgroup_bpf_run_filter_skb(skb, prog);
+   break;
+   /* make gcc happy else complains about missing enum value */
+   default:
+   return 0;
+   }
}
 
rcu_read_unlock();
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1814c010ace6..9abc88deabbc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -828,6 +828,7 @@ static int bpf_obj_get(const union bpf_attr *attr)
 
 static int bpf_prog_attach(const union bpf_attr *attr)
 {
+   enum bpf_prog_type ptype = BPF_PROG_TYPE_UNSPEC;
struct bpf_prog *prog;
struct cgroup *cgrp;
 
@@ -840,25 +841,26 @@ static int bpf_prog_attach(const union bpf_attr *attr)
switch (attr->attach_type) {
case BPF_CGROUP_INET_INGRESS:
case BPF_CGROUP_INET_EGRESS:
-   prog = bpf_prog_get_type(attr->attach_bpf_fd,
-BPF_PROG_TYPE_CGROUP_SKB);
-   if (IS_ERR(prog))
-   return PTR_ERR(prog);
-
-   cgrp = cgroup_get_from_fd(attr->target_fd);
-   if (IS_ERR(cgrp)) {
-   bpf_prog_put(prog);
-   return PTR_ERR(cgrp);
-   }
-
-   cgroup_bpf_update(cgrp, prog, attr->attach_type);
-   cgroup_put(cgrp);
+   ptype = BPF_PROG_TYPE_CGROUP_SKB;
break;
 
default:
return -EINVAL;
}
 
+   prog = bpf_prog_get_type(attr->attach_bpf_fd, ptype);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+
return 0;
 }
 
-- 
2.1.4

[PATCH net-next 3/3] samples: bpf: add userspace example for modifying sk_bound_dev_if

2016-10-25 Thread David Ahern

Add a simple program to demonstrate the ability to attach a bpf program
to a cgroup that sets sk_bound_dev_if for AF_INET{6} sockets when they
are created.

Signed-off-by: David Ahern 
---
 samples/bpf/Makefile  |  2 ++
 samples/bpf/bpf_helpers.h |  2 ++
 samples/bpf/test_cgrp2_sock.c | 84 +++
 3 files changed, 88 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_sock.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 2624d5d7ce8b..ec4ef37a2dbc 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += test_cgrp2_attach
+hostprogs-y += test_cgrp2_sock
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -48,6 +49,7 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
+test_cgrp2_sock-objs := libbpf.o test_cgrp2_sock.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd2045e..7d95c9af3681 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -88,6 +88,8 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int 
from, int to, int flag
(void *) BPF_FUNC_l4_csum_replace;
 static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
(void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_sock_store_u32)(void *ctx, __u32 off, __u32 val) =
+   (void *) BPF_FUNC_sock_store_u32;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/test_cgrp2_sock.c b/samples/bpf/test_cgrp2_sock.c
new file mode 100644
index ..1fab10a08846
--- /dev/null
+++ b/samples/bpf/test_cgrp2_sock.c
@@ -0,0 +1,84 @@
+/* eBPF example program:
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program sets the sk_bound_dev_if index in new AF_INET{6}
+ *   sockets opened by processes in the cgroup.
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "libbpf.h"
+
+static int prog_load(int idx)
+{
+   struct bpf_insn prog[] = {
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   BPF_MOV64_IMM(BPF_REG_3, idx),
+   BPF_MOV64_IMM(BPF_REG_2, offsetof(struct bpf_sock, 
bound_dev_if)),
+   BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_3, offsetof(struct 
bpf_sock, bound_dev_if)),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, 
BPF_FUNC_sock_store_u32),
+   BPF_MOV64_IMM(BPF_REG_0, 1), /* r0 = verdict */
+   BPF_EXIT_INSN(),
+   };
+
+   return bpf_prog_load(BPF_PROG_TYPE_CGROUP_SOCK,
+prog, sizeof(prog), "GPL", 0);
+}
+
+static int usage(const char *argv0)
+{
+   printf("Usage: %s  device-index\n", argv0);
+   return EXIT_FAILURE;
+}
+
+int main(int argc, char **argv)
+{
+   int cg_fd, prog_fd, ret;
+   int idx = 0;
+
+   if (argc < 2)
+   return usage(argv[0]);
+
+   idx = atoi(argv[2]);
+   if (!idx) {
+   printf("Invalid device index\n");
+   return EXIT_FAILURE;
+   }
+
+   cg_fd = open(argv[1], O_DIRECTORY | O_RDONLY);
+   if (cg_fd < 0) {
+   printf("Failed to open cgroup path: '%s'\n", strerror(errno));
+   return EXIT_FAILURE;
+   }
+
+   prog_fd = prog_load(idx);
+   printf("Output from kernel verifier:\n%s\n---\n", bpf_log_buf);
+
+   if (prog_fd < 0) {
+   printf("Failed to load prog: '%s'\n", strerror(errno));
+   return EXIT_FAILURE;
+   }
+
+   ret = bpf_prog_detach(cg_fd, BPF_CGROUP_INET_SOCK_CREATE);
+   ret = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_INET_SOCK_CREATE);
+   if (ret < 0) {
+   printf("Failed to attach prog to cgroup: '%s'\n",
+  strerror(errno));
+   return EXIT_FAILURE;
+   }
+
+   return EXIT_SUCCESS;
+}
-- 
2.1.4

[PATCH net-next 0/3] Add bpf support to set sk_bound_dev_if

2016-10-25 Thread David Ahern

The recently added VRF support in Linux leverages the bind-to-device
API for programs to specify an L3 domain for a socket. While
SO_BINDTODEVICE has been around for ages, not every ipv4/ipv6 capable
program has support for it. Even for those programs that do support it,
the API requires processes to be started as root (CAP_NET_RAW) which
is not desirable from a general security perspective.

This patch set leverages Daniel Mack's work to attach bpf programs to
a cgroup:

https://www.mail-archive.com/netdev@vger.kernel.org/msg134028.html

to provide a capability to set sk_bound_dev_if for all AF_INET{6}
sockets opened by a process in a cgroup when the sockets are allocated.

This capability enables running any program in a VRF context and is key
to deploying Management VRF, a fundamental configuration for networking
gear, with any Linux OS installation.

David Ahern (3):
  bpf: Refactor cgroups code in prep for new type
  bpf: Add new cgroups prog type to enable sock modifications
  samples: bpf: add userspace example for modifying sk_bound_dev_if

 include/linux/filter.h|  2 +-
 include/uapi/linux/bpf.h  | 15 +++
 kernel/bpf/cgroup.c   | 36 ++---
 kernel/bpf/syscall.c  | 32 +--
 net/core/filter.c | 92 +++
 net/core/sock.c   |  7 
 samples/bpf/Makefile  |  2 +
 samples/bpf/bpf_helpers.h |  2 +
 samples/bpf/test_cgrp2_sock.c | 84 +++
 9 files changed, 253 insertions(+), 19 deletions(-)
 create mode 100644 samples/bpf/test_cgrp2_sock.c

-- 
2.1.4

[PATCH net] inet: Fix missing return value in inet6_hash

2016-10-25 Thread Craig Gallek

From: Craig Gallek 

As part of a series to implement faster SO_REUSEPORT lookups,
commit 086c653f5862 ("sock: struct proto hash function may error")
added return values to protocol hash functions and
commit 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function")
implemented a new hash function for IPv6.  However, the latter does
not respect the former's convention.

This properly propagates the hash errors in the IPv6 case.

Fixes: 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function")
Reported-by: Soheil Hassas Yeganeh 
Signed-off-by: Craig Gallek 
---
 net/ipv6/inet6_hashtables.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 2fd0374a35b1..02761c9fe43e 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -264,13 +264,15 @@ EXPORT_SYMBOL_GPL(inet6_hash_connect);
 
 int inet6_hash(struct sock *sk)
 {
+   int err = 0;
+
if (sk->sk_state != TCP_CLOSE) {
local_bh_disable();
-   __inet_hash(sk, NULL, ipv6_rcv_saddr_equal);
+   err = __inet_hash(sk, NULL, ipv6_rcv_saddr_equal);
local_bh_enable();
}
 
-   return 0;
+   return err;
 }
 EXPORT_SYMBOL_GPL(inet6_hash);
 
-- 
2.8.0.rc3.226.g39d4020

Re: [net-next PATCH 04/27] arch/arc: Add option to skip sync on DMA mapping

2016-10-25 Thread Vineet Gupta

On 10/25/2016 02:38 PM, Alexander Duyck wrote:
> This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
> avoid invoking cache line invalidation if the driver will just handle it
> later via a sync_for_cpu or sync_for_device call.
>
> Cc: Vineet Gupta 
> Cc: linux-snps-...@lists.infradead.org
> Signed-off-by: Alexander Duyck 
> ---
>  arch/arc/mm/dma.c |5 -

Acked-by: Vineet Gupta 

>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
> index 20afc65..6303c34 100644
> --- a/arch/arc/mm/dma.c
> +++ b/arch/arc/mm/dma.c
> @@ -133,7 +133,10 @@ static dma_addr_t arc_dma_map_page(struct device *dev, 
> struct page *page,
>   unsigned long attrs)
>  {
>   phys_addr_t paddr = page_to_phys(page) + offset;
> - _dma_cache_sync(paddr, size, dir);
> +
> + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> + _dma_cache_sync(paddr, size, dir);
> +
>   return plat_phys_to_dma(dev, paddr);
>  }
>  
>
>

[net-next PATCH 05/27] arch/arm: Add option to skip sync on DMA map and unmap

2016-10-25 Thread Alexander Duyck

The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Cc: Russell King 
Signed-off-by: Alexander Duyck 
---
 arch/arm/common/dmabounce.c |   16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/arm/common/dmabounce.c b/arch/arm/common/dmabounce.c
index 3012816..75055df 100644
--- a/arch/arm/common/dmabounce.c
+++ b/arch/arm/common/dmabounce.c
@@ -243,7 +243,8 @@ static int needs_bounce(struct device *dev, dma_addr_t 
dma_addr, size_t size)
 }
 
 static inline dma_addr_t map_single(struct device *dev, void *ptr, size_t size,
-   enum dma_data_direction dir)
+   enum dma_data_direction dir,
+   unsigned long attrs)
 {
struct dmabounce_device_info *device_info = dev->archdata.dmabounce;
struct safe_buffer *buf;
@@ -262,7 +263,8 @@ static inline dma_addr_t map_single(struct device *dev, 
void *ptr, size_t size,
__func__, buf->ptr, virt_to_dma(dev, buf->ptr),
buf->safe, buf->safe_dma_addr);
 
-   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) {
+   if ((dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL) &&
+   !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
dev_dbg(dev, "%s: copy unsafe %p to safe %p, size %d\n",
__func__, ptr, buf->safe, size);
memcpy(buf->safe, ptr, size);
@@ -272,7 +274,8 @@ static inline dma_addr_t map_single(struct device *dev, 
void *ptr, size_t size,
 }
 
 static inline void unmap_single(struct device *dev, struct safe_buffer *buf,
-   size_t size, enum dma_data_direction dir)
+   size_t size, enum dma_data_direction dir,
+   unsigned long attrs)
 {
BUG_ON(buf->size != size);
BUG_ON(buf->direction != dir);
@@ -283,7 +286,8 @@ static inline void unmap_single(struct device *dev, struct 
safe_buffer *buf,
 
DO_STATS(dev->archdata.dmabounce->bounce_count++);
 
-   if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) {
+   if ((dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL) &&
+   !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
void *ptr = buf->ptr;
 
dev_dbg(dev, "%s: copy back safe %p to unsafe %p size %d\n",
@@ -334,7 +338,7 @@ static dma_addr_t dmabounce_map_page(struct device *dev, 
struct page *page,
return DMA_ERROR_CODE;
}
 
-   return map_single(dev, page_address(page) + offset, size, dir);
+   return map_single(dev, page_address(page) + offset, size, dir, attrs);
 }
 
 /*
@@ -357,7 +361,7 @@ static void dmabounce_unmap_page(struct device *dev, 
dma_addr_t dma_addr, size_t
return;
}
 
-   unmap_single(dev, buf, size, dir);
+   unmap_single(dev, buf, size, dir, attrs);
 }
 
 static int __dmabounce_sync_for_cpu(struct device *dev, dma_addr_t addr,

[net-next PATCH 04/27] arch/arc: Add option to skip sync on DMA mapping

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
later via a sync_for_cpu or sync_for_device call.

Cc: Vineet Gupta 
Cc: linux-snps-...@lists.infradead.org
Signed-off-by: Alexander Duyck 
---
 arch/arc/mm/dma.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index 20afc65..6303c34 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -133,7 +133,10 @@ static dma_addr_t arc_dma_map_page(struct device *dev, 
struct page *page,
unsigned long attrs)
 {
phys_addr_t paddr = page_to_phys(page) + offset;
-   _dma_cache_sync(paddr, size, dir);
+
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   _dma_cache_sync(paddr, size, dir);
+
return plat_phys_to_dma(dev, paddr);
 }

[net-next PATCH 07/27] arch/blackfin: Add option to skip sync on DMA map

2016-10-25 Thread Alexander Duyck

The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Cc: Steven Miao 
Signed-off-by: Alexander Duyck 
---
 arch/blackfin/kernel/dma-mapping.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/blackfin/kernel/dma-mapping.c 
b/arch/blackfin/kernel/dma-mapping.c
index 53fbbb6..a27a74a 100644
--- a/arch/blackfin/kernel/dma-mapping.c
+++ b/arch/blackfin/kernel/dma-mapping.c
@@ -118,6 +118,10 @@ static int bfin_dma_map_sg(struct device *dev, struct 
scatterlist *sg_list,
 
for_each_sg(sg_list, sg, nents, i) {
sg->dma_address = (dma_addr_t) sg_virt(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
__dma_sync(sg_dma_address(sg), sg_dma_len(sg), direction);
}
 
@@ -143,7 +147,9 @@ static dma_addr_t bfin_dma_map_page(struct device *dev, 
struct page *page,
 {
dma_addr_t handle = (dma_addr_t)(page_address(page) + offset);
 
-   _dma_sync(handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   _dma_sync(handle, size, dir);
+
return handle;
 }

[net-next PATCH 22/27] arch/xtensa: Add option to skip DMA sync as a part of mapping

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Max Filippov 
Signed-off-by: Alexander Duyck 
---
 arch/xtensa/kernel/pci-dma.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
index 1e68806..6a16dec 100644
--- a/arch/xtensa/kernel/pci-dma.c
+++ b/arch/xtensa/kernel/pci-dma.c
@@ -189,7 +189,9 @@ static dma_addr_t xtensa_map_page(struct device *dev, 
struct page *page,
 {
dma_addr_t dma_handle = page_to_phys(page) + offset;
 
-   xtensa_sync_single_for_device(dev, dma_handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   xtensa_sync_single_for_device(dev, dma_handle, size, dir);
+
return dma_handle;
 }
 
@@ -197,7 +199,8 @@ static void xtensa_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
  size_t size, enum dma_data_direction dir,
  unsigned long attrs)
 {
-   xtensa_sync_single_for_cpu(dev, dma_handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   xtensa_sync_single_for_cpu(dev, dma_handle, size, dir);
 }
 
 static int xtensa_map_sg(struct device *dev, struct scatterlist *sg,

[net-next PATCH 01/27] swiotlb: Drop unused function swiotlb_map_sg

2016-10-25 Thread Alexander Duyck

There are no users for swiotlb_map_sg so we might as well just drop it.

Acked-by: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---
 include/linux/swiotlb.h |4 
 lib/swiotlb.c   |8 
 2 files changed, 12 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 5f81f8a..e237b6f 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -72,10 +72,6 @@ extern void swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
   size_t size, enum dma_data_direction dir,
   unsigned long attrs);
 
-extern int
-swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg, int nents,
-  enum dma_data_direction dir);
-
 extern void
 swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nents,
 enum dma_data_direction dir);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 22e13a0..47aad37 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -910,14 +910,6 @@ void swiotlb_unmap_page(struct device *hwdev, dma_addr_t 
dev_addr,
 }
 EXPORT_SYMBOL(swiotlb_map_sg_attrs);
 
-int
-swiotlb_map_sg(struct device *hwdev, struct scatterlist *sgl, int nelems,
-  enum dma_data_direction dir)
-{
-   return swiotlb_map_sg_attrs(hwdev, sgl, nelems, dir, 0);
-}
-EXPORT_SYMBOL(swiotlb_map_sg);
-
 /*
  * Unmap a set of streaming mode DMA translations.  Again, cpu read rules
  * concerning calls here are the same as for swiotlb_unmap_page() above.

[net-next PATCH 03/27] swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC

2016-10-25 Thread Alexander Duyck

As a first step to making DMA_ATTR_SKIP_CPU_SYNC apply to architectures
beyond just ARM I need to make it so that the swiotlb will respect the
flag.  In order to do that I also need to update the swiotlb-xen since it
heavily makes use of the functionality.

Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---
 drivers/xen/swiotlb-xen.c |   11 +++---
 include/linux/swiotlb.h   |6 --
 lib/swiotlb.c |   48 +++--
 3 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index b8014bf..3d048af 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -405,7 +405,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct 
page *page,
 */
trace_swiotlb_bounced(dev, dev_addr, size, swiotlb_force);
 
-   map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
+   map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir,
+attrs);
if (map == SWIOTLB_MAP_ERROR)
return DMA_ERROR_CODE;
 
@@ -419,7 +420,8 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, struct 
page *page,
if (dma_capable(dev, dev_addr, size))
return dev_addr;
 
-   swiotlb_tbl_unmap_single(dev, map, size, dir);
+   swiotlb_tbl_unmap_single(dev, map, size, dir,
+attrs | DMA_ATTR_SKIP_CPU_SYNC);
 
return DMA_ERROR_CODE;
 }
@@ -445,7 +447,7 @@ static void xen_unmap_single(struct device *hwdev, 
dma_addr_t dev_addr,
 
/* NOTE: We use dev_addr here, not paddr! */
if (is_xen_swiotlb_buffer(dev_addr)) {
-   swiotlb_tbl_unmap_single(hwdev, paddr, size, dir);
+   swiotlb_tbl_unmap_single(hwdev, paddr, size, dir, attrs);
return;
}
 
@@ -558,11 +560,12 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
 start_dma_addr,
 sg_phys(sg),
 sg->length,
-dir);
+dir, attrs);
if (map == SWIOTLB_MAP_ERROR) {
dev_warn(hwdev, "swiotlb buffer is full\n");
/* Don't panic here, we expect map_sg users
   to do proper error handling. */
+   attrs |= DMA_ATTR_SKIP_CPU_SYNC;
xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
   attrs);
sg_dma_len(sgl) = 0;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e237b6f..4517be9 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -44,11 +44,13 @@ enum dma_sync_target {
 extern phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
  dma_addr_t tbl_dma_addr,
  phys_addr_t phys, size_t size,
- enum dma_data_direction dir);
+ enum dma_data_direction dir,
+ unsigned long attrs);
 
 extern void swiotlb_tbl_unmap_single(struct device *hwdev,
 phys_addr_t tlb_addr,
-size_t size, enum dma_data_direction dir);
+size_t size, enum dma_data_direction dir,
+unsigned long attrs);
 
 extern void swiotlb_tbl_sync_single(struct device *hwdev,
phys_addr_t tlb_addr,
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 47aad37..b538d39 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -425,7 +425,8 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
phys_addr_t tlb_addr,
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
   dma_addr_t tbl_dma_addr,
   phys_addr_t orig_addr, size_t size,
-  enum dma_data_direction dir)
+  enum dma_data_direction dir,
+  unsigned long attrs)
 {
unsigned long flags;
phys_addr_t tlb_addr;
@@ -526,7 +527,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
 */
for (i = 0; i < nslots; i++)
io_tlb_orig_addr[index+i] = orig_addr + (i << IO_TLB_SHIFT);
-   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
+   (dir == DMA_TO_DEVICE || dir == DMA_BIDIRE

[net-next PATCH 23/27] dma: Add calls for dma_map_page_attrs and dma_unmap_page_attrs

2016-10-25 Thread Alexander Duyck

Add support for mapping and unmapping a page with attributes.  The primary
use for this is currently to allow for us to pass the
DMA_ATTR_SKIP_CPU_SYNC attribute when mapping and unmapping a page.  On
some architectures such as ARM the synchronization has significant overhead
and if we are already taking care of the sync_for_cpu and sync_for_device
from the driver there isn't much need to handle this in the map/unmap calls
as well.

Signed-off-by: Alexander Duyck 
---
 include/linux/dma-mapping.h |   20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 08528af..10c5a17 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -243,29 +243,33 @@ static inline void dma_unmap_sg_attrs(struct device *dev, 
struct scatterlist *sg
ops->unmap_sg(dev, sg, nents, dir, attrs);
 }
 
-static inline dma_addr_t dma_map_page(struct device *dev, struct page *page,
- size_t offset, size_t size,
- enum dma_data_direction dir)
+static inline dma_addr_t dma_map_page_attrs(struct device *dev,
+   struct page *page,
+   size_t offset, size_t size,
+   enum dma_data_direction dir,
+   unsigned long attrs)
 {
struct dma_map_ops *ops = get_dma_ops(dev);
dma_addr_t addr;
 
kmemcheck_mark_initialized(page_address(page) + offset, size);
BUG_ON(!valid_dma_direction(dir));
-   addr = ops->map_page(dev, page, offset, size, dir, 0);
+   addr = ops->map_page(dev, page, offset, size, dir, attrs);
debug_dma_map_page(dev, page, offset, size, dir, addr, false);
 
return addr;
 }
 
-static inline void dma_unmap_page(struct device *dev, dma_addr_t addr,
- size_t size, enum dma_data_direction dir)
+static inline void dma_unmap_page_attrs(struct device *dev,
+   dma_addr_t addr, size_t size,
+   enum dma_data_direction dir,
+   unsigned long attrs)
 {
struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
if (ops->unmap_page)
-   ops->unmap_page(dev, addr, size, dir, 0);
+   ops->unmap_page(dev, addr, size, dir, attrs);
debug_dma_unmap_page(dev, addr, size, dir, false);
 }
 
@@ -385,6 +389,8 @@ static inline void dma_sync_single_range_for_device(struct 
device *dev,
 #define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, 0)
 #define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, 0)
 #define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, 0)
+#define dma_map_page(d, p, o, s, r) dma_map_page_attrs(d, p, o, s, r, 0)
+#define dma_unmap_page(d, a, s, r) dma_unmap_page_attrs(d, a, s, r, 0)
 
 extern int dma_common_mmap(struct device *dev, struct vm_area_struct *vma,
   void *cpu_addr, dma_addr_t dma_addr, size_t size);

[net-next PATCH 24/27] mm: Add support for releasing multiple instances of a page

2016-10-25 Thread Alexander Duyck

This patch adds a function that allows us to batch free a page that has
multiple references outstanding.  Specifically this function can be used to
drop a page being used in the page frag alloc cache.  With this drivers can
make use of functionality similar to the page frag alloc cache without
having to do any workarounds for the fact that there is no function that
frees multiple references.

Cc: linux...@kvack.org
Signed-off-by: Alexander Duyck 
---
 include/linux/gfp.h |2 ++
 mm/page_alloc.c |   14 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f8041f9de..4175dca 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -506,6 +506,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int 
order,
 extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 struct page_frag_cache;
+extern void __page_frag_drain(struct page *page, unsigned int order,
+ unsigned int count);
 extern void *__alloc_page_frag(struct page_frag_cache *nc,
   unsigned int fragsz, gfp_t gfp_mask);
 extern void __free_page_frag(void *addr);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca423cc..253046a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3883,6 +3883,20 @@ static struct page *__page_frag_refill(struct 
page_frag_cache *nc,
return page;
 }
 
+void __page_frag_drain(struct page *page, unsigned int order,
+  unsigned int count)
+{
+   VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+   if (page_ref_sub_and_test(page, count)) {
+   if (order == 0)
+   free_hot_cold_page(page, false);
+   else
+   __free_pages_ok(page, order);
+   }
+}
+EXPORT_SYMBOL(__page_frag_drain);
+
 void *__alloc_page_frag(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask)
 {

[RFC PATCH ethtool 1/2] ethtool-copy.h: sync with net

2016-10-25 Thread Vidya Sagar Ravipati

From: Vidya Sagar Ravipati 

Sending out this review as RFC to get early feedback
on fec options and output changes as changes to kernel uapi
to ethtool is under review on netdev currently and might
change based on review..
http://patchwork.ozlabs.org/patch/686293/

Signed-off-by: Vidya Sagar Ravipati 
---
 ethtool-copy.h | 53 +++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 70748f5..ff3f4f0 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -1222,6 +1222,51 @@ struct ethtool_per_queue_op {
chardata[];
 };
 
+/**
+ * struct ethtool_fecparam - Ethernet forward error correction(fec) parameters
+ * @cmd: Command number = %ETHTOOL_GFECPARAM or %ETHTOOL_SFECPARAM
+ * @autoneg: Flag to enable autonegotiation of fec modes(rs,baser)
+ *  (D44:47 of base link code word)
+ * @fec: Bitmask of supported FEC modes
+ * @rsvd: Reserved for future extensions. i.e FEC bypass feature.
+ *
+ * Drivers should reject a non-zero setting of @autoneg when
+ * autoneogotiation is disabled (or not supported) for the link.
+ *
+ * If @autoneg is non-zero, the MAC is configured to enable one of
+ * the supported FEC modes according to the result of autonegotiation.
+ * Otherwise, it is configured directly based on the @fec parameter
+ */
+struct ethtool_fecparam {
+   __u32   cmd;
+   __u32   autoneg;
+   /* bitmask of FEC modes */
+   __u32   fec;
+   __u32   reserved;
+};
+
+/**
+ * enum ethtool_fec_config_bits - flags definition of ethtool_fec_configuration
+ * @ETHTOOL_FEC_NONE: FEC mode configuration is not supported
+ * @ETHTOOL_FEC_AUTO: Default/Best FEC mode provided by driver
+ * @ETHTOOL_FEC_OFF: No FEC Mode
+ * @ETHTOOL_FEC_RS: Reed-Solomon Forward Error Detection mode
+ * @ETHTOOL_FEC_BASER: Base-R/Reed-Solomon Forward Error Detection mode
+ */
+enum ethtool_fec_config_bits {
+   ETHTOOL_FEC_NONE_BIT,
+   ETHTOOL_FEC_AUTO_BIT,
+   ETHTOOL_FEC_OFF_BIT,
+   ETHTOOL_FEC_RS_BIT,
+   ETHTOOL_FEC_BASER_BIT,
+};
+
+#define ETHTOOL_FEC_NONE   (1 << ETHTOOL_FEC_NONE_BIT)
+#define ETHTOOL_FEC_AUTO   (1 << ETHTOOL_FEC_AUTO_BIT)
+#define ETHTOOL_FEC_OFF(1 << ETHTOOL_FEC_OFF_BIT)
+#define ETHTOOL_FEC_RS (1 << ETHTOOL_FEC_RS_BIT)
+#define ETHTOOL_FEC_BASER  (1 << ETHTOOL_FEC_BASER_BIT)
+
 /* CMDs currently supported */
 #define ETHTOOL_GSET   0x0001 /* DEPRECATED, Get settings.
* Please use ETHTOOL_GLINKSETTINGS
@@ -1313,6 +1358,8 @@ struct ethtool_per_queue_op {
 #define ETHTOOL_GLINKSETTINGS  0x004c /* Get ethtool_link_settings */
 #define ETHTOOL_SLINKSETTINGS  0x004d /* Set ethtool_link_settings */
 
+#define ETHTOOL_GFECPARAM  0x004e /* Get FEC settings */
+#define ETHTOOL_SFECPARAM  0x004f /* Set FEC settings */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
@@ -1367,7 +1414,9 @@ enum ethtool_link_mode_bit_indices {
ETHTOOL_LINK_MODE_1baseLR_Full_BIT  = 44,
ETHTOOL_LINK_MODE_1baseLRM_Full_BIT = 45,
ETHTOOL_LINK_MODE_1baseER_Full_BIT  = 46,
-
+   ETHTOOL_LINK_MODE_FEC_NONE_BIT  = 47,
+   ETHTOOL_LINK_MODE_FEC_RS_BIT= 48,
+   ETHTOOL_LINK_MODE_FEC_BASER_BIT = 49,
 
/* Last allowed bit for __ETHTOOL_LINK_MODE_LEGACY_MASK is bit
 * 31. Please do NOT define any SUPPORTED_* or ADVERTISED_*
@@ -1376,7 +1425,7 @@ enum ethtool_link_mode_bit_indices {
 */
 
__ETHTOOL_LINK_MODE_LAST
- = ETHTOOL_LINK_MODE_1baseER_Full_BIT,
+ = ETHTOOL_LINK_MODE_FEC_BASER_BIT,
 };
 
 #define __ETHTOOL_LINK_MODE_LEGACY_MASK(base_name) \
-- 
2.1.4

[net-next PATCH 08/27] arch/c6x: Add option to skip sync on DMA map and unmap

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
later via a sync_for_cpu or sync_for_device call.

Cc: Mark Salter 
Cc: Aurelien Jacquiot 
Signed-off-by: Alexander Duyck 
---
 arch/c6x/kernel/dma.c |   14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/c6x/kernel/dma.c b/arch/c6x/kernel/dma.c
index db4a6a3..6752df3 100644
--- a/arch/c6x/kernel/dma.c
+++ b/arch/c6x/kernel/dma.c
@@ -42,14 +42,17 @@ static dma_addr_t c6x_dma_map_page(struct device *dev, 
struct page *page,
 {
dma_addr_t handle = virt_to_phys(page_address(page) + offset);
 
-   c6x_dma_sync(handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   c6x_dma_sync(handle, size, dir);
+
return handle;
 }
 
 static void c6x_dma_unmap_page(struct device *dev, dma_addr_t handle,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-   c6x_dma_sync(handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   c6x_dma_sync(handle, size, dir);
 }
 
 static int c6x_dma_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -60,7 +63,8 @@ static int c6x_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
for_each_sg(sglist, sg, nents, i) {
sg->dma_address = sg_phys(sg);
-   c6x_dma_sync(sg->dma_address, sg->length, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   c6x_dma_sync(sg->dma_address, sg->length, dir);
}
 
return nents;
@@ -72,9 +76,11 @@ static void c6x_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
struct scatterlist *sg;
int i;
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
for_each_sg(sglist, sg, nents, i)
c6x_dma_sync(sg_dma_address(sg), sg->length, dir);
-
 }
 
 static void c6x_dma_sync_single_for_cpu(struct device *dev, dma_addr_t handle,

[net-next PATCH 18/27] arch/powerpc: Add option to skip DMA sync as a part of mapping

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-...@lists.ozlabs.org
Signed-off-by: Alexander Duyck 
---
 arch/powerpc/kernel/dma.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e64a601..6877e3f 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -203,6 +203,10 @@ static int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl,
for_each_sg(sgl, sg, nents, i) {
sg->dma_address = sg_phys(sg) + get_dma_offset(dev);
sg->dma_length = sg->length;
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
__dma_sync_page(sg_page(sg), sg->offset, sg->length, direction);
}
 
@@ -235,7 +239,10 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
 unsigned long attrs)
 {
BUG_ON(dir == DMA_NONE);
-   __dma_sync_page(page, offset, size, dir);
+
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_page(page, offset, size, dir);
+
return page_to_phys(page) + offset + get_dma_offset(dev);
 }

[net-next PATCH 20/27] arch/sparc: Add option to skip DMA sync as a part of map and unmap

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: "David S. Miller" 
Cc: sparcli...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/sparc/kernel/iommu.c  |4 ++--
 arch/sparc/kernel/ioport.c |4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index 5c615ab..8fda4e4 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -415,7 +415,7 @@ static void dma_4u_unmap_page(struct device *dev, 
dma_addr_t bus_addr,
ctx = (iopte_val(*base) & IOPTE_CONTEXT) >> 47UL;
 
/* Step 1: Kick data out of streaming buffers if necessary. */
-   if (strbuf->strbuf_enabled)
+   if (strbuf->strbuf_enabled && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
strbuf_flush(strbuf, iommu, bus_addr, ctx,
 npages, direction);
 
@@ -640,7 +640,7 @@ static void dma_4u_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
base = iommu->page_table + entry;
 
dma_handle &= IO_PAGE_MASK;
-   if (strbuf->strbuf_enabled)
+   if (strbuf->strbuf_enabled && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
strbuf_flush(strbuf, iommu, dma_handle, ctx,
 npages, direction);
 
diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
index 2344103..6ffaec4 100644
--- a/arch/sparc/kernel/ioport.c
+++ b/arch/sparc/kernel/ioport.c
@@ -527,7 +527,7 @@ static dma_addr_t pci32_map_page(struct device *dev, struct 
page *page,
 static void pci32_unmap_page(struct device *dev, dma_addr_t ba, size_t size,
 enum dma_data_direction dir, unsigned long attrs)
 {
-   if (dir != PCI_DMA_TODEVICE)
+   if (dir != PCI_DMA_TODEVICE && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
dma_make_coherent(ba, PAGE_ALIGN(size));
 }
 
@@ -572,7 +572,7 @@ static void pci32_unmap_sg(struct device *dev, struct 
scatterlist *sgl,
struct scatterlist *sg;
int n;
 
-   if (dir != PCI_DMA_TODEVICE) {
+   if (dir != PCI_DMA_TODEVICE && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
for_each_sg(sgl, sg, nents, n) {
dma_make_coherent(sg_phys(sg), PAGE_ALIGN(sg->length));
}

[net-next PATCH 21/27] arch/tile: Add option to skip DMA sync as a part of map and unmap

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Chris Metcalf 
Signed-off-by: Alexander Duyck 
---
 arch/tile/kernel/pci-dma.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/tile/kernel/pci-dma.c b/arch/tile/kernel/pci-dma.c
index 09bb774..24e0f8c 100644
--- a/arch/tile/kernel/pci-dma.c
+++ b/arch/tile/kernel/pci-dma.c
@@ -213,10 +213,12 @@ static int tile_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
for_each_sg(sglist, sg, nents, i) {
sg->dma_address = sg_phys(sg);
-   __dma_prep_pa_range(sg->dma_address, sg->length, direction);
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
sg->dma_length = sg->length;
 #endif
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+   __dma_prep_pa_range(sg->dma_address, sg->length, direction);
}
 
return nents;
@@ -232,6 +234,8 @@ static void tile_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
BUG_ON(!valid_dma_direction(direction));
for_each_sg(sglist, sg, nents, i) {
sg->dma_address = sg_phys(sg);
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
__dma_complete_pa_range(sg->dma_address, sg->length,
direction);
}
@@ -245,7 +249,8 @@ static dma_addr_t tile_dma_map_page(struct device *dev, 
struct page *page,
BUG_ON(!valid_dma_direction(direction));
 
BUG_ON(offset + size > PAGE_SIZE);
-   __dma_prep_page(page, offset, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_prep_page(page, offset, size, direction);
 
return page_to_pa(page) + offset;
 }
@@ -256,6 +261,9 @@ static void tile_dma_unmap_page(struct device *dev, 
dma_addr_t dma_address,
 {
BUG_ON(!valid_dma_direction(direction));
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
__dma_complete_page(pfn_to_page(PFN_DOWN(dma_address)),
dma_address & (PAGE_SIZE - 1), size, direction);
 }

[RFC PATCH ethtool 0/2] ethtool: Add support for FEC encoding configuration

2016-10-25 Thread Vidya Sagar Ravipati

From: Vidya Sagar Ravipati 

Forward Error Correction (FEC) modes i.e Base-R
and Reed-Solomon modes are introduced in 25G/40G/100G standards
for providing good BER at high speeds.
Various networking devices which support 25G/40G/100G provides ability
to manage supported FEC modes and the lack of FEC encoding control and
reporting today is a source for itneroperability issues for many vendors.
FEC capability as well as specific FEC mode i.e. Base-R
or RS modes can be requested or advertised through bits D44:47 of base link
codeword.

This patch set intends to provide option under ethtool to manage and report
FEC encoding settings for networking devices as per IEEE 802.3 bj, bm and by
specs.

set-fec/show-fec option(s) are  designed to provide  control and report
the FEC encoding on the link.

SET FEC option:
root@tor: ethtool --set-fec  swp1 encoding [off | RS | BaseR | auto] autoneg 
[off | on]

Encoding: Types of encoding
Off:  Turning off any encoding
RS :  enforcing RS-FEC encoding on supported speeds
BaseR  :  enforcing Base R encoding on supported speeds
Auto   :  Default FEC settings  for  divers , and would represent
  asking the hardware to essentially go into a best effort mode.

Here are a few examples of what we would expect if encoding=auto:
- if autoneg is on, we are  expecting FEC to be negotiated as on or off
  as long as protocol supports it
- if the hardware is capable of detecting the FEC encoding on it's
  receiver it will reconfigure its encoder to match
- in absence of the above, the configuration would be set to IEEE
  defaults.

>From our  understanding , this is essentially what most hardware/driver
combinations are doing today in the absence of a way for users to
control the behavior.

SHOW FEC option:
root@tor: ethtool --show-fec  swp1
FEC parameters for swp1:
Autonegotiate:  off
FEC encodings:  RS

ETHTOOL DEVNAME output modification:

ethtool devname output:
root@tor:~# ethtool swp1
Settings for swp1:
root@hpe-7712-03:~# ethtool swp18
Settings for swp18:
Supported ports: [ FIBRE ]
Supported link modes:   4baseCR4/Full
4baseSR4/Full
4baseLR4/Full
10baseSR4/Full
10baseCR4/Full
10baseLR4_ER4/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: [RS | BaseR | None | Not reported]
Advertised link modes:  Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: [RS | BaseR | None | Not reported]
 One or more FEC modes
Speed: 10Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 106
Transceiver: internal
Auto-negotiation: off
Link detected: yes

Vidya Sagar Ravipati (2):
  ethtool-copy.h: sync with net
  ethtool: Support for FEC encoding control

 ethtool-copy.h |  53 +++-
 ethtool.c  | 152 +
 2 files changed, 203 insertions(+), 2 deletions(-)

-- 
2.1.4

[RFC PATCH ethtool 2/2] ethtool: Support for FEC encoding control

2016-10-25 Thread Vidya Sagar Ravipati

From: Vidya Sagar Ravipati 

 As FEC settings and different FEC modes are mandatory
 and configurable across various interfaces of 25G/50G/100G/40G ,
 the lack of FEC encoding control and reporting today is a source
 for interoperability issues for many vendors

set-fec/show-fec option(s) are  designed to provide  control and report
the FEC encoding on the link.

root@tor: ethtool --set-fec  swp1 encoding [off | RS | BaseR | auto] autoneg 
[off | on]

Encoding: Types of encoding
Off:  Turning off any encoding
RS :  enforcing RS-FEC encoding on supported speeds
BaseR  :  enforcing Base R encoding on supported speeds
Auto   :  Default FEC settings  for  divers , and would represent
  asking the hardware to essentially go into a best effort mode.

Here are a few examples of what we would expect if encoding=auto:
- if autoneg is on, we are  expecting FEC to be negotiated as on or off
  as long as protocol supports it
- if the hardware is capable of detecting the FEC encoding on it's
  receiver it will reconfigure its encoder to match
- in absence of the above, the configuration would be set to IEEE
  defaults.

>From our  understanding , this is essentially what most hardware/driver
combinations are doing today in the absence of a way for users to
control the behavior.

root@tor: ethtool --show-fec  swp1
FEC parameters for swp1:
Autonegotiate:  off
FEC encodings:  RS

ethtool devname output:
root@tor:~# ethtool swp1
Settings for swp1:
root@hpe-7712-03:~# ethtool swp18
Settings for swp18:
Supported ports: [ FIBRE ]
Supported link modes:   4baseCR4/Full
4baseSR4/Full
4baseLR4/Full
10baseSR4/Full
10baseCR4/Full
10baseLR4_ER4/Full
Supported pause frame use: No
Supports auto-negotiation: Yes
Supported FEC modes: [RS | BaseR | None | Not reported]
Advertised link modes:  Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: [RS | BaseR | None | Not reported]
Speed: 10Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 106
Transceiver: internal
Auto-negotiation: off
Link detected: yes

Signed-off-by: Vidya Sagar Ravipati 
---
 ethtool.c | 152 ++
 1 file changed, 152 insertions(+)

diff --git a/ethtool.c b/ethtool.c
index 49ac94e..7fa058c 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -684,6 +684,7 @@ static void dump_link_caps(const char *prefix, const char 
*an_prefix,
};
int indent;
int did1, new_line_pend, i;
+   int fecreported = 0;
 
/* Indent just like the separate functions used to */
indent = strlen(prefix) + 14;
@@ -735,6 +736,26 @@ static void dump_link_caps(const char *prefix, const char 
*an_prefix,
fprintf(stdout, "Yes\n");
else
fprintf(stdout, "No\n");
+
+   fprintf(stdout, "   %s FEC modes: ", prefix);
+   if (ethtool_link_mode_test_bit(
+   ETHTOOL_LINK_MODE_FEC_NONE_BIT, mask)) {
+   fprintf(stdout, "None\n");
+   fecreported = 1;
+   }
+   if (ethtool_link_mode_test_bit(
+   ETHTOOL_LINK_MODE_FEC_BASER_BIT, mask)) {
+   fprintf(stdout, "BaseR\n");
+   fecreported = 1;
+   }
+   if (ethtool_link_mode_test_bit(
+   ETHTOOL_LINK_MODE_FEC_RS_BIT, mask)) {
+   fprintf(stdout, "RS\n");
+   fecreported = 1;
+   }
+   if (!fecreported) {
+   fprintf(stdout, "Not reported\n");
+   }
}
 }
 
@@ -1562,6 +1583,42 @@ static void dump_eeecmd(struct ethtool_eee *ep)
dump_link_caps("Link partner advertised EEE", "", link_mode, 1);
 }
 
+static void dump_feccmd(struct ethtool_fecparam *ep)
+{
+   static char buf[300];
+
+   memset(buf, 0, sizeof(buf));
+
+   bool first = true;
+
+   fprintf(stdout,
+   "Auto-negotiation: %s\n",
+   ep->autoneg ? "on" : "off");
+   fprintf(stdout, "FEC encodings   :");
+
+   if(ep->fec & ETHTOOL_FEC_NONE) {
+   strcat(buf, "NotSupported");
+   first = false;
+   }
+   if(ep->fec & ETHTOOL_FEC_OFF) {
+   strcat(buf, "None");
+   first = false;
+   }
+   if(ep->fec & ETHTOOL_FEC_BASER) {
+   if (!first)
+   strcat(buf, " | ");
+   strcat(buf, "BaseR");
+   first = false;
+   }
+   if(ep->fec & ETHTOOL_FEC_RS) {
+   if (!first)
+   strcat(buf, " | ");
+   strcat(buf, "RS");
+   first = false

[net-next PATCH 15/27] arch/nios2: Add option to skip DMA sync as a part of map and unmap

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Ley Foon Tan 
Signed-off-by: Alexander Duyck 
---
 arch/nios2/mm/dma-mapping.c |   26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/arch/nios2/mm/dma-mapping.c b/arch/nios2/mm/dma-mapping.c
index d800fad..f6a5dcf 100644
--- a/arch/nios2/mm/dma-mapping.c
+++ b/arch/nios2/mm/dma-mapping.c
@@ -98,13 +98,17 @@ static int nios2_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
int i;
 
for_each_sg(sg, sg, nents, i) {
-   void *addr;
+   void *addr = sg_virt(sg);
 
-   addr = sg_virt(sg);
-   if (addr) {
-   __dma_sync_for_device(addr, sg->length, direction);
-   sg->dma_address = sg_phys(sg);
-   }
+   if (!addr)
+   continue;
+
+   sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
+   __dma_sync_for_device(addr, sg->length, direction);
}
 
return nents;
@@ -117,7 +121,9 @@ static dma_addr_t nios2_dma_map_page(struct device *dev, 
struct page *page,
 {
void *addr = page_address(page) + offset;
 
-   __dma_sync_for_device(addr, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_for_device(addr, size, direction);
+
return page_to_phys(page) + offset;
 }
 
@@ -125,7 +131,8 @@ static void nios2_dma_unmap_page(struct device *dev, 
dma_addr_t dma_address,
size_t size, enum dma_data_direction direction,
unsigned long attrs)
 {
-   __dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
 }
 
 static void nios2_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
@@ -138,6 +145,9 @@ static void nios2_dma_unmap_sg(struct device *dev, struct 
scatterlist *sg,
if (direction == DMA_TO_DEVICE)
return;
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
for_each_sg(sg, sg, nhwentries, i) {
addr = sg_virt(sg);
if (addr)

[net-next PATCH 19/27] arch/sh: Add option to skip DMA sync as a part of mapping

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: linux...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/sh/kernel/dma-nommu.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/sh/kernel/dma-nommu.c b/arch/sh/kernel/dma-nommu.c
index eadb669..47fee3b 100644
--- a/arch/sh/kernel/dma-nommu.c
+++ b/arch/sh/kernel/dma-nommu.c
@@ -18,7 +18,9 @@ static dma_addr_t nommu_map_page(struct device *dev, struct 
page *page,
dma_addr_t addr = page_to_phys(page) + offset;
 
WARN_ON(size == 0);
-   dma_cache_sync(dev, page_address(page) + offset, size, dir);
+
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_cache_sync(dev, page_address(page) + offset, size, dir);
 
return addr;
 }
@@ -35,7 +37,8 @@ static int nommu_map_sg(struct device *dev, struct 
scatterlist *sg,
for_each_sg(sg, s, nents, i) {
BUG_ON(!sg_page(s));
 
-   dma_cache_sync(dev, sg_virt(s), s->length, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_cache_sync(dev, sg_virt(s), s->length, dir);
 
s->dma_address = sg_phys(s);
s->dma_length = s->length;

[net-next PATCH 13/27] arch/microblaze: Add option to skip DMA sync as a part of map and unmap

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Michal Simek 
Signed-off-by: Alexander Duyck 
---
 arch/microblaze/kernel/dma.c |   10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index ec04dc1..818daf2 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -61,6 +61,10 @@ static int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl,
/* FIXME this part of code is untested */
for_each_sg(sgl, sg, nents, i) {
sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
__dma_sync(page_to_phys(sg_page(sg)) + sg->offset,
sg->length, direction);
}
@@ -80,7 +84,8 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
 enum dma_data_direction direction,
 unsigned long attrs)
 {
-   __dma_sync(page_to_phys(page) + offset, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync(page_to_phys(page) + offset, size, direction);
return page_to_phys(page) + offset;
 }
 
@@ -95,7 +100,8 @@ static inline void dma_direct_unmap_page(struct device *dev,
  * phys_to_virt is here because in __dma_sync_page is __virt_to_phys and
  * dma_address is physical address
  */
-   __dma_sync(dma_address, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   __dma_sync(dma_address, size, direction);
 }
 
 static inline void

[net-next PATCH 10/27] arch/hexagon: Add option to skip DMA sync as a part of mapping

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
later via a sync_for_cpu or sync_for_device call.

Cc: Richard Kuo 
Cc: linux-hexa...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/hexagon/kernel/dma.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/hexagon/kernel/dma.c b/arch/hexagon/kernel/dma.c
index b901778..dbc4f10 100644
--- a/arch/hexagon/kernel/dma.c
+++ b/arch/hexagon/kernel/dma.c
@@ -119,6 +119,9 @@ static int hexagon_map_sg(struct device *hwdev, struct 
scatterlist *sg,
 
s->dma_length = s->length;
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
flush_dcache_range(dma_addr_to_virt(s->dma_address),
   dma_addr_to_virt(s->dma_address + 
s->length));
}
@@ -180,7 +183,8 @@ static dma_addr_t hexagon_map_page(struct device *dev, 
struct page *page,
if (!check_addr("map_single", dev, bus, size))
return bad_dma_address;
 
-   dma_sync(dma_addr_to_virt(bus), size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_sync(dma_addr_to_virt(bus), size, dir);
 
return bus;
 }

[net-next PATCH 09/27] arch/frv: Add option to skip sync on DMA map

2016-10-25 Thread Alexander Duyck

The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Signed-off-by: Alexander Duyck 
---
 arch/frv/mb93090-mb00/pci-dma-nommu.c |   14 ++
 arch/frv/mb93090-mb00/pci-dma.c   |9 +++--
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/frv/mb93090-mb00/pci-dma-nommu.c 
b/arch/frv/mb93090-mb00/pci-dma-nommu.c
index 90f2e4c..1876881 100644
--- a/arch/frv/mb93090-mb00/pci-dma-nommu.c
+++ b/arch/frv/mb93090-mb00/pci-dma-nommu.c
@@ -109,16 +109,19 @@ static int frv_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
int nents, enum dma_data_direction direction,
unsigned long attrs)
 {
-   int i;
struct scatterlist *sg;
+   int i;
+
+   BUG_ON(direction == DMA_NONE);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return nents;
 
for_each_sg(sglist, sg, nents, i) {
frv_cache_wback_inv(sg_dma_address(sg),
sg_dma_address(sg) + sg_dma_len(sg));
}
 
-   BUG_ON(direction == DMA_NONE);
-
return nents;
 }
 
@@ -127,7 +130,10 @@ static dma_addr_t frv_dma_map_page(struct device *dev, 
struct page *page,
enum dma_data_direction direction, unsigned long attrs)
 {
BUG_ON(direction == DMA_NONE);
-   flush_dcache_page(page);
+
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   flush_dcache_page(page);
+
return (dma_addr_t) page_to_phys(page) + offset;
 }
 
diff --git a/arch/frv/mb93090-mb00/pci-dma.c b/arch/frv/mb93090-mb00/pci-dma.c
index f585745..dba7df9 100644
--- a/arch/frv/mb93090-mb00/pci-dma.c
+++ b/arch/frv/mb93090-mb00/pci-dma.c
@@ -40,13 +40,16 @@ static int frv_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
int nents, enum dma_data_direction direction,
unsigned long attrs)
 {
+   struct scatterlist *sg;
unsigned long dampr2;
void *vaddr;
int i;
-   struct scatterlist *sg;
 
BUG_ON(direction == DMA_NONE);
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return nents;
+
dampr2 = __get_DAMPR(2);
 
for_each_sg(sglist, sg, nents, i) {
@@ -70,7 +73,9 @@ static dma_addr_t frv_dma_map_page(struct device *dev, struct 
page *page,
unsigned long offset, size_t size,
enum dma_data_direction direction, unsigned long attrs)
 {
-   flush_dcache_page(page);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   flush_dcache_page(page);
+
return (dma_addr_t) page_to_phys(page) + offset;
 }

[net-next PATCH 11/27] arch/m68k: Add option to skip DMA sync as a part of mapping

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
later via a sync_for_cpu or sync_for_device call.

Cc: Geert Uytterhoeven 
Cc: linux-m...@lists.linux-m68k.org
Signed-off-by: Alexander Duyck 
---
 arch/m68k/kernel/dma.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/m68k/kernel/dma.c b/arch/m68k/kernel/dma.c
index 8cf97cb..0707006 100644
--- a/arch/m68k/kernel/dma.c
+++ b/arch/m68k/kernel/dma.c
@@ -134,7 +134,9 @@ static dma_addr_t m68k_dma_map_page(struct device *dev, 
struct page *page,
 {
dma_addr_t handle = page_to_phys(page) + offset;
 
-   dma_sync_single_for_device(dev, handle, size, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_sync_single_for_device(dev, handle, size, dir);
+
return handle;
 }
 
@@ -146,6 +148,10 @@ static int m68k_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
for_each_sg(sglist, sg, nents, i) {
sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
dma_sync_single_for_device(dev, sg->dma_address, sg->length,
   dir);
}

[net-next PATCH 17/27] arch/parisc: Add option to skip DMA sync as a part of map and unmap

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: linux-par...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/parisc/kernel/pci-dma.c |   20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 02d9ed0..be55ede 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -459,7 +459,9 @@ static dma_addr_t pa11_dma_map_page(struct device *dev, 
struct page *page,
void *addr = page_address(page) + offset;
BUG_ON(direction == DMA_NONE);
 
-   flush_kernel_dcache_range((unsigned long) addr, size);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   flush_kernel_dcache_range((unsigned long) addr, size);
+
return virt_to_phys(addr);
 }
 
@@ -469,8 +471,11 @@ static void pa11_dma_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
 {
BUG_ON(direction == DMA_NONE);
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
if (direction == DMA_TO_DEVICE)
-   return;
+   return;
 
/*
 * For PCI_DMA_FROMDEVICE this flush is not necessary for the
@@ -479,7 +484,6 @@ static void pa11_dma_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
 */
 
flush_kernel_dcache_range((unsigned long) phys_to_virt(dma_handle), 
size);
-   return;
 }
 
 static int pa11_dma_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -496,6 +500,10 @@ static int pa11_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
sg_dma_address(sg) = (dma_addr_t) virt_to_phys(vaddr);
sg_dma_len(sg) = sg->length;
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
flush_kernel_dcache_range(vaddr, sg->length);
}
return nents;
@@ -510,14 +518,16 @@ static void pa11_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
 
BUG_ON(direction == DMA_NONE);
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return;
+
if (direction == DMA_TO_DEVICE)
-   return;
+   return;
 
/* once we do combining we'll need to use 
phys_to_virt(sg_dma_address(sglist)) */
 
for_each_sg(sglist, sg, nents, i)
flush_kernel_vmap_range(sg_virt(sg), sg->length);
-   return;
 }
 
 static void pa11_dma_sync_single_for_cpu(struct device *dev,

[net-next PATCH 12/27] arch/metag: Add option to skip DMA sync as a part of map and unmap

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: James Hogan 
Cc: linux-me...@vger.kernel.org
Signed-off-by: Alexander Duyck 
---
 arch/metag/kernel/dma.c |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/metag/kernel/dma.c b/arch/metag/kernel/dma.c
index 0db31e2..91968d9 100644
--- a/arch/metag/kernel/dma.c
+++ b/arch/metag/kernel/dma.c
@@ -484,8 +484,9 @@ static dma_addr_t metag_dma_map_page(struct device *dev, 
struct page *page,
unsigned long offset, size_t size,
enum dma_data_direction direction, unsigned long attrs)
 {
-   dma_sync_for_device((void *)(page_to_phys(page) + offset), size,
-   direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_sync_for_device((void *)(page_to_phys(page) + offset),
+   size, direction);
return page_to_phys(page) + offset;
 }
 
@@ -493,7 +494,8 @@ static void metag_dma_unmap_page(struct device *dev, 
dma_addr_t dma_address,
size_t size, enum dma_data_direction direction,
unsigned long attrs)
 {
-   dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_sync_for_cpu(phys_to_virt(dma_address), size, direction);
 }
 
 static int metag_dma_map_sg(struct device *dev, struct scatterlist *sglist,
@@ -507,6 +509,10 @@ static int metag_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
BUG_ON(!sg_page(sg));
 
sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
dma_sync_for_device(sg_virt(sg), sg->length, direction);
}
 
@@ -525,6 +531,10 @@ static void metag_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
BUG_ON(!sg_page(sg));
 
sg->dma_address = sg_phys(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
dma_sync_for_cpu(sg_virt(sg), sg->length, direction);
}
 }

[net-next PATCH 14/27] arch/mips: Add option to skip DMA sync as a part of map and unmap

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Ralf Baechle 
Cc: Keguang Zhang 
Cc: linux-m...@linux-mips.org
Signed-off-by: Alexander Duyck 
---
 arch/mips/loongson64/common/dma-swiotlb.c |2 +-
 arch/mips/mm/dma-default.c|8 +---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/mips/loongson64/common/dma-swiotlb.c 
b/arch/mips/loongson64/common/dma-swiotlb.c
index 1a80b6f..aab4fd6 100644
--- a/arch/mips/loongson64/common/dma-swiotlb.c
+++ b/arch/mips/loongson64/common/dma-swiotlb.c
@@ -61,7 +61,7 @@ static int loongson_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
int nents, enum dma_data_direction dir,
unsigned long attrs)
 {
-   int r = swiotlb_map_sg_attrs(dev, sg, nents, dir, 0);
+   int r = swiotlb_map_sg_attrs(dev, sg, nents, dir, attrs);
mb();
 
return r;
diff --git a/arch/mips/mm/dma-default.c b/arch/mips/mm/dma-default.c
index b2eadd6..dd998d7 100644
--- a/arch/mips/mm/dma-default.c
+++ b/arch/mips/mm/dma-default.c
@@ -293,7 +293,7 @@ static inline void __dma_sync(struct page *page,
 static void mips_dma_unmap_page(struct device *dev, dma_addr_t dma_addr,
size_t size, enum dma_data_direction direction, unsigned long attrs)
 {
-   if (cpu_needs_post_dma_flush(dev))
+   if (cpu_needs_post_dma_flush(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
__dma_sync(dma_addr_to_page(dev, dma_addr),
   dma_addr & ~PAGE_MASK, size, direction);
plat_post_dma_flush(dev);
@@ -307,7 +307,8 @@ static int mips_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
struct scatterlist *sg;
 
for_each_sg(sglist, sg, nents, i) {
-   if (!plat_device_is_coherent(dev))
+   if (!plat_device_is_coherent(dev) &&
+   !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
__dma_sync(sg_page(sg), sg->offset, sg->length,
   direction);
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
@@ -324,7 +325,7 @@ static dma_addr_t mips_dma_map_page(struct device *dev, 
struct page *page,
unsigned long offset, size_t size, enum dma_data_direction direction,
unsigned long attrs)
 {
-   if (!plat_device_is_coherent(dev))
+   if (!plat_device_is_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
__dma_sync(page, offset, size, direction);
 
return plat_map_dma_mem_page(dev, page) + offset;
@@ -339,6 +340,7 @@ static void mips_dma_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
 
for_each_sg(sglist, sg, nhwentries, i) {
if (!plat_device_is_coherent(dev) &&
+   !(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
direction != DMA_TO_DEVICE)
__dma_sync(sg_page(sg), sg->offset, sg->length,
   direction);

[net-next PATCH 00/27] Add support for DMA writable pages being writable by the network stack

2016-10-25 Thread Alexander Duyck

The first 22 patches in the set add support for the DMA attribute
DMA_ATTR_SKIP_CPU_SYNC on multiple platforms/architectures.  This is needed
so that we can flag the calls to dma_map/unmap_page so that we do not
invalidate cache lines that do not currently belong to the device.  Instead
we have to take care of this in the driver via a call to
sync_single_range_for_cpu prior to freeing the Rx page.

Patch 23 adds support for dma_map_page_attrs and dma_unmap_page_attrs so
that we can unmap and map a page using the DMA_ATTR_SKIP_CPU_SYNC
attribute.

Patch 24 adds support for freeing a page that has multiple references being
held by a single caller.  This way we can free page fragments that were
allocated by a given driver.

The last 3 patches use these updates in the igb driver to allow for us to
reimpelement the use of build_skb.

My hope is to get the series accepted into the net-next tree as I have a
number of other Intel drivers I could then begin updating once these
patches are accepted.

v1: Split out changes DMA_ERROR_CODE fix for swiotlb-xen
Minor fixes based on issues found by kernel build bot
Few minor changes for issues found on code review
Added Acked-by for patches that were acked and not changed

---

Alexander Duyck (27):
  swiotlb: Drop unused function swiotlb_map_sg
  swiotlb-xen: Enforce return of DMA_ERROR_CODE in mapping function
  swiotlb: Add support for DMA_ATTR_SKIP_CPU_SYNC
  arch/arc: Add option to skip sync on DMA mapping
  arch/arm: Add option to skip sync on DMA map and unmap
  arch/avr32: Add option to skip sync on DMA map
  arch/blackfin: Add option to skip sync on DMA map
  arch/c6x: Add option to skip sync on DMA map and unmap
  arch/frv: Add option to skip sync on DMA map
  arch/hexagon: Add option to skip DMA sync as a part of mapping
  arch/m68k: Add option to skip DMA sync as a part of mapping
  arch/metag: Add option to skip DMA sync as a part of map and unmap
  arch/microblaze: Add option to skip DMA sync as a part of map and unmap
  arch/mips: Add option to skip DMA sync as a part of map and unmap
  arch/nios2: Add option to skip DMA sync as a part of map and unmap
  arch/openrisc: Add option to skip DMA sync as a part of mapping
  arch/parisc: Add option to skip DMA sync as a part of map and unmap
  arch/powerpc: Add option to skip DMA sync as a part of mapping
  arch/sh: Add option to skip DMA sync as a part of mapping
  arch/sparc: Add option to skip DMA sync as a part of map and unmap
  arch/tile: Add option to skip DMA sync as a part of map and unmap
  arch/xtensa: Add option to skip DMA sync as a part of mapping
  dma: Add calls for dma_map_page_attrs and dma_unmap_page_attrs
  mm: Add support for releasing multiple instances of a page
  igb: Update driver to make use of DMA_ATTR_SKIP_CPU_SYNC
  igb: Update code to better handle incrementing page count
  igb: Revert "igb: Revert support for build_skb in igb"


 arch/arc/mm/dma.c |5 +
 arch/arm/common/dmabounce.c   |   16 +-
 arch/arm/xen/mm.c |1 
 arch/avr32/mm/dma-coherent.c  |7 +
 arch/blackfin/kernel/dma-mapping.c|8 +
 arch/c6x/kernel/dma.c |   14 +-
 arch/frv/mb93090-mb00/pci-dma-nommu.c |   14 +-
 arch/frv/mb93090-mb00/pci-dma.c   |9 +
 arch/hexagon/kernel/dma.c |6 +
 arch/m68k/kernel/dma.c|8 +
 arch/metag/kernel/dma.c   |   16 ++
 arch/microblaze/kernel/dma.c  |   10 +
 arch/mips/loongson64/common/dma-swiotlb.c |2 
 arch/mips/mm/dma-default.c|8 +
 arch/nios2/mm/dma-mapping.c   |   26 +++-
 arch/openrisc/kernel/dma.c|3 
 arch/parisc/kernel/pci-dma.c  |   20 ++-
 arch/powerpc/kernel/dma.c |9 +
 arch/sh/kernel/dma-nommu.c|7 +
 arch/sparc/kernel/iommu.c |4 -
 arch/sparc/kernel/ioport.c|4 -
 arch/tile/kernel/pci-dma.c|   12 +-
 arch/x86/xen/pci-swiotlb-xen.c|1 
 arch/xtensa/kernel/pci-dma.c  |7 +
 drivers/net/ethernet/intel/igb/igb.h  |   36 -
 drivers/net/ethernet/intel/igb/igb_main.c |  207 +++--
 drivers/xen/swiotlb-xen.c |   27 ++--
 include/linux/dma-mapping.h   |   20 ++-
 include/linux/gfp.h   |2 
 include/linux/swiotlb.h   |   10 +
 include/xen/swiotlb-xen.h |3 
 lib/swiotlb.c |   56 
 mm/page_alloc.c   |   14 ++
 33 files changed, 433 insertions(+), 159 deletions(-)

--
Signature

[net-next PATCH 06/27] arch/avr32: Add option to skip sync on DMA map

2016-10-25 Thread Alexander Duyck

The use of DMA_ATTR_SKIP_CPU_SYNC was not consistent across all of the DMA
APIs in the arch/arm folder.  This change is meant to correct that so that
we get consistent behavior.

Acked-by: Hans-Christian Noren Egtvedt 
Signed-off-by: Alexander Duyck 
---
 arch/avr32/mm/dma-coherent.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/avr32/mm/dma-coherent.c b/arch/avr32/mm/dma-coherent.c
index 58610d0..54534e5 100644
--- a/arch/avr32/mm/dma-coherent.c
+++ b/arch/avr32/mm/dma-coherent.c
@@ -146,7 +146,8 @@ static dma_addr_t avr32_dma_map_page(struct device *dev, 
struct page *page,
 {
void *cpu_addr = page_address(page) + offset;
 
-   dma_cache_sync(dev, cpu_addr, size, direction);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   dma_cache_sync(dev, cpu_addr, size, direction);
return virt_to_bus(cpu_addr);
 }
 
@@ -162,6 +163,10 @@ static int avr32_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
sg->dma_address = page_to_bus(sg_page(sg)) + sg->offset;
virt = sg_virt(sg);
+
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   continue;
+
dma_cache_sync(dev, virt, sg->length, direction);
}

[net-next PATCH 26/27] igb: Update code to better handle incrementing page count

2016-10-25 Thread Alexander Duyck

This patch updates the driver code so that we do bulk updates of the page
reference count instead of just incrementing it by one reference at a time.
The advantage to doing this is that we cut down on atomic operations and
this in turn should give us a slight improvement in cycles per packet.  In
addition if we eventually move this over to using build_skb the gains will
be more noticeable.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb.h  |7 ++-
 drivers/net/ethernet/intel/igb/igb_main.c |   24 +---
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index d11093d..acbc3ab 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -210,7 +210,12 @@ struct igb_tx_buffer {
 struct igb_rx_buffer {
dma_addr_t dma;
struct page *page;
-   unsigned int page_offset;
+#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
+   __u32 page_offset;
+#else
+   __u16 page_offset;
+#endif
+   __u16 pagecnt_bias;
 };
 
 struct igb_tx_queue_stats {
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c8c458c..5e66cde 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3962,7 +3962,8 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
 PAGE_SIZE,
 DMA_FROM_DEVICE,
 DMA_ATTR_SKIP_CPU_SYNC);
-   __free_page(buffer_info->page);
+   __page_frag_drain(buffer_info->page, 0,
+ buffer_info->pagecnt_bias);
 
buffer_info->page = NULL;
}
@@ -6830,13 +6831,15 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer 
*rx_buffer,
  struct page *page,
  unsigned int truesize)
 {
+   unsigned int pagecnt_bias = rx_buffer->pagecnt_bias--;
+
/* avoid re-using remote pages */
if (unlikely(igb_page_is_reserved(page)))
return false;
 
 #if (PAGE_SIZE < 8192)
/* if we are only owner of page we can reuse it */
-   if (unlikely(page_count(page) != 1))
+   if (unlikely(page_ref_count(page) != pagecnt_bias))
return false;
 
/* flip page offset to other buffer */
@@ -6849,10 +6852,14 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer 
*rx_buffer,
return false;
 #endif
 
-   /* Even if we own the page, we are not allowed to use atomic_set()
-* This would break get_page_unless_zero() users.
+   /* If we have drained the page fragment pool we need to update
+* the pagecnt_bias and page count so that we fully restock the
+* number of references the driver holds.
 */
-   page_ref_inc(page);
+   if (unlikely(pagecnt_bias == 1)) {
+   page_ref_add(page, USHRT_MAX);
+   rx_buffer->pagecnt_bias = USHRT_MAX;
+   }
 
return true;
 }
@@ -6904,7 +6911,6 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
return true;
 
/* this page cannot be reused so discard it */
-   __free_page(page);
return false;
}
 
@@ -6975,10 +6981,13 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
/* hand second half of page back to the ring */
igb_reuse_rx_page(rx_ring, rx_buffer);
} else {
-   /* we are not reusing the buffer so unmap it */
+   /* We are not reusing the buffer so unmap it and free
+* any references we are holding to it
+*/
dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
 PAGE_SIZE, DMA_FROM_DEVICE,
 DMA_ATTR_SKIP_CPU_SYNC);
+   __page_frag_drain(page, 0, rx_buffer->pagecnt_bias);
}
 
/* clear contents of rx_buffer */
@@ -7252,6 +7261,7 @@ static bool igb_alloc_mapped_page(struct igb_ring 
*rx_ring,
bi->dma = dma;
bi->page = page;
bi->page_offset = 0;
+   bi->pagecnt_bias = 1;
 
return true;
 }

[net-next PATCH 27/27] igb: Revert "igb: Revert support for build_skb in igb"

2016-10-25 Thread Alexander Duyck

This reverts commit f9d40f6a9921 ("igb: Revert support for build_skb in
igb") and adds a few changes to update it to work with the latest version
of igb. We are now able to revert the removal of this due to the fact
that with the recent changes to the page count and the use of
DMA_ATTR_SKIP_CPU_SYNC we can make the pages writable so we should not be
invalidating the additional data added when we call build_skb.

The biggest risk with this change is that we are now not able to support
full jumbo frames when using build_skb.  Instead we can only support up to
2K minus the skb overhead and padding offset.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb.h  |   29 ++
 drivers/net/ethernet/intel/igb/igb_main.c |  130 ++---
 2 files changed, 142 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index acbc3ab..c3420f3 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -145,6 +145,10 @@ struct vf_data_storage {
 #define IGB_RX_HDR_LEN IGB_RXBUFFER_256
 #define IGB_RX_BUFSZ   IGB_RXBUFFER_2048
 
+#define IGB_SKB_PAD(NET_SKB_PAD + NET_IP_ALIGN)
+#define IGB_MAX_BUILD_SKB_SIZE \
+   (SKB_WITH_OVERHEAD(IGB_RX_BUFSZ) - (IGB_SKB_PAD + IGB_TS_HDR_LEN))
+
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define IGB_RX_BUFFER_WRITE16 /* Must be power of 2 */
 
@@ -301,12 +305,29 @@ struct igb_q_vector {
 };
 
 enum e1000_ring_flags_t {
-   IGB_RING_FLAG_RX_SCTP_CSUM,
-   IGB_RING_FLAG_RX_LB_VLAN_BSWAP,
-   IGB_RING_FLAG_TX_CTX_IDX,
-   IGB_RING_FLAG_TX_DETECT_HANG
+   IGB_RING_FLAG_RX_SCTP_CSUM = 0,
+#if (NET_IP_ALIGN != 0)
+   IGB_RING_FLAG_RX_BUILD_SKB_ENABLED = 1,
+#endif
+   IGB_RING_FLAG_RX_LB_VLAN_BSWAP = 2,
+   IGB_RING_FLAG_TX_CTX_IDX = 3,
+   IGB_RING_FLAG_TX_DETECT_HANG = 4,
+#if (NET_IP_ALIGN == 0)
+#if (L1_CACHE_SHIFT < 5)
+   IGB_RING_FLAG_RX_BUILD_SKB_ENABLED = 5,
+#else
+   IGB_RING_FLAG_RX_BUILD_SKB_ENABLED = L1_CACHE_SHIFT,
+#endif
+#endif
 };
 
+#define ring_uses_build_skb(ring) \
+   test_bit(IGB_RING_FLAG_RX_BUILD_SKB_ENABLED, &(ring)->flags)
+#define set_ring_build_skb_enabled(ring) \
+   set_bit(IGB_RING_FLAG_RX_BUILD_SKB_ENABLED, &(ring)->flags)
+#define clear_ring_build_skb_enabled(ring) \
+   clear_bit(IGB_RING_FLAG_RX_BUILD_SKB_ENABLED, &(ring)->flags)
+
 #define IGB_TXD_DCMD (E1000_ADVTXD_DCMD_EOP | E1000_ADVTXD_DCMD_RS)
 
 #define IGB_RX_DESC(R, i)  \
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 5e66cde..e55407a 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3761,6 +3761,16 @@ void igb_configure_rx_ring(struct igb_adapter *adapter,
wr32(E1000_RXDCTL(reg_idx), rxdctl);
 }
 
+static void igb_set_rx_buffer_len(struct igb_adapter *adapter,
+ struct igb_ring *rx_ring)
+{
+   /* set build_skb flag */
+   if (adapter->max_frame_size <= IGB_MAX_BUILD_SKB_SIZE)
+   set_ring_build_skb_enabled(rx_ring);
+   else
+   clear_ring_build_skb_enabled(rx_ring);
+}
+
 /**
  *  igb_configure_rx - Configure receive Unit after Reset
  *  @adapter: board private structure
@@ -3778,8 +3788,12 @@ static void igb_configure_rx(struct igb_adapter *adapter)
/* Setup the HW Rx Head and Tail Descriptor Pointers and
 * the Base and Length of the Rx Descriptor Ring
 */
-   for (i = 0; i < adapter->num_rx_queues; i++)
-   igb_configure_rx_ring(adapter, adapter->rx_ring[i]);
+   for (i = 0; i < adapter->num_rx_queues; i++) {
+   struct igb_ring *rx_ring = adapter->rx_ring[i];
+
+   igb_set_rx_buffer_len(adapter, rx_ring);
+   igb_configure_rx_ring(adapter, rx_ring);
+   }
 }
 
 /**
@@ -4238,7 +4252,7 @@ static void igb_set_rx_mode(struct net_device *netdev)
struct igb_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
unsigned int vfn = adapter->vfs_allocated_count;
-   u32 rctl = 0, vmolr = 0;
+   u32 rctl = 0, vmolr = 0, rlpml = MAX_JUMBO_FRAME_SIZE;
int count;
 
/* Check for Promiscuous and All Multicast modes */
@@ -4310,12 +4324,18 @@ static void igb_set_rx_mode(struct net_device *netdev)
vmolr |= rd32(E1000_VMOLR(vfn)) &
 ~(E1000_VMOLR_ROPE | E1000_VMOLR_MPME | E1000_VMOLR_ROMPE);
 
-   /* enable Rx jumbo frames, no need for restriction */
+   /* enable Rx jumbo frames, restrict as needed to support build_skb */
vmolr &= ~E1000_VMOLR_RLPML_MASK;
-   vmolr |= MAX_JUMBO_FRAME_SIZE | E1000_VMOLR_LPE;
+   vmolr |= E1000_VMOLR_LPE;
+   vmolr |= (adapter->max_frame_size <= IGB_MAX_BUILD_SKB_SIZE) ?
+IGB_MAX_BUILD_SKB_SIZE : MAX

[net-next PATCH 16/27] arch/openrisc: Add option to skip DMA sync as a part of mapping

2016-10-25 Thread Alexander Duyck

This change allows us to pass DMA_ATTR_SKIP_CPU_SYNC which allows us to
avoid invoking cache line invalidation if the driver will just handle it
via a sync_for_cpu or sync_for_device call.

Cc: Jonas Bonn 
Signed-off-by: Alexander Duyck 
---
 arch/openrisc/kernel/dma.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
index 140c991..906998b 100644
--- a/arch/openrisc/kernel/dma.c
+++ b/arch/openrisc/kernel/dma.c
@@ -141,6 +141,9 @@
unsigned long cl;
dma_addr_t addr = page_to_phys(page) + offset;
 
+   if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
+   return addr;
+
switch (dir) {
case DMA_TO_DEVICE:
/* Flush the dcache for the requested range */

[net-next PATCH 25/27] igb: Update driver to make use of DMA_ATTR_SKIP_CPU_SYNC

2016-10-25 Thread Alexander Duyck

The ARM architecture provides a mechanism for deferring cache line
invalidation in the case of map/unmap.  This patch makes use of this
mechanism to avoid unnecessary synchronization.

A secondary effect of this change is that the portion of the page that has
been synchronized for use by the CPU should be writable and could be passed
up the stack (at least on ARM).

The last bit that occurred to me is that on architectures where the
sync_for_cpu call invalidates cache lines we were prefetching and then
invalidating the first 128 bytes of the packet.  To avoid that I have moved
the sync up to before we perform the prefetch and allocate the skbuff so
that we can actually make use of it.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/igb/igb_main.c |   53 ++---
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 4feca69..c8c458c 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3947,10 +3947,21 @@ static void igb_clean_rx_ring(struct igb_ring *rx_ring)
if (!buffer_info->page)
continue;
 
-   dma_unmap_page(rx_ring->dev,
-  buffer_info->dma,
-  PAGE_SIZE,
-  DMA_FROM_DEVICE);
+   /* Invalidate cache lines that may have been written to by
+* device so that we avoid corrupting memory.
+*/
+   dma_sync_single_range_for_cpu(rx_ring->dev,
+ buffer_info->dma,
+ buffer_info->page_offset,
+ IGB_RX_BUFSZ,
+ DMA_FROM_DEVICE);
+
+   /* free resources associated with mapping */
+   dma_unmap_page_attrs(rx_ring->dev,
+buffer_info->dma,
+PAGE_SIZE,
+DMA_FROM_DEVICE,
+DMA_ATTR_SKIP_CPU_SYNC);
__free_page(buffer_info->page);
 
buffer_info->page = NULL;
@@ -6808,12 +6819,6 @@ static void igb_reuse_rx_page(struct igb_ring *rx_ring,
 
/* transfer page from old buffer to new buffer */
*new_buff = *old_buff;
-
-   /* sync the buffer for use by the device */
-   dma_sync_single_range_for_device(rx_ring->dev, old_buff->dma,
-old_buff->page_offset,
-IGB_RX_BUFSZ,
-DMA_FROM_DEVICE);
 }
 
 static inline bool igb_page_is_reserved(struct page *page)
@@ -6934,6 +6939,13 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
page = rx_buffer->page;
prefetchw(page);
 
+   /* we are reusing so sync this buffer for CPU use */
+   dma_sync_single_range_for_cpu(rx_ring->dev,
+ rx_buffer->dma,
+ rx_buffer->page_offset,
+ size,
+ DMA_FROM_DEVICE);
+
if (likely(!skb)) {
void *page_addr = page_address(page) +
  rx_buffer->page_offset;
@@ -6958,21 +6970,15 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
prefetchw(skb->data);
}
 
-   /* we are reusing so sync this buffer for CPU use */
-   dma_sync_single_range_for_cpu(rx_ring->dev,
- rx_buffer->dma,
- rx_buffer->page_offset,
- size,
- DMA_FROM_DEVICE);
-
/* pull page into skb */
if (igb_add_rx_frag(rx_ring, rx_buffer, size, rx_desc, skb)) {
/* hand second half of page back to the ring */
igb_reuse_rx_page(rx_ring, rx_buffer);
} else {
/* we are not reusing the buffer so unmap it */
-   dma_unmap_page(rx_ring->dev, rx_buffer->dma,
-  PAGE_SIZE, DMA_FROM_DEVICE);
+   dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
+PAGE_SIZE, DMA_FROM_DEVICE,
+DMA_ATTR_SKIP_CPU_SYNC);
}
 
/* clear contents of rx_buffer */
@@ -7230,7 +7236,8 @@ static bool igb_alloc_mapped_page(struct igb_ring 
*rx_ring,
}
 
/* map page for use */
-   dma = dma_map_page(rx_ring->dev, page, 0, PAGE_SIZE, DMA_FROM_DEVICE);
+   dma = dma_map_page_attrs(rx_ring->dev, page, 0, PAGE_SIZE,
+DMA_FROM_DEVICE, DMA_ATTR_SKIP_CPU_SYNC);
 
/

[net-next PATCH 02/27] swiotlb-xen: Enforce return of DMA_ERROR_CODE in mapping function

2016-10-25 Thread Alexander Duyck

The mapping function should always return DMA_ERROR_CODE when a mapping has
failed as this is what the DMA API expects when a DMA error has occurred.
The current function for mapping a page in Xen was returning either
DMA_ERROR_CODE or 0 depending on where it failed.

On x86 DMA_ERROR_CODE is 0, but on other architectures such as ARM it is
~0. We need to make sure we return the same error value if either the
mapping failed or the device is not capable of accessing the mapping.

If we are returning DMA_ERROR_CODE as our error value we can drop the
function for checking the error code as the default is to compare the
return value against DMA_ERROR_CODE if no function is defined.

Cc: Konrad Rzeszutek Wilk 
Signed-off-by: Alexander Duyck 
---
 arch/arm/xen/mm.c  |1 -
 arch/x86/xen/pci-swiotlb-xen.c |1 -
 drivers/xen/swiotlb-xen.c  |   18 ++
 include/xen/swiotlb-xen.h  |3 ---
 4 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index d062f08..bd62d94 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -186,7 +186,6 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, 
unsigned int order)
 EXPORT_SYMBOL(xen_dma_ops);
 
 static struct dma_map_ops xen_swiotlb_dma_ops = {
-   .mapping_error = xen_swiotlb_dma_mapping_error,
.alloc = xen_swiotlb_alloc_coherent,
.free = xen_swiotlb_free_coherent,
.sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu,
diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index 0e98e5d..a9fafb5 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -19,7 +19,6 @@
 int xen_swiotlb __read_mostly;
 
 static struct dma_map_ops xen_swiotlb_dma_ops = {
-   .mapping_error = xen_swiotlb_dma_mapping_error,
.alloc = xen_swiotlb_alloc_coherent,
.free = xen_swiotlb_free_coherent,
.sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu,
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 87e6035..b8014bf 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -416,11 +416,12 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
struct page *page,
/*
 * Ensure that the address returned is DMA'ble
 */
-   if (!dma_capable(dev, dev_addr, size)) {
-   swiotlb_tbl_unmap_single(dev, map, size, dir);
-   dev_addr = 0;
-   }
-   return dev_addr;
+   if (dma_capable(dev, dev_addr, size))
+   return dev_addr;
+
+   swiotlb_tbl_unmap_single(dev, map, size, dir);
+
+   return DMA_ERROR_CODE;
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
 
@@ -648,13 +649,6 @@ void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_sync_sg_for_device);
 
-int
-xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
-{
-   return !dma_addr;
-}
-EXPORT_SYMBOL_GPL(xen_swiotlb_dma_mapping_error);
-
 /*
  * Return whether the given device DMA address mask can be supported
  * properly.  For example, if your device can only drive the low 24-bits
diff --git a/include/xen/swiotlb-xen.h b/include/xen/swiotlb-xen.h
index 7c35e27..a0083be 100644
--- a/include/xen/swiotlb-xen.h
+++ b/include/xen/swiotlb-xen.h
@@ -51,9 +51,6 @@ extern void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
   int nelems, enum dma_data_direction dir);
 
 extern int
-xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr);
-
-extern int
 xen_swiotlb_dma_supported(struct device *hwdev, u64 mask);
 
 extern int

Re: [PATCH net-next] ibmveth: calculate correct gso_size and set gso_type

2016-10-25 Thread Jonathan Maxwell

>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);

> Compiler may optmize this, but maybe move hdr_len to [*] ?>

There are other places in the stack where a u16 is used for the
same purpose. So I'll rather stick to that convention.

I'll make the other formatting changes you suggested and
resubmit as v1.

Thanks

Jon

On Tue, Oct 25, 2016 at 9:31 PM, Marcelo Ricardo Leitner
 wrote:
> On Tue, Oct 25, 2016 at 04:13:41PM +1100, Jon Maxwell wrote:
>> We recently encountered a bug where a few customers using ibmveth on the
>> same LPAR hit an issue where a TCP session hung when large receive was
>> enabled. Closer analysis revealed that the session was stuck because the
>> one side was advertising a zero window repeatedly.
>>
>> We narrowed this down to the fact the ibmveth driver did not set gso_size
>> which is translated by TCP into the MSS later up the stack. The MSS is
>> used to calculate the TCP window size and as that was abnormally large,
>> it was calculating a zero window, even although the sockets receive buffer
>> was completely empty.
>>
>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
>> and Marcelo for all your help and review on this.
>>
>> The patch fixes both our internal reproduction tests and our customers tests.
>>
>> Signed-off-by: Jon Maxwell 
>> ---
>>  drivers/net/ethernet/ibm/ibmveth.c | 19 +++
>>  1 file changed, 19 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
>> b/drivers/net/ethernet/ibm/ibmveth.c
>> index 29c05d0..3028c33 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int 
>> budget)
>>   int frames_processed = 0;
>>   unsigned long lpar_rc;
>>   struct iphdr *iph;
>> + bool large_packet = 0;
>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>
> Compiler may optmize this, but maybe move hdr_len to [*] ?
>
>>
>>  restart_poll:
>>   while (frames_processed < budget) {
>> @@ -1236,10 +1238,27 @@ static int ibmveth_poll(struct napi_struct *napi, 
>> int budget)
>>   iph->check = 0;
>>   iph->check = 
>> ip_fast_csum((unsigned char *)iph, iph->ihl);
>>   adapter->rx_large_packets++;
>> + large_packet = 1;
>>   }
>>   }
>>   }
>>
>> + if (skb->len > netdev->mtu) {
>
> [*]
>
>> + iph = (struct iphdr *)skb->data;
>> + if (be16_to_cpu(skb->protocol) == ETH_P_IP && 
>> iph->protocol == IPPROTO_TCP) {
>
> The if line above is too long, should be broken in two.
>
>> + hdr_len += sizeof(struct iphdr);
>> + skb_shinfo(skb)->gso_type = 
>> SKB_GSO_TCPV4;
>> + skb_shinfo(skb)->gso_size = 
>> netdev->mtu - hdr_len;
>> + } else if (be16_to_cpu(skb->protocol) == 
>> ETH_P_IPV6 &&
>> + iph->protocol == IPPROTO_TCP) {
> ^
> And this one should start 3 spaces later, right below be16_
>
>   Marcelo
>
>> + hdr_len += sizeof(struct ipv6hdr);
>> + skb_shinfo(skb)->gso_type = 
>> SKB_GSO_TCPV6;
>> + skb_shinfo(skb)->gso_size = 
>> netdev->mtu - hdr_len;
>> + }
>> + if (!large_packet)
>> + adapter->rx_large_packets++;
>> + }
>> +
>>   napi_gro_receive(napi, skb);/* send it up */
>>
>>   netdev->stats.rx_packets++;
>> --
>> 1.8.3.1
>>

[PATCH v2] cw1200: fix bogus maybe-uninitialized warning

2016-10-25 Thread Arnd Bergmann

On x86, the cw1200 driver produces a rather silly warning about the
possible use of the 'ret' variable without an initialization
presumably after being confused by the architecture specific definition
of WARN_ON:

drivers/net/wireless/st/cw1200/wsm.c: In function ‘wsm_handle_rx’:
drivers/net/wireless/st/cw1200/wsm.c:1457:9: error: ‘ret’ may be used 
uninitialized in this function [-Werror=maybe-uninitialized]

We have already checked that 'count' is larger than 0 here, so
we know that 'ret' is initialized. Changing the 'for' loop
into do/while also makes this clear to the compiler.

Suggested-by: David Laight 
Signed-off-by: Arnd Bergmann 
---
 drivers/net/wireless/st/cw1200/wsm.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

v2: rewrite based on David Laight's suggestion, the first version
was completely wrong.

diff --git a/drivers/net/wireless/st/cw1200/wsm.c 
b/drivers/net/wireless/st/cw1200/wsm.c
index 680d60eabc75..ed93bf3474ec 100644
--- a/drivers/net/wireless/st/cw1200/wsm.c
+++ b/drivers/net/wireless/st/cw1200/wsm.c
@@ -379,7 +379,6 @@ static int wsm_multi_tx_confirm(struct cw1200_common *priv,
 {
int ret;
int count;
-   int i;
 
count = WSM_GET32(buf);
if (WARN_ON(count <= 0))
@@ -395,11 +394,10 @@ static int wsm_multi_tx_confirm(struct cw1200_common 
*priv,
}
 
cw1200_debug_txed_multi(priv, count);
-   for (i = 0; i < count; ++i) {
+   do {
ret = wsm_tx_confirm(priv, buf, link_id);
-   if (ret)
-   return ret;
-   }
+   } while (!ret && --count);
+
return ret;
 
 underflow:
-- 
2.9.0

Re: [PATCH] cw1200: fix bogus maybe-uninitialized warning

2016-10-25 Thread Arnd Bergmann

On Tuesday, October 25, 2016 1:24:55 PM CEST David Laight wrote:
> > diff --git a/drivers/net/wireless/st/cw1200/wsm.c 
> > b/drivers/net/wireless/st/cw1200/wsm.c
> > index 680d60eabc75..094e6637ade2 100644
> > --- a/drivers/net/wireless/st/cw1200/wsm.c
> > +++ b/drivers/net/wireless/st/cw1200/wsm.c
> > @@ -385,14 +385,13 @@ static int wsm_multi_tx_confirm(struct cw1200_common 
> > *priv,
> >   if (WARN_ON(count <= 0))
> >   return -EINVAL;
> > 
> > - if (count > 1) {
> > - /* We already released one buffer, now for the rest */
> > - ret = wsm_release_tx_buffer(priv, count - 1);
> > - if (ret < 0)
> > - return ret;
> > - else if (ret > 0)
> > - cw1200_bh_wakeup(priv);
> > - }
> > + /* We already released one buffer, now for the rest */
> > + ret = wsm_release_tx_buffer(priv, count - 1);
> > + if (ret < 0)
> > + return ret;
> > +
> > + if (ret > 0)
> > + cw1200_bh_wakeup(priv);
> 
> That doesn't look equivalent to me (when count == 1).

Ah, that's what I missed, thanks for pointing that out!

> > 
> >   cw1200_debug_txed_multi(priv, count);
> >   for (i = 0; i < count; ++i) {
> 
> Convert this loop into a do ... while so the body executes at least once.

Good idea. Version 2 coming now.

Arnd

Re: [PATCH] virtio-net: Update the mtu code to match virtio spec

2016-10-25 Thread Aaron Conole

Aaron Conole  writes:

>> From: Aaron Conole 
>>
>> The virtio committee recently ratified a change, VIRTIO-152, which
>> defines the mtu field to be 'max' MTU, not simply desired MTU.
>>
>> This commit brings the virtio-net device in compliance with VIRTIO-152.
>>
>> Additionally, drop the max_mtu branch - it cannot be taken since the u16
>> returned by virtio_cread16 will never exceed the initial value of
>> max_mtu.
>>
>> Cc: "Michael S. Tsirkin" 
>> Cc: Jarod Wilson 
>> Signed-off-by: Aaron Conole 
>> ---
>
> Sorry about the subject line, David.  This is targetted at net-next, and
> it appears my from was mangled.  Would you like me to resubmit with
> these details corrected?

I answered my own question.  Sorry for the noise.

[PATCH v2 net-next] virtio-net: Update the mtu code to match virtio spec

2016-10-25 Thread Aaron Conole

The virtio committee recently ratified a change, VIRTIO-152, which
defines the mtu field to be 'max' MTU, not simply desired MTU.

This commit brings the virtio-net device in compliance with VIRTIO-152.

Additionally, drop the max_mtu branch - it cannot be taken since the u16
returned by virtio_cread16 will never exceed the initial value of
max_mtu.

Signed-off-by: Aaron Conole 
Acked-by: "Michael S. Tsirkin" 
Acked-by: Jarod Wilson 
---
Nothing code-wise has changed, but I've included the ACKs and fixed up the
subject line.

 drivers/net/virtio_net.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 720809f..2cafd12 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1870,10 +1870,12 @@ static int virtnet_probe(struct virtio_device *vdev)
mtu = virtio_cread16(vdev,
 offsetof(struct virtio_net_config,
  mtu));
-   if (mtu < dev->min_mtu || mtu > dev->max_mtu)
+   if (mtu < dev->min_mtu) {
__virtio_clear_bit(vdev, VIRTIO_NET_F_MTU);
-   else
+   } else {
dev->mtu = mtu;
+   dev->max_mtu = mtu;
+   }
}
 
if (vi->any_header_sg)
-- 
2.7.4

Re: [PATCH] virtio-net: Update the mtu code to match virtio spec

2016-10-25 Thread Aaron Conole

> From: Aaron Conole 
>
> The virtio committee recently ratified a change, VIRTIO-152, which
> defines the mtu field to be 'max' MTU, not simply desired MTU.
>
> This commit brings the virtio-net device in compliance with VIRTIO-152.
>
> Additionally, drop the max_mtu branch - it cannot be taken since the u16
> returned by virtio_cread16 will never exceed the initial value of
> max_mtu.
>
> Cc: "Michael S. Tsirkin" 
> Cc: Jarod Wilson 
> Signed-off-by: Aaron Conole 
> ---

Sorry about the subject line, David.  This is targetted at net-next, and
it appears my from was mangled.  Would you like me to resubmit with
these details corrected?

-Aaron

Re: [PATCH net] udp: fix IP_CHECKSUM handling

2016-10-25 Thread Eric Dumazet

On Tue, 2016-10-25 at 15:43 -0400, Willem de Bruijn wrote:
> On Sun, Oct 23, 2016 at 9:03 PM, Eric Dumazet  wrote:
> > From: Eric Dumazet 
> >
> > First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to
> > ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
> > AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
> > ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
> >
> > Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before
> > queueing") forgot to adjust the offsets now UDP headers are pulled
> > before skb are put in receive queue.
> >
> > Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv")
> > Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
> > Signed-off-by: Eric Dumazet 
> > Cc: Sam Kumar 
> > Cc: Willem de Bruijn 
> > ---
> > Tom, I would appreciate your feedback on this patch, I presume
> > you have tests to verify IP_CHECKSUM feature ? Thanks !
> >
> 
> Tested-by: Willem de Bruijn 
> 
> Thanks for fixing, Eric.
> 
> Tested with 
> https://github.com/wdebruij/kerneltools/blob/master/tests/recv_cmsg_ipchecksum.c

Thanks a lot Willem for cooking this test !

[PATCH v2] netfilter: fix type mismatch with error return from nft_parse_u32_check

2016-10-25 Thread John W. Linville

Commit 36b701fae12ac ("netfilter: nf_tables: validate maximum value of
u32 netlink attributes") introduced nft_parse_u32_check with a return
value of "unsigned int", yet on error it returns "-ERANGE".

This patch corrects the mismatch by changing the return value to "int",
which happens to match the actual users of nft_parse_u32_check already.

Found by Coverity, CID 1373930.

Note that commit 21a9e0f1568ea ("netfilter: nft_exthdr: fix error
handling in nft_exthdr_init()) attempted to address the issue, but
did not address the return type of nft_parse_u32_check.

Signed-off-by: John W. Linville 
Cc: Laura Garcia Liebana 
Cc: Pablo Neira Ayuso 
Cc: Dan Carpenter 
Fixes: 36b701fae12ac ("netfilter: nf_tables: validate maximum value...")
---
 include/net/netfilter/nf_tables.h | 2 +-
 net/netfilter/nf_tables_api.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index 5031e072567b..da43f50b39c6 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -145,7 +145,7 @@ static inline enum nft_registers nft_type_to_reg(enum 
nft_data_types type)
return type == NFT_DATA_VERDICT ? NFT_REG_VERDICT : NFT_REG_1 * 
NFT_REG_SIZE / NFT_REG32_SIZE;
 }
 
-unsigned int nft_parse_u32_check(const struct nlattr *attr, int max, u32 
*dest);
+int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest);
 unsigned int nft_parse_register(const struct nlattr *attr);
 int nft_dump_register(struct sk_buff *skb, unsigned int attr, unsigned int 
reg);
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 24db22257586..32fa4f08444a 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4421,7 +4421,7 @@ static int nf_tables_check_loops(const struct nft_ctx 
*ctx,
  * Otherwise a 0 is returned and the attribute value is stored in the
  * destination variable.
  */
-unsigned int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest)
+int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest)
 {
u32 val;
 
-- 
2.7.4

Re: [PATCH] netfilter: fix type mismatch with error return from nft_parse_u32_check

2016-10-25 Thread John W. Linville

On Tue, Oct 25, 2016 at 03:08:04PM -0400, John W. Linville wrote:
> Commit 36b701fae12ac ("netfilter: nf_tables: validate maximum value of
> u32 netlink attributes") introduced nft_parse_u32_check with a return
> value of "unsigned int", yet on error it returns "-ERANGE".
> 
> This patch corrects the mismatch by changing the return value to "int",
> which happens to match the actual users of nft_parse_u32_check already.
> 
> Found by Coverity, CID 1373930.
> 
> Note that commit 21a9e0f1568ea ("netfilter: nft_exthdr: fix error
> handling in nft_exthdr_init()) attempted to address the issue, but
> did not address the return type of nft_parse_u32_check.
> 
> Signed-off-by: John W. Linville 
> Cc: Laura Garcia Liebana 
> Cc: Pablo Neira Ayuso 
> Cc: Dan Carpenter 
> Fixes: 0eadf37afc250 ("netfilter: nf_tables: validate maximum value...")

The Fixes line is incorrect -- corrected patch to follow!

John
-- 
John W. LinvilleSomeday the world will need a hero, and you
linvi...@tuxdriver.com  might be all we have.  Be ready.

Re: [PATCH net] udp: fix IP_CHECKSUM handling

2016-10-25 Thread Willem de Bruijn

On Sun, Oct 23, 2016 at 9:03 PM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to
> ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
> AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
> ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
>
> Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before
> queueing") forgot to adjust the offsets now UDP headers are pulled
> before skb are put in receive queue.
>
> Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv")
> Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
> Signed-off-by: Eric Dumazet 
> Cc: Sam Kumar 
> Cc: Willem de Bruijn 
> ---
> Tom, I would appreciate your feedback on this patch, I presume
> you have tests to verify IP_CHECKSUM feature ? Thanks !
>

Tested-by: Willem de Bruijn 

Thanks for fixing, Eric.

Tested with 
https://github.com/wdebruij/kerneltools/blob/master/tests/recv_cmsg_ipchecksum.c

[PATCH] netfilter: fix type mismatch with error return from nft_parse_u32_check

2016-10-25 Thread John W. Linville

Commit 36b701fae12ac ("netfilter: nf_tables: validate maximum value of
u32 netlink attributes") introduced nft_parse_u32_check with a return
value of "unsigned int", yet on error it returns "-ERANGE".

This patch corrects the mismatch by changing the return value to "int",
which happens to match the actual users of nft_parse_u32_check already.

Found by Coverity, CID 1373930.

Note that commit 21a9e0f1568ea ("netfilter: nft_exthdr: fix error
handling in nft_exthdr_init()) attempted to address the issue, but
did not address the return type of nft_parse_u32_check.

Signed-off-by: John W. Linville 
Cc: Laura Garcia Liebana 
Cc: Pablo Neira Ayuso 
Cc: Dan Carpenter 
Fixes: 0eadf37afc250 ("netfilter: nf_tables: validate maximum value...")
---
 include/net/netfilter/nf_tables.h | 2 +-
 net/netfilter/nf_tables_api.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index 5031e072567b..da43f50b39c6 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -145,7 +145,7 @@ static inline enum nft_registers nft_type_to_reg(enum 
nft_data_types type)
return type == NFT_DATA_VERDICT ? NFT_REG_VERDICT : NFT_REG_1 * 
NFT_REG_SIZE / NFT_REG32_SIZE;
 }
 
-unsigned int nft_parse_u32_check(const struct nlattr *attr, int max, u32 
*dest);
+int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest);
 unsigned int nft_parse_register(const struct nlattr *attr);
 int nft_dump_register(struct sk_buff *skb, unsigned int attr, unsigned int 
reg);
 
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 24db22257586..32fa4f08444a 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -4421,7 +4421,7 @@ static int nf_tables_check_loops(const struct nft_ctx 
*ctx,
  * Otherwise a 0 is returned and the attribute value is stored in the
  * destination variable.
  */
-unsigned int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest)
+int nft_parse_u32_check(const struct nlattr *attr, int max, u32 *dest)
 {
u32 val;
 
-- 
2.7.4

Re: [PATCH] cw1200: fix bogus maybe-uninitialized warning

2016-10-25 Thread Solomon Peachy

On Tue, Oct 25, 2016 at 01:24:55PM +, David Laight wrote:
> > -   if (count > 1) {
> > -   /* We already released one buffer, now for the rest */
> > -   ret = wsm_release_tx_buffer(priv, count - 1);
> > -   if (ret < 0)
> > -   return ret;
> > -   else if (ret > 0)
> > -   cw1200_bh_wakeup(priv);
> > -   }
> > +   /* We already released one buffer, now for the rest */
> > +   ret = wsm_release_tx_buffer(priv, count - 1);
> > +   if (ret < 0)
> > +   return ret;
> > +
> > +   if (ret > 0)
> > +   cw1200_bh_wakeup(priv);
> 
> That doesn't look equivalent to me (when count == 1).

I concur, this patch should not be applied in its current form.

 - Solomon
-- 
Solomon Peachy pizza at shaftnet dot org
Delray Beach, FL  ^^ (email/xmpp) ^^
Quidquid latine dictum sit, altum viditur.


signature.asc
Description: PGP signature

RE: nfs NULL-dereferencing in net-next

2016-10-25 Thread Yotam Gigi


>-Original Message-
>From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
>Behalf Of Jakub Kicinski
>Sent: Monday, October 17, 2016 10:20 PM
>To: Andy Adamson ; Anna Schumaker
>; linux-...@vger.kernel.org
>Cc: netdev@vger.kernel.org; Trond Myklebust 
>Subject: nfs NULL-dereferencing in net-next
>
>Hi!
>
>I'm hitting this reliably on net-next, HEAD at 3f3177bb680f
>("fsl/fman: fix error return code in mac_probe()").


I see the same thing. It happens constantly on some of my machines, making them
completely unusable.

I bisected it and got to the commit:

commit 04ea1b3e6d8ed4978bb608c1748530af3de8c274
Author: Andy Adamson 
Date:   Fri Sep 9 09:22:27 2016 -0400

NFS add xprt switch addrs test to match client

Signed-off-by: Andy Adamson 
Signed-off-by: Anna Schumaker 


>
>[   23.409633] BUG: unable to handle kernel NULL pointer dereference at
>0172
>[   23.418716] IP: [] rpc_clnt_xprt_switch_has_addr+0xc/0x40
>[sunrpc]
>[   23.427574] PGD 859020067 [   23.430472] PUD 858f2d067
>PMD 0 [   23.434311]
>[   23.436133] Oops:  [#1] PREEMPT SMP
>[   23.440506] Modules linked in: nfsv4 ip6table_filter ip6_tables 
>iptable_filter
>ip_tables ebtable_nat ebtables x_tables intel_ri
>[   23.505915] CPU: 1 PID: 1067 Comm: mount.nfs Not tainted 4.8.0-perf-13951-
>g3f3177bb680f #51
>[   23.515363] Hardware name: Dell Inc. PowerEdge T630/0W9WXC, BIOS 1.2.10
>03/10/2015
>[   23.523937] task: 983e9086ea00 task.stack: ac6c0a57c000
>[   23.530641] RIP: 0010:[]  []
>rpc_clnt_xprt_switch_has_addr+0xc/0x40 [sunrpc]
>[   23.542229] RSP: 0018:ac6c0a57fb28  EFLAGS: 00010a97
>[   23.548255] RAX: c80214ac RBX: 983e97c7b000 RCX: 
>983e9b3bc180
>[   23.556320] RDX: 0001 RSI: 983e9928ed28 RDI: 
>ffea
>[   23.564386] RBP: ac6c0a57fb38 R08: 983e97090630 R09: 
>983e9928ed30
>[   23.572452] R10: ac6c0a57fba0 R11: 0010 R12: 
>ac6c0a57fba0
>[   23.580517] R13: 983e9928ed28 R14:  R15: 
>983e91360560
>[   23.588585] FS:  7f4c348aa880() GS:983e9f24()
>knlGS:
>[   23.597742] CS:  0010 DS:  ES:  CR0: 80050033
>[   23.604251] CR2: 0172 CR3: 000850a5f000 CR4:
>001406e0
>[   23.612316] Stack:
>[   23.614648]  983e97c7b000 ac6c0a57fba0 ac6c0a57fb90 
>c04d38c3
>[   23.623331]  983e91360500 983e9928ed30 c0b9e560
>983e913605b8
>[   23.632016]  983e9882e800 983e9882e800 ac6c0a57fc30 
>ac6c0a57fdb8
>[   23.640706] Call Trace:
>[   23.643535]  [] nfs_get_client+0x123/0x340 [nfs]
>[   23.650542]  [] nfs4_set_client+0x80/0xb0 [nfsv4]
>[   23.657642]  [] nfs4_create_server+0x115/0x2a0 [nfsv4]
>[   23.665230]  [] nfs4_remote_mount+0x2e/0x60 [nfsv4]
>[   23.672519]  [] mount_fs+0x3a/0x160
>[   23.678254]  [] ? alloc_vfsmnt+0x19e/0x230
>[   23.684669]  [] vfs_kern_mount+0x67/0x110
>[   23.690990]  [] nfs_do_root_mount+0x84/0xc0 [nfsv4]
>[   23.698284]  [] nfs4_try_mount+0x37/0x50 [nfsv4]
>[   23.705287]  [] nfs_fs_mount+0x2d1/0xa70 [nfs]
>[   23.712092]  [] ? find_next_bit+0x18/0x20
>[   23.718413]  [] ? nfs_remount+0x3c0/0x3c0 [nfs]
>[   23.725316]  [] ? nfs_clone_super+0x130/0x130 [nfs]
>[   23.732606]  [] mount_fs+0x3a/0x160
>[   23.738340]  [] ? alloc_vfsmnt+0x19e/0x230
>[   23.744755]  [] vfs_kern_mount+0x67/0x110
>[   23.751071]  [] do_mount+0x1bf/0xc70
>[   23.756904]  [] ? copy_mount_options+0xbb/0x220
>[   23.763803]  [] SyS_mount+0x83/0xd0
>[   23.769538]  [] entry_SYSCALL_64_fastpath+0x17/0x98
>[   23.776817] Code: 01 00 48 8b 93 f8 04 00 00 44 89 e6 48 c7 c7 98 b2 43 c0 
>e8 9f 0d d4
>f9 eb c0 0f 1f 44 00 00 0f 1f 44 00 00
>[   23.802909] RIP  [] rpc_clnt_xprt_switch_has_addr+0xc/0x40
>[sunrpc]
>[   23.811857]  RSP 
>[   23.815839] CR2: 0172
>[   23.819629] ---[ end trace 9958eca92c9eeafe ]---
>[   23.827345] note: mount.nfs[1067] exited with preempt_count 1

Re: [PATCH net-next RFC WIP] Patch for XDP support for virtio_net

2016-10-25 Thread Jakub Kicinski

On Sat, 22 Oct 2016 04:07:23 +, Shrijeet Mukherjee wrote:
> + act = bpf_prog_run_xdp(xdp_prog, &xdp);
> + switch (act) {
> + case XDP_PASS:
> + return XDP_PASS;
> + case XDP_TX:
> + case XDP_ABORTED:
> + case XDP_DROP:
> + return XDP_DROP;
> + default:
> + bpf_warn_invalid_xdp_action(act);
> + }
> + }
> + return XDP_PASS;

FWIW you may want to move the default label before XDP_TX/XDP_ABORT,
to get the behaviour to be drop on unknown ret code.

Re: [PATCH 0/2] at803x: don't power-down SGMII link

2016-10-25 Thread Timur Tabi


Zefir Kurtisi wrote:

In a device where the ar8031 operates in SGMII mode, we
observed that after a suspend-resume cycle in very rare
cases the copper side autonegotiation secceeds but the
SGMII side fails to come up.

As a work-around, a patch was provided that on suspend and
resume powers the SGMII link down and up along with the
copper side. This fixed the observed failure, but
introduced a regression Timur Tabi observed: once the SGMII
is powered down, the PHY is inaccessible by the CPU and
with that e.g. can't be re-initialized after suspend.

Since the original issue could not be reproduced by others,
this series provides an alternative handling:
* the first patch reverts the prvious fix that powers down
   SGMII
* the second patch adds double-checking for the observed
   failure condition

Zefir Kurtisi (2):
   Revert "at803x: fix suspend/resume for SGMII link"
   at803x: double check SGMII side autoneg


Tested-by: Timur Tabi 

With these patches, the problem I was seeing no longer occurs, and the 
new code does not appear to break anything.  As before, I still have 
never seen the original problem, but this patchset seems to work for 
both of us.


--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

Re: [PATCH] virtio-net: Update the mtu code to match virtio spec

2016-10-25 Thread Jarod Wilson

On Tue, Oct 25, 2016 at 12:35:35PM -0400, Aaron Conole wrote:
> From: Aaron Conole 
> 
> The virtio committee recently ratified a change, VIRTIO-152, which
> defines the mtu field to be 'max' MTU, not simply desired MTU.
> 
> This commit brings the virtio-net device in compliance with VIRTIO-152.
> 
> Additionally, drop the max_mtu branch - it cannot be taken since the u16
> returned by virtio_cread16 will never exceed the initial value of
> max_mtu.
> 
> Cc: "Michael S. Tsirkin" 
> Cc: Jarod Wilson 
> Signed-off-by: Aaron Conole 

Worksforme.

Acked-by: Jarod Wilson 

-- 
Jarod Wilson
ja...@redhat.com

[PATCH] net: bonding: use new api ethtool_{get|set}_link_ksettings

2016-10-25 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/bonding/bond_main.c |   16 
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c9944d8..5708f17 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4080,16 +4080,16 @@ static netdev_tx_t bond_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
return ret;
 }
 
-static int bond_ethtool_get_settings(struct net_device *bond_dev,
-struct ethtool_cmd *ecmd)
+static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev,
+  struct ethtool_link_ksettings *cmd)
 {
struct bonding *bond = netdev_priv(bond_dev);
unsigned long speed = 0;
struct list_head *iter;
struct slave *slave;
 
-   ecmd->duplex = DUPLEX_UNKNOWN;
-   ecmd->port = PORT_OTHER;
+   cmd->base.duplex = DUPLEX_UNKNOWN;
+   cmd->base.port = PORT_OTHER;
 
/* Since bond_slave_can_tx returns false for all inactive or down 
slaves, we
 * do not need to check mode.  Though link speed might not represent
@@ -4100,12 +4100,12 @@ static int bond_ethtool_get_settings(struct net_device 
*bond_dev,
if (bond_slave_can_tx(slave)) {
if (slave->speed != SPEED_UNKNOWN)
speed += slave->speed;
-   if (ecmd->duplex == DUPLEX_UNKNOWN &&
+   if (cmd->base.duplex == DUPLEX_UNKNOWN &&
slave->duplex != DUPLEX_UNKNOWN)
-   ecmd->duplex = slave->duplex;
+   cmd->base.duplex = slave->duplex;
}
}
-   ethtool_cmd_speed_set(ecmd, speed ? : SPEED_UNKNOWN);
+   cmd->base.speed = speed ? : SPEED_UNKNOWN;
 
return 0;
 }
@@ -4121,8 +4121,8 @@ static void bond_ethtool_get_drvinfo(struct net_device 
*bond_dev,
 
 static const struct ethtool_ops bond_ethtool_ops = {
.get_drvinfo= bond_ethtool_get_drvinfo,
-   .get_settings   = bond_ethtool_get_settings,
.get_link   = ethtool_op_get_link,
+   .get_link_ksettings = bond_ethtool_get_link_ksettings,
 };
 
 static const struct net_device_ops bond_netdev_ops = {
-- 
1.7.4.4

Re: [PATCH] virtio-net: Update the mtu code to match virtio spec

2016-10-25 Thread Michael S. Tsirkin

On Tue, Oct 25, 2016 at 12:35:35PM -0400, Aaron Conole wrote:
> From: Aaron Conole 
> 
> The virtio committee recently ratified a change, VIRTIO-152, which
> defines the mtu field to be 'max' MTU, not simply desired MTU.
> 
> This commit brings the virtio-net device in compliance with VIRTIO-152.
> 
> Additionally, drop the max_mtu branch - it cannot be taken since the u16
> returned by virtio_cread16 will never exceed the initial value of
> max_mtu.
> 
> Cc: "Michael S. Tsirkin" 
> Cc: Jarod Wilson 
> Signed-off-by: Aaron Conole 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/virtio_net.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 720809f..2cafd12 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1870,10 +1870,12 @@ static int virtnet_probe(struct virtio_device *vdev)
>   mtu = virtio_cread16(vdev,
>offsetof(struct virtio_net_config,
> mtu));
> - if (mtu < dev->min_mtu || mtu > dev->max_mtu)
> + if (mtu < dev->min_mtu) {
>   __virtio_clear_bit(vdev, VIRTIO_NET_F_MTU);
> - else
> + } else {
>   dev->mtu = mtu;
> + dev->max_mtu = mtu;
> + }
>   }
>  
>   if (vi->any_header_sg)
> -- 
> 2.7.4

[PATCH] virtio-net: Update the mtu code to match virtio spec

2016-10-25 Thread Aaron Conole

From: Aaron Conole 

The virtio committee recently ratified a change, VIRTIO-152, which
defines the mtu field to be 'max' MTU, not simply desired MTU.

This commit brings the virtio-net device in compliance with VIRTIO-152.

Additionally, drop the max_mtu branch - it cannot be taken since the u16
returned by virtio_cread16 will never exceed the initial value of
max_mtu.

Cc: "Michael S. Tsirkin" 
Cc: Jarod Wilson 
Signed-off-by: Aaron Conole 
---
 drivers/net/virtio_net.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 720809f..2cafd12 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1870,10 +1870,12 @@ static int virtnet_probe(struct virtio_device *vdev)
mtu = virtio_cread16(vdev,
 offsetof(struct virtio_net_config,
  mtu));
-   if (mtu < dev->min_mtu || mtu > dev->max_mtu)
+   if (mtu < dev->min_mtu) {
__virtio_clear_bit(vdev, VIRTIO_NET_F_MTU);
-   else
+   } else {
dev->mtu = mtu;
+   dev->max_mtu = mtu;
+   }
}
 
if (vi->any_header_sg)
-- 
2.7.4

[PATCH net] sctp: validate chunk len before actually using it

2016-10-25 Thread Marcelo Ricardo Leitner

Andrey Konovalov reported that KASAN detected that SCTP was using a slab
beyond the boundaries. It was caused because when handling out of the
blue packets in function sctp_sf_ootb() it was checking the chunk len
only after already processing the first chunk, validating only for the
2nd and subsequent ones.

The fix is to just move the check upwards so it's also validated for the
1st chunk.

Reported-by: Andrey Konovalov 
Tested-by: Andrey Konovalov 
Signed-off-by: Marcelo Ricardo Leitner 
---

Hi. Please consider this to -stable too. Thanks

 net/sctp/sm_statefuns.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 
026e3bca4a94bd34b418d5e6947f7182c1512358..8ec20a64a3f8055a0c3576627c5ec5dad7e99ca8
 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -3422,6 +3422,12 @@ sctp_disposition_t sctp_sf_ootb(struct net *net,
return sctp_sf_violation_chunklen(net, ep, asoc, type, 
arg,
  commands);
 
+   /* Report violation if chunk len overflows */
+   ch_end = ((__u8 *)ch) + SCTP_PAD4(ntohs(ch->length));
+   if (ch_end > skb_tail_pointer(skb))
+   return sctp_sf_violation_chunklen(net, ep, asoc, type, 
arg,
+ commands);
+
/* Now that we know we at least have a chunk header,
 * do things that are type appropriate.
 */
@@ -3453,12 +3459,6 @@ sctp_disposition_t sctp_sf_ootb(struct net *net,
}
}
 
-   /* Report violation if chunk len overflows */
-   ch_end = ((__u8 *)ch) + SCTP_PAD4(ntohs(ch->length));
-   if (ch_end > skb_tail_pointer(skb))
-   return sctp_sf_violation_chunklen(net, ep, asoc, type, 
arg,
- commands);
-
ch = (sctp_chunkhdr_t *) ch_end;
} while (ch_end < skb_tail_pointer(skb));
 
-- 
2.7.4

RE: [PATCH] netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning

2016-10-25 Thread David Laight

From: Arnd Bergmann
> Sent: 24 October 2016 21:22
> On Monday, October 24, 2016 10:47:54 PM CEST Julian Anastasov wrote:
> > > diff --git a/net/netfilter/ipvs/ip_vs_sync.c 
> > > b/net/netfilter/ipvs/ip_vs_sync.c
> > > index 1b07578bedf3..9350530c16c1 100644
> > > --- a/net/netfilter/ipvs/ip_vs_sync.c
> > > +++ b/net/netfilter/ipvs/ip_vs_sync.c
> > > @@ -283,6 +283,7 @@ struct ip_vs_sync_buff {
> > >   */
> > >  static void ntoh_seq(struct ip_vs_seq *no, struct ip_vs_seq *ho)
> > >  {
> > > + memset(ho, 0, sizeof(*ho));
> > >   ho->init_seq   = get_unaligned_be32(&no->init_seq);
> > >   ho->delta  = get_unaligned_be32(&no->delta);
> > >   ho->previous_delta = get_unaligned_be32(&no->previous_delta);
> >
> > So, now there is a double write here?
> 
> Correct. I would hope that a sane version of gcc would just not
> perform the first write. What happens instead is that the version
> that produces the warning here moves the initialization to the
> top of the calling function.

Maybe doing the 3 get_unaligned_be32() before the memset will stop
the double-writes.
The problem is that the compiler doesn't know that the two structures
don't alias each other.

David

1 2 >

1 - 100 of 176 matches

Mail list logo