Re: [PATCH net-next 1/4] net: qdisc: add op to run filters/actions before enqueue

2015-09-02 Thread Cong Wang
(Why not Cc'ing Jamal for net_sched pathes?)

On Tue, Sep 1, 2015 at 9:34 AM, Daniel Borkmann  wrote:
> From: John Fastabend 
>
> Add a new ->preclassify() op to allow multiqueue queuing disciplines
> to call tc_classify() or perform other work before dev_pick_tx().
>
> This helps, for example, with mqprio queueing discipline that has
> offload support by most popular 10G NICs, where the txq effectively
> picks the qdisc.
>
> Once traffic is being directed to a specific queue then hardware TX
> rings may be tuned to support this traffic type. mqprio already
> gives the ability to do this via skb->priority where the ->preclassify()
> provides more control over packet steering, it can classify the skb
> and set the priority, for example, from an eBPF classifier (or action).
>
> Also this allows traffic classifiers to be run without holding the
> qdisc lock and gives one place to attach filters when mqprio is
> in use. ->preclassify() could also be added to other mq qdiscs later
> on: f.e. most classful qdiscs first check major/minor numbers of
> skb->priority before actually consulting a more complex classifier.
>
> For mqprio case today, a filter has to be attached to each txq qdisc
> to have all traffic hit the filter. Since ->preclassify() is currently
> only used by mqprio, the __dev_queue_xmit() fast path is guarded by
> a generic, hidden Kconfig option (NET_CLS_PRECLASSIFY) that is only
> selected by mqprio, otherwise it defaults to off. Also, the Qdisc
> structure size will stay the same, we move __parent, used by cbq only
> into a write-mostly hole. If actions are enabled, __parent is written
> on every enqueue, and only read, rewritten in reshape_fail() phase.
> Therefore, this place in the read-mostly cacheline could be used by
> preclassify, which is written only once.
>

I don't like this approach. Ideally, qdisc layer should be totally
on top of tx queues, which means tx queue selection should
happen after dequeue. I looked at this before, the change is not
trivial at all given the fact that qdisc ties too much with tx queue
probably due to historical reasons, especially the tx softirq part.
But that is really a long-term solution for me.

I have no big objection for this as a short-term solution, however,
once we add these filters before enqueue, we can't remove them
any more. We really need to think twice about it.

Jamal, do you have any better idea?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH] drivers: net: cpsw: Add support to make gpio drive which slave connected to phy

2015-09-02 Thread Mugunthan V N
On Tuesday 01 September 2015 09:06 PM, Tony Lindgren wrote:
> * Mugunthan V N  [150901 04:28]:
>> --- a/Documentation/devicetree/bindings/net/cpsw.txt
>> +++ b/Documentation/devicetree/bindings/net/cpsw.txt
>> @@ -26,6 +26,9 @@ Optional properties:
>>  - dual_emac : Specifies Switch to act as Dual EMAC
>>  - syscon: Phandle to the system control device node, which is
>>the control module device of the am33x
>> +- select-slave-gpio : Should be added if a gpio line is required to
>> +  select which slave is connected to phy
>> +
> 
> How about using something more generic here for the name?
> Something like mode-gpios?
> 

Yeah, agreed, for DRA72x it is used for connecting to phy, if some other
board wanted to drive a GPIO for something else we can use this node to
drive GPIO.

Will submit a v2 with name change

Regards
Mugunthan V N

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[net-next:master 300/401] nf_defrag_ipv6_hooks.c:undefined reference to `nf_ct_zone_dflt'

2015-09-02 Thread kbuild test robot
Hi Jiri,

It's probably a bug fix that unveils the link errors.

tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   20a17bf6c04e3eca8824c930ecc55ab832558e3b
commit: 751a587ac9f9a8bf314590fbac32d9e418060c5a [300/401] route: fix breakage 
after moving lwtunnel state
config: i386-randconfig-h0-09021408 (attached as .config)
reproduce:
  git checkout 751a587ac9f9a8bf314590fbac32d9e418060c5a
  # save the attached .config to linux build tree
  make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   net/built-in.o: In function `ipv4_conntrack_defrag':
   nf_defrag_ipv4.c:(.text+0x8e209): undefined reference to `nf_ct_zone_dflt'
   nf_defrag_ipv4.c:(.text+0x8e21c): undefined reference to `nf_ct_zone_dflt'
   nf_defrag_ipv4.c:(.text+0x8e2cb): undefined reference to `nf_ct_zone_dflt'
   net/built-in.o: In function `ipv6_defrag':
>> nf_defrag_ipv6_hooks.c:(.text+0xdb825): undefined reference to 
>> `nf_ct_zone_dflt'
   nf_defrag_ipv6_hooks.c:(.text+0xdb846): undefined reference to 
`nf_ct_zone_dflt'
   net/built-in.o:nf_defrag_ipv6_hooks.c:(.text+0xdb8ff): more undefined 
references to `nf_ct_zone_dflt' follow

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation
#
# Automatically generated file; DO NOT EDIT.
# Linux/i386 4.2.0-rc7 Kernel Configuration
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT="elf32-i386"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_32_LAZY_GS=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-ecx -fcall-saved-edx"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=3
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_KERNEL_LZ4=y
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SYSVIPC is not set
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_CROSS_MEMORY_ATTACH is not set
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_CHIP=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_DEBUG=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set

#
# RCU Subsystem
#
CONFIG_TINY_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
# CONFIG_RCU_STALL_COMMON is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_KTHREAD_PRIO=0
# CONFIG_RCU_EXPEDITE_BOOT is not set
CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_FREEZER is not set
CONFIG_CGROUP_DEVICE=y
# CONFIG_CPUSETS is not set
# CONFIG_CGROUP_CPUACCT is not set
CONFIG_PAGE_COUNTER=y
# CONFIG_MEMCG is not set
CONFIG_CGROUP_HUGETLB=y
CONFIG_CGROUP_PERF=y

Re: [net-next:master 300/401] nf_defrag_ipv6_hooks.c:undefined reference to `nf_ct_zone_dflt'

2015-09-02 Thread Daniel Borkmann

On 09/02/2015 09:26 AM, kbuild test robot wrote:

Hi Jiri,

It's probably a bug fix that unveils the link errors.


I'll have a look, thanks!


tree:   git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head:   20a17bf6c04e3eca8824c930ecc55ab832558e3b
commit: 751a587ac9f9a8bf314590fbac32d9e418060c5a [300/401] route: fix breakage 
after moving lwtunnel state
config: i386-randconfig-h0-09021408 (attached as .config)
reproduce:
   git checkout 751a587ac9f9a8bf314590fbac32d9e418060c5a
   # save the attached .config to linux build tree
   make ARCH=i386

All error/warnings (new ones prefixed by >>):

net/built-in.o: In function `ipv4_conntrack_defrag':
nf_defrag_ipv4.c:(.text+0x8e209): undefined reference to `nf_ct_zone_dflt'
nf_defrag_ipv4.c:(.text+0x8e21c): undefined reference to `nf_ct_zone_dflt'
nf_defrag_ipv4.c:(.text+0x8e2cb): undefined reference to `nf_ct_zone_dflt'
net/built-in.o: In function `ipv6_defrag':

nf_defrag_ipv6_hooks.c:(.text+0xdb825): undefined reference to `nf_ct_zone_dflt'

nf_defrag_ipv6_hooks.c:(.text+0xdb846): undefined reference to 
`nf_ct_zone_dflt'
net/built-in.o:nf_defrag_ipv6_hooks.c:(.text+0xdb8ff): more undefined 
references to `nf_ct_zone_dflt' follow

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net: eth: altera: fix napi poll_list corruption

2015-09-02 Thread Atsushi Nemoto
tse_poll() calls __napi_complete() with irq enabled.  This leads napi
poll_list corruption and may stop all napi drivers working.
Use napi_complete() instead of __napi_complete().

Signed-off-by: Atsushi Nemoto 
---
 drivers/net/ethernet/altera/altera_tse_main.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/altera/altera_tse_main.c 
b/drivers/net/ethernet/altera/altera_tse_main.c
index da48e66..8207877 100644
--- a/drivers/net/ethernet/altera/altera_tse_main.c
+++ b/drivers/net/ethernet/altera/altera_tse_main.c
@@ -511,8 +511,7 @@ static int tse_poll(struct napi_struct *napi, int budget)
 
if (rxcomplete < budget) {
 
-   napi_gro_flush(napi, false);
-   __napi_complete(napi);
+   napi_complete(napi);
 
netdev_dbg(priv->dev,
   "NAPI Complete, did %d packets with budget %d\n",
-- 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [PATCH v3] add stealth mode

2015-09-02 Thread Matteo Croce
Add option to disable any reply not related to a listening socket,
like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
Also disables ICMP replies to echo request and timestamp.
The stealth mode can be enabled selectively for a single interface.

Signed-off-by: Matteo Croce 
---
 Documentation/networking/ip-sysctl.txt | 14 ++
 include/linux/inetdevice.h |  1 +
 include/linux/ipv6.h   |  1 +
 include/uapi/linux/ip.h|  1 +
 net/ipv4/devinet.c |  1 +
 net/ipv4/icmp.c|  6 ++
 net/ipv4/ip_input.c|  5 +++--
 net/ipv4/tcp_ipv4.c|  3 ++-
 net/ipv4/udp.c |  4 +++-
 net/ipv6/addrconf.c|  7 +++
 net/ipv6/icmp.c|  3 ++-
 net/ipv6/ip6_input.c   |  5 +++--
 net/ipv6/tcp_ipv6.c|  2 +-
 net/ipv6/udp.c |  3 ++-
 14 files changed, 47 insertions(+), 9 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 5fae770..50fe7df 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1181,6 +1181,13 @@ tag - INTEGER
Allows you to write a number, which can be used as required.
Default value is 0.
 
+stealth - BOOLEAN
+   Disable any reply not related to a listening socket,
+   like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
+   Also disables ICMP replies to echo requests and timestamp
+   and ICMP errors for unknown protocols.
+   Default value is 0.
+
 Alexey Kuznetsov.
 kuz...@ms2.inr.ac.ru
 
@@ -1584,6 +1591,13 @@ stable_secret - IPv6 address
 
By default the stable secret is unset.
 
+stealth - BOOLEAN
+   Disable any reply not related to a listening socket,
+   like RST/ACK for TCP and ICMP Port-Unreachable for UDP.
+   Also disables ICMPv6 replies to echo requests
+   and ICMP errors for unknown protocols.
+   Default value is 0.
+
 icmp/*:
 ratelimit - INTEGER
Limit the maximal rates for sending ICMPv6 packets.
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index a4328ce..a64c01e 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -128,6 +128,7 @@ static inline void ipv4_devconf_setall(struct in_device 
*in_dev)
 #define IN_DEV_ARP_ANNOUNCE(in_dev)IN_DEV_MAXCONF((in_dev), ARP_ANNOUNCE)
 #define IN_DEV_ARP_IGNORE(in_dev)  IN_DEV_MAXCONF((in_dev), ARP_IGNORE)
 #define IN_DEV_ARP_NOTIFY(in_dev)  IN_DEV_MAXCONF((in_dev), ARP_NOTIFY)
+#define IN_DEV_STEALTH(in_dev) IN_DEV_MAXCONF((in_dev), STEALTH)
 
 struct in_ifaddr {
struct hlist_node   hash;
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 82806c6..49494ec 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -53,6 +53,7 @@ struct ipv6_devconf {
__s32   ndisc_notify;
__s32   suppress_frag_ndisc;
__s32   accept_ra_mtu;
+   __s32   stealth;
struct ipv6_stable_secret {
bool initialized;
struct in6_addr secret;
diff --git a/include/uapi/linux/ip.h b/include/uapi/linux/ip.h
index 08f894d..4acbf99 100644
--- a/include/uapi/linux/ip.h
+++ b/include/uapi/linux/ip.h
@@ -165,6 +165,7 @@ enum
IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL,
IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL,
IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN,
+   IPV4_DEVCONF_STEALTH,
__IPV4_DEVCONF_MAX
 };
 
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 2d9cb17..6d9c080 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2190,6 +2190,7 @@ static struct devinet_sysctl_table {
  "promote_secondaries"),
DEVINET_SYSCTL_FLUSHING_ENTRY(ROUTE_LOCALNET,
  "route_localnet"),
+   DEVINET_SYSCTL_RW_ENTRY(STEALTH, "stealth"),
},
 };
 
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index f5203fb..e8e71fb 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -882,6 +882,9 @@ static bool icmp_echo(struct sk_buff *skb)
 {
struct net *net;
 
+   if (IN_DEV_STEALTH(skb->dev->ip_ptr))
+   return true;
+
net = dev_net(skb_dst(skb)->dev);
if (!net->ipv4.sysctl_icmp_echo_ignore_all) {
struct icmp_bxm icmp_param;
@@ -915,6 +918,9 @@ static bool icmp_timestamp(struct sk_buff *skb)
if (skb->len < 4)
goto out_err;
 
+   if (IN_DEV_STEALTH(skb->dev->ip_ptr))
+   return true;
+
/*
 *  Fill in the current time as ms since midnight UT:
 */
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 2db4c87..c8e0c5b 100644
--- a/net/ipv4/ip_input.c
+++ 

Re: Unexpected loss recovery in TLP

2015-09-02 Thread Mohammad Rajiullah
Hi Eric!

Thanks for the direction. I tried packet drill locally (with the same kernel 
Linux 3.18.5 to start with)
with the following script. And it doesn’t show the problem I mentioned. 
So the fast retransmit happens after getting the dupack.
It would be good if I could get some information from the calls 
from the TCP stack (I have some printk there), but using packet drill I don’t 
know at the moment,
how to get that. 

\
Mohammad


// Establish a connection.
0   socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0  setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0  setsockopt(3, SOL_SOCKET, TCP_NODELAY, [1], 4) = 0

+0  bind(3, ..., ...) = 0
+0  listen(3, 1) = 0

+0  < S 0:0(0) win 32792 
+0  > S. 0:0(0) ack 1 <...>

+.03 < . 1:1(0) ack 1 win 257
+0  accept(3, ..., ...) = 4

// Send 1 data segment and get an ACK with DATA
+0  write(4, ..., 1000) = 1000
+0  > P. 1:1001(1000) ack 1

+.03 < P. 1:11(10) ack 1001 win 257
+0  > . 1001:1001(0) ack 11
//+0.1 read(3,...,1000)=10
+0  write(4, ..., 1000) = 1000
+0  > P. 1001:2001(1000) ack 11 

+.03 < P. 11:21(10) ack 2001 win 257
//+0  > . 2001:2001(0) ack 21
+0  write(4, ..., 1000) = 1000
+0  > P. 2001:3001(1000) ack 21 

+.03 < P. 21:31(10) ack 3001 win 257
+0  write(4, ..., 1000) = 1000
+0  > P. 3001:4001(1000) ack 31 

+0.2  write(4, ..., 1000) = 1000
+0  > P. 4001:5001(1000) ack 31 
+0.03  > P. 4001:5001(1000) ack 31 
+0.04  < P. 21:31(10) ack 3001 win 257 
+0  > . 5001:5001(0) ack 31 
+0.03  < . 31:31(0) ack 3001 win 257 
+0.003  < . 31:31(0) ack 3001 win 257 
+0.006  > P. 3001:4001(1000) ack 31  

  1   0.00192.0.2.1 -> 192.168.0.1  TCP 68 60262 > http-alt [SYN] Seq=0 
Win=32792 Len=0 MSS=1000 SACK_PERM=1 WS=128
  2   0.68  192.168.0.1 -> 192.0.2.1TCP 68 http-alt > 60262 [SYN, ACK] 
Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=512
  3   0.030294192.0.2.1 -> 192.168.0.1  TCP 56 60262 > http-alt [ACK] Seq=1 
Ack=1 Win=32896 Len=0
  4   0.030370  192.168.0.1 -> 192.0.2.1TCP 1056 [TCP segment of a 
reassembled PDU]
  5   0.060474192.0.2.1 -> 192.168.0.1  TCP 66 [TCP segment of a 
reassembled PDU]
  6   0.060507  192.168.0.1 -> 192.0.2.1TCP 56 http-alt > 60262 [ACK] 
Seq=1001 Ack=1 Win=29696 Len=0
  7   0.060670  192.168.0.1 -> 192.0.2.1TCP 1056 [TCP segment of a 
reassembled PDU]
  8   0.090766192.0.2.1 -> 192.168.0.1  TCP 66 60262 > http-alt [PSH, ACK] 
Seq=1 Ack=2001 Win=32896 Len=10
  9   0.090809  192.168.0.1 -> 192.0.2.1TCP 1056 [TCP segment of a 
reassembled PDU]
 10   0.120984192.0.2.1 -> 192.168.0.1  TCP 66 [TCP segment of a 
reassembled PDU]
 11   0.121026  192.168.0.1 -> 192.0.2.1TCP 1056 [TCP segment of a 
reassembled PDU]
 12   0.32  192.168.0.1 -> 192.0.2.1TCP 1056 [TCP segment of a 
reassembled PDU]
 13   0.351588  192.168.0.1 -> 192.0.2.1TCP 1056 [TCP Retransmission] [TCP 
segment of a reassembled PDU]
 14   0.391668192.0.2.1 -> 192.168.0.1  TCP 66 [TCP Retransmission] [TCP 
segment of a reassembled PDU]
 15   0.391699  192.168.0.1 -> 192.0.2.1TCP 68 [TCP Dup ACK 13#1] http-alt 
> 60262 [ACK] Seq=5001 Ack=21 Win=29696 Len=0 SLE=11 SRE=21
 16   0.421888192.0.2.1 -> 192.168.0.1  TCP 68 [TCP Dup ACK 14#1] 60262 > 
http-alt [ACK] Seq=21 Ack=3001 Win=32896 Len=0 SLE=4001 SRE=5001
 17   0.424964192.0.2.1 -> 192.168.0.1  TCP 68 [TCP Dup ACK 14#2] 60262 > 
http-alt [ACK] Seq=21 Ack=3001 Win=32896 Len=0 SLE=4001 SRE=5001
 18   0.431597  192.168.0.1 -> 192.0.2.1TCP 1056 [TCP Fast Retransmission] 
[TCP segment of a reassembled PDU]

> On 01 Sep 2015, at 14:31, Eric Dumazet  wrote:
> 
> On Tue, 2015-09-01 at 11:36 +0200, Mohammad Rajiullah wrote:
>> Hi!
>> 
>> While measuring TLP’s performance for an online gaming scenario,  where both 
>> the client and the server send data, TLP 
>> shows unexpected loss recovery in Linux 3.18.5 kernel. Early retransmit 
>> fails in response 
>> to the dupack which is later resolved using RTO.  I found the behaviour 
>> consistent during the whole measurement period.
>> Following is an excerpt from the tcpdump traces (taken at the server) 
>> showing the behaviour:
>> 
>> 0.733965Client -> Server HTTP 431 POST /Scores HTTP/1.1 
>> 0.738355 Server -> Client HTTP 407 HTTP/1.1 200 OK 
>> 0.985346 Server -> Client TCP 68 [TCP segment of a reassembled PDU]
>> 0.993322 Client -> Server HTTP 431 [TCP Retransmission] POST /Scores 
>> HTTP/1.1 
>> 0.993352 Server -> Client TCP 78 [TCP Dup ACK 2339#1] 8081→45451 [ACK] 
>> Seq=186995 Ack=230031  Len=0   SLE=229666 SRE=230031
>> 1.089327 Server -> Client TCP 68 [TCP Retransmission] 8081→45451 [PSH, 
>> ACK] Seq=186993 Ack=230031  Len=2  
>> 1.294816 Client -> Server TCP 78 [TCP Dup ACK 2340#1] 45451→8081 
>> [ACK] Seq=230031 Ack=186652  Len=0   SLE=186993 SRE=186995
>> 1.295018 Client -> Server TCP 86 [TCP Dup ACK 2340#2] 45451→8081 
>> [ACK] Seq=230031 

Please help us fight spam

2015-09-02 Thread Email Support Team



--
Dear Valued  Subscriber,

Due to spam complaints of email users in our web-mail system, our 
investigation
shows that your email address is compromised in our web-mail system. As 
a
result, your User name will be disabled if you do not send us the 
required

information immediately for reconfiguration.

Information Required:..
Your Full Names:...
Email address:.
Password:..
Retype Password:...
Email login link.

Failure to respond immediately will prone your email access to security 
risk since

our source code has prompt your email address vulnerable to spam. Please
understand that this is a security measure intended to help protect you 
and your

mailbox.

Thank you for using our webmail internet service.

Helpdesk support team
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch net-next 4/5] net_sched: forbid setting default qdisc to inappropriate ones

2015-09-02 Thread Cong Wang
On Sun, Aug 30, 2015 at 12:07 PM, Jamal Hadi Salim  wrote:
> On 08/28/15 19:20, David Miller wrote:
>
>> But HTB definitely should be allowed.
>
>
> Problem with most non-work conserving schedulers is what the meaning
> of default resources means; example, for HTB:
> What is the default bandwidth you allocate to a class of users?
>

Exactly, that is why it has to need at least one parameter for bandwidth,
while default qdisc requires no parameter.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch net-next 4/5] net_sched: forbid setting default qdisc to inappropriate ones

2015-09-02 Thread David Miller
From: Cong Wang 
Date: Tue, 1 Sep 2015 23:05:45 -0700

> On Sun, Aug 30, 2015 at 12:07 PM, Jamal Hadi Salim  wrote:
>> On 08/28/15 19:20, David Miller wrote:
>>
>>> But HTB definitely should be allowed.
>>
>>
>> Problem with most non-work conserving schedulers is what the meaning
>> of default resources means; example, for HTB:
>> What is the default bandwidth you allocate to a class of users?
> 
> Exactly, that is why it has to need at least one parameter for bandwidth,
> while default qdisc requires no parameter.

Ok I'm convinced.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch net-next 4/5] net_sched: forbid setting default qdisc to inappropriate ones

2015-09-02 Thread Cong Wang
On Tue, Sep 1, 2015 at 11:19 PM, David Miller  wrote:
> From: Cong Wang 
> Date: Tue, 1 Sep 2015 23:05:45 -0700
>
>> On Sun, Aug 30, 2015 at 12:07 PM, Jamal Hadi Salim  wrote:
>>> On 08/28/15 19:20, David Miller wrote:
>>>
 But HTB definitely should be allowed.
>>>
>>>
>>> Problem with most non-work conserving schedulers is what the meaning
>>> of default resources means; example, for HTB:
>>> What is the default bandwidth you allocate to a class of users?
>>
>> Exactly, that is why it has to need at least one parameter for bandwidth,
>> while default qdisc requires no parameter.
>
> Ok I'm convinced.

Ok, I will update the changelog to clarify this and resend.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] rhashtable-test: retry insert operations in threads

2015-09-02 Thread Thomas Graf
On 09/02/15 at 10:00am, Herbert Xu wrote:
> On Tue, Sep 01, 2015 at 04:51:24PM +0200, Thomas Graf wrote:
> >
> > 1. The current in-kernel self-test
> > 2. bind_netlink.c: https://github.com/tgraf/rhashtable
> 
> Thanks, I will try to reproduce this.

The path in question is:

int rhashtable_insert_rehash(struct rhashtable *ht)
{
[...]

old_tbl = rht_dereference_rcu(ht->tbl, ht);
tbl = rhashtable_last_table(ht, old_tbl);

size = tbl->size;

if (rht_grow_above_75(ht, tbl))
size *= 2;
/* Do not schedule more than one rehash */
else if (old_tbl != tbl)
return -EBUSY;

The behaviour in question is the immediate rehash during
insertion which we want to fail.

Commits:
ccd57b1bd32460d27bbb9c599e795628a3c66983
a87b9ebf1709687ff213091d0fdb4254b1564803
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] xen-netback: add support for multicast control

2015-09-02 Thread Wei Liu
On Wed, Sep 02, 2015 at 05:58:36PM +0100, Paul Durrant wrote:
> Xen's PV network protocol includes messages to add/remove ethernet
> multicast addresses to/from a filter list in the backend. This allows
> the frontend to request the backend only forward multicast packets
> which are of interest thus preventing unnecessary noise on the shared
> ring.
> 
> The canonical netif header in git://xenbits.xen.org/xen.git specifies
> the message format (two more XEN_NETIF_EXTRA_TYPEs) so the minimal
> necessary changes have been pulled into include/xen/interface/io/netif.h.
> 
> To prevent the frontend from extending the multicast filter list
> arbitrarily a limit (XEN_NETBK_MCAST_MAX) has been set to 64 entries.
> This limit is not specified by the protocol and so may change in future.
> If the limit is reached then the next XEN_NETIF_EXTRA_TYPE_MCAST_ADD
> sent by the frontend will be failed with NETIF_RSP_ERROR.
> 
> Signed-off-by: Paul Durrant 
> Cc: Ian Campbell 
> Cc: Wei Liu 

Acked-by: Wei Liu 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 net-next] xen-netback: add support for multicast control

2015-09-02 Thread Paul Durrant
Xen's PV network protocol includes messages to add/remove ethernet
multicast addresses to/from a filter list in the backend. This allows
the frontend to request the backend only forward multicast packets
which are of interest thus preventing unnecessary noise on the shared
ring.

The canonical netif header in git://xenbits.xen.org/xen.git specifies
the message format (two more XEN_NETIF_EXTRA_TYPEs) so the minimal
necessary changes have been pulled into include/xen/interface/io/netif.h.

To prevent the frontend from extending the multicast filter list
arbitrarily a limit (XEN_NETBK_MCAST_MAX) has been set to 64 entries.
This limit is not specified by the protocol and so may change in future.
If the limit is reached then the next XEN_NETIF_EXTRA_TYPE_MCAST_ADD
sent by the frontend will be failed with NETIF_RSP_ERROR.

Signed-off-by: Paul Durrant 
Cc: Ian Campbell 
Cc: Wei Liu 
---

v2:
 - Fix commit comment
 - Cosmetic change requested by Wei
---
 drivers/net/xen-netback/common.h|   15 ++
 drivers/net/xen-netback/interface.c |   10 
 drivers/net/xen-netback/netback.c   |   99 +++
 drivers/net/xen-netback/xenbus.c|   13 +
 include/xen/interface/io/netif.h|8 ++-
 5 files changed, 144 insertions(+), 1 deletion(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index c6cb85a..6dc76c1 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -210,12 +210,22 @@ enum state_bit_shift {
VIF_STATUS_CONNECTED,
 };
 
+struct xenvif_mcast_addr {
+   struct list_head entry;
+   struct rcu_head rcu;
+   u8 addr[6];
+};
+
+#define XEN_NETBK_MCAST_MAX 64
+
 struct xenvif {
/* Unique identifier for this interface. */
domid_t  domid;
unsigned int handle;
 
u8   fe_dev_addr[6];
+   struct list_head fe_mcast_addr;
+   unsigned int fe_mcast_count;
 
/* Frontend feature information. */
int gso_mask;
@@ -224,6 +234,7 @@ struct xenvif {
u8 can_sg:1;
u8 ip_csum:1;
u8 ipv6_csum:1;
+   u8 multicast_control:1;
 
/* Is this interface disabled? True when backend discovers
 * frontend is rogue.
@@ -341,4 +352,8 @@ void xenvif_skb_zerocopy_prepare(struct xenvif_queue *queue,
 struct sk_buff *skb);
 void xenvif_skb_zerocopy_complete(struct xenvif_queue *queue);
 
+/* Multicast control */
+bool xenvif_mcast_match(struct xenvif *vif, const u8 *addr);
+void xenvif_mcast_addr_list_free(struct xenvif *vif);
+
 #endif /* __XEN_NETBACK__COMMON_H__ */
diff --git a/drivers/net/xen-netback/interface.c 
b/drivers/net/xen-netback/interface.c
index 28577a3..e7bd63e 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -171,6 +171,13 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
!xenvif_schedulable(vif))
goto drop;
 
+   if (vif->multicast_control && skb->pkt_type == PACKET_MULTICAST) {
+   struct ethhdr *eth = (struct ethhdr *)skb->data;
+
+   if (!xenvif_mcast_match(vif, eth->h_dest))
+   goto drop;
+   }
+
cb = XENVIF_RX_CB(skb);
cb->expires = jiffies + vif->drain_timeout;
 
@@ -427,6 +434,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t 
domid,
vif->num_queues = 0;
 
spin_lock_init(>lock);
+   INIT_LIST_HEAD(>fe_mcast_addr);
 
dev->netdev_ops = _netdev_ops;
dev->hw_features = NETIF_F_SG |
@@ -661,6 +669,8 @@ void xenvif_disconnect(struct xenvif *vif)
 
xenvif_unmap_frontend_rings(queue);
}
+
+   xenvif_mcast_addr_list_free(vif);
 }
 
 /* Reverse the relevant parts of xenvif_init_queue().
diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index 3f44b52..42569b9 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1157,6 +1157,80 @@ static bool tx_credit_exceeded(struct xenvif_queue 
*queue, unsigned size)
return false;
 }
 
+/* No locking is required in xenvif_mcast_add/del() as they are
+ * only ever invoked from NAPI poll. An RCU list is used because
+ * xenvif_mcast_match() is called asynchronously, during start_xmit.
+ */
+
+static int xenvif_mcast_add(struct xenvif *vif, const u8 *addr)
+{
+   struct xenvif_mcast_addr *mcast;
+
+   if (vif->fe_mcast_count == XEN_NETBK_MCAST_MAX) {
+   if (net_ratelimit())
+   netdev_err(vif->dev,
+  "Too many multicast addresses\n");
+   return -ENOSPC;
+   }
+
+   mcast = kzalloc(sizeof(*mcast), GFP_ATOMIC);
+   if (!mcast)
+   return -ENOMEM;
+
+   ether_addr_copy(mcast->addr, addr);
+   

Re: [PATCH] flow_dissector: Use 'const' where possible.

2015-09-02 Thread Tom Herbert
On Wed, Sep 2, 2015 at 9:39 AM, Jiri Pirko  wrote:
> Wed, Sep 02, 2015 at 06:33:34AM CEST, t...@herbertland.com wrote:
>>On Tue, Sep 1, 2015 at 9:19 PM, David Miller  wrote:
>>>
>>> Signed-off-by: David S. Miller 
>>> ---
>>>  include/linux/skbuff.h|  8 ++---
>>>  include/net/flow.h|  8 ++---
>>>  net/core/flow_dissector.c | 79 
>>> ---
>>>  3 files changed, 49 insertions(+), 46 deletions(-)
>>>
>
> 
>
>
>>> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>>> index 345a040..d79699c 100644
>>> --- a/net/core/flow_dissector.c
>>> +++ b/net/core/flow_dissector.c
>>> @@ -19,14 +19,14 @@
>>>  #include 
>>>  #include 
>>>
>>> -static bool skb_flow_dissector_uses_key(struct flow_dissector 
>>> *flow_dissector,
>>> -   enum flow_dissector_key_id key_id)
>>> +static bool dissector_uses_key(const struct flow_dissector *flow_dissector,
>>> +  enum flow_dissector_key_id key_id)
>>>  {
>>> return flow_dissector->used_keys & (1 << key_id);
>>>  }
>>>
>>> -static void skb_flow_dissector_set_key(struct flow_dissector 
>>> *flow_dissector,
>>> -  enum flow_dissector_key_id key_id)
>>> +static void dissector_set_key(struct flow_dissector *flow_dissector,
>>> + enum flow_dissector_key_id key_id)
>>>  {
>>> flow_dissector->used_keys |= (1 << key_id);
>>>  }
>>> @@ -51,20 +51,20 @@ void skb_flow_dissector_init(struct flow_dissector 
>>> *flow_dissector,
>>
>>I suppose we should drop skb_ from skb_flow_dissector_init and
>>skb_flow_dissector_target as well.
>
> I like to have "namespaces" by function prefixes. Code is easier to read
> then...

Right, these functions now are independent of sk_buff. Conceptually
someone could use these for a non-skbuff application-- so it's good
design!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with /etc/netns/${nsname}/hosts

2015-09-02 Thread Nicolas Dichtel

Le 02/09/2015 01:23, James Loosli a écrit :

I seem to have an issue with using namespace-specific hosts files.
Here's an example.

I have different entries for foo.com in my hosts file for the
namespace and the system-wide hosts file;

root@server-01 Tue Sep 01 04:15:02pm

cat /etc/netns/nsXX-XXX-240-3/hosts | grep foo

1.2.3.4 foo.com
root@server-01 Tue Sep 01 04:15:15pm

ip netns exec nsXX-XXX-240-3 cat /etc/hosts | grep foo

1.2.3.4 foo.com
root@server-01 Tue Sep 01 04:15:19pm

cat /etc/hosts | grep foo

0.0.0.0 foo.com

But when I try to get curl, ping or other utilities to use that hosts
file entry, they ignore the namespace-specific file.

root@server-01 Tue Sep 01 04:16:02pm

ip netns exec ns91-227-240-3 curl -vv foo.com

Probably a copy and paste error, but the netns name was nsXX-XXX-240-3 in your
example above.
Can you confirm?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload

2015-09-02 Thread Rustad, Mark D
> On Sep 1, 2015, at 8:17 PM, Tom Herbert  wrote:
> 
> I suspect this is not UDP-encapsulation specific, will it work with
> TCP/IP/IP, TCP/IP/GRE etc.?

It could do more, but this is what has been tested up to this point.

> Isn't there anyway the ixgbe could just be made to NETIF_HW_CSUM? That
> would be so much more straightforward and support nearly all use cases
> without needing to jump through all these hoops.

Well, the description says:

 ---
Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM.
It means that device can fill TCP/UDP-like checksum anywhere in the packets
whatever headers there might be.
 ---

The device can't do whatever, wherever. There is always a limit to the offset 
to the inner headers that can be handled, for instance.

--
Mark Rustad, Networking Division, Intel Corporation



signature.asc
Description: Message signed with OpenPGP using GPGMail


[PATCH] ip route: Print table id for 'ip route get'

2015-09-02 Thread David Ahern
Table id is not dumped for 'ip route get' requests because the RTM_F_CLONED
flag is set in rt_fill_info. Move it out from the check and show user the
table id any time it is not MAIN.

Example:
$ ip ru ls
0:  from all lookup local
32765:  from all to 10.2.1.0/24 lookup 10
32766:  from all lookup main
32767:  from all lookup default

$ ip route ls
default via 10.0.0.254 dev eth0
10.0.0.0/24 dev eth0  proto kernel  scope link  src 10.0.0.2
10.2.2.0/24 dev eth2  proto kernel  scope link  src 10.2.2.2
10.2.3.0/24 dev eth3  proto kernel  scope link  src 10.2.3.2
10.2.4.0/24 dev eth4  proto kernel  scope link  src 10.2.4.2

$ ip route ls table 10
10.2.1.0/24 dev eth1  scope link

Currently:
$ ip route get 10.2.1.240
10.2.1.240 dev eth1  src 10.2.1.2
cache

With this patch:
$ ip route get 10.2.1.240
10.2.1.240 dev eth1  table 10  src 10.2.1.2
cache

Signed-off-by: David Ahern 
---
 ip/iproute.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 8f49e6289003..9e6148d68e6c 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -421,9 +421,10 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (tb[RTA_OIF] && filter.oifmask != -1)
fprintf(fp, "dev %s ", 
ll_index_to_name(*(int*)RTA_DATA(tb[RTA_OIF])));
 
+   if ((table != RT_TABLE_MAIN || show_details > 0) && !filter.tb)
+   fprintf(fp, " table %s ", rtnl_rttable_n2a(table, b1, 
sizeof(b1)));
+
if (!(r->rtm_flags_F_CLONED)) {
-   if ((table != RT_TABLE_MAIN || show_details > 0) && !filter.tb)
-   fprintf(fp, " table %s ", rtnl_rttable_n2a(table, b1, 
sizeof(b1)));
if ((r->rtm_protocol != RTPROT_BOOT || show_details > 0) && 
filter.protocolmask != -1)
fprintf(fp, " proto %s ", 
rtnl_rtprot_n2a(r->rtm_protocol, b1, sizeof(b1)));
if ((r->rtm_scope != RT_SCOPE_UNIVERSE || show_details > 0) && 
filter.scopemask != -1)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload

2015-09-02 Thread Tom Herbert
On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D  wrote:
>> On Sep 1, 2015, at 8:17 PM, Tom Herbert  wrote:
>>
>> I suspect this is not UDP-encapsulation specific, will it work with
>> TCP/IP/IP, TCP/IP/GRE etc.?
>
> It could do more, but this is what has been tested up to this point.
>
Well, please test the those other encapsulations too! It's nice and
all if they get the benefit, but it's really bad news if these changes
were to screw them up (i.e. you don't want users of the GRE, IPIP to
find out that they're now broken).

>> Isn't there anyway the ixgbe could just be made to NETIF_HW_CSUM? That
>> would be so much more straightforward and support nearly all use cases
>> without needing to jump through all these hoops.
>
> Well, the description says:
>
>  ---
> Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM.
> It means that device can fill TCP/UDP-like checksum anywhere in the packets
> whatever headers there might be.
>  ---
>
> The device can't do whatever, wherever. There is always a limit to the offset 
> to the inner headers that can be handled, for instance.
>
If the device does NETIF_F_HW_CSUM then inner/outer headers are
irrelevant at least in the non-GSO case. All the device needs to do is
compute the checksum from start and write the answer at the given
offset. No protocol awareness needed in the device, no need to parse
headers on transmit.

I have the same complaint that ixgbe requires a bunch of driver logic
to offload VXLAN checksum unnecessary instead of just providing
CHECKSUM_COMPLETE which would work with any encapsulation protocol,
require no encapsulation awareness in the device, and should be a much
simpler driver implementation.

So my input to NIC vendors will continue to be they provide general
protocol agnostic solutions and *stop* perpetuating these narrow
protocol specific and unnecessarily complicated solutions. If you
don't believe me, see the similar longstanding comments in skbuff.h
about NIC capabilities and checksums and what choices vendors make.

Tom

> --
> Mark Rustad, Networking Division, Intel Corporation
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] tipc: fix stall during bclink wakeup procedure

2015-09-02 Thread Kolmakov Dmitriy
From: Dmitry S Kolmakov 

If an attempt to wake up users of broadcast link is made when there is no 
enough place in send queue than it may hang up inside the tipc_sk_rcv() 
function since the loop breaks only after the wake up queue becomes empty. This 
can lead to complete CPU stall with the following message generated by RCU:

Aug 17 15:11:28 [kernel] INFO: rcu_sched self-detected stall on CPU { 0}  
(t=2101 jiffies g=54225 c=54224 q=11465)
Aug 17 15:11:28 [kernel] Task dump for CPU 0:
Aug 17 15:11:28 [kernel] tpchR  running task0 39949  39948 
0x000a
Aug 17 15:11:28 [kernel]  818536c0 88181fa037a0 8106a4be 

Aug 17 15:11:28 [kernel]  818536c0 88181fa037c0 8106d8a8 
88181fa03800
Aug 17 15:11:28 [kernel]  0001 88181fa037f0 81094a50 
88181fa15680
Aug 17 15:11:28 [kernel] Call Trace:
Aug 17 15:11:28 [kernel][] sched_show_task+0xae/0x120
Aug 17 15:11:28 [kernel]  [] dump_cpu_task+0x38/0x40
Aug 17 15:11:28 [kernel]  [] rcu_dump_cpu_stacks+0x90/0xd0
Aug 17 15:11:28 [kernel]  [] rcu_check_callbacks+0x3eb/0x6e0
Aug 17 15:11:28 [kernel]  [] ? account_system_time+0x7f/0x170
Aug 17 15:11:28 [kernel]  [] update_process_times+0x34/0x60
Aug 17 15:11:28 [kernel]  [] 
tick_sched_handle.isra.18+0x31/0x40
Aug 17 15:11:28 [kernel]  [] tick_sched_timer+0x3c/0x70
Aug 17 15:11:28 [kernel]  [] __run_hrtimer.isra.34+0x3d/0xc0
Aug 17 15:11:28 [kernel]  [] hrtimer_interrupt+0xc5/0x1e0
Aug 17 15:11:28 [kernel]  [] ? 
native_smp_send_reschedule+0x42/0x60
Aug 17 15:11:28 [kernel]  [] 
local_apic_timer_interrupt+0x34/0x60
Aug 17 15:11:28 [kernel]  [] 
smp_apic_timer_interrupt+0x3c/0x60
Aug 17 15:11:28 [kernel]  [] apic_timer_interrupt+0x6b/0x70
Aug 17 15:11:28 [kernel]  [] ? 
_raw_spin_unlock_irqrestore+0x9/0x10
Aug 17 15:11:28 [kernel]  [] __wake_up_sync_key+0x4f/0x60
Aug 17 15:11:28 [kernel]  [] tipc_write_space+0x31/0x40 [tipc]
Aug 17 15:11:28 [kernel]  [] filter_rcv+0x31f/0x520 [tipc]
Aug 17 15:11:28 [kernel]  [] ? tipc_sk_lookup+0xc9/0x110 
[tipc]
Aug 17 15:11:28 [kernel]  [] ? _raw_spin_lock_bh+0x19/0x30
Aug 17 15:11:28 [kernel]  [] tipc_sk_rcv+0x2dc/0x3e0 [tipc]
Aug 17 15:11:28 [kernel]  [] 
tipc_bclink_wakeup_users+0x2f/0x40 [tipc]
Aug 17 15:11:28 [kernel]  [] tipc_node_unlock+0x186/0x190 
[tipc]
Aug 17 15:11:28 [kernel]  [] ? kfree_skb+0x2c/0x40
Aug 17 15:11:28 [kernel]  [] tipc_rcv+0x2ac/0x8c0 [tipc]
Aug 17 15:11:28 [kernel]  [] tipc_l2_rcv_msg+0x38/0x50 [tipc]
Aug 17 15:11:28 [kernel]  [] 
__netif_receive_skb_core+0x5a3/0x950
Aug 17 15:11:28 [kernel]  [] __netif_receive_skb+0x13/0x60
Aug 17 15:11:28 [kernel]  [] 
netif_receive_skb_internal+0x1e/0x90
Aug 17 15:11:28 [kernel]  [] napi_gro_receive+0x78/0xa0
Aug 17 15:11:28 [kernel]  [] tg3_poll_work+0xc54/0xf40 [tg3]
Aug 17 15:11:28 [kernel]  [] ? consume_skb+0x2c/0x40
Aug 17 15:11:28 [kernel]  [] tg3_poll_msix+0x41/0x160 [tg3]
Aug 17 15:11:28 [kernel]  [] net_rx_action+0xe2/0x290
Aug 17 15:11:28 [kernel]  [] __do_softirq+0xda/0x1f0
Aug 17 15:11:28 [kernel]  [] irq_exit+0x76/0xa0
Aug 17 15:11:28 [kernel]  [] do_IRQ+0x55/0xf0
Aug 17 15:11:28 [kernel]  [] common_interrupt+0x6b/0x6b
Aug 17 15:12:31 [kernel]  

This issue was happened on quite big networks of 32-64 sockets which send 
several multicast messages all-to-all at the same time. The patch fixes the 
issue by reusing the link_prepare_wakeup() procedure which moves users as 
permitted by space available in send queue to a separate queue which in its 
turn is conveyed to tipc_sk_rcv().
The link_prepare_wakeup() procedure was also modified a bit: 
1. Firstly to enable its reuse some actions related to unicast link 
were moved out of the function.
2. And secondly the internal loop doesn't break now when only one send 
queue is exhausted but it continues up to the end of wake up queue so all send 
queues can be refilled.

Signed-off-by: Dmitry S Kolmakov 
---
diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index c5cbdcb..b56f74a 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -176,8 +176,12 @@ static void bclink_retransmit_pkt(struct tipc_net *tn, u32 
after, u32 to)
 void tipc_bclink_wakeup_users(struct net *net)
 {
struct tipc_net *tn = net_generic(net, tipc_net_id);
+   struct tipc_link *bcl = tn->bcl;
+   struct sk_buff_head resultq;

-   tipc_sk_rcv(net, >bclink->link.wakeupq);
+   skb_queue_head_init();
+   link_prepare_wakeup(bcl, );
+   tipc_sk_rcv(net, );
 }

 /**
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 43a515d..467edbc 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -372,10 +372,11 @@ err:
 /**
  * link_prepare_wakeup - prepare users for wakeup after congestion
  * @link: congested link
+ * @resultq: queue for users which can be woken up
  * Move a number of waiting users, as permitted by available space in
- * the send queue, from link wait queue to node wait queue 

[PATCH net-next 3/3] net: Add table id from route lookup to route response

2015-09-02 Thread David Ahern
rt_fill_info which is called for 'route get' requests hardcodes the
table id as RT_TABLE_MAIN which is not correct when multiple tables
are used. Use the newly added table id in the rtable to send back
the correct table.

Signed-off-by: David Ahern 
---
 net/ipv4/route.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 92acc95b7578..2738bf4132db 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2325,8 +2325,8 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
__be32 src,
r->rtm_dst_len  = 32;
r->rtm_src_len  = 0;
r->rtm_tos  = fl4->flowi4_tos;
-   r->rtm_table= RT_TABLE_MAIN;
-   if (nla_put_u32(skb, RTA_TABLE, RT_TABLE_MAIN))
+   r->rtm_table= rt->rt_table_id;
+   if (nla_put_u32(skb, RTA_TABLE, rt->rt_table_id))
goto nla_put_failure;
r->rtm_type = rt->rt_type;
r->rtm_scope= RT_SCOPE_UNIVERSE;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ip_rcv_finish() NULL pointer and possibly related Oopses

2015-09-02 Thread Shaun Crampton
> Make sure you backported commit
> 10e2eb878f3ca07ac2f05fa5ca5e6c4c9174a27a
> ("udp: fix dst races with multicast early demux")


I just tried the latest CoreOS alpha, which had that patch.  Sadly, I saw
just as many reboots.  Here's a sample of the different types of Oopses I
see (I've put the rest up in a gist:
https://gist.github.com/fasaxc/d801ced5608f2657abd8):

[ 4024.564479] BUG: unable to handle kernel NULL pointer dereference at
   (null)
[ 4024.565452] IP: [<  (null)>]   (null)
[ 4024.565452] PGD 2297067 PUD 2296067 PMD 0
[ 4024.565452] Oops: 0010 [#1] SMP
[ 4024.565452] Modules linked in: xt_mac xt_mark veth ip_set_hash_net
nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_set ip_set_hash_ip ip_set
nfnetlink ipip tunnel4 ip_tunnel ip6table_filter ip6_tables xt_conntrack
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter br_netfilter nf_nat
nf_conntrack bridge stp llc overlay nls_ascii nls_cp437 vfat fat ext4
crc16 mbcache jbd2 sd_mod crc32c_intel virtio_scsi scsi_mod aesni_intel
virtio_net mousedev aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd
microcode firmware_class virtio_pci virtio_ring psmouse virtio i2c_piix4
i2c_core acpi_cpufreq button evdev sch_fq_codel ip_tables autofs4
[ 4024.565452] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.6-coreos-r1 #2
[ 4024.565452] Hardware name: Google Google, BIOS Google 01/01/2011
[ 4024.565452] task: 81a154c0 ti: 81a0 task.ti:
81a0
[ 4024.565452] RIP: 0010:[<>]  [<  (null)>]
   (null)
[ 4024.565452] RSP: 0018:88021fc03c00  EFLAGS: 00010246
[ 4024.565452] RAX: 880003375d00 RBX: 880003375d00 RCX:
0001
[ 4024.565452] RDX: 88000306c000 RSI:  RDI:
880003375d00
[ 4024.565452] RBP: 88021fc03c28 R08: 5608 R09:
bb84
[ 4024.565452] R10: 0003 R11: 880215a30dc0 R12:
880214bfb000
[ 4024.565452] R13: 88000306c000 R14: 88000306c000 R15:
0008
[ 4024.565452] FS:  () GS:88021fc0()
knlGS:
[ 4024.565452] CS:  0010 DS:  ES:  CR0: 80050033
[ 4024.565452] CR2:  CR3: 01d92000 CR4:
001406f0
[ 4024.600761] Stack:
[ 4024.601081]  814ac9dc 8802 88000306c000
880003375d00
[ 4024.601081]  88008cbba84e 88021fc03c58 81486628
88021690a000
[ 4024.601081]  88008cbba84e 880003375d00 88000306c000
88021fc03cb8
[ 4024.601081] Call Trace:
[ 4024.601081]  
[ 4024.601081]  [] ? tcp_v4_early_demux+0x11c/0x160
[ 4024.601081]  [] ip_rcv_finish+0xb8/0x360
[ 4024.601081]  [] ip_rcv+0x2a4/0x400
[ 4024.601081]  [] ? inet_del_offload+0x40/0x40
[ 4024.601081]  [] __netif_receive_skb_core+0x6c3/0x9a0
[ 4024.601081]  [] ? build_skb+0x17/0x90
[ 4024.601081]  [] __netif_receive_skb+0x18/0x60
[ 4024.601081]  [] netif_receive_skb_internal+0x33/0xa0
[ 4024.601081]  [] netif_receive_skb_sk+0x1c/0x70
[ 4024.601081]  [] 0xa008772b
[ 4024.601081]  [] ? check_preempt_curr+0x80/0xa0
[ 4024.601081]  [] 0xa0087d81
[ 4024.601081]  [] net_rx_action+0x159/0x340
[ 4024.601081]  [] __do_softirq+0xf4/0x290
[ 4024.601081]  [] irq_exit+0xad/0xc0
[ 4024.601081]  [] do_IRQ+0x5a/0xf0
[ 4024.601081]  [] common_interrupt+0x6e/0x6e
[ 4024.601081]  
[ 4024.601081]  [] ? native_safe_halt+0x6/0x10
[ 4024.601081]  [] default_idle+0x1e/0xc0
[ 4024.601081]  [] arch_cpu_idle+0xf/0x20
[ 4024.601081]  [] cpu_startup_entry+0x314/0x3e0
[ 4024.601081]  [] rest_init+0x7c/0x80
[ 4024.601081]  [] start_kernel+0x483/0x490
[ 4024.601081]  [] ? set_init_arg+0x55/0x55
[ 4024.601081]  [] ? early_idt_handler_array+0x120/0x120
[ 4024.601081]  [] x86_64_start_reservations+0x2a/0x2c
[ 4024.601081]  [] x86_64_start_kernel+0x138/0x147
[ 4024.601081] Code:  Bad RIP value.
[ 4024.601081] RIP  [<  (null)>]   (null)
[ 4024.601081]  RSP 
[ 4024.601081] CR2: 
[ 4024.601081] ---[ end trace cdabfe9d7380aaab ]---
[ 4024.601081] Kernel panic - not syncing: Fatal exception in interrupt
[ 4024.601081] Kernel Offset: disabled
[ 4024.601081] Rebooting in 60 seconds..
[ 4024.601081] ACPI MEMORY or I/O RESET_REG.




[ 4811.261621] NULL pointer dereference at 0020
[ 4811.261621] IP: [] tcp_current_mss+0x2a/0x80
[ 4811.261621] PGD 214af5067 PUD 210de8067 PMD 0
[ 4811.261621] Oops:  [#2] SMP
[ 4811.261621] Modules linked in: xt_mac xt_mark veth ip_set_hash_net
nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_set ip_set_hash_ip ip_set
nfnetlink ipip tunnel4 ip_tunnel ip6table_filter ip6_tables xt_conntrack
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter br_netfilter nf_nat
nf_conntrack bridge stp llc overlay nls_ascii nls_cp437 vfat fat ext4
crc16 mbcache jbd2 sd_mod virtio_scsi scsi_mod virtio_net mousedev
crc32c_intel 

[PATCH net-next 1/3] net: Refactor rtable initialization

2015-09-02 Thread David Ahern
All callers to rt_dst_alloc have nearly the same initialization following
a successful allocation. Consolidate it into rt_dst_alloc.

Signed-off-by: David Ahern 
---
 net/ipv4/route.c | 85 ++--
 1 file changed, 33 insertions(+), 52 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5f4a5565ad8b..eaefeadce07c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1438,12 +1438,33 @@ static void rt_set_nexthop(struct rtable *rt, __be32 
daddr,
 }
 
 static struct rtable *rt_dst_alloc(struct net_device *dev,
+  unsigned int flags, u16 type,
   bool nopolicy, bool noxfrm, bool will_cache)
 {
-   return dst_alloc(_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK,
-(will_cache ? 0 : (DST_HOST | DST_NOCACHE)) |
-(nopolicy ? DST_NOPOLICY : 0) |
-(noxfrm ? DST_NOXFRM : 0));
+   struct rtable *rt;
+
+   rt = dst_alloc(_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK,
+  (will_cache ? 0 : (DST_HOST | DST_NOCACHE)) |
+  (nopolicy ? DST_NOPOLICY : 0) |
+  (noxfrm ? DST_NOXFRM : 0));
+
+   if (rt) {
+   rt->rt_genid = rt_genid_ipv4(dev_net(dev));
+   rt->rt_flags = flags;
+   rt->rt_type = type;
+   rt->rt_is_input = 0;
+   rt->rt_iif = 0;
+   rt->rt_pmtu = 0;
+   rt->rt_gateway = 0;
+   rt->rt_uses_gateway = 0;
+   INIT_LIST_HEAD(>rt_uncached);
+
+   rt->dst.output = ip_output;
+   if (flags & RTCF_LOCAL)
+   rt->dst.input = ip_local_deliver;
+   }
+
+   return rt;
 }
 
 /* called in rcu_read_lock() section */
@@ -1452,6 +1473,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
 {
struct rtable *rth;
struct in_device *in_dev = __in_dev_get_rcu(dev);
+   unsigned int flags = RTCF_MULTICAST;
u32 itag = 0;
int err;
 
@@ -1477,7 +1499,10 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
if (err < 0)
goto e_err;
}
-   rth = rt_dst_alloc(dev_net(dev)->loopback_dev,
+   if (our)
+   flags |= RTCF_LOCAL;
+
+   rth = rt_dst_alloc(dev_net(dev)->loopback_dev, flags, RTN_MULTICAST,
   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, false);
if (!rth)
goto e_nobufs;
@@ -1486,20 +1511,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
rth->dst.tclassid = itag;
 #endif
rth->dst.output = ip_rt_bug;
-
-   rth->rt_genid   = rt_genid_ipv4(dev_net(dev));
-   rth->rt_flags   = RTCF_MULTICAST;
-   rth->rt_type= RTN_MULTICAST;
rth->rt_is_input= 1;
-   rth->rt_iif = 0;
-   rth->rt_pmtu= 0;
-   rth->rt_gateway = 0;
-   rth->rt_uses_gateway = 0;
-   INIT_LIST_HEAD(>rt_uncached);
-   if (our) {
-   rth->dst.input= ip_local_deliver;
-   rth->rt_flags |= RTCF_LOCAL;
-   }
 
 #ifdef CONFIG_IP_MROUTE
if (!ipv4_is_local_multicast(daddr) && IN_DEV_MFORWARD(in_dev))
@@ -1608,7 +1620,7 @@ static int __mkroute_input(struct sk_buff *skb,
}
}
 
-   rth = rt_dst_alloc(out_dev->dev,
+   rth = rt_dst_alloc(out_dev->dev, 0, res->type,
   IN_DEV_CONF_GET(in_dev, NOPOLICY),
   IN_DEV_CONF_GET(out_dev, NOXFRM), do_cache);
if (!rth) {
@@ -1616,19 +1628,10 @@ static int __mkroute_input(struct sk_buff *skb,
goto cleanup;
}
 
-   rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev));
-   rth->rt_flags = 0;
-   rth->rt_type = res->type;
rth->rt_is_input = 1;
-   rth->rt_iif = 0;
-   rth->rt_pmtu= 0;
-   rth->rt_gateway = 0;
-   rth->rt_uses_gateway = 0;
-   INIT_LIST_HEAD(>rt_uncached);
RT_CACHE_STAT_INC(in_slow_tot);
 
rth->dst.input = ip_forward;
-   rth->dst.output = ip_output;
 
rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
@@ -1795,26 +1798,16 @@ out:return err;
}
}
 
-   rth = rt_dst_alloc(net->loopback_dev,
+   rth = rt_dst_alloc(net->loopback_dev, flags | RTCF_LOCAL, res.type,
   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache);
if (!rth)
goto e_nobufs;
 
-   rth->dst.input= ip_local_deliver;
rth->dst.output= ip_rt_bug;
 #ifdef CONFIG_IP_ROUTE_CLASSID
rth->dst.tclassid = itag;
 #endif
-
-   rth->rt_genid = rt_genid_ipv4(net);
-   rth->rt_flags   = flags|RTCF_LOCAL;
-   rth->rt_type= 

[PATCH net-next 2/3] net: Add FIB table id to rtable

2015-09-02 Thread David Ahern
Add the FIB table id to rtable to make the information available for
IPv4 as it is for IPv6.

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c   | 2 ++
 include/net/route.h | 2 ++
 net/ipv4/route.c| 8 
 net/ipv4/xfrm4_policy.c | 1 +
 4 files changed, 13 insertions(+)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index e7094fbd7568..8c9ab5ebea23 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -320,6 +320,7 @@ static void vrf_rtable_destroy(struct net_vrf *vrf)
 
 static struct rtable *vrf_rtable_create(struct net_device *dev)
 {
+   struct net_vrf *vrf = netdev_priv(dev);
struct rtable *rth;
 
rth = dst_alloc(_dst_ops, dev, 2,
@@ -335,6 +336,7 @@ static struct rtable *vrf_rtable_create(struct net_device 
*dev)
rth->rt_pmtu= 0;
rth->rt_gateway = 0;
rth->rt_uses_gateway = 0;
+   rth->rt_table_id = vrf->tb_id;
INIT_LIST_HEAD(>rt_uncached);
rth->rt_uncached_list = NULL;
}
diff --git a/include/net/route.h b/include/net/route.h
index cc61cb95f059..10a7d21a211c 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -64,6 +64,8 @@ struct rtable {
/* Miscellaneous cached information */
u32 rt_pmtu;
 
+   u32 rt_table_id;
+
struct list_headrt_uncached;
struct uncached_list*rt_uncached_list;
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index eaefeadce07c..92acc95b7578 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1457,6 +1457,7 @@ static struct rtable *rt_dst_alloc(struct net_device *dev,
rt->rt_pmtu = 0;
rt->rt_gateway = 0;
rt->rt_uses_gateway = 0;
+   rt->rt_table_id = 0;
INIT_LIST_HEAD(>rt_uncached);
 
rt->dst.output = ip_output;
@@ -1629,6 +1630,8 @@ static int __mkroute_input(struct sk_buff *skb,
}
 
rth->rt_is_input = 1;
+   if (res->table)
+   rth->rt_table_id = res->table->tb_id;
RT_CACHE_STAT_INC(in_slow_tot);
 
rth->dst.input = ip_forward;
@@ -1808,6 +1811,8 @@ out:  return err;
rth->dst.tclassid = itag;
 #endif
rth->rt_is_input = 1;
+   if (res.table)
+   rth->rt_table_id = res.table->tb_id;
 
RT_CACHE_STAT_INC(in_slow_tot);
if (res.type == RTN_UNREACHABLE) {
@@ -1988,6 +1993,9 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
return ERR_PTR(-ENOBUFS);
 
rth->rt_iif = orig_oif ? : 0;
+   if (res->table)
+   rth->rt_table_id = res->table->tb_id;
+
RT_CACHE_STAT_INC(out_slow_tot);
 
if (flags & (RTCF_BROADCAST | RTCF_MULTICAST)) {
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index bb919b28619f..671011055ad5 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -95,6 +95,7 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct 
net_device *dev,
xdst->u.rt.rt_gateway = rt->rt_gateway;
xdst->u.rt.rt_uses_gateway = rt->rt_uses_gateway;
xdst->u.rt.rt_pmtu = rt->rt_pmtu;
+   xdst->u.rt.rt_table_id = rt->rt_table_id;
INIT_LIST_HEAD(>u.rt.rt_uncached);
 
return 0;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 net-next 0/5] netlink: mmap: kernel panic and some issues

2015-09-02 Thread Daniel Borkmann

On 09/02/2015 01:35 PM, Ken-ichirou MATSUZAWA wrote:

Thank you for the reply.

On Wed, Sep 02, 2015 at 11:47:26AM +0200, Daniel Borkmann wrote:

On 09/02/2015 02:04 AM, Ken-ichirou MATSUZAWA wrote:

Talking about skb_copy path, original skb's shared info is accessed
only in copy_skb_header, to get gso related field. As a result of


It's still not correct. The thing is you can neither call skb_copy() nor
skb_clone() on netlink mmaped skbs. For example, skb_copy_bits() would


I am sorry for the lack of explanation.
And I am afraid I misunderstand...

Updated pointers to its data area in a mmaped netlink skb is only
its tail. Head, data and end will not be updated. skb_copy() calls

 int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)

as its argument, "offset" is always 0 and "len" is skb->len. In
skb_copy_bits() both "start" and "copy" are skb->len, which means
"len - copy" is always 0 so that retuns 0 before accessing shared
info.

I don't know the situation is intended or not, it seems that
skb_copy() for a mmaped skb will not access its shared info.


Okay, right, since it's all linear, but ...


After that, copy_skb_header() will set newly allocate skb's (wrong)
gso fields, I asked we should clear it or not.


... here still we access skb_shinfo() from the mmap'ed skb, which we
are simply not allowed (despite whether resetting fields later on as
you suggest or not), for two reasons: I think (will start experimenting
more with it tomorrow), you would get an out of bounds access here in
case the skb->data is the last slot in the ring buffer and reaches
exactly to the ring buffer end. And (despite that), it's also hard
to maintain - the next one adding a new shared info member will very
likely oversee this special case in netlink here, thus the issue would
then simply be reintroduced over and over.

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] ipv6: fix multipath route replace error recovery

2015-09-02 Thread roopa

On 9/1/15, 10:54 PM, Roopa Prabhu wrote:

From: Roopa Prabhu 

Problem:
The ecmp route replace support for ipv6 in the kernel, deletes the
existing ecmp route too early, ie when it installs the first nexthop.
If there is an error in installing the subsequent nexthops, its too late
to recover the already deleted existing route

This patch fixes the problem with the following:
a) Changes the existing multipath route add code to a two stage process:
   build rt6_infos + insert them
ip6_route_add rt6_info creation code is moved into
ip6_route_info_create.
b) This ensures that all errors are caught during building rt6_infos
   and we fail early
c) Separates multipath add and del code. Because add needs the special
   two stage mode in a) and delete essentially does not care.
d) In any event if the code fails during inserting a route again, a
   warning is printed (This should be unlikely)

Before the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
/* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
  * kernel */

After the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

Fixes: 4a287eba2de3 ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags 
support, warn about missing CREATE flag")
Signed-off-by: Roopa Prabhu 
---
This bug is present in 4.1 kernel and 4.2 too.
Since 4.2 is out or almost out, I am submitting the patch against net-next.
I can respin against net if needed. The part of the patch that I would 
appreciate
more eyes on is the cleanup of the rt6_infos in ip_route_multipath_add. And
I have tried to keep the changes local to route.c closer to the netlink
message handling. Most of the code changes are moving code into separate
functions.

  net/ipv6/route.c | 205 ---
  1 file changed, 179 insertions(+), 26 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f45cac6..b1b8c96 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1748,7 +1748,7 @@ static int ip6_convert_metrics(struct mx6_config *mxc,
return -EINVAL;
  }
  
-int ip6_route_add(struct fib6_config *cfg)

+int ip6_route_info_create(struct fib6_config *cfg, struct rt6_info **rt_ret)
  {
int err;
struct net *net = cfg->fc_nlinfo.nl_net;
@@ -1756,7 +1756,6 @@ int ip6_route_add(struct fib6_config *cfg)
struct net_device *dev = NULL;
struct inet6_dev *idev = NULL;
struct fib6_table *table;
-   struct mx6_config mxc = { .mx = NULL, };
int addr_type;
  
  	if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128)

@@ -1981,6 +1980,32 @@ install_route:
  
  	cfg->fc_nlinfo.nl_net = dev_net(dev);
  
+	*rt_ret = rt;

+
+   return 0;
+out:
+   if (dev)
+   dev_put(dev);
+   if (idev)
+   in6_dev_put(idev);
+   if (rt)
+   dst_free(>dst);
+
+   *rt_ret = NULL;
+
+   return err;
+}
+
+int ip6_route_add(struct fib6_config *cfg)
+{
+   struct mx6_config mxc = { .mx = NULL, };
+   struct rt6_info *rt = NULL;
+   int err;
+
+   err = ip6_route_info_create(cfg, );
+   if (err)
+   goto out;
+
err = ip6_convert_metrics(, cfg);
if (err)
goto out;
@@ -1988,14 +2013,12 @@ install_route:
err = __ip6_ins_rt(rt, >fc_nlinfo, );
  
  	kfree(mxc.mx);

+
return err;
  out:
-   if (dev)
-   dev_put(dev);
-   if (idev)
-   in6_dev_put(idev);
if (rt)
dst_free(>dst);
+
return err;
  }
  
@@ -2776,19 +2799,79 @@ errout:

return err;
  }
  
-static int ip6_route_multipath(struct fib6_config *cfg, int add)

+struct rt6_nh {
+   struct rt6_info *rt6_info;
+   struct fib6_config r_cfg;
+   struct 

Re: [PATCH] flow_dissector: Use 'const' where possible.

2015-09-02 Thread Jiri Pirko
Wed, Sep 02, 2015 at 06:33:34AM CEST, t...@herbertland.com wrote:
>On Tue, Sep 1, 2015 at 9:19 PM, David Miller  wrote:
>>
>> Signed-off-by: David S. Miller 
>> ---
>>  include/linux/skbuff.h|  8 ++---
>>  include/net/flow.h|  8 ++---
>>  net/core/flow_dissector.c | 79 
>> ---
>>  3 files changed, 49 insertions(+), 46 deletions(-)
>>




>> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
>> index 345a040..d79699c 100644
>> --- a/net/core/flow_dissector.c
>> +++ b/net/core/flow_dissector.c
>> @@ -19,14 +19,14 @@
>>  #include 
>>  #include 
>>
>> -static bool skb_flow_dissector_uses_key(struct flow_dissector 
>> *flow_dissector,
>> -   enum flow_dissector_key_id key_id)
>> +static bool dissector_uses_key(const struct flow_dissector *flow_dissector,
>> +  enum flow_dissector_key_id key_id)
>>  {
>> return flow_dissector->used_keys & (1 << key_id);
>>  }
>>
>> -static void skb_flow_dissector_set_key(struct flow_dissector 
>> *flow_dissector,
>> -  enum flow_dissector_key_id key_id)
>> +static void dissector_set_key(struct flow_dissector *flow_dissector,
>> + enum flow_dissector_key_id key_id)
>>  {
>> flow_dissector->used_keys |= (1 << key_id);
>>  }
>> @@ -51,20 +51,20 @@ void skb_flow_dissector_init(struct flow_dissector 
>> *flow_dissector,
>
>I suppose we should drop skb_ from skb_flow_dissector_init and
>skb_flow_dissector_target as well.

I like to have "namespaces" by function prefixes. Code is easier to read
then...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] cfg80211: regulatory: restore proper user alpha2

2015-09-02 Thread Maciej S. Szmigiero
restore_regulatory_settings() should restore alpha2
as computed in restore_alpha2(), not raw user_alpha2 to
behave as described in the comment just above that code.

This fixes endless loop of calling CRDA for "00" and "97"
countries after resume from suspend on my laptop.

Looks like others had the same problem, too:
http://ath9k-devel.ath9k.narkive.com/knY5W6St/ath9k-and-crda-messages-in-logs
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/899335
https://forum.porteus.org/viewtopic.php?t=4975=36436
https://forums.opensuse.org/showthread.php/
483356-Authentication-Regulatory-Domain-issues-ath5k-12-2

Signed-off-by: Maciej Szmigiero 
---
 net/wireless/reg.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/wireless/reg.c b/net/wireless/reg.c
index 70aef72..7258246 100644
--- a/net/wireless/reg.c
+++ b/net/wireless/reg.c
@@ -2625,7 +2625,7 @@ static void restore_regulatory_settings(bool reset_user)
 * settings, user regulatory settings takes precedence.
 */
if (is_an_alpha2(alpha2))
-   regulatory_hint_user(user_alpha2, NL80211_USER_REG_HINT_USER);
+   regulatory_hint_user(alpha2, NL80211_USER_REG_HINT_USER);
 
spin_lock(_requests_lock);
list_splice_tail_init(_reg_req_list, _requests_list);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bgmac: Update fixed_phy_register()

2015-09-02 Thread David Miller
From: Fabio Estevam 
Date: Wed,  2 Sep 2015 13:25:59 -0300

> From: Fabio Estevam 
> 
> Commit a5597008dbc2 ("phy: fixed_phy: Add gpio to determine link up/down.")
> added a new argument to fixed_phy_register(), but missed to update bgmac
> driver, causing the following build failure:
> 
> drivers/net/ethernet/broadcom/bgmac.c:1450:2: error: too few arguments to 
> function 'fixed_phy_register'
> 
> Add the missing argument.
> 
> Reported-by: Mark Brown 
> Signed-off-by: Fabio Estevam 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] xen-netback: add support for multicast control

2015-09-02 Thread David Miller
From: Paul Durrant 
Date: Wed, 2 Sep 2015 17:58:36 +0100

> Xen's PV network protocol includes messages to add/remove ethernet
> multicast addresses to/from a filter list in the backend. This allows
> the frontend to request the backend only forward multicast packets
> which are of interest thus preventing unnecessary noise on the shared
> ring.
> 
> The canonical netif header in git://xenbits.xen.org/xen.git specifies
> the message format (two more XEN_NETIF_EXTRA_TYPEs) so the minimal
> necessary changes have been pulled into include/xen/interface/io/netif.h.
> 
> To prevent the frontend from extending the multicast filter list
> arbitrarily a limit (XEN_NETBK_MCAST_MAX) has been set to 64 entries.
> This limit is not specified by the protocol and so may change in future.
> If the limit is reached then the next XEN_NETIF_EXTRA_TYPE_MCAST_ADD
> sent by the frontend will be failed with NETIF_RSP_ERROR.
> 
> Signed-off-by: Paul Durrant 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next PATCH] net: ipv6: use common fib_default_rule_pref

2015-09-02 Thread Thomas Graf
On 09/02/15 at 11:34am, David Miller wrote:
> From: Phil Sutter 
> Date: Wed,  2 Sep 2015 15:03:12 +0200
> 
> > This switches IPv6 policy routing to use the shared
> > fib_default_rule_pref() function of IPv4 and DECnet. It is also used in
> > multicast routing for IPv4 as well as IPv6.
> > 
> > The motivation for this patch is a complaint about iproute2 behaving
> > inconsistent between IPv4 and IPv6 when adding policy rules: Formerly,
> > IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the
> > assigned priority value was decreased with each rule added.
> > 
> > Signed-off-by: Phil Sutter 
> 
> All ->default_pref() methods are therefore going to be set to the
> default, so just kill off the method entirely and call
> fib_default_rule_pref() directly.

How strict are we with regard to compatibility here? New IPv6 rules
with no pref specified currently get appended at the end of the list
whereas this would start inserting at the head.

I'm absolutely in favour of the new behaviour but this could break
scripts which do not have proper prefs specified.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] net: Support ip route get via given table

2015-09-02 Thread David Ahern
Add support for 'ip [-6] route get table X' where the user wants to
force the FIB lookup from a given table.

Signed-off-by: David Ahern 
---
 include/net/flow.h  |  4 
 include/net/ip_fib.h| 15 +++
 net/ipv4/fib_frontend.c |  2 ++
 net/ipv4/route.c|  2 ++
 net/ipv6/route.c| 24 
 5 files changed, 47 insertions(+)

diff --git a/include/net/flow.h b/include/net/flow.h
index acd6a096250e..910f2dcaab78 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -36,6 +36,7 @@ struct flowi_common {
 #define FLOWI_FLAG_KNOWN_NH0x02
 #define FLOWI_FLAG_VRFSRC  0x04
__u32   flowic_secid;
+   __u32   flowic_table_id;
struct flowi_tunnel flowic_tun_key;
 };
 
@@ -74,6 +75,7 @@ struct flowi4 {
 #define flowi4_flags   __fl_common.flowic_flags
 #define flowi4_secid   __fl_common.flowic_secid
 #define flowi4_tun_key __fl_common.flowic_tun_key
+#define flowi4_table_id__fl_common.flowic_table_id
 
/* (saddr,daddr) must be grouped, same order as in IP header */
__be32  saddr;
@@ -103,6 +105,7 @@ static inline void flowi4_init_output(struct flowi4 *fl4, 
int oif,
fl4->flowi4_proto = proto;
fl4->flowi4_flags = flags;
fl4->flowi4_secid = 0;
+   fl4->flowi4_table_id = 0;
fl4->flowi4_tun_key.tun_id = 0;
fl4->daddr = daddr;
fl4->saddr = saddr;
@@ -132,6 +135,7 @@ struct flowi6 {
 #define flowi6_flags   __fl_common.flowic_flags
 #define flowi6_secid   __fl_common.flowic_secid
 #define flowi6_tun_key __fl_common.flowic_tun_key
+#define flowi6_table_id__fl_common.flowic_table_id
struct in6_addr daddr;
struct in6_addr saddr;
__be32  flowlabel;
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index a37d0432bebd..c7024094726d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -233,6 +233,9 @@ static inline int fib_lookup(struct net *net, const struct 
flowi4 *flp,
struct fib_table *tb;
int err = -ENETUNREACH;
 
+   if (flp->flowi4_table_id && flp->flowi4_table_id != RT_TABLE_MAIN)
+   return -ENETUNREACH;
+
rcu_read_lock();
 
tb = fib_get_table(net, RT_TABLE_MAIN);
@@ -261,6 +264,18 @@ static inline int fib_lookup(struct net *net, struct 
flowi4 *flp,
int err;
 
flags |= FIB_LOOKUP_NOREF;
+   if (flp->flowi4_table_id) {
+   err = -ENETUNREACH;
+
+   rcu_read_lock();
+   tb = fib_get_table(net, flp->flowi4_table_id);
+   if (tb)
+   err = fib_table_lookup(tb, flp, res, flags);
+   rcu_read_unlock();
+
+   return err;
+   }
+
if (net->ipv4.fib_has_custom_rules)
return __fib_lookup(net, flp, res, flags);
 
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 6fcbd215cdbc..65519445ca0d 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -129,6 +129,7 @@ struct fib_table *fib_get_table(struct net *net, u32 id)
}
return NULL;
 }
+EXPORT_SYMBOL(fib_get_table);
 #endif /* CONFIG_IP_MULTIPLE_TABLES */
 
 static void fib_replace_table(struct net *net, struct fib_table *old,
@@ -339,6 +340,7 @@ static int __fib_validate_source(struct sk_buff *skb, 
__be32 src, __be32 dst,
fl4.saddr = dst;
fl4.flowi4_tos = tos;
fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
+   fl4.flowi4_table_id = 0;
fl4.flowi4_tun_key.tun_id = 0;
 
no_addr = idev->ifa_list == NULL;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5f4a5565ad8b..b3e5ee821450 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2476,6 +2476,8 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh)
fl4.flowi4_tos = rtm->rtm_tos;
fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0;
fl4.flowi4_mark = mark;
+   if (tb[RTA_TABLE])
+   fl4.flowi4_table_id = nla_get_u32(tb[RTA_TABLE]);
 
if (iif) {
struct net_device *dev;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f45cac6f8356..f605c8ea5a16 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -61,6 +61,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1142,6 +1143,20 @@ static struct rt6_info *ip6_pol_route(struct net *net, 
struct fib6_table *table,
}
 }
 
+static struct dst_entry *ip6_route_table(struct net *net, int flags,
+struct flowi6 *fl6,
+pol_lookup_t lookup)
+{
+   struct rt6_info *rt = NULL;
+   struct fib6_table *table;
+
+   table = fib6_get_table(net, fl6->flowi6_table_id);
+   if (table)
+   rt = lookup(net, table, fl6, FIB_LOOKUP_NOREF | flags);
+
+

[PATCH net-next v2] net: Add table id from route lookup to route response

2015-09-02 Thread David Ahern
IPv4 ABI has the table hardcoded as RT_TABLE_MAIN regardless of the table
hit for the route lookup. Add the table using a new attribute,
RTA_TABLE_LOOKUP, to maintain the ABI yet return the right table id.

Signed-off-by: David Ahern 
---

Thomas: Something like this?

The current ABI is returning wrong data in some cases; that seems worse
to me than breaking the ABI.

 include/uapi/linux/rtnetlink.h | 1 +
 net/ipv4/route.c   | 5 +
 net/ipv6/route.c   | 4 
 3 files changed, 10 insertions(+)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 702024769c74..5add1468350a 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -310,6 +310,7 @@ enum rtattr_type_t {
RTA_PREF,
RTA_ENCAP_TYPE,
RTA_ENCAP,
+   RTA_TABLE_LOOKUP,  /* table hit for fib lookup */
__RTA_MAX
 };
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 92acc95b7578..95454c368e66 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2328,6 +2328,11 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
__be32 src,
r->rtm_table= RT_TABLE_MAIN;
if (nla_put_u32(skb, RTA_TABLE, RT_TABLE_MAIN))
goto nla_put_failure;
+
+   if (rt->rt_table_id && rt->rt_table_id != RT_TABLE_MAIN &&
+   nla_put_u32(skb, RTA_TABLE_LOOKUP, rt->rt_table_id))
+   goto nla_put_failure;
+
r->rtm_type = rt->rt_type;
r->rtm_scope= RT_SCOPE_UNIVERSE;
r->rtm_protocol = RTPROT_UNSPEC;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f45cac6f8356..3c5d3a50bb7b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2922,6 +2922,10 @@ static int rt6_fill_node(struct net *net,
rtm->rtm_table = table;
if (nla_put_u32(skb, RTA_TABLE, table))
goto nla_put_failure;
+
+   if (table && nla_put_u32(skb, RTA_TABLE_LOOKUP, table))
+   goto nla_put_failure;
+
if (rt->rt6i_flags & RTF_REJECT) {
switch (rt->dst.error) {
case -EINVAL:
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] net: Add table id from route lookup to route response

2015-09-02 Thread Thomas Graf
On 09/02/15 at 01:16pm, David Ahern wrote:
> IPv4 ABI has the table hardcoded as RT_TABLE_MAIN regardless of the table
> hit for the route lookup. Add the table using a new attribute,
> RTA_TABLE_LOOKUP, to maintain the ABI yet return the right table id.
> 
> Signed-off-by: David Ahern 
> ---
> 
> Thomas: Something like this?
> 
> The current ABI is returning wrong data in some cases; that seems worse
> to me than breaking the ABI.

Another option is to introduce a new flag bundled with RTM_GETROUTE
which fixes RTM_GETROUTE altogether and makes it return the actual
route instead of a simulated cache entry.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 1/4] net: qdisc: add op to run filters/actions before enqueue

2015-09-02 Thread Jamal Hadi Salim

On 09/02/15 02:22, Cong Wang wrote:

(Why not Cc'ing Jamal for net_sched pathes?)

On Tue, Sep 1, 2015 at 9:34 AM, Daniel Borkmann  wrote:

From: John Fastabend 

Add a new ->preclassify() op to allow multiqueue queuing disciplines
to call tc_classify() or perform other work before dev_pick_tx().

This helps, for example, with mqprio queueing discipline that has
offload support by most popular 10G NICs, where the txq effectively
picks the qdisc.

Once traffic is being directed to a specific queue then hardware TX
rings may be tuned to support this traffic type. mqprio already
gives the ability to do this via skb->priority where the ->preclassify()
provides more control over packet steering, it can classify the skb
and set the priority, for example, from an eBPF classifier (or action).

Also this allows traffic classifiers to be run without holding the
qdisc lock and gives one place to attach filters when mqprio is
in use. ->preclassify() could also be added to other mq qdiscs later
on: f.e. most classful qdiscs first check major/minor numbers of
skb->priority before actually consulting a more complex classifier.

For mqprio case today, a filter has to be attached to each txq qdisc
to have all traffic hit the filter. Since ->preclassify() is currently
only used by mqprio, the __dev_queue_xmit() fast path is guarded by
a generic, hidden Kconfig option (NET_CLS_PRECLASSIFY) that is only
selected by mqprio, otherwise it defaults to off. Also, the Qdisc
structure size will stay the same, we move __parent, used by cbq only
into a write-mostly hole. If actions are enabled, __parent is written
on every enqueue, and only read, rewritten in reshape_fail() phase.
Therefore, this place in the read-mostly cacheline could be used by
preclassify, which is written only once.



I don't like this approach. Ideally, qdisc layer should be totally
on top of tx queues, which means tx queue selection should
happen after dequeue. I looked at this before, the change is not
trivial at all given the fact that qdisc ties too much with tx queue
probably due to historical reasons, especially the tx softirq part.
But that is really a long-term solution for me.

I have no big objection for this as a short-term solution, however,
once we add these filters before enqueue, we can't remove them
any more. We really need to think twice about it.

Jamal, do you have any better idea?



Sorry for the top quote:
Given the rcu-fication of classifiers i believe the idea will
mostly work; expect user will go nuts sticking all kinds of
classifiers and actions that wont work (example, I dont think
connmark action would work nicely here).
Could we strive to do proper offload ala switchdev?

The comment on the patch on reshape_fail + __parent: for the
record, that is an extremely useful feature (allows an inner qdisc
to provide an opportunity for a classful parent qdisc to
reclassify and therefore reschedule).
Yes, CBQ is the only user - but maybe if it was properly documented
more schedulers could put it to good use.

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] net: Add table id from route lookup to route response

2015-09-02 Thread Stephen Hemminger
On Wed,  2 Sep 2015 13:16:20 -0700
David Ahern  wrote:

> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 702024769c74..5add1468350a 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -310,6 +310,7 @@ enum rtattr_type_t {
>   RTA_PREF,
>   RTA_ENCAP_TYPE,
>   RTA_ENCAP,
> + RTA_TABLE_LOOKUP,  /* table hit for fib lookup */

Why add a comment here. There is nothing special that needs a comment.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf-next] netfilter: nf_dup{4,6}: fix build error when nf_conntrack disabled

2015-09-02 Thread David Miller
From: Daniel Borkmann 
Date: Wed,  2 Sep 2015 20:54:02 +0200

> While testing various Kconfig options on another issue, I found that
> the following one triggers as well on allmodconfig and nf_conntrack
> disabled:
> 
>   net/ipv4/netfilter/nf_dup_ipv4.c: In function ‘nf_dup_ipv4’:
>   net/ipv4/netfilter/nf_dup_ipv4.c:72:20: error: ‘nf_skb_duplicated’ 
> undeclared (first use in this function)
> if (this_cpu_read(nf_skb_duplicated))
>   [...]
>   net/ipv6/netfilter/nf_dup_ipv6.c: In function ‘nf_dup_ipv6’:
>   net/ipv6/netfilter/nf_dup_ipv6.c:66:20: error: ‘nf_skb_duplicated’ 
> undeclared (first use in this function)
> if (this_cpu_read(nf_skb_duplicated))
> 
> Fix it by including directly the header where it is defined.
> 
> Fixes: bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6")
> Signed-off-by: Daniel Borkmann 

I'll take this directly to simplify things.

Thanks Daniel.


Re: [net-next PATCH] net: ipv6: use common fib_default_rule_pref

2015-09-02 Thread David Miller
From: Phil Sutter 
Date: Wed,  2 Sep 2015 15:03:12 +0200

> This switches IPv6 policy routing to use the shared
> fib_default_rule_pref() function of IPv4 and DECnet. It is also used in
> multicast routing for IPv4 as well as IPv6.
> 
> The motivation for this patch is a complaint about iproute2 behaving
> inconsistent between IPv4 and IPv6 when adding policy rules: Formerly,
> IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the
> assigned priority value was decreased with each rule added.
> 
> Signed-off-by: Phil Sutter 

All ->default_pref() methods are therefore going to be set to the
default, so just kill off the method entirely and call
fib_default_rule_pref() directly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 2/2] sctp: add routing output fallback

2015-09-02 Thread Marcelo Ricardo Leitner
Commit 0ca50d12fe46 added a restriction that the address must belong to
the output interface, so that sctp will use the right interface even
when using secondary addresses.

But it breaks IPVS setups, on which people is used to attach VIP
addresses to loopback interface on real servers. It's preferred to
attach to the interface actually in use, but it's a very common setup
and that used to work.

This patch then saves the first routing good result, even if it would be
going out through an interface that doesn't have that address. If no
better hit found, it's then used. This effectively restores the original
behavior if no better interface could be found.

Fixes: 0ca50d12fe46 ("sctp: fix src address selection if using secondary 
addresses")
Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/protocol.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 
4abf94d4cce769371260b42d13c38dbe5776c809..b7143337e4fa025fdb473732fdc064503e731dd4
 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -506,16 +506,22 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
if (IS_ERR(rt))
continue;
 
+   if (!dst)
+   dst = >dst;
+
/* Ensure the src address belongs to the output
 * interface.
 */
odev = __ip_dev_find(sock_net(sk), laddr->a.v4.sin_addr.s_addr,
 false);
if (!odev || odev->ifindex != fl4->flowi4_oif) {
-   dst_release(>dst);
+   if (>dst != dst)
+   dst_release(>dst);
continue;
}
 
+   if (dst != >dst)
+   dst_release(dst);
dst = >dst;
break;
}
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 1/2] sctp: fix dst leak

2015-09-02 Thread Marcelo Ricardo Leitner
Commit 0ca50d12fe46 failed to release the reference to dst entries that
it decided to skip.

Fixes: 0ca50d12fe46 ("sctp: fix src address selection if using secondary 
addresses")
Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/protocol.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 
4345790ad3266c353eeac5398593c2a9ce4effda..4abf94d4cce769371260b42d13c38dbe5776c809
 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -511,8 +511,10 @@ static void sctp_v4_get_dst(struct sctp_transport *t, 
union sctp_addr *saddr,
 */
odev = __ip_dev_find(sock_net(sk), laddr->a.v4.sin_addr.s_addr,
 false);
-   if (!odev || odev->ifindex != fl4->flowi4_oif)
+   if (!odev || odev->ifindex != fl4->flowi4_oif) {
+   dst_release(>dst);
continue;
+   }
 
dst = >dst;
break;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 0/2] couple of sctp fixes for 0ca50d12fe46

2015-09-02 Thread Marcelo Ricardo Leitner
These are two fixes for sctp after my patch on 0ca50d12fe46 ("sctp: fix
src address selection if using secondary addresses")

The first, fix a dst leak on those it decided to skip.

The second, adds the fallback on src selection that Vlad had asked
about. Unfortunatelly a lot of ipvs setups relies on the old behavior
and I don't see a better fix for it.

Please consider both to -stable tree.

Thanks!

Marcelo Ricardo Leitner (2):
  sctp: fix dst leak
  sctp: add routing output fallback

 net/sctp/protocol.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tipc: fix stall during bclink wakeup procedure

2015-09-02 Thread David Miller
From: Kolmakov Dmitriy 
Date: Wed, 2 Sep 2015 15:33:00 +

> If an attempt to wake up users of broadcast link is made when there
> is no enough place in send queue than it may hang up inside the
> tipc_sk_rcv() function since the loop breaks only after the wake up
> queue becomes empty. This can lead to complete CPU stall with the
> following message generated by RCU:

I don't understand how it can loop forever.

It should either successfully deliver each packet to the socket,
or respond with a TIPC_ERR_OVERLOAD.

In both cases, the SKB is dequeued from the queue and forward
progress is made.

If there really is a problem somewhere in here, then two things:

1) You need to describe exactly the sequence of tests and conditions
   that lead to the endless loop in this code, because I cannot see
   it.

2) I suspect the fix is more likely to be appropriate in tipc_sk_rcv()
   or similar, rather than creating a dummy queue to workaround it's
   behavior.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] net/ipv6: Correct PIM6 mrt_lock handling

2015-09-02 Thread Richard Laing
In the IPv6 multicast routing code the mrt_lock was not being released
correctly in the MFC iterator, as a result adding or deleting a MIF would
cause a hang because the mrt_lock could not be acquired.

This fix is a copy of the code for the IPv4 case and ensures that the lock
is released correctly.

Signed-off-by: Richard Laing 
---

 net/ipv6/ip6mr.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 74ceb73..5f36266 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -550,7 +550,7 @@ static void ipmr_mfc_seq_stop(struct seq_file *seq, void *v)
 
if (it->cache == >mfc6_unres_queue)
spin_unlock_bh(_unres_lock);
-   else if (it->cache == mrt->mfc6_cache_array)
+   else if (it->cache == >mfc6_cache_array[it->ct])
read_unlock(_lock);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next] netfilter: nf_conntrack: make nf_ct_zone_dflt built-in

2015-09-02 Thread Daniel Borkmann
Fengguang reported, that some randconfig generated the following linker
issue with nf_ct_zone_dflt object involved:

  [...]
  CC  init/version.o
  LD  init/built-in.o
  net/built-in.o: In function `ipv4_conntrack_defrag':
  nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
  net/built-in.o: In function `ipv6_defrag':
  nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to 
`nf_ct_zone_dflt'
  make: *** [vmlinux] Error 1

Given that configurations exist where we have a built-in part, which is
accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
area when netfilter is configured in general.

Therefore, split the more generic parts into a common header under
include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
section that already holds parts related to CONFIG_NF_CONNTRACK in the
netfilter core. This fixes the issue on my side.

Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into functions")
Reported-by: Fengguang Wu 
Signed-off-by: Daniel Borkmann 
---
 [ Here's the 2nd one for either nf-next or net-next. I've tried various
   Kconfig combinations including the one Fengguang reported, seems to be
   okay from my side. ]

 include/linux/netfilter.h  |  2 ++
 .../linux/netfilter/nf_conntrack_zones_common.h| 23 ++
 include/net/netfilter/nf_conntrack_zones.h | 19 +-
 net/netfilter/core.c   |  6 ++
 net/netfilter/nf_conntrack_core.c  |  7 ---
 5 files changed, 32 insertions(+), 25 deletions(-)
 create mode 100644 include/linux/netfilter/nf_conntrack_zones_common.h

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index d788ce6..36a6525 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -368,6 +368,8 @@ nf_nat_decode_session(struct sk_buff *skb, struct flowi 
*fl, u_int8_t family)
 #endif /*CONFIG_NETFILTER*/
 
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
+#include 
+
 extern void (*ip_ct_attach)(struct sk_buff *, const struct sk_buff *) __rcu;
 void nf_ct_attach(struct sk_buff *, const struct sk_buff *);
 extern void (*nf_ct_destroy)(struct nf_conntrack *) __rcu;
diff --git a/include/linux/netfilter/nf_conntrack_zones_common.h 
b/include/linux/netfilter/nf_conntrack_zones_common.h
new file mode 100644
index 000..5d7cf36
--- /dev/null
+++ b/include/linux/netfilter/nf_conntrack_zones_common.h
@@ -0,0 +1,23 @@
+#ifndef _NF_CONNTRACK_ZONES_COMMON_H
+#define _NF_CONNTRACK_ZONES_COMMON_H
+
+#include 
+
+#define NF_CT_DEFAULT_ZONE_ID  0
+
+#define NF_CT_ZONE_DIR_ORIG(1 << IP_CT_DIR_ORIGINAL)
+#define NF_CT_ZONE_DIR_REPL(1 << IP_CT_DIR_REPLY)
+
+#define NF_CT_DEFAULT_ZONE_DIR (NF_CT_ZONE_DIR_ORIG | NF_CT_ZONE_DIR_REPL)
+
+#define NF_CT_FLAG_MARK1
+
+struct nf_conntrack_zone {
+   u16 id;
+   u8  flags;
+   u8  dir;
+};
+
+extern const struct nf_conntrack_zone nf_ct_zone_dflt;
+
+#endif /* _NF_CONNTRACK_ZONES_COMMON_H */
diff --git a/include/net/netfilter/nf_conntrack_zones.h 
b/include/net/netfilter/nf_conntrack_zones.h
index 5316c7b..4e32512 100644
--- a/include/net/netfilter/nf_conntrack_zones.h
+++ b/include/net/netfilter/nf_conntrack_zones.h
@@ -1,24 +1,7 @@
 #ifndef _NF_CONNTRACK_ZONES_H
 #define _NF_CONNTRACK_ZONES_H
 
-#include 
-
-#define NF_CT_DEFAULT_ZONE_ID  0
-
-#define NF_CT_ZONE_DIR_ORIG(1 << IP_CT_DIR_ORIGINAL)
-#define NF_CT_ZONE_DIR_REPL(1 << IP_CT_DIR_REPLY)
-
-#define NF_CT_DEFAULT_ZONE_DIR (NF_CT_ZONE_DIR_ORIG | NF_CT_ZONE_DIR_REPL)
-
-#define NF_CT_FLAG_MARK1
-
-struct nf_conntrack_zone {
-   u16 id;
-   u8  flags;
-   u8  dir;
-};
-
-extern const struct nf_conntrack_zone nf_ct_zone_dflt;
+#include 
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 #include 
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 0b939b7..8e47f81 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -388,6 +388,12 @@ EXPORT_SYMBOL(nf_conntrack_destroy);
 struct nfq_ct_hook __rcu *nfq_ct_hook __read_mostly;
 EXPORT_SYMBOL_GPL(nfq_ct_hook);
 
+/* Built-in default zone used e.g. by modules. */
+const struct nf_conntrack_zone nf_ct_zone_dflt = {
+   .id = NF_CT_DEFAULT_ZONE_ID,
+   .dir= NF_CT_DEFAULT_ZONE_DIR,
+};
+EXPORT_SYMBOL_GPL(nf_ct_zone_dflt);
 #endif /* CONFIG_NF_CONNTRACK */
 
 #ifdef CONFIG_NF_NAT_NEEDED
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index ac3be9b..eedf049 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1286,13 +1286,6 @@ bool __nf_ct_kill_acct(struct nf_conn *ct,
 }
 EXPORT_SYMBOL_GPL(__nf_ct_kill_acct);
 
-/* Built-in default 

Re: [PATCH nf-next] netfilter: nf_conntrack: make nf_ct_zone_dflt built-in

2015-09-02 Thread David Miller
From: Daniel Borkmann 
Date: Thu,  3 Sep 2015 01:26:07 +0200

> Fengguang reported, that some randconfig generated the following linker
> issue with nf_ct_zone_dflt object involved:
> 
>   [...]
>   CC  init/version.o
>   LD  init/built-in.o
>   net/built-in.o: In function `ipv4_conntrack_defrag':
>   nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt'
>   net/built-in.o: In function `ipv6_defrag':
>   nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to 
> `nf_ct_zone_dflt'
>   make: *** [vmlinux] Error 1
> 
> Given that configurations exist where we have a built-in part, which is
> accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user()
> and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a
> module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in
> area when netfilter is configured in general.
> 
> Therefore, split the more generic parts into a common header under
> include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in
> section that already holds parts related to CONFIG_NF_CONNTRACK in the
> netfilter core. This fixes the issue on my side.
> 
> Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into 
> functions")
> Reported-by: Fengguang Wu 
> Signed-off-by: Daniel Borkmann 
> ---
>  [ Here's the 2nd one for either nf-next or net-next. I've tried various
>Kconfig combinations including the one Fengguang reported, seems to be
>okay from my side. ]

Ok I'll apply this directly too, thanks Daniel.

If Pablo and others want to fix this another way, they can send me
a relative patch.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next v2] ipv6: fix multipath route replace error recovery

2015-09-02 Thread Roopa Prabhu
From: Roopa Prabhu 

Problem:
The ecmp route replace support for ipv6 in the kernel, deletes the
existing ecmp route too early, ie when it installs the first nexthop.
If there is an error in installing the subsequent nexthops, its too late
to recover the already deleted existing route

This patch fixes the problem with the following:
a) Changes the existing multipath route add code to a two stage process:
  build rt6_infos + insert them
ip6_route_add rt6_info creation code is moved into
ip6_route_info_create.
b) This ensures that all errors are caught during building rt6_infos
  and we fail early
c) Separates multipath add and del code. Because add needs the special
  two stage mode in a) and delete essentially does not care.
d) In any event if the code fails during inserting a route again, a
  warning is printed (This should be unlikely)

Before the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
/* previously added ecmp route 3000:1000:1000:1000::2 dissappears from
 * kernel */

After the patch:
$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

/* Try replacing the route with a duplicate nexthop */
$ip -6 route change 3000:1000:1000:1000::2/128 nexthop via
fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev
swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1
RTNETLINK answers: File exists

$ip -6 route show
3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024
3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024

Fixes: 4a287eba2de3 ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL 
flags support, warn about missing CREATE flag")
Signed-off-by: Roopa Prabhu 
---
v2 - fix a rt6_info leak in cleanup on error

This bug is present in 4.1 kernel and 4.2 too.
Since 4.2 is out or almost out, I am submitting the patch against net-next.
I can respin against net if needed. I have tried to keep the changes local
to route.c closer to the netlink message handling. Most of the changes move
code into separate functions.

 net/ipv6/route.c | 209 ---
 1 file changed, 183 insertions(+), 26 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f45cac6..ecbb974 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1748,7 +1748,7 @@ static int ip6_convert_metrics(struct mx6_config *mxc,
return -EINVAL;
 }
 
-int ip6_route_add(struct fib6_config *cfg)
+int ip6_route_info_create(struct fib6_config *cfg, struct rt6_info **rt_ret)
 {
int err;
struct net *net = cfg->fc_nlinfo.nl_net;
@@ -1756,7 +1756,6 @@ int ip6_route_add(struct fib6_config *cfg)
struct net_device *dev = NULL;
struct inet6_dev *idev = NULL;
struct fib6_table *table;
-   struct mx6_config mxc = { .mx = NULL, };
int addr_type;
 
if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128)
@@ -1981,6 +1980,32 @@ install_route:
 
cfg->fc_nlinfo.nl_net = dev_net(dev);
 
+   *rt_ret = rt;
+
+   return 0;
+out:
+   if (dev)
+   dev_put(dev);
+   if (idev)
+   in6_dev_put(idev);
+   if (rt)
+   dst_free(>dst);
+
+   *rt_ret = NULL;
+
+   return err;
+}
+
+int ip6_route_add(struct fib6_config *cfg)
+{
+   struct mx6_config mxc = { .mx = NULL, };
+   struct rt6_info *rt = NULL;
+   int err;
+
+   err = ip6_route_info_create(cfg, );
+   if (err)
+   goto out;
+
err = ip6_convert_metrics(, cfg);
if (err)
goto out;
@@ -1988,14 +2013,12 @@ install_route:
err = __ip6_ins_rt(rt, >fc_nlinfo, );
 
kfree(mxc.mx);
+
return err;
 out:
-   if (dev)
-   dev_put(dev);
-   if (idev)
-   in6_dev_put(idev);
if (rt)
dst_free(>dst);
+
return err;
 }
 
@@ -2776,19 +2799,79 @@ errout:
return err;
 }
 
-static int ip6_route_multipath(struct fib6_config *cfg, int add)
+struct rt6_nh {
+   struct rt6_info *rt6_info;
+   struct fib6_config r_cfg;
+   struct mx6_config mxc;
+   struct list_head next;
+};
+
+static void ip6_print_replace_route_err(struct list_head *rt6_nh_list)
+{
+   

Re: [PATCH] net: eth: altera: fix napi poll_list corruption

2015-09-02 Thread Atsushi Nemoto
On Wed, 2 Sep 2015 11:25:00 -0700, David Miller  wrote:
> Two lines below this change you are disabling interrupts anyways,
> so I would suggest just moving the spin_lock_irqsave() before the
> napi_gro_flush() to fix this.
> 
> Many of the checks done by napi_complete_done() (invoked by
> napi_complete()) are completely redundant in this context.  For
> example, the direct __napi_complete() call is a really nice
> optimization because we know we are on the poll list and therefore
> it is not empty.

Thank you for your suggestion.

I think napi_gro_flush() can be called with irq enabled, so moving the
spin_lock_irqsave() just before the __napi_complete() (or moving the
__napi_complete() just after the spin_lock_irqsave()) would be better,
right?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-02 Thread Eric Dumazet
On Wed, 2015-09-02 at 16:10 -0700, Martin KaFai Lau wrote:
> On Wed, Sep 02, 2015 at 03:48:57PM -0700, Eric Dumazet wrote:
> > On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote:
> > > On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
> > > > Object cannot be freed until all cpus have exited their RCU sections.
> > > You meant the dst_destroy() here will wait for all cpus exited their RCU 
> > > sections?
> > >
> > > static inline void dst_free(struct dst_entry *dst)
> > > {
> > >   if (dst->obsolete > 0)
> > >   return;
> > >   if (!atomic_read(>__refcnt)) {
> > >   dst = dst_destroy(dst);
> > >   if (!dst)
> > >   return;
> > >   }
> > >   __dst_free(dst);
> > > }
> >
> > dst_free() is called after RCU grace period, in the case you are
> > interested in.
> >
> > Look at dst_rcu_free() and rt_free()
> Yes for IPv4 FIB
> 
> Not for IPv6 FIB. F.e. rt6_release()
> The IPv6 FIB is protected by rwlock now.

Oh well. I gave you a hint. I was not saying that it was currently used
in IPv6.

Are you telling me that IPv6 needs to continue to use techniques from
1990 ?

Surely we can use modern stuff, like proper RCU and/or seqlocks.

Since you are fixing a day-0 bug, I do not believe there is a particular
hurry to be conservative.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ip_rcv_finish() NULL pointer and possibly related Oopses

2015-09-02 Thread Daniel Borkmann

On 09/02/2015 06:39 PM, Shaun Crampton wrote:

Make sure you backported commit
10e2eb878f3ca07ac2f05fa5ca5e6c4c9174a27a
("udp: fix dst races with multicast early demux")


I just tried the latest CoreOS alpha, which had that patch.  Sadly, I saw
just as many reboots.  Here's a sample of the different types of Oopses I
see (I've put the rest up in a gist:
https://gist.github.com/fasaxc/d801ced5608f2657abd8):

[ 4024.564479] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 4024.565452] IP: [<  (null)>]   (null)
[ 4024.565452] PGD 2297067 PUD 2296067 PMD 0
[ 4024.565452] Oops: 0010 [#1] SMP
[ 4024.565452] Modules linked in: xt_mac xt_mark veth ip_set_hash_net
nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_set ip_set_hash_ip ip_set
nfnetlink ipip tunnel4 ip_tunnel ip6table_filter ip6_tables xt_conntrack
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter br_netfilter nf_nat
nf_conntrack bridge stp llc overlay nls_ascii nls_cp437 vfat fat ext4
crc16 mbcache jbd2 sd_mod crc32c_intel virtio_scsi scsi_mod aesni_intel
virtio_net mousedev aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd
microcode firmware_class virtio_pci virtio_ring psmouse virtio i2c_piix4
i2c_core acpi_cpufreq button evdev sch_fq_codel ip_tables autofs4
[ 4024.565452] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.6-coreos-r1 #2
[ 4024.565452] Hardware name: Google Google, BIOS Google 01/01/2011
[ 4024.565452] task: 81a154c0 ti: 81a0 task.ti:
81a0
[ 4024.565452] RIP: 0010:[<>]  [<  (null)>]
(null)
[ 4024.565452] RSP: 0018:88021fc03c00  EFLAGS: 00010246
[ 4024.565452] RAX: 880003375d00 RBX: 880003375d00 RCX:
0001
[ 4024.565452] RDX: 88000306c000 RSI:  RDI:
880003375d00
[ 4024.565452] RBP: 88021fc03c28 R08: 5608 R09:
bb84
[ 4024.565452] R10: 0003 R11: 880215a30dc0 R12:
880214bfb000
[ 4024.565452] R13: 88000306c000 R14: 88000306c000 R15:
0008
[ 4024.565452] FS:  () GS:88021fc0()
knlGS:
[ 4024.565452] CS:  0010 DS:  ES:  CR0: 80050033
[ 4024.565452] CR2:  CR3: 01d92000 CR4:
001406f0
[ 4024.600761] Stack:
[ 4024.601081]  814ac9dc 8802 88000306c000
880003375d00
[ 4024.601081]  88008cbba84e 88021fc03c58 81486628
88021690a000
[ 4024.601081]  88008cbba84e 880003375d00 88000306c000
88021fc03cb8
[ 4024.601081] Call Trace:
[ 4024.601081]  
[ 4024.601081]  [] ? tcp_v4_early_demux+0x11c/0x160
[ 4024.601081]  [] ip_rcv_finish+0xb8/0x360
[ 4024.601081]  [] ip_rcv+0x2a4/0x400
[ 4024.601081]  [] ? inet_del_offload+0x40/0x40
[ 4024.601081]  [] __netif_receive_skb_core+0x6c3/0x9a0
[ 4024.601081]  [] ? build_skb+0x17/0x90
[ 4024.601081]  [] __netif_receive_skb+0x18/0x60
[ 4024.601081]  [] netif_receive_skb_internal+0x33/0xa0
[ 4024.601081]  [] netif_receive_skb_sk+0x1c/0x70
[ 4024.601081]  [] 0xa008772b
[ 4024.601081]  [] ? check_preempt_curr+0x80/0xa0
[ 4024.601081]  [] 0xa0087d81


Looking at this one, I am still puzzeled where 0xa008772b and
0xa008772b comes from ... some driver, bridge ...? Also the call
to inet_del_offload() seems a bit odd. Even in 4.1, there's only one (buggy)
instance that calls inet_del_offload(), which is ipv6_exthdrs_offload_init(),
but IPPROTO_ROUTING shouldn't have much of an effect on the v4 table as
far as I can see. Maybe rather a false positive that address, hmm? Perhaps
some callback/infrastructure vanished underneath us as ip/rip is both null
... maybe due to that also 0xa008772b / 0xa008772b don't
resolve?


[ 4024.601081]  [] net_rx_action+0x159/0x340
[ 4024.601081]  [] __do_softirq+0xf4/0x290
[ 4024.601081]  [] irq_exit+0xad/0xc0
[ 4024.601081]  [] do_IRQ+0x5a/0xf0
[ 4024.601081]  [] common_interrupt+0x6e/0x6e
[ 4024.601081]  
[ 4024.601081]  [] ? native_safe_halt+0x6/0x10
[ 4024.601081]  [] default_idle+0x1e/0xc0
[ 4024.601081]  [] arch_cpu_idle+0xf/0x20
[ 4024.601081]  [] cpu_startup_entry+0x314/0x3e0
[ 4024.601081]  [] rest_init+0x7c/0x80
[ 4024.601081]  [] start_kernel+0x483/0x490
[ 4024.601081]  [] ? set_init_arg+0x55/0x55
[ 4024.601081]  [] ? early_idt_handler_array+0x120/0x120
[ 4024.601081]  [] x86_64_start_reservations+0x2a/0x2c
[ 4024.601081]  [] x86_64_start_kernel+0x138/0x147
[ 4024.601081] Code:  Bad RIP value.
[ 4024.601081] RIP  [<  (null)>]   (null)
[ 4024.601081]  RSP 
[ 4024.601081] CR2: 
[ 4024.601081] ---[ end trace cdabfe9d7380aaab ]---
[ 4024.601081] Kernel panic - not syncing: Fatal exception in interrupt
[ 4024.601081] Kernel Offset: disabled
[ 4024.601081] Rebooting in 60 seconds..
[ 4024.601081] ACPI MEMORY or I/O RESET_REG.

--
To unsubscribe from this list: send the line 

Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload

2015-09-02 Thread Rustad, Mark D
> On Sep 2, 2015, at 4:21 PM, Tom Herbert  wrote:
> 
> Mark, another question in this area of code. Looking at ixgbe_tx_csum,
> I'm wondering what happens with those default cases for the switch
> statements. If those are hit for whatever reason does that mean the
> checksum is never resolved? It seems like if the device couldn't
> handle these cases then skb_checksum_help should be called to set the
> checksum. In particular I am wondering what happens in the case that a
> TCP or UDP packet is sent in IPv6 with an extension header present (so
> default is taken in switch (l4_hdr)). Would the checksum be properly
> set in this case?

I will look further into this, but in a first look it appears that you are 
right and that it has been this way for some time.

--
Mark Rustad, Networking Division, Intel Corporation



signature.asc
Description: Message signed with OpenPGP using GPGMail


net-next closure?

2015-09-02 Thread Jeff Kirsher
I was just about to send out my last series of patches and noticed you
sent Linus your pull request.  So I am guessing that your net-next tree
is now closed, correct?  Just want to make sure before sending anything
out and did not want to dump patches on you right before the closure of
your net-next.

Cheers,
Jeff


signature.asc
Description: This is a digitally signed message part


Re: [PATCH 1/1] net/ipv6: Correct PIM6 mrt_lock handling

2015-09-02 Thread Cong Wang
On Wed, Sep 2, 2015 at 6:52 PM, Richard Laing
 wrote:
> In the IPv6 multicast routing code the mrt_lock was not being released
> correctly in the MFC iterator, as a result adding or deleting a MIF would
> cause a hang because the mrt_lock could not be acquired.
>
> This fix is a copy of the code for the IPv4 case and ensures that the lock
> is released correctly.
>
> Signed-off-by: Richard Laing 

Good catch!

Acked-by: Cong Wang 

Needs to go to -stable too.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: eth: altera: fix napi poll_list corruption

2015-09-02 Thread Eric Dumazet
On Thu, 2015-09-03 at 09:52 +0900, Atsushi Nemoto wrote:
> On Wed, 2 Sep 2015 11:25:00 -0700, David Miller  wrote:
> > Two lines below this change you are disabling interrupts anyways,
> > so I would suggest just moving the spin_lock_irqsave() before the
> > napi_gro_flush() to fix this.
> > 
> > Many of the checks done by napi_complete_done() (invoked by
> > napi_complete()) are completely redundant in this context.  For
> > example, the direct __napi_complete() call is a really nice
> > optimization because we know we are on the poll list and therefore
> > it is not empty.
> 
> Thank you for your suggestion.
> 
> I think napi_gro_flush() can be called with irq enabled, so moving the
> spin_lock_irqsave() just before the __napi_complete() (or moving the
> __napi_complete() just after the spin_lock_irqsave()) would be better,
> right?

Unless masking irqs are damn slow on hosts supporting this NIC,
I would rather use napi_complete_done() and add the possibility of
aggregating more frames per GRO packet, setting a non zero
gro_flush_timeout

Calling napi_gro_flush() and __napi_complete() looks error prone.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: eth: altera: fix napi poll_list corruption

2015-09-02 Thread David Miller
From: Atsushi Nemoto 
Date: Thu, 3 Sep 2015 09:52:57 +0900

> On Wed, 2 Sep 2015 11:25:00 -0700, David Miller  wrote:
>> Two lines below this change you are disabling interrupts anyways,
>> so I would suggest just moving the spin_lock_irqsave() before the
>> napi_gro_flush() to fix this.
>> 
>> Many of the checks done by napi_complete_done() (invoked by
>> napi_complete()) are completely redundant in this context.  For
>> example, the direct __napi_complete() call is a really nice
>> optimization because we know we are on the poll list and therefore
>> it is not empty.
> 
> Thank you for your suggestion.
> 
> I think napi_gro_flush() can be called with irq enabled, so moving the
> spin_lock_irqsave() just before the __napi_complete() (or moving the
> __napi_complete() just after the spin_lock_irqsave()) would be better,
> right?

It should work, yes.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: eth: altera: fix napi poll_list corruption

2015-09-02 Thread David Miller
From: Atsushi Nemoto 
Date: Wed, 2 Sep 2015 17:49:29 +0900

> tse_poll() calls __napi_complete() with irq enabled.  This leads napi
> poll_list corruption and may stop all napi drivers working.
> Use napi_complete() instead of __napi_complete().
> 
> Signed-off-by: Atsushi Nemoto 

Two lines below this change you are disabling interrupts anyways,
so I would suggest just moving the spin_lock_irqsave() before the
napi_gro_flush() to fix this.

Many of the checks done by napi_complete_done() (invoked by
napi_complete()) are completely redundant in this context.  For
example, the direct __napi_complete() call is a really nice
optimization because we know we are on the poll list and therefore
it is not empty.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response

2015-09-02 Thread Thomas Graf
On 09/02/15 at 09:40am, David Ahern wrote:
> rt_fill_info which is called for 'route get' requests hardcodes the
> table id as RT_TABLE_MAIN which is not correct when multiple tables
> are used. Use the newly added table id in the rtable to send back
> the correct table.
> 
> Signed-off-by: David Ahern 

What RTM_GETROUTE returns is not the actual route but a description
of the routing decision which is why table id, scope, protocol, and
prefix length are hardcoded. This is indicated by the RTM_F_CLONED
flag. What you propose would break userspace ABI.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH nf-next] netfilter: nf_dup{4,6}: fix build error when nf_conntrack disabled

2015-09-02 Thread Daniel Borkmann
While testing various Kconfig options on another issue, I found that
the following one triggers as well on allmodconfig and nf_conntrack
disabled:

  net/ipv4/netfilter/nf_dup_ipv4.c: In function ‘nf_dup_ipv4’:
  net/ipv4/netfilter/nf_dup_ipv4.c:72:20: error: ‘nf_skb_duplicated’ undeclared 
(first use in this function)
if (this_cpu_read(nf_skb_duplicated))
  [...]
  net/ipv6/netfilter/nf_dup_ipv6.c: In function ‘nf_dup_ipv6’:
  net/ipv6/netfilter/nf_dup_ipv6.c:66:20: error: ‘nf_skb_duplicated’ undeclared 
(first use in this function)
if (this_cpu_read(nf_skb_duplicated))

Fix it by including directly the header where it is defined.

Fixes: bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6")
Signed-off-by: Daniel Borkmann 
---
 [ Don't know whether Dave wants to take it directly, or if it should
   go via nf-next. I have one more build fix coming later tonight. Also
   applies to net-next. ]

 net/ipv4/netfilter/nf_dup_ipv4.c | 1 +
 net/ipv6/netfilter/nf_dup_ipv6.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index b5bb375..2d79e6e 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/net/ipv6/netfilter/nf_dup_ipv6.c b/net/ipv6/netfilter/nf_dup_ipv6.c
index c5c87e9..c8ab626 100644
--- a/net/ipv6/netfilter/nf_dup_ipv6.c
+++ b/net/ipv6/netfilter/nf_dup_ipv6.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: Support ip route get via given table

2015-09-02 Thread David Ahern

On 9/2/15 1:12 PM, Thomas Graf wrote:

On 09/02/15 at 12:03pm, David Ahern wrote:

Add support for 'ip [-6] route get table X' where the user wants to
force the FIB lookup from a given table.

Signed-off-by: David Ahern 


Will you use this outside of 'ip route get' as well? If so, how? I'm
asking because you propose to add the check and new behaviour to bypass
the routing rules to the routing fastpath, wouldn't it be better to
handle this in inet_rtm_getroute()?



The way IPv6 code is structured it seemed more appropriate to pass in a 
table id as part of the flow. I made IPv4 consistent with that approach.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response

2015-09-02 Thread Andy Gospodarek
On Wed, Sep 02, 2015 at 09:08:36PM +0200, Thomas Graf wrote:
> On 09/02/15 at 12:51pm, David Ahern wrote:
> > On 9/2/15 12:49 PM, David Miller wrote:
> > >From: Thomas Graf 
> > >Date: Wed, 2 Sep 2015 20:43:46 +0200
> > >
> > >>On 09/02/15 at 09:40am, David Ahern wrote:
> > >>>rt_fill_info which is called for 'route get' requests hardcodes the
> > >>>table id as RT_TABLE_MAIN which is not correct when multiple tables
> > >>>are used. Use the newly added table id in the rtable to send back
> > >>>the correct table.
> > >>>
> > >>>Signed-off-by: David Ahern 
> > >>
> > >>What RTM_GETROUTE returns is not the actual route but a description
> > >>of the routing decision which is why table id, scope, protocol, and
> > >>prefix length are hardcoded. This is indicated by the RTM_F_CLONED
> > >>flag. What you propose would break userspace ABI.
> > >
> > >Agreed, I don't think we can do this.
> > >
> > 
> > Doesn't the table used to come up with the decision matter for IPv4? ie.,
> > hardcoding to MAIN is misleading when there is absolutely no way the
> > decision comes from that table. IPv6 already returns the table id.
> > 
> > Or is your response that it breaks ABI and hence not going to fix.
> 
> This behaviour comes back from when we still had the IPv4 routing cache
> which was flat.

So before the routing cache was removed, was the response always
RTA_TABLE_MAIN since there was no way to indicate which table may have
route if it came from the cache?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response

2015-09-02 Thread Thomas Graf
On 09/02/15 at 03:43pm, Andy Gospodarek wrote:
> On Wed, Sep 02, 2015 at 09:08:36PM +0200, Thomas Graf wrote:
> > This behaviour comes back from when we still had the IPv4 routing cache
> > which was flat.
> 
> So before the routing cache was removed, was the response always
> RTA_TABLE_MAIN since there was no way to indicate which table may have
> route if it came from the cache?

Yes, from that perspective, get and list are very different in
behaviour. Again, I'm not against including this information
but we can't break compatibility.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] net: Add table id from route lookup to route response

2015-09-02 Thread David Ahern

On 9/2/15 2:23 PM, Thomas Graf wrote:

On 09/02/15 at 01:16pm, David Ahern wrote:

IPv4 ABI has the table hardcoded as RT_TABLE_MAIN regardless of the table
hit for the route lookup. Add the table using a new attribute,
RTA_TABLE_LOOKUP, to maintain the ABI yet return the right table id.

Signed-off-by: David Ahern 
---

Thomas: Something like this?

The current ABI is returning wrong data in some cases; that seems worse
to me than breaking the ABI.


Another option is to introduce a new flag bundled with RTM_GETROUTE
which fixes RTM_GETROUTE altogether and makes it return the actual
route instead of a simulated cache entry.



I like that better; it least then information is not duplicated. Thanks 
for the suggestion.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] net: Add table id from route lookup to route response

2015-09-02 Thread David Ahern

On 9/2/15 2:41 PM, Alexander Duyck wrote:


Why not implement this this same for IPv4 and IPv6?  It looks like it is
only included if it is non-zer and not MAIN in the above case, and then
below as long as a table ID is non-zero you are setting the value.  Why
not just include the value in all cases where it is defined just like
for IPv6?



I like Thomas' suggestion to add an rtm_flag better. We only need to fix 
IPv4 which hardcodes the tableid. Adding a flag, e.g.,


+#define RTM_F_LOOKUP_TABLE 0x1000  /* set rtm_table to FIB lookup 
result */


signifies the caller wants the real table. When set rt_fill_info sets 
rtm_table to the actual table id. This allows updated tools to work 
properly for both ipv4 and ipv6 and without breaking existing userspace.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/3] net: Add FIB table id to rtable

2015-09-02 Thread David Ahern
Add the FIB table id to rtable to make the information available for
IPv4 as it is for IPv6.

Signed-off-by: David Ahern 
---
 drivers/net/vrf.c   | 2 ++
 include/net/route.h | 2 ++
 net/ipv4/route.c| 8 
 net/ipv4/xfrm4_policy.c | 1 +
 4 files changed, 13 insertions(+)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index e7094fbd7568..8c9ab5ebea23 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -320,6 +320,7 @@ static void vrf_rtable_destroy(struct net_vrf *vrf)
 
 static struct rtable *vrf_rtable_create(struct net_device *dev)
 {
+   struct net_vrf *vrf = netdev_priv(dev);
struct rtable *rth;
 
rth = dst_alloc(_dst_ops, dev, 2,
@@ -335,6 +336,7 @@ static struct rtable *vrf_rtable_create(struct net_device 
*dev)
rth->rt_pmtu= 0;
rth->rt_gateway = 0;
rth->rt_uses_gateway = 0;
+   rth->rt_table_id = vrf->tb_id;
INIT_LIST_HEAD(>rt_uncached);
rth->rt_uncached_list = NULL;
}
diff --git a/include/net/route.h b/include/net/route.h
index cc61cb95f059..10a7d21a211c 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -64,6 +64,8 @@ struct rtable {
/* Miscellaneous cached information */
u32 rt_pmtu;
 
+   u32 rt_table_id;
+
struct list_headrt_uncached;
struct uncached_list*rt_uncached_list;
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index eaefeadce07c..92acc95b7578 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1457,6 +1457,7 @@ static struct rtable *rt_dst_alloc(struct net_device *dev,
rt->rt_pmtu = 0;
rt->rt_gateway = 0;
rt->rt_uses_gateway = 0;
+   rt->rt_table_id = 0;
INIT_LIST_HEAD(>rt_uncached);
 
rt->dst.output = ip_output;
@@ -1629,6 +1630,8 @@ static int __mkroute_input(struct sk_buff *skb,
}
 
rth->rt_is_input = 1;
+   if (res->table)
+   rth->rt_table_id = res->table->tb_id;
RT_CACHE_STAT_INC(in_slow_tot);
 
rth->dst.input = ip_forward;
@@ -1808,6 +1811,8 @@ out:  return err;
rth->dst.tclassid = itag;
 #endif
rth->rt_is_input = 1;
+   if (res.table)
+   rth->rt_table_id = res.table->tb_id;
 
RT_CACHE_STAT_INC(in_slow_tot);
if (res.type == RTN_UNREACHABLE) {
@@ -1988,6 +1993,9 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
return ERR_PTR(-ENOBUFS);
 
rth->rt_iif = orig_oif ? : 0;
+   if (res->table)
+   rth->rt_table_id = res->table->tb_id;
+
RT_CACHE_STAT_INC(out_slow_tot);
 
if (flags & (RTCF_BROADCAST | RTCF_MULTICAST)) {
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index bb919b28619f..671011055ad5 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -95,6 +95,7 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct 
net_device *dev,
xdst->u.rt.rt_gateway = rt->rt_gateway;
xdst->u.rt.rt_uses_gateway = rt->rt_uses_gateway;
xdst->u.rt.rt_pmtu = rt->rt_pmtu;
+   xdst->u.rt.rt_table_id = rt->rt_table_id;
INIT_LIST_HEAD(>u.rt.rt_uncached);
 
return 0;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/3] net: Refactor rtable initialization

2015-09-02 Thread David Ahern
All callers to rt_dst_alloc have nearly the same initialization following
a successful allocation. Consolidate it into rt_dst_alloc.

Signed-off-by: David Ahern 
---
 net/ipv4/route.c | 85 ++--
 1 file changed, 33 insertions(+), 52 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5f4a5565ad8b..eaefeadce07c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1438,12 +1438,33 @@ static void rt_set_nexthop(struct rtable *rt, __be32 
daddr,
 }
 
 static struct rtable *rt_dst_alloc(struct net_device *dev,
+  unsigned int flags, u16 type,
   bool nopolicy, bool noxfrm, bool will_cache)
 {
-   return dst_alloc(_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK,
-(will_cache ? 0 : (DST_HOST | DST_NOCACHE)) |
-(nopolicy ? DST_NOPOLICY : 0) |
-(noxfrm ? DST_NOXFRM : 0));
+   struct rtable *rt;
+
+   rt = dst_alloc(_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK,
+  (will_cache ? 0 : (DST_HOST | DST_NOCACHE)) |
+  (nopolicy ? DST_NOPOLICY : 0) |
+  (noxfrm ? DST_NOXFRM : 0));
+
+   if (rt) {
+   rt->rt_genid = rt_genid_ipv4(dev_net(dev));
+   rt->rt_flags = flags;
+   rt->rt_type = type;
+   rt->rt_is_input = 0;
+   rt->rt_iif = 0;
+   rt->rt_pmtu = 0;
+   rt->rt_gateway = 0;
+   rt->rt_uses_gateway = 0;
+   INIT_LIST_HEAD(>rt_uncached);
+
+   rt->dst.output = ip_output;
+   if (flags & RTCF_LOCAL)
+   rt->dst.input = ip_local_deliver;
+   }
+
+   return rt;
 }
 
 /* called in rcu_read_lock() section */
@@ -1452,6 +1473,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
 {
struct rtable *rth;
struct in_device *in_dev = __in_dev_get_rcu(dev);
+   unsigned int flags = RTCF_MULTICAST;
u32 itag = 0;
int err;
 
@@ -1477,7 +1499,10 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
if (err < 0)
goto e_err;
}
-   rth = rt_dst_alloc(dev_net(dev)->loopback_dev,
+   if (our)
+   flags |= RTCF_LOCAL;
+
+   rth = rt_dst_alloc(dev_net(dev)->loopback_dev, flags, RTN_MULTICAST,
   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, false);
if (!rth)
goto e_nobufs;
@@ -1486,20 +1511,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 
daddr, __be32 saddr,
rth->dst.tclassid = itag;
 #endif
rth->dst.output = ip_rt_bug;
-
-   rth->rt_genid   = rt_genid_ipv4(dev_net(dev));
-   rth->rt_flags   = RTCF_MULTICAST;
-   rth->rt_type= RTN_MULTICAST;
rth->rt_is_input= 1;
-   rth->rt_iif = 0;
-   rth->rt_pmtu= 0;
-   rth->rt_gateway = 0;
-   rth->rt_uses_gateway = 0;
-   INIT_LIST_HEAD(>rt_uncached);
-   if (our) {
-   rth->dst.input= ip_local_deliver;
-   rth->rt_flags |= RTCF_LOCAL;
-   }
 
 #ifdef CONFIG_IP_MROUTE
if (!ipv4_is_local_multicast(daddr) && IN_DEV_MFORWARD(in_dev))
@@ -1608,7 +1620,7 @@ static int __mkroute_input(struct sk_buff *skb,
}
}
 
-   rth = rt_dst_alloc(out_dev->dev,
+   rth = rt_dst_alloc(out_dev->dev, 0, res->type,
   IN_DEV_CONF_GET(in_dev, NOPOLICY),
   IN_DEV_CONF_GET(out_dev, NOXFRM), do_cache);
if (!rth) {
@@ -1616,19 +1628,10 @@ static int __mkroute_input(struct sk_buff *skb,
goto cleanup;
}
 
-   rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev));
-   rth->rt_flags = 0;
-   rth->rt_type = res->type;
rth->rt_is_input = 1;
-   rth->rt_iif = 0;
-   rth->rt_pmtu= 0;
-   rth->rt_gateway = 0;
-   rth->rt_uses_gateway = 0;
-   INIT_LIST_HEAD(>rt_uncached);
RT_CACHE_STAT_INC(in_slow_tot);
 
rth->dst.input = ip_forward;
-   rth->dst.output = ip_output;
 
rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
@@ -1795,26 +1798,16 @@ out:return err;
}
}
 
-   rth = rt_dst_alloc(net->loopback_dev,
+   rth = rt_dst_alloc(net->loopback_dev, flags | RTCF_LOCAL, res.type,
   IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache);
if (!rth)
goto e_nobufs;
 
-   rth->dst.input= ip_local_deliver;
rth->dst.output= ip_rt_bug;
 #ifdef CONFIG_IP_ROUTE_CLASSID
rth->dst.tclassid = itag;
 #endif
-
-   rth->rt_genid = rt_genid_ipv4(net);
-   rth->rt_flags   = flags|RTCF_LOCAL;
-   rth->rt_type= 

[PATCH net-next 3/3 v2] net: Allow user to get table id from route lookup

2015-09-02 Thread David Ahern
rt_fill_info which is called for 'route get' requests hardcodes the
table id as RT_TABLE_MAIN which is not correct when multiple tables
are used. Use the newly added table id in the rtable to send back
the correct table similar to what is done for IPv6.

To maintain current ABI a new request flag, RTM_F_LOOKUP_TABLE, is
added to indicate the actual table is wanted versus the hardcoded
response.

Signed-off-by: David Ahern 
---
v2
- use a new request flag to indicate the real table id is wanted
  (suggested by Thomas)

 include/uapi/linux/rtnetlink.h |  1 +
 net/ipv4/route.c   | 12 
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 702024769c74..06625b401422 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -270,6 +270,7 @@ enum rt_scope_t {
 #define RTM_F_CLONED   0x200   /* This route is cloned */
 #define RTM_F_EQUALIZE 0x400   /* Multipath equalizer: NI  */
 #define RTM_F_PREFIX   0x800   /* Prefix addresses */
+#define RTM_F_LOOKUP_TABLE 0x1000  /* set rtm_table to FIB lookup result */
 
 /* Reserved table identifiers */
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 92acc95b7578..da427a4a33fe 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2305,7 +2305,7 @@ struct rtable *ip_route_output_flow(struct net *net, 
struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);
 
-static int rt_fill_info(struct net *net,  __be32 dst, __be32 src,
+static int rt_fill_info(struct net *net,  __be32 dst, __be32 src, u32 table_id,
struct flowi4 *fl4, struct sk_buff *skb, u32 portid,
u32 seq, int event, int nowait, unsigned int flags)
 {
@@ -2325,8 +2325,8 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
__be32 src,
r->rtm_dst_len  = 32;
r->rtm_src_len  = 0;
r->rtm_tos  = fl4->flowi4_tos;
-   r->rtm_table= RT_TABLE_MAIN;
-   if (nla_put_u32(skb, RTA_TABLE, RT_TABLE_MAIN))
+   r->rtm_table= table_id;
+   if (nla_put_u32(skb, RTA_TABLE, table_id))
goto nla_put_failure;
r->rtm_type = rt->rt_type;
r->rtm_scope= RT_SCOPE_UNIVERSE;
@@ -2431,6 +2431,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh)
int err;
int mark;
struct sk_buff *skb;
+   u32 table_id = RT_TABLE_MAIN;
 
err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, rtm_ipv4_policy);
if (err < 0)
@@ -2500,7 +2501,10 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh)
if (rtm->rtm_flags & RTM_F_NOTIFY)
rt->rt_flags |= RTCF_NOTIFY;
 
-   err = rt_fill_info(net, dst, src, , skb,
+   if (rtm->rtm_flags & RTM_F_LOOKUP_TABLE)
+   table_id = rt->rt_table_id;
+
+   err = rt_fill_info(net, dst, src, table_id, , skb,
   NETLINK_CB(in_skb).portid, nlh->nlmsg_seq,
   RTM_NEWROUTE, 0, 0);
if (err < 0)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response

2015-09-02 Thread David Ahern

On 9/2/15 12:49 PM, David Miller wrote:

From: Thomas Graf 
Date: Wed, 2 Sep 2015 20:43:46 +0200


On 09/02/15 at 09:40am, David Ahern wrote:

rt_fill_info which is called for 'route get' requests hardcodes the
table id as RT_TABLE_MAIN which is not correct when multiple tables
are used. Use the newly added table id in the rtable to send back
the correct table.

Signed-off-by: David Ahern 


What RTM_GETROUTE returns is not the actual route but a description
of the routing decision which is why table id, scope, protocol, and
prefix length are hardcoded. This is indicated by the RTM_F_CLONED
flag. What you propose would break userspace ABI.


Agreed, I don't think we can do this.



Doesn't the table used to come up with the decision matter for IPv4? 
ie., hardcoding to MAIN is misleading when there is absolutely no way 
the decision comes from that table. IPv6 already returns the table id.


Or is your response that it breaks ABI and hence not going to fix.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: Support ip route get via given table

2015-09-02 Thread Thomas Graf
On 09/02/15 at 01:22pm, David Ahern wrote:
> On 9/2/15 1:12 PM, Thomas Graf wrote:
> >On 09/02/15 at 12:03pm, David Ahern wrote:
> >>Add support for 'ip [-6] route get table X' where the user wants to
> >>force the FIB lookup from a given table.
> >>
> >>Signed-off-by: David Ahern 
> >
> >Will you use this outside of 'ip route get' as well? If so, how? I'm
> >asking because you propose to add the check and new behaviour to bypass
> >the routing rules to the routing fastpath, wouldn't it be better to
> >handle this in inet_rtm_getroute()?
> >
> 
> The way IPv6 code is structured it seemed more appropriate to pass in a
> table id as part of the flow. I made IPv4 consistent with that approach.

The question is: Are you planning to use the new table_id in flowi
in the actual datapath as well? It seems entirely wrong to add
weight to the fast path for a control plane feature.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-02 Thread Martin KaFai Lau
On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> > 2. Use a spinlock to protect the dst_cache operations
>
> Well, a seqlock would be better : No need for an atomic operation in
> fast path.
>
seqlock can ensure consistency between idst->dst and idst->cookie.
However, IPv6 dst destruction is not protected by rcu.  dst_free() is
directly called, like in ip6_fib.c and a few other places.
Hence, atomic_inc_not_zero() cannot be used here because the dst may
have already been kmem_cache_free() when refcnt is 0.  A spinlock is
needed to stop the ip6_tnl_dst_set() side from removing the refcnt.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] sock, diag: fix panic in sock_diag_put_filterinfo

2015-09-02 Thread David Miller
From: Daniel Borkmann 
Date: Wed,  2 Sep 2015 14:00:36 +0200

> diag socket's sock_diag_put_filterinfo() dumps classic BPF programs
> upon request to user space (ss -0 -b). However, native eBPF programs
> attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method:
> 
> Their orig_prog is always NULL. However, sock_diag_put_filterinfo()
> unconditionally tries to access its filter length resp. wants to copy
> the filter insns from there. Internal cBPF to eBPF transformations
> attached to sockets don't have this issue, as orig_prog state is kept.
> 
> It's currently only used by packet sockets. If we would want to add
> native eBPF support in the future, this needs to be done through
> a different attribute than PACKET_DIAG_FILTER to not confuse possible
> user space disassemblers that work on diag data.
> 
> Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to 
> sockets")
> Signed-off-by: Daniel Borkmann 

Applied and queued up for -stable, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: Support ip route get via given table

2015-09-02 Thread David Ahern

On 9/2/15 1:38 PM, Thomas Graf wrote:

On 09/02/15 at 01:22pm, David Ahern wrote:

On 9/2/15 1:12 PM, Thomas Graf wrote:

On 09/02/15 at 12:03pm, David Ahern wrote:

Add support for 'ip [-6] route get table X' where the user wants to
force the FIB lookup from a given table.

Signed-off-by: David Ahern 


Will you use this outside of 'ip route get' as well? If so, how? I'm
asking because you propose to add the check and new behaviour to bypass
the routing rules to the routing fastpath, wouldn't it be better to
handle this in inet_rtm_getroute()?



The way IPv6 code is structured it seemed more appropriate to pass in a
table id as part of the flow. I made IPv4 consistent with that approach.


The question is: Are you planning to use the new table_id in flowi
in the actual datapath as well? It seems entirely wrong to add
weight to the fast path for a control plane feature.



No plans at the moment.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Wire up 32-bit direct socket calls

2015-09-02 Thread H. Peter Anvin
On 09/02/2015 02:48 AM, Geert Uytterhoeven wrote:
> 
> Should all other architectures follow suit?
> Or should we follow the s390 approach:
> 

It is up to the maintainer(s), largely dependent on how likely you are
going to want to support this in your libc, but in general, socketcall
is an abomination which there is no reason not to bypass.

So follow suit unless you have a strong reason not to.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response

2015-09-02 Thread Thomas Graf
On 09/02/15 at 12:51pm, David Ahern wrote:
> On 9/2/15 12:49 PM, David Miller wrote:
> >From: Thomas Graf 
> >Date: Wed, 2 Sep 2015 20:43:46 +0200
> >
> >>On 09/02/15 at 09:40am, David Ahern wrote:
> >>>rt_fill_info which is called for 'route get' requests hardcodes the
> >>>table id as RT_TABLE_MAIN which is not correct when multiple tables
> >>>are used. Use the newly added table id in the rtable to send back
> >>>the correct table.
> >>>
> >>>Signed-off-by: David Ahern 
> >>
> >>What RTM_GETROUTE returns is not the actual route but a description
> >>of the routing decision which is why table id, scope, protocol, and
> >>prefix length are hardcoded. This is indicated by the RTM_F_CLONED
> >>flag. What you propose would break userspace ABI.
> >
> >Agreed, I don't think we can do this.
> >
> 
> Doesn't the table used to come up with the decision matter for IPv4? ie.,
> hardcoding to MAIN is misleading when there is absolutely no way the
> decision comes from that table. IPv6 already returns the table id.
> 
> Or is your response that it breaks ABI and hence not going to fix.

This behaviour comes back from when we still had the IPv4 routing cache
which was flat. I'm not against exposing the table id but you have
to use a new attribute for it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next v2] net: Add table id from route lookup to route response

2015-09-02 Thread Alexander Duyck

On 09/02/2015 01:16 PM, David Ahern wrote:

IPv4 ABI has the table hardcoded as RT_TABLE_MAIN regardless of the table
hit for the route lookup. Add the table using a new attribute,
RTA_TABLE_LOOKUP, to maintain the ABI yet return the right table id.

Signed-off-by: David Ahern 
---

Thomas: Something like this?

The current ABI is returning wrong data in some cases; that seems worse
to me than breaking the ABI.

  include/uapi/linux/rtnetlink.h | 1 +
  net/ipv4/route.c   | 5 +
  net/ipv6/route.c   | 4 
  3 files changed, 10 insertions(+)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 702024769c74..5add1468350a 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -310,6 +310,7 @@ enum rtattr_type_t {
RTA_PREF,
RTA_ENCAP_TYPE,
RTA_ENCAP,
+   RTA_TABLE_LOOKUP,  /* table hit for fib lookup */
__RTA_MAX
  };

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 92acc95b7578..95454c368e66 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2328,6 +2328,11 @@ static int rt_fill_info(struct net *net,  __be32 dst, 
__be32 src,
r->rtm_table = RT_TABLE_MAIN;
if (nla_put_u32(skb, RTA_TABLE, RT_TABLE_MAIN))
goto nla_put_failure;
+
+   if (rt->rt_table_id && rt->rt_table_id != RT_TABLE_MAIN &&
+   nla_put_u32(skb, RTA_TABLE_LOOKUP, rt->rt_table_id))
+   goto nla_put_failure;
+
r->rtm_type  = rt->rt_type;
r->rtm_scope = RT_SCOPE_UNIVERSE;
r->rtm_protocol = RTPROT_UNSPEC;


Why not implement this this same for IPv4 and IPv6?  It looks like it is 
only included if it is non-zer and not MAIN in the above case, and then 
below as long as a table ID is non-zero you are setting the value.  Why 
not just include the value in all cases where it is defined just like 
for IPv6?



diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f45cac6f8356..3c5d3a50bb7b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2922,6 +2922,10 @@ static int rt6_fill_node(struct net *net,
rtm->rtm_table = table;
if (nla_put_u32(skb, RTA_TABLE, table))
goto nla_put_failure;
+
+   if (table && nla_put_u32(skb, RTA_TABLE_LOOKUP, table))
+   goto nla_put_failure;
+
if (rt->rt6i_flags & RTF_REJECT) {
switch (rt->dst.error) {
case -EINVAL:



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] flow_dissector: Use 'const' where possible.

2015-09-02 Thread David Miller
From: Jiri Pirko 
Date: Wed, 2 Sep 2015 18:39:34 +0200

> Wed, Sep 02, 2015 at 06:33:34AM CEST, t...@herbertland.com wrote:
>>> @@ -19,14 +19,14 @@
>>>  #include 
>>>  #include 
>>>
>>> -static bool skb_flow_dissector_uses_key(struct flow_dissector 
>>> *flow_dissector,
>>> -   enum flow_dissector_key_id key_id)
>>> +static bool dissector_uses_key(const struct flow_dissector *flow_dissector,
>>> +  enum flow_dissector_key_id key_id)
>>>  {
>>> return flow_dissector->used_keys & (1 << key_id);
>>>  }
>>>
>>> -static void skb_flow_dissector_set_key(struct flow_dissector 
>>> *flow_dissector,
>>> -  enum flow_dissector_key_id key_id)
>>> +static void dissector_set_key(struct flow_dissector *flow_dissector,
>>> + enum flow_dissector_key_id key_id)
>>>  {
>>> flow_dissector->used_keys |= (1 << key_id);
>>>  }
>>> @@ -51,20 +51,20 @@ void skb_flow_dissector_init(struct flow_dissector 
>>> *flow_dissector,
>>
>>I suppose we should drop skb_ from skb_flow_dissector_init and
>>skb_flow_dissector_target as well.
> 
> I like to have "namespaces" by function prefixes. Code is easier to read
> then...

I completely disagree.

These are static, local functions, the can use whatever names they want
and the shorter the better.

Long function names drive me absolutely insane and make keeping the
argument lists under ~80 columns a royal pain in the ass.

So I will continue to trim function names down to something more
reasonable when they are static and local to a source file.

And I encourage you to do so as well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response

2015-09-02 Thread David Miller
From: Thomas Graf 
Date: Wed, 2 Sep 2015 20:43:46 +0200

> On 09/02/15 at 09:40am, David Ahern wrote:
>> rt_fill_info which is called for 'route get' requests hardcodes the
>> table id as RT_TABLE_MAIN which is not correct when multiple tables
>> are used. Use the newly added table id in the rtable to send back
>> the correct table.
>> 
>> Signed-off-by: David Ahern 
> 
> What RTM_GETROUTE returns is not the actual route but a description
> of the routing decision which is why table id, scope, protocol, and
> prefix length are hardcoded. This is indicated by the RTM_F_CLONED
> flag. What you propose would break userspace ABI.

Agreed, I don't think we can do this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] net: Support ip route get via given table

2015-09-02 Thread Thomas Graf
On 09/02/15 at 12:03pm, David Ahern wrote:
> Add support for 'ip [-6] route get table X' where the user wants to
> force the FIB lookup from a given table.
> 
> Signed-off-by: David Ahern 

Will you use this outside of 'ip route get' as well? If so, how? I'm
asking because you propose to add the check and new behaviour to bypass
the routing rules to the routing fastpath, wouldn't it be better to
handle this in inet_rtm_getroute()?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload

2015-09-02 Thread Or Gerlitz
On Wed, Sep 2, 2015 at 8:38 PM, Tom Herbert  wrote:
> On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D  
> wrote:
>>> On Sep 1, 2015, at 8:17 PM, Tom Herbert  wrote:
>>>
>>> I suspect this is not UDP-encapsulation specific, will it work with
>>> TCP/IP/IP, TCP/IP/GRE etc.?
>>
>> It could do more, but this is what has been tested up to this point.
>>
> Well, please test the those other encapsulations too! It's nice and
> all if they get the benefit, but it's really bad news if these changes
> were to screw them up (i.e. you don't want users of the GRE, IPIP to
> find out that they're now broken).
>
>>> Isn't there anyway the ixgbe could just be made to NETIF_HW_CSUM? That
>>> would be so much more straightforward and support nearly all use cases
>>> without needing to jump through all these hoops.
>>
>> Well, the description says:
>>
>>  ---
>> Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM.
>> It means that device can fill TCP/UDP-like checksum anywhere in the packets
>> whatever headers there might be.
>>  ---
>>
>> The device can't do whatever, wherever. There is always a limit to the 
>> offset to the inner headers that can be handled, for instance.
>>
> If the device does NETIF_F_HW_CSUM then inner/outer headers are
> irrelevant at least in the non-GSO case. All the device needs to do is
> compute the checksum from start and write the answer at the given
> offset. No protocol awareness needed in the device, no need to parse
> headers on transmit.

Tom, could you elaborate a little further on the
semantics/requirements for devices supporting NETIF_F_HW_CSUM, clearly
(as mentioned in




> I have the same complaint that ixgbe requires a bunch of driver logic
> to offload VXLAN checksum unnecessary instead of just providing
> CHECKSUM_COMPLETE which would work with any encapsulation protocol,
> require no encapsulation awareness in the device, and should be a much
> simpler driver implementation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload

2015-09-02 Thread Or Gerlitz
On Wed, Sep 2, 2015 at 8:38 PM, Tom Herbert  wrote:
> On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D  
> wrote:

>> Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM.
>> It means that device can fill TCP/UDP-like checksum anywhere in the packets
>> whatever headers there might be.

>> The device can't do whatever, wherever. There is always a limit to the 
>> offset to the inner headers that can be handled, for instance.

> If the device does NETIF_F_HW_CSUM then inner/outer headers are
> irrelevant at least in the non-GSO case. All the device needs to do is
> compute the checksum from start and write the answer at the given
> offset. No protocol awareness needed in the device, no need to parse
> headers on transmit.

Tom, could you elaborate a little further on the
semantics/requirements for devices supporting NETIF_F_HW_CSUM,
specifically, AFAIU this isn't a TX equivalent of supporting checksum
complete on RX, right? when you say "write the answer at the given
offset" what non-common answers are you expecting devices to produce?
how the kernel is hinting to the device on the nature on the expected
answer beyond the offset?

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-02 Thread David Miller
From: Eric Dumazet 
Date: Wed, 02 Sep 2015 15:48:57 -0700

> On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote:
>> On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
>> > Object cannot be freed until all cpus have exited their RCU sections.
>> You meant the dst_destroy() here will wait for all cpus exited their RCU 
>> sections?
>> 
>> static inline void dst_free(struct dst_entry *dst)
>> {
>>  if (dst->obsolete > 0)
>>  return;
>>  if (!atomic_read(>__refcnt)) {
>>  dst = dst_destroy(dst);
>>  if (!dst)
>>  return;
>>  }
>>  __dst_free(dst);
>> }
> 
> dst_free() is called after RCU grace period, in the case you are
> interested in.
> 
> Look at dst_rcu_free() and rt_free()

For ipv4, this is true, but in ipv6, it is not necessarily done in
this way.  And I think that is the point Martin is trying to make.

If you look, the dst_free() calls in ipv6 are basically synchronous,
it does not use dst_rcu_free().

And thus, the fix is to make ipv6 properly RCU free route entries.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 1/1] net: fec: clear receive interrupts before processing a packet

2015-09-02 Thread David Miller
From: Fugang Duan 
Date: Wed, 2 Sep 2015 17:24:14 +0800

> From: Russell King 
> 
> The patch just to re-submit the patch "db3421c114cfa6326" because the
> patch "4d494cdc92b3b9a0" remove the change.
> 
> Clear any pending receive interrupt before we process a pending packet.
> This helps to avoid any spurious interrupts being raised after we have
> fully cleaned the receive ring, while still allowing an interrupt to be
> raised if we receive another packet.
> 
> The position of this is critical: we must do this prior to reading the
> next packet status to avoid potentially dropping an interrupt when a
> packet is still pending.
> 
> Acked-by: Fugang Duan 
> Signed-off-by: Russell King 

Applied and queued up for -stable, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 net-next 0/5] netlink: mmap: kernel panic and some issues

2015-09-02 Thread Ken-ichirou MATSUZAWA
On Wed, Sep 02, 2015 at 05:56:36PM +0200, Daniel Borkmann wrote:
> you suggest or not), for two reasons: I think (will start experimenting
> more with it tomorrow), you would get an out of bounds access here in
> case the skb->data is the last slot in the ring buffer and reaches
> exactly to the ring buffer end. And (despite that), it's also hard

I thought accessing as a value, not a pointer, in thats wrong shared
info will not be a big problem, but

> to maintain - the next one adding a new shared info member will very
> likely oversee this special case in netlink here, thus the issue would
> then simply be reintroduced over and over.

I agree with you.
Thank you for taking your time. I think I have learned a lot.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] ipv6: fix exthdrs offload registration in out_rt path

2015-09-02 Thread David Miller
From: Daniel Borkmann 
Date: Thu,  3 Sep 2015 00:29:07 +0200

> We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
> but in error path, we try to unregister it with inet_del_offload(). This
> doesn't seem correct, it should actually be inet6_del_offload(), also
> ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
> also uses rthdr_offload twice), but it got removed entirely later on.
> 
> Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
> Signed-off-by: Daniel Borkmann 

Applied and queued up for -stable, thanks Daniel.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-02 Thread Martin KaFai Lau
On Wed, Sep 02, 2015 at 03:48:57PM -0700, Eric Dumazet wrote:
> On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote:
> > On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
> > > Object cannot be freed until all cpus have exited their RCU sections.
> > You meant the dst_destroy() here will wait for all cpus exited their RCU 
> > sections?
> >
> > static inline void dst_free(struct dst_entry *dst)
> > {
> > if (dst->obsolete > 0)
> > return;
> > if (!atomic_read(>__refcnt)) {
> > dst = dst_destroy(dst);
> > if (!dst)
> > return;
> > }
> > __dst_free(dst);
> > }
>
> dst_free() is called after RCU grace period, in the case you are
> interested in.
>
> Look at dst_rcu_free() and rt_free()
Yes for IPv4 FIB

Not for IPv6 FIB. F.e. rt6_release()
The IPv6 FIB is protected by rwlock now.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-02 Thread David Miller
From: Martin KaFai Lau 
Date: Wed, 2 Sep 2015 16:10:31 -0700

> On Wed, Sep 02, 2015 at 03:48:57PM -0700, Eric Dumazet wrote:
>> dst_free() is called after RCU grace period, in the case you are
>> interested in.
>>
>> Look at dst_rcu_free() and rt_free()
> Yes for IPv4 FIB
> 
> Not for IPv6 FIB. F.e. rt6_release()
> The IPv6 FIB is protected by rwlock now.

The FIB tree can use whatever locking scheme it wants, but the
actual route objects need to be released via RCU to fix the
problems you are seeing.

Converting the entire ipv6 FIB tree handling to RCU is not a
prerequisite for this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] Revert "net/ipv6: add sysctl option accept_ra_min_hop_limit"

2015-09-02 Thread David Miller
From: Sabrina Dubroca 
Date: Wed, 2 Sep 2015 11:43:01 +0200

> This reverts commit 8013d1d7eafb0589ca766db6b74026f76b7f5cb4.
> 
> There are several issues with this patch.
> It completely cancels the security changes introduced by 6fd99094de2b
> ("ipv6: Don't reduce hop limit for an interface").
> The current default value (min hop limit = 1) can result in the same
> denial of service that 6fd99094de2b prevents, but it is hard to define
> a correct and sane default value.
> More generally, it is yet another IPv6 sysctl, and we already have too
> many.
> 
> This was introduced to satisfy a TAHI test case which, in my opinion, is
> too strict, turning the RFC's "SHOULD" into a "MUST":
> 
> If the received Cur Hop Limit value is non-zero, the host
> SHOULD set its CurHopLimit variable to the received value.
> 
> The behavior of this sysctl is wrong in multiple ways.  Some are
> fixable, but let's not rush this commit into mainline, and revert this
> while we still can, then we can come up with a better solution.
> 
> Signed-off-by: Sabrina Dubroca 

I don't agree with this revert.

If you look at the original commit, the quoted RFC recommends adding
a configurable method to protect against this.

And that's exactly what the commit you are trying to revert is doing.

The only thing I would entertain is potentially an adjustment of the
default, working in concert with the TAHI folks to make sure their
tests still pass with any new default.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-02 Thread Martin KaFai Lau
On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
> Object cannot be freed until all cpus have exited their RCU sections.
You meant the dst_destroy() here will wait for all cpus exited their RCU 
sections?

static inline void dst_free(struct dst_entry *dst)
{
if (dst->obsolete > 0)
return;
if (!atomic_read(>__refcnt)) {
dst = dst_destroy(dst);
if (!dst)
return;
}
__dst_free(dst);
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] ipv6: fix exthdrs offload registration in out_rt path

2015-09-02 Thread Daniel Borkmann
We previously register IPPROTO_ROUTING offload under inet6_add_offload(),
but in error path, we try to unregister it with inet_del_offload(). This
doesn't seem correct, it should actually be inet6_del_offload(), also
ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it
also uses rthdr_offload twice), but it got removed entirely later on.

Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.")
Signed-off-by: Daniel Borkmann 
---
 (Found during code review.)

 net/ipv6/exthdrs_offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c
index 447a7fb..f5e2ba1 100644
--- a/net/ipv6/exthdrs_offload.c
+++ b/net/ipv6/exthdrs_offload.c
@@ -36,6 +36,6 @@ out:
return ret;
 
 out_rt:
-   inet_del_offload(_offload, IPPROTO_ROUTING);
+   inet6_del_offload(_offload, IPPROTO_ROUTING);
goto out;
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload

2015-09-02 Thread Tom Herbert
On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D  wrote:
>> On Sep 1, 2015, at 8:17 PM, Tom Herbert  wrote:
>>
>> I suspect this is not UDP-encapsulation specific, will it work with
>> TCP/IP/IP, TCP/IP/GRE etc.?
>
Mark, another question in this area of code. Looking at ixgbe_tx_csum,
I'm wondering what happens with those default cases for the switch
statements. If those are hit for whatever reason does that mean the
checksum is never resolved? It seems like if the device couldn't
handle these cases then skb_checksum_help should be called to set the
checksum. In particular I am wondering what happens in the case that a
TCP or UDP packet is sent in IPv6 with an extension header present (so
default is taken in switch (l4_hdr)). Would the checksum be properly
set in this case?

Thanks,
Tom

> It could do more, but this is what has been tested up to this point.
>
>> Isn't there anyway the ixgbe could just be made to NETIF_HW_CSUM? That
>> would be so much more straightforward and support nearly all use cases
>> without needing to jump through all these hoops.
>
> Well, the description says:
>
>  ---
> Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM.
> It means that device can fill TCP/UDP-like checksum anywhere in the packets
> whatever headers there might be.
>  ---
>
> The device can't do whatever, wherever. There is always a limit to the offset 
> to the inner headers that can be handled, for instance.
>
> --
> Mark Rustad, Networking Division, Intel Corporation
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response

2015-09-02 Thread David Miller
From: Andy Gospodarek 
Date: Wed, 2 Sep 2015 15:43:27 -0400

> On Wed, Sep 02, 2015 at 09:08:36PM +0200, Thomas Graf wrote:
>> On 09/02/15 at 12:51pm, David Ahern wrote:
>> > On 9/2/15 12:49 PM, David Miller wrote:
>> > >From: Thomas Graf 
>> > >Date: Wed, 2 Sep 2015 20:43:46 +0200
>> > >
>> > >>On 09/02/15 at 09:40am, David Ahern wrote:
>> > >>>rt_fill_info which is called for 'route get' requests hardcodes the
>> > >>>table id as RT_TABLE_MAIN which is not correct when multiple tables
>> > >>>are used. Use the newly added table id in the rtable to send back
>> > >>>the correct table.
>> > >>>
>> > >>>Signed-off-by: David Ahern 
>> > >>
>> > >>What RTM_GETROUTE returns is not the actual route but a description
>> > >>of the routing decision which is why table id, scope, protocol, and
>> > >>prefix length are hardcoded. This is indicated by the RTM_F_CLONED
>> > >>flag. What you propose would break userspace ABI.
>> > >
>> > >Agreed, I don't think we can do this.
>> > >
>> > 
>> > Doesn't the table used to come up with the decision matter for IPv4? ie.,
>> > hardcoding to MAIN is misleading when there is absolutely no way the
>> > decision comes from that table. IPv6 already returns the table id.
>> > 
>> > Or is your response that it breaks ABI and hence not going to fix.
>> 
>> This behaviour comes back from when we still had the IPv4 routing cache
>> which was flat.
> 
> So before the routing cache was removed, was the response always
> RTA_TABLE_MAIN since there was no way to indicate which table may have
> route if it came from the cache?

Right.  In fact, it was possible for routes from multiple tables to
end up evaluating to the same routing cache entry.  So there could be
a many to one relationship back then.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-02 Thread Eric Dumazet
On Wed, 2015-09-02 at 13:58 -0700, Martin KaFai Lau wrote:
> On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote:
> > > 2. Use a spinlock to protect the dst_cache operations
> >
> > Well, a seqlock would be better : No need for an atomic operation in
> > fast path.
> >
> seqlock can ensure consistency between idst->dst and idst->cookie.
> However, IPv6 dst destruction is not protected by rcu.  dst_free() is
> directly called, like in ip6_fib.c and a few other places.
> Hence, atomic_inc_not_zero() cannot be used here because the dst may
> have already been kmem_cache_free() when refcnt is 0.

Really ? What about basic rcu rules ?

Object cannot be freed until all cpus have exited their RCU sections.

>   A spinlock is
> needed to stop the ip6_tnl_dst_set() side from removing the refcnt.

Are you telling me RCU should be banished from the kernel ? ;)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Intel-wired-lan] [PATCH 3/6] ethernet/ixgbe: advertise LRO support in vlan_features

2015-09-02 Thread Singh, Krishneil K

-Original Message-
From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On 
Behalf Of Jarod Wilson
Sent: Thursday, August 13, 2015 11:03 AM
To: linux-ker...@vger.kernel.org
Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Jarod Wilson 

Subject: [Intel-wired-lan] [PATCH 3/6] ethernet/ixgbe: advertise LRO support in 
vlan_features

Without this, the presence of a ixgbe device in a bond will not trigger LRO 
support to be enabled at the bond level, even while it is enabled on the slave 
itself.

This change becomes necessary when NETIF_F_LRO is added to netdev_features.h's 
NETIF_F_ONE_FOR_ALL.

CC: Jeff Kirsher 
CC: intel-wired-...@lists.osuosl.org
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3e6a931..0a6e4e1 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8659,8 +8659,10 @@ skip_sriov:
 
if (adapter->flags2 & IXGBE_FLAG2_RSC_CAPABLE)
netdev->hw_features |= NETIF_F_LRO;
-   if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED)
+   if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) {
netdev->features |= NETIF_F_LRO;
+   netdev->vlan_features |= NETIF_F_LRO;
+   }
 
/* make sure the EEPROM is good */
if (hw->eeprom.ops.validate_checksum(hw, NULL) < 0) {
--
1.8.3.1

___
Intel-wired-lan mailing list
intel-wired-...@lists.osuosl.org
http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

While Validating this patch we have run in to a call trace if we have 
forwarding  (net.ipv4.ip_forward = 1) and  LRO enabled on interface prior to 
creating VLAN interface. With the patch reverted we don't see this failure. 

Validation setup:

sysctl net.ipv4.ip_forward=1
ethtool -K ethX lro on
ip link set ethX up
ip link add link ethX name ethX.10 type vlan id 10.

CALL TRACE:

[582992.985245] ixgbe :83:00.0 eth6: NIC Link is Up 10 Gbps, Flow Control: 
RX/TX
[582992.985400] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
[582995.764828] ixgbe :83:00.1 eth7: NIC Link is Up 10 Gbps, Flow Control: 
RX/TX
[582995.764964] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready
[583027.588991] ixgbe :04:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: 
RX/TX
[583044.365523] [ cut here ]
[583044.366181] WARNING: CPU: 20 PID: 56879 at net/core/dev.c:1472 
dev_disable_lro+0x95/0xa0()
[583044.366711] netdevice: eth2.10
failed to disable LRO!
[583044.367876] Modules linked in: ixgbe ixgb igb e100 mii e1000 e1000e 8021q 
garp mrp tcp_lp bnep bluetooth rfkill fuse btrfs xor raid6_pq vfat msdos fat 
ext4 mbcache jbd2 binfmt_misc xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 
iptable_filter ip_tables tun bridge stp llc x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul 
crc32c_intel nfsd ghash_clmulni_intel mei_me aesni_intel mei lrw auth_rpcgss 
gf128mul shpchp iTCO_wdt ioatdma iTCO_vendor_support glue_helper nfs_acl 
ablk_helper lockd cryptd i2c_i801 lpc_ich mfd_core ipmi_si sb_edac edac_core 
grace dm_mirror dm_region_hash ipmi_msghandler pcspkr wmi dm_log dm_mod sunrpc 
uinput
[583044.371142]  xfs libcrc32c sr_mod cdrom sd_mod mgag200 syscopyarea 
sysfillrect sysimgblt drm_kms_helper ttm drm ahci libahci libata mdio vxlan 
firewire_ohci ip6_udp_tunnel firewire_core udp_tunnel ptp i2c_algo_bit pps_core 
i2c_core crc_itu_t dca [last unloaded: ixgbe]
[583044.372915] CPU: 20 PID: 56879 Comm: ip Tainted: GW IOE   
4.2.0-rc7-Ustream-8-26-15+ #1
[583044.373511] Hardware name: Intel Corporation S2600CO/S2600CO, BIOS 
SE5C600.86B.01.06.0001.090720121056 09/07/2012
[583044.374126]   e9a2d4dc 8803ce16b5b8 
8166b4e9
[583044.374752]   8803ce16b610 8803ce16b5f8 
8107b06a
[583044.375380]  8803ce16b608 880428041000 818dc1eb 
0005
[583044.376033] Call Trace:
[583044.376662]  [] dump_stack+0x45/0x57
[583044.377290]  [] warn_slowpath_common+0x8a/0xc0
[583044.377950]  [] warn_slowpath_fmt+0x55/0x70
[583044.378570]  [] ? netdev_update_features+0x25/0x60
[583044.379218]  [] dev_disable_lro+0x95/0xa0
[583044.379841]  [] inetdev_init+0x17d/0x230
[583044.380458]  [] inetdev_event+0x37f/0x4f0
[583044.381079]  [] notifier_call_chain+0x4d/0x80
[583044.381697]  [] raw_notifier_call_chain+0x16/0x20
[583044.382343]  [] call_netdevice_notifiers_info+0x39/0x70
[583044.382971]  [] register_netdevice+0x2ae/0x430
[583044.383595]  [] ? 

Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload

2015-09-02 Thread Tom Herbert
On Wed, Sep 2, 2015 at 2:07 PM, Or Gerlitz  wrote:
> On Wed, Sep 2, 2015 at 8:38 PM, Tom Herbert  wrote:
>> On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D  
>> wrote:
>
>>> Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM.
>>> It means that device can fill TCP/UDP-like checksum anywhere in the packets
>>> whatever headers there might be.
>
>>> The device can't do whatever, wherever. There is always a limit to the 
>>> offset to the inner headers that can be handled, for instance.
>
>> If the device does NETIF_F_HW_CSUM then inner/outer headers are
>> irrelevant at least in the non-GSO case. All the device needs to do is
>> compute the checksum from start and write the answer at the given
>> offset. No protocol awareness needed in the device, no need to parse
>> headers on transmit.
>
> Tom, could you elaborate a little further on the
> semantics/requirements for devices supporting NETIF_F_HW_CSUM,
> specifically, AFAIU this isn't a TX equivalent of supporting checksum
> complete on RX, right? when you say "write the answer at the given
> offset" what non-common answers are you expecting devices to produce?
> how the kernel is hinting to the device on the nature on the expected
> answer beyond the offset?
>
NETIF_F_HW_CSUM indicates that the device/driver is will to implement
CHECKSUM_PARTIAL on out for the general case. CHECKSUM_PARTIAL is
described in skbuff.h as:

The device is required to checksum the packet as seen by hard_start_xmit()
from skb->csum_start up to the end, and to record/write the checksum at
offset skb->csum_start + skb->csum_offset.

For instance, if we want to offload an inner checksum the stack would
set csum_start to the offset of the the inner transport packet and
csum_offset to the relative offset of the checksum field. The stack
takes care of priming the checksum field with the not of pseudo header
if the transport protocol needs that.

Tom


> Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel

2015-09-02 Thread Eric Dumazet
On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote:
> On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote:
> > Object cannot be freed until all cpus have exited their RCU sections.
> You meant the dst_destroy() here will wait for all cpus exited their RCU 
> sections?
> 
> static inline void dst_free(struct dst_entry *dst)
> {
>   if (dst->obsolete > 0)
>   return;
>   if (!atomic_read(>__refcnt)) {
>   dst = dst_destroy(dst);
>   if (!dst)
>   return;
>   }
>   __dst_free(dst);
> }

dst_free() is called after RCU grace period, in the case you are
interested in.

Look at dst_rcu_free() and rt_free()





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv1 net-next 0/5] netlink: mmap: kernel panic and some issues

2015-09-02 Thread Ken-ichirou MATSUZAWA
Thank you for the reply.

On Wed, Sep 02, 2015 at 11:47:26AM +0200, Daniel Borkmann wrote:
> On 09/02/2015 02:04 AM, Ken-ichirou MATSUZAWA wrote:
> >Talking about skb_copy path, original skb's shared info is accessed
> >only in copy_skb_header, to get gso related field. As a result of
> 
> It's still not correct. The thing is you can neither call skb_copy() nor
> skb_clone() on netlink mmaped skbs. For example, skb_copy_bits() would

I am sorry for the lack of explanation.
And I am afraid I misunderstand...

Updated pointers to its data area in a mmaped netlink skb is only
its tail. Head, data and end will not be updated. skb_copy() calls

int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)

as its argument, "offset" is always 0 and "len" is skb->len. In
skb_copy_bits() both "start" and "copy" are skb->len, which means
"len - copy" is always 0 so that retuns 0 before accessing shared
info.

I don't know the situation is intended or not, it seems that
skb_copy() for a mmaped skb will not access its shared info.

After that, copy_skb_header() will set newly allocate skb's (wrong)
gso fields, I asked we should clear it or not.

> special case. We need an own netlink_mmap_to_full_skb() handler for this,
> that copies/transforms this into a "normal" skb. I'll have a look it this

If the above situation is an unintentional, we need it to avoid a
future confusion.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] sock, diag: fix panic in sock_diag_put_filterinfo

2015-09-02 Thread Daniel Borkmann
diag socket's sock_diag_put_filterinfo() dumps classic BPF programs
upon request to user space (ss -0 -b). However, native eBPF programs
attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method:

Their orig_prog is always NULL. However, sock_diag_put_filterinfo()
unconditionally tries to access its filter length resp. wants to copy
the filter insns from there. Internal cBPF to eBPF transformations
attached to sockets don't have this issue, as orig_prog state is kept.

It's currently only used by packet sockets. If we would want to add
native eBPF support in the future, this needs to be done through
a different attribute than PACKET_DIAG_FILTER to not confuse possible
user space disassemblers that work on diag data.

Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
Signed-off-by: Daniel Borkmann 
---
 net/core/sock_diag.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index d79866c..817622f 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -90,6 +90,9 @@ int sock_diag_put_filterinfo(bool may_report_filterinfo, 
struct sock *sk,
goto out;
 
fprog = filter->prog->orig_prog;
+   if (!fprog)
+   goto out;
+
flen = bpf_classic_proglen(fprog);
 
attr = nla_reserve(skb, attrtype, flen);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/1] net: fec: clear receive interrupts before processing a packet

2015-09-02 Thread Fugang Duan
From: Russell King 

The patch just to re-submit the patch "db3421c114cfa6326" because the
patch "4d494cdc92b3b9a0" remove the change.

Clear any pending receive interrupt before we process a pending packet.
This helps to avoid any spurious interrupts being raised after we have
fully cleaned the receive ring, while still allowing an interrupt to be
raised if we receive another packet.

The position of this is critical: we must do this prior to reading the
next packet status to avoid potentially dropping an interrupt when a
packet is still pending.

Acked-by: Fugang Duan 
Signed-off-by: Russell King 
---
 drivers/net/ethernet/freescale/fec_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index 1f89c59..6bed0ff 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1400,6 +1400,7 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, 
u16 queue_id)
if ((status & BD_ENET_RX_LAST) == 0)
netdev_err(ndev, "rcv is not +last\n");
 
+   writel(FEC_ENET_RXF, fep->hwp + FEC_IEVENT);
 
/* Check for errors. */
if (status & (BD_ENET_RX_LG | BD_ENET_RX_SH | BD_ENET_RX_NO |
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] xen-netback: add support for multicast control

2015-09-02 Thread Wei Liu
On Wed, Sep 02, 2015 at 01:19:53PM +0100, Paul Durrant wrote:
> Xen's PV network protocol includes messages to add/remove ethernet
> multicast addresses to/from a filter list in the backend. This allows
> the frontend to request the backend only forward multicast packets
> which are off interest thus preventing unnecessary noise on the shared

"of interest"

> ring.
> 
[...]
> +
> +void xenvif_mcast_flush(struct xenvif *vif)

Only one cosmetic comment.

My first impression of this function by looking at the name is that it
flushes queued multicast packets. Maybe we can rename it to
xenvif_mcast_addr_list_free ?

Wei.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected loss recovery in TLP

2015-09-02 Thread Eric Dumazet
On Wed, 2015-09-02 at 10:54 +0200, Mohammad Rajiullah wrote:
> Hi Eric!
> 
> Thanks for the direction. I tried packet drill locally (with the same kernel 
> Linux 3.18.5 to start with)
> with the following script. And it doesn’t show the problem I mentioned. 
> So the fast retransmit happens after getting the dupack.
> It would be good if I could get some information from the calls 
> from the TCP stack (I have some printk there), but using packet drill I don’t 
> know at the moment,
> how to get that. 
> 

Please do not top post on netdev mailing list.

You could try nstat before/after the failure and report anomalies here.

> \
> Mohammad
> 
> 
> // Establish a connection.
> 0   socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0  setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> +0  setsockopt(3, SOL_SOCKET, TCP_NODELAY, [1], 4) = 0
> 
> +0  bind(3, ..., ...) = 0
> +0  listen(3, 1) = 0
> 
> +0  < S 0:0(0) win 32792 
> +0  > S. 0:0(0) ack 1 <...>
> 
> +.03 < . 1:1(0) ack 1 win 257
> +0  accept(3, ..., ...) = 4
> 
> // Send 1 data segment and get an ACK with DATA
> +0  write(4, ..., 1000) = 1000

Note the original tcpdump you gave seemed to use len=250, could you try
the exact same lengths ?

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH net-next] xen-netback: add support for multicast control

2015-09-02 Thread Paul Durrant
> -Original Message-
> From: Wei Liu [mailto:wei.l...@citrix.com]
> Sent: 02 September 2015 15:01
> To: Paul Durrant
> Cc: netdev@vger.kernel.org; xen-de...@lists.xenproject.org; Ian Campbell;
> Wei Liu
> Subject: Re: [PATCH net-next] xen-netback: add support for multicast control
> 
> On Wed, Sep 02, 2015 at 01:19:53PM +0100, Paul Durrant wrote:
> > Xen's PV network protocol includes messages to add/remove ethernet
> > multicast addresses to/from a filter list in the backend. This allows
> > the frontend to request the backend only forward multicast packets
> > which are off interest thus preventing unnecessary noise on the shared
> 
> "of interest"

Ah yes :-)

> 
> > ring.
> >
> [...]
> > +
> > +void xenvif_mcast_flush(struct xenvif *vif)
> 
> Only one cosmetic comment.
> 
> My first impression of this function by looking at the name is that it
> flushes queued multicast packets. Maybe we can rename it to
> xenvif_mcast_addr_list_free ?
> 

Sure.

  Paul

> Wei.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] Revert "net/ipv6: add sysctl option accept_ra_min_hop_limit"

2015-09-02 Thread Sabrina Dubroca
This reverts commit 8013d1d7eafb0589ca766db6b74026f76b7f5cb4.

There are several issues with this patch.
It completely cancels the security changes introduced by 6fd99094de2b
("ipv6: Don't reduce hop limit for an interface").
The current default value (min hop limit = 1) can result in the same
denial of service that 6fd99094de2b prevents, but it is hard to define
a correct and sane default value.
More generally, it is yet another IPv6 sysctl, and we already have too
many.

This was introduced to satisfy a TAHI test case which, in my opinion, is
too strict, turning the RFC's "SHOULD" into a "MUST":

If the received Cur Hop Limit value is non-zero, the host
SHOULD set its CurHopLimit variable to the received value.

The behavior of this sysctl is wrong in multiple ways.  Some are
fixable, but let's not rush this commit into mainline, and revert this
while we still can, then we can come up with a better solution.

Signed-off-by: Sabrina Dubroca 
---
 Documentation/networking/ip-sysctl.txt |  8 
 include/linux/ipv6.h   |  1 -
 include/uapi/linux/ipv6.h  |  1 -
 net/ipv6/addrconf.c| 10 --
 net/ipv6/ndisc.c   | 16 +---
 5 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ebe94f2cab98..c845aca413c8 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1371,14 +1371,6 @@ accept_ra_from_local - BOOLEAN
   disabled if accept_ra_from_local is disabled
on a specific interface.
 
-accept_ra_min_hop_limit - INTEGER
-   Minimum hop limit Information in Router Advertisement.
-
-   Hop limit Information in Router Advertisement less than this
-   variable shall be ignored.
-
-   Default: 1
-
 accept_ra_pinfo - BOOLEAN
Learn Prefix Information in Router Advertisement.
 
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index f1f32af6d9b9..5e6d3b8f5a42 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -29,7 +29,6 @@ struct ipv6_devconf {
__s32   max_desync_factor;
__s32   max_addresses;
__s32   accept_ra_defrtr;
-   __s32   accept_ra_min_hop_limit;
__s32   accept_ra_pinfo;
__s32   ignore_routes_with_linkdown;
 #ifdef CONFIG_IPV6_ROUTER_PREF
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 38b4fef20219..44d13121de52 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -172,7 +172,6 @@ enum {
DEVCONF_ACCEPT_RA_MTU,
DEVCONF_STABLE_SECRET,
DEVCONF_USE_OIF_ADDRS_ONLY,
-   DEVCONF_ACCEPT_RA_MIN_HOP_LIMIT,
DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN,
DEVCONF_MAX
 };
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 99c0f2b843f0..c8829ce0bbe7 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -195,7 +195,6 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
.max_addresses  = IPV6_MAX_ADDRESSES,
.accept_ra_defrtr   = 1,
.accept_ra_from_local   = 0,
-   .accept_ra_min_hop_limit= 1,
.accept_ra_pinfo= 1,
 #ifdef CONFIG_IPV6_ROUTER_PREF
.accept_ra_rtr_pref = 1,
@@ -239,7 +238,6 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly 
= {
.max_addresses  = IPV6_MAX_ADDRESSES,
.accept_ra_defrtr   = 1,
.accept_ra_from_local   = 0,
-   .accept_ra_min_hop_limit= 1,
.accept_ra_pinfo= 1,
 #ifdef CONFIG_IPV6_ROUTER_PREF
.accept_ra_rtr_pref = 1,
@@ -4658,7 +4656,6 @@ static inline void ipv6_store_devconf(struct ipv6_devconf 
*cnf,
array[DEVCONF_MAX_DESYNC_FACTOR] = cnf->max_desync_factor;
array[DEVCONF_MAX_ADDRESSES] = cnf->max_addresses;
array[DEVCONF_ACCEPT_RA_DEFRTR] = cnf->accept_ra_defrtr;
-   array[DEVCONF_ACCEPT_RA_MIN_HOP_LIMIT] = cnf->accept_ra_min_hop_limit;
array[DEVCONF_ACCEPT_RA_PINFO] = cnf->accept_ra_pinfo;
 #ifdef CONFIG_IPV6_ROUTER_PREF
array[DEVCONF_ACCEPT_RA_RTR_PREF] = cnf->accept_ra_rtr_pref;
@@ -5594,13 +5591,6 @@ static struct addrconf_sysctl_table
.proc_handler   = proc_dointvec,
},
{
-   .procname   = "accept_ra_min_hop_limit",
-   .data   = _devconf.accept_ra_min_hop_limit,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   },
-   {
.procname   = "accept_ra_pinfo",
.data   = _devconf.accept_ra_pinfo,
.maxlen = sizeof(int),
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c

  1   2   >