Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
On Sat, Apr 02, 2016 at 08:40:55PM +0200, Johannes Berg wrote: > On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: > > > +static int mlx4_bpf_set(struct net_device *dev, int fd) > > +{ > [...] > > + if (prog->type != BPF_PROG_TYPE_PHYS_DEV) { > > + bpf_prog_put(prog); > > + return -EINVAL; > > + } > > + } > > Why wouldn't this check be done in the generic code that calls > ndo_bpf_set()? Having a common check makes sense. The tricky thing is that the type can only be checked after taking the reference, and I wanted to keep the scope of the prog brief in the case of errors. I would have to move the bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the ndo instead. Would that API look fine to you? A possible extension of this is just to keep the bpf_prog * in the netdev itself and expose a feature flag from the driver rather than an ndo. But that would mean another 8 bytes in the netdev. > > johannes
Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote: [...] > > > 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet > frame. > > Still you pass a single fragment but total 'length' here : BPF program > can read past the end of this first fragment and panic the box. > > Please take a look at mlx4_en_complete_rx_desc() and you'll see what I > mean. Sure, I will do some reading. Jesper also raised the issue after you did. Please let me know what you think of the proposals. > > 2) priv->stats.rx_dropped is shared by all the RX queues -> false > sharing. > >This is probably the right time to add a rx_dropped field in struct > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on > higher speed links. > This sounds reasonable! Will look into it for the next spin.
Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote: [...] > > I think you need to DMA sync RX-page before you can safely access > packet data in page (on all arch's). > Thanks, I will give that a try in the next spin. > > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > > +frags[0].page_offset); > > + if (mlx4_call_bpf(prog, ethh, length)) { > > AFAIK length here covers all the frags[n].page, thus potentially > causing the BPF program to access memory out of bound (crash). > > Having several page fragments is AFAIK an optimization for jumbo-frames > on PowerPC (which is a bit annoying for you use-case ;-)). > Yeah, this needs some more work. I can think of some options: 1. limit pseudo skb.len to first frag's length only, and signal to program that the packet is incomplete 2. for nfrags>1 skip bpf processing, but this could be functionally incorrect for some use cases 3. run the program for each frag 4. reject ndo_bpf_set when frags are possible (large mtu?) My preference is to go with 1, thoughts? > [...]
Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote: > Very nice! Do you think this hook will be sufficient to implement a > fast forward patch also? That is the goal, but more work needs to be done of course. It won't be possible with just a single pseudo skb, the driver will need a fast way to get batches of pseudo skbs (per core?) through from rx to tx. In mlx4 for instance, either the skb needs to be much more complete to be handled from the start of mlx4_en_xmit(), or that function would need to be split so that the fast tx could start midway through. Or, skb allocation just gets much faster. Then it should be pretty straightforward. > > Tom
[PATCH v3 net-next 6/8] ipv6: process socket-level control messages in IPv6
From: Soheil Hassas Yeganeh Process socket-level control messages by invoking __sock_cmsg_send in ip6_datagram_send_ctl for control messages on the SOL_SOCKET layer. This makes sure whenever ip6_datagram_send_ctl is called for udp and raw, we also process socket-level control messages. This is a bit uglier than IPv4, since IPv6 does not have something like ipcm_cookie. Perhaps we can later create a control message cookie for IPv6? Note that this commit interprets new control messages that were ignored before. As such, this commit does not change the behavior of IPv6 control messages. Signed-off-by: Soheil Hassas Yeganeh Acked-by: Willem de Bruijn --- include/net/transp_v6.h | 3 ++- net/ipv6/datagram.c | 9 - net/ipv6/ip6_flowlabel.c | 3 ++- net/ipv6/ipv6_sockglue.c | 3 ++- net/ipv6/raw.c | 6 +- net/ipv6/udp.c | 5 - net/l2tp/l2tp_ip6.c | 8 +--- 7 files changed, 28 insertions(+), 9 deletions(-) diff --git a/include/net/transp_v6.h b/include/net/transp_v6.h index b927413..2b1c345 100644 --- a/include/net/transp_v6.h +++ b/include/net/transp_v6.h @@ -42,7 +42,8 @@ void ip6_datagram_recv_specific_ctl(struct sock *sk, struct msghdr *msg, int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg, struct flowi6 *fl6, struct ipv6_txoptions *opt, - int *hlimit, int *tclass, int *dontfrag); + int *hlimit, int *tclass, int *dontfrag, + struct sockcm_cookie *sockc); void ip6_dgram_sock_seq_show(struct seq_file *seq, struct sock *sp, __u16 srcp, __u16 destp, int bucket); diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c index 4281621..a73d701 100644 --- a/net/ipv6/datagram.c +++ b/net/ipv6/datagram.c @@ -685,7 +685,8 @@ EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl); int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg, struct flowi6 *fl6, struct ipv6_txoptions *opt, - int *hlimit, int *tclass, int *dontfrag) + int *hlimit, int *tclass, int *dontfrag, + struct sockcm_cookie *sockc) { struct in6_pktinfo *src_info; struct cmsghdr *cmsg; @@ -702,6 +703,12 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk, goto exit_f; } + if (cmsg->cmsg_level == SOL_SOCKET) { + if (__sock_cmsg_send(sk, msg, cmsg, sockc)) + return -EINVAL; + continue; + } + if (cmsg->cmsg_level != SOL_IPV6) continue; diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c index dc2db4f..35d3ddc 100644 --- a/net/ipv6/ip6_flowlabel.c +++ b/net/ipv6/ip6_flowlabel.c @@ -372,6 +372,7 @@ fl_create(struct net *net, struct sock *sk, struct in6_flowlabel_req *freq, if (olen > 0) { struct msghdr msg; struct flowi6 flowi6; + struct sockcm_cookie sockc_junk; int junk; err = -ENOMEM; @@ -390,7 +391,7 @@ fl_create(struct net *net, struct sock *sk, struct in6_flowlabel_req *freq, memset(&flowi6, 0, sizeof(flowi6)); err = ip6_datagram_send_ctl(net, sk, &msg, &flowi6, fl->opt, - &junk, &junk, &junk); + &junk, &junk, &junk, &sockc_junk); if (err) goto done; err = -EINVAL; diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c index 4449ad1..a5557d2 100644 --- a/net/ipv6/ipv6_sockglue.c +++ b/net/ipv6/ipv6_sockglue.c @@ -471,6 +471,7 @@ sticky_done: struct ipv6_txoptions *opt = NULL; struct msghdr msg; struct flowi6 fl6; + struct sockcm_cookie sockc_junk; int junk; memset(&fl6, 0, sizeof(fl6)); @@ -503,7 +504,7 @@ sticky_done: msg.msg_control = (void *)(opt+1); retv = ip6_datagram_send_ctl(net, sk, &msg, &fl6, opt, &junk, -&junk, &junk); +&junk, &junk, &sockc_junk); if (retv) goto done; update: diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index fa59dd7..f175ec0 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -745,6 +745,7 @@ static int rawv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) struct dst_entry *dst = NULL; struct raw6_frag_vec rfv; struct flowi6 fl6; + struct sockcm_cookie sockc; int addr_len = msg->msg_namelen; int hlimit = -1; int tclass = -1; @@ -821,13 +822,16 @@ static int rawv6_sendms
[PATCH v3 net-next 8/8] sock: document timestamping via cmsg in Documentation
From: Soheil Hassas Yeganeh Update docs and add code snippet for using cmsg for timestamping. Signed-off-by: Soheil Hassas Yeganeh Acked-by: Willem de Bruijn --- Documentation/networking/timestamping.txt | 48 +-- 1 file changed, 45 insertions(+), 3 deletions(-) diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index a977339..671cccf 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -44,11 +44,17 @@ timeval of SO_TIMESTAMP (ms). Supports multiple types of timestamp requests. As a result, this socket option takes a bitmap of flags, not a boolean. In - err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val); + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, + sizeof(val)); val is an integer with any of the following bits set. Setting other bit returns EINVAL and does not change the current state. +The socket option configures timestamp generation for individual +sk_buffs (1.3.1), timestamp reporting to the socket's error +queue (1.3.2) and options (1.3.3). Timestamp generation can also +be enabled for individual sendmsg calls using cmsg (1.3.4). + 1.3.1 Timestamp Generation @@ -71,13 +77,16 @@ SOF_TIMESTAMPING_RX_SOFTWARE: kernel receive stack. SOF_TIMESTAMPING_TX_HARDWARE: - Request tx timestamps generated by the network adapter. + Request tx timestamps generated by the network adapter. This flag + can be enabled via both socket options and control messages. SOF_TIMESTAMPING_TX_SOFTWARE: Request tx timestamps when data leaves the kernel. These timestamps are generated in the device driver as close as possible, but always prior to, passing the packet to the network interface. Hence, they require driver support and may not be available for all devices. + This flag can be enabled via both socket options and control messages. + SOF_TIMESTAMPING_TX_SCHED: Request tx timestamps prior to entering the packet scheduler. Kernel @@ -90,7 +99,8 @@ SOF_TIMESTAMPING_TX_SCHED: machines with virtual devices where a transmitted packet travels through multiple devices and, hence, multiple packet schedulers, a timestamp is generated at each layer. This allows for fine - grained measurement of queuing delay. + grained measurement of queuing delay. This flag can be enabled + via both socket options and control messages. SOF_TIMESTAMPING_TX_ACK: Request tx timestamps when all data in the send buffer has been @@ -99,6 +109,7 @@ SOF_TIMESTAMPING_TX_ACK: over-report measurement, because the timestamp is generated when all data up to and including the buffer at send() was acknowledged: the cumulative acknowledgment. The mechanism ignores SACK and FACK. + This flag can be enabled via both socket options and control messages. 1.3.2 Timestamp Reporting @@ -183,6 +194,37 @@ having access to the contents of the original packet, so cannot be combined with SOF_TIMESTAMPING_OPT_TSONLY. +1.3.4. Enabling timestamps via control messages + +In addition to socket options, timestamp generation can be requested +per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1). +Using this feature, applications can sample timestamps per sendmsg() +without paying the overhead of enabling and disabling timestamps via +setsockopt: + + struct msghdr *msg; + ... + cmsg= CMSG_FIRSTHDR(msg); + cmsg->cmsg_level= SOL_SOCKET; + cmsg->cmsg_type = SO_TIMESTAMPING; + cmsg->cmsg_len = CMSG_LEN(sizeof(__u32)); + *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED | +SOF_TIMESTAMPING_TX_SOFTWARE | +SOF_TIMESTAMPING_TX_ACK; + err = sendmsg(fd, msg, 0); + +The SOF_TIMESTAMPING_TX_* flags set via cmsg will override +the SOF_TIMESTAMPING_TX_* flags set via setsockopt. + +Moreover, applications must still enable timestamp reporting via +setsockopt to receive timestamps: + + __u32 val = SOF_TIMESTAMPING_SOFTWARE | + SOF_TIMESTAMPING_OPT_ID /* or any other flag */; + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, + sizeof(val)); + + 1.4 Bytestream Timestamps The SO_TIMESTAMPING interface supports timestamping of bytes in a -- 2.8.0.rc3.226.g39d4020
[PATCH v3 net-next 4/8] sock: accept SO_TIMESTAMPING flags in socket cmsg
From: Soheil Hassas Yeganeh Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level as a basis to accept timestamping requests per write. This implementation only accepts TX recording flags (i.e., SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE, SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in control messages. Users need to set reporting flags (e.g., SOF_TIMESTAMPING_OPT_ID) per socket via socket options. This commit adds a tsflags field in sockcm_cookie which is set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_* bits in sockcm_cookie.tsflags allowing the control message to override the recording behavior per write, yet maintaining the value of other flags. This patch implements validating the control message and setting tsflags in struct sockcm_cookie. Next commits in this series will actually implement timestamping per write for different protocols. Signed-off-by: Soheil Hassas Yeganeh Acked-by: Willem de Bruijn --- include/net/sock.h | 1 + include/uapi/linux/net_tstamp.h | 10 ++ net/core/sock.c | 13 + 3 files changed, 24 insertions(+) diff --git a/include/net/sock.h b/include/net/sock.h index 03772d4..af012da 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1418,6 +1418,7 @@ void sk_send_sigurg(struct sock *sk); struct sockcm_cookie { u32 mark; + u16 tsflags; }; int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, diff --git a/include/uapi/linux/net_tstamp.h b/include/uapi/linux/net_tstamp.h index 6d1abea..264e515 100644 --- a/include/uapi/linux/net_tstamp.h +++ b/include/uapi/linux/net_tstamp.h @@ -31,6 +31,16 @@ enum { SOF_TIMESTAMPING_LAST }; +/* + * SO_TIMESTAMPING flags are either for recording a packet timestamp or for + * reporting the timestamp to user space. + * Recording flags can be set both via socket options and control messages. + */ +#define SOF_TIMESTAMPING_TX_RECORD_MASK(SOF_TIMESTAMPING_TX_HARDWARE | \ +SOF_TIMESTAMPING_TX_SOFTWARE | \ +SOF_TIMESTAMPING_TX_SCHED | \ +SOF_TIMESTAMPING_TX_ACK) + /** * struct hwtstamp_config - %SIOCGHWTSTAMP and %SIOCSHWTSTAMP parameter * diff --git a/net/core/sock.c b/net/core/sock.c index 0a64fe2..315f5e5 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1870,6 +1870,8 @@ EXPORT_SYMBOL(sock_alloc_send_skb); int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, struct sockcm_cookie *sockc) { + u32 tsflags; + switch (cmsg->cmsg_type) { case SO_MARK: if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) @@ -1878,6 +1880,17 @@ int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, return -EINVAL; sockc->mark = *(u32 *)CMSG_DATA(cmsg); break; + case SO_TIMESTAMPING: + if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32))) + return -EINVAL; + + tsflags = *(u32 *)CMSG_DATA(cmsg); + if (tsflags & ~SOF_TIMESTAMPING_TX_RECORD_MASK) + return -EINVAL; + + sockc->tsflags &= ~SOF_TIMESTAMPING_TX_RECORD_MASK; + sockc->tsflags |= tsflags; + break; default: return -EINVAL; } -- 2.8.0.rc3.226.g39d4020
[PATCH v3 net-next 2/8] tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO
From: Soheil Hassas Yeganeh SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs to associate timestamps with send calls. For TCP connections, tp->snd_una is used as the starting point to calculate relative IDs. This socket option will fail if set before the handshake on a passive TCP fast open connection with data in SYN or SYN/ACK, since setsockopt requires the connection to be in the ESTABLISHED state. To address these, instead of limiting the option to the ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as long as the connection is not in LISTEN or CLOSE states. Signed-off-by: Soheil Hassas Yeganeh Acked-by: Willem de Bruijn Acked-by: Yuchung Cheng Acked-by: Eric Dumazet --- net/core/sock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index 66976f8..0a64fe2 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -832,7 +832,8 @@ set_rcvbuf: !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) { if (sk->sk_protocol == IPPROTO_TCP && sk->sk_type == SOCK_STREAM) { - if (sk->sk_state != TCP_ESTABLISHED) { + if ((1 << sk->sk_state) & + (TCPF_CLOSE | TCPF_LISTEN)) { ret = -EINVAL; break; } -- 2.8.0.rc3.226.g39d4020
[PATCH v3 net-next 5/8] ipv4: process socket-level control messages in IPv4
From: Soheil Hassas Yeganeh Process socket-level control messages by invoking __sock_cmsg_send in ip_cmsg_send for control messages on the SOL_SOCKET layer. This makes sure whenever ip_cmsg_send is called in udp, icmp, and raw, we also process socket-level control messages. Note that this commit interprets new control messages that were ignored before. As such, this commit does not change the behavior of IPv4 control messages. Signed-off-by: Soheil Hassas Yeganeh Acked-by: Willem de Bruijn --- include/net/ip.h | 3 ++- net/ipv4/ip_sockglue.c | 9 - net/ipv4/ping.c| 2 +- net/ipv4/raw.c | 2 +- net/ipv4/udp.c | 3 +-- 5 files changed, 13 insertions(+), 6 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index fad74d3..93725e5 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -56,6 +56,7 @@ static inline unsigned int ip_hdrlen(const struct sk_buff *skb) } struct ipcm_cookie { + struct sockcm_cookiesockc; __be32 addr; int oif; struct ip_options_rcu *opt; @@ -550,7 +551,7 @@ int ip_options_rcv_srr(struct sk_buff *skb); void ipv4_pktinfo_prepare(const struct sock *sk, struct sk_buff *skb); void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb, int offset); -int ip_cmsg_send(struct net *net, struct msghdr *msg, +int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc, bool allow_ipv6); int ip_setsockopt(struct sock *sk, int level, int optname, char __user *optval, unsigned int optlen); diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c index 035ad64..1b7c077 100644 --- a/net/ipv4/ip_sockglue.c +++ b/net/ipv4/ip_sockglue.c @@ -219,11 +219,12 @@ void ip_cmsg_recv_offset(struct msghdr *msg, struct sk_buff *skb, } EXPORT_SYMBOL(ip_cmsg_recv_offset); -int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc, +int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc, bool allow_ipv6) { int err, val; struct cmsghdr *cmsg; + struct net *net = sock_net(sk); for_each_cmsghdr(cmsg, msg) { if (!CMSG_OK(msg, cmsg)) @@ -244,6 +245,12 @@ int ip_cmsg_send(struct net *net, struct msghdr *msg, struct ipcm_cookie *ipc, continue; } #endif + if (cmsg->cmsg_level == SOL_SOCKET) { + if (__sock_cmsg_send(sk, msg, cmsg, &ipc->sockc)) + return -EINVAL; + continue; + } + if (cmsg->cmsg_level != SOL_IP) continue; switch (cmsg->cmsg_type) { diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c index cf9700b..670639b 100644 --- a/net/ipv4/ping.c +++ b/net/ipv4/ping.c @@ -747,7 +747,7 @@ static int ping_v4_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) sock_tx_timestamp(sk, &ipc.tx_flags); if (msg->msg_controllen) { - err = ip_cmsg_send(sock_net(sk), msg, &ipc, false); + err = ip_cmsg_send(sk, msg, &ipc, false); if (unlikely(err)) { kfree(ipc.opt); return err; diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 8d22de7..088ce66 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -548,7 +548,7 @@ static int raw_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) ipc.oif = sk->sk_bound_dev_if; if (msg->msg_controllen) { - err = ip_cmsg_send(net, msg, &ipc, false); + err = ip_cmsg_send(sk, msg, &ipc, false); if (unlikely(err)) { kfree(ipc.opt); goto out; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 08eed5e..bccb4e1 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1034,8 +1034,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) sock_tx_timestamp(sk, &ipc.tx_flags); if (msg->msg_controllen) { - err = ip_cmsg_send(sock_net(sk), msg, &ipc, - sk->sk_family == AF_INET6); + err = ip_cmsg_send(sk, msg, &ipc, sk->sk_family == AF_INET6); if (unlikely(err)) { kfree(ipc.opt); return err; -- 2.8.0.rc3.226.g39d4020
[PATCH v3 net-next 7/8] sock: enable timestamping using control messages
From: Soheil Hassas Yeganeh Currently, SOL_TIMESTAMPING can only be enabled using setsockopt. This is very costly when users want to sample writes to gather tx timestamps. Add support for enabling SO_TIMESTAMPING via control messages by using tsflags added in `struct sockcm_cookie` (added in the previous patches in this series) to set the tx_flags of the last skb created in a sendmsg. With this patch, the timestamp recording bits in tx_flags of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg. Please note that this is only effective for overriding the recording timestamps flags. Users should enable timestamp reporting (e.g., SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using socket options and then should ask for SOF_TIMESTAMPING_TX_* using control messages per sendmsg to sample timestamps for each write. Signed-off-by: Soheil Hassas Yeganeh Acked-by: Willem de Bruijn --- drivers/net/tun.c | 3 ++- include/net/ipv6.h | 6 -- include/net/sock.h | 10 ++ net/can/raw.c | 2 +- net/ipv4/ping.c| 5 +++-- net/ipv4/raw.c | 11 ++- net/ipv4/tcp.c | 20 +++- net/ipv4/udp.c | 7 --- net/ipv6/icmp.c| 6 -- net/ipv6/ip6_output.c | 15 +-- net/ipv6/ping.c| 3 ++- net/ipv6/raw.c | 5 ++--- net/ipv6/udp.c | 7 --- net/l2tp/l2tp_ip6.c| 2 +- net/packet/af_packet.c | 30 +- net/socket.c | 10 +- 16 files changed, 93 insertions(+), 49 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index afdf950..6d2fcd0 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -860,7 +860,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) goto drop; if (skb->sk && sk_fullsock(skb->sk)) { - sock_tx_timestamp(skb->sk, &skb_shinfo(skb)->tx_flags); + sock_tx_timestamp(skb->sk, skb->sk->sk_tsflags, + &skb_shinfo(skb)->tx_flags); sw_tx_timestamp(skb); } diff --git a/include/net/ipv6.h b/include/net/ipv6.h index d0aeb97..55ee1eb 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -867,7 +867,8 @@ int ip6_append_data(struct sock *sk, int odd, struct sk_buff *skb), void *from, int length, int transhdrlen, int hlimit, int tclass, struct ipv6_txoptions *opt, struct flowi6 *fl6, - struct rt6_info *rt, unsigned int flags, int dontfrag); + struct rt6_info *rt, unsigned int flags, int dontfrag, + const struct sockcm_cookie *sockc); int ip6_push_pending_frames(struct sock *sk); @@ -884,7 +885,8 @@ struct sk_buff *ip6_make_skb(struct sock *sk, void *from, int length, int transhdrlen, int hlimit, int tclass, struct ipv6_txoptions *opt, struct flowi6 *fl6, struct rt6_info *rt, -unsigned int flags, int dontfrag); +unsigned int flags, int dontfrag, +const struct sockcm_cookie *sockc); static inline struct sk_buff *ip6_finish_skb(struct sock *sk) { diff --git a/include/net/sock.h b/include/net/sock.h index af012da..e91b87f 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2057,19 +2057,21 @@ static inline void sock_recv_ts_and_drops(struct msghdr *msg, struct sock *sk, sk->sk_stamp = skb->tstamp; } -void __sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags); +void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags); /** * sock_tx_timestamp - checks whether the outgoing packet is to be time stamped * @sk:socket sending this packet + * @tsflags: timestamping flags to use * @tx_flags: completed with instructions for time stamping * * Note : callers should take care of initial *tx_flags value (usually 0) */ -static inline void sock_tx_timestamp(const struct sock *sk, __u8 *tx_flags) +static inline void sock_tx_timestamp(const struct sock *sk, __u16 tsflags, +__u8 *tx_flags) { - if (unlikely(sk->sk_tsflags)) - __sock_tx_timestamp(sk, tx_flags); + if (unlikely(tsflags)) + __sock_tx_timestamp(tsflags, tx_flags); if (unlikely(sock_flag(sk, SOCK_WIFI_STATUS))) *tx_flags |= SKBTX_WIFI_STATUS; } diff --git a/net/can/raw.c b/net/can/raw.c index 2e67b14..972c187 100644 --- a/net/can/raw.c +++ b/net/can/raw.c @@ -755,7 +755,7 @@ static int raw_sendmsg(struct socket *sock, struct msghdr *msg, size_t size) if (err < 0) goto free_skb; - sock_tx_timestamp(sk, &skb_shinfo(skb)->tx_flags); + sock_tx_timestamp(sk, sk->sk_tsflags, &skb_shinfo(skb)->tx_flags); skb->dev
[PATCH v3 net-next 3/8] tcp: use one bit in TCP_SKB_CB to mark ACK timestamps
From: Soheil Hassas Yeganeh Currently, to avoid a cache line miss for accessing skb_shinfo, tcp_ack_tstamp skips socket that do not have SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is implemented based on an implicit assumption that the SOF_TIMESTAMPING_TX_ACK is set via socket options for the duration that ACK timestamps are needed. To implement per-write timestamps, this check should be removed and replaced with a per-packet alternative that quickly skips packets missing ACK timestamps marks without a cache-line miss. To enable per-packet marking without a cache line miss, use one bit in TCP_SKB_CB to mark a whether a SKB might need a ack tx timestamp or not. Further checks in tcp_ack_tstamp are not modified and work as before. Signed-off-by: Soheil Hassas Yeganeh Acked-by: Willem de Bruijn Acked-by: Eric Dumazet --- include/net/tcp.h| 3 ++- net/ipv4/tcp.c | 2 ++ net/ipv4/tcp_input.c | 2 +- 3 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index b91370f..f3a80ec 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -754,7 +754,8 @@ struct tcp_skb_cb { TCPCB_REPAIRED) __u8ip_dsfield; /* IPv4 tos or IPv6 dsfield */ - /* 1 byte hole */ + __u8txstamp_ack:1, /* Record TX timestamp for ack? */ + unused:7; __u32 ack_seq;/* Sequence number ACK'd*/ union { struct inet_skb_parmh4; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 08b8b96..ce3c9eb 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -432,10 +432,12 @@ static void tcp_tx_timestamp(struct sock *sk, struct sk_buff *skb) { if (sk->sk_tsflags) { struct skb_shared_info *shinfo = skb_shinfo(skb); + struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); sock_tx_timestamp(sk, &shinfo->tx_flags); if (shinfo->tx_flags & SKBTX_ANY_TSTAMP) shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1; + tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP); } } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e6e65f7..2d5fee4 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3093,7 +3093,7 @@ static void tcp_ack_tstamp(struct sock *sk, struct sk_buff *skb, const struct skb_shared_info *shinfo; /* Avoid cache line misses to get skb_shinfo() and shinfo->tx_flags */ - if (likely(!(sk->sk_tsflags & SOF_TIMESTAMPING_TX_ACK))) + if (likely(!TCP_SKB_CB(skb)->txstamp_ack)) return; shinfo = skb_shinfo(skb); -- 2.8.0.rc3.226.g39d4020
[PATCH v3 net-next 0/8] add TX timestamping via cmsg
From: Soheil Hassas Yeganeh This patch series aim at enabling TX timestamping via cmsg. Currently, to occasionally sample TX timestamping on a socket, applications need to call setsockopt twice: first for enabling timestamps and then for disabling them. This is an unnecessary overhead. With cmsg, in contrast, applications can sample TX timestamps per sendmsg(). This patch series adds the code for processing SO_TIMESTAMPING for cmsg's of the SOL_SOCKET level, and adds the glue code for TCP, UDP, and RAW for both IPv4 and IPv6. This implementation supports overriding timestamp generation flags (i.e., SOF_TIMESTAMPING_TX_*) but not timestamp reporting flags. Applications must still enable timestamp reporting via setsockopt to receive timestamps. This series does not change existing timestamping behavior for applications that are using socket options. I will follow up with another patch to enable timestamping for active TFO (client-side TCP Fast Open) and also setting packet mark via cmsgs. Thanks! Changes in v2: - Replace u32 with __u32 in the documentation. Changes in v3: - Fix the broken build for L2TP (due to changes in IPv6). Soheil Hassas Yeganeh (7): tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO tcp: use one bit in TCP_SKB_CB to mark ACK timestamps sock: accept SO_TIMESTAMPING flags in socket cmsg ipv4: process socket-level control messages in IPv4 ipv6: process socket-level control messages in IPv6 sock: enable timestamping using control messages sock: document timestamping via cmsg in Documentation Willem de Bruijn (1): sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop Documentation/networking/timestamping.txt | 48 -- drivers/net/tun.c | 3 +- include/net/ip.h | 3 +- include/net/ipv6.h| 6 ++-- include/net/sock.h| 13 +--- include/net/tcp.h | 3 +- include/net/transp_v6.h | 3 +- include/uapi/linux/net_tstamp.h | 10 +++ net/can/raw.c | 2 +- net/core/sock.c | 49 +++ net/ipv4/ip_sockglue.c| 9 +- net/ipv4/ping.c | 7 +++-- net/ipv4/raw.c| 13 net/ipv4/tcp.c| 22 ++ net/ipv4/tcp_input.c | 2 +- net/ipv4/udp.c| 10 +++ net/ipv6/datagram.c | 9 +- net/ipv6/icmp.c | 6 ++-- net/ipv6/ip6_flowlabel.c | 3 +- net/ipv6/ip6_output.c | 15 ++ net/ipv6/ipv6_sockglue.c | 3 +- net/ipv6/ping.c | 3 +- net/ipv6/raw.c| 7 +++-- net/ipv6/udp.c| 10 +-- net/l2tp/l2tp_ip6.c | 10 --- net/packet/af_packet.c| 30 +++ net/socket.c | 10 +++ 27 files changed, 231 insertions(+), 78 deletions(-) -- 2.8.0.rc3.226.g39d4020
[PATCH v3 net-next 1/8] sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop
From: Willem de Bruijn To process cmsg's of the SOL_SOCKET level in addition to cmsgs of another level, protocols can call sock_cmsg_send(). This causes a double walk on the cmsghdr list, one for SOL_SOCKET and one for the other level. Extract the inner demultiplex logic from the loop that walks the list, to allow having this called directly from a walker in the protocol specific code. Signed-off-by: Willem de Bruijn Signed-off-by: Soheil Hassas Yeganeh --- include/net/sock.h | 2 ++ net/core/sock.c| 33 ++--- 2 files changed, 24 insertions(+), 11 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 255d3e0..03772d4 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1420,6 +1420,8 @@ struct sockcm_cookie { u32 mark; }; +int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, +struct sockcm_cookie *sockc); int sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct sockcm_cookie *sockc); diff --git a/net/core/sock.c b/net/core/sock.c index b67b9ae..66976f8 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1866,27 +1866,38 @@ struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size, } EXPORT_SYMBOL(sock_alloc_send_skb); +int __sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct cmsghdr *cmsg, +struct sockcm_cookie *sockc) +{ + switch (cmsg->cmsg_type) { + case SO_MARK: + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) + return -EPERM; + if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32))) + return -EINVAL; + sockc->mark = *(u32 *)CMSG_DATA(cmsg); + break; + default: + return -EINVAL; + } + return 0; +} +EXPORT_SYMBOL(__sock_cmsg_send); + int sock_cmsg_send(struct sock *sk, struct msghdr *msg, struct sockcm_cookie *sockc) { struct cmsghdr *cmsg; + int ret; for_each_cmsghdr(cmsg, msg) { if (!CMSG_OK(msg, cmsg)) return -EINVAL; if (cmsg->cmsg_level != SOL_SOCKET) continue; - switch (cmsg->cmsg_type) { - case SO_MARK: - if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) - return -EPERM; - if (cmsg->cmsg_len != CMSG_LEN(sizeof(u32))) - return -EINVAL; - sockc->mark = *(u32 *)CMSG_DATA(cmsg); - break; - default: - return -EINVAL; - } + ret = __sock_cmsg_send(sk, msg, cmsg, sockc); + if (ret) + return ret; } return 0; } -- 2.8.0.rc3.226.g39d4020
Re: [RFC PATCH net 3/4] ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update
On Fri, Apr 01, 2016 at 04:13:41PM -0700, Cong Wang wrote: > On Fri, Apr 1, 2016 at 3:56 PM, Martin KaFai Lau wrote: > > + bh_lock_sock(sk); > > + if (!sock_owned_by_user(sk)) > > + ip6_datagram_dst_update(sk, false); > > + bh_unlock_sock(sk); > > > My discussion with Eric shows that we probably don't need to hold > this sock lock here, and you are Cc'ed in that thread, so > > 1) why do you still take the lock here? > 2) why didn't you involve in our discussion if you disagree? It is because I agree with the last thread discussion that updating sk->sk_dst_cache does not need a sk lock. I also don't see a lock is need for other operations in that thread. I am thinking another case that needs a lock, so I start another RFC thread. A quick recall for this commit message: >> It is done under '!sock_owned_by_user(sk)' condition because >> the user may make another ip6_datagram_connect() while >> dst lookup and update are happening. If that could not happen, then the lock is not needed. One thing to note is that this patch uses the addresses from the sk instead of iph when updating sk->sk_dst_cache. It is basically the same logic that the __ip6_datagram_connect() is doing, so some refactoring works in the first two patches. AFAIK, a UDP socket can become connected after sending out some datagrams in un-connected state. or It can be connected multiple times to different destinations. I did some quick tests but I could be wrong. I am thinking if there could be a chance that the skb->data, which has the original outgoing iph, is not related to the current connected address. If it is possible, we have to specifically use the addresses in the sk instead of skb->data (i.e. iph) when updating the sk->sk_dst_cache. If we need to use the sk addresses (and other info) to find out a new dst for a connected udp socket, it is better not doing it while the userland is connecting to somewhere else. If the above case is impossible, we can keep using the info from iph to do the dst update for a connected-udp sk without taking the lock. >> diff --git a/net/ipv6/route.c b/net/ipv6/route.c >> index ed44663..f7e6a6d 100644 >> --- a/net/ipv6/route.c >> +++ b/net/ipv6/route.c >> @@ -1417,8 +1417,19 @@ EXPORT_SYMBOL_GPL(ip6_update_pmtu); >> >> void ip6_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, __be32 mtu) >> { >> +struct dst_entry *dst; >> + >> ip6_update_pmtu(skb, sock_net(sk), mtu, >> sk->sk_bound_dev_if, sk->sk_mark); iph's addresses are used to update the pmtu. It is fine because it does not update the sk->sk_dst_cache. >>> + >> +dst = __sk_dst_get(sk); >> +if (!dst || dst->ops->check(dst, inet6_sk(sk)->dst_cookie)) >> +return; >> + >> +bh_lock_sock(sk); >> +if (!sock_owned_by_user(sk)) sk is not connecting to another address. Find a new dst for the connected address. >> +ip6_datagram_dst_update(sk, false); >> +bh_unlock_sock(sk); >> }
Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
On Sun, Apr 3, 2016 at 7:57 AM, Tom Herbert wrote: > I am curious though, how do you think this would specifically help > Android with power? Seems like the receiver still needs to be powered > to receive packets to filter them anyway... The receiver is powered up, but its wake/sleep cycles are much shorter than the main CPU's. On a phone, leaving the CPU asleep with wifi on might consume ~5mA average, but getting the CPU out of suspend might average ~200mA for ~300ms as the system comes out of sleep, initializes other hardware, wakes up userspace processes whose timeouts have fired, freezes, and suspends again. Receiving one such superfluous packet every 3 seconds (e.g., on networks that send identical IPv6 RAs once every 3 seconds) works out to ~25mA, which is 5x the cost of idle. Pushing down filters to the hardware so it can drop the packet without waking up the CPU thus saves a lot of idle power. That said, getting BPF to the driver is part of the picture. On the chipsets we're targeting for APF, we're only seeing 2k-4k of memory (that's 256-512 BPF instructions) available for filtering code, which means that BPF might be too large.
Re: [PATCH v2 net-next 0/8] add TX timestamping via cmsg
On Sat, Apr 2, 2016 at 9:27 PM, David Miller wrote: > From: David Miller > Date: Sat, 02 Apr 2016 21:19:42 -0400 (EDT) > >> Series applied, thanks. > > I had to revert, this breaks the build: > > net/l2tp/l2tp_ip6.c: In function ‘l2tp_ip6_sendmsg’: > net/l2tp/l2tp_ip6.c:565:9: error: too few arguments to function > ‘ip6_datagram_send_ctl’ >err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt, > ^ > In file included from net/l2tp/l2tp_ip6.c:33:0: > include/net/transp_v6.h:43:5: note: declared here > int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr > *msg, > ^ > net/l2tp/l2tp_ip6.c:625:8: error: too few arguments to function > ‘ip6_append_data’ > err = ip6_append_data(sk, ip_generic_getfrag, msg, > ^ > In file included from include/net/inetpeer.h:15:0, > from include/net/route.h:28, > from include/net/ip.h:31, > from net/l2tp/l2tp_ip6.c:23: > include/net/ipv6.h:865:5: note: declared here > int ip6_append_data(struct sock *sk, > ^ I'm really sorry about this. CONFIG_L2TP was no enabled in my config. I'll fix the patch, and will mail v3. Thanks, Soheil
Re: [PATCH v2 net-next 0/8] add TX timestamping via cmsg
From: David Miller Date: Sat, 02 Apr 2016 21:19:42 -0400 (EDT) > Series applied, thanks. I had to revert, this breaks the build: net/l2tp/l2tp_ip6.c: In function ‘l2tp_ip6_sendmsg’: net/l2tp/l2tp_ip6.c:565:9: error: too few arguments to function ‘ip6_datagram_send_ctl’ err = ip6_datagram_send_ctl(sock_net(sk), sk, msg, &fl6, opt, ^ In file included from net/l2tp/l2tp_ip6.c:33:0: include/net/transp_v6.h:43:5: note: declared here int ip6_datagram_send_ctl(struct net *net, struct sock *sk, struct msghdr *msg, ^ net/l2tp/l2tp_ip6.c:625:8: error: too few arguments to function ‘ip6_append_data’ err = ip6_append_data(sk, ip_generic_getfrag, msg, ^ In file included from include/net/inetpeer.h:15:0, from include/net/route.h:28, from include/net/ip.h:31, from net/l2tp/l2tp_ip6.c:23: include/net/ipv6.h:865:5: note: declared here int ip6_append_data(struct sock *sk, ^
Re: [PATCH v2 net-next 0/8] add TX timestamping via cmsg
From: Soheil Hassas Yeganeh Date: Fri, 1 Apr 2016 11:04:32 -0400 > From: Soheil Hassas Yeganeh > > This patch series aim at enabling TX timestamping via cmsg. > > Currently, to occasionally sample TX timestamping on a socket, > applications need to call setsockopt twice: first for enabling > timestamps and then for disabling them. This is an unnecessary > overhead. With cmsg, in contrast, applications can sample TX > timestamps per sendmsg(). > > This patch series adds the code for processing SO_TIMESTAMPING > for cmsg's of the SOL_SOCKET level, and adds the glue code for > TCP, UDP, and RAW for both IPv4 and IPv6. This implementation > supports overriding timestamp generation flags (i.e., > SOF_TIMESTAMPING_TX_*) but not timestamp reporting flags. > Applications must still enable timestamp reporting via > setsockopt to receive timestamps. > > This series does not change existing timestamping behavior for > applications that are using socket options. > > I will follow up with another patch to enable timestamping for > active TFO (client-side TCP Fast Open) and also setting packet > mark via cmsgs. ... > Changes in v2: > - Replace u32 with __u32 in the documentation. Series applied, thanks.
Re: [RESEND PATCH net-next 00/13] Enhance stmmac driver to support GMAC4.x IP
From: Alexandre TORGUE Date: Fri, 1 Apr 2016 11:37:24 +0200 > This is a subset of patch to enhance current stmmac driver to support > new GMAC4.x chips. New set of callbacks is defined to support this new > family: descriptors, dma, core. Series applied, thanks.
Re: [PATCH v2 net-next] net: hns: add support of pause frame ctrl for HNS V2
From: Yisen Zhuang Date: Thu, 31 Mar 2016 21:00:09 +0800 > From: Lisheng > > The patch adds support of pause ctrl for HNS V2, and this feature is lost > by HNS V1: >1) service ports can disable rx pause frame, >2) debug ports can open tx/rx pause frame. > > And this patch updates the REGs about the pause ctrl when updated > status function called by upper layer routine. > > Signed-off-by: Lisheng > Signed-off-by: Yisen Zhuang > Reviewed-by: Andy Shevchenko Applied.
Re: [PATCH] netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address
From: Haishuang Yan Date: Thu, 31 Mar 2016 18:21:38 +0800 > Since nla_get_in_addr and nla_put_in_addr were implemented, > so use them appropriately. > > Signed-off-by: Haishuang Yan Applied, thank you.
Re: [PATCH v2 net-next] tcp: remove cwnd moderation after recovery
From: Yuchung Cheng Date: Wed, 30 Mar 2016 14:54:20 -0700 > For non-SACK connections, cwnd is lowered to inflight plus 3 packets > when the recovery ends. This is an optional feature in the NewReno > RFC 2582 to reduce the potential burst when cwnd is "re-opened" > after recovery and inflight is low. > > This feature is questionably effective because of PRR: when > the recovery ends (i.e., snd_una == high_seq) NewReno holds the > CA_Recovery state for another round trip to prevent false fast > retransmits. But if the inflight is low, PRR will overwrite the > moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a > receiver responds bogus ACKs (i.e., acking future data) to speed up > transfer after recovery, it can only induce a burst up to a window > worth of data packets by acking up to SND.NXT. A restart from (short) > idle or receiving streched ACKs can both cause such bursts as well. > > On the other hand, if the recovery ends because the sender > detects the losses were spurious (e.g., reordering). This feature > unconditionally lowers a reverted cwnd even though nothing > was lost. > > By principle loss recovery module should not update cwnd. Further > pacing is much more effective to reduce burst. Hence this patch > removes the cwnd moderation feature. > > v2 changes: revised commit message on bogus ACKs and burst, and > missing signature > > Signed-off-by: Matt Mathis > Signed-off-by: Neal Cardwell > Signed-off-by: Soheil Hassas Yeganeh > Signed-off-by: Yuchung Cheng Applied, thanks.
Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
On Sat, Apr 2, 2016 at 2:41 PM, Johannes Berg wrote: > On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: >> This patch set introduces new infrastructure for programmatically >> processing packets in the earliest stages of rx, as part of an effort >> others are calling Express Data Path (XDP) [1]. Start this effort by >> introducing a new bpf program type for early packet filtering, before >> even >> an skb has been allocated. >> >> With this, hope to enable line rate filtering, with this initial >> implementation providing drop/allow action only. > > Since this is handed to the driver in some way, I assume the API would > also allow offloading the program to the NIC itself, and as such be > useful for what Android wants to do to save power in wireless? > Conceptually, yes. There is some ongoing work to offload BPF and one goal is that BPF programs (like for XDP) could be portable between userspace, kernel (maybe even other OSes), and devices. I am curious though, how do you think this would specifically help Android with power? Seems like the receiver still needs to be powered to receive packets to filter them anyway... Thanks, Tom > johannes
Re: bridge/brctl/ip
On 04/02/2016 09:26 PM, Bert Vermeulen wrote: > Hi all, > > I'm wondering about the current userspace toolset to control bridging in > the Linux kernel. As far as I can determine, functionality is a bit > scattered right now between the iproute2 (ip, bridge) and bridge-utils > (brctl) tools: > > - creating/deleting bridges: ip or brctl > - adding/deleting ports to/from bridge: brctl only ip link set dev ethX master bridgeY ip link set dev ethX nomaster > - showing bridge fdb: brctl (in-kernel fdb), bridge (hardware offloaded > fdb) (!) bridge fdb show - shows all fdb entries, offloaded or not. > ...and no doubt a few other things. > > Also the brctl tool seems not to be getting updates, whereas the > iproute2 tools are of course updated regularly. Is brctl considered > obsolete? iproute2 supports (almost, user-space stp?) everything now, there have been many recent additions to the options that can be manipulated. $ ip link set dev bridge0 type bridge help Usage: ... bridge [ forward_delay FORWARD_DELAY ] [ hello_time HELLO_TIME ] [ max_age MAX_AGE ] [ ageing_time AGEING_TIME ] [ stp_state STP_STATE ] [ priority PRIORITY ] [ group_fwd_mask MASK ] [ group_address ADDRESS ] [ vlan_filtering VLAN_FILTERING ] [ vlan_protocol VLAN_PROTOCOL ] [ vlan_default_pvid VLAN_DEFAULT_PVID ] [ mcast_snooping MULTICAST_SNOOPING ] [ mcast_router MULTICAST_ROUTER ] [ mcast_query_use_ifaddr MCAST_QUERY_USE_IFADDR ] [ mcast_querier MULTICAST_QUERIER ] [ mcast_hash_elasticity HASH_ELASTICITY ] [ mcast_hash_max HASH_MAX ] [ mcast_last_member_count LAST_MEMBER_COUNT ] [ mcast_startup_query_count STARTUP_QUERY_COUNT ] [ mcast_last_member_interval LAST_MEMBER_INTERVAL ] [ mcast_membership_interval MEMBERSHIP_INTERVAL ] [ mcast_querier_interval QUERIER_INTERVAL ] [ mcast_query_interval QUERY_INTERVAL ] [ mcast_query_response_interval QUERY_RESPONSE_INTERVAL ] [ mcast_startup_query_interval STARTUP_QUERY_INTERVAL ] [ nf_call_iptables NF_CALL_IPTABLES ] [ nf_call_ip6tables NF_CALL_IP6TABLES ] [ nf_call_arptables NF_CALL_ARPTABLES ] Where: VLAN_PROTOCOL := { 802.1Q | 802.1ad } > > If that is the case, would patches to add the missing functionality into > the bridge tool be welcome? I'm thinking primarily of creating/deleting > bridges, and adding/deleting ports in bridges. > >
Re: bridge/brctl/ip
On Sat, Apr 02, 2016 at 09:26:55PM +0200, Bert Vermeulen wrote: > Hi all, > > I'm wondering about the current userspace toolset to control bridging in > the Linux kernel. As far as I can determine, functionality is a bit > scattered right now between the iproute2 (ip, bridge) and bridge-utils > (brctl) tools: > > - adding/deleting ports to/from bridge: brctl only ip link set lan0 master br0 I think most of the normal operations can be done with iproute2. What might be missing is things like setting the forwarding delay, hello time, etc. Andrew
Re: net: memory leak due to CLONE_NEWNET
On Sat, Apr 2, 2016 at 6:55 AM, Dmitry Vyukov wrote: > Hello, > > The following program leads to memory leaks in: > > unreferenced object 0x88005c10d208 (size 96): > comm "a.out", pid 10753, jiffies 4296778619 (age 43.118s) > hex dump (first 32 bytes): > e8 31 85 2d 00 88 ff ff 0f 00 00 00 00 00 00 00 .1.- > 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .N.. > backtrace: > [] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915 > [< inline >] kmemleak_alloc_recursive include/linux/kmemleak.h:47 > [< inline >] slab_post_alloc_hook mm/slab.h:406 > [< inline >] slab_alloc_node mm/slub.c:2602 > [< inline >] slab_alloc mm/slub.c:2610 > [] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627 > [< inline >] kmalloc include/linux/slab.h:478 > [< inline >] tc_action_net_init include/net/act_api.h:122 > [] csum_init_net+0x15e/0x450 net/sched/act_csum.c:593 > [] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109 > [] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287 > [] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367 > [] create_new_namespaces+0x37f/0x740 > kernel/nsproxy.c:106 > [] unshare_nsproxy_namespaces+0xa9/0x1d0 The following patch should fix it. diff --git a/include/net/act_api.h b/include/net/act_api.h index 2a19fe1..03e322b 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -135,6 +135,7 @@ void tcf_hashinfo_destroy(const struct tc_action_ops *ops, static inline void tc_action_net_exit(struct tc_action_net *tn) { tcf_hashinfo_destroy(tn->ops, tn->hinfo); + kfree(tn->hinfo); } int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On 04/01/2016 07:21 PM, Eric Dumazet wrote: On Fri, 2016-04-01 at 22:16 -0400, David Miller wrote: From: Alexander Duyck Date: Fri, 1 Apr 2016 12:58:41 -0700 RFC 6864 is pretty explicit about this, IPv4 ID used only for fragmentation. https://tools.ietf.org/html/rfc6864#section-4.1 The goal with this change is to try and keep most of the existing behavior in tact without violating this rule? I would think the sequence number should give you the ability to infer a drop in the case of TCP. In the case of UDP tunnels we are now getting a bit more data since we were ignoring the outer IP header ID before. When retransmits happen, the sequence numbers are the same. But you can then use the IP ID to see exactly what happened. You can even tell if multiple retransmits got reordered. Eric's use case is extremely useful, and flat out eliminates ambiguity when analyzing TCP traces. Yes, our team (including Van Jacobson ;) ) would be sad to not have sequential IP ID (but then we don't have them for IPv6 ;) ) Your team would not be the only one sad to see that go away. rick jones Since the cost of generating them is pretty small (inet->inet_id counter), we probably should keep them in linux. Their usage will phase out as IPv6 wins the Internet war...
bridge/brctl/ip
Hi all, I'm wondering about the current userspace toolset to control bridging in the Linux kernel. As far as I can determine, functionality is a bit scattered right now between the iproute2 (ip, bridge) and bridge-utils (brctl) tools: - creating/deleting bridges: ip or brctl - adding/deleting ports to/from bridge: brctl only - showing bridge fdb: brctl (in-kernel fdb), bridge (hardware offloaded fdb) (!) ...and no doubt a few other things. Also the brctl tool seems not to be getting updates, whereas the iproute2 tools are of course updated regularly. Is brctl considered obsolete? If that is the case, would patches to add the missing functionality into the bridge tool be welcome? I'm thinking primarily of creating/deleting bridges, and adding/deleting ports in bridges. -- Bert Vermeulen b...@biot.com
[PATCH v3 -next] net/core/dev: Warn on a too-short GRO frame
From: Aaron Conole When signaling that a GRO frame is ready to be processed, the network stack correctly checks length and aborts processing when a frame is less than 14 bytes. However, such a condition is really indicative of a broken driver, and should be loudly signaled, rather than silently dropped as the case is today. Convert the condition to use net_warn_ratelimited() to ensure the stack loudly complains about such broken drivers. Signed-off-by: Aaron Conole --- v2: * Switched from WARN_ON to net_warn_ratelimited v3: * Amend the string to include device name as a hint net/core/dev.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/dev.c b/net/core/dev.c index b9bcbe7..273f10d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4663,6 +4663,8 @@ static struct sk_buff *napi_frags_skb(struct napi_struct *napi) if (unlikely(skb_gro_header_hard(skb, hlen))) { eth = skb_gro_header_slow(skb, hlen, 0); if (unlikely(!eth)) { + net_warn_ratelimited("%s: dropping impossible skb from %s\n", +__func__, napi->dev->name); napi_reuse_skb(napi, skb); return NULL; } -- 2.5.5
Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: > +static int mlx4_bpf_set(struct net_device *dev, int fd) > +{ [...] > + if (prog->type != BPF_PROG_TYPE_PHYS_DEV) { > + bpf_prog_put(prog); > + return -EINVAL; > + } > + } Why wouldn't this check be done in the generic code that calls ndo_bpf_set()? johannes
Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote: > This patch set introduces new infrastructure for programmatically > processing packets in the earliest stages of rx, as part of an effort > others are calling Express Data Path (XDP) [1]. Start this effort by > introducing a new bpf program type for early packet filtering, before > even > an skb has been allocated. > > With this, hope to enable line rate filtering, with this initial > implementation providing drop/allow action only. Since this is handed to the driver in some way, I assume the API would also allow offloading the program to the NIC itself, and as such be useful for what Android wants to do to save power in wireless? johannes
Re: Question on rhashtable in worst-case scenario.
On Sat, 2016-04-02 at 09:46 +0800, Herbert Xu wrote: > On Fri, Apr 01, 2016 at 11:34:10PM +0200, Johannes Berg wrote: > > > > > > I was thinking about that one - it's not obvious to me from the > > code > > how this "explicitly checking for dups" would be done or let's say > > how > > rhashtable differentiates. But since it seems to work for Ben until > > hitting a certain number of identical keys, surely that's just me > > not > > understanding the code rather than anything else :) > It's really simple, rhashtable_insert_fast does not check for dups > while rhashtable_lookup_insert_* do. Oh, ok, thanks :) johannes
[PATCH v2] net: remove unimplemented RTNH_F_PERVASIVE
Linux 2.1.68 introduced RTNH_F_PERVASIVE, but it had no implementation and couldn't be enabled since the required config parameter wasn't in any Kconfig file (see commit d088dde7b196 ("ipv4: obsolete config in kernel source (IP_ROUTE_PERVASIVE)")). This commit removes all remaining references to RTNH_F_PERVASIVE. Although this will cause userspace applications that were using the flag to fail to build, they will be alerted to the fact that using RTNH_F_PERVASIVE was not achieving anything. Signed-off-by: Quentin Armitage --- include/uapi/linux/rtnetlink.h |2 +- net/decnet/dn_fib.c|2 +- net/ipv4/fib_semantics.c |2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index ca764b5..58e6ba0 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -339,7 +339,7 @@ struct rtnexthop { /* rtnh_flags */ #define RTNH_F_DEAD1 /* Nexthop is dead (used by multipath) */ -#define RTNH_F_PERVASIVE 2 /* Do recursive gateway lookup */ + /* 2 was RTNH_F_PERVASIVE (never implemented) */ #define RTNH_F_ONLINK 4 /* Gateway is forced on link*/ #define RTNH_F_OFFLOAD 8 /* offloaded route */ #define RTNH_F_LINKDOWN16 /* carrier-down on nexthop */ diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c index df48034..c53aa74 100644 --- a/net/decnet/dn_fib.c +++ b/net/decnet/dn_fib.c @@ -243,7 +243,7 @@ out: } else { struct net_device *dev; - if (nh->nh_flags&(RTNH_F_PERVASIVE|RTNH_F_ONLINK)) + if (nh->nh_flags & RTNH_F_ONLINK) return -EINVAL; dev = __dev_get_by_index(&init_net, nh->nh_oif); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index d97268e..3883860 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -803,7 +803,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, } else { struct in_device *in_dev; - if (nh->nh_flags & (RTNH_F_PERVASIVE | RTNH_F_ONLINK)) + if (nh->nh_flags & RTNH_F_ONLINK) return -EINVAL; rcu_read_lock(); -- 1.7.7.6
Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco wrote: > This patch set introduces new infrastructure for programmatically > processing packets in the earliest stages of rx, as part of an effort > others are calling Express Data Path (XDP) [1]. Start this effort by > introducing a new bpf program type for early packet filtering, before even > an skb has been allocated. > > With this, hope to enable line rate filtering, with this initial > implementation providing drop/allow action only. > > Patch 1 introduces the new prog type and helpers for validating the bpf > program. A new userspace struct is defined containing only len as a field, > with others to follow in the future. > In patch 2, create a new ndo to pass the fd to support drivers. > In patch 3, expose a new rtnl option to userspace. > In patch 4, enable support in mlx4 driver. No skb allocation is required, > instead a static percpu skb is kept in the driver and minimally initialized > for each driver frag. > In patch 5, create a sample drop and count program. With single core, > achieved ~14.5 Mpps drop rate on a 40G mlx4. This includes packet data > access, bpf array lookup, and increment. > Very nice! Do you think this hook will be sufficient to implement a fast forward patch also? Tom > Interestingly, accessing packet data from the program did not have a > noticeable impact on performance. Even so, future enhancements to > prefetching / batching / page-allocs should hopefully improve the > performance in this path. > > [1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf > > Brenden Blanco (5): > bpf: add PHYS_DEV prog type for early driver filter > net: add ndo to set bpf prog in adapter rx > rtnl: add option for setting link bpf prog > mlx4: add support for fast rx drop bpf program > Add sample for adding simple drop program to link > > drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++ > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 18 +++ > drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 2 + > include/linux/netdevice.h | 8 ++ > include/uapi/linux/bpf.h | 5 + > include/uapi/linux/if_link.h | 1 + > kernel/bpf/verifier.c | 1 + > net/core/dev.c | 12 ++ > net/core/filter.c | 68 +++ > net/core/rtnetlink.c | 10 ++ > samples/bpf/Makefile | 4 + > samples/bpf/bpf_load.c | 8 ++ > samples/bpf/netdrvx1_kern.c| 26 + > samples/bpf/netdrvx1_user.c| 155 > + > 14 files changed, 379 insertions(+) > create mode 100644 samples/bpf/netdrvx1_kern.c > create mode 100644 samples/bpf/netdrvx1_user.c > > -- > 2.8.0 >
Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco wrote: > Add a new bpf prog type that is intended to run in early stages of the > packet rx path. Only minimal packet metadata will be available, hence a new > context type, struct xdp_metadata, is exposed to userspace. So far only > expose the readable packet length, and only in read mode. > This would eventually be a generic abstraction of receive descriptors? > The PHYS_DEV name is chosen to represent that the program is meant only > for physical adapters, rather than all netdevs. > Is there a hard restriction that this could only work with physical devices? > While the user visible struct is new, the underlying context must be > implemented as a minimal skb in order for the packet load_* instructions > to work. The skb filled in by the driver must have skb->len, skb->head, > and skb->data set, and skb->data_len == 0. > > Signed-off-by: Brenden Blanco > --- > include/uapi/linux/bpf.h | 5 > kernel/bpf/verifier.c| 1 + > net/core/filter.c| 68 > > 3 files changed, 74 insertions(+) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 924f537..b8a4ef2 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -92,6 +92,7 @@ enum bpf_prog_type { > BPF_PROG_TYPE_KPROBE, > BPF_PROG_TYPE_SCHED_CLS, > BPF_PROG_TYPE_SCHED_ACT, > + BPF_PROG_TYPE_PHYS_DEV, > }; > > #define BPF_PSEUDO_MAP_FD 1 > @@ -367,6 +368,10 @@ struct __sk_buff { > __u32 tc_classid; > }; > > +struct xdp_metadata { > + __u32 len; > +}; > + > struct bpf_tunnel_key { > __u32 tunnel_id; > union { > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 2e08f8e..804ca70 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -1340,6 +1340,7 @@ static bool may_access_skb(enum bpf_prog_type type) > case BPF_PROG_TYPE_SOCKET_FILTER: > case BPF_PROG_TYPE_SCHED_CLS: > case BPF_PROG_TYPE_SCHED_ACT: > + case BPF_PROG_TYPE_PHYS_DEV: > return true; > default: > return false; > diff --git a/net/core/filter.c b/net/core/filter.c > index b7177d0..c417db6 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -2018,6 +2018,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) > } > } > > +static const struct bpf_func_proto * > +phys_dev_func_proto(enum bpf_func_id func_id) > +{ > + return sk_filter_func_proto(func_id); > +} > + > static bool __is_valid_access(int off, int size, enum bpf_access_type type) > { > /* check bounds */ > @@ -2073,6 +2079,36 @@ static bool tc_cls_act_is_valid_access(int off, int > size, > return __is_valid_access(off, size, type); > } > > +static bool __is_valid_xdp_access(int off, int size, > + enum bpf_access_type type) > +{ > + if (off < 0 || off >= sizeof(struct xdp_metadata)) > + return false; > + > + if (off % size != 0) > + return false; > + > + if (size != 4) > + return false; > + > + return true; > +} > + > +static bool phys_dev_is_valid_access(int off, int size, > +enum bpf_access_type type) > +{ > + if (type == BPF_WRITE) > + return false; > + > + switch (off) { > + case offsetof(struct xdp_metadata, len): > + break; > + default: > + return false; > + } > + return __is_valid_xdp_access(off, size, type); > +} > + > static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg, > int src_reg, int ctx_off, > struct bpf_insn *insn_buf, > @@ -2210,6 +2246,26 @@ static u32 bpf_net_convert_ctx_access(enum > bpf_access_type type, int dst_reg, > return insn - insn_buf; > } > > +static u32 bpf_phys_dev_convert_ctx_access(enum bpf_access_type type, > + int dst_reg, int src_reg, > + int ctx_off, > + struct bpf_insn *insn_buf, > + struct bpf_prog *prog) > +{ > + struct bpf_insn *insn = insn_buf; > + > + switch (ctx_off) { > + case offsetof(struct xdp_metadata, len): > + BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4); > + > + *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg, > + offsetof(struct sk_buff, len)); > + break; > + } > + > + return insn - insn_buf; > +} > + > static const struct bpf_verifier_ops sk_filter_ops = { > .get_func_proto = sk_filter_func_proto, > .is_valid_access = sk_filter_is_valid_access, > @@ -,6 +2278,12 @@ static const struct bpf_verifier_ops tc_c
Re: [PATCH v2 net-next 01/11] net: add SOCK_RCU_FREE socket flag
On Fri, Apr 1, 2016 at 11:52 AM, Eric Dumazet wrote: > We want a generic way to insert an RCU grace period before socket > freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too > much overhead. > > SLAB_DESTROY_BY_RCU strict rules force us to take a reference > on the socket sk_refcnt, and it is a performance problem for UDP > encapsulation, or TCP synflood behavior, as many CPUs might > attempt the atomic operations on a shared sk_refcnt > > UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their > lookup can use traditional RCU rules, without refcount changes. > They can set the flag only once hashed and visible by other cpus. > > Signed-off-by: Eric Dumazet > Cc: Tom Herbert > --- > include/net/sock.h | 2 ++ > net/core/sock.c| 14 +- > 2 files changed, 15 insertions(+), 1 deletion(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index 255d3e03727b..c88785a3e76c 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -438,6 +438,7 @@ struct sock { > struct sk_buff *skb); > void(*sk_destruct)(struct sock *sk); > struct sock_reuseport __rcu *sk_reuseport_cb; > + struct rcu_head sk_rcu; > }; > > #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data))) > @@ -720,6 +721,7 @@ enum sock_flags { > */ > SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */ > SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */ > + SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */ > }; > > #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << > SOCK_TIMESTAMPING_RX_SOFTWARE)) > diff --git a/net/core/sock.c b/net/core/sock.c > index b67b9aedb230..238a94f879ca 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1418,8 +1418,12 @@ struct sock *sk_alloc(struct net *net, int family, > gfp_t priority, > } > EXPORT_SYMBOL(sk_alloc); > > -void sk_destruct(struct sock *sk) > +/* Sockets having SOCK_RCU_FREE will call this function after one RCU > + * grace period. This is the case for UDP sockets and TCP listeners. > + */ > +static void __sk_destruct(struct rcu_head *head) > { > + struct sock *sk = container_of(head, struct sock, sk_rcu); > struct sk_filter *filter; > > if (sk->sk_destruct) > @@ -1448,6 +1452,14 @@ void sk_destruct(struct sock *sk) > sk_prot_free(sk->sk_prot_creator, sk); > } > > +void sk_destruct(struct sock *sk) > +{ > + if (sock_flag(sk, SOCK_RCU_FREE)) > + call_rcu(&sk->sk_rcu, __sk_destruct); > + else > + __sk_destruct(&sk->sk_rcu); > +} > + > static void __sk_free(struct sock *sk) > { > if (unlikely(sock_diag_has_destroy_listeners(sk) && > sk->sk_net_refcnt)) > -- > 2.8.0.rc3.226.g39d4020 > Tested-by: Tom Herbert
Re: [PATCH v2 net-next 02/11] udp: no longer use SLAB_DESTROY_BY_RCU
On Fri, Apr 1, 2016 at 11:52 AM, Eric Dumazet wrote: > Tom Herbert would like not touching UDP socket refcnt for encapsulated > traffic. For this to happen, we need to use normal RCU rules, with a grace > period before freeing a socket. UDP sockets are not short lived in the > high usage case, so the added cost of call_rcu() should not be a concern. > > This actually removes a lot of complexity in UDP stack. > > Multicast receives no longer need to hold a bucket spinlock. > > Note that ip early demux still needs to take a reference on the socket. > > Same remark for functions used by xt_socket and xt_PROXY netfilter modules, > but this might be changed later. > > Performance for a single UDP socket receiving flood traffic from > many RX queues/cpus. > > Simple udp_rx using simple recvfrom() loop : > 438 kpps instead of 374 kpps : 17 % increase of the peak rate. > > v2: Addressed Willem de Bruijn feedback in multicast handling > - keep early demux break in __udp4_lib_demux_lookup() > Works fine with UDP encapsulation also. Tested-by: Tom Herbert > Signed-off-by: Eric Dumazet > Cc: Tom Herbert > Cc: Willem de Bruijn > --- > include/linux/udp.h | 8 +- > include/net/sock.h | 12 +-- > include/net/udp.h | 2 +- > net/ipv4/udp.c | 293 > > net/ipv4/udp_diag.c | 18 ++-- > net/ipv6/udp.c | 196 --- > 6 files changed, 171 insertions(+), 358 deletions(-) > > diff --git a/include/linux/udp.h b/include/linux/udp.h > index 87c094961bd5..32342754643a 100644 > --- a/include/linux/udp.h > +++ b/include/linux/udp.h > @@ -98,11 +98,11 @@ static inline bool udp_get_no_check6_rx(struct sock *sk) > return udp_sk(sk)->no_check6_rx; > } > > -#define udp_portaddr_for_each_entry(__sk, node, list) \ > - hlist_nulls_for_each_entry(__sk, node, list, > __sk_common.skc_portaddr_node) > +#define udp_portaddr_for_each_entry(__sk, list) \ > + hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node) > > -#define udp_portaddr_for_each_entry_rcu(__sk, node, list) \ > - hlist_nulls_for_each_entry_rcu(__sk, node, list, > __sk_common.skc_portaddr_node) > +#define udp_portaddr_for_each_entry_rcu(__sk, list) \ > + hlist_for_each_entry_rcu(__sk, list, __sk_common.skc_portaddr_node) > > #define IS_UDPLITE(__sk) (udp_sk(__sk)->pcflag) > > diff --git a/include/net/sock.h b/include/net/sock.h > index c88785a3e76c..c3a707d1cee8 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -178,7 +178,7 @@ struct sock_common { > int skc_bound_dev_if; > union { > struct hlist_node skc_bind_node; > - struct hlist_nulls_node skc_portaddr_node; > + struct hlist_node skc_portaddr_node; > }; > struct proto*skc_prot; > possible_net_t skc_net; > @@ -670,18 +670,18 @@ static inline void sk_add_bind_node(struct sock *sk, > hlist_for_each_entry(__sk, list, sk_bind_node) > > /** > - * sk_nulls_for_each_entry_offset - iterate over a list at a given struct > offset > + * sk_for_each_entry_offset_rcu - iterate over a list at a given struct > offset > * @tpos: the type * to use as a loop cursor. > * @pos: the &struct hlist_node to use as a loop cursor. > * @head: the head for your list. > * @offset:offset of hlist_node within the struct. > * > */ > -#define sk_nulls_for_each_entry_offset(tpos, pos, head, offset) > \ > - for (pos = (head)->first; > \ > -(!is_a_nulls(pos)) && > \ > +#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset) > \ > + for (pos = rcu_dereference((head)->first); > \ > +pos != NULL && > \ > ({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;}); > \ > -pos = pos->next) > +pos = rcu_dereference(pos->next)) > > static inline struct user_namespace *sk_user_ns(struct sock *sk) > { > diff --git a/include/net/udp.h b/include/net/udp.h > index 92927f729ac8..d870ec1611c4 100644 > --- a/include/net/udp.h > +++ b/include/net/udp.h > @@ -59,7 +59,7 @@ struct udp_skb_cb { > * @lock: spinlock protecting changes to head/count > */ > struct udp_hslot { > - struct hlist_nulls_head head; > + struct hlist_head head; > int count; > spinlock_t lock; > } __attribute__((aligned(2 * sizeof(long; > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 08eed5e16df0..0475aaf95040 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -143,10 +143,9 @@ static int udp_lib_lport_inuse(struct net *net, __u16 > num, >unsigned i
Re: [PATCH] net: remove unimplemented RTNH_F_PERVASIVE
Hello. On 4/2/2016 11:43 AM, Quentin Armitage wrote: Linux 2.1.68 introduced RTNH_F_PERVASIVE, but it had no implementation and couldn't be enabled since the required config parameter wasn't in any Kconfig file (see commit d088dde7b). scripts/checkpatch.pl now enforces certain commit citing format, your doesn't match it, i.e. you need 12-digit SHA1 and (" This commit removes all remaining references to RTNH_F_PERVASIVE. Although this will cause userspace applications that were using the flag to fail to build, they will be alerted to the fact that using RTNH_F_PERVASIVE was not achieving anything. Signed-off-by: Quentin Armitage [...] diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c index df48034..f5660c6 100644 --- a/net/decnet/dn_fib.c +++ b/net/decnet/dn_fib.c @@ -243,7 +243,7 @@ out: } else { struct net_device *dev; - if (nh->nh_flags&(RTNH_F_PERVASIVE|RTNH_F_ONLINK)) + if (nh->nh_flags&RTNH_F_ONLINK) Please enclose & into spaces like below. diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index d97268e..3883860 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -803,7 +803,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, } else { struct in_device *in_dev; - if (nh->nh_flags & (RTNH_F_PERVASIVE | RTNH_F_ONLINK)) + if (nh->nh_flags & RTNH_F_ONLINK) return -EINVAL; rcu_read_lock(); MBR, Sergei
net: memory leak due to CLONE_NEWNET
Hello, The following program leads to memory leaks in: unreferenced object 0x88005c10d208 (size 96): comm "a.out", pid 10753, jiffies 4296778619 (age 43.118s) hex dump (first 32 bytes): e8 31 85 2d 00 88 ff ff 0f 00 00 00 00 00 00 00 .1.- 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .N.. backtrace: [] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915 [< inline >] kmemleak_alloc_recursive include/linux/kmemleak.h:47 [< inline >] slab_post_alloc_hook mm/slab.h:406 [< inline >] slab_alloc_node mm/slub.c:2602 [< inline >] slab_alloc mm/slub.c:2610 [] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627 [< inline >] kmalloc include/linux/slab.h:478 [< inline >] tc_action_net_init include/net/act_api.h:122 [] csum_init_net+0x15e/0x450 net/sched/act_csum.c:593 [] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109 [] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287 [] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367 [] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106 [] unshare_nsproxy_namespaces+0xa9/0x1d0 kernel/nsproxy.c:205 [< inline >] SYSC_unshare kernel/fork.c:2019 [] SyS_unshare+0x3b3/0x800 kernel/fork.c:1969 [] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207 [] 0x unreferenced object 0x88005c10e1c8 (size 96): comm "a.out", pid 10753, jiffies 4296778620 (age 43.117s) hex dump (first 32 bytes): e8 0b 85 2d 00 88 ff ff 0f 00 00 00 00 00 ad de ...- 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .N.. backtrace: [] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915 [< inline >] kmemleak_alloc_recursive include/linux/kmemleak.h:47 [< inline >] slab_post_alloc_hook mm/slab.h:406 [< inline >] slab_alloc_node mm/slub.c:2602 [< inline >] slab_alloc mm/slub.c:2610 [] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627 [< inline >] kmalloc include/linux/slab.h:478 [< inline >] tc_action_net_init include/net/act_api.h:122 [] ife_init_net+0x15e/0x450 net/sched/act_ife.c:838 [] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109 [] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287 [] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367 [] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106 [] unshare_nsproxy_namespaces+0xa9/0x1d0 kernel/nsproxy.c:205 [< inline >] SYSC_unshare kernel/fork.c:2019 [] SyS_unshare+0x3b3/0x800 kernel/fork.c:1969 [] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207 [] 0x unreferenced object 0x880025a55b08 (size 96): comm "a.out", pid 10702, jiffies 4296768144 (age 61.526s) hex dump (first 32 bytes): 28 ed 55 2b 00 88 ff ff 0f 00 00 00 00 00 00 00 (.U+ 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .N.. backtrace: [] kmemleak_alloc+0x63/0xa0 mm/kmemleak.c:915 [< inline >] kmemleak_alloc_recursive include/linux/kmemleak.h:47 [< inline >] slab_post_alloc_hook mm/slab.h:406 [< inline >] slab_alloc_node mm/slub.c:2602 [< inline >] slab_alloc mm/slub.c:2610 [] kmem_cache_alloc_trace+0x160/0x3d0 mm/slub.c:2627 [< inline >] kmalloc include/linux/slab.h:478 [< inline >] tc_action_net_init include/net/act_api.h:122 [] nat_init_net+0x15e/0x450 net/sched/act_nat.c:311 [] ops_init+0xa9/0x3a0 net/core/net_namespace.c:109 [] setup_net+0x1b4/0x3e0 net/core/net_namespace.c:287 [] copy_net_ns+0xd6/0x1a0 net/core/net_namespace.c:367 [] create_new_namespaces+0x37f/0x740 kernel/nsproxy.c:106 [] unshare_nsproxy_namespaces+0xa9/0x1d0 kernel/nsproxy.c:205 [< inline >] SYSC_unshare kernel/fork.c:2019 [] SyS_unshare+0x3b3/0x800 kernel/fork.c:1969 [] entry_SYSCALL_64_fastpath+0x23/0xc1 arch/x86/entry/entry_64.S:207 [] 0x #include #include #include #include #include #include int main() { int pid, status; pid = fork(); if (pid == 0) { unshare(CLONE_NEWNET); exit(0); } while (waitpid(pid, &status, 0) != pid) { } return 0; } grep "kmalloc-96" /proc/slabinfo confirms the leak. I am on commit 05cf8077e54b20dddb756eaa26f3aeb5c38dd3cf (Apr 1).
[PATCH] net: remove unimplemented RTNH_F_PERVASIVE
Linux 2.1.68 introduced RTNH_F_PERVASIVE, but it had no implementation and couldn't be enabled since the required config parameter wasn't in any Kconfig file (see commit d088dde7b). This commit removes all remaining references to RTNH_F_PERVASIVE. Although this will cause userspace applications that were using the flag to fail to build, they will be alerted to the fact that using RTNH_F_PERVASIVE was not achieving anything. Signed-off-by: Quentin Armitage --- include/uapi/linux/rtnetlink.h |2 +- net/decnet/dn_fib.c|2 +- net/ipv4/fib_semantics.c |2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index ca764b5..58e6ba0 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -339,7 +339,7 @@ struct rtnexthop { /* rtnh_flags */ #define RTNH_F_DEAD1 /* Nexthop is dead (used by multipath) */ -#define RTNH_F_PERVASIVE 2 /* Do recursive gateway lookup */ + /* 2 was RTNH_F_PERVASIVE (never implemented) */ #define RTNH_F_ONLINK 4 /* Gateway is forced on link*/ #define RTNH_F_OFFLOAD 8 /* offloaded route */ #define RTNH_F_LINKDOWN16 /* carrier-down on nexthop */ diff --git a/net/decnet/dn_fib.c b/net/decnet/dn_fib.c index df48034..f5660c6 100644 --- a/net/decnet/dn_fib.c +++ b/net/decnet/dn_fib.c @@ -243,7 +243,7 @@ out: } else { struct net_device *dev; - if (nh->nh_flags&(RTNH_F_PERVASIVE|RTNH_F_ONLINK)) + if (nh->nh_flags&RTNH_F_ONLINK) return -EINVAL; dev = __dev_get_by_index(&init_net, nh->nh_oif); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index d97268e..3883860 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -803,7 +803,7 @@ static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi, } else { struct in_device *in_dev; - if (nh->nh_flags & (RTNH_F_PERVASIVE | RTNH_F_ONLINK)) + if (nh->nh_flags & RTNH_F_ONLINK) return -EINVAL; rcu_read_lock(); -- 1.7.7.6
Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
First of all, I'm very happy to see people start working on this! Thanks you Brenden! On Fri, 1 Apr 2016 18:21:57 -0700 Brenden Blanco wrote: > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver. Since > bpf programs require a skb context to navigate the packet, build a > percpu fake skb with the minimal fields. This avoids the costly > allocation for packets that end up being dropped. > > Since mlx4 is so far the only user of this pseudo skb, the build > function is defined locally. > > Signed-off-by: Brenden Blanco > --- [...] > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index 86bcfe5..03fe005 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c [...] > @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct > mlx4_en_cq *cq, int bud > if (budget <= 0) > return polled; > > + prog = READ_ONCE(priv->prog); > + > /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx >* descriptor offset can be deduced from the CQE index instead of >* reading 'cqe->index' */ > @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct > mlx4_en_cq *cq, int bud > l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) && > (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL)); > > + /* A bpf program gets first chance to drop the packet. It may > + * read bytes but not past the end of the frag. A non-zero > + * return indicates packet should be dropped. > + */ > + if (prog) { > + struct ethhdr *ethh; > + I think you need to DMA sync RX-page before you can safely access packet data in page (on all arch's). > + ethh = (struct ethhdr *)(page_address(frags[0].page) + > + frags[0].page_offset); > + if (mlx4_call_bpf(prog, ethh, length)) { AFAIK length here covers all the frags[n].page, thus potentially causing the BPF program to access memory out of bound (crash). Having several page fragments is AFAIK an optimization for jumbo-frames on PowerPC (which is a bit annoying for you use-case ;-)). > + priv->stats.rx_dropped++; > + goto next; > + } > + } > + -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer