Re: [PATCH bpf-next v3 1/2] bpf: Fix bpf_get/setsockopt to tos not take effect when TCP over IPv4 via INET6 API

2024-09-14 Thread Eric Dumazet
On Sat, Sep 14, 2024 at 12:32 PM Feng zhou  wrote:
>
> From: Feng Zhou 
>
> when TCP over IPv4 via INET6 API, bpf_get/setsockopt with ipv4 will
> fail, because sk->sk_family is AF_INET6. With ipv6 will success, not
> take effect, because inet_csk(sk)->icsk_af_ops is ipv6_mapped and
> use ip_queue_xmit, inet_sk(sk)->tos.
>
> Bpf_get/setsockopt use sk_is_inet() helper to fix this case.
>
> Signed-off-by: Feng Zhou 

Reviewed-by: Eric Dumazet 



Re: [RFC PATCH 2/3] ipv6: Run a reverse sk_lookup on sendmsg.

2024-09-14 Thread Eric Dumazet
On Fri, Sep 13, 2024 at 11:39 AM Tiago Lam  wrote:
>
> This follows the same rationale provided for the ipv4 counterpart, where
> it now runs a reverse socket lookup when source addresses and/or ports
> are changed, on sendmsg, to check whether egress traffic should be
> allowed to go through or not.
>
> As with ipv4, the ipv6 sendmsg path is also extended here to support the
> IPV6_ORIGDSTADDR ancilliary message to be able to specify a source
> address/port.
>
> Suggested-by: Jakub Sitnicki 
> Signed-off-by: Tiago Lam 
> ---
>  net/ipv6/datagram.c | 76 
> +
>  net/ipv6/udp.c  |  8 --
>  2 files changed, 82 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
> index fff78496803d..4214dda1c320 100644
> --- a/net/ipv6/datagram.c
> +++ b/net/ipv6/datagram.c
> @@ -756,6 +756,27 @@ void ip6_datagram_recv_ctl(struct sock *sk, struct 
> msghdr *msg,
>  }
>  EXPORT_SYMBOL_GPL(ip6_datagram_recv_ctl);
>
> +static inline bool reverse_sk_lookup(struct flowi6 *fl6, struct sock *sk,
> +struct in6_addr *saddr, __be16 sport)
> +{
> +   if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
> +   (saddr && sport) &&
> +   (ipv6_addr_cmp(&sk->sk_v6_rcv_saddr, saddr) || 
> inet_sk(sk)->inet_sport != sport)) {
> +   struct sock *sk_egress;
> +
> +   bpf_sk_lookup_run_v6(sock_net(sk), IPPROTO_UDP, &fl6->daddr, 
> fl6->fl6_dport,
> +saddr, ntohs(sport), 0, &sk_egress);
> +   if (!IS_ERR_OR_NULL(sk_egress) &&
> +   atomic64_read(&sk_egress->sk_cookie) == 
> atomic64_read(&sk->sk_cookie))

I do not understand this.

1) sk_cookie is not always initialized. It is done on demand, when/if
__sock_gen_cookie() was called.

2) if sk1 and sk2 share the same sk_cookie, then sk1 == sk2 ???

So why not simply testing sk_egress == sk ?

> +   return true;
> +
> +   net_info_ratelimited("No reverse socket lookup match for 
> local addr %pI6:%d remote addr %pI6:%d\n",
> +&saddr, ntohs(sport), &fl6->daddr, 
> ntohs(fl6->fl6_dport));
> +   }
> +
> +   return false;
> +}
> +
>  int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
>   struct msghdr *msg, struct flowi6 *fl6,
>   struct ipcm6_cookie *ipc6)
> @@ -844,7 +865,62 @@ int ip6_datagram_send_ctl(struct net *net, struct sock 
> *sk,
>
> break;
> }
> +   case IPV6_ORIGDSTADDR:
> +   {
> +   struct sockaddr_in6 *sockaddr_in;
> +   struct net_device *dev = NULL;
> +
> +   if (cmsg->cmsg_len < CMSG_LEN(sizeof(struct 
> sockaddr_in6))) {
> +   err = -EINVAL;
> +   goto exit_f;
> +   }
> +
> +   sockaddr_in = (struct sockaddr_in6 *)CMSG_DATA(cmsg);
> +
> +   addr_type = __ipv6_addr_type(&sockaddr_in->sin6_addr);
> +
> +   if (addr_type & IPV6_ADDR_LINKLOCAL)
> +   return -EINVAL;
> +
> +   /* If we're egressing with a different source address 
> and/or port, we
> +* perform a reverse socket lookup.  The rationale 
> behind this is that we
> +* can allow return UDP traffic that has ingressed 
> through sk_lookup to
> +* also egress correctly. In case the reverse lookup 
> fails, we
> +* continue with the normal path.
> +*
> +* The lookup is performed if either source address 
> and/or port changed, and
> +* neither is "0".
> +*/
> +   if (reverse_sk_lookup(fl6, sk, 
> &sockaddr_in->sin6_addr,
> + sockaddr_in->sin6_port)) {
> +   /* Override the source port and address to 
> use with the one we
> +* got in cmsg and bail early.
> +*/
> +   fl6->saddr = sockaddr_in->sin6_addr;
> +   fl6->fl6_sport = sockaddr_in->sin6_port;
> +   break;
> +   }
>
> +   if (addr_type != IPV6_ADDR_ANY) {
> +   int strict = __ipv6_addr_src_scope(addr_type) 
> <= IPV6_ADDR_SCOPE_LINKLOCAL;
> +
> +   if (!ipv6_can_nonlocal_bind(net, inet_sk(sk)) 
> &&
> +   !ipv6_chk_addr_and_flags(net,
> +
> &sockaddr_in->sin6_addr,
> +  

Re: [PATCH net] selftests: net: csum: Fix checksums for packets with non-zero padding

2024-09-09 Thread Eric Dumazet
On Mon, Sep 9, 2024 at 5:02 PM Sean Anderson  wrote:
>
> On 9/6/24 22:05, Willem de Bruijn wrote:
> > Sean Anderson wrote:
> >> Padding is not included in UDP and TCP checksums. Therefore, reduce the
> >> length of the checksummed data to include only the data in the IP
> >> payload. This fixes spurious reported checksum failures like
> >>
> >> rx: pkt: sport=33000 len=26 csum=0xc850 verify=0xf9fe
> >> pkt: bad csum
> >
> > Are you using this test as receiver for other input?
> >
> > The packet builder in the test doesn't generate these, does it?
>
> It's added by the MAC before transmission.
>
> This is permitted by the standard, but in this case it actually appears
> to be due to the MAC using 32-bit reads for the data and not masking off
> the end. Not sure whether this is a bug in the driver/device, since
> technically we may leak up to 3 bytes of memory.

This seems to be a bug in the driver.

A call to skb_put_padto(skb, ETH_ZLEN) should be added.

>
> That said, it would be a nice enhancement to generate packets with
> non-zero padding as well, since they are an interesting edge case.
>
> > Just trying to understand if this is a bug fix or a new use for
> > csum.c as receiver.
>
> Bug fix.
>
> >> Technically it is possible for there to be trailing bytes after the UDP
> >> data but before the Ethernet padding (e.g. if sizeof(ip) + sizeof(udp) +
> >> udp.len < ip.len). However, we don't generate such packets.
> >
> > More likely is that L3 and L4 length fields agree, as both are
> > generated at the sender, but that some trailer is attached in the
> > network. Such as a timestamp trailer.
>
> Yes, as noted above we don't generate packets with differing L3 and L4
> lengths.
>
> >> Fixes: 91a7de85600d ("selftests/net: add csum offload test")
> >> Signed-off-by: Sean Anderson 
> >> ---
> >> Found while testing for this very bug in hardware checksum offloads.
> >>
> >>  tools/testing/selftests/net/lib/csum.c | 16 ++--
> >>  1 file changed, 14 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/tools/testing/selftests/net/lib/csum.c 
> >> b/tools/testing/selftests/net/lib/csum.c
> >> index b9f3fc3c3426..e0a34e5e8dd5 100644
> >> --- a/tools/testing/selftests/net/lib/csum.c
> >> +++ b/tools/testing/selftests/net/lib/csum.c
> >> @@ -654,10 +654,16 @@ static int recv_verify_packet_ipv4(void *nh, int len)
> >>  {
> >>  struct iphdr *iph = nh;
> >>  uint16_t proto = cfg_encap ? IPPROTO_UDP : cfg_proto;
> >> +uint16_t ip_len;
> >>
> >>  if (len < sizeof(*iph) || iph->protocol != proto)
> >>  return -1;
> >>
> >> +ip_len = ntohs(iph->tot_len);
> >> +if (ip_len > len || ip_len < sizeof(*iph))
> >> +return -1;
> >> +
> >> +len = ip_len;
> >>  iph_addr_p = &iph->saddr;
> >>  if (proto == IPPROTO_TCP)
> >>  return recv_verify_packet_tcp(iph + 1, len - sizeof(*iph));
> >> @@ -669,16 +675,22 @@ static int recv_verify_packet_ipv6(void *nh, int len)
> >>  {
> >>  struct ipv6hdr *ip6h = nh;
> >>  uint16_t proto = cfg_encap ? IPPROTO_UDP : cfg_proto;
> >> +uint16_t ip_len;
> >
> > nit: payload_len, as it never includes sizeof ipv6hdr
>
> OK
>
> --Sean
>
> >>  if (len < sizeof(*ip6h) || ip6h->nexthdr != proto)
> >>  return -1;
> >>
> >> +ip_len = ntohs(ip6h->payload_len);
> >> +if (ip_len > len - sizeof(*ip6h))
> >> +return -1;
> >> +
> >> +len = ip_len;
> >>  iph_addr_p = &ip6h->saddr;
> >>
> >>  if (proto == IPPROTO_TCP)
> >> -return recv_verify_packet_tcp(ip6h + 1, len - sizeof(*ip6h));
> >> +return recv_verify_packet_tcp(ip6h + 1, len);
> >>  else
> >> -return recv_verify_packet_udp(ip6h + 1, len - sizeof(*ip6h));
> >> +return recv_verify_packet_udp(ip6h + 1, len);
> >>  }
> >>
> >>  /* return whether auxdata includes TP_STATUS_CSUM_VALID */
> >> --
> >> 2.35.1.1320.gc452695387.dirty
> >>
> >
> >



Re: [PATCH net] selftests: net: enable bind tests

2024-09-04 Thread Eric Dumazet
On Wed, Sep 4, 2024 at 8:12 AM Jamie Bainbridge
 wrote:
>
> bind_wildcard is compiled but not run, bind_timewait is not compiled.
>
> These two tests complete in a very short time, use the test harness
> properly, and seem reasonable to enable.
>
> The author of the tests confirmed via email that these were
> intended to be run.
>
> Enable these two tests.
>
> Fixes: 13715acf8ab5 ("selftest: Add test for bind() conflicts.")
> Fixes: 2c042e8e54ef ("tcp: Add selftest for bind() and TIME_WAIT.")
> Signed-off-by: Jamie Bainbridge 

Reviewed-by: Eric Dumazet 



Re: [PATCH] net: missing check

2024-06-06 Thread Eric Dumazet
On Thu, Jun 6, 2024 at 4:14 PM Denis Arefev  wrote:
>
> Two missing check in virtio_net_hdr_to_skb() allowed syzbot
> to crash kernels again
>
> 1. After the skb_segment function the buffer may become non-linear
> (nr_frags != 0), but since the SKBTX_SHARED_FRAG flag is not set anywhere
> the __skb_linearize function will not be executed, then the buffer will
> remain non-linear. Then the condition (offset >= skb_headlen(skb))
> becomes true, which causes WARN_ON_ONCE in skb_checksum_help.
>
> 2. The struct sk_buff and struct virtio_net_hdr members must be
> mathematically related.
> (gso_size) must be greater than (needed) otherwise WARN_ON_ONCE.
> (remainder) must be greater than (needed) otherwise WARN_ON_ONCE.
> (remainder) may be 0 if division is without remainder.
>
> offset+2 (4191) > skb_headlen() (1116)
> WARNING: CPU: 1 PID: 5084 at net/core/dev.c:3303 
> skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303
> Modules linked in:
> CPU: 1 PID: 5084 Comm: syz-executor336 Not tainted 
> 6.7.0-rc3-syzkaller-00014-gdf60cee26a2e #0
> Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 
> 11/10/2023
> RIP: 0010:skb_checksum_help+0x5e2/0x740 net/core/dev.c:3303
> Code: 89 e8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 52 01 00 00 44 89 e2 2b 
> 53 74 4c 89 ee 48 c7 c7 40 57 e9 8b e8 af 8f dd f8 90 <0f> 0b 90 90 e9 87 fe 
> ff ff e8 40 0f 6e f9 e9 4b fa ff ff 48 89 ef
> RSP: 0018:c90003a9f338 EFLAGS: 00010286
> RAX:  RBX: 888025125780 RCX: 814db209
> RDX: 888015393b80 RSI: 814db216 RDI: 0001
> RBP: 8880251257f4 R08: 0001 R09: 
> R10:  R11: 0001 R12: 045c
> R13: 105f R14: 8880251257f0 R15: 105d
> FS:  55c24380() GS:8880b990() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 2000f000 CR3: 23151000 CR4: 003506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  ip_do_fragment+0xa1b/0x18b0 net/ipv4/ip_output.c:777
>  ip_fragment.constprop.0+0x161/0x230 net/ipv4/ip_output.c:584
>  ip_finish_output_gso net/ipv4/ip_output.c:286 [inline]
>  __ip_finish_output net/ipv4/ip_output.c:308 [inline]
>  __ip_finish_output+0x49c/0x650 net/ipv4/ip_output.c:295
>  ip_finish_output+0x31/0x310 net/ipv4/ip_output.c:323
>  NF_HOOK_COND include/linux/netfilter.h:303 [inline]
>  ip_output+0x13b/0x2a0 net/ipv4/ip_output.c:433
>  dst_output include/net/dst.h:451 [inline]
>  ip_local_out+0xaf/0x1a0 net/ipv4/ip_output.c:129
>  iptunnel_xmit+0x5b4/0x9b0 net/ipv4/ip_tunnel_core.c:82
>  ipip6_tunnel_xmit net/ipv6/sit.c:1034 [inline]
>  sit_tunnel_xmit+0xed2/0x28f0 net/ipv6/sit.c:1076
>  __netdev_start_xmit include/linux/netdevice.h:4940 [inline]
>  netdev_start_xmit include/linux/netdevice.h:4954 [inline]
>  xmit_one net/core/dev.c:3545 [inline]
>  dev_hard_start_xmit+0x13d/0x6d0 net/core/dev.c:3561
>  __dev_queue_xmit+0x7c1/0x3d60 net/core/dev.c:4346
>  dev_queue_xmit include/linux/netdevice.h:3134 [inline]
>  packet_xmit+0x257/0x380 net/packet/af_packet.c:276
>  packet_snd net/packet/af_packet.c:3087 [inline]
>  packet_sendmsg+0x24ca/0x5240 net/packet/af_packet.c:3119
>  sock_sendmsg_nosec net/socket.c:730 [inline]
>  __sock_sendmsg+0xd5/0x180 net/socket.c:745
>  __sys_sendto+0x255/0x340 net/socket.c:2190
>  __do_sys_sendto net/socket.c:2202 [inline]
>  __se_sys_sendto net/socket.c:2198 [inline]
>  __x64_sys_sendto+0xe0/0x1b0 net/socket.c:2198
>  do_syscall_x64 arch/x86/entry/common.c:51 [inline]
>  do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82
>  entry_SYSCALL_64_after_hwframe+0x63/0x6b
>
> Signed-off-by: Denis Arefev 
> ---
>  include/linux/virtio_net.h | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 4dfa9b69ca8d..77ebe908d746 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -56,6 +56,7 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
> unsigned int thlen = 0;
> unsigned int p_off = 0;
> unsigned int ip_proto;
> +   u64 ret, remainder;
>
> if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> switch (hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
> @@ -98,6 +99,15 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff 
> *skb,
> u32 off = __virtio16_to_cpu(little_endian, hdr->csum_offset);
> u32 needed = start + max_t(u32, thlen, off + sizeof(__sum16));
>
> +   if (hdr->gso_size) {
> +   ret = div64_u64_rem(skb->len, hdr->gso_size, 
> &remainder);
> +   if (!(ret && (hdr->gso_size > needed) &&
> +   ((remainder > needed) || 
> (remainder == 0 {
> +  

Re: [PATCH net-next v7] net/ipv4: add tracepoint for icmp_send

2024-04-26 Thread Eric Dumazet
On Fri, Apr 26, 2024 at 10:47 AM  wrote:
>
> From: Peilin He 
>
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
>
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain detailed information about the UDP
> packet loss, including the type, code, source address, destination address,
> source port, and destination port. This facilitates the trouble-shooting
> of UDP packet loss issues especially for those network-service
> applications.
>
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
>
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>  icmp_send: icmp_send: type=3, code=3.
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>  skbaddr=589b167a
>
> Change log:
> ===
> v6->v7:
> Some fixes according to
> https://lore.kernel.org/all/20240425081210.720a4...@kernel.org/
> 1. Fix patch format issues.
>
> v5->v6:
> Some fixes according to
> https://lore.kernel.org/all/20240413161319.ga853...@kernel.org/
> 1.Resubmit patches based on the latest net-next code.
>
> v4->v5:
> Some fixes according to
> https://lore.kernel.org/all/CAL+tcoDeXXh+zcRk4PHnUk8ELnx=CE2pc
> cqs7sfm0y9ak-e...@mail.gmail.com/
> 1.Adjust the position of trace_icmp_send() to before icmp_push_reply().
>
> v3->v4:
> Some fixes according to
> https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJz
> BzkJFVgCTdXBx=y...@mail.gmail.com/
> 1.Add legality check for UDP header in SKB.
> 2.Target this patch for net-next.
>
> v2->v3:
> Some fixes according to
> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> 1. Change the tracking directory to/sys/kernel/tracking.
> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>
> v1->v2:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sZtRnKRu_tnUwqHu
> fqtjvjsv-nz1x...@mail.gmail.com/
> 1. adjust the trace_icmp_send() to more protocols than UDP.
> 2. move the calling of trace_icmp_send after sanity checks
> in __icmp_send().
>
> Signed-off-by: Peilin He 
> Signed-off-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 68 +
>  net/ipv4/icmp.c |  4 +++
>  2 files changed, 72 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..cff33aaee44e
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> +   TP_ARGS(skb, type, code),
> +
> +   TP_STRUCT__entry(
> +   __field(const void *, skbaddr)
> +   __field(int, type)
> +   __field(int, code)
> +   __array(__u8, saddr, 4)
> +   __array(__u8, daddr, 4)
> +   __field(__u16, sport)
> +   __field(__u16, dport)
> +   __field(unsigned short, ulen)
> +   ),
> +
> +   TP_fast_assign(
> +   struct iphdr *iph = ip_hdr(skb);
> +   int proto_4 = iph->protocol;
> +   __be32 *p32;
> +
> +   __entry->skbaddr = skb;
> +   __entry->type = type;
> +   __entry->code = code;
> +
> +   struct udphdr *uh = udp_hdr(skb);

Please move this line up., perhaps after the "struct iphdr *iph =
ip_hdr(skb);" one

We group all variable definitions together, we do not mix code and variables.



> +
> +   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
> +   (u8 *)uh + sizeof(struct udphdr)
> +   > skb_tail_pointer(skb)) {
> +   __entry->sport = 0;
> +   __entry->dport = 0;
> +   __entry->ulen = 0;
> +   } else {
> +   __entr

Re: [PATCH v3 resend] net/ipv4: add tracepoint for icmp_send

2024-03-21 Thread Eric Dumazet
On Thu, Mar 21, 2024 at 4:09 AM  wrote:
>
> From: he peilin 
>
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
>
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain detailed information about the UDP
> packet loss, including the type, code, source address, destination address,
> source port, and destination port. This facilitates the trouble-shooting
> of UDP packet loss issues especially for those network-service
> applications.
>
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
>
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>  icmp_send: icmp_send: type=3, code=3.
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>  skbaddr=589b167a
>
> Changelog
> -
> v2->v3:
> Some fixes according to
> https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
> 1. Change the tracking directory to/sys/kernel/tracking.
> 2. Adjust the layout of the TP-STRUCT_entry parameter structure.
>
> v1->v2:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
> 1. adjust the trace_icmp_send() to more protocols than UDP.
> 2. move the calling of trace_icmp_send after sanity checks
> in __icmp_send().
>
> Signed-off-by: Peilin He
> Reviewed-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 64 +
>  net/ipv4/icmp.c |  4 +++
>  2 files changed, 68 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..2098d4b1b12e
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> +   TP_ARGS(skb, type, code),
> +
> +   TP_STRUCT__entry(
> +   __field(const void *, skbaddr)
> +   __field(int, type)
> +   __field(int, code)
> +   __array(__u8, saddr, 4)
> +   __array(__u8, daddr, 4)
> +   __field(__u16, sport)
> +   __field(__u16, dport)
> +   __field(unsigned short, ulen)
> +   ),
> +
> +   TP_fast_assign(
> +   struct iphdr *iph = ip_hdr(skb);
> +   int proto_4 = iph->protocol;
> +   __be32 *p32;
> +
> +   __entry->skbaddr = skb;
> +   __entry->type = type;
> +   __entry->code = code;
> +
> +   if (proto_4 == IPPROTO_UDP) {
> +   struct udphdr *uh = udp_hdr(skb);
> +   __entry->sport = ntohs(uh->source);
> +   __entry->dport = ntohs(uh->dest);
> +   __entry->ulen = ntohs(uh->len);

This is completely bogus.

Adding tracepoints is ok if there are no side effects like bugs :/

At this point there is no guarantee the UDP header is complete/present
in skb->head

Look at the existing checks between lines 619 and 623

Then audit all icmp_send() callers, and ask yourself if UDP packets
can not be malicious (like with a truncated UDP header)



Re: [PATCH net-next v2] tcp: Add skb addr and sock addr to arguments of tracepoint tcp_probe.

2024-03-04 Thread Eric Dumazet
On Mon, Mar 4, 2024 at 4:46 AM fuyuanli  wrote:
>
> It is useful to expose skb addr and sock addr to user in tracepoint
> tcp_probe, so that we can get more information while monitoring
> receiving of tcp data, by ebpf or other ways.
>
> For example, we need to identify a packet by seq and end_seq when
> calculate transmit latency between lay 2 and lay 4 by ebpf, but which is

Please use "layer 2 and layer 4".

> not available in tcp_probe, so we can only use kprobe hooking
> tcp_rcv_esatblised to get them. But we can use tcp_probe directly if skb
> addr and sock addr are available, which is more efficient.

Okay, but please fix the typo. Correct function name is tcp_rcv_established

>
> Signed-off-by: fuyuanli 
> Link: 
> https://lore.kernel.org/netdev/20240229052813.GA23899@didi-ThinkCentre-M920t-N000/
>



Re: [PATCH] net/ipv4: add tracepoint for icmp_send

2024-02-26 Thread Eric Dumazet
On Tue, Feb 27, 2024 at 3:50 AM  wrote:
>
> From: xu xin 
>
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
>
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain the detailed information easily about the
> UDP packet loss, including the type, code, source address, destination
> address, source port, and destination port. This facilitates the
> trouble-shooting of packet loss issues especially for those complicated
> network-service applications.
>
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/debug/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
>
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:  icmp_send:
> icmp_send: type=3, code=3.From 127.0.0.1:41895 to 127.0.0.1: ulen=23
> skbaddr=589b167a
>
> Signed-off-by: He Peilin 
> Reviewed-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 57 
> +
>  net/ipv4/icmp.c |  4 
>  2 files changed, 61 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
>
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..3d9af5769bc3
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,57 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> +   TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> +   TP_ARGS(skb, type, code),
> +
> +   TP_STRUCT__entry(
> +   __field(__u16, sport)
> +   __field(__u16, dport)
> +   __field(unsigned short, ulen)
> +   __field(const void *, skbaddr)
> +   __field(int, type)
> +   __field(int, code)
> +   __array(__u8, saddr, 4)
> +   __array(__u8, daddr, 4)
> +   ),
> +
> +   TP_fast_assign(
> +   // Get UDP header
> +   struct udphdr *uh = udp_hdr(skb);
> +   struct iphdr *iph = ip_hdr(skb);
> +   __be32 *p32;
> +
> +   __entry->sport = ntohs(uh->source);
> +   __entry->dport = ntohs(uh->dest);
> +   __entry->ulen = ntohs(uh->len);
> +   __entry->skbaddr = skb;
> +   __entry->type = type;
> +   __entry->code = code;
> +
> +   p32 = (__be32 *) __entry->saddr;
> +   *p32 = iph->saddr;
> +
> +   p32 = (__be32 *) __entry->daddr;
> +   *p32 =  iph->daddr;
> +   ),
> +

FYI, ICMP can be generated for many other protocols than UDP.

> +   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to 
> %pI4:%u ulen=%d skbaddr=%p",
> +   __entry->type, __entry->code,
> +   __entry->saddr, __entry->sport, __entry->daddr,
> +   __entry->dport, __entry->ulen, __entry->skbaddr)
> +);
> +
> +#endif /* _TRACE_ICMP_H */
> +
> +/* This part must be outside protection */
> +#include 
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index e63a3bf99617..437bdb7e2650 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -92,6 +92,8 @@
>  #include 
>  #include 
>  #include 
> +#define CREATE_TRACE_POINTS
> +#include 
>
>  /*
>   * Build xmit assembly blocks
> @@ -599,6 +601,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
> code, __be32 info,
> struct net *net;
> struct sock *sk;
>
> +   trace_icmp_send(skb_in, type, code);

I think you missed many sanity checks between lines 622 and 676

Honestly, a kprobe BPF based solution would be less risky, and less
maintenance for us.



Re: [PATCH net-next 1/2] net/vsockmon: Leverage core stats allocator

2024-02-23 Thread Eric Dumazet
On Fri, Feb 23, 2024 at 12:58 PM Breno Leitao  wrote:
>
> With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
> convert veth & vrf"), stats allocation could be done on net core
> instead of this driver.
>
> With this new approach, the driver doesn't have to bother with error
> handling (allocation failure checking, making sure free happens in the
> right spot, etc). This is core responsibility now.
>
> Remove the allocation in the vsockmon driver and leverage the network
> core allocation instead.
>
> Signed-off-by: Breno Leitao 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next 2/2] net/vsockmon: Do not set zeroed statistics

2024-02-23 Thread Eric Dumazet
On Fri, Feb 23, 2024 at 12:58 PM Breno Leitao  wrote:
>
> Do not set rtnl_link_stats64 fields to zero, since they are zeroed
> before ops->ndo_get_stats64 is called in core dev_get_stats() function.
>
> Signed-off-by: Breno Leitao 

Reviewed-by: Eric Dumazet 



Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-14 Thread Eric Dumazet
On Wed, Feb 14, 2024 at 5:49 PM Breno Leitao  wrote:
>
> On Wed, Feb 14, 2024 at 04:41:36PM +0100, Eric Dumazet wrote:
> > On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao  wrote:
> > >
> > > On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > > > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > > > Please note that adding other sysfs entries is expensive for workloads
> > > > > creating/deleting netdev and netns often.
> > > > >
> > > > > I _think_ we should find a way for not creating
> > > > > /sys/class/net//queues/tx-{Q}/byte_queue_limits  directory
> > > > > and files
> > > > > for non BQL enabled devices (like loopback !)
> > > >
> > > > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > > > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > > > would be pointless"? Obviously better to annotate the drivers which
> > > > do have BQL support, but there's >50 of them on a quick count..
> > >
> > > Let me make sure I understand the suggestion above. We want to disable
> > > BQL completely for devices that has dev->features & NETIF_F_LLTX or
> > > dev->priv_flags & IFF_NO_QUEUE, right?
> > >
> > > Maybe we can add a ->enabled field in struct dql, and set it according
> > > to the features above. Then we can created the sysfs and process the dql
> > > operations based on that field. This should avoid some unnecessary calls
> > > also, if we are not display sysfs.
> > >
> > > Here is a very simple PoC to represent what I had in mind. Am I in the
> > > right direction?
> >
> > No, this was really about sysfs entries (aka dql_group)
> >
> > Partial patch would be:
>
> That is simpler than what I imagined. Thanks!
>

>
> for netdev_uses_bql(), would it be similar to what I proposed in the
> previous message? Let me copy it here.
>
> static bool netdev_uses_bql(struct net_device *dev)
> {
>if (dev->features & NETIF_F_LLTX ||
>dev->priv_flags & IFF_NO_QUEUE)
>return false;
>
>return true;
> }

I think this should be fine, yes.



Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-14 Thread Eric Dumazet
On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao  wrote:
>
> On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > Please note that adding other sysfs entries is expensive for workloads
> > > creating/deleting netdev and netns often.
> > >
> > > I _think_ we should find a way for not creating
> > > /sys/class/net//queues/tx-{Q}/byte_queue_limits  directory
> > > and files
> > > for non BQL enabled devices (like loopback !)
> >
> > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > would be pointless"? Obviously better to annotate the drivers which
> > do have BQL support, but there's >50 of them on a quick count..
>
> Let me make sure I understand the suggestion above. We want to disable
> BQL completely for devices that has dev->features & NETIF_F_LLTX or
> dev->priv_flags & IFF_NO_QUEUE, right?
>
> Maybe we can add a ->enabled field in struct dql, and set it according
> to the features above. Then we can created the sysfs and process the dql
> operations based on that field. This should avoid some unnecessary calls
> also, if we are not display sysfs.
>
> Here is a very simple PoC to represent what I had in mind. Am I in the
> right direction?

No, this was really about sysfs entries (aka dql_group)

Partial patch would be:

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 
a09d507c5b03d24a829bf7af0b7cf1e6a0bdb65a..094e3b2d78cca40d810b2fa3bd4393d22b30e6ad
100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1709,9 +1709,11 @@ static int netdev_queue_add_kobject(struct
net_device *dev, int index)
goto err;

 #ifdef CONFIG_BQL
-   error = sysfs_create_group(kobj, &dql_group);
-   if (error)
-   goto err;
+   if (netdev_uses_bql(dev)) {
+   error = sysfs_create_group(kobj, &dql_group);
+   if (error)
+   goto err;
+   }
 #endif

kobject_uevent(kobj, KOBJ_ADD);
@@ -1734,7 +1736,8 @@ static int tx_queue_change_owner(struct
net_device *ndev, int index,
return error;

 #ifdef CONFIG_BQL
-   error = sysfs_group_change_owner(kobj, &dql_group, kuid, kgid);
+   if (netdev_uses_bql(ndev))
+   error = sysfs_group_change_owner(kobj, &dql_group, kuid, kgid);
 #endif
return error;
 }
@@ -1768,7 +1771,8 @@ netdev_queue_update_kobjects(struct net_device
*dev, int old_num, int new_num)
if (!refcount_read(&dev_net(dev)->ns.count))
queue->kobj.uevent_suppress = 1;
 #ifdef CONFIG_BQL
-   sysfs_remove_group(&queue->kobj, &dql_group);
+   if (netdev_uses_bql(dev))
+   sysfs_remove_group(&queue->kobj, &dql_group);
 #endif
kobject_put(&queue->kobj);
}



Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-13 Thread Eric Dumazet
On Fri, Feb 2, 2024 at 5:55 PM Breno Leitao  wrote:
>
> From: Jakub Kicinski 
>
> softnet_data->time_squeeze is sometimes used as a proxy for
> host overload or indication of scheduling problems. In practice
> this statistic is very noisy and has hard to grasp units -
> e.g. is 10 squeezes a second to be expected, or high?
>
> Delaying network (NAPI) processing leads to drops on NIC queues
> but also RTT bloat, impacting pacing and CA decisions.
> Stalls are a little hard to detect on the Rx side, because
> there may simply have not been any packets received in given
> period of time. Packet timestamps help a little bit, but
> again we don't know if packets are stale because we're
> not keeping up or because someone (*cough* cgroups)
> disabled IRQs for a long time.

Please note that adding other sysfs entries is expensive for workloads
creating/deleting netdev and netns often.

I _think_ we should find a way for not creating
/sys/class/net//queues/tx-{Q}/byte_queue_limits  directory
and files
for non BQL enabled devices (like loopback !)



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 11:43 AM Yajun Deng  wrote:
>
>
> On 2023/10/9 17:30, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 10:36 AM Yajun Deng  wrote:
> >>
> >> On 2023/10/9 16:20, Eric Dumazet wrote:
> >>> On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:
> >>>> On 2023/10/9 15:53, Eric Dumazet wrote:
> >>>>> On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:
> >>>>>
> >>>>>> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> >>>>>> the trace work well.
> >>>>>>
> >>>>>> They all have 'pop' instructions in them. This may be the key to making
> >>>>>> the trace work well.
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I need your help on percpu and ftrace.
> >>>>>>
> >>>>> I do not think you made sure netdev_core_stats_inc() was never inlined.
> >>>>>
> >>>>> Adding more code in it is simply changing how the compiler decides to
> >>>>> inline or not.
> >>>> Yes, you are right. It needs to add the 'noinline' prefix. The
> >>>> disassembly code will have 'pop'
> >>>>
> >>>> instruction.
> >>>>
> >>> The function was fine, you do not need anything like push or pop.
> >>>
> >>> The only needed stuff was the call __fentry__.
> >>>
> >>> The fact that the function was inlined for some invocations was the
> >>> issue, because the trace point
> >>> is only planted in the out of line function.
> >>
> >> But somehow the following code isn't inline? They didn't need to add the
> >> 'noinline' prefix.
> >>
> >> +   field = (unsigned long *)((void *)this_cpu_ptr(p) + 
> >> offset);
> >> +   WRITE_ONCE(*field, READ_ONCE(*field) + 1);
> >>
> >> Or
> >> +   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;
> >>
> > I think you are very confused.
> >
> > You only want to trace netdev_core_stats_inc() entry point, not
> > arbitrary pieces of it.
>
>
> Yes, I will trace netdev_core_stats_inc() entry point. I mean to replace
>
> +   field = (__force unsigned long
> __percpu *)((__force void *)p + offset);
> +   this_cpu_inc(*field);
>
> with
>
> +   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
> +   WRITE_ONCE(*field, READ_ONCE(*field) + 1);
>
> Or
> +   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;
>
> The netdev_core_stats_inc() entry point will work fine even if it doesn't
> have 'noinline' prefix.
>
> I don't know why this code needs to add 'noinline' prefix.
> +   field = (__force unsigned long __percpu *)((__force void *)p 
> + offset);
> +   this_cpu_inc(*field);
>

C compiler decides to inline or not, depending on various factors.

The most efficient (and small) code is generated by this_cpu_inc()
version, allowing the compiler to inline it.

If you copy/paste this_cpu_inc()  twenty times, then the compiler
would  not inline the function anymore.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 10:36 AM Yajun Deng  wrote:
>
>
> On 2023/10/9 16:20, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:
> >>
> >> On 2023/10/9 15:53, Eric Dumazet wrote:
> >>> On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:
> >>>
> >>>> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> >>>> the trace work well.
> >>>>
> >>>> They all have 'pop' instructions in them. This may be the key to making
> >>>> the trace work well.
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I need your help on percpu and ftrace.
> >>>>
> >>> I do not think you made sure netdev_core_stats_inc() was never inlined.
> >>>
> >>> Adding more code in it is simply changing how the compiler decides to
> >>> inline or not.
> >>
> >> Yes, you are right. It needs to add the 'noinline' prefix. The
> >> disassembly code will have 'pop'
> >>
> >> instruction.
> >>
> > The function was fine, you do not need anything like push or pop.
> >
> > The only needed stuff was the call __fentry__.
> >
> > The fact that the function was inlined for some invocations was the
> > issue, because the trace point
> > is only planted in the out of line function.
>
>
> But somehow the following code isn't inline? They didn't need to add the
> 'noinline' prefix.
>
> +   field = (unsigned long *)((void *)this_cpu_ptr(p) + offset);
> +   WRITE_ONCE(*field, READ_ONCE(*field) + 1);
>
> Or
> +   (*(unsigned long *)((void *)this_cpu_ptr(p) + offset))++;
>

I think you are very confused.

You only want to trace netdev_core_stats_inc() entry point, not
arbitrary pieces of it.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 10:14 AM Yajun Deng  wrote:
>
>
> On 2023/10/9 15:53, Eric Dumazet wrote:
> > On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:
> >
> >> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> >> the trace work well.
> >>
> >> They all have 'pop' instructions in them. This may be the key to making
> >> the trace work well.
> >>
> >> Hi all,
> >>
> >> I need your help on percpu and ftrace.
> >>
> > I do not think you made sure netdev_core_stats_inc() was never inlined.
> >
> > Adding more code in it is simply changing how the compiler decides to
> > inline or not.
>
>
> Yes, you are right. It needs to add the 'noinline' prefix. The
> disassembly code will have 'pop'
>
> instruction.
>

The function was fine, you do not need anything like push or pop.

The only needed stuff was the call __fentry__.

The fact that the function was inlined for some invocations was the
issue, because the trace point
is only planted in the out of line function.



Re: [PATCH net-next v7] net/core: Introduce netdev_core_stats_inc()

2023-10-09 Thread Eric Dumazet
On Mon, Oct 9, 2023 at 5:07 AM Yajun Deng  wrote:

> 'this_cpu_read + this_cpu_write' and 'pr_info + this_cpu_inc' will make
> the trace work well.
>
> They all have 'pop' instructions in them. This may be the key to making
> the trace work well.
>
> Hi all,
>
> I need your help on percpu and ftrace.
>

I do not think you made sure netdev_core_stats_inc() was never inlined.

Adding more code in it is simply changing how the compiler decides to
inline or not.



Re: [PATCH v3 bpf-next 00/11] Socket migration for SO_REUSEPORT.

2021-04-20 Thread Eric Dumazet



On 4/20/21 5:41 PM, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation [1]. When a SYN packet is received, the connection is tied
> to a listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners on the same port could accept
> such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in the
> in-flight ACK of 3WHS is responded by RST.
> 
> The SO_REUSEPORT option is excellent to improve scalability.

This was before the SYN processing was made lockless.

I really wonder if we still need SO_REUSEPORT for TCP ?

Eventually a new accept() system call where different threads
can express how they want to choose the children sockets would
be less invasive.

Instead of having many listeners, have one listener and eventually multiple
accept queues to improve scalability of accept() phase.



Re: BUG: KASAN: use-after-free in page_to_skb.isra.0+0x300/0x418

2021-04-20 Thread Eric Dumazet
On Tue, Apr 20, 2021 at 3:45 PM Naresh Kamboju
 wrote:
>
> Following kernel BUG reported on qemu-arm64 running linux next 20210420
> the config is enabled with KASAN.
>
> steps to reproduce:
> 
> - Build the arm64 kernel with KASAN enabled.
> - boot it with below command and you will notice
>  /usr/bin/qemu-system-aarch64 -cpu host -machine virt,accel=kvm
> -nographic -net nic,model=virtio,macaddr=BA:DD:AD:CC:09:10 -net tap -m
> 1024 -monitor none -kernel kernel/Image.gz --append "console=ttyAMA0
> root=/dev/vda rw" -hda
> rootfs/rpb-console-image-lkft-juno-20210414125244-133.rootfs.ext4 -m
> 4096 -smp 4 -nographic
>
>
> crash log:
> -
>

This is the fifth report, have you tried the proposed fix ?

https://patchwork.kernel.org/project/netdevbpf/patch/20210420094341.3259328-1-eric.duma...@gmail.com/


Re: [PATCH] tcp: prevent loss of bytes when using syncookie

2021-04-20 Thread Eric Dumazet
On Tue, Apr 20, 2021 at 2:00 PM zhaoya  wrote:
>
> When using syncookie, since $MSSID is spliced into cookie and
> the legal index of msstab  is 0,1,2,3, this gives client 3 bytes
> of freedom, resulting in at most 3 bytes of silent loss.
>
> C seq=12345-> S
> C <--seq=cookie/ack=12346 S S generated the cookie
> [RFC4987 Appendix A]
> C ---seq=123456/ack=cookie+1-->X  S The first byte was loss.
> C -seq=123457/ack=cookie+1--> S The second byte was received and
> cookie-check was still okay and
> handshake was finished.
> C 
> Here is a POC:
>
> $ cat poc.c
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> void *serverfunc(void *arg)
> {
> int sd = -1;
> int csd = -1;
> struct sockaddr_in servaddr, cliaddr;
> int len = sizeof(cliaddr);
>
> sd = socket(AF_INET, SOCK_STREAM, 0);
> servaddr.sin_family = AF_INET;
> servaddr.sin_addr.s_addr = htonl(INADDR_ANY);
> servaddr.sin_port = htons(1234);
> bind(sd, (struct sockaddr *)&servaddr, sizeof(servaddr));
> listen(sd, 1);
>
> while (1) {
> char buf[2];
> int ret;
> csd = accept(sd, (struct sockaddr *)&cliaddr, &len);
> memset(buf, 0, 2);
> ret = recv(csd, buf, 1, 0);
> // but unexpected char is 'b'
> if (ret && strncmp(buf, "a", 1)) {
> printf("unexpected:%s\n", buf);
> close(csd);
> exit(0);
> }
> close(csd);
> }
> }
>
> void *connectfunc(void *arg)
> {
> struct sockaddr_in addr;
> int sd;
> int i;
>
> for (i = 0; i < 500; i++) {
> sd = socket(AF_INET, SOCK_STREAM, 0);
> addr.sin_family = AF_INET;
> addr.sin_addr.s_addr = inet_addr("127.0.0.1");
> addr.sin_port = htons(1234);
>
> connect(sd, (struct sockaddr *)&addr, sizeof(addr));
>
> send(sd, "a", 1, 0); // expected char is 'a'
> send(sd, "b", 1, 0);
> close(sd);
> }
> return NULL;
> }
>
> int main(int argc, char *argv[])
> {
> int i;
> pthread_t id;
>
> pthread_create(&id, NULL, serverfunc, NULL);
> sleep(1);
> for (i = 0; i < 500; i++) {
> pthread_create(&id, NULL, connectfunc, NULL);
> }
> sleep(5);
> }
>
> $ sudo gcc poc.c -lpthread
> $ sudo sysctl -w net.ipv4.tcp_syncookies=1
> $ sudo ./a.out # please try as many times.
>

POC is distracting, you could instead give a link to
https://kognitio.com/blog/syn-cookies-ate-my-dog-breaking-tcp-on-linux/
where all the details can be found.

Also it gives credits to past work.

If you feel a program needs to be written and recorded for posterity,
add a selftest. (in tools/testing/selftests/net )

> This problem can be fixed by having $SSEQ itself participate in the
> hash operation.
>
> The feature is protected by a sysctl so that admins can choose
> the preferred behavior.
>
> Signed-off-by: zhaoya 
> ---
>  Documentation/networking/ip-sysctl.rst |  9 
>  include/linux/netfilter_ipv6.h | 24 +---
>  include/net/netns/ipv4.h   |  1 +
>  include/net/tcp.h  | 20 ++---
>  net/core/filter.c  |  6 +++--
>  net/ipv4/syncookies.c  | 31 --
>  net/ipv4/sysctl_net_ipv4.c |  7 ++
>  net/ipv4/tcp_ipv4.c|  4 +++-
>  net/ipv6/syncookies.c  | 29 +++-
>  net/ipv6/tcp_ipv6.c|  3 ++-
>  net/netfilter/nf_synproxy_core.c   | 10 +
>  11 files changed, 97 insertions(+), 47 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst 
> b/Documentation/networking/ip-sysctl.rst
> index c7952ac5bd2f..a430e8736c2b 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -732,6 +732,15 @@ tcp_syncookies - INTEGER
> network connections you can set this knob to 2 to enable
> unconditionally generation of syncookies.
>
> +tcp_syncookies_enorder - INTEGER

enorder ? What does it mean ?

> +   Only valid when the kernel was compiled with CONFIG_SYN_COOKIES
> +   Prevent the loss of at most 3 bytes of which sent by client when
> +   syncookies as generated to complete TCP-Handshake.
> +   Default: 0
> +
> +   Note that if it was enabled, any out-of-order bytes sent by client
> +   to complete TCP-Handshake could get the sessio

Re: [PATCH] net: called rtnl_unlock() before runpm resumes devices

2021-04-20 Thread Eric Dumazet
On Tue, Apr 20, 2021 at 9:54 AM AceLan Kao  wrote:
>
> From: "Chia-Lin Kao (AceLan)" 
>
> The rtnl_lock() has been called in rtnetlink_rcv_msg(), and then in
> __dev_open() it calls pm_runtime_resume() to resume devices, and in
> some devices' resume function(igb_resum,igc_resume) they calls rtnl_lock()
> again. That leads to a recursive lock.
>
> It should leave the devices' resume function to decide if they need to
> call rtnl_lock()/rtnl_unlock(), so call rtnl_unlock() before calling
> pm_runtime_resume() and then call rtnl_lock() after it in __dev_open().
>
>

Hi Acelan

When was the bugg added ?
Please add a Fixes: tag

By doing so, you give more chances for reviewers to understand why the
fix is not risky,
and help stable teams work.

Thanks.

> Signed-off-by: Chia-Lin Kao (AceLan) 
> ---
>  net/core/dev.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1f79b9aa9a3f..427cbc80d1e5 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1537,8 +1537,11 @@ static int __dev_open(struct net_device *dev, struct 
> netlink_ext_ack *extack)
>
> if (!netif_device_present(dev)) {
> /* may be detached because parent is runtime-suspended */
> -   if (dev->dev.parent)
> +   if (dev->dev.parent) {
> +   rtnl_unlock();
> pm_runtime_resume(dev->dev.parent);
> +   rtnl_lock();
> +   }
> if (!netif_device_present(dev))
> return -ENODEV;
> }
> --
> 2.25.1
>


Re: [PATCH net] gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check

2021-04-19 Thread Eric Dumazet
On Sun, Apr 18, 2021 at 1:43 PM Alexander Lobakin  wrote:
>
> Commit 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
> did the right thing, but missed the fact that napi_gro_frags() logics
> calls for skb_gro_reset_offset() *before* pulling Ethernet header
> to the skb linear space.
> That said, the introduced check for frag0 address being aligned to 4
> always fails for it as Ethernet header is obviously 14 bytes long,
> and in case with NET_IP_ALIGN its start is not aligned to 4.
>
> Fix this by adding @nhoff argument to skb_gro_reset_offset() which
> tells if an IP header is placed right at the start of frag0 or not.
> This restores Fast GRO for napi_gro_frags() that became very slow
> after the mentioned commit, and preserves the introduced check to
> avoid silent unaligned accesses.
>
> Fixes: 38ec4944b593 ("gro: ensure frag0 meets IP header alignment")
> Signed-off-by: Alexander Lobakin 
> ---
>  net/core/dev.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1f79b9aa9a3f..965d5f9b6fee 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5914,7 +5914,7 @@ static struct list_head *gro_list_prepare(struct 
> napi_struct *napi,
> return head;
>  }
>
> -static void skb_gro_reset_offset(struct sk_buff *skb)
> +static void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff)
>  {
> const struct skb_shared_info *pinfo = skb_shinfo(skb);
> const skb_frag_t *frag0 = &pinfo->frags[0];
> @@ -5925,7 +5925,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
>
> if (!skb_headlen(skb) && pinfo->nr_frags &&
> !PageHighMem(skb_frag_page(frag0)) &&
> -   (!NET_IP_ALIGN || !(skb_frag_off(frag0) & 3))) {
> +   (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) {
> NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
> NAPI_GRO_CB(skb)->frag0_len = min_t(unsigned int,
> skb_frag_size(frag0),
> @@ -6143,7 +6143,7 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, 
> struct sk_buff *skb)
> skb_mark_napi_id(skb, napi);
> trace_napi_gro_receive_entry(skb);
>
> -   skb_gro_reset_offset(skb);
> +   skb_gro_reset_offset(skb, 0);
>
> ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
> trace_napi_gro_receive_exit(ret);
> @@ -6232,7 +6232,7 @@ static struct sk_buff *napi_frags_skb(struct 
> napi_struct *napi)
> napi->skb = NULL;
>
> skb_reset_mac_header(skb);
> -   skb_gro_reset_offset(skb);
> +   skb_gro_reset_offset(skb, hlen);
>
> if (unlikely(skb_gro_header_hard(skb, hlen))) {
> eth = skb_gro_header_slow(skb, hlen, 0);
> --
> 2.31.1
>
>

Good catch, thanks.

This has the unfortunate effect of increasing code length on x86,
maybe we should have an exception to
normal rules so that skb_gro_reset_offset() is inlined.

Reviewed-by: Eric Dumazet 


Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-19 Thread Eric Dumazet
On Sun, Apr 18, 2021 at 4:31 PM Matt Corallo
 wrote:
>
> Should the default, though, be so low? If someone is still using a old modem 
> they can crank up the sysctl, it does seem
> like such things are pretty rare these days :). Its rather trivial to, 
> without any kind of attack, hit 1Mbps of lost
> fragments in today's networks, at which point all fragments are dropped. 
> After all, I submitted the patch to "scratch my
> own itch" :).

Again, even if you increase the values by 1000x, it is trivial for an
attacker to use all the memory you allowed.

And allowing a significant portion of memory to be eaten like that
might cause OOM on hosts where jobs are consuming all physical memory.

It is a sysctl, I changed things so that one could really reserve/use
16GB of memory if she/he is desperate about frags.

>
> Matt
>
> On 4/18/21 00:39, Willy Tarreau wrote:
> > I do agree that we shouldn't keep them that long nowadays, we can't go
> > too low without risking to break some slow transmission stacks (SLIP/PPP
> > over modems for example).


Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-17 Thread Eric Dumazet
On Sat, Apr 17, 2021 at 6:03 PM Linus Torvalds
 wrote:
>
> On Fri, Apr 16, 2021 at 12:24 PM Eric Dumazet  wrote:
> >
> > From: Eric Dumazet 
> >
> > We have to loop only to copy u64 values.
> > After this first loop, we copy at most one u32, one u16 and one byte.
>
> As Al mentioned, at least in trivial cases the compiler understands
> that the subsequent loops can only be executed once, because earlier
> loops made sure that 'len' is always smaller than 2*size.
>
> But apparently the problem is the slightly more complex cases where
> the compiler just messes up and loses sight of that. Oh well.
>
> So the patch looks fine to me.
>
> HOWEVER.
>
> Looking at the put_cmsg() case in net-next, I'm very *VERY* unhappy
> about how you use those "unsafe" user access functions.
>
> Why? Because the point of the "unsafe" is to be a big red flag and
> make sure you are very VERY careful with it.
>
> And that code isn't.
>
> In particular, what if the "int len" argument is negative? Maybe it
> cannot happen, but when it comes to things like those unsafe user
> accessors, I really really want to see that all the arguments are
> *checked*.
>
> And you don't.
>
> You do
>
> if (!user_write_access_begin(cm, cmlen))
>
> ahead of time, and that will do basic range checking, but "cmlen" is
>
> sizeof(struct cmsghdr) + (len))
>
> so it's entirely possible that "cmlen" has a valid value, but "len"
> (and thus "cmlen - sizeof(*cm)", which is the length argument to the
> unsafe user copy) might be negative and that is never checked.
>
> End result: I want people to be a LOT more careful when they use those
> unsafe user space accessors. You need to make it REALLY REALLY obvious
> that everything you do is safe. And it's not at all obvious in the
> context of put_cmsg() - the safety currently relies on every single
> caller getting it right.

I thought put_cmsg() callers were from the kernel, with no possibility
for user to abuse this interface trying to push GB of data.

Which would be terrible even if we  ' fix'  possible overflows.

Maybe I was wrong.


>
> So either fix "len" to be some restricted type (ie "unsigned short"),
> or make really really sure that "len" is valid (ie never negative, and
> the cmlen addition didn't overflow.
>
> Really. The "unsafe" user accesses are named that way very explicitly,
> and for a very very good reason: the safety needs to be guaranteed and
> obvious within the context of those accesses. Not within some "oh,
> nobody will ever call this with a negative argument" garbage bullshit.
>
>   Linus


Re: [External] Re: [PATCH] tcp: fix silent loss when syncookie is trigered

2021-04-16 Thread Eric Dumazet
On Sat, Apr 17, 2021 at 12:45 AM 赵亚  wrote:
>
> On Fri, Apr 16, 2021 at 7:52 PM Eric Dumazet  wrote:
> >
> > On Fri, Apr 16, 2021 at 12:52 PM zhaoya  wrote:
> > >
> > > When syncookie is triggered, since $MSSID is spliced into cookie and
> > > the legal index of msstab  is 0,1,2,3, this gives client 3 bytes
> > > of freedom, resulting in at most 3 bytes of silent loss.
> > >
> > > C seq=12345-> S
> > > C <--seq=cookie/ack=12346 S S generated the cookie
> > > [RFC4987 Appendix A]
> > > C ---seq=123456/ack=cookie+1-->X  S The first byte was loss.
> > > C -seq=123457/ack=cookie+1--> S The second byte was received and
> > > cookie-check was still okay and
> > > handshake was finished.
> > > C <seq=.../ack=12348- S acknowledge the second byte.
> >
> >
> > I think this has been discussed in the past :
> > https://kognitio.com/blog/syn-cookies-ate-my-dog-breaking-tcp-on-linux/
> >
> > If I remember well, this can not be fixed "easily"
> >
> > I suspect you are trading one minor issue with another (which is
> > considered more practical these days)
> > Have you tried what happens if the server receives an out-of-order
> > packet after the SYN & SYN-ACK ?
> > The answer is : RST packet is sent, killing the session.
> >
> > That is the reason why sseq is not part of the hash key.
>
> Yes, I've tested this scenario. More sessions do get reset.
>
> If a client got an RST, it knew the session failed, which was clear. However,
> if the client send a character and it was acknowledged, but the server did not
> receive it, this could cause confusion.
> >
> > In practice, secure connexions are using a setup phase where more than
> > 3 bytes are sent in the first packet.
> > We recommend using secure protocols over TCP. (prefer HTTPS over HTTP,
> > SSL over plaintext)
>
> Yes, i agree with you. But the basis of practice is principle.
> Syncookie breaks the
> semantics of TCP.
> >
> > Your change would severely impair servers under DDOS ability to really
> > establish flows.
>
> Would you tell me more details.
> >
> > Now, if your patch is protected by a sysctl so that admins can choose
> > the preferred behavior, then why not...
>
> The sysctl in the POC is just for triggering problems easily.
>
> So the question is, when syncookie is triggered, which is more important,
> the practice or the principle?

SYNCOOKIES have lots of known limitations.

You can disable them if you need.

Or you can add a sysctl or socket options so that each listener can
decide what they want.

I gave feedback of why your initial patch was _not_ good.

I think it can render a server under DDOS absolutely unusable.
Exactly the same situation than _without_ syncookies being used.
We do not want to go back to the situation wed had before SYNCOOKIES
were invented.

I think you should have put a big warning in the changelog to explain
that you fully understood
the risks.

We prefer having servers that can still be useful, especially ones
serving 100% HTTPS traffic.

Thank you.


Re: [PATCH net] net/core/dev.c: Ensure pfmemalloc skbs are correctly handled when receiving

2021-04-16 Thread Eric Dumazet
On Sat, Apr 17, 2021 at 2:08 AM Xie He  wrote:
>
> When an skb is allocated by "__netdev_alloc_skb" in "net/core/skbuff.c",
> if "sk_memalloc_socks()" is true, and if there's not sufficient memory,
> the skb would be allocated using emergency memory reserves. This kind of
> skbs are called pfmemalloc skbs.
>
> pfmemalloc skbs must be specially handled in "net/core/dev.c" when
> receiving. They must NOT be delivered to the target protocol if
> "skb_pfmemalloc_protocol(skb)" is false.
>
> However, if, after a pfmemalloc skb is allocated and before it reaches
> the code in "__netif_receive_skb", "sk_memalloc_socks()" becomes false,
> then the skb will be handled by "__netif_receive_skb" as a normal skb.
> This causes the skb to be delivered to the target protocol even if
> "skb_pfmemalloc_protocol(skb)" is false.
>
> This patch fixes this problem by ensuring all pfmemalloc skbs are handled
> by "__netif_receive_skb" as pfmemalloc skbs.
>
> "__netif_receive_skb_list" has the same problem as "__netif_receive_skb".
> This patch also fixes it.
>
> Fixes: b4b9e3558508 ("netvm: set PF_MEMALLOC as appropriate during SKB 
> processing")
> Cc: Mel Gorman 
> Cc: David S. Miller 
> Cc: Neil Brown 
> Cc: Peter Zijlstra 
> Cc: Jiri Slaby 
> Cc: Mike Christie 
> Cc: Eric B Munson 
> Cc: Eric Dumazet 
> Cc: Sebastian Andrzej Siewior 
> Cc: Christoph Lameter 
> Cc: Andrew Morton 
> Signed-off-by: Xie He 
> ---
>  net/core/dev.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1f79b9aa9a3f..3e6b7879daef 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5479,7 +5479,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>  {
> int ret;
>
> -   if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
> +   if (skb_pfmemalloc(skb)) {
> unsigned int noreclaim_flag;
>
> /*
> @@ -5507,7 +5507,7 @@ static void __netif_receive_skb_list(struct list_head 
> *head)
> bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */
>
> list_for_each_entry_safe(skb, next, head, list) {
> -   if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != 
> pfmemalloc) {
> +   if (skb_pfmemalloc(skb) != pfmemalloc) {
> struct list_head sublist;
>
> /* Handle the previous sublist */
> --
> 2.27.0
>

The race window has been considered to be small that we prefer the
code as it is.

The reason why we prefer current code is that we use a static key for
the implementation
of sk_memalloc_socks()

Trading some minor condition (race) with extra cycles for each
received packet is a serious concern.

What matters is a persistent condition that would _deplete_ memory,
not for a dozen of packets,
but thousands. Can you demonstrate such an issue ?


Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-16 Thread Eric Dumazet
On Sat, Apr 17, 2021 at 2:31 AM David Ahern  wrote:
>
> [ cc author of 648700f76b03b7e8149d13cc2bdb3355035258a9 ]



I think this has been discussed already. There is no strategy that
makes IP reassembly units immune to DDOS attacks.

We added rb-tree and sysctls to let admins choose to use GB of RAM if
they really care.



>
> On 4/16/21 3:58 PM, Keyu Man wrote:
> > Hi,
> >
> >
> >
> > My name is Keyu Man. We are a group of researchers from University
> > of California, Riverside. Zhiyun Qian is my advisor. We found the code
> > in processing IPv4/IPv6 fragments will potentially lead to DoS Attacks.
> > Specifically, after the latest kernel receives an IPv4 fragment, it will
> > try to fit it into a queue by calling function
> >
> >
> >
> > struct inet_frag_queue *inet_frag_find(struct fqdir *fqdir, void
> > *key) in net/ipv4/inet_fragment.c.
> >
> >
> >
> > However, this function will first check if the existing fragment
> > memory exceeds the fqdir->high_thresh. If it exceeds, then drop the
> > fragment regardless whether it belongs to a new queue or an existing queue.
> >
> > Chances are that an attacker can fill the cache with fragments that
> > will never be assembled (i.e., only sends the first fragment with new
> > IPIDs every time) to exceed the threshold so that all future incoming
> > fragmented IPv4 traffic would be blocked and dropped. Since there is no
> > GC mechanism, the victim host has to wait for 30s when the fragments are
> > expired to continue receive incoming fragments normally.
> >
> > In practice, given the 4MB fragment cache, the attacker only needs
> > to send 1766 fragments to exhaust the cache and DoS the victim for 30s,
> > whose cost is pretty low. Besides, IPv6 would also be affected since the
> > issue resides in inet part.
> >
> > This issue is introduced in commit
> > 648700f76b03b7e8149d13cc2bdb3355035258a9 (inet: frags: use rhashtables
> > for reassembly units) which removes fqdir->low_thresh, and GC worker as
> > well. We would gently request to bring GC worker back to the kernel to
> > prevent the DoS attacks.
> >
> > Looking forward to hear from you
> >
> >
> >
> > Thanks,
> >
> > Keyu Man
> >
>


Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-16 Thread Eric Dumazet
On Fri, Apr 16, 2021 at 10:11 PM Eric Dumazet  wrote:
>
> On Fri, Apr 16, 2021 at 9:44 PM Al Viro  wrote:
> >
> > On Fri, Apr 16, 2021 at 12:24:13PM -0700, Eric Dumazet wrote:
> > > From: Eric Dumazet 
> > >
> > > We have to loop only to copy u64 values.
> > > After this first loop, we copy at most one u32, one u16 and one byte.
> >
> > Does it actually yield a better code?
> >
>
> Yes, my patch gives a better code, on actual kernel use-case
>
> (net-next tree, look at put_cmsg())
>
> 5ca: 48 89 0f  mov%rcx,(%rdi)
>  5cd: 89 77 08  mov%esi,0x8(%rdi)
>  5d0: 89 57 0c  mov%edx,0xc(%rdi)
>  5d3: 48 83 c7 10  add$0x10,%rdi
>  5d7: 48 83 c1 f0  add$0xfff0,%rcx
>  5db: 48 83 f9 07  cmp$0x7,%rcx
>  5df: 76 40jbe621 
>  5e1: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 nopw
> %cs:0x0(%rax,%rax,1)
>  5e8: 0f 1f 84 00 00 00 00
>  5ef: 00
>  5f0: 49 8b 10  mov(%r8),%rdx
>  5f3: 48 89 17  mov%rdx,(%rdi)
>  5f6: 48 83 c7 08  add$0x8,%rdi
>  5fa: 49 83 c0 08  add$0x8,%r8
>  5fe: 48 83 c1 f8  add$0xfff8,%rcx
>  602: 48 83 f9 07  cmp$0x7,%rcx
>  606: 77 e8ja 5f0 
>  608: eb 17jmp621 
>  60a: 66 0f 1f 44 00 00nopw   0x0(%rax,%rax,1)
>  610: 41 8b 10  mov(%r8),%edx
>  613: 89 17mov%edx,(%rdi)
>  615: 48 83 c7 04  add$0x4,%rdi
>  619: 49 83 c0 04  add$0x4,%r8
>  61d: 48 83 c1 fc  add$0xfffc,%rcx
>  621: 48 83 f9 03  cmp$0x3,%rcx
>  625: 77 e9ja 610 
>  627: eb 1ajmp643 
>  629: 0f 1f 80 00 00 00 00 nopl   0x0(%rax)
>  630: 41 0f b7 10  movzwl (%r8),%edx
>  634: 66 89 17  mov%dx,(%rdi)
>  637: 48 83 c7 02  add$0x2,%rdi
>  63b: 49 83 c0 02  add$0x2,%r8
>  63f: 48 83 c1 fe  add$0xfffe,%rcx
>  643: 48 83 f9 01  cmp$0x1,%rcx
>  647: 77 e7ja 630 
>  649: eb 15jmp660 
>  64b: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
>  650: 41 0f b6 08  movzbl (%r8),%ecx
>  654: 88 0fmov%cl,(%rdi)
>  656: 48 83 c7 01  add$0x1,%rdi
>  65a: 49 83 c0 01  add$0x1,%r8
>  65e: 31 c9xor%ecx,%ecx
>  660: 48 85 c9  test   %rcx,%rcx
>  663: 75 ebjne650 

After the change code is now what we would expect (no jmp around)
 5db: 48 83 f9 08  cmp$0x8,%rcx
 5df: 72 27jb 608 
 5e1: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 nopw
%cs:0x0(%rax,%rax,1)
 5e8: 0f 1f 84 00 00 00 00
 5ef: 00
 5f0: 49 8b 10  mov(%r8),%rdx
 5f3: 48 89 17  mov%rdx,(%rdi)
 5f6: 48 83 c7 08  add$0x8,%rdi
 5fa: 49 83 c0 08  add$0x8,%r8
 5fe: 48 83 c1 f8  add$0xfff8,%rcx
 602: 48 83 f9 08  cmp$0x8,%rcx
 606: 73 e8jae5f0 
 608: 48 83 f9 04  cmp$0x4,%rcx
 60c: 72 11jb 61f 
 60e: 41 8b 10  mov(%r8),%edx
 611: 89 17mov%edx,(%rdi)
 613: 48 83 c7 04  add$0x4,%rdi
 617: 49 83 c0 04  add$0x4,%r8
 61b: 48 83 c1 fc  add$0xfffc,%rcx
 61f: 48 83 f9 02  cmp$0x2,%rcx
 623: 72 13jb 638 
 625: 41 0f b7 10  movzwl (%r8),%edx
 629: 66 89 17  mov%dx,(%rdi)
 62c: 48 83 c7 02  add$0x2,%rdi
 630: 49 83 c0 02  add$0x2,%r8
 634: 48 83 c1 fe  add$0xfffe,%rcx
 638: 48 85 c9  test   %rcx,%rcx
 63b: 74 05je 642 
 63d: 41 8a 08  mov(%r8),%cl
 640: 88 0fmov%cl,(%rdi)

As I said, its minor, I am sure you can come up to something much better !

Thanks !

>

>
> > FWIW, this
> > void bar(unsigned);
> > void foo(unsigned n)
> > {
> > while (n >= 8) {
> > bar(n);
> > n -= 8;
> > }
> > while (n >= 4) {
> > bar(n);
> > n -= 4;
> > }
> > while (n >= 2) {
> > bar(n);
> > n -= 2;
> > }
> > while (n >= 1) {
> > bar(n);
> > n -= 1;
> > }
> > }
> >
> > will compile (with -O2) to
> > pushq   %rbp
&

Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-16 Thread Eric Dumazet
On Fri, Apr 16, 2021 at 9:44 PM Al Viro  wrote:
>
> On Fri, Apr 16, 2021 at 12:24:13PM -0700, Eric Dumazet wrote:
> > From: Eric Dumazet 
> >
> > We have to loop only to copy u64 values.
> > After this first loop, we copy at most one u32, one u16 and one byte.
>
> Does it actually yield a better code?
>

Yes, my patch gives a better code, on actual kernel use-case

(net-next tree, look at put_cmsg())

5ca: 48 89 0f  mov%rcx,(%rdi)
 5cd: 89 77 08  mov%esi,0x8(%rdi)
 5d0: 89 57 0c  mov%edx,0xc(%rdi)
 5d3: 48 83 c7 10  add$0x10,%rdi
 5d7: 48 83 c1 f0  add$0xfff0,%rcx
 5db: 48 83 f9 07  cmp$0x7,%rcx
 5df: 76 40jbe621 
 5e1: 66 66 66 66 66 66 2e data16 data16 data16 data16 data16 nopw
%cs:0x0(%rax,%rax,1)
 5e8: 0f 1f 84 00 00 00 00
 5ef: 00
 5f0: 49 8b 10  mov(%r8),%rdx
 5f3: 48 89 17  mov%rdx,(%rdi)
 5f6: 48 83 c7 08  add$0x8,%rdi
 5fa: 49 83 c0 08  add$0x8,%r8
 5fe: 48 83 c1 f8  add$0xfff8,%rcx
 602: 48 83 f9 07  cmp$0x7,%rcx
 606: 77 e8ja 5f0 
 608: eb 17jmp621 
 60a: 66 0f 1f 44 00 00nopw   0x0(%rax,%rax,1)
 610: 41 8b 10  mov(%r8),%edx
 613: 89 17mov%edx,(%rdi)
 615: 48 83 c7 04  add$0x4,%rdi
 619: 49 83 c0 04  add$0x4,%r8
 61d: 48 83 c1 fc  add$0xfffc,%rcx
 621: 48 83 f9 03  cmp$0x3,%rcx
 625: 77 e9ja 610 
 627: eb 1ajmp643 
 629: 0f 1f 80 00 00 00 00 nopl   0x0(%rax)
 630: 41 0f b7 10  movzwl (%r8),%edx
 634: 66 89 17  mov%dx,(%rdi)
 637: 48 83 c7 02  add$0x2,%rdi
 63b: 49 83 c0 02  add$0x2,%r8
 63f: 48 83 c1 fe  add$0xfffe,%rcx
 643: 48 83 f9 01  cmp$0x1,%rcx
 647: 77 e7ja 630 
 649: eb 15jmp660 
 64b: 0f 1f 44 00 00nopl   0x0(%rax,%rax,1)
 650: 41 0f b6 08  movzbl (%r8),%ecx
 654: 88 0fmov%cl,(%rdi)
 656: 48 83 c7 01  add$0x1,%rdi
 65a: 49 83 c0 01  add$0x1,%r8
 65e: 31 c9xor%ecx,%ecx
 660: 48 85 c9  test   %rcx,%rcx
 663: 75 ebjne650 


> FWIW, this
> void bar(unsigned);
> void foo(unsigned n)
> {
> while (n >= 8) {
> bar(n);
> n -= 8;
> }
> while (n >= 4) {
> bar(n);
> n -= 4;
> }
> while (n >= 2) {
> bar(n);
> n -= 2;
> }
> while (n >= 1) {
> bar(n);
> n -= 1;
> }
> }
>
> will compile (with -O2) to
> pushq   %rbp
> pushq   %rbx
> movl%edi, %ebx
> subq$8, %rsp
> cmpl$7, %edi
> jbe .L2
> movl%edi, %ebp
> .L3:
> movl%ebp, %edi
> subl$8, %ebp
> callbar@PLT
> cmpl$7, %ebp
> ja  .L3
> andl$7, %ebx
> .L2:
> cmpl$3, %ebx
> jbe .L4
> movl%ebx, %edi
> andl$3, %ebx
> callbar@PLT
> .L4:
> cmpl$1, %ebx
> jbe .L5
> movl%ebx, %edi
> andl$1, %ebx
> callbar@PLT
> .L5:
> testl   %ebx, %ebx
> je  .L1
> addq$8, %rsp
> movl$1, %edi
> popq%rbx
> popq%rbp
> jmp bar@PLT
> .L1:
> addq$8, %rsp
> popq%rbx
> popq%rbp
> ret
>
> i.e. loop + if + if + if...


[PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-16 Thread Eric Dumazet
From: Eric Dumazet 

We have to loop only to copy u64 values.
After this first loop, we copy at most one u32, one u16 and one byte.

Signed-off-by: Eric Dumazet 
---
 arch/x86/include/asm/uaccess.h | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 
c9fa7be3df82ddb9495961b3e2f22b1ac07edafa..ddb19bb8c86786d78407dcfb59623943ccbce8a8
 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -517,15 +517,23 @@ do {  
\
len -= sizeof(type);
\
}
 
+#define unsafe_copy_elem(dst, src, len, type, label)   
\
+   if (len >= sizeof(type)) {  
\
+   unsafe_put_user(*(type *)(src),(type __user *)(dst),label); 
\
+   dst += sizeof(type);
\
+   src += sizeof(type);
\
+   len -= sizeof(type);
\
+   }
+
 #define unsafe_copy_to_user(_dst,_src,_len,label)  \
 do {   \
char __user *__ucu_dst = (_dst);\
const char *__ucu_src = (_src); \
size_t __ucu_len = (_len);  \
unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u64, label);  \
-   unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u32, label);  \
-   unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u16, label);  \
-   unsafe_copy_loop(__ucu_dst, __ucu_src, __ucu_len, u8, label);   \
+   unsafe_copy_elem(__ucu_dst, __ucu_src, __ucu_len, u32, label);  \
+   unsafe_copy_elem(__ucu_dst, __ucu_src, __ucu_len, u16, label);  \
+   unsafe_copy_elem(__ucu_dst, __ucu_src, __ucu_len, u8, label);   \
 } while (0)
 
 #define HAVE_GET_KERNEL_NOFAULT
-- 
2.31.1.368.gbe11c130af-goog



Re: [PATCH] tcp: fix silent loss when syncookie is trigered

2021-04-16 Thread Eric Dumazet
On Fri, Apr 16, 2021 at 12:52 PM zhaoya  wrote:
>
> When syncookie is triggered, since $MSSID is spliced into cookie and
> the legal index of msstab  is 0,1,2,3, this gives client 3 bytes
> of freedom, resulting in at most 3 bytes of silent loss.
>
> C seq=12345-> S
> C <--seq=cookie/ack=12346 S S generated the cookie
> [RFC4987 Appendix A]
> C ---seq=123456/ack=cookie+1-->X  S The first byte was loss.
> C -seq=123457/ack=cookie+1--> S The second byte was received and
> cookie-check was still okay and
> handshake was finished.
> C 

Re: [PATCH 4.14 16/68] net: ensure mac header is set in virtio_net_hdr_to_skb()

2021-04-16 Thread Eric Dumazet
On Fri, Apr 16, 2021 at 10:49 AM Balazs Nemeth  wrote:
>
> On Thu, 2021-04-15 at 16:46 +0200, Greg Kroah-Hartman wrote:
> > From: Eric Dumazet 
> >
> > commit 61431a5907fc36d0738e9a547c7e1556349a03e9 upstream.
> >
> > Commit 924a9bc362a5 ("net: check if protocol extracted by
> > virtio_net_hdr_set_proto is correct")
> > added a call to dev_parse_header_protocol() but mac_header is not yet
> > set.
> >
> > This means that eth_hdr() reads complete garbage, and syzbot
> > complained about it [1]
> >
> > This patch resets mac_header earlier, to get more coverage about this
> > change.
> >
> > Audit of virtio_net_hdr_to_skb() callers shows that this change
> > should be safe.
> >
> > [1]
> >
> > BUG: KASAN: use-after-free in eth_header_parse_protocol+0xdc/0xe0
> > net/ethernet/eth.c:282
> > Read of size 2 at addr 888017a6200b by task syz-executor313/8409
> >
> > CPU: 1 PID: 8409 Comm: syz-executor313 Not tainted 5.12.0-rc2-
> > syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > BIOS Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:79 [inline]
> >  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
> >  print_address_description.constprop.0.cold+0x5b/0x2f8
> > mm/kasan/report.c:232
> >  __kasan_report mm/kasan/report.c:399 [inline]
> >  kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
> >  eth_header_parse_protocol+0xdc/0xe0 net/ethernet/eth.c:282
> >  dev_parse_header_protocol include/linux/netdevice.h:3177 [inline]
> >  virtio_net_hdr_to_skb.constprop.0+0x99d/0xcd0
> > include/linux/virtio_net.h:83
> >  packet_snd net/packet/af_packet.c:2994 [inline]
> >  packet_sendmsg+0x2325/0x52b0 net/packet/af_packet.c:3031
> >  sock_sendmsg_nosec net/socket.c:654 [inline]
> >  sock_sendmsg+0xcf/0x120 net/socket.c:674
> >  sock_no_sendpage+0xf3/0x130 net/core/sock.c:2860
> >  kernel_sendpage.part.0+0x1ab/0x350 net/socket.c:3631
> >  kernel_sendpage net/socket.c:3628 [inline]
> >  sock_sendpage+0xe5/0x140 net/socket.c:947
> >  pipe_to_sendpage+0x2ad/0x380 fs/splice.c:364
> >  splice_from_pipe_feed fs/splice.c:418 [inline]
> >  __splice_from_pipe+0x43e/0x8a0 fs/splice.c:562
> >  splice_from_pipe fs/splice.c:597 [inline]
> >  generic_splice_sendpage+0xd4/0x140 fs/splice.c:746
> >  do_splice_from fs/splice.c:767 [inline]
> >  do_splice+0xb7e/0x1940 fs/splice.c:1079
> >  __do_splice+0x134/0x250 fs/splice.c:1144
> >  __do_sys_splice fs/splice.c:1350 [inline]
> >  __se_sys_splice fs/splice.c:1332 [inline]
> >  __x64_sys_splice+0x198/0x250 fs/splice.c:1332
> >  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >
> > Fixes: 924a9bc362a5 ("net: check if protocol extracted by
> > virtio_net_hdr_set_proto is correct")
> > Signed-off-by: Eric Dumazet 
> > Cc: Balazs Nemeth 
> > Cc: Willem de Bruijn 
> > Reported-by: syzbot 
> > Signed-off-by: David S. Miller 
> > Signed-off-by: Greg Kroah-Hartman 
> > ---
> >  include/linux/virtio_net.h |2 ++
> >  1 file changed, 2 insertions(+)
> >
> > --- a/include/linux/virtio_net.h
> > +++ b/include/linux/virtio_net.h
> > @@ -62,6 +62,8 @@ static inline int virtio_net_hdr_to_skb(
> > return -EINVAL;
> > }
> >
> > +   skb_reset_mac_header(skb);
> > +
> > if (hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> > u16 start = __virtio16_to_cpu(little_endian, hdr-
> > >csum_start);
> > u16 off = __virtio16_to_cpu(little_endian, hdr-
> > >csum_offset);
> >
> >
>
> Hi,
>
> Since the call to dev_parse_header_protocol is only made for gso
> packets where skb->protocol is not set, we could move
> skb_reset_mac_header down closer to that call. Is there another reason
> to reset mac_header earlier (and affect handling of other packets as
> well)? In any case, thanks for spotting this!
>

The answer to your question was in the changelog

"This patch resets mac_header earlier, to get more coverage about this change."

We want to detect if such a reset is going to hurt in general, not
only for GSO packets.


Re: [PROBLEM] a data race between tcp_set_default_congestion_control() and tcp_set_congestion_control()

2021-04-15 Thread Eric Dumazet
On Thu, Apr 15, 2021 at 5:47 PM Gong, Sishuai  wrote:
>
> Hi,
>
> We found a data race between tcp_set_default_congestion_control() and 
> tcp_set_congestion_control() in linux-5.12-rc3.
> In general, when tcp_set_congestion_control() is reading ca->flags with a 
> lock grabbed, tcp_set_default_congestion_control()
> may be updating ca->flags at the same time, as shown below.
>
> When the writer and reader are running parallel, 
> tcp_set_congestion_control()’s control flow
> might be non-deterministic, either returning a -EPERM or calling 
> tcp_reinit_congestion_control().
>
> We also notice in tcp_set_allowed_congestion_control(), the write to 
> ca->flags is protected by tcp_cong_list_lock,
> so we want to point it out in case the data race is unexpected.
>
> Thread 1Thread 2
> //tcp_set_default_congestion_control()  //tcp_set_congestion_control()
> // 
> lock_sock() grabbed
> if 
> (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin))
> err = 
> -EPERM;
> else if 
> (!bpf_try_module_get(ca, ca->owner))
> err = 
> -EBUSY;
> else
> 
> tcp_reinit_congestion_control(sk, ca);
> ca->flags |= TCP_CONG_NON_RESTRICTED;
>
>
>
> Thanks,
> Sishuai
>

Yes, obviously reading ca->flags while another thread might set the bit is racy.

This is of no consequence, if you want to silence KCSAN please a patch.


[tip: sched/core] rseq: Optimize rseq_update_cpu_id()

2021-04-15 Thread tip-bot2 for Eric Dumazet
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 60af388d23889636011488c42763876bcdda3eab
Gitweb:
https://git.kernel.org/tip/60af388d23889636011488c42763876bcdda3eab
Author:Eric Dumazet 
AuthorDate:Tue, 13 Apr 2021 13:33:50 -07:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 14 Apr 2021 18:04:09 +02:00

rseq: Optimize rseq_update_cpu_id()

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Link: https://lkml.kernel.org/r/20210413203352.71350-2-eric.duma...@gmail.com
---
 kernel/rseq.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index a4f86a9..f020f18 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
u32 cpu_id = raw_smp_processor_id();
+   struct rseq __user *rseq = t->rseq;
 
-   if (put_user(cpu_id, &t->rseq->cpu_id_start))
-   return -EFAULT;
-   if (put_user(cpu_id, &t->rseq->cpu_id))
-   return -EFAULT;
+   if (!user_write_access_begin(rseq, sizeof(*rseq)))
+   goto efault;
+   unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
+   unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
+   user_write_access_end();
trace_rseq_update(t);
return 0;
+
+efault_end:
+   user_write_access_end();
+efault:
+   return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)


[tip: sched/core] rseq: Remove redundant access_ok()

2021-04-15 Thread tip-bot2 for Eric Dumazet
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0ed96051531ecc6965f6456d25b19b9b6bdb5c28
Gitweb:
https://git.kernel.org/tip/0ed96051531ecc6965f6456d25b19b9b6bdb5c28
Author:Eric Dumazet 
AuthorDate:Tue, 13 Apr 2021 13:33:51 -07:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 14 Apr 2021 18:04:09 +02:00

rseq: Remove redundant access_ok()

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Mathieu pointed out the same cleanup can be done
in rseq_syscall().

Signed-off-by: Eric Dumazet 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Link: https://lkml.kernel.org/r/20210413203352.71350-3-eric.duma...@gmail.com
---
 kernel/rseq.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index f020f18..cfe01ab 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, 
struct pt_regs *regs)
 
if (unlikely(t->flags & PF_EXITING))
return;
-   if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
-   goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
@@ -301,8 +299,7 @@ void rseq_syscall(struct pt_regs *regs)
 
if (!t->rseq)
return;
-   if (!access_ok(t->rseq, sizeof(*t->rseq)) ||
-   rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
+   if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
force_sig(SIGSEGV);
 }
 


[tip: sched/core] rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-15 Thread tip-bot2 for Eric Dumazet
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 5e0ccd4a3b01c5a71732a13186ca110a138516ea
Gitweb:
https://git.kernel.org/tip/5e0ccd4a3b01c5a71732a13186ca110a138516ea
Author:Eric Dumazet 
AuthorDate:Tue, 13 Apr 2021 13:33:52 -07:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 14 Apr 2021 18:04:09 +02:00

rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
is suboptimal.

We can use faster put_user() and get_user() on 64bit arches.

Signed-off-by: Eric Dumazet 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Mathieu Desnoyers 
Link: https://lkml.kernel.org/r/20210413203352.71350-4-eric.duma...@gmail.com
---
 kernel/rseq.c |  9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index cfe01ab..35f7bd0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -127,8 +127,13 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct 
rseq_cs *rseq_cs)
u32 sig;
int ret;
 
+#ifdef CONFIG_64BIT
+   if (get_user(ptr, &t->rseq->rseq_cs.ptr64))
+   return -EFAULT;
+#else
if (copy_from_user(&ptr, &t->rseq->rseq_cs.ptr64, sizeof(ptr)))
return -EFAULT;
+#endif
if (!ptr) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
@@ -211,9 +216,13 @@ static int clear_rseq_cs(struct task_struct *t)
 *
 * Set rseq_cs to NULL.
 */
+#ifdef CONFIG_64BIT
+   return put_user(0UL, &t->rseq->rseq_cs.ptr64);
+#else
if (clear_user(&t->rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
return -EFAULT;
return 0;
+#endif
 }
 
 /*


Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 10:09 PM Shakeel Butt  wrote:

>
> I will let TCP RX zerocopy experts respond to this but from my high
> level code inspection, I didn't see page->private usage.

Indeed, we do not use page->private, since we do not own the page(s).


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 10:15 PM Arjun Roy  wrote:
>
> On Wed, Apr 14, 2021 at 10:35 AM Eric Dumazet  wrote:
> >
> > On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy  wrote:
> > >
> > > On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet  wrote:
> > > >
> > > > On Wed, Apr 14, 2021 at 6:08 PM David Laight  
> > > > wrote:
> > > > >
> > > > > From: Eric Dumazet
> > > > > > Sent: 14 April 2021 17:00
> > > > > ...
> > > > > > > Repeated unsafe_get_user() calls are crying out for an 
> > > > > > > optimisation.
> > > > > > > You get something like:
> > > > > > > failed = 0;
> > > > > > > copy();
> > > > > > > if (failed) goto error;
> > > > > > > copy();
> > > > > > > if (failed) goto error;
> > > > > > > Where 'failed' is set by the fault handler.
> > > > > > >
> > > > > > > This could be optimised to:
> > > > > > > failed = 0;
> > > > > > > copy();
> > > > > > > copy();
> > > > > > > if (failed) goto error;
> > > > > > > Even if it faults on every invalid address it probably
> > > > > > > doesn't matter - no one cares about that path.
> > > > > >
> > > > > >
> > > > > > On which arch are you looking at ?
> > > > > >
> > > > > > On x86_64 at least, code generation is just perfect.
> > > > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > > > >
> > > > > > stac
> > > > > > copy();
> > > > > > copy();
> > > > > > clac
> > > > > >
> > > > > >
> > > > > > 
> > > > > > efault_end: do error recovery.
> > > > >
> > > > > It will be x86_64.
> > > > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > > > >
> > > > > It may well be because the compiler isn't very new.
> > > > > Will be an Ubuntu build of 9.3.0.
> > > > > Does that support 'asm goto with outputs' - which
> > > > > may be the difference.
> > > > >
> > > >
> > > > Yep, probably. I am using some recent clang version.
> > > >
> > >
> > > On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> > > down to stac + lfence + 8 x mov + clac as straight line code. And
> > > results in roughly a 5%-10% speedup over copy_from_user().
> > >
> >
> > But rseq_get_rseq_cs() would still need three different copies,
> > with 3 stac+lfence+clac sequences.
> >
> > Maybe we need to enclose all __rseq_handle_notify_resume() operations
> > in a single section.
> >
> >
>
> To provide a bit of further exposition on this point, if you do 4x
> unsafe_get_user() recall I mentioned a 5-10% improvement. On the other
> hand, 4x normal get_user() I saw something like a 100% (ie. doubling
> of sys time measured) regression.
>
> I assume that's the fault of multiple stac+clac.


I was suggesting only using unsafe_get_user() and unsafe_put_user(),
and one surrounding stac/clac

Basically what we had (partially) in our old Google kernels, before
commit 8f2817701492 ("rseq: Use get_user/put_user rather than
__get_user/__put_user")
but with all the needed modern stuff.


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 7:15 PM Arjun Roy  wrote:
>
> On Wed, Apr 14, 2021 at 9:10 AM Eric Dumazet  wrote:
> >
> > On Wed, Apr 14, 2021 at 6:08 PM David Laight  
> > wrote:
> > >
> > > From: Eric Dumazet
> > > > Sent: 14 April 2021 17:00
> > > ...
> > > > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > > > You get something like:
> > > > > failed = 0;
> > > > > copy();
> > > > > if (failed) goto error;
> > > > > copy();
> > > > > if (failed) goto error;
> > > > > Where 'failed' is set by the fault handler.
> > > > >
> > > > > This could be optimised to:
> > > > > failed = 0;
> > > > > copy();
> > > > > copy();
> > > > > if (failed) goto error;
> > > > > Even if it faults on every invalid address it probably
> > > > > doesn't matter - no one cares about that path.
> > > >
> > > >
> > > > On which arch are you looking at ?
> > > >
> > > > On x86_64 at least, code generation is just perfect.
> > > > Not even a conditional jmp, it is all handled by exceptions (if any)
> > > >
> > > > stac
> > > > copy();
> > > > copy();
> > > > clac
> > > >
> > > >
> > > > 
> > > > efault_end: do error recovery.
> > >
> > > It will be x86_64.
> > > I'm definitely seeing repeated tests of (IIRC) %rdx.
> > >
> > > It may well be because the compiler isn't very new.
> > > Will be an Ubuntu build of 9.3.0.
> > > Does that support 'asm goto with outputs' - which
> > > may be the difference.
> > >
> >
> > Yep, probably. I am using some recent clang version.
> >
>
> On x86-64 I can confirm, for me it (4 x unsafe_get_user()) compiles
> down to stac + lfence + 8 x mov + clac as straight line code. And
> results in roughly a 5%-10% speedup over copy_from_user().
>

But rseq_get_rseq_cs() would still need three different copies,
with 3 stac+lfence+clac sequences.

Maybe we need to enclose all __rseq_handle_notify_resume() operations
in a single section.







> -Arjun
>
>
> > > David
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> > > 1PT, UK
> > > Registration No: 1397386 (Wales)


[PATCH] sh: remove unused variable

2021-04-14 Thread Eric Dumazet
From: Eric Dumazet 

Removes this annoying warning:

arch/sh/kernel/traps.c: In function ‘nmi_trap_handler’:
arch/sh/kernel/traps.c:183:15: warning: unused variable ‘cpu’ 
[-Wunused-variable]
  183 |  unsigned int cpu = smp_processor_id();

Fixes: fe3f1d5d7cd3 ("sh: Get rid of nmi_count()")
Signed-off-by: Eric Dumazet 
Cc: Thomas Gleixner 
Cc: Frederic Weisbecker 
Cc: Yoshinori Sato 
Cc: Rich Felker 
---
 arch/sh/kernel/traps.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/sh/kernel/traps.c b/arch/sh/kernel/traps.c
index 
f5beecdac69382f2d719fa33d50b9d58e22f6ff8..e76b221570999776e3bc9276d6b2fd60b9132e94
 100644
--- a/arch/sh/kernel/traps.c
+++ b/arch/sh/kernel/traps.c
@@ -180,7 +180,6 @@ static inline void arch_ftrace_nmi_exit(void) { }
 
 BUILD_TRAP_HANDLER(nmi)
 {
-   unsigned int cpu = smp_processor_id();
TRAP_HANDLER_DECL;
 
arch_ftrace_nmi_enter();
-- 
2.31.1.295.g9ea45b61b8-goog



Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 6:08 PM David Laight  wrote:
>
> From: Eric Dumazet
> > Sent: 14 April 2021 17:00
> ...
> > > Repeated unsafe_get_user() calls are crying out for an optimisation.
> > > You get something like:
> > > failed = 0;
> > > copy();
> > > if (failed) goto error;
> > > copy();
> > > if (failed) goto error;
> > > Where 'failed' is set by the fault handler.
> > >
> > > This could be optimised to:
> > > failed = 0;
> > > copy();
> > > copy();
> > > if (failed) goto error;
> > > Even if it faults on every invalid address it probably
> > > doesn't matter - no one cares about that path.
> >
> >
> > On which arch are you looking at ?
> >
> > On x86_64 at least, code generation is just perfect.
> > Not even a conditional jmp, it is all handled by exceptions (if any)
> >
> > stac
> > copy();
> > copy();
> > clac
> >
> >
> > 
> > efault_end: do error recovery.
>
> It will be x86_64.
> I'm definitely seeing repeated tests of (IIRC) %rdx.
>
> It may well be because the compiler isn't very new.
> Will be an Ubuntu build of 9.3.0.
> Does that support 'asm goto with outputs' - which
> may be the difference.
>

Yep, probably. I am using some recent clang version.

> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-14 Thread Eric Dumazet
On Wed, Apr 14, 2021 at 9:55 AM David Laight  wrote:
>
> From: Arjun Roy
> > Sent: 13 April 2021 23:04
> >
> > On Tue, Apr 13, 2021 at 2:19 PM David Laight  
> > wrote:
> > >
> > > > If we're special-casing 64-bit architectures anyways - unrolling the
> > > > 32B copy_from_user() for struct rseq_cs appears to be roughly 5-10%
> > > > savings on x86-64 when I measured it (well, in a microbenchmark, not
> > > > in rseq_get_rseq_cs() directly). Perhaps that could be an additional
> > > > avenue for improvement here.
> > >
> > > The killer is usually 'user copy hardening'.
> > > It significantly slows down sendmsg() and recvmsg().
> > > I've got measurable performance improvements by
> > > using __copy_from_user() when the buffer since has
> > > already been checked - but isn't a compile-time constant.
> > >
> > > There is also scope for using _get_user() when reading
> > > iovec[] (instead of copy_from_user()) and doing all the
> > > bound checks (etc) in the loop.
> > > That gives a measurable improvement for writev("/dev/null").
> > > I must sort those patches out again.
> > >
> > > David
> > >
> >
> > In this case I mean replacing copy_from_user(rseq_cs, urseq_cs,
> > sizeof(*rseq_cs)) with  4 (x8B=32 total) unsafe_get_user() (wrapped in
> > user_read_access_begin/end()) which I think would just bypass user
> > copy hardening (as far as I can tell).
>
> Yes that is one advantage over any of the get_user() calls.
> You also lose all the 'how shall we optimise this' checks
> in copy_from_user().
>
> Repeated unsafe_get_user() calls are crying out for an optimisation.
> You get something like:
> failed = 0;
> copy();
> if (failed) goto error;
> copy();
> if (failed) goto error;
> Where 'failed' is set by the fault handler.
>
> This could be optimised to:
> failed = 0;
> copy();
> copy();
> if (failed) goto error;
> Even if it faults on every invalid address it probably
> doesn't matter - no one cares about that path.


On which arch are you looking at ?

On x86_64 at least, code generation is just perfect.
Not even a conditional jmp, it is all handled by exceptions (if any)

stac
copy();
copy();
clac



efault_end: do error recovery.



>
> I've not really looked at how it could be achieved though.
>
> It might be that the 'asm goto with outputs' variant
> manages to avoid the compare and jump.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)


[PATCH v3 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
is suboptimal.

We can use faster put_user() and get_user() on 64bit arches.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6..35f7bd0fced0e2dd8aed819e054dac03f024388a
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -127,8 +127,13 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct 
rseq_cs *rseq_cs)
u32 sig;
int ret;
 
+#ifdef CONFIG_64BIT
+   if (get_user(ptr, &t->rseq->rseq_cs.ptr64))
+   return -EFAULT;
+#else
if (copy_from_user(&ptr, &t->rseq->rseq_cs.ptr64, sizeof(ptr)))
return -EFAULT;
+#endif
if (!ptr) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
@@ -211,9 +216,13 @@ static int clear_rseq_cs(struct task_struct *t)
 *
 * Set rseq_cs to NULL.
 */
+#ifdef CONFIG_64BIT
+   return put_user(0UL, &t->rseq->rseq_cs.ptr64);
+#else
if (clear_user(&t->rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
return -EFAULT;
return 0;
+#endif
 }
 
 /*
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v3 0/3] rseq: minor optimizations

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

rseq is a heavy user of copy to/from user data in fast paths.
This series tries to reduce the cost.

v3: Third patch going back to v1 (only deal with 64bit arches)
v2: Addressed Peter and Mathieu feedbacks, thanks !

Eric Dumazet (3):
  rseq: optimize rseq_update_cpu_id()
  rseq: remove redundant access_ok()
  rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

 kernel/rseq.c | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v3 1/3] rseq: optimize rseq_update_cpu_id()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..f020f18f512a3f6241c3c9b104ce50e4d2c6188c
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
u32 cpu_id = raw_smp_processor_id();
+   struct rseq __user *rseq = t->rseq;
 
-   if (put_user(cpu_id, &t->rseq->cpu_id_start))
-   return -EFAULT;
-   if (put_user(cpu_id, &t->rseq->cpu_id))
-   return -EFAULT;
+   if (!user_write_access_begin(rseq, sizeof(*rseq)))
+   goto efault;
+   unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
+   unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
+   user_write_access_end();
trace_rseq_update(t);
return 0;
+
+efault_end:
+   user_write_access_end();
+efault:
+   return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v3 2/3] rseq: remove redundant access_ok()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Mathieu pointed out the same cleanup can be done
in rseq_syscall().

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
f020f18f512a3f6241c3c9b104ce50e4d2c6188c..cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, 
struct pt_regs *regs)
 
if (unlikely(t->flags & PF_EXITING))
return;
-   if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
-   goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
@@ -301,8 +299,7 @@ void rseq_syscall(struct pt_regs *regs)
 
if (!t->rseq)
return;
-   if (!access_ok(t->rseq, sizeof(*t->rseq)) ||
-   rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
+   if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
force_sig(SIGSEGV);
 }
 
-- 
2.31.1.295.g9ea45b61b8-goog



Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 8:00 PM Mathieu Desnoyers
 wrote:
>

> As long as the ifdefs are localized within clearly identified wrappers in the
> rseq code I don't mind doing the special-casing there.
>
> The point which remains is that I don't think we want to optimize for speed
> on 32-bit architectures when it adds special-casing and complexity to the 
> 32-bit
> build. I suspect there is less and less testing performed on 32-bit 
> architectures
> nowadays, and it's good that as much code as possible is shared between 
> 32-bit and
> 64-bit builds to share the test coverage.
>

Quite frankly V1 was fine, I can't really make it looking better.

> Thanks,
>
> Mathieu
>
> >
> >
> >>
> >> Thanks,
> >>
> >> Mathieu
> >>
> >> >
> >> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> >> > index
> >> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> >> > 100644
> >> > --- a/kernel/rseq.c
> >> > +++ b/kernel/rseq.c
> >> > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user 
> >> > **uptrp,
> >> > {
> >> >u32 ptr;
> >> >
> >> > +   if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
> >> > +   return -EFAULT;
> >> > +   if (ptr)
> >> > +   return -EINVAL;
> >> >if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
> >> >return -EFAULT;
> >> >*uptrp = (struct rseq_cs __user *)ptr;
> >> > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> >> > struct rseq_cs *rseq_cs)
> >> >u32 sig;
> >> >int ret;
> >> >
> >> > -   if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
> >> > -   return -EFAULT;
> >> > +   ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
> >> > +   if (ret)
> >> > +   return ret;
> >> >if (!urseq_cs) {
> >> >memset(rseq_cs, 0, sizeof(*rseq_cs));
> >> >return 0;
> >> > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> >> > #ifdef CONFIG_64BIT
> >> >return put_user(0UL, &t->rseq->rseq_cs.ptr64);
> >> > #else
> >> > -   return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
> >> > +   return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
> >> > +  put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
> >> > #endif
> >> >  }
> >>
> >> --
> >> Mathieu Desnoyers
> >> EfficiOS Inc.
> > > http://www.efficios.com
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 7:20 PM Mathieu Desnoyers
 wrote:
>
> - On Apr 13, 2021, at 1:07 PM, Eric Dumazet eduma...@google.com wrote:
>
> > On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet  wrote:
> >>
> >> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet  wrote:
> >> >
> >> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
> >> >  wrote:
> >> > >
> >> > > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet 
> >> > > eric.duma...@gmail.com wrote:
> >> > >
> >> > > > From: Eric Dumazet 
> >> > > >
> >> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> >> > > > update includes") added regressions for our servers.
> >> > > >
> >> > > > Using copy_from_user() and clear_user() for 64bit values
> >> > > > is suboptimal.
> >> > > >
> >> > > > We can use faster put_user() and get_user().
> >> > > >
> >> > > > 32bit arches can be changed to use the ptr32 field,
> >> > > > since the padding field must always be zero.
> >> > > >
> >> > > > v2: added ideas from Peter and Mathieu about making this
> >> > > >generic, since my initial patch was only dealing with
> >> > > >64bit arches.
> >> > >
> >> > > Ah, now I remember the reason why reading and clearing the entire 
> >> > > 64-bit
> >> > > is important: it's because we don't want to allow user-space processes 
> >> > > to
> >> > > use this change in behavior to figure out whether they are running on a
> >> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
> >> > >
> >> > > So although I'm fine with making 64-bit kernels faster, we'll want to 
> >> > > keep
> >> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
> >> > >
> >> > > Thanks,
> >> > >
> >> >
> >> > So... back to V1 then ?
> >>
> >> Or add more stuff as in :
> >
> > diff against v2, WDYT ?
>
> I like this approach slightly better, because it moves the preprocessor 
> ifdefs into
> rseq_get_rseq_cs and clear_rseq_cs, while keeping the same behavior for a 
> 32-bit
> process running on native 32-bit kernel and as compat task on a 64-bit kernel.
>
> That being said, I don't expect anyone to care much about performance of 
> 32-bit
> kernels, so we could use copy_from_user() on 32-bit kernels to remove 
> special-cases
> in 32-bit specific code. This would eliminate the 32-bit specific "padding" 
> read, and
> let the TASK_SIZE comparison handle the check for both 32-bit and 64-bit 
> kernels.
>
> As for clear_user(), I wonder whether we could simply keep using it, but 
> change the
> clear_user() macro to figure out that it can use a faster 8-byte put_user ? I 
> find it
> odd that performance optimizations which would be relevant elsewhere creep 
> into the
> rseq code.


clear_user() is a maze of arch-dependent macros/functions/assembly

I guess the same could be said from  copy_in_user(), but apparently we removed
special-casing, like in commit a41e0d754240fe8ca9c4f2070bf67e3b0228aa22

Definitely it seems odd having to carefully choose between multiple methods.


>
> Thanks,
>
> Mathieu
>
> >
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index
> > f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
> > 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user 
> > **uptrp,
> > {
> >u32 ptr;
> >
> > +   if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
> > +   return -EFAULT;
> > +   if (ptr)
> > +   return -EINVAL;
> >if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
> >return -EFAULT;
> >*uptrp = (struct rseq_cs __user *)ptr;
> > @@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
> > struct rseq_cs *rseq_cs)
> >u32 sig;
> >int ret;
> >
> > -   if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
> > -   return -EFAULT;
> > +   ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
> > +   if (ret)
> > +   return ret;
> >if (!urseq_cs) {
> >memset(rseq_cs, 0, sizeof(*rseq_cs));
> >return 0;
> > @@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
> > #ifdef CONFIG_64BIT
> >return put_user(0UL, &t->rseq->rseq_cs.ptr64);
> > #else
> > -   return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
> > +   return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
> > +  put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
> > #endif
> >  }
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 7:01 PM Eric Dumazet  wrote:
>
> On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet  wrote:
> >
> > On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
> >  wrote:
> > >
> > > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com 
> > > wrote:
> > >
> > > > From: Eric Dumazet 
> > > >
> > > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> > > > update includes") added regressions for our servers.
> > > >
> > > > Using copy_from_user() and clear_user() for 64bit values
> > > > is suboptimal.
> > > >
> > > > We can use faster put_user() and get_user().
> > > >
> > > > 32bit arches can be changed to use the ptr32 field,
> > > > since the padding field must always be zero.
> > > >
> > > > v2: added ideas from Peter and Mathieu about making this
> > > >generic, since my initial patch was only dealing with
> > > >64bit arches.
> > >
> > > Ah, now I remember the reason why reading and clearing the entire 64-bit
> > > is important: it's because we don't want to allow user-space processes to
> > > use this change in behavior to figure out whether they are running on a
> > > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
> > >
> > > So although I'm fine with making 64-bit kernels faster, we'll want to keep
> > > updating the entire 64-bit ptr field on 32-bit kernels as well.
> > >
> > > Thanks,
> > >
> >
> > So... back to V1 then ?
>
> Or add more stuff as in :

diff against v2, WDYT ?

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
f2eee3f7f5d330688c81cb2e57d47ca6b843873e..537b1f684efa11069990018ffa3642c209993011
100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -136,6 +136,10 @@ static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
 {
u32 ptr;

+   if (get_user(ptr, &rseq->rseq_cs.ptr.padding))
+   return -EFAULT;
+   if (ptr)
+   return -EINVAL;
if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
return -EFAULT;
*uptrp = (struct rseq_cs __user *)ptr;
@@ -150,8 +154,9 @@ static int rseq_get_rseq_cs(struct task_struct *t,
struct rseq_cs *rseq_cs)
u32 sig;
int ret;

-   if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
-   return -EFAULT;
+   ret = rseq_get_cs_ptr(&urseq_cs, t->rseq);
+   if (ret)
+   return ret;
if (!urseq_cs) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
@@ -237,7 +242,8 @@ static int clear_rseq_cs(struct task_struct *t)
 #ifdef CONFIG_64BIT
return put_user(0UL, &t->rseq->rseq_cs.ptr64);
 #else
-   return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
+   return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
+  put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
 #endif
 }


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 6:57 PM Eric Dumazet  wrote:
>
> On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
>  wrote:
> >
> > - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com 
> > wrote:
> >
> > > From: Eric Dumazet 
> > >
> > > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> > > update includes") added regressions for our servers.
> > >
> > > Using copy_from_user() and clear_user() for 64bit values
> > > is suboptimal.
> > >
> > > We can use faster put_user() and get_user().
> > >
> > > 32bit arches can be changed to use the ptr32 field,
> > > since the padding field must always be zero.
> > >
> > > v2: added ideas from Peter and Mathieu about making this
> > >generic, since my initial patch was only dealing with
> > >64bit arches.
> >
> > Ah, now I remember the reason why reading and clearing the entire 64-bit
> > is important: it's because we don't want to allow user-space processes to
> > use this change in behavior to figure out whether they are running on a
> > 32-bit or in a 32-bit compat mode on a 64-bit kernel.
> >
> > So although I'm fine with making 64-bit kernels faster, we'll want to keep
> > updating the entire 64-bit ptr field on 32-bit kernels as well.
> >
> > Thanks,
> >
>
> So... back to V1 then ?

Or add more stuff as in :

#ifdef CONFIG_64BIT
+   return put_user(0UL, &t->rseq->rseq_cs.ptr64);
+#else
+   return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32) |
put_user(0UL, &t->rseq->rseq_cs.ptr.padding);
+#endif


Re: [PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 6:54 PM Mathieu Desnoyers
 wrote:
>
> - On Apr 13, 2021, at 12:22 PM, Eric Dumazet eric.duma...@gmail.com wrote:
>
> > From: Eric Dumazet 
> >
> > Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
> > update includes") added regressions for our servers.
> >
> > Using copy_from_user() and clear_user() for 64bit values
> > is suboptimal.
> >
> > We can use faster put_user() and get_user().
> >
> > 32bit arches can be changed to use the ptr32 field,
> > since the padding field must always be zero.
> >
> > v2: added ideas from Peter and Mathieu about making this
> >generic, since my initial patch was only dealing with
> >64bit arches.
>
> Ah, now I remember the reason why reading and clearing the entire 64-bit
> is important: it's because we don't want to allow user-space processes to
> use this change in behavior to figure out whether they are running on a
> 32-bit or in a 32-bit compat mode on a 64-bit kernel.
>
> So although I'm fine with making 64-bit kernels faster, we'll want to keep
> updating the entire 64-bit ptr field on 32-bit kernels as well.
>
> Thanks,
>

So... back to V1 then ?


[PATCH v2 3/3] rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
is suboptimal.

We can use faster put_user() and get_user().

32bit arches can be changed to use the ptr32 field,
since the padding field must always be zero.

v2: added ideas from Peter and Mathieu about making this
generic, since my initial patch was only dealing with
64bit arches.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 41 +
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6..f2eee3f7f5d330688c81cb2e57d47ca6b843873e
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -119,23 +119,46 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t)
return 0;
 }
 
+#ifdef CONFIG_64BIT
+static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
+  const struct rseq __user *rseq)
+{
+   u64 ptr;
+
+   if (get_user(ptr, &rseq->rseq_cs.ptr64))
+   return -EFAULT;
+   *uptrp = (struct rseq_cs __user *)ptr;
+   return 0;
+}
+#else
+static int rseq_get_cs_ptr(struct rseq_cs __user **uptrp,
+  const struct rseq __user *rseq)
+{
+   u32 ptr;
+
+   if (get_user(ptr, &rseq->rseq_cs.ptr.ptr32))
+   return -EFAULT;
+   *uptrp = (struct rseq_cs __user *)ptr;
+   return 0;
+}
+#endif
+
 static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
 {
struct rseq_cs __user *urseq_cs;
-   u64 ptr;
u32 __user *usig;
u32 sig;
int ret;
 
-   if (copy_from_user(&ptr, &t->rseq->rseq_cs.ptr64, sizeof(ptr)))
+   if (rseq_get_cs_ptr(&urseq_cs, t->rseq))
return -EFAULT;
-   if (!ptr) {
+   if (!urseq_cs) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
}
-   if (ptr >= TASK_SIZE)
+   if ((unsigned long)urseq_cs >= TASK_SIZE)
return -EINVAL;
-   urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
+
if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
return -EFAULT;
 
@@ -211,9 +234,11 @@ static int clear_rseq_cs(struct task_struct *t)
 *
 * Set rseq_cs to NULL.
 */
-   if (clear_user(&t->rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
-   return -EFAULT;
-   return 0;
+#ifdef CONFIG_64BIT
+   return put_user(0UL, &t->rseq->rseq_cs.ptr64);
+#else
+   return put_user(0UL, &t->rseq->rseq_cs.ptr.ptr32);
+#endif
 }
 
 /*
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v2 2/3] rseq: remove redundant access_ok()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Mathieu pointed out the same cleanup can be done
in rseq_syscall().

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
f020f18f512a3f6241c3c9b104ce50e4d2c6188c..cfe01ab5253c1c424c0e8b25acbb6a8e1b41a5b6
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, 
struct pt_regs *regs)
 
if (unlikely(t->flags & PF_EXITING))
return;
-   if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
-   goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
@@ -301,8 +299,7 @@ void rseq_syscall(struct pt_regs *regs)
 
if (!t->rseq)
return;
-   if (!access_ok(t->rseq, sizeof(*t->rseq)) ||
-   rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
+   if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
force_sig(SIGSEGV);
 }
 
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v2 1/3] rseq: optimize rseq_update_cpu_id()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..f020f18f512a3f6241c3c9b104ce50e4d2c6188c
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
u32 cpu_id = raw_smp_processor_id();
+   struct rseq __user *rseq = t->rseq;
 
-   if (put_user(cpu_id, &t->rseq->cpu_id_start))
-   return -EFAULT;
-   if (put_user(cpu_id, &t->rseq->cpu_id))
-   return -EFAULT;
+   if (!user_write_access_begin(rseq, sizeof(*rseq)))
+   goto efault;
+   unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
+   unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
+   user_write_access_end();
trace_rseq_update(t);
return 0;
+
+efault_end:
+   user_write_access_end();
+efault:
+   return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH v2 0/3] rseq: minor optimizations

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

rseq is a heavy user of copy to/from user data in fast paths.
This series tries to reduce the cost.

v2: Addressed Peter and Mathieu feedbacks, thanks !

Eric Dumazet (3):
  rseq: optimize rseq_update_cpu_id()
  rseq: remove redundant access_ok()
  rseq: optimise rseq_get_rseq_cs() and clear_rseq_cs()

 kernel/rseq.c | 61 +--
 1 file changed, 45 insertions(+), 16 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog



Re: [PATCH 1/3] rseq: optimize rseq_update_cpu_id()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 4:29 PM Mathieu Desnoyers
 wrote:
>
> - On Apr 13, 2021, at 3:36 AM, Eric Dumazet eric.duma...@gmail.com wrote:
>
> > From: Eric Dumazet 
> >
> > Two put_user() in rseq_update_cpu_id() are replaced
> > by a pair of unsafe_put_user() with appropriate surroundings.
> >
> > This removes one stac/clac pair on x86 in fast path.
> >
> > Signed-off-by: Eric Dumazet 
> > Cc: Mathieu Desnoyers 
> > Cc: Peter Zijlstra 
> > Cc: "Paul E. McKenney" 
> > Cc: Boqun Feng 
> > Cc: Arjun Roy 
> > Cc: Ingo Molnar 
> > ---
> > kernel/rseq.c | 15 +++
> > 1 file changed, 11 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index
> > a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..d2689ccbb132c0fc8ec0924008771e5ee1ca855e
> > 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -84,13 +84,20 @@
> > static int rseq_update_cpu_id(struct task_struct *t)
> > {
> >   u32 cpu_id = raw_smp_processor_id();
> > + struct rseq *r = t->rseq;
>
> AFAIU the variable above should be a struct rseq __user *.
>
> Elsewhere in the file we use "rseq" rather than "r" for struct rseq __user *
> variable name, it would be better to keep the naming consistent across the 
> file
> if possible.

Absolutely, thanks for the feedback.


Re: [PATCH 2/3] rseq: remove redundant access_ok()

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 4:34 PM Mathieu Desnoyers
 wrote:
>
> - On Apr 13, 2021, at 3:36 AM, Eric Dumazet eric.duma...@gmail.com wrote:
>
> > From: Eric Dumazet 
> >
> > After commit 8f2817701492 ("rseq: Use get_user/put_user rather
> > than __get_user/__put_user") we no longer need
> > an access_ok() call from __rseq_handle_notify_resume()
>
> While we are doing that, should we also remove the access_ok() check in
> rseq_syscall() ? It look like it can also be removed for the exact same
> reason outlined here.

Yes, good idea.

 I was focusing in __rseq_handle_notify_resume() paths but
rseq_syscall() can get the same.

> Thanks,
>
> Mathieu
>
> >
> > Signed-off-by: Eric Dumazet 
> > Cc: Mathieu Desnoyers 
> > Cc: Peter Zijlstra 
> > Cc: "Paul E. McKenney" 
> > Cc: Boqun Feng 
> > Cc: Arjun Roy 
> > Cc: Ingo Molnar 
> > ---
> > kernel/rseq.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/kernel/rseq.c b/kernel/rseq.c
> > index
> > d2689ccbb132c0fc8ec0924008771e5ee1ca855e..57344f9abb43905c7dd2b6081205ff508d963e1e
> > 100644
> > --- a/kernel/rseq.c
> > +++ b/kernel/rseq.c
> > @@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> > struct pt_regs *regs)
> >
> >   if (unlikely(t->flags & PF_EXITING))
> >   return;
> > - if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
> > - goto error;
> >   ret = rseq_ip_fixup(regs);
> >   if (unlikely(ret < 0))
> >   goto error;
> > --
> > 2.31.1.295.g9ea45b61b8-goog
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com


Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 3:38 PM Michael S. Tsirkin  wrote:
>
> On Tue, Apr 13, 2021 at 03:33:40PM +0200, Eric Dumazet wrote:
> > On Tue, Apr 13, 2021 at 3:27 PM Eric Dumazet  wrote:
> > >
> > > On Tue, Apr 13, 2021 at 2:57 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 06:47:07PM +0200, Eric Dumazet wrote:
> > > > > On Mon, Apr 12, 2021 at 6:31 PM Eric Dumazet  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Qemu test results:
> > > > > > > > total: 460 pass: 459 fail: 1
> > > > > > > > Failed tests:
> > > > > > > > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> > > > > > > >
> > > > > > > > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not 
> > > > > > > > pull payload in
> > > > > > > > skb->head"). It is a spurious problem - the test passes roughly 
> > > > > > > > every other
> > > > > > > > time. When the failure is seen, udhcpc fails to get an IP 
> > > > > > > > address and aborts
> > > > > > > > with SIGTERM. So far I have only seen this with the "sh" 
> > > > > > > > architecture.
> > > > > > >
> > > > > > > Hmm. Let's add in some more of the people involved in that 
> > > > > > > commit, and
> > > > > > > also netdev.
> > > > > > >
> > > > > > > Nothing in there looks like it should have any interaction with
> > > > > > > architecture, so that "it happens on sh" sounds odd, but maybe 
> > > > > > > it's
> > > > > > > some particular interaction with the qemu environment.
> > > > > >
> > > > > > Yes, maybe.
> > > > > >
> > > > > > I spent few hours on this, and suspect a buggy memcpy() 
> > > > > > implementation
> > > > > > on SH, but this was not conclusive.
> > > > > >
> > > > > > By pulling one extra byte, the problem goes away.
> > > > > >
> > > > > > Strange thing is that the udhcpc process does not go past sendto().
> > > > >
> > > > > This is the patch working around the issue. Unfortunately I was not
> > > > > able to root-cause it (I really suspect something on SH)
> > > > >
> > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > index 
> > > > > 0824e6999e49957f7aaf7c990f6259792d42f32b..fd890a951beea03bdf24406809042666eb972655
> > > > > 100644
> > > > > --- a/drivers/net/virtio_net.c
> > > > > +++ b/drivers/net/virtio_net.c
> > > > > @@ -408,11 +408,17 @@ static struct sk_buff *page_to_skb(struct
> > > > > virtnet_info *vi,
> > > > >
> > > > > /* Copy all frame if it fits skb->head, otherwise
> > > > >  * we let virtio_net_hdr_to_skb() and GRO pull headers as 
> > > > > needed.
> > > > > +*
> > > > > +* Apparently, pulling only the Ethernet Header triggers a bug
> > > > > on qemu-system-sh4.
> > > > > +* Since GRO aggregation really cares of IPv4/IPv6, pull 20 
> > > > > bytes
> > > > > +* more to work around this bug : These 20 bytes can not 
> > > > > belong
> > > > > +* to UDP/TCP payload.
> > > > > +* As a bonus, this makes GRO slightly faster for IPv4 (one 
> > > > > less copy).
> > > > >  */
> > > >
> > > > Question: do we still want to do this for performance reasons?
> > > > We also have the hdr_len coming from the device which is
> > > > just skb_headlen on the host.
> > >
> > > Well, putting 20 bytes in skb->head will disable frag0 optimization.
> > >
> > > The change would only benefit to sh arc

Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 3:27 PM Eric Dumazet  wrote:
>
> On Tue, Apr 13, 2021 at 2:57 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Apr 12, 2021 at 06:47:07PM +0200, Eric Dumazet wrote:
> > > On Mon, Apr 12, 2021 at 6:31 PM Eric Dumazet  wrote:
> > > >
> > > > On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
> > > >  wrote:
> > > > >
> > > > > On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  
> > > > > wrote:
> > > > > >
> > > > > > Qemu test results:
> > > > > > total: 460 pass: 459 fail: 1
> > > > > > Failed tests:
> > > > > > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> > > > > >
> > > > > > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not 
> > > > > > pull payload in
> > > > > > skb->head"). It is a spurious problem - the test passes roughly 
> > > > > > every other
> > > > > > time. When the failure is seen, udhcpc fails to get an IP address 
> > > > > > and aborts
> > > > > > with SIGTERM. So far I have only seen this with the "sh" 
> > > > > > architecture.
> > > > >
> > > > > Hmm. Let's add in some more of the people involved in that commit, and
> > > > > also netdev.
> > > > >
> > > > > Nothing in there looks like it should have any interaction with
> > > > > architecture, so that "it happens on sh" sounds odd, but maybe it's
> > > > > some particular interaction with the qemu environment.
> > > >
> > > > Yes, maybe.
> > > >
> > > > I spent few hours on this, and suspect a buggy memcpy() implementation
> > > > on SH, but this was not conclusive.
> > > >
> > > > By pulling one extra byte, the problem goes away.
> > > >
> > > > Strange thing is that the udhcpc process does not go past sendto().
> > >
> > > This is the patch working around the issue. Unfortunately I was not
> > > able to root-cause it (I really suspect something on SH)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 
> > > 0824e6999e49957f7aaf7c990f6259792d42f32b..fd890a951beea03bdf24406809042666eb972655
> > > 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -408,11 +408,17 @@ static struct sk_buff *page_to_skb(struct
> > > virtnet_info *vi,
> > >
> > > /* Copy all frame if it fits skb->head, otherwise
> > >  * we let virtio_net_hdr_to_skb() and GRO pull headers as needed.
> > > +*
> > > +* Apparently, pulling only the Ethernet Header triggers a bug
> > > on qemu-system-sh4.
> > > +* Since GRO aggregation really cares of IPv4/IPv6, pull 20 bytes
> > > +* more to work around this bug : These 20 bytes can not belong
> > > +* to UDP/TCP payload.
> > > +* As a bonus, this makes GRO slightly faster for IPv4 (one less 
> > > copy).
> > >  */
> >
> > Question: do we still want to do this for performance reasons?
> > We also have the hdr_len coming from the device which is
> > just skb_headlen on the host.
>
> Well, putting 20 bytes in skb->head will disable frag0 optimization.
>
> The change would only benefit to sh architecture :)
>
> About hdr_len, I suppose we could try it, with appropriate safety checks.

I have added traces, hdr_len seems to be 0 with the qemu-system-sh4 I am using.

Have I understood you correctly ?

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 
0824e6999e49957f7aaf7c990f6259792d42f32b..f024860f7dc260d4efbc35a3b8ffd358bd0da894
100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -399,9 +399,10 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
hdr_padded_len = sizeof(struct padded_vnet_hdr);

/* hdr_valid means no XDP, so we can copy the vnet header */
-   if (hdr_valid)
+   if (hdr_valid) {
memcpy(hdr, p, hdr_len);
-
+   pr_err("hdr->hdr_len=%u\n", hdr->hdr.hdr_len);
+   }
len -= hdr_len;
offset += hdr_padded_len;
p += hdr_padded_len;


Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 2:57 PM Michael S. Tsirkin  wrote:
>
> On Mon, Apr 12, 2021 at 06:47:07PM +0200, Eric Dumazet wrote:
> > On Mon, Apr 12, 2021 at 6:31 PM Eric Dumazet  wrote:
> > >
> > > On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
> > >  wrote:
> > > >
> > > > On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  
> > > > wrote:
> > > > >
> > > > > Qemu test results:
> > > > > total: 460 pass: 459 fail: 1
> > > > > Failed tests:
> > > > > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> > > > >
> > > > > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not pull 
> > > > > payload in
> > > > > skb->head"). It is a spurious problem - the test passes roughly every 
> > > > > other
> > > > > time. When the failure is seen, udhcpc fails to get an IP address and 
> > > > > aborts
> > > > > with SIGTERM. So far I have only seen this with the "sh" architecture.
> > > >
> > > > Hmm. Let's add in some more of the people involved in that commit, and
> > > > also netdev.
> > > >
> > > > Nothing in there looks like it should have any interaction with
> > > > architecture, so that "it happens on sh" sounds odd, but maybe it's
> > > > some particular interaction with the qemu environment.
> > >
> > > Yes, maybe.
> > >
> > > I spent few hours on this, and suspect a buggy memcpy() implementation
> > > on SH, but this was not conclusive.
> > >
> > > By pulling one extra byte, the problem goes away.
> > >
> > > Strange thing is that the udhcpc process does not go past sendto().
> >
> > This is the patch working around the issue. Unfortunately I was not
> > able to root-cause it (I really suspect something on SH)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 
> > 0824e6999e49957f7aaf7c990f6259792d42f32b..fd890a951beea03bdf24406809042666eb972655
> > 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -408,11 +408,17 @@ static struct sk_buff *page_to_skb(struct
> > virtnet_info *vi,
> >
> > /* Copy all frame if it fits skb->head, otherwise
> >  * we let virtio_net_hdr_to_skb() and GRO pull headers as needed.
> > +*
> > +* Apparently, pulling only the Ethernet Header triggers a bug
> > on qemu-system-sh4.
> > +* Since GRO aggregation really cares of IPv4/IPv6, pull 20 bytes
> > +* more to work around this bug : These 20 bytes can not belong
> > +* to UDP/TCP payload.
> > +* As a bonus, this makes GRO slightly faster for IPv4 (one less 
> > copy).
> >  */
>
> Question: do we still want to do this for performance reasons?
> We also have the hdr_len coming from the device which is
> just skb_headlen on the host.

Well, putting 20 bytes in skb->head will disable frag0 optimization.

The change would only benefit to sh architecture :)

About hdr_len, I suppose we could try it, with appropriate safety checks.

>
> > if (len <= skb_tailroom(skb))
> > copy = len;
> > else
> > -   copy = ETH_HLEN + metasize;
> > +   copy = ETH_HLEN + sizeof(struct iphdr) + metasize;
> > skb_put_data(skb, p, copy);
> >
> > if (metasize) {
>


Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 12:43 PM Eric Dumazet  wrote:
>
> On Tue, Apr 13, 2021 at 11:24 AM Eric Dumazet  wrote:
> >
> > On Mon, Apr 12, 2021 at 10:05 PM Guenter Roeck  wrote:
> > >
> > > On 4/12/21 10:38 AM, Eric Dumazet wrote:
> > > [ ... ]
> > >
> > > > Yes, I think this is the real issue here. This smells like some memory
> > > > corruption.
> > > >
> > > > In my traces, packet is correctly received in AF_PACKET queue.
> > > >
> > > > I have checked the skb is well formed.
> > > >
> > > > But the user space seems to never call poll() and recvmsg() on this
> > > > af_packet socket.
> > > >
> > >
> > > After sprinkling the kernel with debug messages:
> > >
> > > 424   00:01:33.674181 sendto(6, 
> > > "E\0\1H\0\0\0\0@\21y\246\0\0\0\0\377\377\377\377\0D\0C\00148\346\1\1\6\0\246\336\333\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0RT\0\
> > > 424   00:01:33.693873 close(6)  = 0
> > > 424   00:01:33.694652 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
> > > 424   00:01:33.695213 clock_gettime64(CLOCK_MONOTONIC, 0x7be18a18) = -1 
> > > EFAULT (Bad address)
> > > 424   00:01:33.695889 write(2, "udhcpc: clock_gettime(MONOTONIC) 
> > > failed\n", 40) = -1 EFAULT (Bad address)
> > > 424   00:01:33.697311 exit_group(1) = ?
> > > 424   00:01:33.698346 +++ exited with 1 +++
> > >
> > > I only see that after adding debug messages in the kernel, so I guess 
> > > there must be
> > > a heisenbug somehere.
> > >
> > > Anyway, indeed, I see (another kernel debug message):
> > >
> > > __do_sys_clock_gettime: Returning -EFAULT on address 0x7bacc9a8
> > >
> > > So udhcpc doesn't even try to read the reply because it crashes after 
> > > sendto()
> > > when trying to read the current time. Unless I am missing something, that 
> > > means
> > > that the problem happens somewhere on the send side.
> > >
> > > To make things even more interesting, it looks like the failing system 
> > > call
> > > isn't always clock_gettime().
> > >
> > > Guenter
> >
> >
> > I think GRO fast path has never worked on SUPERH. Probably SUPERH has
> > never used a fast NIC (10Gbit+)
> >
> > The following hack fixes the issue.
> >
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 
> > af8c1ea040b9364b076e2d72f04dc3de2d7e2f11..91ba89a645ff91d4cd4f3d8dc8a009bcb67da344
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5916,13 +5916,16 @@ static struct list_head
> > *gro_list_prepare(struct napi_struct *napi,
> >
> >  static void skb_gro_reset_offset(struct sk_buff *skb)
> >  {
> > +#if !defined(CONFIG_SUPERH)
> > const struct skb_shared_info *pinfo = skb_shinfo(skb);
> > const skb_frag_t *frag0 = &pinfo->frags[0];
> > +#endif
> >
> > NAPI_GRO_CB(skb)->data_offset = 0;
> > NAPI_GRO_CB(skb)->frag0 = NULL;
> > NAPI_GRO_CB(skb)->frag0_len = 0;
> >
> > +#if !defined(CONFIG_SUPERH)
> > if (!skb_headlen(skb) && pinfo->nr_frags &&
> > !PageHighMem(skb_frag_page(frag0))) {
> > NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
> > @@ -5930,6 +5933,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
> > skb_frag_size(frag0),
> > skb->end - skb->tail);
> > }
> > +#endif
> >  }
> >
> >  static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
>
> OK ... more sh debugging :
>
> diff --git a/arch/sh/mm/alignment.c b/arch/sh/mm/alignment.c
> index 
> fb517b82a87b1065cf38c06cb3c178ce86587b00..5d18f9f792991105a8aa05cc6231b7d4532d72c9
> 100644
> --- a/arch/sh/mm/alignment.c
> +++ b/arch/sh/mm/alignment.c
> @@ -27,7 +27,7 @@ static unsigned long se_multi;
> valid! */
>  static int se_usermode = UM_WARN | UM_FIXUP;
>  /* 0: no warning 1: print a warning message, disabled by default */
> -static int se_kernmode_warn;
> +static int se_kernmode_warn = 1;
>
>  core_param(alignment, se_usermode, int, 0600);
>
> @@ -103,7 +103,7 @@ void unaligned_fixups_notify(struct task_struct
> *tsk, insn_size_t insn,
>   (void *)instruction_pointer(regs), insn);
> else if (se_kernmode_warn)
>   

Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Tue, Apr 13, 2021 at 11:24 AM Eric Dumazet  wrote:
>
> On Mon, Apr 12, 2021 at 10:05 PM Guenter Roeck  wrote:
> >
> > On 4/12/21 10:38 AM, Eric Dumazet wrote:
> > [ ... ]
> >
> > > Yes, I think this is the real issue here. This smells like some memory
> > > corruption.
> > >
> > > In my traces, packet is correctly received in AF_PACKET queue.
> > >
> > > I have checked the skb is well formed.
> > >
> > > But the user space seems to never call poll() and recvmsg() on this
> > > af_packet socket.
> > >
> >
> > After sprinkling the kernel with debug messages:
> >
> > 424   00:01:33.674181 sendto(6, 
> > "E\0\1H\0\0\0\0@\21y\246\0\0\0\0\377\377\377\377\0D\0C\00148\346\1\1\6\0\246\336\333\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0RT\0\
> > 424   00:01:33.693873 close(6)  = 0
> > 424   00:01:33.694652 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
> > 424   00:01:33.695213 clock_gettime64(CLOCK_MONOTONIC, 0x7be18a18) = -1 
> > EFAULT (Bad address)
> > 424   00:01:33.695889 write(2, "udhcpc: clock_gettime(MONOTONIC) failed\n", 
> > 40) = -1 EFAULT (Bad address)
> > 424   00:01:33.697311 exit_group(1) = ?
> > 424   00:01:33.698346 +++ exited with 1 +++
> >
> > I only see that after adding debug messages in the kernel, so I guess there 
> > must be
> > a heisenbug somehere.
> >
> > Anyway, indeed, I see (another kernel debug message):
> >
> > __do_sys_clock_gettime: Returning -EFAULT on address 0x7bacc9a8
> >
> > So udhcpc doesn't even try to read the reply because it crashes after 
> > sendto()
> > when trying to read the current time. Unless I am missing something, that 
> > means
> > that the problem happens somewhere on the send side.
> >
> > To make things even more interesting, it looks like the failing system call
> > isn't always clock_gettime().
> >
> > Guenter
>
>
> I think GRO fast path has never worked on SUPERH. Probably SUPERH has
> never used a fast NIC (10Gbit+)
>
> The following hack fixes the issue.
>
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 
> af8c1ea040b9364b076e2d72f04dc3de2d7e2f11..91ba89a645ff91d4cd4f3d8dc8a009bcb67da344
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5916,13 +5916,16 @@ static struct list_head
> *gro_list_prepare(struct napi_struct *napi,
>
>  static void skb_gro_reset_offset(struct sk_buff *skb)
>  {
> +#if !defined(CONFIG_SUPERH)
> const struct skb_shared_info *pinfo = skb_shinfo(skb);
> const skb_frag_t *frag0 = &pinfo->frags[0];
> +#endif
>
> NAPI_GRO_CB(skb)->data_offset = 0;
> NAPI_GRO_CB(skb)->frag0 = NULL;
> NAPI_GRO_CB(skb)->frag0_len = 0;
>
> +#if !defined(CONFIG_SUPERH)
> if (!skb_headlen(skb) && pinfo->nr_frags &&
> !PageHighMem(skb_frag_page(frag0))) {
> NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
> @@ -5930,6 +5933,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
> skb_frag_size(frag0),
> skb->end - skb->tail);
> }
> +#endif
>  }
>
>  static void gro_pull_from_frag0(struct sk_buff *skb, int grow)

OK ... more sh debugging :

diff --git a/arch/sh/mm/alignment.c b/arch/sh/mm/alignment.c
index 
fb517b82a87b1065cf38c06cb3c178ce86587b00..5d18f9f792991105a8aa05cc6231b7d4532d72c9
100644
--- a/arch/sh/mm/alignment.c
+++ b/arch/sh/mm/alignment.c
@@ -27,7 +27,7 @@ static unsigned long se_multi;
valid! */
 static int se_usermode = UM_WARN | UM_FIXUP;
 /* 0: no warning 1: print a warning message, disabled by default */
-static int se_kernmode_warn;
+static int se_kernmode_warn = 1;

 core_param(alignment, se_usermode, int, 0600);

@@ -103,7 +103,7 @@ void unaligned_fixups_notify(struct task_struct
*tsk, insn_size_t insn,
  (void *)instruction_pointer(regs), insn);
else if (se_kernmode_warn)
pr_notice_ratelimited("Fixing up unaligned kernel access "
- "in \"%s\" pid=%d pc=0x%p ins=0x%04hx\n",
+ "in \"%s\" pid=%d pc=%px ins=0x%04hx\n",
  tsk->comm, task_pid_nr(tsk),
  (void *)instruction_pointer(regs), insn);
 }

I now see something of interest :
Fixing up unaligned kernel access in "udhcpc" pid=91 pc=8c43fc2e ins=0x6236
Fixing up unaligned kernel access in "udhcpc" pid=91 pc=8c43fc2e ins=0x6236
Fixing up unaligned kernel access in 

Re: Linux 5.12-rc7

2021-04-13 Thread Eric Dumazet
On Mon, Apr 12, 2021 at 10:05 PM Guenter Roeck  wrote:
>
> On 4/12/21 10:38 AM, Eric Dumazet wrote:
> [ ... ]
>
> > Yes, I think this is the real issue here. This smells like some memory
> > corruption.
> >
> > In my traces, packet is correctly received in AF_PACKET queue.
> >
> > I have checked the skb is well formed.
> >
> > But the user space seems to never call poll() and recvmsg() on this
> > af_packet socket.
> >
>
> After sprinkling the kernel with debug messages:
>
> 424   00:01:33.674181 sendto(6, 
> "E\0\1H\0\0\0\0@\21y\246\0\0\0\0\377\377\377\377\0D\0C\00148\346\1\1\6\0\246\336\333\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0RT\0\
> 424   00:01:33.693873 close(6)  = 0
> 424   00:01:33.694652 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
> 424   00:01:33.695213 clock_gettime64(CLOCK_MONOTONIC, 0x7be18a18) = -1 
> EFAULT (Bad address)
> 424   00:01:33.695889 write(2, "udhcpc: clock_gettime(MONOTONIC) failed\n", 
> 40) = -1 EFAULT (Bad address)
> 424   00:01:33.697311 exit_group(1) = ?
> 424   00:01:33.698346 +++ exited with 1 +++
>
> I only see that after adding debug messages in the kernel, so I guess there 
> must be
> a heisenbug somehere.
>
> Anyway, indeed, I see (another kernel debug message):
>
> __do_sys_clock_gettime: Returning -EFAULT on address 0x7bacc9a8
>
> So udhcpc doesn't even try to read the reply because it crashes after sendto()
> when trying to read the current time. Unless I am missing something, that 
> means
> that the problem happens somewhere on the send side.
>
> To make things even more interesting, it looks like the failing system call
> isn't always clock_gettime().
>
> Guenter


I think GRO fast path has never worked on SUPERH. Probably SUPERH has
never used a fast NIC (10Gbit+)

The following hack fixes the issue.


diff --git a/net/core/dev.c b/net/core/dev.c
index 
af8c1ea040b9364b076e2d72f04dc3de2d7e2f11..91ba89a645ff91d4cd4f3d8dc8a009bcb67da344
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5916,13 +5916,16 @@ static struct list_head
*gro_list_prepare(struct napi_struct *napi,

 static void skb_gro_reset_offset(struct sk_buff *skb)
 {
+#if !defined(CONFIG_SUPERH)
const struct skb_shared_info *pinfo = skb_shinfo(skb);
const skb_frag_t *frag0 = &pinfo->frags[0];
+#endif

NAPI_GRO_CB(skb)->data_offset = 0;
NAPI_GRO_CB(skb)->frag0 = NULL;
NAPI_GRO_CB(skb)->frag0_len = 0;

+#if !defined(CONFIG_SUPERH)
if (!skb_headlen(skb) && pinfo->nr_frags &&
!PageHighMem(skb_frag_page(frag0))) {
NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
@@ -5930,6 +5933,7 @@ static void skb_gro_reset_offset(struct sk_buff *skb)
skb_frag_size(frag0),
skb->end - skb->tail);
}
+#endif
 }

 static void gro_pull_from_frag0(struct sk_buff *skb, int grow)


[PATCH 2/3] rseq: remove redundant access_ok()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

After commit 8f2817701492 ("rseq: Use get_user/put_user rather
than __get_user/__put_user") we no longer need
an access_ok() call from __rseq_handle_notify_resume()

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
d2689ccbb132c0fc8ec0924008771e5ee1ca855e..57344f9abb43905c7dd2b6081205ff508d963e1e
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -273,8 +273,6 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, 
struct pt_regs *regs)
 
if (unlikely(t->flags & PF_EXITING))
return;
-   if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq
-   goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
goto error;
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH 3/3] rseq: optimise for 64bit arches

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union,
update includes") added regressions for our servers.

Using copy_from_user() and clear_user() for 64bit values
on 64bit arches is suboptimal.

We might revisit this patch once all 32bit arches support
get_user() and/or put_user() for 8 bytes values.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
57344f9abb43905c7dd2b6081205ff508d963e1e..18a75a804008d2f564d1f7789f09216f1a8760bd
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -127,8 +127,13 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct 
rseq_cs *rseq_cs)
u32 sig;
int ret;
 
+#ifdef CONFIG_64BIT
+   if (get_user(ptr, &t->rseq->rseq_cs.ptr64))
+   return -EFAULT;
+#else
if (copy_from_user(&ptr, &t->rseq->rseq_cs.ptr64, sizeof(ptr)))
return -EFAULT;
+#endif
if (!ptr) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
@@ -211,9 +216,13 @@ static int clear_rseq_cs(struct task_struct *t)
 *
 * Set rseq_cs to NULL.
 */
+#ifdef CONFIG_64BIT
+   return put_user(0ULL, &t->rseq->rseq_cs.ptr64);
+#else
if (clear_user(&t->rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
return -EFAULT;
return 0;
+#endif
 }
 
 /*
-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH 0/3] rseq: minor optimizations

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

rseq is a heavy user of copy to/from user data in fast paths.
This series tries to reduce the cost.

Eric Dumazet (3):
  rseq: optimize rseq_update_cpu_id()
  rseq: remove redundant access_ok()
  rseq: optimise for 64bit arches

 kernel/rseq.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog



[PATCH 1/3] rseq: optimize rseq_update_cpu_id()

2021-04-13 Thread Eric Dumazet
From: Eric Dumazet 

Two put_user() in rseq_update_cpu_id() are replaced
by a pair of unsafe_put_user() with appropriate surroundings.

This removes one stac/clac pair on x86 in fast path.

Signed-off-by: Eric Dumazet 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Boqun Feng 
Cc: Arjun Roy 
Cc: Ingo Molnar 
---
 kernel/rseq.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 
a4f86a9d6937cdfa2f13d1dcc9be863c1943d06f..d2689ccbb132c0fc8ec0924008771e5ee1ca855e
 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,13 +84,20 @@
 static int rseq_update_cpu_id(struct task_struct *t)
 {
u32 cpu_id = raw_smp_processor_id();
+   struct rseq *r = t->rseq;
 
-   if (put_user(cpu_id, &t->rseq->cpu_id_start))
-   return -EFAULT;
-   if (put_user(cpu_id, &t->rseq->cpu_id))
-   return -EFAULT;
+   if (!user_write_access_begin(r, sizeof(*r)))
+   goto efault;
+   unsafe_put_user(cpu_id, &r->cpu_id_start, efault_end);
+   unsafe_put_user(cpu_id, &r->cpu_id, efault_end);
+   user_write_access_end();
trace_rseq_update(t);
return 0;
+
+efault_end:
+   user_write_access_end();
+efault:
+   return -EFAULT;
 }
 
 static int rseq_reset_rseq_cpu_id(struct task_struct *t)
-- 
2.31.1.295.g9ea45b61b8-goog



Re: Linux 5.12-rc7

2021-04-12 Thread Eric Dumazet
On Mon, Apr 12, 2021 at 7:31 PM Guenter Roeck  wrote:
>
> On 4/12/21 9:31 AM, Eric Dumazet wrote:
> > On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
> >  wrote:
> >>
> >> On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  wrote:
> >>>
> >>> Qemu test results:
> >>> total: 460 pass: 459 fail: 1
> >>> Failed tests:
> >>> sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> >>>
> >>> The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not pull 
> >>> payload in
> >>> skb->head"). It is a spurious problem - the test passes roughly every 
> >>> other
> >>> time. When the failure is seen, udhcpc fails to get an IP address and 
> >>> aborts
> >>> with SIGTERM. So far I have only seen this with the "sh" architecture.
> >>
> >> Hmm. Let's add in some more of the people involved in that commit, and
> >> also netdev.
> >>
> >> Nothing in there looks like it should have any interaction with
> >> architecture, so that "it happens on sh" sounds odd, but maybe it's
> >> some particular interaction with the qemu environment.
> >
> > Yes, maybe.
> >
> > I spent few hours on this, and suspect a buggy memcpy() implementation
> > on SH, but this was not conclusive.
> >
>
> I replaced all memcpy() calls in skbuff.h with calls to
>
> static inline void __my_memcpy(unsigned char *to, const unsigned char *from,
>unsigned int len)
> {
>while (len--)
>*to++ = *from++;
> }
>
> That made no difference, so unless you have some other memcpy() in mind that
> seems to be unlikely.


Sure, note I also had :

diff --git a/net/core/dev.c b/net/core/dev.c
index 
af8c1ea040b9364b076e2d72f04dc3de2d7e2f11..4e05a32542dd606ee8038017fea949939c0e
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5938,7 +5938,7 @@ static void gro_pull_from_frag0(struct sk_buff
*skb, int grow)

BUG_ON(skb->end - skb->tail < grow);

-   memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
+   memmove(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);

skb->data_len -= grow;
skb->tail += grow;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 
c421c8f809256f7b13a8b5a1331108449353ee2d..41796dedfa9034f2333cf249a0d81b7250e14d1f
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2278,7 +2278,7 @@ int skb_copy_bits(const struct sk_buff *skb, int
offset, void *to, int len)
  skb_frag_off(f) + offset - start,
  copy, p, p_off, p_len, copied) {
vaddr = kmap_atomic(p);
-   memcpy(to + copied, vaddr + p_off, p_len);
+   memmove(to + copied, vaddr + p_off, p_len);
kunmap_atomic(vaddr);
}


>
> > By pulling one extra byte, the problem goes away.
> >
> > Strange thing is that the udhcpc process does not go past sendto().
> >
>
> I have been trying to debug that one. Unfortunately gdb doesn't work with sh,
> so I can't use it to debug the problem. I'll spend some more time on this 
> today.

Yes, I think this is the real issue here. This smells like some memory
corruption.

In my traces, packet is correctly received in AF_PACKET queue.

I have checked the skb is well formed.

But the user space seems to never call poll() and recvmsg() on this
af_packet socket.


Re: Linux 5.12-rc7

2021-04-12 Thread Eric Dumazet
On Mon, Apr 12, 2021 at 6:31 PM Eric Dumazet  wrote:
>
> On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
>  wrote:
> >
> > On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  wrote:
> > >
> > > Qemu test results:
> > > total: 460 pass: 459 fail: 1
> > > Failed tests:
> > > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> > >
> > > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not pull 
> > > payload in
> > > skb->head"). It is a spurious problem - the test passes roughly every 
> > > other
> > > time. When the failure is seen, udhcpc fails to get an IP address and 
> > > aborts
> > > with SIGTERM. So far I have only seen this with the "sh" architecture.
> >
> > Hmm. Let's add in some more of the people involved in that commit, and
> > also netdev.
> >
> > Nothing in there looks like it should have any interaction with
> > architecture, so that "it happens on sh" sounds odd, but maybe it's
> > some particular interaction with the qemu environment.
>
> Yes, maybe.
>
> I spent few hours on this, and suspect a buggy memcpy() implementation
> on SH, but this was not conclusive.
>
> By pulling one extra byte, the problem goes away.
>
> Strange thing is that the udhcpc process does not go past sendto().

This is the patch working around the issue. Unfortunately I was not
able to root-cause it (I really suspect something on SH)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 
0824e6999e49957f7aaf7c990f6259792d42f32b..fd890a951beea03bdf24406809042666eb972655
100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -408,11 +408,17 @@ static struct sk_buff *page_to_skb(struct
virtnet_info *vi,

/* Copy all frame if it fits skb->head, otherwise
 * we let virtio_net_hdr_to_skb() and GRO pull headers as needed.
+*
+* Apparently, pulling only the Ethernet Header triggers a bug
on qemu-system-sh4.
+* Since GRO aggregation really cares of IPv4/IPv6, pull 20 bytes
+* more to work around this bug : These 20 bytes can not belong
+* to UDP/TCP payload.
+* As a bonus, this makes GRO slightly faster for IPv4 (one less copy).
 */
if (len <= skb_tailroom(skb))
copy = len;
else
-   copy = ETH_HLEN + metasize;
+   copy = ETH_HLEN + sizeof(struct iphdr) + metasize;
skb_put_data(skb, p, copy);

if (metasize) {


Re: Linux 5.12-rc7

2021-04-12 Thread Eric Dumazet
On Mon, Apr 12, 2021 at 6:28 PM Linus Torvalds
 wrote:
>
> On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  wrote:
> >
> > Qemu test results:
> > total: 460 pass: 459 fail: 1
> > Failed tests:
> > sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
> >
> > The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not pull 
> > payload in
> > skb->head"). It is a spurious problem - the test passes roughly every other
> > time. When the failure is seen, udhcpc fails to get an IP address and aborts
> > with SIGTERM. So far I have only seen this with the "sh" architecture.
>
> Hmm. Let's add in some more of the people involved in that commit, and
> also netdev.
>
> Nothing in there looks like it should have any interaction with
> architecture, so that "it happens on sh" sounds odd, but maybe it's
> some particular interaction with the qemu environment.

Yes, maybe.

I spent few hours on this, and suspect a buggy memcpy() implementation
on SH, but this was not conclusive.

By pulling one extra byte, the problem goes away.

Strange thing is that the udhcpc process does not go past sendto().


Re: [PATCH net-next v2 2/3] net: use skb_for_each_frag() helper where possible

2021-04-12 Thread Eric Dumazet



On 4/12/21 2:38 AM, Matteo Croce wrote:
> From: Matteo Croce 
> 
> use the new helper macro skb_for_each_frag() which allows to iterate
> through all the SKB fragments.
> 
> The patch was created with Coccinelle, this was the semantic patch:
> 
> @@
> struct sk_buff *skb;
> identifier i;
> statement S;
> iterator name skb_for_each_frag;
> @@
> -for (i = 0; i < skb_shinfo(skb)->nr_frags; \(++i\|i++\))
> +skb_for_each_frag(skb, i)
>  S
> @@
> struct skb_shared_info *sinfo;
> struct sk_buff *skb;
> identifier i;
> statement S;
> iterator name skb_for_each_frag;
> @@


I disagree with this part :

>  sinfo = skb_shinfo(skb)
>  ...
> -for (i = 0; i < sinfo->nr_frags; \(++i\|i++\))
> +skb_for_each_frag(skb, i)
>  S
>


> index bde781f46b41..5de00477eaf9 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1644,7 +1644,7 @@ static int __pskb_trim_head(struct sk_buff *skb, int 
> len)
>   eat = len;
>   k = 0;
>   shinfo = skb_shinfo(skb);
> - for (i = 0; i < shinfo->nr_frags; i++) {
> + skb_for_each_frag(skb, i) {
>   int size = skb_frag_size(&shinfo->frags[i]);
>  
>   if (size <= eat) {

This will force the compiler to re-evaluate skb_shinfo(skb)->nr_frags in the 
loop,
since atomic operations like skb_frag_unref() have a memory clobber.

skb_shinfo(skb)->nr_frags has to reload three vars.

The macro should only be used when the code had

for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)




Re: [syzbot] KMSAN: uninit-value in INET_ECN_decapsulate (2)

2021-04-12 Thread Eric Dumazet



On 3/30/21 3:26 PM, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:29ad81a1 arch/x86: add missing include to sparsemem.h
> git tree:   https://github.com/google/kmsan.git master
> console output: https://syzkaller.appspot.com/x/log.txt?x=166fe481d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b573c14b733efb1c
> dashboard link: https://syzkaller.appspot.com/bug?extid=5e9c61d74e52a82b8ace
> compiler:   Debian clang version 11.0.1-2
> userspace arch: i386
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+5e9c61d74e52a82b8...@syzkaller.appspotmail.com
> 
> =
> BUG: KMSAN: uninit-value in __INET_ECN_decapsulate include/net/inet_ecn.h:236 
> [inline]
> BUG: KMSAN: uninit-value in INET_ECN_decapsulate+0x329/0x1db0 
> include/net/inet_ecn.h:258
> CPU: 1 PID: 9058 Comm: syz-executor.1 Not tainted 5.11.0-rc7-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x21c/0x280 lib/dump_stack.c:120
>  kmsan_report+0xfb/0x1e0 mm/kmsan/kmsan_report.c:118
>  __msan_warning+0x5f/0xa0 mm/kmsan/kmsan_instr.c:197
>  __INET_ECN_decapsulate include/net/inet_ecn.h:236 [inline]
>  INET_ECN_decapsulate+0x329/0x1db0 include/net/inet_ecn.h:258
>  geneve_rx+0x216e/0x29d0 include/net/inet_ecn.h:304
>  geneve_udp_encap_recv+0x1055/0x1340 drivers/net/geneve.c:377
>  udp_queue_rcv_one_skb+0x1943/0x1af0 net/ipv4/udp.c:2095
>  udp_queue_rcv_skb+0x286/0x1040 net/ipv4/udp.c:2169
>  udp_unicast_rcv_skb net/ipv4/udp.c:2327 [inline]
>  __udp4_lib_rcv+0x3a1f/0x58f0 net/ipv4/udp.c:2396
>  udp_rcv+0x5c/0x70 net/ipv4/udp.c:2567
>  ip_protocol_deliver_rcu+0x572/0xc50 net/ipv4/ip_input.c:204
>  ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline]
>  NF_HOOK include/linux/netfilter.h:301 [inline]
>  ip_local_deliver+0x585/0x8d0 net/ipv4/ip_input.c:252
>  dst_input include/net/dst.h:447 [inline]
>  ip_rcv_finish net/ipv4/ip_input.c:428 [inline]
>  NF_HOOK include/linux/netfilter.h:301 [inline]
>  ip_rcv+0x599/0x820 net/ipv4/ip_input.c:539
>  __netif_receive_skb_one_core net/core/dev.c:5323 [inline]
>  __netif_receive_skb+0x1ec/0x640 net/core/dev.c:5437
>  process_backlog+0x517/0xbd0 net/core/dev.c:6328
>  napi_poll+0x428/0x15c0 net/core/dev.c:6806
>  net_rx_action+0x34c/0xd30 net/core/dev.c:6889
>  __do_softirq+0x1b9/0x715 kernel/softirq.c:343
>  asm_call_irq_on_stack+0xf/0x20
>  
>  __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
>  run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
>  do_softirq_own_stack+0x6e/0x90 arch/x86/kernel/irq_64.c:77
>  do_softirq kernel/softirq.c:246 [inline]
>  __local_bh_enable_ip+0x184/0x1d0 kernel/softirq.c:196
>  local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
>  rcu_read_unlock_bh include/linux/rcupdate.h:737 [inline]
>  __dev_queue_xmit+0x3b3e/0x45c0 net/core/dev.c:4178
>  dev_queue_xmit+0x4b/0x60 net/core/dev.c:4184
>  packet_snd net/packet/af_packet.c:3006 [inline]
>  packet_sendmsg+0x8778/0x9a60 net/packet/af_packet.c:3031
>  sock_sendmsg_nosec net/socket.c:652 [inline]
>  sock_sendmsg net/socket.c:672 [inline]
>  __sys_sendto+0x9ea/0xc60 net/socket.c:1975
>  __do_sys_sendto net/socket.c:1987 [inline]
>  __se_sys_sendto+0x107/0x130 net/socket.c:1983
>  __ia32_sys_sendto+0x6e/0x90 net/socket.c:1983
>  do_syscall_32_irqs_on arch/x86/entry/common.c:79 [inline]
>  __do_fast_syscall_32+0x102/0x160 arch/x86/entry/common.c:141
>  do_fast_syscall_32+0x6a/0xc0 arch/x86/entry/common.c:166
>  do_SYSENTER_32+0x73/0x90 arch/x86/entry/common.c:209
>  entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> RIP: 0023:0xf7f84549
> Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 
> 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 
> 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
> RSP: 002b:f557e5fc EFLAGS: 0296 ORIG_RAX: 0171
> RAX: ffda RBX: 0003 RCX: 2980
> RDX: 000e RSI:  RDI: 22c0
> RBP: 0014 R08:  R09: 
> R10:  R11:  R12: 
> R13:  R14:  R15: 
> 
> Uninit was created at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
>  kmsan_internal_poison_shadow+0x5c/0xf0 mm/kmsan/kmsan.c:104
>  kmsan_slab_alloc+0x8d/0xe0 mm/kmsan/kmsan_hooks.c:76
>  slab_alloc_node mm/slub.c:2907 [inline]
>  __kmalloc_node_track_caller+0xa37/0x1430 mm/slub.c:4527
>  __kmalloc_reserve net/core/skbuff.c:142 [inline]
>  __alloc_skb+0x2f8/0xb30 net/core/skbuff.c:210
>  alloc_skb include/linux/skbuff.h:1099 [inline]
>  alloc_skb_with_frags+0x1f3/0xc10 net/co

Re: [PATCH] net: geneve: check skb is large enough for IPv4/IPv6 header

2021-04-11 Thread Eric Dumazet
On Sun, Apr 11, 2021 at 1:28 PM Phillip Potter  wrote:
>
> Check within geneve_xmit_skb/geneve6_xmit_skb that sk_buff structure
> is large enough to include IPv4 or IPv6 header, and reject if not. The
> geneve_xmit_skb portion and overall idea was contributed by Eric Dumazet.
> Fixes a KMSAN-found uninit-value bug reported by syzbot at:
> https://syzkaller.appspot.com/bug?id=abe95dc3e3e9667fc23b8d81f29ecad95c6f106f
>
> Suggested-by: Eric Dumazet 
> Reported-by: syzbot+2e406a9ac75bb71d4...@syzkaller.appspotmail.com
> Signed-off-by: Phillip Potter 

Signed-off-by: Eric Dumazet 

Thanks !


Re: [PATCH] net: core: sk_buff: zero-fill skb->data in __alloc_skb function

2021-04-10 Thread Eric Dumazet
On Sat, Apr 10, 2021 at 12:12 PM Eric Dumazet  wrote:
>
> On Sat, Apr 10, 2021 at 11:51 AM Phillip Potter  wrote:
> >
> > Zero-fill skb->data in __alloc_skb function of net/core/skbuff.c,
> > up to start of struct skb_shared_info bytes. Fixes a KMSAN-found
> > uninit-value bug reported by syzbot at:
> > https://syzkaller.appspot.com/bug?id=abe95dc3e3e9667fc23b8d81f29ecad95c6f106f
> >
> > Reported-by: syzbot+2e406a9ac75bb71d4...@syzkaller.appspotmail.com
> > Signed-off-by: Phillip Potter 
> > ---
> >  net/core/skbuff.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 785daff48030..9ac26cdb5417 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -215,6 +215,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > gfp_mask,
> >  * to allow max possible filling before reallocation.
> >  */
> > size = SKB_WITH_OVERHEAD(ksize(data));
> > +   memset(data, 0, size);
> > prefetchw(data + size);
>
>
> Certainly not.
>
> There is a difference between kmalloc() and kzalloc()
>
> Here you are basically silencing KMSAN and make it useless.
>
> Please fix the real issue, or stop using KMSAN if it bothers you.

My understanding of the KMSAN bug (when I released it months ago) was
that it was triggered by some invalid assumptions in geneve_xmit()

The syzbot repro sends a packet with a very small size (Ethernet
header only) and no IP/IPv6 header

Fix for ipv4 part (sorry, not much time during week end to test all this)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 
e3b2375ac5eb55f544bbc1f309886cc9be189fd1..0a72779bc74bc50c20c34c05b2c525cca829f33c
100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -892,6 +892,9 @@ static int geneve_xmit_skb(struct sk_buff *skb,
struct net_device *dev,
__be16 sport;
int err;

+   if (!pskb_network_may_pull(skb, sizeof(struct iphdr))
+   return -EINVAL;
+
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
rt = geneve_get_v4_rt(skb, dev, gs4, &fl4, info,
  geneve->cfg.info.key.tp_dst, sport);


Re: [PATCH] net: core: sk_buff: zero-fill skb->data in __alloc_skb function

2021-04-10 Thread Eric Dumazet
On Sat, Apr 10, 2021 at 11:51 AM Phillip Potter  wrote:
>
> Zero-fill skb->data in __alloc_skb function of net/core/skbuff.c,
> up to start of struct skb_shared_info bytes. Fixes a KMSAN-found
> uninit-value bug reported by syzbot at:
> https://syzkaller.appspot.com/bug?id=abe95dc3e3e9667fc23b8d81f29ecad95c6f106f
>
> Reported-by: syzbot+2e406a9ac75bb71d4...@syzkaller.appspotmail.com
> Signed-off-by: Phillip Potter 
> ---
>  net/core/skbuff.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 785daff48030..9ac26cdb5417 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -215,6 +215,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> gfp_mask,
>  * to allow max possible filling before reallocation.
>  */
> size = SKB_WITH_OVERHEAD(ksize(data));
> +   memset(data, 0, size);
> prefetchw(data + size);


Certainly not.

There is a difference between kmalloc() and kzalloc()

Here you are basically silencing KMSAN and make it useless.

Please fix the real issue, or stop using KMSAN if it bothers you.


Re: Problem in pfmemalloc skb handling in net/core/dev.c

2021-04-09 Thread Eric Dumazet



On 4/9/21 12:14 PM, Xie He wrote:
> On Fri, Apr 9, 2021 at 3:04 AM Eric Dumazet  wrote:
>>
>> Note that pfmemalloc skbs are normally dropped in sk_filter_trim_cap()
>>
>> Simply make sure your protocol use it.
> 
> It seems "sk_filter_trim_cap" needs an "struct sock" argument. Some of
> my protocols act like a middle layer to another protocol and don't
> have any "struct sock".

Then simply copy the needed logic.

> 
> Also, I think this is a problem in net/core/dev.c, there are a lot of
> old protocols that are not aware of pfmemalloc skbs. I don't think
> it's a good idea to fix them one by one.
> 

I think you are mistaken.

There is no problem in net/core/dev.c really, it uses
skb_pfmemalloc_protocol()

pfmemalloc is best effort really.

If a layer store packets in many long living queues, it has to drop pfmemalloc 
packets,
unless these packets are used for swapping.






Re: [PATCH] net/rds: Avoid potential use after free in rds_send_remove_from_sock

2021-04-09 Thread Eric Dumazet



On 4/7/21 2:09 AM, Aditya Pakki wrote:
> In case of rs failure in rds_send_remove_from_sock(), the 'rm' resource
> is freed and later under spinlock, causing potential use-after-free.
> Set the free pointer to NULL to avoid undefined behavior.
> 
> Signed-off-by: Aditya Pakki 
> ---
>  net/rds/message.c | 1 +
>  net/rds/send.c| 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/rds/message.c b/net/rds/message.c
> index 071a261fdaab..90ebcfe5fe3b 100644
> --- a/net/rds/message.c
> +++ b/net/rds/message.c
> @@ -180,6 +180,7 @@ void rds_message_put(struct rds_message *rm)
>   rds_message_purge(rm);
>  
>   kfree(rm);
> + rm = NULL;

This is a nop really.

This does not clear @rm variable in the caller.



>   }
>  }
>  EXPORT_SYMBOL_GPL(rds_message_put);
> diff --git a/net/rds/send.c b/net/rds/send.c
> index 985d0b7713ac..fe5264b9d4b3 100644
> --- a/net/rds/send.c
> +++ b/net/rds/send.c
> @@ -665,7 +665,7 @@ static void rds_send_remove_from_sock(struct list_head 
> *messages, int status)
>  unlock_and_drop:
>   spin_unlock_irqrestore(&rm->m_rs_lock, flags);
>   rds_message_put(rm);
> - if (was_on_sock)
> + if (was_on_sock && rm)
>   rds_message_put(rm);

Maybe the bug is that the refcount has not be elevated when was_on_sock
has been set.

>   }
>  
> 


Re: Problem in pfmemalloc skb handling in net/core/dev.c

2021-04-09 Thread Eric Dumazet



On 4/9/21 11:14 AM, Xie He wrote:
> On Fri, Apr 9, 2021 at 1:44 AM Mel Gorman  wrote:
>>
>> That would imply that the tap was communicating with a swap device to
>> allocate a pfmemalloc skb which shouldn't happen. Furthermore, it would
>> require the swap device to be deactivated while pfmemalloc skbs still
>> existed. Have you encountered this problem?
> 
> I'm not a user of swap devices or pfmemalloc skbs. I just want to make
> sure the protocols that I'm developing (not IP or IPv6) won't get
> pfmemalloc skbs when receiving, because those protocols cannot handle
> them.
> 
> According to the code, it seems always possible to get a pfmemalloc
> skb when a network driver calls "__netdev_alloc_skb". The skb will
> then be queued in per-CPU backlog queues when the driver calls
> "netif_rx". There seems to be nothing preventing "sk_memalloc_socks()"
> from becoming "false" after the skb is allocated and before it is
> handled by "__netif_receive_skb".
> 
> Do you mean that at the time "sk_memalloc_socks()" changes from "true"
> to "false", there would be no in-flight skbs currently being received,
> and all network communications have been paused?
> 


Note that pfmemalloc skbs are normally dropped in sk_filter_trim_cap()

Simply make sure your protocol use it.


Re: [PATCH] net: sched: sch_teql: fix null-pointer dereference

2021-04-08 Thread Eric Dumazet



On 4/8/21 5:14 PM, Pavel Tikhomirov wrote:
> Reproduce:
> 
>   modprobe sch_teql
>   tc qdisc add dev teql0 root teql0
> 
> This leads to (for instance in Centos 7 VM) OOPS:
> 
>
> 
> Null pointer dereference happens on master->slaves dereference in
> teql_destroy() as master is null-pointer.
> 
> When qdisc_create() calls teql_qdisc_init() it imediately fails after
> check "if (m->dev == dev)" because both devices are teql0, and it does
> not set qdisc_priv(sch)->m leaving it zero on error path, then
> qdisc_create() imediately calls teql_destroy() which does not expect
> zero master pointer and we get OOPS.
> 
> Signed-off-by: Pavel Tikhomirov 
> ---

This makes sense, thanks !

Reviewed-by: Eric Dumazet 

I would think bug origin is 

Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")

Can you confirm you have this backported to 3.10.0-1062.7.1.el7.x86_64 ?




Re: [PATCH net v4] atl1c: move tx cleanup processing out of interrupt

2021-04-07 Thread Eric Dumazet



On 4/6/21 4:49 PM, Gatis Peisenieks wrote:
> Tx queue cleanup happens in interrupt handler on same core as rx queue
> processing. Both can take considerable amount of processing in high
> packet-per-second scenarios.
> 
> Sending big amounts of packets can stall the rx processing which is unfair
> and also can lead to out-of-memory condition since __dev_kfree_skb_irq
> queues the skbs for later kfree in softirq which is not allowed to happen
> with heavy load in interrupt handler.
> 

[ ... ]

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 0f72ff5d34ba..489ac60b530c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6789,6 +6789,7 @@ int dev_set_threaded(struct net_device *dev, bool 
> threaded)
> 
>  return err;
>  }
> +EXPORT_SYMBOL(dev_set_threaded);
> 
>  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
>  int (*poll)(struct napi_struct *, int), int weight)

This has already been done in net-next

Please base your patch on top of net-next, this can not be backported to old
versions anyway, without some amount of pain.





Re: [PATCH v3] net: tun: set tun->dev->addr_len during TUNSETLINK processing

2021-04-06 Thread Eric Dumazet



On 4/6/21 7:45 PM, Phillip Potter wrote:
> When changing type with TUNSETLINK ioctl command, set tun->dev->addr_len
> to match the appropriate type, using new tun_get_addr_len utility function
> which returns appropriate address length for given type. Fixes a
> KMSAN-found uninit-value bug reported by syzbot at:
> https://syzkaller.appspot.com/bug?id=0766d38c656abeace60621896d705743aeefed51
> 
> Reported-by: syzbot+001516d86dbe88862...@syzkaller.appspotmail.com
> Diagnosed-by: Eric Dumazet 
> Signed-off-by: Phillip Potter 
> ---

SGTM, thanks a lot.

Reviewed-by: Eric Dumazet 



Re: [PATCH] net: tun: set tun->dev->addr_len during TUNSETLINK processing

2021-04-06 Thread Eric Dumazet



On 4/5/21 1:35 PM, Phillip Potter wrote:
> When changing type with TUNSETLINK ioctl command, set tun->dev->addr_len
> to match the appropriate type, using new tun_get_addr_len utility function
> which returns appropriate address length for given type. Fixes a
> KMSAN-found uninit-value bug reported by syzbot at:
> https://syzkaller.appspot.com/bug?id=0766d38c656abeace60621896d705743aeefed51
> 
> Reported-by: syzbot+001516d86dbe88862...@syzkaller.appspotmail.com
> Signed-off-by: Phillip Potter 
> ---

Please give credits to people who helped.

You could have :

Suggested-by: Eric Dumazet 

Or
Diagnosed-by: Eric Dumazet 

Or at least CCed me :/



Re: [tcp] 4ecc1baf36: ltp.proc01.fail

2021-04-06 Thread Eric Dumazet
On Tue, Apr 6, 2021 at 8:41 AM kernel test robot  wrote:
>
>
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 4ecc1baf362c5df2dcabe242511e38ee28486545 ("tcp: convert elligible 
> sysctls to u8")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>

I think this has been fixed with commit
d24f511b04b8b159b705ec32a3b8782667d1b06a ("tcp: fix tcp_min_tso_segs sysctl")

Thanks

>
> proc01  0  TINFO  :  /proc/sys/fs/binfmt_misc/register: is write-only.
> proc01  1  TFAIL  :  proc01.c:402: read failed: 
> /proc/sys/net/ipv4/tcp_min_tso_segs: errno=EINVAL(22): Invalid argument

...

> proc01  2  TFAIL  :  proc01.c:471: readproc() failed with 1 errors.
> <<>>
> initiation_status="ok"
> duration=2 termination_type=exited termination_id=1 corefile=no
> cutime=13 cstime=233
> Oliver Sang
>


Re: [PATCH] net: initialize local variables in net/ipv6/mcast.c and net/ipv4/igmp.c

2021-04-02 Thread Eric Dumazet



On 4/2/21 10:53 PM, Eric Dumazet wrote:
> 
> 
> On 4/2/21 8:10 PM, Phillip Potter wrote:
>> On Fri, Apr 02, 2021 at 07:49:44PM +0200, Eric Dumazet wrote:
>>>
>>>
>>> On 4/2/21 7:36 PM, Phillip Potter wrote:
>>>> Use memset to initialize two local buffers in net/ipv6/mcast.c,
>>>> and another in net/ipv4/igmp.c. Fixes a KMSAN found uninit-value
>>>> bug reported by syzbot at:
>>>> https://syzkaller.appspot.com/bug?id=0766d38c656abeace60621896d705743aeefed51
>>>
>>>
>>> According to this link, the bug no longer triggers.
>>>
>>> Please explain why you think it is still there.
>>>
>>
>> Dear Eric,
>>
>> It definitely still triggers, tested it on the master branch of
>> https://github.com/google/kmsan last night. The patch which fixes the
>> crash on that page is the same patch I've sent in.
> 
> Please send the full report (stack trace)

I think your patch just silences the real problem.

The issue at hand is that TUNSETLINK changes dev->type without making
any change to dev->addr_len

This is the real issue.

If you care about this, please fix tun driver.



Re: [PATCH] net: initialize local variables in net/ipv6/mcast.c and net/ipv4/igmp.c

2021-04-02 Thread Eric Dumazet



On 4/2/21 8:10 PM, Phillip Potter wrote:
> On Fri, Apr 02, 2021 at 07:49:44PM +0200, Eric Dumazet wrote:
>>
>>
>> On 4/2/21 7:36 PM, Phillip Potter wrote:
>>> Use memset to initialize two local buffers in net/ipv6/mcast.c,
>>> and another in net/ipv4/igmp.c. Fixes a KMSAN found uninit-value
>>> bug reported by syzbot at:
>>> https://syzkaller.appspot.com/bug?id=0766d38c656abeace60621896d705743aeefed51
>>
>>
>> According to this link, the bug no longer triggers.
>>
>> Please explain why you think it is still there.
>>
> 
> Dear Eric,
> 
> It definitely still triggers, tested it on the master branch of
> https://github.com/google/kmsan last night. The patch which fixes the
> crash on that page is the same patch I've sent in.

Please send the full report (stack trace)

Thanks.





Re: [PATCH net v2] atl1c: move tx cleanup processing out of interrupt

2021-04-02 Thread Eric Dumazet



On 4/2/21 7:20 PM, Gatis Peisenieks wrote:
> Tx queue cleanup happens in interrupt handler on same core as rx queue 
> processing.
> Both can take considerable amount of processing in high packet-per-second 
> scenarios.
> 

...

> @@ -2504,6 +2537,7 @@ static int atl1c_init_netdev(struct net_device *netdev, 
> struct pci_dev *pdev)
>  NETIF_F_TSO6;
>  netdev->features =    netdev->hw_features    |
>  NETIF_F_HW_VLAN_CTAG_TX;
> +    netdev->threaded = true;

Shouldn't it use dev_set_threaded() ?

>  return 0;
>  }
> 
> @@ -2588,6 +2622,7 @@ static int atl1c_probe(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>  adapter->mii.phy_id_mask = 0x1f;
>  adapter->mii.reg_num_mask = MDIO_CTRL_REG_MASK;
>  netif_napi_add(netdev, &adapter->napi, atl1c_clean, 64);
> +    netif_napi_add(netdev, &adapter->tx_napi, atl1c_clean_tx, 64);
>  timer_setup(&adapter->phy_config_timer, atl1c_phy_config, 0);
>  /* setup the private structure */
>  err = atl1c_sw_init(adapter);


Re: [PATCH net v2] atl1c: move tx cleanup processing out of interrupt

2021-04-02 Thread Eric Dumazet



On 4/2/21 7:20 PM, Gatis Peisenieks wrote:
> Tx queue cleanup happens in interrupt handler on same core as rx queue 
> processing.
> Both can take considerable amount of processing in high packet-per-second 
> scenarios.
> 
> Sending big amounts of packets can stall the rx processing which is unfair
> and also can lead to to out-of-memory condition since __dev_kfree_skb_irq
> queues the skbs for later kfree in softirq which is not allowed to happen
> with heavy load in interrupt handler.
> 
> This puts tx cleanup in its own napi and enables threaded napi to allow the 
> rx/tx
> queue processing to happen on different cores.
> 
> The ability to sustain equal amounts of tx/rx traffic increased:
> from 280Kpps to 1130Kpps on Threadripper 3960X with upcoming Mikrotik 10/25G 
> NIC,
> from 520Kpps to 850Kpps on Intel i3-3320 with Mikrotik RB44Ge adapter.
> 
> Signed-off-by: Gatis Peisenieks 
> ---
>  drivers/net/ethernet/atheros/atl1c/atl1c.h    |  2 +
>  .../net/ethernet/atheros/atl1c/atl1c_main.c   | 43 +--
>  2 files changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c.h 
> b/drivers/net/ethernet/atheros/atl1c/atl1c.h
> index a0562a90fb6d..4404fa44d719 100644
> --- a/drivers/net/ethernet/atheros/atl1c/atl1c.h
> +++ b/drivers/net/ethernet/atheros/atl1c/atl1c.h
> @@ -506,6 +506,7 @@ struct atl1c_adapter {
>  struct net_device   *netdev;
>  struct pci_dev  *pdev;
>  struct napi_struct  napi;
> +    struct napi_struct  tx_napi;
>  struct page *rx_page;
>  unsigned int    rx_page_offset;
>  unsigned int    rx_frag_size;
> @@ -529,6 +530,7 @@ struct atl1c_adapter {
>  u16 link_duplex;
> 
>  spinlock_t mdio_lock;
> +    spinlock_t irq_mask_lock;
>  atomic_t irq_sem;
> 
>  struct work_struct common_task;
> diff --git a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c 
> b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> index 3f65f2b370c5..f51b28e8b6dc 100644
> --- a/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> +++ b/drivers/net/ethernet/atheros/atl1c/atl1c_main.c
> @@ -813,6 +813,7 @@ static int atl1c_sw_init(struct atl1c_adapter *adapter)
>  atl1c_set_rxbufsize(adapter, adapter->netdev);
>  atomic_set(&adapter->irq_sem, 1);
>  spin_lock_init(&adapter->mdio_lock);
> +    spin_lock_init(&adapter->irq_mask_lock);
>  set_bit(__AT_DOWN, &adapter->flags);
> 
>  return 0;
> @@ -1530,7 +1531,7 @@ static inline void atl1c_clear_phy_int(struct 
> atl1c_adapter *adapter)
>  spin_unlock(&adapter->mdio_lock);
>  }
> 
> -static bool atl1c_clean_tx_irq(struct atl1c_adapter *adapter,
> +static unsigned atl1c_clean_tx_irq(struct atl1c_adapter *adapter,
>  enum atl1c_trans_queue type)


This v2 is much better, thanks.

You might rename this atl1c_clean_tx_irq(), because it is now
not run under hard irqs ?

Maybe merge atl1c_clean_tx_irq() and atl1c_clean_tx() into a single function ?




Re: [PATCH] net: initialize local variables in net/ipv6/mcast.c and net/ipv4/igmp.c

2021-04-02 Thread Eric Dumazet



On 4/2/21 7:36 PM, Phillip Potter wrote:
> Use memset to initialize two local buffers in net/ipv6/mcast.c,
> and another in net/ipv4/igmp.c. Fixes a KMSAN found uninit-value
> bug reported by syzbot at:
> https://syzkaller.appspot.com/bug?id=0766d38c656abeace60621896d705743aeefed51


According to this link, the bug no longer triggers.

Please explain why you think it is still there.

> 
> Reported-by: syzbot+001516d86dbe88862...@syzkaller.appspotmail.com
> Signed-off-by: Phillip Potter 
> ---
>  net/ipv4/igmp.c  | 2 ++
>  net/ipv6/mcast.c | 4 
>  2 files changed, 6 insertions(+)
> 
> diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
> index 7b272bbed2b4..bc8e358a9a2a 100644
> --- a/net/ipv4/igmp.c
> +++ b/net/ipv4/igmp.c
> @@ -1131,6 +1131,8 @@ static void ip_mc_filter_add(struct in_device *in_dev, 
> __be32 addr)
>   char buf[MAX_ADDR_LEN];
>   struct net_device *dev = in_dev->dev;
>  
> + memset(buf, 0, sizeof(buf));
> +
>   /* Checking for IFF_MULTICAST here is WRONG-WRONG-WRONG.
>  We will get multicast token leakage, when IFF_MULTICAST
>  is changed. This check should be done in ndo_set_rx_mode
> diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
> index 6c8604390266..ad90dc28f318 100644
> --- a/net/ipv6/mcast.c
> +++ b/net/ipv6/mcast.c
> @@ -658,6 +658,8 @@ static void igmp6_group_added(struct ifmcaddr6 *mc)
>   struct net_device *dev = mc->idev->dev;
>   char buf[MAX_ADDR_LEN];
>  
> + memset(buf, 0, sizeof(buf));
> +
>   if (IPV6_ADDR_MC_SCOPE(&mc->mca_addr) <
>   IPV6_ADDR_SCOPE_LINKLOCAL)
>   return;
> @@ -694,6 +696,8 @@ static void igmp6_group_dropped(struct ifmcaddr6 *mc)
>   struct net_device *dev = mc->idev->dev;
>   char buf[MAX_ADDR_LEN];
>  
> + memset(buf, 0, sizeof(buf));
> +
>   if (IPV6_ADDR_MC_SCOPE(&mc->mca_addr) <
>   IPV6_ADDR_SCOPE_LINKLOCAL)
>   return;
> 


Re: [PATCH net] atl1c: move tx cleanup processing out of interrupt

2021-04-01 Thread Eric Dumazet



On 4/1/21 7:32 PM, Gatis Peisenieks wrote:
> Tx queue cleanup happens in interrupt handler on same core as rx queue 
> processing.
> Both can take considerable amount of processing in high packet-per-second 
> scenarios.
> 
> Sending big amounts of packets can stall the rx processing which is unfair
> and also can lead to to out-of-memory condition since __dev_kfree_skb_irq
> queues the skbs for later kfree in softirq which is not allowed to happen
> with heavy load in interrupt handler.
> 
> This puts tx cleanup in its own napi to be executed on different core than rx
> cleanup which solves the mentioned problems and increases the throughput.
> 
> The ability to sustain equal amounts of tx/rx traffic increased:
> from 280Kpps to 1130Kpps on Threadripper 3960X with upcoming Mikrotik 10/25G 
> NIC,
> from 520Kpps to 850Kpps on Intel i3-3320 with Mikrotik RB44Ge adapter.
> 
> Signed-off-by: Gatis Peisenieks 



>  }
>  }
> -    if (status & ISR_TX_PKT)
> -    atl1c_clean_tx_irq(adapter, atl1c_trans_normal);
> +    if (status & ISR_TX_PKT) {
> +    if (napi_schedule_prep(&adapter->tx_napi)) {
> +    int tx_cpu = (smp_processor_id() + 1) %
> +    num_online_cpus();

Ouch. Please do not burry in a driver such hard-coded facility.

There is no guarantee tx_cpu is an online cpu .

> +    spin_lock(&adapter->irq_mask_lock);
> +    hw->intr_mask &= ~ISR_TX_PKT;
> +    AT_WRITE_REG(hw, REG_IMR, hw->intr_mask);
> +    spin_unlock(&adapter->irq_mask_lock);
> +    smp_call_function_single_async(tx_cpu,
> +   &adapter->csd);
> +    }
> +    }
>
Have you tried using a second NAPI, but no csd ?

We now have kthread napi [1]

By simply enabling kthread NAPI on your NIC, you should get same nice behavior.

As a bonus, on moderate load you would use a single cpu instead of two.

[1]
cb038357937ee4f589aab2469ec3896dce90f317 net: fix race between napi kthread 
mode and busy poll
5fdd2f0e5c64846bf3066689b73fc3b81c74 net: add sysfs attribute to control 
napi threaded mode
29863d41bb6e1d969c62fdb15b0961806942960e net: implement threaded-able napi poll 
loop support


Re: [PATCH net-next] net: document a side effect of ip_local_reserved_ports

2021-04-01 Thread Eric Dumazet
On Thu, Apr 1, 2021 at 5:58 PM Otto Hollmann  wrote:
>
> If there is overlapp between ip_local_port_range and 
> ip_local_reserved_ports with a huge reserved block, it will affect 
> probability of selecting ephemeral ports, see file 
> net/ipv4/inet_hashtables.c:723
>
> int __inet_hash_connect(
> ...
> for (i = 0; i < remaining; i += 2, port += 2) {
> if (unlikely(port >= high))
> port -= remaining;
> if (inet_is_local_reserved_port(net, port))
> continue;
>
> E.g. if there is reserved block of 1 ports, two ports right after 
> this block will be 5000 more likely selected than others.
> If this was intended, we can/should add note into documentation as 
> proposed in this commit, otherwise we should think about different solution. 
> One option could be mapping table of continuous port ranges. Second option 
> could be letting user to modify step (port+=2) in above loop, e.g. using new 
> sysctl parameter.
>
> Signed-off-by: Otto Hollmann 

I think we can view this as a security bug that needs a fix.


Re: [PATCH AUTOSEL 5.11 10/38] net: correct sk_acceptq_is_full()

2021-03-31 Thread Eric Dumazet



On 3/30/21 12:21 AM, Sasha Levin wrote:
> From: liuyacan 
> 
> [ Upstream commit f211ac154577ec9ccf07c15f18a6abf0d9bdb4ab ]
> 
> The "backlog" argument in listen() specifies
> the maximom length of pending connections,
> so the accept queue should be considered full
> if there are exactly "backlog" elements.
> 
> Signed-off-by: liuyacan 
> Signed-off-by: David S. Miller 
> Signed-off-by: Sasha Levin 
> ---
>  include/net/sock.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 129d200bccb4..a95f38a4b8c6 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -936,7 +936,7 @@ static inline void sk_acceptq_added(struct sock *sk)
>  
>  static inline bool sk_acceptq_is_full(const struct sock *sk)
>  {
> - return READ_ONCE(sk->sk_ack_backlog) > 
> READ_ONCE(sk->sk_max_ack_backlog);
> + return READ_ONCE(sk->sk_ack_backlog) >= 
> READ_ONCE(sk->sk_max_ack_backlog);
>  }
>  
>  /*
> 




I have not seen this patch going in our trees.

First, there was no Fixes: tag, so this is quite unfortunate.

Second, we already had such wrong patches in the past.

Please look at commits
64a146513f8f12ba204b7bf5cb7e9505594ead42 [NET]: Revert incorrect accept queue 
backlog changes.
8488df894d05d6fa41c2bd298c335f944bb0e401 [NET]: Fix bugs in "Whether sock 
accept queue is full" checking

Please revert  this patch, thanks !



Re: BUG: use-after-free in macvlan_broadcast

2021-03-30 Thread Eric Dumazet



On 3/30/21 12:11 PM, Hao Sun wrote:
> Hi
> 
> When using Healer(https://github.com/SunHao-0/healer/tree/dev) to fuzz
> the Linux kernel, I found a use-after-free vulnerability in
> macvlan_broadcast.
> Hope the report can help you locate the problem.
> 
> Details:
> commit:   5695e5161 Linux 5.11
> git tree:   upstream
> kernel config and reproducing program can be found in the attachment.
> report:
> ==
> BUG: KASAN: use-after-free in macvlan_broadcast+0x595/0x6e0
> Read of size 4 at addr 88806ae70801 by task syz-executor392/8448
> 
> CPU: 0 PID: 8448 Comm: syz-executor392 Not tainted 5.4.0 #14

Why is 5.4.0  printed if your tree is linux-5.11 ?


You describe something that has been fixed already in linux-5.5


> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.13.0-1ubuntu1.1 04/01/2014
> Call Trace:
>  dump_stack+0xf3/0x16a
>  print_address_description.constprop.0.cold+0xd3/0x372
>  __kasan_report.cold+0x75/0x9b
>  kasan_report+0x12/0x20
>  macvlan_broadcast+0x595/0x6e0
>  macvlan_start_xmit+0x40e/0x630
>  dev_direct_xmit+0x3f6/0x5f0
>  packet_sendmsg+0x233b/0x5a60
>  sock_sendmsg+0xd7/0x130
>  __sys_sendto+0x21e/0x330
>  __x64_sys_sendto+0xe1/0x1b0
>  do_syscall_64+0xbf/0x640
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x453d6d
> Code: c3 e8 77 2c 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48
> 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7fff89ba8378 EFLAGS: 0246 ORIG_RAX: 002c
> RAX: ffda RBX: 0003 RCX: 00453d6d
> RDX:  RSI:  RDI: 0003
> RBP: 0040 R08: 2280 R09: 0014
> R10:  R11: 0246 R12: 0003
> R13: 7fff89ba83e8 R14: 7fff89ba83f0 R15: 
> 
> Allocated by task 8448:
>  save_stack+0x1b/0x80
>  __kasan_kmalloc.constprop.0+0xc2/0xd0
>  __kmalloc_node_track_caller+0x116/0x410
>  __kmalloc_reserve.isra.0+0x35/0xd0
>  __alloc_skb+0xf3/0x5c0
>  rtmsg_fib+0x16f/0x11a0
>  fib_table_insert+0x66e/0x1560
>  fib_magic+0x406/0x570
>  fib_add_ifaddr+0x39a/0x520
>  fib_netdev_event+0x3d0/0x560
>  notifier_call_chain+0xc0/0x230
>  __dev_notify_flags+0x125/0x2d0
>  dev_change_flags+0x104/0x160
>  do_setlink+0x999/0x32a0
>  __rtnl_newlink+0xac8/0x14c0
>  rtnl_newlink+0x68/0xa0
>  rtnetlink_rcv_msg+0x4a4/0xb70
>  netlink_rcv_skb+0x15e/0x410
>  netlink_unicast+0x4d4/0x690
>  netlink_sendmsg+0x8ae/0xd00
>  sock_sendmsg+0xd7/0x130
>  __sys_sendto+0x21e/0x330
>  __x64_sys_sendto+0xe1/0x1b0
>  do_syscall_64+0xbf/0x640
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Freed by task 8448:
>  save_stack+0x1b/0x80
>  __kasan_slab_free+0x126/0x170
>  kfree+0xfa/0x460
>  skb_free_head+0x8b/0xa0
>  pskb_expand_head+0x2bd/0xe80
>  netlink_trim+0x203/0x260
>  netlink_broadcast_filtered+0x61/0xbd0
>  nlmsg_notify+0x15b/0x1a0
>  rtmsg_fib+0x2eb/0x11a0
>  fib_table_insert+0x66e/0x1560
>  fib_magic+0x406/0x570
>  fib_add_ifaddr+0x39a/0x520
>  fib_netdev_event+0x3d0/0x560
>  notifier_call_chain+0xc0/0x230
>  __dev_notify_flags+0x125/0x2d0
>  dev_change_flags+0x104/0x160
>  do_setlink+0x999/0x32a0
>  __rtnl_newlink+0xac8/0x14c0
>  rtnl_newlink+0x68/0xa0
>  rtnetlink_rcv_msg+0x4a4/0xb70
>  netlink_rcv_skb+0x15e/0x410
>  netlink_unicast+0x4d4/0x690
>  netlink_sendmsg+0x8ae/0xd00
>  sock_sendmsg+0xd7/0x130
>  __sys_sendto+0x21e/0x330
>  __x64_sys_sendto+0xe1/0x1b0
>  do_syscall_64+0xbf/0x640
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> The buggy address belongs to the object at 88806ae70800
>  which belongs to the cache kmalloc-1k of size 1024
> The buggy address is located 1 bytes inside of
>  1024-byte region [88806ae70800, 88806ae70c00)
> The buggy address belongs to the page:
> page:ea0001ab9c00 refcount:1 mapcount:0 mapping:88806b802280
> index:0x0 compound_mapcount: 0
> raw: 00fff0010200 dead0100 dead0122 88806b802280
> raw:  00080008 0001 
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  88806ae70700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  88806ae70780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>> 88806ae70800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>^
>  88806ae70880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  88806ae70900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==
> 


Re: [PATCH v2] wireless/nl80211.c: fix uninitialized variable

2021-03-30 Thread Eric Dumazet



On 3/30/21 7:22 PM, Alaa Emad wrote:
> This change fix  KMSAN uninit-value in net/wireless/nl80211.c:225 , That
> because of `fixedlen` variable uninitialized,So I initialized it by zero.
> 
> Reported-by: syzbot+72b99dcf4607e8c77...@syzkaller.appspotmail.com
> Signed-off-by: Alaa Emad 
> ---
> Changes in v2:
>   - Make the commit message more clearer.
> ---
>  net/wireless/nl80211.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
> index 775d0c4d86c3..b87ab67ad33d 100644
> --- a/net/wireless/nl80211.c
> +++ b/net/wireless/nl80211.c
> @@ -210,7 +210,7 @@ static int validate_beacon_head(const struct nlattr *attr,
>   const struct element *elem;
>   const struct ieee80211_mgmt *mgmt = (void *)data;
>   bool s1g_bcn = ieee80211_is_s1g_beacon(mgmt->frame_control);
> - unsigned int fixedlen, hdrlen;
> + unsigned int fixedlen = 0, hdrlen;
>  
>   if (s1g_bcn) {
>   fixedlen = offsetof(struct ieee80211_ext,
> 

What was the report exactly ?

Current code does :

unsigned int fixedlen;

if (s1g_bcn) {
fixedlen = something1;
...
else {
fixedlen = something2;
...
}

So your patch does nothing.

Initial value of @fixedlen is not relevant.

Reading this code (without access to KMSAN report) I suspect the issue
is more like the following :

diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index 775d0c4d86c3..d815261917ff 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -209,9 +209,12 @@ static int validate_beacon_head(const struct nlattr *attr,
unsigned int len = nla_len(attr);
const struct element *elem;
const struct ieee80211_mgmt *mgmt = (void *)data;
-   bool s1g_bcn = ieee80211_is_s1g_beacon(mgmt->frame_control);
unsigned int fixedlen, hdrlen;
+   bool s1g_bcn;
 
+   if (len < offsetofend(typeof(*mgmt), frame_control))
+   goto err;
+   s1g_bcn = ieee80211_is_s1g_beacon(mgmt->frame_control);
if (s1g_bcn) {
fixedlen = offsetof(struct ieee80211_ext,
u.s1g_beacon.variable);



Re: [syzbot] WARNING in xfrm_alloc_compat (2)

2021-03-29 Thread Eric Dumazet



On 3/29/21 9:57 PM, Dmitry Safonov wrote:
> Hi,
> 
> On 3/29/21 8:04 PM, syzbot wrote:
>> Hello,
>>
>> syzbot found the following issue on:
>>
>> HEAD commit:6c996e19 net: change netdev_unregister_timeout_secs min va..
>> git tree:   net-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=102e5926d0
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=c0ac79756537274e
>> dashboard link: https://syzkaller.appspot.com/bug?extid=834ffd1afc7212eb8147
>> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=10a7b1aad0
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17ae6b7cd0
>>
>> The issue was bisected to:
>>
>> commit 5f3eea6b7e8f58cf5c8a9d4b9679dc19e9e67ba3
>> Author: Dmitry Safonov 
>> Date:   Mon Sep 21 14:36:53 2020 +
>>
>> xfrm/compat: Attach xfrm dumps to 64=>32 bit translator
>>
>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10694b3ad0
>> final oops: https://syzkaller.appspot.com/x/report.txt?x=12694b3ad0
>> console output: https://syzkaller.appspot.com/x/log.txt?x=14694b3ad0
>>
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: syzbot+834ffd1afc7212eb8...@syzkaller.appspotmail.com
>> Fixes: 5f3eea6b7e8f ("xfrm/compat: Attach xfrm dumps to 64=>32 bit 
>> translator")
>>
>> netlink: 208 bytes leftover after parsing attributes in process 
>> `syz-executor193'.
>> [ cut here ]
>> unsupported nla_type 356
> 
> This doesn't seem to be an issue.
> Userspace sent message with nla_type 356, which is > __XFRM_MSG_MAX, so
> this warning is expected. I've added it so when a new XFRM_MSG_* will be
> added, to make sure that there will be translations to such messages in
> xfrm_compat.ko (if the translation needed).
> This is WARN_ON_ONCE(), so it can't be used as DOS.
> 
> Ping me if you'd expect something else than once-a-boot warning for
> this. Not sure how-to close syzkaller bug, docs have only `invalid' tag,
> which isn't `not-a-bug'/`expected' tag:
> https://github.com/google/syzkaller/blob/master/docs/syzbot.md
> 

You should not use WARN_ON_ONCE() for this case (if user space can trigger it)

https://lwn.net/Articles/769365/


Greg Kroah-Hartman raised the problem of core kernel API code that will use 
WARN_ON_ONCE() to complain about bad usage; that will not generate the desired 
result if WARN_ON_ONCE() is configured to crash the machine. He was told that 
the code should just call pr_warn() instead, and that the called function 
should return an error in such situations. It was generally agreed that any 
WARN_ON() or WARN_ON_ONCE() calls that can be triggered from user space need to 
be fixed. 





Re: [PATCH net-next v2] net: change netdev_unregister_timeout_secs min value to 1

2021-03-25 Thread Eric Dumazet



On 3/25/21 3:52 PM, Dmitry Vyukov wrote:
> netdev_unregister_timeout_secs=0 can lead to printing the
> "waiting for dev to become free" message every jiffy.
> This is too frequent and unnecessary.
> Set the min value to 1 second.
> 
> Also fix the merge issue introduced by
> "net: make unregister netdev warning timeout configurable":
> it changed "refcnt != 1" to "refcnt".
> 
> Signed-off-by: Dmitry Vyukov 
> Suggested-by: Eric Dumazet 
> Fixes: 5aa3afe107d9 ("net: make unregister netdev warning timeout 
> configurable")
> Cc: net...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
> Changes since v1:
>  - fix merge issue related to refcnt check

Reviewed-by: Eric Dumazet 

Thanks !



Re: [PATCH] net: change netdev_unregister_timeout_secs min value to 1

2021-03-25 Thread Eric Dumazet



On 3/25/21 3:38 PM, Dmitry Vyukov wrote:
> On Thu, Mar 25, 2021 at 3:34 PM Eric Dumazet  wrote:
>> On 3/25/21 11:31 AM, Dmitry Vyukov wrote:
>>> netdev_unregister_timeout_secs=0 can lead to printing the
>>> "waiting for dev to become free" message every jiffy.
>>> This is too frequent and unnecessary.
>>> Set the min value to 1 second.
>>>
>>> Signed-off-by: Dmitry Vyukov 
>>> Suggested-by: Eric Dumazet 
>>> Fixes: 5aa3afe107d9 ("net: make unregister netdev warning timeout 
>>> configurable")
>>> Cc: net...@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> ---
>>
>> Please respin your patch, and fix the merge issue [1]
> 
> Is net-next rebuilt and rebased? Do I send v4 of the whole change?
> I cannot base it on net-next now, because net-next already includes
> most of it... so what should I use as base then?
> 
>> For networking patches it is customary to tell if its for net or net-next 
>> tree.
>>
>> [1]
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 
>> 4bb6dcdbed8b856c03dc4af8b7fafe08984e803f..7bb00b8b86c6494c033cf57460f96ff3adebe081
>>  100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -10431,7 +10431,7 @@ static void netdev_wait_allrefs(struct net_device 
>> *dev)
>>
>> refcnt = netdev_refcnt_read(dev);
>>
>> -   if (refcnt &&
>> +   if (refcnt != 1 &&
>> time_after(jiffies, warning_time +
>>netdev_unregister_timeout_secs * HZ)) {
>> pr_emerg("unregister_netdevice: waiting for %s to 
>> become free. Usage count = %d\n",

Please include my fix into your patch.

Send a V2, based on current net-next.

net-next is never rebased, we have to fix the bug by adding a fix on top of it.




Re: [PATCH] net: change netdev_unregister_timeout_secs min value to 1

2021-03-25 Thread Eric Dumazet



On 3/25/21 11:31 AM, Dmitry Vyukov wrote:
> netdev_unregister_timeout_secs=0 can lead to printing the
> "waiting for dev to become free" message every jiffy.
> This is too frequent and unnecessary.
> Set the min value to 1 second.
> 
> Signed-off-by: Dmitry Vyukov 
> Suggested-by: Eric Dumazet 
> Fixes: 5aa3afe107d9 ("net: make unregister netdev warning timeout 
> configurable")
> Cc: net...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---

Please respin your patch, and fix the merge issue [1]

For networking patches it is customary to tell if its for net or net-next tree.

[1]
diff --git a/net/core/dev.c b/net/core/dev.c
index 
4bb6dcdbed8b856c03dc4af8b7fafe08984e803f..7bb00b8b86c6494c033cf57460f96ff3adebe081
 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10431,7 +10431,7 @@ static void netdev_wait_allrefs(struct net_device *dev)
 
refcnt = netdev_refcnt_read(dev);
 
-   if (refcnt &&
+   if (refcnt != 1 &&
time_after(jiffies, warning_time +
   netdev_unregister_timeout_secs * HZ)) {
pr_emerg("unregister_netdevice: waiting for %s to 
become free. Usage count = %d\n",


Re: [PATCH v2] net: make unregister netdev warning timeout configurable

2021-03-25 Thread Eric Dumazet
On Thu, Mar 25, 2021 at 8:39 AM Dmitry Vyukov  wrote:
>
> On Wed, Mar 24, 2021 at 10:40 AM Eric Dumazet  wrote:
> >
> > On Tue, Mar 23, 2021 at 7:49 AM Dmitry Vyukov  wrote:
> > >
> > > netdev_wait_allrefs() issues a warning if refcount does not drop to 0
> > > after 10 seconds. While 10 second wait generally should not happen
> > > under normal workload in normal environment, it seems to fire falsely
> > > very often during fuzzing and/or in qemu emulation (~10x slower).
> > > At least it's not possible to understand if it's really a false
> > > positive or not. Automated testing generally bumps all timeouts
> > > to very high values to avoid flake failures.
> > > Add net.core.netdev_unregister_timeout_secs sysctl to make
> > > the timeout configurable for automated testing systems.
> > > Lowering the timeout may also be useful for e.g. manual bisection.
> > > The default value matches the current behavior.
> > >
> > > Signed-off-by: Dmitry Vyukov 
> > > Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
> > > Cc: net...@vger.kernel.org
> > > Cc: linux-kernel@vger.kernel.org
> > >
> > > ---
> > > Changes since v1:
> > >  - use sysctl instead of a config
> > > ---
> >
> > > },
> > > +   {
> > > +   .procname   = "netdev_unregister_timeout_secs",
> > > +   .data   = &netdev_unregister_timeout_secs,
> > > +   .maxlen = sizeof(unsigned int),
> > > +   .mode   = 0644,
> > > +   .proc_handler   = proc_dointvec_minmax,
> > > +   .extra1 = SYSCTL_ZERO,
> > > +   .extra2 = &int_3600,
> > > +   },
> > > { }
> > >  };
> > >
> >
> > If we allow the sysctl to be 0, then we risk a flood of pr_emerg()
> > (one per jiffy ?)
>
> My reasoning was that it's up to the user. Some spammy output on the
> console for rare events is probably not the worst way how root can
> misconfigure the kernel :)
> It allows one to check (more or less) if we are reaching
> unregister_netdevice with non-zero refcount, which may be useful for
> some debugging maybe.
> But I don't mind changing it to 1 (or 5) if you prefer. On syzbot we
> only want to increase it.
>

Yes, please use a lower limit of one to avoid spurious reports.


Re: [PATCH v2] net: make unregister netdev warning timeout configurable

2021-03-24 Thread Eric Dumazet
On Tue, Mar 23, 2021 at 7:49 AM Dmitry Vyukov  wrote:
>
> netdev_wait_allrefs() issues a warning if refcount does not drop to 0
> after 10 seconds. While 10 second wait generally should not happen
> under normal workload in normal environment, it seems to fire falsely
> very often during fuzzing and/or in qemu emulation (~10x slower).
> At least it's not possible to understand if it's really a false
> positive or not. Automated testing generally bumps all timeouts
> to very high values to avoid flake failures.
> Add net.core.netdev_unregister_timeout_secs sysctl to make
> the timeout configurable for automated testing systems.
> Lowering the timeout may also be useful for e.g. manual bisection.
> The default value matches the current behavior.
>
> Signed-off-by: Dmitry Vyukov 
> Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
> Cc: net...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
>
> ---
> Changes since v1:
>  - use sysctl instead of a config
> ---

> },
> +   {
> +   .procname   = "netdev_unregister_timeout_secs",
> +   .data   = &netdev_unregister_timeout_secs,
> +   .maxlen = sizeof(unsigned int),
> +   .mode   = 0644,
> +   .proc_handler   = proc_dointvec_minmax,
> +   .extra1 = SYSCTL_ZERO,
> +   .extra2 = &int_3600,
> +   },
> { }
>  };
>

If we allow the sysctl to be 0, then we risk a flood of pr_emerg()
(one per jiffy ?)

If you really want the zero value, you need to change pr_emerg() to
pr_emerg_ratelimited()

Also, please base your patch on net-next, to avoid future merge conflicts
with my prior patch add2d7363107 "net: set initial device refcount to 1".


  1   2   3   4   5   6   7   8   9   10   >