Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
2018-04-24 1:22 GMT+02:00 Michael S. Tsirkin : > On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote: >> From: Björn Töpel >> >> This RFC introduces a new address family called AF_XDP that is >> optimized for high performance packet processing and, in upcoming >> patch sets, zero-copy semantics. In this v2 version, we have removed >> all zero-copy related code in order to make it smaller, simpler and >> hopefully more review friendly. This RFC only supports copy-mode for >> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX >> using the XDP_DRV path. Zero-copy support requires XDP and driver >> changes that Jesper Dangaard Brouer is working on. Some of his work >> has already been accepted. We will publish our zero-copy support for >> RX and TX on top of his patch sets at a later point in time. >> >> An AF_XDP socket (XSK) is created with the normal socket() >> syscall. Associated with each XSK are two queues: the RX queue and the >> TX queue. A socket can receive packets on the RX queue and it can send >> packets on the TX queue. These queues are registered and sized with >> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is >> mandatory to have at least one of these queues for each socket. In >> contrast to AF_PACKET V2/V3 these descriptor queues are separated from >> packet buffers. An RX or TX descriptor points to a data buffer in a >> memory area called a UMEM. RX and TX can share the same UMEM so that a >> packet does not have to be copied between RX and TX. Moreover, if a >> packet needs to be kept for a while due to a possible retransmit, the >> descriptor that points to that packet can be changed to point to >> another and reused right away. This again avoids copying data. >> >> This new dedicated packet buffer area is call a UMEM. It consists of a >> number of equally size frames and each frame has a unique frame id. A >> descriptor in one of the queues references a frame by referencing its >> frame id. The user space allocates memory for this UMEM using whatever >> means it feels is most appropriate (malloc, mmap, huge pages, >> etc). This memory area is then registered with the kernel using the new >> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue >> and the COMPLETION queue. The fill queue is used by the application to >> send down frame ids for the kernel to fill in with RX packet >> data. References to these frames will then appear in the RX queue of >> the XSK once they have been received. The completion queue, on the >> other hand, contains frame ids that the kernel has transmitted >> completely and can now be used again by user space, for either TX or >> RX. Thus, the frame ids appearing in the completion queue are ids that >> were previously transmitted using the TX queue. In summary, the RX and >> FILL queues are used for the RX path and the TX and COMPLETION queues >> are used for the TX path. >> >> The socket is then finally bound with a bind() call to a device and a >> specific queue id on that device, and it is not until bind is >> completed that traffic starts to flow. Note that in this RFC, all >> packet data is copied out to user-space. >> >> A new feature in this RFC is that the UMEM can be shared between >> processes, if desired. If a process wants to do this, it simply skips >> the registration of the UMEM and its corresponding two queues, sets a >> flag in the bind call and submits the XSK of the process it would like >> to share UMEM with as well as its own newly created XSK socket. The >> new process will then receive frame id references in its own RX queue >> that point to this shared UMEM. Note that since the queue structures >> are single-consumer / single-producer (for performance reasons), the >> new process has to create its own socket with associated RX and TX >> queues, since it cannot share this with the other process. This is >> also the reason that there is only one set of FILL and COMPLETION >> queues per UMEM. It is the responsibility of a single process to >> handle the UMEM. If multiple-producer / multiple-consumer queues are >> implemented in the future, this requirement could be relaxed. >> >> How is then packets distributed between these two XSK? We have >> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in >> full). The user-space application can place an XSK at an arbitrary >> place in this map. The XDP program can then redirect a packet to a >> specific index in this map and at this point XDP validates that the >> XSK in that map was indeed bound to that device and queue number. If >> not, the packet is dropped. If the map is empty at that index, the >> packet is also dropped. This also means that it is currently mandatory >> to have an XDP program loaded (and one XSK in the XSKMAP) to be able >> to get any traffic to user space through the XSK. >> >> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the >> driver does not have support for XDP, or XDP_SKB i
Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
On Mon, Apr 23, 2018 at 2:19 PM, Yi-Hung Wei wrote: > On Mon, Apr 23, 2018 at 1:10 PM, Pravin Shelar wrote: >> On Mon, Apr 23, 2018 at 6:39 AM, David Miller wrote: >>> From: Yi-Hung Wei >>> Date: Tue, 17 Apr 2018 17:30:27 -0700 >>> Currently, nf_conntrack_max is used to limit the maximum number of conntrack entries in the conntrack table for every network namespace. For the VMs and containers that reside in the same namespace, they share the same conntrack table, and the total # of conntrack entries for all the VMs and containers are limited by nf_conntrack_max. In this case, if one of the VM/container abuses the usage the conntrack entries, it blocks the others from committing valid conntrack entries into the conntrack table. Even if we can possibly put the VM in different network namespace, the current nf_conntrack_max configuration is kind of rigid that we cannot limit different VM/container to have different # conntrack entries. >> >> Hi >> This looks like general problem related to nf zone usage limit, Did >> you considered changing nf-conntrack to have a per zone limit, so that >> all users of nf-filter can use it. I prefer this to adding a wrapper >> in OVS nf-filter layer. >> >> Thanks, >> Pravin. >> > > Hi Prvain, > > Thanks for your comment. Originally, I was thinking to add this > feature in nf_conntrack and had some discussion with Florian. It > turns out that iptables and nft have their own way to keep track of > the connection limits, and it sounds reasonable to share the backend > that counts the number of connections, but each module can enforce the > connection limit in their own way. Therefore, Florian helped to pull > out the common backend to nf_conncount in the following commit. The > nf_conncount then can be used by xtables, nft, and ovs. > > commit 625c556118f3c2fd28bb8ef6da18c53bd4037be4 > Author: Florian Westphal > Date: Sat Dec 9 21:01:08 2017 +0100 > > netfilter: connlimit: split xt_connlimit into front and backend > > This allows to reuse xt_connlimit infrastructure from nf_tables. > The upcoming nf_tables frontend can just pass in an nftables register > as input key, this allows limiting by any nft-supported key, including > concatenations. For xt_connlimit, pass in the zone and the ip/ipv6 addres. > > > > Basically, to achieve conntrack zone limit in OVS. We need the > following 3 parts. > 1. Count the number of connections (this is provided by netfilter's > nf_conncount backend) > 2. Keep track of the connection limits of zones, and check if it > exceeds the limit. > 3. An API for userspace to set/delete/get the conntrack zone limit. > > This patch series implements item 2 and 3, and it reuses the > nf_conncount from netfiler for the first part. > OK. Thanks for the info.
[PATCHv2 net] team: fix netconsole setup over team
The same fix in Commit dbe173079ab5 ("bridge: fix netconsole setup over bridge") is also needed for team driver. While at it, remove the unnecessary parameter *team from team_port_enable_netpoll(). v1->v2: - fix it in a better way, as does bridge. Fixes: 0fb52a27a04a ("team: cleanup netpoll clode") Reported-by: João Avelino Bellomo Filho Signed-off-by: Xin Long --- drivers/net/team/team.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index acbe849..ddb6bf8 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -1072,14 +1072,11 @@ static void team_port_leave(struct team *team, struct team_port *port) } #ifdef CONFIG_NET_POLL_CONTROLLER -static int team_port_enable_netpoll(struct team *team, struct team_port *port) +static int __team_port_enable_netpoll(struct team_port *port) { struct netpoll *np; int err; - if (!team->dev->npinfo) - return 0; - np = kzalloc(sizeof(*np), GFP_KERNEL); if (!np) return -ENOMEM; @@ -1093,6 +1090,14 @@ static int team_port_enable_netpoll(struct team *team, struct team_port *port) return err; } +static int team_port_enable_netpoll(struct team_port *port) +{ + if (!port->team->dev->npinfo) + return 0; + + return __team_port_enable_netpoll(port); +} + static void team_port_disable_netpoll(struct team_port *port) { struct netpoll *np = port->np; @@ -1107,7 +1112,7 @@ static void team_port_disable_netpoll(struct team_port *port) kfree(np); } #else -static int team_port_enable_netpoll(struct team *team, struct team_port *port) +static int team_port_enable_netpoll(struct team_port *port) { return 0; } @@ -1221,7 +1226,7 @@ static int team_port_add(struct team *team, struct net_device *port_dev, goto err_vids_add; } - err = team_port_enable_netpoll(team, port); + err = team_port_enable_netpoll(port); if (err) { netdev_err(dev, "Failed to enable netpoll on device %s\n", portname); @@ -1918,7 +1923,7 @@ static int team_netpoll_setup(struct net_device *dev, mutex_lock(&team->lock); list_for_each_entry(port, &team->port_list, list) { - err = team_port_enable_netpoll(team, port); + err = __team_port_enable_netpoll(port); if (err) { __team_netpoll_cleanup(team); break; -- 2.1.0
Re: [PATCH net-next v2 2/2] openvswitch: Support conntrack zone limit
On Tue, Apr 17, 2018 at 5:30 PM, Yi-Hung Wei wrote: > Currently, nf_conntrack_max is used to limit the maximum number of > conntrack entries in the conntrack table for every network namespace. > For the VMs and containers that reside in the same namespace, > they share the same conntrack table, and the total # of conntrack entries > for all the VMs and containers are limited by nf_conntrack_max. In this > case, if one of the VM/container abuses the usage the conntrack entries, > it blocks the others from committing valid conntrack entries into the > conntrack table. Even if we can possibly put the VM in different network > namespace, the current nf_conntrack_max configuration is kind of rigid > that we cannot limit different VM/container to have different # conntrack > entries. > > To address the aforementioned issue, this patch proposes to have a > fine-grained mechanism that could further limit the # of conntrack entries > per-zone. For example, we can designate different zone to different VM, > and set conntrack limit to each zone. By providing this isolation, a > mis-behaved VM only consumes the conntrack entries in its own zone, and > it will not influence other well-behaved VMs. Moreover, the users can > set various conntrack limit to different zone based on their preference. > > The proposed implementation utilizes Netfilter's nf_conncount backend > to count the number of connections in a particular zone. If the number of > connection is above a configured limitation, ovs will return ENOMEM to the > userspace. If userspace does not configure the zone limit, the limit > defaults to zero that is no limitation, which is backward compatible to > the behavior without this patch. > > The following high leve APIs are provided to the userspace: > - OVS_CT_LIMIT_CMD_SET: > * set default connection limit for all zones > * set the connection limit for a particular zone > - OVS_CT_LIMIT_CMD_DEL: > * remove the connection limit for a particular zone > - OVS_CT_LIMIT_CMD_GET: > * get the default connection limit for all zones > * get the connection limit for a particular zone > > Signed-off-by: Yi-Hung Wei > --- > net/openvswitch/Kconfig | 3 +- > net/openvswitch/conntrack.c | 498 > +++- > net/openvswitch/conntrack.h | 9 +- > net/openvswitch/datapath.c | 7 +- > net/openvswitch/datapath.h | 1 + > 5 files changed, 512 insertions(+), 6 deletions(-) > > diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig > index 2650205cdaf9..89da9512ec1e 100644 > --- a/net/openvswitch/Kconfig > +++ b/net/openvswitch/Kconfig > @@ -9,7 +9,8 @@ config OPENVSWITCH >(NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \ > (!NF_NAT || NF_NAT) && \ > (!NF_NAT_IPV4 || NF_NAT_IPV4) && \ > -(!NF_NAT_IPV6 || NF_NAT_IPV6))) > +(!NF_NAT_IPV6 || NF_NAT_IPV6) && \ > +(!NETFILTER_CONNCOUNT || > NETFILTER_CONNCOUNT))) > select LIBCRC32C > select MPLS > select NET_MPLS_GSO > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c > index c5904f629091..d09b572f72b4 100644 > --- a/net/openvswitch/conntrack.c > +++ b/net/openvswitch/conntrack.c > @@ -17,7 +17,9 @@ > #include > #include > #include > +#include > #include > +#include > #include > #include > #include > @@ -76,6 +78,38 @@ struct ovs_conntrack_info { > #endif > }; > > +#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT) > +#define OVS_CT_LIMIT_UNLIMITED 0 > +#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED > +#define CT_LIMIT_HASH_BUCKETS 512 > + Can you use static key when the limit is not set. This would avoid overhead in datapath when these limits are not used. > +struct ovs_ct_limit { > + /* Elements in ovs_ct_limit_info->limits hash table */ > + struct hlist_node hlist_node; > + struct rcu_head rcu; > + u16 zone; > + u32 limit; > +}; > + ... > +#endif > + > /* Lookup connection and confirm if unconfirmed. */ > static int ovs_ct_commit(struct net *net, struct sw_flow_key *key, > const struct ovs_conntrack_info *info, > @@ -1054,6 +1176,13 @@ static int ovs_ct_commit(struct net *net, struct > sw_flow_key *key, > if (!ct) > return 0; > > +#ifIS_ENABLED(CONFIG_NETFILTER_CONNCOUNT) > + err = ovs_ct_check_limit(net, info, > +&ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple); > + if (err) > + return err; > +#endif > + This could be checked during flow install time, so that only permitted flows would have 'ct commit' action, we can avoid per packet cost checking the limit. returning error code form ovs_ct_commit() is lost in datapath and it would be hard to debug packet lost in case of the limit is
VRF: Ingress IPv6 Linklocal/Multicast destined pkt from slave VRF device does not map to Master device socket
VRF: Ingress IPv6 Linklocal/Multicast pkt from slave VRF device does not map to Master device socket. KERNEL VERSION: 4.14.28 BUG REPORT: https://bugzilla.kernel.org/show_bug.cgi?id=199409 CONFIGURATION AND PROBLEM ROOT CAUSE: 1) Created VRF device(Vrf_258) and enslaved network device(v1_F4246) to this VRF. /exos/bin # ip link show v1_F4246 54: v1_F4246: mtu 1500 qdisc noqueue master vrf_258 state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 00:04:96:98:c9:18 brd ff:ff:ff:ff:ff:ff /exos/bin # ip link show vrf_258 14: vrf_258: mtu 65536 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 00:04:96:98:c9:18 brd ff:ff:ff:ff:ff:ff 2) Opened PIM protocol raw socket for AF_INET6 family pim_socket = socket(AF_INET6, SOCK_RAW , IPPROTO_PIM ) 3) PIM user daemon process per VRF so opened RX socket SO_BINDTODEVICE to VRF_258 netdevice. PIM control packets ingressing any slave devices belongs to this master VRF device should be sent to this socket. 4) Ingressing PIM hello control packets which is having SrcIP = fe80::204:96ff:fe98:c918 (IPv6 Link-local) and DestIP = ff02::0d (Multicast pkt) does not mapped to vrf_258 bounded socket and gets dropped in socket lookup function. 5) inet6_iif() is returning v1_F4246's ifindex 54 and inet6_sdif() returns value zero. __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, inet6_iif(skb), inet6_sdif(skb)); sk->sk_bound_dev_if is having vrf_258(ifIndex value 14) but dif(value 54) and sdif(value 0) does not match this socket hence socket not found. struct sock *__raw_v6_lookup(struct net *net, struct sock *sk, unsigned short num, const struct in6_addr *loc_addr, const struct in6_addr *rmt_addr, int dif, int sdif) { .. if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif && sk->sk_bound_dev_if != sdif) .. } 6) This problem is seen for Raw, Udp and TCP socket look up function for IPv6 packets destined to linklocal or multicast address. 7) This issue do not occur for all types of IPV4 address and IPv6 unicast global address. TEMP FIX: = Get master device address from (skb->dev) and pass master to socket lookup up function for Ipv6 Linklocal/Multicast address. ipv6_raw_deliver() { int mdif; .. .. mdif = (((nexthdr == IPPROTO_PIM || nexthdr == 89 /* IPPROTO_OSPF */ || nexthdr == IPPROTO_ICMPV6 || nexthdr == 112 /*IPPROTO_VRRP*/) && (ipv6_addr_type(daddr) & (IPV6_ADDR_MULTICAST | IPV6_ADDR_LINKLOCAL))) ? l3mdev_master_ifindex_rcu(skb->dev) : inet6_iif(skb)); sk = __raw_v6_lookup(net, sk, nexthdr, daddr, saddr, mdif, inet6_sdif(skb)); ... .. } Regards, Sukumar
Re: [PATCH net] sfc: ARFS filter IDs
Hi Edward, I love your patch! Perhaps something to improve: [auto build test WARNING on net/master] url: https://github.com/0day-ci/linux/commits/Edward-Cree/sfc-ARFS-filter-IDs/20180424-080737 config: i386-allmodconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 Note: it may well be a FALSE warning. FWIW you are at least aware of it now. http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings All warnings (new ones prefixed by >>): drivers/net/ethernet/sfc/farch.c: In function 'efx_farch_filter_rfs_expire_one': >> drivers/net/ethernet/sfc/farch.c:2938:7: warning: 'rule' may be used >> uninitialized in this function [-Wmaybe-uninitialized] if (rule) ^ coccinelle warnings: (new ones prefixed by >>) >> drivers/net/ethernet/sfc/efx.c:3032:1-20: alloc with no test, possible model >> on line 3041 drivers/net/ethernet/sfc/efx.c:3032:1-20: alloc with no test, possible model on line 3062 vim +/rule +2938 drivers/net/ethernet/sfc/farch.c 2902 2903 bool efx_farch_filter_rfs_expire_one(struct efx_nic *efx, u32 flow_id, 2904 unsigned int index) 2905 { 2906 struct efx_farch_filter_state *state = efx->filter_state; 2907 struct efx_farch_filter_table *table; 2908 bool ret = false, force = false; 2909 u16 arfs_id; 2910 2911 down_write(&state->lock); 2912 spin_lock_bh(&efx->rps_hash_lock); 2913 table = &state->table[EFX_FARCH_FILTER_TABLE_RX_IP]; 2914 if (test_bit(index, table->used_bitmap) && 2915 table->spec[index].priority == EFX_FILTER_PRI_HINT) { 2916 struct efx_filter_spec spec; 2917 struct efx_arfs_rule *rule; 2918 2919 efx_farch_filter_to_gen_spec(&spec, &table->spec[index]); 2920 if (!efx->rps_hash_table) { 2921 /* In the absence of the table, we always returned 0 to 2922 * ARFS, so use the same to query it. 2923 */ 2924 arfs_id = 0; 2925 } else { 2926 rule = efx_rps_hash_find(efx, &spec); 2927 if (!rule) { 2928 /* ARFS table doesn't know of this filter, remove it */ 2929 force = true; 2930 } else { 2931 arfs_id = rule->arfs_id; 2932 if (!efx_rps_check_rule(rule, index, &force)) 2933 goto out_unlock; 2934 } 2935 } 2936 if (force || rps_may_expire_flow(efx->net_dev, spec.dmaq_id, 2937 flow_id, arfs_id)) { > 2938 if (rule) 2939 rule->filter_id = EFX_ARFS_FILTER_ID_REMOVING; 2940 efx_rps_hash_del(efx, &spec); 2941 efx_farch_filter_table_clear_entry(efx, table, index); 2942 ret = true; 2943 } 2944 } 2945 out_unlock: 2946 spin_unlock_bh(&efx->rps_hash_lock); 2947 up_write(&state->lock); 2948 return ret; 2949 } 2950 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [Patch nf] ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()
Hello, On Mon, 23 Apr 2018, Cong Wang wrote: > Similarly, tbl->entries is not initialized after kmalloc(), > therefore causes an uninit-value warning in ip_vs_lblc_check_expire(), > as reported by syzbot. > > Reported-by: > Cc: Simon Horman > Cc: Julian Anastasov > Cc: Pablo Neira Ayuso > Signed-off-by: Cong Wang Thanks! Acked-by: Julian Anastasov > --- > net/netfilter/ipvs/ip_vs_lblc.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c > index 3057e453bf31..83918119ceb8 100644 > --- a/net/netfilter/ipvs/ip_vs_lblc.c > +++ b/net/netfilter/ipvs/ip_vs_lblc.c > @@ -371,6 +371,7 @@ static int ip_vs_lblc_init_svc(struct ip_vs_service *svc) > tbl->counter = 1; > tbl->dead = false; > tbl->svc = svc; > + atomic_set(&tbl->entries, 0); > > /* >*Hook periodic timer for garbage collection > -- > 2.13.0 Regards -- Julian Anastasov
Re: [Patch nf] ipvs: initialize tbl->entries after allocation
Hello, On Mon, 23 Apr 2018, Cong Wang wrote: > tbl->entries is not initialized after kmalloc(), therefore > causes an uninit-value warning in ip_vs_lblc_check_expire() > as reported by syzbot. > > Reported-by: > Cc: Simon Horman > Cc: Julian Anastasov > Cc: Pablo Neira Ayuso > Signed-off-by: Cong Wang Thanks! Acked-by: Julian Anastasov > --- > net/netfilter/ipvs/ip_vs_lblcr.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c > b/net/netfilter/ipvs/ip_vs_lblcr.c > index 92adc04557ed..bc2bc5eebcb8 100644 > --- a/net/netfilter/ipvs/ip_vs_lblcr.c > +++ b/net/netfilter/ipvs/ip_vs_lblcr.c > @@ -534,6 +534,7 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc) > tbl->counter = 1; > tbl->dead = false; > tbl->svc = svc; > + atomic_set(&tbl->entries, 0); > > /* >*Hook periodic timer for garbage collection > -- > 2.13.0 Regards -- Julian Anastasov
Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
On Tue, 24 Apr 2018 04:42:22 +0300 "Michael S. Tsirkin" wrote: > On Mon, Apr 23, 2018 at 06:25:03PM -0700, Stephen Hemminger wrote: > > On Mon, 23 Apr 2018 12:44:39 -0700 > > Siwei Liu wrote: > > > > > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin > > > wrote: > > > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote: > > > >> On Mon, 23 Apr 2018 20:24:56 +0300 > > > >> "Michael S. Tsirkin" wrote: > > > >> > > > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote: > > > >> > > > > >> > > > > > > > >> > > > >I will NAK patches to change to common code for netvsc > > > >> > > > >especially the > > > >> > > > >three device model. MS worked hard with distro vendors to > > > >> > > > >support transparent > > > >> > > > >mode, ans we really can't have a new model; or do backport. > > > >> > > > > > > > >> > > > >Plus, DPDK is now dependent on existing model. > > > >> > > > > > > >> > > > Sorry, but nobody here cares about dpdk or other similar > > > >> > > > oddities. > > > >> > > > > > >> > > The network device model is a userspace API, and DPDK is a > > > >> > > userspace application. > > > >> > > > > >> > It is userspace but are you sure dpdk is actually poking at netdevs? > > > >> > AFAIK it's normally banging device registers directly. > > > >> > > > > >> > > You can't go breaking userspace even if you don't like the > > > >> > > application. > > > >> > > > > >> > Could you please explain how is the proposed patchset breaking > > > >> > userspace? Ignoring DPDK for now, I don't think it changes the > > > >> > userspace > > > >> > API at all. > > > >> > > > > >> > > > >> The DPDK has a device driver vdev_netvsc which scans the Linux network > > > >> devices > > > >> to look for Linux netvsc device and the paired VF device and setup the > > > >> DPDK environment. This setup creates a DPDK failsafe (bondingish) > > > >> instance > > > >> and sets up TAP support over the Linux netvsc device as well as the > > > >> Mellanox > > > >> VF device. > > > >> > > > >> So it depends on existing 2 device model. You can't go to a 3 device > > > >> model > > > >> or start hiding devices from userspace. > > > > > > > > Okay so how does the existing patch break that? IIUC does not go to > > > > a 3 device model since netvsc calls failover_register directly. > > > > > > > >> Also, I am working on associating netvsc and VF device based on serial > > > >> number > > > >> rather than MAC address. The serial number is how Windows works now, > > > >> and it makes > > > >> sense for Linux and Windows to use the same mechanism if possible. > > > > > > > > Maybe we should support same for virtio ... > > > > Which serial do you mean? From vpd? > > > > > > > > I guess you will want to keep supporting MAC for old hypervisors? > > > > The serial number has always been in the hypervisor since original support > > of SR-IOV > > in WS2008. So no backward compatibility special cases would be needed. > > Is that a serial from real hardware or a hypervisor thing? > > It is a hypervisor thing in the PCI hyperv code and the hyperv Netvsc interface. It might also be in the PCI spec, but the value in Hyper-V is being generated by the host.
Re: [PATCH net v2] net: ethtool: Add missing kernel doc for FEC parameters
On Mon, Apr 23, 2018 at 3:51 PM, Florian Fainelli wrote: > While adding support for ethtool::get_fecparam and set_fecparam, kernel > doc for these functions was missed, add those. > > Fixes: 1a5f3da20bd9 ("net: ethtool: add support for forward error correction > modes") > Signed-off-by: Florian Fainelli Acked-by: Roopa Prabhu Thanks Florian.
Re: [PATCH net-next] net: init sk_cookie for inet socket
On Tue, Apr 24, 2018 at 12:09 AM, Eric Dumazet wrote: > > > On 04/23/2018 08:58 AM, David Miller wrote: >> From: Yafang Shao >> Date: Sun, 22 Apr 2018 21:50:04 +0800 >> >>> With sk_cookie we can identify a socket, that is very helpful for >>> traceing and statistic, i.e. tcp tracepiont and ebpf. >>> So we'd better init it by default for inet socket. >>> When using it, we just need call atomic64_read(&sk->sk_cookie). >>> >>> Signed-off-by: Yafang Shao >> >> Applied, thank you. >> > > This is adding yet another atomic_inc on a global cache line. > That's a trade-off. > Most applications do not need the cookie being ever set. > > The existing mechanism was fine. Set it on demand. There are some drawback in the existing mechanism. - we have to set the net->cookie_gen and then sk->sk_cookie when we want to get the sk_cookie, that's a little expensive as well. After that change, sock_gen_cookie() could be replaced by atomic64_read(&sk->sk_cookie) in most places. - If the application want to get the sk_cookie, it must set it first. What if the application don't have the permision to write? Furthermore, maybe it is a security concern ? Thanks Yafang
Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
On 04/23/2018 07:04 PM, Andy Lutomirski wrote: > On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet wrote: >> Hi Andy >> >> On 04/23/2018 02:14 PM, Andy Lutomirski wrote: > >>> I would suggest that you rework the interface a bit. First a user would >>> call mmap() on a TCP socket, which would create an empty VMA. (It would >>> set vm_ops to point to tcp_vm_ops or similar so that the TCP code could >>> recognize it, but it would have no effect whatsoever on the TCP state >>> machine. Reading the VMA would get SIGBUS.) Then a user would call a new >>> ioctl() or setsockopt() function and pass something like: >> >> >>> >>> struct tcp_zerocopy_receive { >>> void *address; >>> size_t length; >>> }; >>> >>> The kernel would verify that [address, address+length) is entirely inside a >>> single TCP VMA and then would do the vm_insert_range magic. >> >> I have no idea what is the proper API for that. >> Where the TCP VMA(s) would be stored ? >> In TCP socket, or MM layer ? > > MM layer. I haven't tested this at all, and the error handling is > totally wrong, but I think you'd do something like: > > len = get_user(...); > > down_read(¤t->mm->mmap_sem); > > vma = find_vma(mm, start); > if (!vma || vma->vm_start > start) > return -EFAULT; > > /* This is buggy. You also need to check that the file is a socket. > This is probably trivial. */ > if (vma->vm_file->private_data != sock) > return -EINVAL; > > if (len > vma->vm_end - start) > return -EFAULT; /* too big a request. */ > > and now you'd do the vm_insert_page() dance, except that you don't > have to abort the whole procedure if you discover that something isn't > aligned right. Instead you'd just stop and tell the caller that you > didn't map the full requested size. You might also need to add some > code to charge the caller for the pages that get pinned, but that's an > orthogonal issue. > > You also need to provide some way for user programs to signal that > they're done with the page in question. MADV_DONTNEED might be > sufficient. > > In the mmap() helper, you might want to restrict the mapped size to > something reasonable. And it might be nice to hook mremap() to > prevent user code from causing too much trouble. > > With my x86-writer-of-TLB-code hat on, I expect the major performance > costs to be the generic costs of mmap() and munmap() (which only > happen once per socket instead of once per read if you like my idea), > the cost of a TLB miss when the data gets read (really not so bad on > modern hardware), and the cost of the TLB invalidation when user code > is done with the buffers. The latter is awful, especially in > multithreaded programs. In fact, it's so bad that it might be worth > mentioning in the documentation for this code that it just shouldn't > be used in multithreaded processes. (Also, on non-PCID hardware, > there's an annoying situation in which a recently-migrated thread that > removes a mapping sends an IPI to the CPU that the thread used to be > on. I thought I had a clever idea to get rid of that IPI once, but it > turned out to be wrong.) > > Architectures like ARM that have superior TLB handling primitives will > not be hurt as badly if this is used my a multithreaded program. > >> >> >> And I am not sure why the error handling would be better (point 4), unless >> we can return smaller @length than requested maybe ? > > Exactly. If I request 10MB mapped and only the first 9MB are aligned > right, I still want the first 9 MB. > >> >> Also how the VMA space would be accounted (point 3) when creating an empty >> VMA (no pages in there yet) > > There's nothing to account. It's the same as mapping /dev/null or > similar -- the mm core should take care of it for you. > Thanks Andy, I am working on all this, and initial patch looks sane enough. include/uapi/linux/tcp.h |7 + net/ipv4/tcp.c | 175 +++ 2 files changed, 93 insertions(+), 89 deletions(-) I will test all this before sending for review asap. ( I have not done the compat code yet, this can be done later I guess)
Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote: > Some bugs (such as buffer overflows) are better detected > with kmalloc code, so we must test the kmalloc path too. Well now, this brings up another item for the collective TODO list -- implement redzone checks for vmalloc. Unless this is something already taken care of by kasan or similar.
Re: [PATCH 2/2] alx: add disable_wol paramenter
Hi, May I know the final decision of this patch? Thanks. Best regards, AceLan Kao. 2018-04-10 10:40 GMT+08:00 AceLan Kao : > The problem is I don't have a machine with that wakeup issue, and I > need WoL feature. > Instead of spreading "alx with WoL" dkms package everywhere, I would > like to see it's supported in the driver and is disabled by default. > > Moreover, the wakeup issue may come from old Atheros chips, or result > from buggy BIOS. > With the WoL has been removed from the driver, no one will report > issue about that, and we don't have any chance to find a fix for it. > > Adding this feature back is not covering a paper on the issue, it > makes people have a chance to examine this feature. > > 2018-04-09 22:50 GMT+08:00 David Miller : >> From: Andrew Lunn >> Date: Mon, 9 Apr 2018 14:39:10 +0200 >> >>> On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote: The WoL feature was reported broken and will lead to the system resume immediately after suspending. This symptom is not happening on every system, so adding disable_wol option and disable WoL by default to prevent the issue from happening again. >>> const char alx_drv_name[] = "alx"; +/* disable WoL by default */ +bool disable_wol = 1; +module_param(disable_wol, bool, 0); +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature"); + >>> >>> Hi AceLan >>> >>> This seems like you are papering over the cracks. And module >>> parameters are not liked. >>> >>> Please try to find the real problem. >> >> Agreed.
Re: [PATCH net-next] net: fib_rules: fix l3mdev netlink attr processing
On 4/23/18 9:21 PM, David Miller wrote: > From: Roopa Prabhu > Date: Mon, 23 Apr 2018 20:08:41 -0700 > >> From: Roopa Prabhu >> >> Fixes: b16fb418b1bf ("net: fib_rules: add extack support") >> Signed-off-by: Roopa Prabhu > > Applied. > > It would be nice to get rid of these if() conditionals dangling > around ifdef blocks. They are quite error prone. > I'll send a patch. I'd prefer a different message when NET_L3_MASTER_DEV is not enabled.
Re: [PATCH 1/1] Revert "rds: ib: add error handle"
On 4/23/18 6:39 PM, Zhu Yanjun wrote: This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc. After long time discussion and investigations, it seems that there is no mem leak. So this patch is reverted. Signed-off-by: Zhu Yanjun --- Well your fix was not for any leaks but just proper labels for graceful exits. I don't know which long time discussion you are referring but there is no need to revert this change unless you see any issue with your change. Regards, Santosh
Re: [PATCH net-next] net: fib_rules: fix l3mdev netlink attr processing
From: Roopa Prabhu Date: Mon, 23 Apr 2018 20:08:41 -0700 > From: Roopa Prabhu > > Fixes: b16fb418b1bf ("net: fib_rules: add extack support") > Signed-off-by: Roopa Prabhu Applied. It would be nice to get rid of these if() conditionals dangling around ifdef blocks. They are quite error prone.
Re: [PATCH net-next v2 0/2] fib rules extack support
On Mon, Apr 23, 2018 at 7:21 AM, David Miller wrote: > From: Roopa Prabhu > Date: Sat, 21 Apr 2018 09:41:29 -0700 > >> From: Roopa Prabhu >> >> First patch refactors code to move fib rule netlink handling >> into a common function. This became obvious when adding >> duplicate extack msgs in add and del paths. Second patch >> adds extack msgs. >> >> v2 - Dropped the ip route get support and selftests from >> the series to look at the input path some more (as pointed >> out by ido). Will come back to that next week when i have >> some time. resending just the extack part for now. > > Series applied, but I was really looking forward to having those > nice test cases in the tree. yes, am on it. (and also excited about my first selftest!) thanks (I just sent a follow-up bug fix for vrf).
[PATCH] mac80211_hwsim: fix a possible memory leak in hwsim_new_radio_nl()
'hwname' should be freed before leaving from the error handling cases, otherwise it will cause mem leak Fixes: cb1a5bae5684 ("mac80211_hwsim: add permanent mac address option for new radios") Signed-off-by: YueHaibing --- drivers/net/wireless/mac80211_hwsim.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/wireless/mac80211_hwsim.c b/drivers/net/wireless/mac80211_hwsim.c index 96d26cf..4a017a0 100644 --- a/drivers/net/wireless/mac80211_hwsim.c +++ b/drivers/net/wireless/mac80211_hwsim.c @@ -3236,6 +3236,7 @@ static int hwsim_new_radio_nl(struct sk_buff *msg, struct genl_info *info) GENL_SET_ERR_MSG(info,"MAC is no valid source addr"); NL_SET_BAD_ATTR(info->extack, info->attrs[HWSIM_ATTR_PERM_ADDR]); + kfree(hwname); return -EINVAL; } -- 2.7.0
[PATCH net-next] net: fib_rules: fix l3mdev netlink attr processing
From: Roopa Prabhu Fixes: b16fb418b1bf ("net: fib_rules: add extack support") Signed-off-by: Roopa Prabhu --- Looks like I broke this when i split extack changes into a separate patch :( net/core/fib_rules.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index ebd9351..2271c80 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -541,8 +541,10 @@ static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh, nlrule->l3mdev = nla_get_u8(tb[FRA_L3MDEV]); if (nlrule->l3mdev != 1) #endif + { NL_SET_ERR_MSG(extack, "Invalid l3mdev"); goto errout_free; + } } nlrule->action = frh->action; -- 2.1.4
[PATCH v2 7/8] ipconfig: Create /proc/net/ipconfig directory
To allow ipconfig to report IP configuration details to user space processes without cluttering /proc/net, create a new subdirectory /proc/net/ipconfig. All files containing IP configuration details should be written to this directory. Signed-off-by: Chris Novakovic --- net/ipv4/ipconfig.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c index e11dfd29a929..9abf833f3a99 100644 --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@ -158,6 +158,9 @@ static u8 ic_domain[64];/* DNS (not NIS) domain name */ * Private state. */ +/* proc_dir_entry for /proc/net/ipconfig */ +static struct proc_dir_entry *ipconfig_dir; + /* Name of user-selected boot device */ static char user_dev_name[IFNAMSIZ] __initdata = { 0, }; @@ -1301,6 +1304,16 @@ static const struct file_operations pnp_seq_fops = { .llseek = seq_lseek, .release= single_release, }; + +/* Create the /proc/net/ipconfig directory */ +static int ipconfig_proc_net_init(void) +{ + ipconfig_dir = proc_net_mkdir(&init_net, "ipconfig", init_net.proc_net); + if (!ipconfig_dir) + return -ENOMEM; + + return 0; +} #endif /* CONFIG_PROC_FS */ /* @@ -1384,6 +1397,8 @@ static int __init ip_auto_config(void) #ifdef CONFIG_PROC_FS proc_create("pnp", 0444, init_net.proc_net, &pnp_seq_fops); + + ipconfig_proc_net_init(); #endif /* CONFIG_PROC_FS */ if (!ic_enable) -- 2.14.1
[PATCH v2 6/8] ipconfig: Correctly initialise ic_nameservers
ic_nameservers, which stores the list of name servers discovered by ipconfig, is initialised (i.e. has all of its elements set to NONE, or 0x) by ic_nameservers_predef() in the following scenarios: - before the "ip=" and "nfsaddrs=" kernel command line parameters are parsed (in ip_auto_config_setup()); - before autoconfiguring via DHCP or BOOTP (in ic_bootp_init()), in order to clear any values that may have been set after parsing "ip=" or "nfsaddrs=" and are no longer needed. This means that ic_nameservers_predef() is not called when neither "ip=" nor "nfsaddrs=" is specified on the kernel command line. In this scenario, every element in ic_nameservers remains set to 0x, which is indistinguishable from ANY and causes pnp_seq_show() to write the following (bogus) information to /proc/net/pnp: #MANUAL nameserver 0.0.0.0 nameserver 0.0.0.0 nameserver 0.0.0.0 This is potentially problematic for systems that blindly link /etc/resolv.conf to /proc/net/pnp. Ensure that ic_nameservers is also initialised when neither "ip=" nor "nfsaddrs=" are specified by calling ic_nameservers_predef() in ip_auto_config(), but only when ip_auto_config_setup() was not called earlier. This causes the following to be written to /proc/net/pnp, and is consistent with what gets written when ipconfig is configured manually but no name servers are specified on the kernel command line: #MANUAL Signed-off-by: Chris Novakovic --- net/ipv4/ipconfig.c | 13 + 1 file changed, 13 insertions(+) diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c index 0f460d6d3cce..e11dfd29a929 100644 --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@ -750,6 +750,11 @@ static void __init ic_bootp_init_ext(u8 *e) */ static inline void __init ic_bootp_init(void) { + /* Re-initialise all name servers to NONE, in case any were set via the +* "ip=" or "nfsaddrs=" kernel command line parameters: any IP addresses +* specified there will already have been decoded but are no longer +* needed +*/ ic_nameservers_predef(); dev_add_pack(&bootp_packet_type); @@ -1370,6 +1375,13 @@ static int __init ip_auto_config(void) int err; unsigned int i; + /* Initialise all name servers to NONE (but only if the "ip=" or +* "nfsaddrs=" kernel command line parameters weren't decoded, otherwise +* we'll overwrite the IP addresses specified there) +*/ + if (ic_set_manually == 0) + ic_nameservers_predef(); + #ifdef CONFIG_PROC_FS proc_create("pnp", 0444, init_net.proc_net, &pnp_seq_fops); #endif /* CONFIG_PROC_FS */ @@ -1593,6 +1605,7 @@ static int __init ip_auto_config_setup(char *addrs) return 1; } + /* Initialise all name servers to NONE */ ic_nameservers_predef(); /* Parse string for static IP assignment. */ -- 2.14.1
[PATCH v2 4/8] ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers
When ipconfig is autoconfigured via BOOTP, the request packet initialised by ic_bootp_init_ext() always allocates 8 bytes for the name server option, limiting the BOOTP server to responding with at most 2 name servers even though ipconfig in fact supports an arbitrary number of name servers (as defined by CONF_NAMESERVERS_MAX, which is currently 3). Only request name servers in the request packet if CONF_NAMESERVERS_MAX is positive (to comply with [1, §3.8]), and allocate enough space in the packet for CONF_NAMESERVERS_MAX name servers to indicate the maximum number we can accept in response. [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions": https://tools.ietf.org/rfc/rfc2132.txt Signed-off-by: Chris Novakovic --- net/ipv4/ipconfig.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c index bcf3c4f9882d..0f460d6d3cce 100644 --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@ -721,9 +721,11 @@ static void __init ic_bootp_init_ext(u8 *e) *e++ = 3; /* Default gateway request */ *e++ = 4; e += 4; +#if CONF_NAMESERVERS_MAX > 0 *e++ = 6; /* (DNS) name server request */ - *e++ = 8; - e += 8; + *e++ = 4 * CONF_NAMESERVERS_MAX; + e += 4 * CONF_NAMESERVERS_MAX; +#endif *e++ = 12; /* Host name request */ *e++ = 32; e += 32; -- 2.14.1
[PATCH v2 8/8] ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers
Distributed filesystems are most effective when the server and client clocks are synchronised. Embedded devices often use NFS for their root filesystem but typically do not contain an RTC, so the clocks of the NFS server and the embedded device will be out-of-sync when the root filesystem is mounted (and may not be synchronised until late in the boot process). Extend ipconfig with the ability to export IP addresses of NTP servers it discovers to /proc/net/ipconfig/ntp_servers. They can be supplied as follows: - If ipconfig is configured manually via the "ip=" or "nfsaddrs=" kernel command line parameters, one NTP server can be specified in the new "" parameter. - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in the DHCPDISCOVER message, and record the IP addresses of up to three NTP servers sent by the responding DHCP server in the subsequent DHCPOFFER message. ipconfig will only write the NTP server IP addresses it discovers to /proc/net/ipconfig/ntp_servers, one per line (in the order received from the DHCP server, if DHCP autoconfiguration is used); making use of these NTP servers is the responsibility of a user space process (e.g. an initrd/initram script that invokes an NTP client before mounting an NFS root filesystem). Signed-off-by: Chris Novakovic --- Documentation/filesystems/nfs/nfsroot.txt | 35 +++-- net/ipv4/ipconfig.c | 118 +++--- 2 files changed, 136 insertions(+), 17 deletions(-) diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index a1030bea60d3..d2963123eb1c 100644 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -5,6 +5,7 @@ Written 1996 by Gero Kuhlmann Updated 1997 by Martin Mares Updated 2006 by Nico Schottelius Updated 2006 by Horms +Updated 2018 by Chris Novakovic @@ -79,7 +80,7 @@ nfsroot=[:][,] ip=::: - : + :: This parameter tells the kernel how to configure IP addresses of devices and also how to set up the IP routing table. It was originally called @@ -178,9 +179,18 @@ ip=::: IP address of secondary nameserver. See . - After configuration (whether manual or automatic) is complete, a file is - created at /proc/net/pnp in the following format; lines are omitted if - their respective value is empty following configuration. + IP address of a Network Time Protocol (NTP) server. + Value is exported to /proc/net/ipconfig/ntp_servers, but is + otherwise unused (see below). + + Default: None if not using autoconfiguration; determined + automatically if using autoconfiguration. + + After configuration (whether manual or automatic) is complete, two files + are created in the following format; lines are omitted if their respective + value is empty following configuration: + + - /proc/net/pnp: #PROTO: (depending on configuration method) domain (if autoconfigured, the DNS domain) @@ -189,13 +199,26 @@ ip=::: nameserver (tertiary name server IP) bootserver (NFS server IP) - and are requested during autoconfiguration; they - cannot be specified as part of the "ip=" kernel command line parameter. + - /proc/net/ipconfig/ntp_servers: + + (NTP server IP) + (NTP server IP) + (NTP server IP) + + and (in /proc/net/pnp) and and + (in /proc/net/ipconfig/ntp_servers) are requested during autoconfiguration; + they cannot be specified as part of the "ip=" kernel command line parameter. Because the "domain" and "nameserver" options are recognised by DNS resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems that use an NFS root filesystem. + Note that the kernel will not synchronise the system time with any NTP + servers it discovers; this is the responsibility of a user space process + (e.g. an initrd/initramfs script that passes the IP addresses listed in + /proc/net/ipconfig/ntp_servers to an NTP client before mounting the real + root filesystem if it is on NFS). + nfsrootdebug diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c index 9abf833f3a99..d839d74853fc 100644 --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@ -28,6 +28,9 @@ * * Multiple Nameservers in /proc/net/pnp * -- Josef Siemes , Aug 2002 + * + * NTP servers in /proc/net/ipconfig/ntp_servers + * -- Chris Novakovic , April 2018 */ #include @@ -93,6 +96,7 @@ #define CONF_TIMEOUT_MAX (HZ*30) /* Maximum allowed timeout */ #define CONF_NAMESERVERS_MAX 3 /* Maximum number of nameservers - '3' from resolv.h */ +#define CONF_NTP_SERVERS_MAX 3 /*
[PATCH v2 0/8] ipconfig: NTP server support, bug fixes, documentation improvements
This series (against net-next) makes various improvements to ipconfig: - Patch #1 correctly documents the behaviour of parameter 4 in the "ip=" and "nfsaddrs=" command line parameter. - Patch #2 tidies up the printk()s for reporting configured name servers. - Patch #3 fixes a bug in autoconfiguration via BOOTP whereby the IP addresses of IEN-116 name servers are requested from the BOOTP server, rather than those of DNS name servers. - Patch #4 requests the number of DNS servers specified by CONF_NAMESERVERS_MAX when autoconfiguring via BOOTP, rather than hardcoding it to 2. - Patch #5 fully documents the contents and format of /proc/net/pnp in Documentation/filesystems/nfs/nfsroot.txt. - Patch #6 fixes a bug whereby bogus information is written to /proc/net/pnp when ipconfig is not used. - Patch #7 creates a new procfs directory for ipconfig-related configuration reports at /proc/net/ipconfig. - Patch #8 allows for NTP servers to be configured (manually on the kernel command line or automatically via DHCP), enabling systems with an NFS root filesystem to synchronise their clock before mounting their root filesystem. NTP server IP addresses are written to /proc/net/ipconfig/ntp_servers. Changes from v1: - David requested that a new directory /proc/net/ipconfig be created to contain ipconfig-related configuration reports, which is implemented in the new patch #7. NTP server IPs are now written to this directory instead of /proc/net/ntp in the new patch #8. - Cong and David both requested that the modification to CREDITS be dropped. This patch has been removed from the series. Chris Novakovic (8): ipconfig: Document setting of NIS domain name ipconfig: Tidy up reporting of name servers ipconfig: BOOTP: Don't request IEN-116 name servers ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers ipconfig: Document /proc/net/pnp ipconfig: Correctly initialise ic_nameservers ipconfig: Create /proc/net/ipconfig directory ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers Documentation/filesystems/nfs/nfsroot.txt | 70 -- net/ipv4/ipconfig.c | 151 +++--- 2 files changed, 200 insertions(+), 21 deletions(-) -- 2.14.1
[PATCH v2 2/8] ipconfig: Tidy up reporting of name servers
Commit 5e953778a2aab04929a5e7b69f53dc26e39b079e ("ipconfig: add nameserver IPs to kernel-parameter ip=") adds the IP addresses of discovered name servers to the summary printed by ipconfig when configuration is complete. It appears the intention in ip_auto_config() was to print the name servers on a new line (especially given the spacing and lack of comma before "nameserver0="), but they're actually printed on the same line as the NFS root filesystem configuration summary: [0.686186] IP-Config: Complete: [0.686226] device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1 [0.686328] host=test, domain=example.com, nis-domain=(none) [0.686386] bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath= nameserver0=10.0.0.1 This makes it harder to read and parse ipconfig's output. Instead, print the name servers on a separate line: [0.791250] IP-Config: Complete: [0.791289] device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1 [0.791407] host=test, domain=example.com, nis-domain=(none) [0.791475] bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath= [0.791476] nameserver0=10.0.0.1 Signed-off-by: Chris Novakovic --- net/ipv4/ipconfig.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c index 43f620feb1c4..d0ea0ecc9008 100644 --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@ -1481,16 +1481,19 @@ static int __init ip_auto_config(void) &ic_servaddr, &root_server_addr, root_server_path); if (ic_dev_mtu) pr_cont(", mtu=%d", ic_dev_mtu); - for (i = 0; i < CONF_NAMESERVERS_MAX; i++) + /* Name servers (if any): */ + for (i = 0; i < CONF_NAMESERVERS_MAX; i++) { if (ic_nameservers[i] != NONE) { - pr_cont(" nameserver%u=%pI4", - i, &ic_nameservers[i]); - break; + if (i == 0) + pr_info(" nameserver%u=%pI4", + i, &ic_nameservers[i]); + else + pr_cont(", nameserver%u=%pI4", + i, &ic_nameservers[i]); } - for (i++; i < CONF_NAMESERVERS_MAX; i++) - if (ic_nameservers[i] != NONE) - pr_cont(", nameserver%u=%pI4", i, &ic_nameservers[i]); - pr_cont("\n"); + if (i + 1 == CONF_NAMESERVERS_MAX) + pr_cont("\n"); + } #endif /* !SILENT */ /* -- 2.14.1
[PATCH v2 1/8] ipconfig: Document setting of NIS domain name
ic_do_bootp_ext() is responsible for parsing the "ip=" and "nfsaddrs=" kernel parameters. If a "." character is found in parameter 4 (the client's hostname), everything before the first "." is used as the hostname, and everything after it is used as the NIS domain name (but not necessarily the DNS domain name). Document this behaviour in Documentation/filesystems/nfs/nfsroot.txt, as it is not made explicit. Signed-off-by: Chris Novakovic --- Documentation/filesystems/nfs/nfsroot.txt | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index 5efae00f6c7f..1513e5d663fd 100644 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -123,10 +123,13 @@ ip=::: Default: Determined using autoconfiguration. - Name of the client. May be supplied by autoconfiguration, - but its absence will not trigger autoconfiguration. - If specified and DHCP is used, the user provided hostname will - be carried in the DHCP request to hopefully update DNS record. + Name of the client. If a '.' character is present, anything + before the first '.' is used as the client's hostname, and anything + after it is used as its NIS domain name. May be supplied by + autoconfiguration, but its absence will not trigger autoconfiguration. + If specified and DHCP is used, the user-provided hostname (and NIS + domain name, if present) will be carried in the DHCP request; this + may cause a DNS record to be created or updated for the client. Default: Client IP address is used in ASCII notation. -- 2.14.1
[PATCH v2 5/8] ipconfig: Document /proc/net/pnp
Fully document the format used by the /proc/net/pnp file written by ipconfig, explain where its values originate from, and clarify that the tertiary name server IP and DNS domain name are only written to the file when autoconfiguration is used. Signed-off-by: Chris Novakovic --- Documentation/filesystems/nfs/nfsroot.txt | 34 ++- 1 file changed, 29 insertions(+), 5 deletions(-) diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index 1513e5d663fd..a1030bea60d3 100644 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -110,6 +110,9 @@ ip=::: will not be triggered if it is missing and NFS root is not in operation. + Value is exported to /proc/net/pnp with the prefix "bootserver " + (see below). + Default: Determined using autoconfiguration. The address of the autoconfiguration server is used. @@ -165,12 +168,33 @@ ip=::: Default: any - IP address of first nameserver. - Value gets exported by /proc/net/pnp which is often linked - on embedded systems by /etc/resolv.conf. + IP address of primary nameserver. + Value is exported to /proc/net/pnp with the prefix "nameserver " + (see below). + + Default: None if not using autoconfiguration; determined + automatically if using autoconfiguration. + + IP address of secondary nameserver. + See . + + After configuration (whether manual or automatic) is complete, a file is + created at /proc/net/pnp in the following format; lines are omitted if + their respective value is empty following configuration. + + #PROTO: (depending on configuration method) + domain (if autoconfigured, the DNS domain) + nameserver (primary name server IP) + nameserver (secondary name server IP) + nameserver (tertiary name server IP) + bootserver (NFS server IP) + + and are requested during autoconfiguration; they + cannot be specified as part of the "ip=" kernel command line parameter. - IP address of second nameserver. - Same as above. + Because the "domain" and "nameserver" options are recognised by DNS + resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems + that use an NFS root filesystem. nfsrootdebug -- 2.14.1
[PATCH v2 3/8] ipconfig: BOOTP: Don't request IEN-116 name servers
When ipconfig is autoconfigured via BOOTP, the request packet initialised by ic_bootp_init_ext() allocates 8 bytes for tag 5 ("Name Server" [1, §3.7]), but tag 5 in the response isn't processed by ic_do_bootp_ext(). Instead, allocate the 8 bytes to tag 6 ("Domain Name Server" [1, §3.8]), which is processed by ic_do_bootp_ext(), and appears to have been the intended tag to request. This won't cause any breakage for existing users, as tag 5 responses provided by BOOTP servers weren't being processed anyway. [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions": https://tools.ietf.org/rfc/rfc2132.txt Signed-off-by: Chris Novakovic --- net/ipv4/ipconfig.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c index d0ea0ecc9008..bcf3c4f9882d 100644 --- a/net/ipv4/ipconfig.c +++ b/net/ipv4/ipconfig.c @@ -721,7 +721,7 @@ static void __init ic_bootp_init_ext(u8 *e) *e++ = 3; /* Default gateway request */ *e++ = 4; e += 4; - *e++ = 5; /* Name server request */ + *e++ = 6; /* (DNS) name server request */ *e++ = 8; e += 8; *e++ = 12; /* Host name request */ -- 2.14.1
Re: [PATCH iproute2-next v2 2/2] gre/gre6: allow clearing {,i,o}{key,seq,csum} flags
On 4/20/18 2:32 AM, Sabrina Dubroca wrote: > Currently, iproute allows setting those flags, but it's impossible to > clear them, since their current value is fetched from the kernel and > then we OR in the additional flags passed on the command line. > > Add no* variants to allow clearing them. > > Signed-off-by: Sabrina Dubroca > --- > v2: fixed up okey flag clearing > also reset the actual value of the key, not just the flag both applied to iproute2-next. Thanks,
Re: [PATCH iproute2 net-next] vxlan: add ttl auto in help message
On 4/23/18 8:40 PM, Hangbin Liu wrote: > Signed-off-by: Hangbin Liu > --- > ip/iplink_vxlan.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > applied to iproute2-next. Thanks,
Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
On Mon, 23 Apr 2018, Mikulas Patocka wrote: > The kvmalloc function tries to use kmalloc and falls back to vmalloc if > kmalloc fails. > > Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then > uses DMA-API on the returned memory or frees it with kfree. Such bugs were > found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific > code. > > These bugs are hard to reproduce because kvmalloc falls back to vmalloc > only if memory is fragmented. > > In order to detect these bugs reliably I submit this patch that changes > kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG > is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer > verify the addresses passed to it, and so the user will get a reliable > stacktrace. > Why not just do it unconditionally? Sounds better than "50% of the time this will catch bugs". > Some bugs (such as buffer overflows) are better detected > with kmalloc code, so we must test the kmalloc path too. > > Signed-off-by: Mikulas Patocka > > --- > mm/util.c | 10 ++ > 1 file changed, 10 insertions(+) > > Index: linux-2.6/mm/util.c > === > --- linux-2.6.orig/mm/util.c 2018-04-23 00:12:05.0 +0200 > +++ linux-2.6/mm/util.c 2018-04-23 17:57:02.0 +0200 > @@ -14,6 +14,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f >*/ > WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL); > > +#ifdef CONFIG_DEBUG_SG > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */ > + if (!(prandom_u32_max(2) & 1)) > + goto do_vmalloc; > +#endif > + > /* >* We want to attempt a large physically contiguous block first because >* it is less likely to fragment multiple larger blocks and therefore > @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f > if (ret || size <= PAGE_SIZE) > return ret; > > +#ifdef CONFIG_DEBUG_SG > +do_vmalloc: > +#endif You can just do do_vmalloc: __maybe_unused > return __vmalloc_node_flags_caller(size, node, flags, > __builtin_return_address(0)); > } > >
[PATCH iproute2 net-next] vxlan: add ttl auto in help message
Signed-off-by: Hangbin Liu --- ip/iplink_vxlan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c index 661eaa7..d89b68b 100644 --- a/ip/iplink_vxlan.c +++ b/ip/iplink_vxlan.c @@ -51,7 +51,7 @@ static void print_explain(FILE *f) "Where: VNI := 0-16777215\n" " ADDR := { IP_ADDRESS | any }\n" " TOS := { NUMBER | inherit }\n" - " TTL := { 1..255 | inherit }\n" + " TTL := { 1..255 | auto | inherit }\n" " LABEL := 0-1048575\n" ); } -- 2.5.5
Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
On 2018年04月23日 21:56, Björn Töpel wrote: From: Björn Töpel This RFC introduces a new address family called AF_XDP that is optimized for high performance packet processing and, in upcoming patch sets, zero-copy semantics. In this v2 version, we have removed all zero-copy related code in order to make it smaller, simpler and hopefully more review friendly. This RFC only supports copy-mode for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX using the XDP_DRV path. Zero-copy support requires XDP and driver changes that Jesper Dangaard Brouer is working on. Some of his work has already been accepted. We will publish our zero-copy support for RX and TX on top of his patch sets at a later point in time. An AF_XDP socket (XSK) is created with the normal socket() syscall. Associated with each XSK are two queues: the RX queue and the TX queue. A socket can receive packets on the RX queue and it can send packets on the TX queue. These queues are registered and sized with the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory to have at least one of these queues for each socket. In contrast to AF_PACKET V2/V3 these descriptor queues are separated from packet buffers. An RX or TX descriptor points to a data buffer in a memory area called a UMEM. RX and TX can share the same UMEM so that a packet does not have to be copied between RX and TX. Moreover, if a packet needs to be kept for a while due to a possible retransmit, the descriptor that points to that packet can be changed to point to another and reused right away. This again avoids copying data. This new dedicated packet buffer area is call a UMEM. It consists of a number of equally size frames and each frame has a unique frame id. A descriptor in one of the queues references a frame by referencing its frame id. The user space allocates memory for this UMEM using whatever means it feels is most appropriate (malloc, mmap, huge pages, etc). This memory area is then registered with the kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue and the COMPLETION queue. The fill queue is used by the application to send down frame ids for the kernel to fill in with RX packet data. References to these frames will then appear in the RX queue of the XSK once they have been received. The completion queue, on the other hand, contains frame ids that the kernel has transmitted completely and can now be used again by user space, for either TX or RX. Thus, the frame ids appearing in the completion queue are ids that were previously transmitted using the TX queue. In summary, the RX and FILL queues are used for the RX path and the TX and COMPLETION queues are used for the TX path. The socket is then finally bound with a bind() call to a device and a specific queue id on that device, and it is not until bind is completed that traffic starts to flow. Note that in this RFC, all packet data is copied out to user-space. A new feature in this RFC is that the UMEM can be shared between processes, if desired. If a process wants to do this, it simply skips the registration of the UMEM and its corresponding two queues, sets a flag in the bind call and submits the XSK of the process it would like to share UMEM with as well as its own newly created XSK socket. The new process will then receive frame id references in its own RX queue that point to this shared UMEM. Note that since the queue structures are single-consumer / single-producer (for performance reasons), the new process has to create its own socket with associated RX and TX queues, since it cannot share this with the other process. This is also the reason that there is only one set of FILL and COMPLETION queues per UMEM. It is the responsibility of a single process to handle the UMEM. If multiple-producer / multiple-consumer queues are implemented in the future, this requirement could be relaxed. How is then packets distributed between these two XSK? We have introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The user-space application can place an XSK at an arbitrary place in this map. The XDP program can then redirect a packet to a specific index in this map and at this point XDP validates that the XSK in that map was indeed bound to that device and queue number. If not, the packet is dropped. If the map is empty at that index, the packet is also dropped. This also means that it is currently mandatory to have an XDP program loaded (and one XSK in the XSKMAP) to be able to get any traffic to user space through the XSK. AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the driver does not have support for XDP, or XDP_SKB is explicitly chosen when loading the XDP program, XDP_SKB mode is employed that uses SKBs together with the generic XDP support and copies out the data to user space. A fallback mode that works for any network device. On the other hand, if the driver has support for XDP, it will be used by t
Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
On 2018-04-23 19:15, Paul Moore wrote: > On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs wrote: > > On 2018-04-18 19:47, Paul Moore wrote: > >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs > >> wrote: > >> > Implement the proc fs write to set the audit container ID of a process, > >> > emitting an AUDIT_CONTAINER record to document the event. > >> > > >> > This is a write from the container orchestrator task to a proc entry of > >> > the form /proc/PID/containerid where PID is the process ID of the newly > >> > created task that is to become the first task in a container, or an > >> > additional task added to a container. > >> > > >> > The write expects up to a u64 value (unset: 18446744073709551615). > >> > > >> > This will produce a record such as this: > >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 > >> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 > >> > tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 > >> > res=0 > >> > > >> > The "op" field indicates an initial set. The "pid" to "ses" fields are > >> > the orchestrator while the "opid" field is the object's PID, the process > >> > being "contained". Old and new container ID values are given in the > >> > "contid" fields, while res indicates its success. > >> > > >> > It is not permitted to self-set, unset or re-set the container ID. A > >> > child inherits its parent's container ID, but then can be set only once > >> > after. > >> > > >> > See: https://github.com/linux-audit/audit-kernel/issues/32 > >> > > >> > Signed-off-by: Richard Guy Briggs > >> > --- > >> > fs/proc/base.c | 37 > >> > include/linux/audit.h | 16 + > >> > include/linux/init_task.h | 4 ++- > >> > include/linux/sched.h | 1 + > >> > include/uapi/linux/audit.h | 2 ++ > >> > kernel/auditsc.c | 84 > >> > ++ > >> > 6 files changed, 143 insertions(+), 1 deletion(-) > > ... > > >> > diff --git a/include/linux/sched.h b/include/linux/sched.h > >> > index d258826..1b82191 100644 > >> > --- a/include/linux/sched.h > >> > +++ b/include/linux/sched.h > >> > @@ -796,6 +796,7 @@ struct task_struct { > >> > #ifdef CONFIG_AUDITSYSCALL > >> > kuid_t loginuid; > >> > unsigned intsessionid; > >> > + u64 containerid; > >> > >> This one line addition to the task_struct scares me the most of > >> anything in this patchset. Why? It's a field named "containerid" in > >> a perhaps one of the most widely used core kernel structures; the > >> possibilities for abuse are endless, and it's foolish to think we > >> would ever be able to adequately police this. > > > > Fair enough. > > > >> Unfortunately, we can't add the field to audit_context as things > >> currently stand because we don't always allocate an audit_context, > >> it's dependent on the system's configuration, and we need to track the > >> audit container ID for a given process, regardless of the audit > >> configuration. Pretty much the same reason why loginuid and sessionid > >> are located directly in task_struct now. As I stressed during the > >> design phase, I really want to keep this as an *audit* container ID > >> and not a general purpose kernel wide container ID. If the kernel > >> ever grows a general purpose container ID token, I'll be the first in > >> line to convert the audit code, but I don't want audit to be that > >> general purpose mechanism ... audit is hated enough as-is ;) > > > > When would we need an audit container ID when audit is not enabled > > enough to have an audit_context? > > I'm thinking of the audit_alloc() case where audit_filter_task() > returns AUDIT_DISABLED. Ok, so a task could be marked as filtered but its children would still be auditable and inheriting its parent containerid (as well at its loginuid and sessionid)... > I believe this is the same reason why loginuid and sessionid live > directly in the task_struct and not in the audit_context; they need to > persist for the lifetime of the task. Yes, probably. > > If it is only used for audit, and audit is the only consumer, and audit > > can only use it when it is enabled, then we can just return success to > > any write to the proc filehandle, or not even present it. Nothing will > > be able to know that value wasn't used. > > > > When are loginuid and sessionid used now when audit is not enabled (or > > should I say, explicitly disabled)? > > See above. I think that should answer these questions. Ok. > >> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > >> > index 4e61a9e..921a71f 100644 > >> > --- a/include/uapi/linux/audit.h > >> > +++ b/include/uapi/linux/audit.h > >> > @@ -71,6 +71,7 @@ > >> > #define AUDIT_TTY_SET 1017/* Set TTY auditing status */ > >> > #define AUDIT_SET_FEATURE 1018/*
Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet wrote: > Hi Andy > > On 04/23/2018 02:14 PM, Andy Lutomirski wrote: >> I would suggest that you rework the interface a bit. First a user would >> call mmap() on a TCP socket, which would create an empty VMA. (It would set >> vm_ops to point to tcp_vm_ops or similar so that the TCP code could >> recognize it, but it would have no effect whatsoever on the TCP state >> machine. Reading the VMA would get SIGBUS.) Then a user would call a new >> ioctl() or setsockopt() function and pass something like: > > >> >> struct tcp_zerocopy_receive { >> void *address; >> size_t length; >> }; >> >> The kernel would verify that [address, address+length) is entirely inside a >> single TCP VMA and then would do the vm_insert_range magic. > > I have no idea what is the proper API for that. > Where the TCP VMA(s) would be stored ? > In TCP socket, or MM layer ? MM layer. I haven't tested this at all, and the error handling is totally wrong, but I think you'd do something like: len = get_user(...); down_read(¤t->mm->mmap_sem); vma = find_vma(mm, start); if (!vma || vma->vm_start > start) return -EFAULT; /* This is buggy. You also need to check that the file is a socket. This is probably trivial. */ if (vma->vm_file->private_data != sock) return -EINVAL; if (len > vma->vm_end - start) return -EFAULT; /* too big a request. */ and now you'd do the vm_insert_page() dance, except that you don't have to abort the whole procedure if you discover that something isn't aligned right. Instead you'd just stop and tell the caller that you didn't map the full requested size. You might also need to add some code to charge the caller for the pages that get pinned, but that's an orthogonal issue. You also need to provide some way for user programs to signal that they're done with the page in question. MADV_DONTNEED might be sufficient. In the mmap() helper, you might want to restrict the mapped size to something reasonable. And it might be nice to hook mremap() to prevent user code from causing too much trouble. With my x86-writer-of-TLB-code hat on, I expect the major performance costs to be the generic costs of mmap() and munmap() (which only happen once per socket instead of once per read if you like my idea), the cost of a TLB miss when the data gets read (really not so bad on modern hardware), and the cost of the TLB invalidation when user code is done with the buffers. The latter is awful, especially in multithreaded programs. In fact, it's so bad that it might be worth mentioning in the documentation for this code that it just shouldn't be used in multithreaded processes. (Also, on non-PCID hardware, there's an annoying situation in which a recently-migrated thread that removes a mapping sends an IPI to the CPU that the thread used to be on. I thought I had a clever idea to get rid of that IPI once, but it turned out to be wrong.) Architectures like ARM that have superior TLB handling primitives will not be hurt as badly if this is used my a multithreaded program. > > > And I am not sure why the error handling would be better (point 4), unless we > can return smaller @length than requested maybe ? Exactly. If I request 10MB mapped and only the first 9MB are aligned right, I still want the first 9 MB. > > Also how the VMA space would be accounted (point 3) when creating an empty > VMA (no pages in there yet) There's nothing to account. It's the same as mapping /dev/null or similar -- the mm core should take care of it for you.
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 24, 2018 at 04:43:22AM +0300, Michael S. Tsirkin wrote: > On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote: > > On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote: > > > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote: > > > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote: > > > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote: > > > > > > > > > > > > > > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote: > > > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote: > > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > > > > > > > > > Hello everyone, > > > > > > > > > > > > > > > > > > This RFC implements packed ring support for virtio driver. > > > > > > > > > > > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) > > > > > > > > > implemented > > > > > > > > > by Jens at > > > > > > > > > http://dpdk.org/ml/archives/dev/2018-January/089417.html > > > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the > > > > > > > > > guest. > > > > > > > > > > > > > > > > > > TODO: > > > > > > > > > - Refinements and bug fixes; > > > > > > > > > - Split into small patches; > > > > > > > > > - Test indirect descriptor support; > > > > > > > > > - Test/fix event suppression support; > > > > > > > > > - Test devices other than net; > > > > > > > > > > > > > > > > > > RFC v1 -> RFC v2: > > > > > > > > > - Add indirect descriptor support - compile test only; > > > > > > > > > - Add event suppression supprt - compile test only; > > > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST); > > > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason); > > > > > > > > > - Split vring_unmap_one() for packed ring and split ring > > > > > > > > > (Jason); > > > > > > > > > - Avoid using '%' operator (Jason); > > > > > > > > > - Rename free_head -> next_avail_idx (Jason); > > > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() > > > > > > > > > (Jason); > > > > > > > > > - Some other refinements and bug fixes; > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > Signed-off-by: Tiwei Bie > > > > > > > > > --- > > > > > > > > >drivers/virtio/virtio_ring.c | 1094 > > > > > > > > > +--- > > > > > > > > >include/linux/virtio_ring.h|8 +- > > > > > > > > >include/uapi/linux/virtio_config.h | 12 +- > > > > > > > > >include/uapi/linux/virtio_ring.h | 61 ++ > > > > > > > > >4 files changed, 980 insertions(+), 195 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c > > > > > > > > > b/drivers/virtio/virtio_ring.c > > > > > > > > > index 71458f493cf8..0515dca34d77 100644 > > > > > > > > > --- a/drivers/virtio/virtio_ring.c > > > > > > > > > +++ b/drivers/virtio/virtio_ring.c > > > > > > > > > @@ -58,14 +58,15 @@ > > > > > > > > [...] > > > > > > > > > > > > > > > > > + > > > > > > > > > + if (vq->indirect) { > > > > > > > > > + u32 len; > > > > > > > > > + > > > > > > > > > + desc = vq->desc_state[head].indir_desc; > > > > > > > > > + /* Free the indirect table, if any, now that > > > > > > > > > it's unmapped. */ > > > > > > > > > + if (!desc) > > > > > > > > > + goto out; > > > > > > > > > + > > > > > > > > > + len = virtio32_to_cpu(vq->vq.vdev, > > > > > > > > > + > > > > > > > > > vq->vring_packed.desc[head].len); > > > > > > > > > + > > > > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags & > > > > > > > > > + cpu_to_virtio16(vq->vq.vdev, > > > > > > > > > VRING_DESC_F_INDIRECT))); > > > > > > > > It looks to me spec does not force to keep > > > > > > > > VRING_DESC_F_INDIRECT here. So we > > > > > > > > can safely remove this BUG_ON() here. > > > > > > > > > > > > > > > > > + BUG_ON(len == 0 || len % sizeof(struct > > > > > > > > > vring_packed_desc)); > > > > > > > > Len could be ignored for used descriptor according to the spec, > > > > > > > > so we need > > > > > > > > remove this BUG_ON() too. > > > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it. > > > > > > > And I think something related to this in the spec isn't very > > > > > > > clear currently. > > > > > > > > > > > > > > In the spec, there are below words: > > > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272 > > > > > > > """ > > > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE > > > > > > > is reserved and is ignored by the device. > > > > > > > """ > > > > > > > > > > > > > > So when device writes back an used descriptor in this case, > > > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag > > > > > > > is reserved and should be ignored. > > > >
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 24, 2018 at 09:37:47AM +0800, Tiwei Bie wrote: > On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote: > > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote: > > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote: > > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote: > > > > > > > > > > > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote: > > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote: > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > > > > > > > > Hello everyone, > > > > > > > > > > > > > > > > This RFC implements packed ring support for virtio driver. > > > > > > > > > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) > > > > > > > > implemented > > > > > > > > by Jens at > > > > > > > > http://dpdk.org/ml/archives/dev/2018-January/089417.html > > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the > > > > > > > > guest. > > > > > > > > > > > > > > > > TODO: > > > > > > > > - Refinements and bug fixes; > > > > > > > > - Split into small patches; > > > > > > > > - Test indirect descriptor support; > > > > > > > > - Test/fix event suppression support; > > > > > > > > - Test devices other than net; > > > > > > > > > > > > > > > > RFC v1 -> RFC v2: > > > > > > > > - Add indirect descriptor support - compile test only; > > > > > > > > - Add event suppression supprt - compile test only; > > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST); > > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason); > > > > > > > > - Split vring_unmap_one() for packed ring and split ring > > > > > > > > (Jason); > > > > > > > > - Avoid using '%' operator (Jason); > > > > > > > > - Rename free_head -> next_avail_idx (Jason); > > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() > > > > > > > > (Jason); > > > > > > > > - Some other refinements and bug fixes; > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > Signed-off-by: Tiwei Bie > > > > > > > > --- > > > > > > > >drivers/virtio/virtio_ring.c | 1094 > > > > > > > > +--- > > > > > > > >include/linux/virtio_ring.h|8 +- > > > > > > > >include/uapi/linux/virtio_config.h | 12 +- > > > > > > > >include/uapi/linux/virtio_ring.h | 61 ++ > > > > > > > >4 files changed, 980 insertions(+), 195 deletions(-) > > > > > > > > > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c > > > > > > > > b/drivers/virtio/virtio_ring.c > > > > > > > > index 71458f493cf8..0515dca34d77 100644 > > > > > > > > --- a/drivers/virtio/virtio_ring.c > > > > > > > > +++ b/drivers/virtio/virtio_ring.c > > > > > > > > @@ -58,14 +58,15 @@ > > > > > > > [...] > > > > > > > > > > > > > > > + > > > > > > > > + if (vq->indirect) { > > > > > > > > + u32 len; > > > > > > > > + > > > > > > > > + desc = vq->desc_state[head].indir_desc; > > > > > > > > + /* Free the indirect table, if any, now that > > > > > > > > it's unmapped. */ > > > > > > > > + if (!desc) > > > > > > > > + goto out; > > > > > > > > + > > > > > > > > + len = virtio32_to_cpu(vq->vq.vdev, > > > > > > > > + > > > > > > > > vq->vring_packed.desc[head].len); > > > > > > > > + > > > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags & > > > > > > > > +cpu_to_virtio16(vq->vq.vdev, > > > > > > > > VRING_DESC_F_INDIRECT))); > > > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT > > > > > > > here. So we > > > > > > > can safely remove this BUG_ON() here. > > > > > > > > > > > > > > > + BUG_ON(len == 0 || len % sizeof(struct > > > > > > > > vring_packed_desc)); > > > > > > > Len could be ignored for used descriptor according to the spec, > > > > > > > so we need > > > > > > > remove this BUG_ON() too. > > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it. > > > > > > And I think something related to this in the spec isn't very > > > > > > clear currently. > > > > > > > > > > > > In the spec, there are below words: > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272 > > > > > > """ > > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE > > > > > > is reserved and is ignored by the device. > > > > > > """ > > > > > > > > > > > > So when device writes back an used descriptor in this case, > > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag > > > > > > is reserved and should be ignored. > > > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170 > > > > > > """ > > > > > > Element Length is reserved for used descriptors without the > > > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers.
Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
On Mon, Apr 23, 2018 at 06:25:03PM -0700, Stephen Hemminger wrote: > On Mon, 23 Apr 2018 12:44:39 -0700 > Siwei Liu wrote: > > > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin > > wrote: > > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote: > > >> On Mon, 23 Apr 2018 20:24:56 +0300 > > >> "Michael S. Tsirkin" wrote: > > >> > > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote: > > >> > > > > > > >> > > > >I will NAK patches to change to common code for netvsc especially > > >> > > > >the > > >> > > > >three device model. MS worked hard with distro vendors to > > >> > > > >support transparent > > >> > > > >mode, ans we really can't have a new model; or do backport. > > >> > > > > > > >> > > > >Plus, DPDK is now dependent on existing model. > > >> > > > > > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities. > > >> > > > > > >> > > > > >> > > The network device model is a userspace API, and DPDK is a userspace > > >> > > application. > > >> > > > >> > It is userspace but are you sure dpdk is actually poking at netdevs? > > >> > AFAIK it's normally banging device registers directly. > > >> > > > >> > > You can't go breaking userspace even if you don't like the > > >> > > application. > > >> > > > >> > Could you please explain how is the proposed patchset breaking > > >> > userspace? Ignoring DPDK for now, I don't think it changes the > > >> > userspace > > >> > API at all. > > >> > > > >> > > >> The DPDK has a device driver vdev_netvsc which scans the Linux network > > >> devices > > >> to look for Linux netvsc device and the paired VF device and setup the > > >> DPDK environment. This setup creates a DPDK failsafe (bondingish) > > >> instance > > >> and sets up TAP support over the Linux netvsc device as well as the > > >> Mellanox > > >> VF device. > > >> > > >> So it depends on existing 2 device model. You can't go to a 3 device > > >> model > > >> or start hiding devices from userspace. > > > > > > Okay so how does the existing patch break that? IIUC does not go to > > > a 3 device model since netvsc calls failover_register directly. > > > > > >> Also, I am working on associating netvsc and VF device based on serial > > >> number > > >> rather than MAC address. The serial number is how Windows works now, and > > >> it makes > > >> sense for Linux and Windows to use the same mechanism if possible. > > > > > > Maybe we should support same for virtio ... > > > Which serial do you mean? From vpd? > > > > > > I guess you will want to keep supporting MAC for old hypervisors? > > The serial number has always been in the hypervisor since original support of > SR-IOV > in WS2008. So no backward compatibility special cases would be needed. Is that a serial from real hardware or a hypervisor thing? -- MST
[PATCH 1/1] Revert "rds: ib: add error handle"
This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc. After long time discussion and investigations, it seems that there is no mem leak. So this patch is reverted. Signed-off-by: Zhu Yanjun --- net/rds/ib_cm.c | 47 +++ 1 file changed, 11 insertions(+), 36 deletions(-) diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index eea1d86..d64bfaf 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -443,7 +443,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) ic->i_send_cq = NULL; ibdev_put_vector(rds_ibdev, ic->i_scq_vector); rdsdebug("ib_create_cq send failed: %d\n", ret); - goto rds_ibdev_out; + goto out; } ic->i_rcq_vector = ibdev_get_unused_vector(rds_ibdev); @@ -457,19 +457,19 @@ static int rds_ib_setup_qp(struct rds_connection *conn) ic->i_recv_cq = NULL; ibdev_put_vector(rds_ibdev, ic->i_rcq_vector); rdsdebug("ib_create_cq recv failed: %d\n", ret); - goto send_cq_out; + goto out; } ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); if (ret) { rdsdebug("ib_req_notify_cq send failed: %d\n", ret); - goto recv_cq_out; + goto out; } ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED); if (ret) { rdsdebug("ib_req_notify_cq recv failed: %d\n", ret); - goto recv_cq_out; + goto out; } /* XXX negotiate max send/recv with remote? */ @@ -495,7 +495,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr); if (ret) { rdsdebug("rdma_create_qp failed: %d\n", ret); - goto recv_cq_out; + goto out; } ic->i_send_hdrs = ib_dma_alloc_coherent(dev, @@ -505,7 +505,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!ic->i_send_hdrs) { ret = -ENOMEM; rdsdebug("ib_dma_alloc_coherent send failed\n"); - goto qp_out; + goto out; } ic->i_recv_hdrs = ib_dma_alloc_coherent(dev, @@ -515,7 +515,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!ic->i_recv_hdrs) { ret = -ENOMEM; rdsdebug("ib_dma_alloc_coherent recv failed\n"); - goto send_hdrs_dma_out; + goto out; } ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header), @@ -523,7 +523,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!ic->i_ack) { ret = -ENOMEM; rdsdebug("ib_dma_alloc_coherent ack failed\n"); - goto recv_hdrs_dma_out; + goto out; } ic->i_sends = vzalloc_node(ic->i_send_ring.w_nr * sizeof(struct rds_ib_send_work), @@ -531,7 +531,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!ic->i_sends) { ret = -ENOMEM; rdsdebug("send allocation failed\n"); - goto ack_dma_out; + goto out; } ic->i_recvs = vzalloc_node(ic->i_recv_ring.w_nr * sizeof(struct rds_ib_recv_work), @@ -539,7 +539,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!ic->i_recvs) { ret = -ENOMEM; rdsdebug("recv allocation failed\n"); - goto sends_out; + goto out; } rds_ib_recv_init_ack(ic); @@ -547,33 +547,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn) rdsdebug("conn %p pd %p cq %p %p\n", conn, ic->i_pd, ic->i_send_cq, ic->i_recv_cq); - return ret; - -sends_out: - vfree(ic->i_sends); -ack_dma_out: - ib_dma_free_coherent(dev, sizeof(struct rds_header), -ic->i_ack, ic->i_ack_dma); -recv_hdrs_dma_out: - ib_dma_free_coherent(dev, ic->i_recv_ring.w_nr * - sizeof(struct rds_header), - ic->i_recv_hdrs, ic->i_recv_hdrs_dma); -send_hdrs_dma_out: - ib_dma_free_coherent(dev, ic->i_send_ring.w_nr * - sizeof(struct rds_header), - ic->i_send_hdrs, ic->i_send_hdrs_dma); -qp_out: - rdma_destroy_qp(ic->i_cm_id); -recv_cq_out: - if (!ib_destroy_cq(ic->i_recv_cq)) - ic->i_recv_cq = NULL; -send_cq_out: - if (!ib_destroy_cq(ic->i_send_cq)) - ic->i_send_cq = NULL; -rds_ibdev_out: - rds_ib_remove_conn(rds_ibdev, conn); +out: rds_ib_dev_put(rds_ibdev); - return ret; } -- 2.7.4
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 24, 2018 at 04:29:51AM +0300, Michael S. Tsirkin wrote: > On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote: > > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote: > > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote: > > > > > > > > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote: > > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote: > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > > > > > > > Hello everyone, > > > > > > > > > > > > > > This RFC implements packed ring support for virtio driver. > > > > > > > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) > > > > > > > implemented > > > > > > > by Jens at > > > > > > > http://dpdk.org/ml/archives/dev/2018-January/089417.html > > > > > > > Minor changes are needed for the vhost code, e.g. to kick the > > > > > > > guest. > > > > > > > > > > > > > > TODO: > > > > > > > - Refinements and bug fixes; > > > > > > > - Split into small patches; > > > > > > > - Test indirect descriptor support; > > > > > > > - Test/fix event suppression support; > > > > > > > - Test devices other than net; > > > > > > > > > > > > > > RFC v1 -> RFC v2: > > > > > > > - Add indirect descriptor support - compile test only; > > > > > > > - Add event suppression supprt - compile test only; > > > > > > > - Move vring_packed_init() out of uapi (Jason, MST); > > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason); > > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason); > > > > > > > - Avoid using '%' operator (Jason); > > > > > > > - Rename free_head -> next_avail_idx (Jason); > > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason); > > > > > > > - Some other refinements and bug fixes; > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > Signed-off-by: Tiwei Bie > > > > > > > --- > > > > > > >drivers/virtio/virtio_ring.c | 1094 > > > > > > > +--- > > > > > > >include/linux/virtio_ring.h|8 +- > > > > > > >include/uapi/linux/virtio_config.h | 12 +- > > > > > > >include/uapi/linux/virtio_ring.h | 61 ++ > > > > > > >4 files changed, 980 insertions(+), 195 deletions(-) > > > > > > > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c > > > > > > > b/drivers/virtio/virtio_ring.c > > > > > > > index 71458f493cf8..0515dca34d77 100644 > > > > > > > --- a/drivers/virtio/virtio_ring.c > > > > > > > +++ b/drivers/virtio/virtio_ring.c > > > > > > > @@ -58,14 +58,15 @@ > > > > > > [...] > > > > > > > > > > > > > + > > > > > > > + if (vq->indirect) { > > > > > > > + u32 len; > > > > > > > + > > > > > > > + desc = vq->desc_state[head].indir_desc; > > > > > > > + /* Free the indirect table, if any, now that it's > > > > > > > unmapped. */ > > > > > > > + if (!desc) > > > > > > > + goto out; > > > > > > > + > > > > > > > + len = virtio32_to_cpu(vq->vq.vdev, > > > > > > > + vq->vring_packed.desc[head].len); > > > > > > > + > > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags & > > > > > > > + cpu_to_virtio16(vq->vq.vdev, > > > > > > > VRING_DESC_F_INDIRECT))); > > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT > > > > > > here. So we > > > > > > can safely remove this BUG_ON() here. > > > > > > > > > > > > > + BUG_ON(len == 0 || len % sizeof(struct > > > > > > > vring_packed_desc)); > > > > > > Len could be ignored for used descriptor according to the spec, so > > > > > > we need > > > > > > remove this BUG_ON() too. > > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it. > > > > > And I think something related to this in the spec isn't very > > > > > clear currently. > > > > > > > > > > In the spec, there are below words: > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272 > > > > > """ > > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE > > > > > is reserved and is ignored by the device. > > > > > """ > > > > > > > > > > So when device writes back an used descriptor in this case, > > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag > > > > > is reserved and should be ignored. > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170 > > > > > """ > > > > > Element Length is reserved for used descriptors without the > > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers. > > > > > """ > > > > > > > > > > And this is the way how driver ignores the `len` in an used > > > > > descriptor. > > > > > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241 > > > > > """ > > > > > To increase ring capacity the driver can store a (read-only > > > > > by the device) table of indirect descriptors anywhere in memory, >
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 24, 2018 at 09:16:38AM +0800, Tiwei Bie wrote: > On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote: > > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote: > > > > > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote: > > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote: > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > > > > > > Hello everyone, > > > > > > > > > > > > This RFC implements packed ring support for virtio driver. > > > > > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented > > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html > > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest. > > > > > > > > > > > > TODO: > > > > > > - Refinements and bug fixes; > > > > > > - Split into small patches; > > > > > > - Test indirect descriptor support; > > > > > > - Test/fix event suppression support; > > > > > > - Test devices other than net; > > > > > > > > > > > > RFC v1 -> RFC v2: > > > > > > - Add indirect descriptor support - compile test only; > > > > > > - Add event suppression supprt - compile test only; > > > > > > - Move vring_packed_init() out of uapi (Jason, MST); > > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason); > > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason); > > > > > > - Avoid using '%' operator (Jason); > > > > > > - Rename free_head -> next_avail_idx (Jason); > > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason); > > > > > > - Some other refinements and bug fixes; > > > > > > > > > > > > Thanks! > > > > > > > > > > > > Signed-off-by: Tiwei Bie > > > > > > --- > > > > > >drivers/virtio/virtio_ring.c | 1094 > > > > > > +--- > > > > > >include/linux/virtio_ring.h|8 +- > > > > > >include/uapi/linux/virtio_config.h | 12 +- > > > > > >include/uapi/linux/virtio_ring.h | 61 ++ > > > > > >4 files changed, 980 insertions(+), 195 deletions(-) > > > > > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c > > > > > > b/drivers/virtio/virtio_ring.c > > > > > > index 71458f493cf8..0515dca34d77 100644 > > > > > > --- a/drivers/virtio/virtio_ring.c > > > > > > +++ b/drivers/virtio/virtio_ring.c > > > > > > @@ -58,14 +58,15 @@ > > > > > [...] > > > > > > > > > > > + > > > > > > + if (vq->indirect) { > > > > > > + u32 len; > > > > > > + > > > > > > + desc = vq->desc_state[head].indir_desc; > > > > > > + /* Free the indirect table, if any, now that it's > > > > > > unmapped. */ > > > > > > + if (!desc) > > > > > > + goto out; > > > > > > + > > > > > > + len = virtio32_to_cpu(vq->vq.vdev, > > > > > > + vq->vring_packed.desc[head].len); > > > > > > + > > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags & > > > > > > +cpu_to_virtio16(vq->vq.vdev, > > > > > > VRING_DESC_F_INDIRECT))); > > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT > > > > > here. So we > > > > > can safely remove this BUG_ON() here. > > > > > > > > > > > + BUG_ON(len == 0 || len % sizeof(struct > > > > > > vring_packed_desc)); > > > > > Len could be ignored for used descriptor according to the spec, so we > > > > > need > > > > > remove this BUG_ON() too. > > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it. > > > > And I think something related to this in the spec isn't very > > > > clear currently. > > > > > > > > In the spec, there are below words: > > > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272 > > > > """ > > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE > > > > is reserved and is ignored by the device. > > > > """ > > > > > > > > So when device writes back an used descriptor in this case, > > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag > > > > is reserved and should be ignored. > > > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170 > > > > """ > > > > Element Length is reserved for used descriptors without the > > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers. > > > > """ > > > > > > > > And this is the way how driver ignores the `len` in an used > > > > descriptor. > > > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241 > > > > """ > > > > To increase ring capacity the driver can store a (read-only > > > > by the device) table of indirect descriptors anywhere in memory, > > > > and insert a descriptor in the main virtqueue (with \field{Flags} > > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element > > > > containing this indirect descriptor table; > > > > """ > > > > > > > > So the indirect descriptors in the table are read-only by > > > > the device. And t
Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
On Mon, 23 Apr 2018 23:06:55 +0300 "Michael S. Tsirkin" wrote: > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote: > > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin > > wrote: > > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote: > > >> On Mon, 23 Apr 2018 20:24:56 +0300 > > >> "Michael S. Tsirkin" wrote: > > >> > > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote: > > >> > > > > > > >> > > > >I will NAK patches to change to common code for netvsc especially > > >> > > > >the > > >> > > > >three device model. MS worked hard with distro vendors to > > >> > > > >support transparent > > >> > > > >mode, ans we really can't have a new model; or do backport. > > >> > > > > > > >> > > > >Plus, DPDK is now dependent on existing model. > > >> > > > > > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities. > > >> > > > > > >> > > > > >> > > The network device model is a userspace API, and DPDK is a userspace > > >> > > application. > > >> > > > >> > It is userspace but are you sure dpdk is actually poking at netdevs? > > >> > AFAIK it's normally banging device registers directly. > > >> > > > >> > > You can't go breaking userspace even if you don't like the > > >> > > application. > > >> > > > >> > Could you please explain how is the proposed patchset breaking > > >> > userspace? Ignoring DPDK for now, I don't think it changes the > > >> > userspace > > >> > API at all. > > >> > > > >> > > >> The DPDK has a device driver vdev_netvsc which scans the Linux network > > >> devices > > >> to look for Linux netvsc device and the paired VF device and setup the > > >> DPDK environment. This setup creates a DPDK failsafe (bondingish) > > >> instance > > >> and sets up TAP support over the Linux netvsc device as well as the > > >> Mellanox > > >> VF device. > > >> > > >> So it depends on existing 2 device model. You can't go to a 3 device > > >> model > > >> or start hiding devices from userspace. > > > > > > Okay so how does the existing patch break that? IIUC does not go to > > > a 3 device model since netvsc calls failover_register directly. > > > > > >> Also, I am working on associating netvsc and VF device based on serial > > >> number > > >> rather than MAC address. The serial number is how Windows works now, and > > >> it makes > > >> sense for Linux and Windows to use the same mechanism if possible. > > > > > > Maybe we should support same for virtio ... > > > Which serial do you mean? From vpd? > > > > > > I guess you will want to keep supporting MAC for old hypervisors? > > > > > > It all seems like a reasonable thing to support in the generic core. > > > > That's the reason why I chose explicit identifier rather than rely on > > MAC address to bind/pair a device. MAC address can change. Even if it > > can't, malicious guest user can fake MAC address to skip binding. > > > > -Siwei > > Address should be sampled at device creation to prevent this > kind of hack. Not that it buys the malicious user much: > if you can poke at MAC addresses you probably already can > break networking. On Hyper-V guest can't really change MAC address if SR-IOV is enabled. Also, Linux has permanent ether address in netdev which is what should be used to avoid user messing with device.
Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
On Mon, 23 Apr 2018 12:44:39 -0700 Siwei Liu wrote: > On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin wrote: > > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote: > >> On Mon, 23 Apr 2018 20:24:56 +0300 > >> "Michael S. Tsirkin" wrote: > >> > >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote: > >> > > > > > >> > > > >I will NAK patches to change to common code for netvsc especially > >> > > > >the > >> > > > >three device model. MS worked hard with distro vendors to support > >> > > > >transparent > >> > > > >mode, ans we really can't have a new model; or do backport. > >> > > > > > >> > > > >Plus, DPDK is now dependent on existing model. > >> > > > > >> > > > Sorry, but nobody here cares about dpdk or other similar oddities. > >> > > > >> > > The network device model is a userspace API, and DPDK is a userspace > >> > > application. > >> > > >> > It is userspace but are you sure dpdk is actually poking at netdevs? > >> > AFAIK it's normally banging device registers directly. > >> > > >> > > You can't go breaking userspace even if you don't like the > >> > > application. > >> > > >> > Could you please explain how is the proposed patchset breaking > >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace > >> > API at all. > >> > > >> > >> The DPDK has a device driver vdev_netvsc which scans the Linux network > >> devices > >> to look for Linux netvsc device and the paired VF device and setup the > >> DPDK environment. This setup creates a DPDK failsafe (bondingish) instance > >> and sets up TAP support over the Linux netvsc device as well as the > >> Mellanox > >> VF device. > >> > >> So it depends on existing 2 device model. You can't go to a 3 device model > >> or start hiding devices from userspace. > > > > Okay so how does the existing patch break that? IIUC does not go to > > a 3 device model since netvsc calls failover_register directly. > > > >> Also, I am working on associating netvsc and VF device based on serial > >> number > >> rather than MAC address. The serial number is how Windows works now, and > >> it makes > >> sense for Linux and Windows to use the same mechanism if possible. > > > > Maybe we should support same for virtio ... > > Which serial do you mean? From vpd? > > > > I guess you will want to keep supporting MAC for old hypervisors? The serial number has always been in the hypervisor since original support of SR-IOV in WS2008. So no backward compatibility special cases would be needed.
Re: [PATCH net 0/3] amd-xgbe: AMD XGBE driver fixes 2018-04-23
From: Tom Lendacky Date: Mon, 23 Apr 2018 11:42:58 -0500 > This patch series addresses some issues in the AMD XGBE driver. > > The following fixes are included in this driver update series: > > - Improve KR auto-negotiation and training (2 patches) > - Add pre and post auto-negotiation hooks > - Use the pre and post auto-negotiation hooks to disable CDR tracking > during auto-negotiation page exchange in KR mode > - Check for SFP tranceiver signal support and only use the signal if the > SFP indicates that it is supported > > This patch series is based on net. > > --- > > Please queue this patch series up to stable releases 4.14 and above. Series applied and queued up for -stable, thanks.
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 24, 2018 at 04:05:07AM +0300, Michael S. Tsirkin wrote: > On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote: > > > > > > On 2018年04月23日 17:29, Tiwei Bie wrote: > > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote: > > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > > > > > Hello everyone, > > > > > > > > > > This RFC implements packed ring support for virtio driver. > > > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented > > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html > > > > > Minor changes are needed for the vhost code, e.g. to kick the guest. > > > > > > > > > > TODO: > > > > > - Refinements and bug fixes; > > > > > - Split into small patches; > > > > > - Test indirect descriptor support; > > > > > - Test/fix event suppression support; > > > > > - Test devices other than net; > > > > > > > > > > RFC v1 -> RFC v2: > > > > > - Add indirect descriptor support - compile test only; > > > > > - Add event suppression supprt - compile test only; > > > > > - Move vring_packed_init() out of uapi (Jason, MST); > > > > > - Merge two loops into one in virtqueue_add_packed() (Jason); > > > > > - Split vring_unmap_one() for packed ring and split ring (Jason); > > > > > - Avoid using '%' operator (Jason); > > > > > - Rename free_head -> next_avail_idx (Jason); > > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason); > > > > > - Some other refinements and bug fixes; > > > > > > > > > > Thanks! > > > > > > > > > > Signed-off-by: Tiwei Bie > > > > > --- > > > > >drivers/virtio/virtio_ring.c | 1094 > > > > > +--- > > > > >include/linux/virtio_ring.h|8 +- > > > > >include/uapi/linux/virtio_config.h | 12 +- > > > > >include/uapi/linux/virtio_ring.h | 61 ++ > > > > >4 files changed, 980 insertions(+), 195 deletions(-) > > > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c > > > > > b/drivers/virtio/virtio_ring.c > > > > > index 71458f493cf8..0515dca34d77 100644 > > > > > --- a/drivers/virtio/virtio_ring.c > > > > > +++ b/drivers/virtio/virtio_ring.c > > > > > @@ -58,14 +58,15 @@ > > > > [...] > > > > > > > > > + > > > > > + if (vq->indirect) { > > > > > + u32 len; > > > > > + > > > > > + desc = vq->desc_state[head].indir_desc; > > > > > + /* Free the indirect table, if any, now that it's > > > > > unmapped. */ > > > > > + if (!desc) > > > > > + goto out; > > > > > + > > > > > + len = virtio32_to_cpu(vq->vq.vdev, > > > > > + vq->vring_packed.desc[head].len); > > > > > + > > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags & > > > > > + cpu_to_virtio16(vq->vq.vdev, > > > > > VRING_DESC_F_INDIRECT))); > > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. > > > > So we > > > > can safely remove this BUG_ON() here. > > > > > > > > > + BUG_ON(len == 0 || len % sizeof(struct > > > > > vring_packed_desc)); > > > > Len could be ignored for used descriptor according to the spec, so we > > > > need > > > > remove this BUG_ON() too. > > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it. > > > And I think something related to this in the spec isn't very > > > clear currently. > > > > > > In the spec, there are below words: > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272 > > > """ > > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE > > > is reserved and is ignored by the device. > > > """ > > > > > > So when device writes back an used descriptor in this case, > > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag > > > is reserved and should be ignored. > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170 > > > """ > > > Element Length is reserved for used descriptors without the > > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers. > > > """ > > > > > > And this is the way how driver ignores the `len` in an used > > > descriptor. > > > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241 > > > """ > > > To increase ring capacity the driver can store a (read-only > > > by the device) table of indirect descriptors anywhere in memory, > > > and insert a descriptor in the main virtqueue (with \field{Flags} > > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element > > > containing this indirect descriptor table; > > > """ > > > > > > So the indirect descriptors in the table are read-only by > > > the device. And the only descriptor which is writeable by > > > the device is the descriptor in the main virtqueue (with > > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the > > > `len` in this descriptor, we won't be able to get the > > > length of the data written by
Re: [RFC v2] virtio: support packed ring
On 2018年04月24日 09:05, Michael S. Tsirkin wrote: + if (vq->indirect) { + u32 len; + + desc = vq->desc_state[head].indir_desc; + /* Free the indirect table, if any, now that it's unmapped. */ + if (!desc) + goto out; + + len = virtio32_to_cpu(vq->vq.vdev, + vq->vring_packed.desc[head].len); + + BUG_ON(!(vq->vring_packed.desc[head].flags & +cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT))); It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we can safely remove this BUG_ON() here. + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc)); Len could be ignored for used descriptor according to the spec, so we need remove this BUG_ON() too. Yeah, you're right! The BUG_ON() isn't right. I'll remove it. And I think something related to this in the spec isn't very clear currently. In the spec, there are below words: https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272 """ In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE is reserved and is ignored by the device. """ So when device writes back an used descriptor in this case, device may not set the VIRTQ_DESC_F_WRITE flag as the flag is reserved and should be ignored. https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170 """ Element Length is reserved for used descriptors without the VIRTQ_DESC_F_WRITE flag, and is ignored by drivers. """ And this is the way how driver ignores the `len` in an used descriptor. https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241 """ To increase ring capacity the driver can store a (read-only by the device) table of indirect descriptors anywhere in memory, and insert a descriptor in the main virtqueue (with \field{Flags} bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element containing this indirect descriptor table; """ So the indirect descriptors in the table are read-only by the device. And the only descriptor which is writeable by the device is the descriptor in the main virtqueue (with Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the `len` in this descriptor, we won't be able to get the length of the data written by the device. So I think the `len` in this descriptor will carry the length of the data written by the device (if the buffers are writable to the device) even if the VIRTQ_DESC_F_WRITE isn't set by the device. How do you think? Yes I think so. But we'd better need clarification from Michael. I think if you use a descriptor, and you want to supply len to guest, you set VIRTQ_DESC_F_WRITE in the used descriptor. Spec also says you must not set VIRTQ_DESC_F_INDIRECT then. If that's a problem we could look at relaxing that last requirement - does driver want INDIRECT in used descriptor to match the value in the avail descriptor for some reason? Looks not, so what I get it: - device and set VIRTQ_DESC_F_WRITE flag for used descriptor when needed - there no need to keep INDIRECT flag in used descriptor So for the above case, we can just have a used descriptor with _F_WRITE but without INDIRECT flag. Thanks
Re: [PATCH net] pppoe: check sockaddr length in pppoe_connect()
From: Guillaume Nault Date: Mon, 23 Apr 2018 16:38:27 +0200 > We must validate sockaddr_len, otherwise userspace can pass fewer data > than we expect and we end up accessing invalid data. > > Fixes: 224cf5ad14c0 ("ppp: Move the PPP drivers") > Reported-by: syzbot+4f03bdf92fdf9ef5d...@syzkaller.appspotmail.com > Signed-off-by: Guillaume Nault Applied and queued up for -stable, thank you.
Re: [PATCH net] l2tp: check sockaddr length in pppol2tp_connect()
From: Guillaume Nault Date: Mon, 23 Apr 2018 16:15:14 +0200 > Check sockaddr_len before dereferencing sp->sa_protocol, to ensure that > it actually points to valid data. > > Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp > parts") > Reported-by: syzbot+a70ac890b23b1bf29...@syzkaller.appspotmail.com > Signed-off-by: Guillaume Nault Applied and queued up for -stable. I guess you can completely remove the "bad socket address" -EINVAL else clause later in the function as a cleanup in net-next.
Re: [PATCH] selftests: net: update .gitignore with missing test
From: Anders Roxell Date: Mon, 23 Apr 2018 16:00:50 +0200 > Fixes: 192dc405f308 ("selftests: net: add tcp_mmap program") > Signed-off-by: Anders Roxell Applied, thanks.
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 24, 2018 at 08:54:52AM +0800, Jason Wang wrote: > > > On 2018年04月23日 17:29, Tiwei Bie wrote: > > On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote: > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > > > > Hello everyone, > > > > > > > > This RFC implements packed ring support for virtio driver. > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html > > > > Minor changes are needed for the vhost code, e.g. to kick the guest. > > > > > > > > TODO: > > > > - Refinements and bug fixes; > > > > - Split into small patches; > > > > - Test indirect descriptor support; > > > > - Test/fix event suppression support; > > > > - Test devices other than net; > > > > > > > > RFC v1 -> RFC v2: > > > > - Add indirect descriptor support - compile test only; > > > > - Add event suppression supprt - compile test only; > > > > - Move vring_packed_init() out of uapi (Jason, MST); > > > > - Merge two loops into one in virtqueue_add_packed() (Jason); > > > > - Split vring_unmap_one() for packed ring and split ring (Jason); > > > > - Avoid using '%' operator (Jason); > > > > - Rename free_head -> next_avail_idx (Jason); > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason); > > > > - Some other refinements and bug fixes; > > > > > > > > Thanks! > > > > > > > > Signed-off-by: Tiwei Bie > > > > --- > > > >drivers/virtio/virtio_ring.c | 1094 > > > > +--- > > > >include/linux/virtio_ring.h|8 +- > > > >include/uapi/linux/virtio_config.h | 12 +- > > > >include/uapi/linux/virtio_ring.h | 61 ++ > > > >4 files changed, 980 insertions(+), 195 deletions(-) > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > > > index 71458f493cf8..0515dca34d77 100644 > > > > --- a/drivers/virtio/virtio_ring.c > > > > +++ b/drivers/virtio/virtio_ring.c > > > > @@ -58,14 +58,15 @@ > > > [...] > > > > > > > + > > > > + if (vq->indirect) { > > > > + u32 len; > > > > + > > > > + desc = vq->desc_state[head].indir_desc; > > > > + /* Free the indirect table, if any, now that it's > > > > unmapped. */ > > > > + if (!desc) > > > > + goto out; > > > > + > > > > + len = virtio32_to_cpu(vq->vq.vdev, > > > > + vq->vring_packed.desc[head].len); > > > > + > > > > + BUG_ON(!(vq->vring_packed.desc[head].flags & > > > > +cpu_to_virtio16(vq->vq.vdev, > > > > VRING_DESC_F_INDIRECT))); > > > It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So > > > we > > > can safely remove this BUG_ON() here. > > > > > > > + BUG_ON(len == 0 || len % sizeof(struct > > > > vring_packed_desc)); > > > Len could be ignored for used descriptor according to the spec, so we need > > > remove this BUG_ON() too. > > Yeah, you're right! The BUG_ON() isn't right. I'll remove it. > > And I think something related to this in the spec isn't very > > clear currently. > > > > In the spec, there are below words: > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272 > > """ > > In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE > > is reserved and is ignored by the device. > > """ > > > > So when device writes back an used descriptor in this case, > > device may not set the VIRTQ_DESC_F_WRITE flag as the flag > > is reserved and should be ignored. > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170 > > """ > > Element Length is reserved for used descriptors without the > > VIRTQ_DESC_F_WRITE flag, and is ignored by drivers. > > """ > > > > And this is the way how driver ignores the `len` in an used > > descriptor. > > > > https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241 > > """ > > To increase ring capacity the driver can store a (read-only > > by the device) table of indirect descriptors anywhere in memory, > > and insert a descriptor in the main virtqueue (with \field{Flags} > > bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element > > containing this indirect descriptor table; > > """ > > > > So the indirect descriptors in the table are read-only by > > the device. And the only descriptor which is writeable by > > the device is the descriptor in the main virtqueue (with > > Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the > > `len` in this descriptor, we won't be able to get the > > length of the data written by the device. > > > > So I think the `len` in this descriptor will carry the > > length of the data written by the device (if the buffers > > are writable to the device) even if the VIRTQ_DESC_F_WRITE > > isn't set by the device. How do you think? > > Yes I think so. But we'd better need clari
Re: [RFC V3 PATCH 0/8] Packed ring for vhost
On 2018年04月24日 04:11, Konrad Rzeszutek Wilk wrote: On Mon, Apr 23, 2018 at 10:59:43PM +0300, Michael S. Tsirkin wrote: On Mon, Apr 23, 2018 at 03:31:20PM -0400, Konrad Rzeszutek Wilk wrote: On Mon, Apr 23, 2018 at 01:34:52PM +0800, Jason Wang wrote: Hi all: This RFC implement packed ring layout. The code were tested with Tiwei's RFC V2 a thttps://lkml.org/lkml/2018/4/1/48. Some fixups and tweaks were needed on top of Tiwei's code to make it run. TCP stream and pktgen does not show obvious difference compared with split ring. I have to ask then - what is the benefit of this? You can use this with dpdk within guest. The DPDK version did see a performance improvement so hopefully with Is there a link to this performance improvement document? You probably want to this: https://www.mail-archive.com/dev@dpdk.org/msg97735.html Thanks
Re: [PATCH] dca: make function dca_common_get_tag static
From: Colin King Date: Mon, 23 Apr 2018 13:49:38 +0100 > From: Colin Ian King > > Function dca_common_get_tag is local to the source and does not need to be > in global scope, so make it static. > > Cleans up sparse warning: > drivers/dca/dca-core.c:273:4: warning: symbol 'dca_common_get_tag' was > not declared. Should it be static? > > Signed-off-by: Colin Ian King Applied to net-next, thanks.
Re: [PATCH V6 net-next 08/15] net/tls: Support TLS device offload with IPv6
From: Boris Pismenny Date: Sun, 22 Apr 2018 18:19:50 +0300 > @@ -97,13 +102,57 @@ static void tls_device_queue_ctx_destruction(struct > tls_context *ctx) > spin_unlock_irqrestore(&tls_device_lock, flags); > } > > +#if IS_ENABLED(CONFIG_IPV6) > +static struct net_device *ipv6_get_netdev(struct sock *sk) > +{ > + struct net_device *dev = NULL; > + struct inet_sock *inet = inet_sk(sk); > + struct ipv6_pinfo *np = inet6_sk(sk); > + struct flowi6 _fl6, *fl6 = &_fl6; > + struct dst_entry *dst; Ugh, please use sk->sk_dst_cache->dev and avoid all of the unnecessary work. Thank you.
Re: [RFC v2] virtio: support packed ring
On 2018年04月23日 17:29, Tiwei Bie wrote: On Mon, Apr 23, 2018 at 01:42:14PM +0800, Jason Wang wrote: On 2018年04月01日 22:12, Tiwei Bie wrote: Hello everyone, This RFC implements packed ring support for virtio driver. The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html Minor changes are needed for the vhost code, e.g. to kick the guest. TODO: - Refinements and bug fixes; - Split into small patches; - Test indirect descriptor support; - Test/fix event suppression support; - Test devices other than net; RFC v1 -> RFC v2: - Add indirect descriptor support - compile test only; - Add event suppression supprt - compile test only; - Move vring_packed_init() out of uapi (Jason, MST); - Merge two loops into one in virtqueue_add_packed() (Jason); - Split vring_unmap_one() for packed ring and split ring (Jason); - Avoid using '%' operator (Jason); - Rename free_head -> next_avail_idx (Jason); - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason); - Some other refinements and bug fixes; Thanks! Signed-off-by: Tiwei Bie --- drivers/virtio/virtio_ring.c | 1094 +--- include/linux/virtio_ring.h|8 +- include/uapi/linux/virtio_config.h | 12 +- include/uapi/linux/virtio_ring.h | 61 ++ 4 files changed, 980 insertions(+), 195 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 71458f493cf8..0515dca34d77 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -58,14 +58,15 @@ [...] + + if (vq->indirect) { + u32 len; + + desc = vq->desc_state[head].indir_desc; + /* Free the indirect table, if any, now that it's unmapped. */ + if (!desc) + goto out; + + len = virtio32_to_cpu(vq->vq.vdev, + vq->vring_packed.desc[head].len); + + BUG_ON(!(vq->vring_packed.desc[head].flags & +cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_INDIRECT))); It looks to me spec does not force to keep VRING_DESC_F_INDIRECT here. So we can safely remove this BUG_ON() here. + BUG_ON(len == 0 || len % sizeof(struct vring_packed_desc)); Len could be ignored for used descriptor according to the spec, so we need remove this BUG_ON() too. Yeah, you're right! The BUG_ON() isn't right. I'll remove it. And I think something related to this in the spec isn't very clear currently. In the spec, there are below words: https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L272 """ In descriptors with VIRTQ_DESC_F_INDIRECT set VIRTQ_DESC_F_WRITE is reserved and is ignored by the device. """ So when device writes back an used descriptor in this case, device may not set the VIRTQ_DESC_F_WRITE flag as the flag is reserved and should be ignored. https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L170 """ Element Length is reserved for used descriptors without the VIRTQ_DESC_F_WRITE flag, and is ignored by drivers. """ And this is the way how driver ignores the `len` in an used descriptor. https://github.com/oasis-tcs/virtio-spec/blob/d4fec517dfcf/packed-ring.tex#L241 """ To increase ring capacity the driver can store a (read-only by the device) table of indirect descriptors anywhere in memory, and insert a descriptor in the main virtqueue (with \field{Flags} bit VIRTQ_DESC_F_INDIRECT on) that refers to a buffer element containing this indirect descriptor table; """ So the indirect descriptors in the table are read-only by the device. And the only descriptor which is writeable by the device is the descriptor in the main virtqueue (with Flags bit VIRTQ_DESC_F_INDIRECT on). So if we ignore the `len` in this descriptor, we won't be able to get the length of the data written by the device. So I think the `len` in this descriptor will carry the length of the data written by the device (if the buffers are writable to the device) even if the VIRTQ_DESC_F_WRITE isn't set by the device. How do you think? Yes I think so. But we'd better need clarification from Michael. The reason is we don't touch descriptor ring in the case of split, so BUG_ON()s may help there. + + for (j = 0; j < len / sizeof(struct vring_packed_desc); j++) + vring_unmap_one_packed(vq, &desc[j]); + + kfree(desc); + vq->desc_state[head].indir_desc = NULL; + } else if (ctx) { + *ctx = vq->desc_state[head].indir_desc; + } + +out: + return vq->desc_state[head].num; +} + +static inline bool more_used_split(const struct vring_virtqueue *vq) { return vq->last_used_idx != virtio16_to_cpu(vq->vq.vdev, vq->vring.used->idx); } +static inline bool more_used_packed(const struct vring_virtqueue *vq) +{ + u16 last_used, flags; + b
Re: [PATCH V6 net-next 07/15] net/tls: Add generic NIC offload infrastructure
From: Boris Pismenny Date: Sun, 22 Apr 2018 18:19:49 +0300 > +/* We assume that the socket is already connected */ > +static struct net_device *get_netdev_for_sock(struct sock *sk) > +{ > + struct inet_sock *inet = inet_sk(sk); > + struct net_device *netdev = NULL; > + > + netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif); > + > + return netdev; > +} Please do this more directly by looking at sk->sk_dst_cache and taking the device from dst->dev if non-NULL. That avoids the dev_get_by_index() demux work, and also if sk->sk_dst_cache is non-NULL then the socket is indeed connected. Thanks.
Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
On Mon, 23 Apr 2018, Michal Hocko wrote: > On Mon 23-04-18 10:06:08, Mikulas Patocka wrote: > > > > > He didn't want to fix vmalloc(GFP_NOIO) > > > > > > I don't remember that conversation, so I don't know whether I agree with > > > his reasoning or not. But we are supposed to be moving away from GFP_NOIO > > > towards marking regions with memalloc_noio_save() / restore. If you do > > > that, you won't need vmalloc(GFP_NOIO). > > > > He said the same thing a year ago. And there was small progress. 6 out of > > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in > > infiniband and 1 in btrfs. (the whole discussion is here > > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html ) > > Well this is not that easy. It requires a cooperation from maintainers. > I can only do as much. I've posted patches in the past and actively > bringing up this topic at LSFMM last two years... You're right - but you have chosen the uneasy path. Fixing __vmalloc code is easy and it doesn't require cooperation with maintainers. > > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4 > > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why > > does he have veto over this part of the code? I'd much rather argue with > > people who have constructive comments about fixing bugs than with him. > > I didn't NACK the patch AFAIR. I've said it is not a good idea longterm. > I would be much more willing to change my mind if you would back your > patch by a real bug report. Hacks are acceptable when we have a real > issue in hands. But if we want to fix potential issue then better make > it properly. Developers should fix bugs in advance, not to wait until a crash hapens, is analyzed and reported. What's the problem with 15-line hack? Is the problem that kernel developers would feel depressed when looking the source code? Other than harming developers' feelings, I don't see what kind of damange could that piece of code do. Mikulas
Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
Hi, Peter, I have a question regarding to one of your comments below. On 3/12/18 3:01 PM, Peter Zijlstra wrote: On Mon, Mar 12, 2018 at 01:39:56PM -0700, Song Liu wrote: +static void stack_map_get_build_id_offset(struct bpf_map *map, + struct stack_map_bucket *bucket, + u64 *ips, u32 trace_nr) +{ + int i; + struct vm_area_struct *vma; + struct bpf_stack_build_id *id_offs; + + bucket->nr = trace_nr; + id_offs = (struct bpf_stack_build_id *)bucket->data; + + if (!current || !current->mm || + down_read_trylock(¤t->mm->mmap_sem) == 0) { You probably want an in_nmi() before the down_read_trylock(). Doing up_read() is an absolute no-no from NMI context. The below is the final code from Song: /* * We cannot do up_read() in nmi context, so build_id lookup is * only supported for non-nmi events. If at some point, it is * possible to run find_vma() without taking the semaphore, we * would like to allow build_id lookup in nmi context. * * Same fallback is used for kernel stack (!user) on a stackmap * with build_id. */ if (!user || !current || !current->mm || in_nmi() || down_read_trylock(¤t->mm->mmap_sem) == 0) { /* cannot access current->mm, fall back to ips */ for (i = 0; i < trace_nr; i++) { id_offs[i].status = BPF_STACK_BUILD_ID_IP; id_offs[i].ip = ips[i]; } return; } And IIUC its 'trivial' to use this stuff with hardware counters. Here, you mentioned that it was 'trivial' to use buildid thing with hardware counters, if I interpreted correctly. However, the hardware counter overflow will trigger NMI. Based on the above logic, it will default to old IP only behavior. Could you explain a little more how to get buildid with hardware counter overflow events? Thanks! + /* cannot access current->mm, fall back to ips */ + for (i = 0; i < trace_nr; i++) { + id_offs[i].status = BPF_STACK_BUILD_ID_IP; + id_offs[i].ip = ips[i]; + } + return; + } + + for (i = 0; i < trace_nr; i++) { + vma = find_vma(current->mm, ips[i]); + if (!vma || stack_map_get_build_id(vma, id_offs[i].build_id)) { + /* per entry fall back to ips */ + id_offs[i].status = BPF_STACK_BUILD_ID_IP; + id_offs[i].ip = ips[i]; + continue; + } + id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i] + - vma->vm_start; + id_offs[i].status = BPF_STACK_BUILD_ID_VALID; + } + up_read(¤t->mm->mmap_sem); +}
[PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG
The kvmalloc function tries to use kmalloc and falls back to vmalloc if kmalloc fails. Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then uses DMA-API on the returned memory or frees it with kfree. Such bugs were found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific code. These bugs are hard to reproduce because kvmalloc falls back to vmalloc only if memory is fragmented. In order to detect these bugs reliably I submit this patch that changes kvmalloc to fall back to vmalloc with 1/2 probability if CONFIG_DEBUG_SG is turned on. CONFIG_DEBUG_SG is used, because it makes the DMA API layer verify the addresses passed to it, and so the user will get a reliable stacktrace. Some bugs (such as buffer overflows) are better detected with kmalloc code, so we must test the kmalloc path too. Signed-off-by: Mikulas Patocka --- mm/util.c | 10 ++ 1 file changed, 10 insertions(+) Index: linux-2.6/mm/util.c === --- linux-2.6.orig/mm/util.c2018-04-23 00:12:05.0 +0200 +++ linux-2.6/mm/util.c 2018-04-23 17:57:02.0 +0200 @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f */ WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL); +#ifdef CONFIG_DEBUG_SG + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */ + if (!(prandom_u32_max(2) & 1)) + goto do_vmalloc; +#endif + /* * We want to attempt a large physically contiguous block first because * it is less likely to fragment multiple larger blocks and therefore @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f if (ret || size <= PAGE_SIZE) return ret; +#ifdef CONFIG_DEBUG_SG +do_vmalloc: +#endif return __vmalloc_node_flags_caller(size, node, flags, __builtin_return_address(0)); }
Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
On Mon, Apr 23, 2018 at 7:21 PM, Michael S. Tsirkin wrote: > On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote: >> From: Magnus Karlsson >> >> Here, we add another setsockopt for registered user memory (umem) >> called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can >> ask the kernel to allocate a queue (ring buffer) and also mmap it >> (XDP_UMEM_PGOFF_FILL_QUEUE) into the process. >> >> The queue is used to explicitly pass ownership of umem frames from the >> user process to the kernel. These frames will in a later patch be >> filled in with Rx packet data by the kernel. >> >> Signed-off-by: Magnus Karlsson >> --- >> include/uapi/linux/if_xdp.h | 15 +++ >> net/xdp/Makefile| 2 +- >> net/xdp/xdp_umem.c | 5 >> net/xdp/xdp_umem.h | 2 ++ >> net/xdp/xsk.c | 62 >> - >> net/xdp/xsk_queue.c | 58 ++ >> net/xdp/xsk_queue.h | 38 +++ >> 7 files changed, 180 insertions(+), 2 deletions(-) >> create mode 100644 net/xdp/xsk_queue.c >> create mode 100644 net/xdp/xsk_queue.h >> >> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h >> index 41252135a0fe..975661e1baca 100644 >> --- a/include/uapi/linux/if_xdp.h >> +++ b/include/uapi/linux/if_xdp.h >> @@ -23,6 +23,7 @@ >> >> /* XDP socket options */ >> #define XDP_UMEM_REG 3 >> +#define XDP_UMEM_FILL_RING 4 >> >> struct xdp_umem_reg { >> __u64 addr; /* Start of packet data area */ >> @@ -31,4 +32,18 @@ struct xdp_umem_reg { >> __u32 frame_headroom; /* Frame head room */ >> }; >> >> +/* Pgoff for mmaping the rings */ >> +#define XDP_UMEM_PGOFF_FILL_RING 0x1 >> + >> +struct xdp_ring { >> + __u32 producer __attribute__((aligned(64))); >> + __u32 consumer __attribute__((aligned(64))); >> +}; > > Why 64? And do you still need these guys in uapi? I was just about to ask the same. You mean cacheline_aligned? >> +static int xsk_mmap(struct file *file, struct socket *sock, >> + struct vm_area_struct *vma) >> +{ >> + unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; >> + unsigned long size = vma->vm_end - vma->vm_start; >> + struct xdp_sock *xs = xdp_sk(sock->sk); >> + struct xsk_queue *q; >> + unsigned long pfn; >> + struct page *qpg; >> + >> + if (!xs->umem) >> + return -EINVAL; >> + >> + if (offset == XDP_UMEM_PGOFF_FILL_RING) >> + q = xs->umem->fq; >> + else >> + return -EINVAL; >> + >> + qpg = virt_to_head_page(q->ring); Is it assured that q is initialized with a call to setsockopt XDP_UMEM_FILL_RING before the call the mmap? In general, with such an extensive new API, it might be worthwhile to run syzkaller locally on a kernel with these patches. It is pretty easy to set up (https://github.com/google/syzkaller/blob/master/docs/linux/setup.md), though it also needs to be taught about any new APIs.
Re: [PATCH bpf-next 15/15] samples/bpf: sample application for AF_XDP sockets
On Mon, Apr 23, 2018 at 03:56:19PM +0200, Björn Töpel wrote: > From: Magnus Karlsson > > This is a sample application for AF_XDP sockets. The application > supports three different modes of operation: rxdrop, txonly and l2fwd. > > To show-case a simple round-robin load-balancing between a set of > sockets in an xskmap, set the RR_LB compile time define option to 1 in > "xdpsock.h". > > Co-authored-by: Björn Töpel > Signed-off-by: Björn Töpel > Signed-off-by: Magnus Karlsson > --- > samples/bpf/Makefile | 4 + > samples/bpf/xdpsock.h | 11 + > samples/bpf/xdpsock_kern.c | 56 +++ > samples/bpf/xdpsock_user.c | 947 > + > 4 files changed, 1018 insertions(+) > create mode 100644 samples/bpf/xdpsock.h > create mode 100644 samples/bpf/xdpsock_kern.c > create mode 100644 samples/bpf/xdpsock_user.c > > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile > index aa8c392e2e52..d0ddc1abf20d 100644 > --- a/samples/bpf/Makefile > +++ b/samples/bpf/Makefile > @@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info > hostprogs-y += syscall_tp > hostprogs-y += cpustat > hostprogs-y += xdp_adjust_tail > +hostprogs-y += xdpsock > > # Libbpf dependencies > LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o > @@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) > xdp_rxq_info_user.o > syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o > cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o > xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o > +xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o > > # Tell kbuild to always build the programs > always := $(hostprogs-y) > @@ -151,6 +153,7 @@ always += xdp2skb_meta_kern.o > always += syscall_tp_kern.o > always += cpustat_kern.o > always += xdp_adjust_tail_kern.o > +always += xdpsock_kern.o > > HOSTCFLAGS += -I$(objtree)/usr/include > HOSTCFLAGS += -I$(srctree)/tools/lib/ > @@ -197,6 +200,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf > HOSTLOADLIBES_syscall_tp += -lelf > HOSTLOADLIBES_cpustat += -lelf > HOSTLOADLIBES_xdp_adjust_tail += -lelf > +HOSTLOADLIBES_xdpsock += -lelf -pthread > > # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on > cmdline: > # make samples/bpf/ LLC=~/git/llvm/build/bin/llc > CLANG=~/git/llvm/build/bin/clang > diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h > new file mode 100644 > index ..533ab81adfa1 > --- /dev/null > +++ b/samples/bpf/xdpsock.h > @@ -0,0 +1,11 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef XDPSOCK_H_ > +#define XDPSOCK_H_ > + > +/* Power-of-2 number of sockets */ > +#define MAX_SOCKS 4 > + > +/* Round-robin receive */ > +#define RR_LB 0 > + > +#endif /* XDPSOCK_H_ */ > diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c > new file mode 100644 > index ..d8806c41362e > --- /dev/null > +++ b/samples/bpf/xdpsock_kern.c > @@ -0,0 +1,56 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#define KBUILD_MODNAME "foo" > +#include > +#include "bpf_helpers.h" > + > +#include "xdpsock.h" > + > +struct bpf_map_def SEC("maps") qidconf_map = { > + .type = BPF_MAP_TYPE_ARRAY, > + .key_size = sizeof(int), > + .value_size = sizeof(int), > + .max_entries= 1, > +}; > + > +struct bpf_map_def SEC("maps") xsks_map = { > + .type = BPF_MAP_TYPE_XSKMAP, > + .key_size = sizeof(int), > + .value_size = sizeof(int), > + .max_entries = 4, > +}; > + > +struct bpf_map_def SEC("maps") rr_map = { > + .type = BPF_MAP_TYPE_PERCPU_ARRAY, > + .key_size = sizeof(int), > + .value_size = sizeof(unsigned int), > + .max_entries = 1, > +}; > + > +SEC("xdp_sock") > +int xdp_sock_prog(struct xdp_md *ctx) > +{ > + int *qidconf, key = 0, idx; > + unsigned int *rr; > + > + qidconf = bpf_map_lookup_elem(&qidconf_map, &key); > + if (!qidconf) > + return XDP_ABORTED; > + > + if (*qidconf != ctx->rx_queue_index) > + return XDP_PASS; > + > +#if RR_LB /* NB! RR_LB is configured in xdpsock.h */ > + rr = bpf_map_lookup_elem(&rr_map, &key); > + if (!rr) > + return XDP_ABORTED; > + > + *rr = (*rr + 1) & (MAX_SOCKS - 1); > + idx = *rr; > +#else > + idx = 0; > +#endif > + > + return bpf_redirect_map(&xsks_map, idx, 0); > +} > + > +char _license[] SEC("license") = "GPL"; > diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c > new file mode 100644 > index ..690bac1a0ab7 > --- /dev/null > +++ b/samples/bpf/xdpsock_user.c > @@ -0,0 +1,947 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright(c) 2017 - 2018 Intel Corporation. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms and conditions of the GNU General Public License, > + * version 2, as published by the Free Software Foundation. > + * > + * This program is distributed in the hope it will be us
Re: [PATCH bpf-next 00/15] Introducing AF_XDP support
On Mon, Apr 23, 2018 at 03:56:04PM +0200, Björn Töpel wrote: > From: Björn Töpel > > This RFC introduces a new address family called AF_XDP that is > optimized for high performance packet processing and, in upcoming > patch sets, zero-copy semantics. In this v2 version, we have removed > all zero-copy related code in order to make it smaller, simpler and > hopefully more review friendly. This RFC only supports copy-mode for > the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX > using the XDP_DRV path. Zero-copy support requires XDP and driver > changes that Jesper Dangaard Brouer is working on. Some of his work > has already been accepted. We will publish our zero-copy support for > RX and TX on top of his patch sets at a later point in time. > > An AF_XDP socket (XSK) is created with the normal socket() > syscall. Associated with each XSK are two queues: the RX queue and the > TX queue. A socket can receive packets on the RX queue and it can send > packets on the TX queue. These queues are registered and sized with > the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is > mandatory to have at least one of these queues for each socket. In > contrast to AF_PACKET V2/V3 these descriptor queues are separated from > packet buffers. An RX or TX descriptor points to a data buffer in a > memory area called a UMEM. RX and TX can share the same UMEM so that a > packet does not have to be copied between RX and TX. Moreover, if a > packet needs to be kept for a while due to a possible retransmit, the > descriptor that points to that packet can be changed to point to > another and reused right away. This again avoids copying data. > > This new dedicated packet buffer area is call a UMEM. It consists of a > number of equally size frames and each frame has a unique frame id. A > descriptor in one of the queues references a frame by referencing its > frame id. The user space allocates memory for this UMEM using whatever > means it feels is most appropriate (malloc, mmap, huge pages, > etc). This memory area is then registered with the kernel using the new > setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue > and the COMPLETION queue. The fill queue is used by the application to > send down frame ids for the kernel to fill in with RX packet > data. References to these frames will then appear in the RX queue of > the XSK once they have been received. The completion queue, on the > other hand, contains frame ids that the kernel has transmitted > completely and can now be used again by user space, for either TX or > RX. Thus, the frame ids appearing in the completion queue are ids that > were previously transmitted using the TX queue. In summary, the RX and > FILL queues are used for the RX path and the TX and COMPLETION queues > are used for the TX path. > > The socket is then finally bound with a bind() call to a device and a > specific queue id on that device, and it is not until bind is > completed that traffic starts to flow. Note that in this RFC, all > packet data is copied out to user-space. > > A new feature in this RFC is that the UMEM can be shared between > processes, if desired. If a process wants to do this, it simply skips > the registration of the UMEM and its corresponding two queues, sets a > flag in the bind call and submits the XSK of the process it would like > to share UMEM with as well as its own newly created XSK socket. The > new process will then receive frame id references in its own RX queue > that point to this shared UMEM. Note that since the queue structures > are single-consumer / single-producer (for performance reasons), the > new process has to create its own socket with associated RX and TX > queues, since it cannot share this with the other process. This is > also the reason that there is only one set of FILL and COMPLETION > queues per UMEM. It is the responsibility of a single process to > handle the UMEM. If multiple-producer / multiple-consumer queues are > implemented in the future, this requirement could be relaxed. > > How is then packets distributed between these two XSK? We have > introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in > full). The user-space application can place an XSK at an arbitrary > place in this map. The XDP program can then redirect a packet to a > specific index in this map and at this point XDP validates that the > XSK in that map was indeed bound to that device and queue number. If > not, the packet is dropped. If the map is empty at that index, the > packet is also dropped. This also means that it is currently mandatory > to have an XDP program loaded (and one XSK in the XSKMAP) to be able > to get any traffic to user space through the XSK. > > AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the > driver does not have support for XDP, or XDP_SKB is explicitly chosen > when loading the XDP program, XDP_SKB mode is employed that uses SKBs > together with the generic
Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote: > From: Magnus Karlsson > > Here, we add another setsockopt for registered user memory (umem) > called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can > ask the kernel to allocate a queue (ring buffer) and also mmap it > (XDP_UMEM_PGOFF_FILL_QUEUE) into the process. > > The queue is used to explicitly pass ownership of umem frames from the > user process to the kernel. These frames will in a later patch be > filled in with Rx packet data by the kernel. > > Signed-off-by: Magnus Karlsson > --- > include/uapi/linux/if_xdp.h | 15 +++ > net/xdp/Makefile| 2 +- > net/xdp/xdp_umem.c | 5 > net/xdp/xdp_umem.h | 2 ++ > net/xdp/xsk.c | 62 > - > net/xdp/xsk_queue.c | 58 ++ > net/xdp/xsk_queue.h | 38 +++ > 7 files changed, 180 insertions(+), 2 deletions(-) > create mode 100644 net/xdp/xsk_queue.c > create mode 100644 net/xdp/xsk_queue.h > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h > index 41252135a0fe..975661e1baca 100644 > --- a/include/uapi/linux/if_xdp.h > +++ b/include/uapi/linux/if_xdp.h > @@ -23,6 +23,7 @@ > > /* XDP socket options */ > #define XDP_UMEM_REG 3 > +#define XDP_UMEM_FILL_RING 4 > > struct xdp_umem_reg { > __u64 addr; /* Start of packet data area */ > @@ -31,4 +32,18 @@ struct xdp_umem_reg { > __u32 frame_headroom; /* Frame head room */ > }; > > +/* Pgoff for mmaping the rings */ > +#define XDP_UMEM_PGOFF_FILL_RING 0x1 > + > +struct xdp_ring { > + __u32 producer __attribute__((aligned(64))); > + __u32 consumer __attribute__((aligned(64))); > +}; Why 64? And do you still need these guys in uapi? > + > +/* Used for the fill and completion queues for buffers */ > +struct xdp_umem_ring { > + struct xdp_ring ptrs; > + __u32 desc[0] __attribute__((aligned(64))); > +}; > + > #endif /* _LINUX_IF_XDP_H */ > diff --git a/net/xdp/Makefile b/net/xdp/Makefile > index a5d736640a0f..074fb2b2d51c 100644 > --- a/net/xdp/Makefile > +++ b/net/xdp/Makefile > @@ -1,2 +1,2 @@ > -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o > +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o > > diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c > index bff058f5a769..6fc233e03f30 100644 > --- a/net/xdp/xdp_umem.c > +++ b/net/xdp/xdp_umem.c > @@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem) > struct mm_struct *mm; > unsigned long diff; > > + if (umem->fq) { > + xskq_destroy(umem->fq); > + umem->fq = NULL; > + } > + > if (umem->pgs) { > xdp_umem_unpin_pages(umem); > > diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h > index 58714f4f7f25..3086091aebdd 100644 > --- a/net/xdp/xdp_umem.h > +++ b/net/xdp/xdp_umem.h > @@ -18,9 +18,11 @@ > #include > #include > > +#include "xsk_queue.h" > #include "xdp_umem_props.h" > > struct xdp_umem { > + struct xsk_queue *fq; > struct page **pgs; > struct xdp_umem_props props; > u32 npgs; > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index 19fc719cbe0d..bf6a1151df28 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -32,6 +32,7 @@ > #include > #include > > +#include "xsk_queue.h" > #include "xdp_umem.h" > > struct xdp_sock { > @@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk) > return (struct xdp_sock *)sk; > } > > +static int xsk_init_queue(u32 entries, struct xsk_queue **queue) > +{ > + struct xsk_queue *q; > + > + if (entries == 0 || *queue || !is_power_of_2(entries)) > + return -EINVAL; > + > + q = xskq_create(entries); > + if (!q) > + return -ENOMEM; > + > + *queue = q; > + return 0; > +} > + > static int xsk_release(struct socket *sock) > { > struct sock *sk = sock->sk; > @@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int > level, int optname, > mutex_unlock(&xs->mutex); > return 0; > } > + case XDP_UMEM_FILL_RING: > + { > + struct xsk_queue **q; > + int entries; > + > + if (!xs->umem) > + return -EINVAL; > + > + if (copy_from_user(&entries, optval, sizeof(entries))) > + return -EFAULT; > + > + mutex_lock(&xs->mutex); > + q = &xs->umem->fq; > + err = xsk_init_queue(entries, q); > + mutex_unlock(&xs->mutex); > + return err; > + } > default: > break; > } > @@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int > level, int optname, > return -ENOPROTOOPT; > } > > +static int xsk_mmap(struct file *file, struct socket *sock, > +
Re: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
On Mon, 23 Apr 2018, Michal Hocko wrote: > On Mon 23-04-18 10:24:02, Mikulas Patocka wrote: > > > > Really, we have a fault injection framework and this sounds like > > > something to hook in there. > > > > The testing people won't set it up. They install the "kernel-debug" > > package and run the tests in it. > > > > If you introduce a hidden option that no one knows about, no one will use > > it. > > then make sure people know about it. Fuzzers already do test fault > injections. I think that in the long term we can introduce a kernel parameter like "debug_level=1", "debug_level=2", "debug_level=3" that will turn on debugging features across all kernel subsystems and we can teach users to use it to diagnose problems. But it won't work if every subsystem has different debug parameters. There are 192 distinct filenames in debugfs, if we add 193rd one, harly anyone notices it. Mikulas
Re: [PATCH bpf-next 03/15] xsk: add umem fill queue support and mmap
On Mon, Apr 23, 2018 at 03:56:07PM +0200, Björn Töpel wrote: > From: Magnus Karlsson > > Here, we add another setsockopt for registered user memory (umem) > called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can > ask the kernel to allocate a queue (ring buffer) and also mmap it > (XDP_UMEM_PGOFF_FILL_QUEUE) into the process. > > The queue is used to explicitly pass ownership of umem frames from the > user process to the kernel. These frames will in a later patch be > filled in with Rx packet data by the kernel. > > Signed-off-by: Magnus Karlsson > --- > include/uapi/linux/if_xdp.h | 15 +++ > net/xdp/Makefile| 2 +- > net/xdp/xdp_umem.c | 5 > net/xdp/xdp_umem.h | 2 ++ > net/xdp/xsk.c | 62 > - > net/xdp/xsk_queue.c | 58 ++ > net/xdp/xsk_queue.h | 38 +++ > 7 files changed, 180 insertions(+), 2 deletions(-) > create mode 100644 net/xdp/xsk_queue.c > create mode 100644 net/xdp/xsk_queue.h > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h > index 41252135a0fe..975661e1baca 100644 > --- a/include/uapi/linux/if_xdp.h > +++ b/include/uapi/linux/if_xdp.h > @@ -23,6 +23,7 @@ > > /* XDP socket options */ > #define XDP_UMEM_REG 3 > +#define XDP_UMEM_FILL_RING 4 > > struct xdp_umem_reg { > __u64 addr; /* Start of packet data area */ > @@ -31,4 +32,18 @@ struct xdp_umem_reg { > __u32 frame_headroom; /* Frame head room */ > }; > > +/* Pgoff for mmaping the rings */ > +#define XDP_UMEM_PGOFF_FILL_RING 0x1 > + > +struct xdp_ring { > + __u32 producer __attribute__((aligned(64))); > + __u32 consumer __attribute__((aligned(64))); > +}; > + > +/* Used for the fill and completion queues for buffers */ > +struct xdp_umem_ring { > + struct xdp_ring ptrs; > + __u32 desc[0] __attribute__((aligned(64))); > +}; > + > #endif /* _LINUX_IF_XDP_H */ > diff --git a/net/xdp/Makefile b/net/xdp/Makefile > index a5d736640a0f..074fb2b2d51c 100644 > --- a/net/xdp/Makefile > +++ b/net/xdp/Makefile > @@ -1,2 +1,2 @@ > -obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o > +obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o > > diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c > index bff058f5a769..6fc233e03f30 100644 > --- a/net/xdp/xdp_umem.c > +++ b/net/xdp/xdp_umem.c > @@ -62,6 +62,11 @@ static void xdp_umem_release(struct xdp_umem *umem) > struct mm_struct *mm; > unsigned long diff; > > + if (umem->fq) { > + xskq_destroy(umem->fq); > + umem->fq = NULL; > + } > + > if (umem->pgs) { > xdp_umem_unpin_pages(umem); > > diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h > index 58714f4f7f25..3086091aebdd 100644 > --- a/net/xdp/xdp_umem.h > +++ b/net/xdp/xdp_umem.h > @@ -18,9 +18,11 @@ > #include > #include > > +#include "xsk_queue.h" > #include "xdp_umem_props.h" > > struct xdp_umem { > + struct xsk_queue *fq; > struct page **pgs; > struct xdp_umem_props props; > u32 npgs; > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index 19fc719cbe0d..bf6a1151df28 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -32,6 +32,7 @@ > #include > #include > > +#include "xsk_queue.h" > #include "xdp_umem.h" > > struct xdp_sock { > @@ -47,6 +48,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk) > return (struct xdp_sock *)sk; > } > > +static int xsk_init_queue(u32 entries, struct xsk_queue **queue) > +{ > + struct xsk_queue *q; > + > + if (entries == 0 || *queue || !is_power_of_2(entries)) > + return -EINVAL; > + > + q = xskq_create(entries); > + if (!q) > + return -ENOMEM; > + > + *queue = q; > + return 0; > +} > + > static int xsk_release(struct socket *sock) > { > struct sock *sk = sock->sk; > @@ -109,6 +125,23 @@ static int xsk_setsockopt(struct socket *sock, int > level, int optname, > mutex_unlock(&xs->mutex); > return 0; > } > + case XDP_UMEM_FILL_RING: > + { > + struct xsk_queue **q; > + int entries; > + > + if (!xs->umem) > + return -EINVAL; > + > + if (copy_from_user(&entries, optval, sizeof(entries))) > + return -EFAULT; > + > + mutex_lock(&xs->mutex); > + q = &xs->umem->fq; > + err = xsk_init_queue(entries, q); > + mutex_unlock(&xs->mutex); > + return err; > + } > default: > break; > } > @@ -116,6 +149,33 @@ static int xsk_setsockopt(struct socket *sock, int > level, int optname, > return -ENOPROTOOPT; > } > > +static int xsk_mmap(struct file *file, struct socket *sock, > + struct vm_area_struct *vma) > +{ > + u
Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs wrote: > On 2018-04-18 19:47, Paul Moore wrote: >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs wrote: >> > Implement the proc fs write to set the audit container ID of a process, >> > emitting an AUDIT_CONTAINER record to document the event. >> > >> > This is a write from the container orchestrator task to a proc entry of >> > the form /proc/PID/containerid where PID is the process ID of the newly >> > created task that is to become the first task in a container, or an >> > additional task added to a container. >> > >> > The write expects up to a u64 value (unset: 18446744073709551615). >> > >> > This will produce a record such as this: >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 >> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 >> > ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0 >> > >> > The "op" field indicates an initial set. The "pid" to "ses" fields are >> > the orchestrator while the "opid" field is the object's PID, the process >> > being "contained". Old and new container ID values are given in the >> > "contid" fields, while res indicates its success. >> > >> > It is not permitted to self-set, unset or re-set the container ID. A >> > child inherits its parent's container ID, but then can be set only once >> > after. >> > >> > See: https://github.com/linux-audit/audit-kernel/issues/32 >> > >> > Signed-off-by: Richard Guy Briggs >> > --- >> > fs/proc/base.c | 37 >> > include/linux/audit.h | 16 + >> > include/linux/init_task.h | 4 ++- >> > include/linux/sched.h | 1 + >> > include/uapi/linux/audit.h | 2 ++ >> > kernel/auditsc.c | 84 >> > ++ >> > 6 files changed, 143 insertions(+), 1 deletion(-) ... >> > diff --git a/include/linux/sched.h b/include/linux/sched.h >> > index d258826..1b82191 100644 >> > --- a/include/linux/sched.h >> > +++ b/include/linux/sched.h >> > @@ -796,6 +796,7 @@ struct task_struct { >> > #ifdef CONFIG_AUDITSYSCALL >> > kuid_t loginuid; >> > unsigned intsessionid; >> > + u64 containerid; >> >> This one line addition to the task_struct scares me the most of >> anything in this patchset. Why? It's a field named "containerid" in >> a perhaps one of the most widely used core kernel structures; the >> possibilities for abuse are endless, and it's foolish to think we >> would ever be able to adequately police this. > > Fair enough. > >> Unfortunately, we can't add the field to audit_context as things >> currently stand because we don't always allocate an audit_context, >> it's dependent on the system's configuration, and we need to track the >> audit container ID for a given process, regardless of the audit >> configuration. Pretty much the same reason why loginuid and sessionid >> are located directly in task_struct now. As I stressed during the >> design phase, I really want to keep this as an *audit* container ID >> and not a general purpose kernel wide container ID. If the kernel >> ever grows a general purpose container ID token, I'll be the first in >> line to convert the audit code, but I don't want audit to be that >> general purpose mechanism ... audit is hated enough as-is ;) > > When would we need an audit container ID when audit is not enabled > enough to have an audit_context? I'm thinking of the audit_alloc() case where audit_filter_task() returns AUDIT_DISABLED. I believe this is the same reason why loginuid and sessionid live directly in the task_struct and not in the audit_context; they need to persist for the lifetime of the task. > If it is only used for audit, and audit is the only consumer, and audit > can only use it when it is enabled, then we can just return success to > any write to the proc filehandle, or not even present it. Nothing will > be able to know that value wasn't used. > > When are loginuid and sessionid used now when audit is not enabled (or > should I say, explicitly disabled)? See above. I think that should answer these questions. >> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h >> > index 4e61a9e..921a71f 100644 >> > --- a/include/uapi/linux/audit.h >> > +++ b/include/uapi/linux/audit.h >> > @@ -71,6 +71,7 @@ >> > #define AUDIT_TTY_SET 1017/* Set TTY auditing status */ >> > #define AUDIT_SET_FEATURE 1018/* Turn an audit feature on or off >> > */ >> > #define AUDIT_GET_FEATURE 1019/* Get which features are enabled >> > */ >> > +#define AUDIT_CONTAINER1020/* Define the container id >> > and information */ >> > >> > #define AUDIT_FIRST_USER_MSG 1100/* Userspace messages mostly >> > uninteresting to kernel */ >> > #define AUDIT_USER_AVC 1107/* We filter this differently */ >> > @@
Re: dev_loopback_xmit parameters
Ok, clear now. Even though I don't understand what to set to avoid triggering the WARN_ON(!skb_dst(skb)); inside dev_loopback_xmit. I just would like to send the skb in loopback, i.e. moving the packet from the sending to the receiving queue of a certain struct net_device. On 24/04/2018 00:36, Eric Dumazet wrote: On 04/23/2018 02:40 PM, Emanuele wrote: Hello, I don't know if this is the right place where to ask, but I was wondering why the dev_loopback_xmit function defined in /net/core/dev.c takes struct net * and struct sock * as parameters. They are never used, so I believe passing only the struct sk_buff * should be enough. Look at net/ipv6/ip6_output.c where NF_HOOK() uses dev_loopback_xmit(). In addition, it would like to know where I can read what is and how to set a skb dst_entry, since I don't really understand it. Thanks a lot, Emanuele
Re: [PATCH bpf-next 02/15] xsk: add user memory registration support sockopt
On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel wrote: > From: Björn Töpel > > In this commit the base structure of the AF_XDP address family is set > up. Further, we introduce the abilty register a window of user memory > to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory > window is viewed by an AF_XDP socket as a set of equally large > frames. After a user memory registration all frames are "owned" by the > user application, and not the kernel. > > Co-authored-by: Magnus Karlsson > Signed-off-by: Magnus Karlsson > Signed-off-by: Björn Töpel > +static void xdp_umem_release(struct xdp_umem *umem) > +{ > + struct task_struct *task; > + struct mm_struct *mm; > + unsigned long diff; > + > + if (umem->pgs) { > + xdp_umem_unpin_pages(umem); > + > + task = get_pid_task(umem->pid, PIDTYPE_PID); > + put_pid(umem->pid); > + if (!task) > + goto out; > + mm = get_task_mm(task); > + put_task_struct(task); > + if (!mm) > + goto out; > + > + diff = umem->size >> PAGE_SHIFT; Need to round up or size must always be a multiple of PAGE_SIZE. > + > + down_write(&mm->mmap_sem); > + mm->pinned_vm -= diff; > + up_write(&mm->mmap_sem); When using user->locked_vm for resource limit checks, no need to also update mm->pinned_vm? > +static int __xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) > +{ > + u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom; > + u64 addr = mr->addr, size = mr->len; > + unsigned int nframes; > + int size_chk, err; > + > + if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) { > + /* Strictly speaking we could support this, if: > +* - huge pages, or* > +* - using an IOMMU, or > +* - making sure the memory area is consecutive > +* but for now, we simply say "computer says no". > +*/ > + return -EINVAL; > + } Ideally, AF_XDP subsumes all packet socket use cases. It does not have packet v3's small packet optimizations of variable sized frames and block signaling. I don't suggest adding that now. But for the non-zerocopy case, it may make sense to ensure that nothing is blocking a later addition of these features. Especially for header-only (snaplen) workloads. So far, I don't see any issues. > + if (!is_power_of_2(frame_size)) > + return -EINVAL; > + > + if (!PAGE_ALIGNED(addr)) { > + /* Memory area has to be page size aligned. For > +* simplicity, this might change. > +*/ > + return -EINVAL; > + } > + > + if ((addr + size) < addr) > + return -EINVAL; > + > + nframes = size / frame_size; > + if (nframes == 0 || nframes > UINT_MAX) > + return -EINVAL; You may also want a check here that nframes * frame_size is at least PAGE_SIZE and probably a multiple of that. > + frame_headroom = ALIGN(frame_headroom, 64); > + > + size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM; > + if (size_chk < 0) > + return -EINVAL; > + > + umem->pid = get_task_pid(current, PIDTYPE_PID); > + umem->size = (size_t)size; > + umem->address = (unsigned long)addr; > + umem->props.frame_size = frame_size; > + umem->props.nframes = nframes; > + umem->frame_headroom = frame_headroom; > + umem->npgs = size / PAGE_SIZE; > + umem->pgs = NULL; > + umem->user = NULL; > + > + umem->frame_size_log2 = ilog2(frame_size); > + umem->nfpp_mask = (PAGE_SIZE / frame_size) - 1; > + umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size); > + atomic_set(&umem->users, 1); > + > + err = xdp_umem_account_pages(umem); > + if (err) > + goto out; > + > + err = xdp_umem_pin_pages(umem); > + if (err) need to call xdp_umem_unaccount_pages on error > + goto out; > + return 0; > + > +out: > + put_pid(umem->pid); > + return err; > +}
[PATCH v2] selftests: bpf: update .gitignore with missing file
Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests") Signed-off-by: Anders Roxell --- Rebased against bpf-next. tools/testing/selftests/bpf/.gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index 9cf83f895d98..da19f0562bf8 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -12,3 +12,4 @@ test_tcpbpf_user test_verifier_log feature test_libbpf_open +test_btf -- 2.17.0
Re: [bpf PATCH v3 0/3] BPF, a couple sockmap fixes
On 04/24/2018 12:39 AM, John Fastabend wrote: > While testing sockmap with more programs (besides our test programs) > I found a couple issues. > > The attached series fixes an issue where pinned maps were not > working correctly, blocking sockets returned zero, and an error > path that when the sock hit an out of memory case resulted in a > double page_put() while doing ingress redirects. > > See individual patches for more details. > > v2: Incorporated Daniel's feedback to use map ops for uref put op > which also fixed the build error discovered in v1. > v3: rename map_put_uref to map_release_uref Applied to bpf tree, thanks John!
[PATCH net v2] net: ethtool: Add missing kernel doc for FEC parameters
While adding support for ethtool::get_fecparam and set_fecparam, kernel doc for these functions was missed, add those. Fixes: 1a5f3da20bd9 ("net: ethtool: add support for forward error correction modes") Signed-off-by: Florian Fainelli --- Changes in v2: - corrected set_fecparam in commit message include/linux/ethtool.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h index ebe41811ed34..b32cd2062f18 100644 --- a/include/linux/ethtool.h +++ b/include/linux/ethtool.h @@ -310,6 +310,8 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32, * fields should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS * instead of the latter), any change to them will be overwritten * by kernel. Returns a negative error code or zero. + * @get_fecparam: Get the network device Forward Error Correction parameters. + * @set_fecparam: Set the network device Forward Error Correction parameters. * * All operations are optional (i.e. the function pointer may be set * to %NULL) and callers must take this into account. Callers must -- 2.14.1
[bpf PATCH v3 3/3] bpf: sockmap, fix double page_put on ENOMEM error in redirect path
In the case where the socket memory boundary is hit the redirect path returns an ENOMEM error. However, before checking for this condition the redirect scatterlist buffer is setup with a valid page and length. This is never unwound so when the buffers are released latter in the error path we do a put_page() and clear the scatterlist fields. But, because the initial error happens before completing the scatterlist buffer we end up with both the original buffer and the redirect buffer pointing to the same page resulting in duplicate put_page() calls. To fix this simply move the initial configuration of the redirect scatterlist buffer below the sock memory check. Found this while running TCP_STREAM test with netperf using Cilium. Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT") Signed-off-by: John Fastabend --- 0 files changed diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index aaf50ec..634415c 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -524,8 +524,6 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes, i = md->sg_start; do { - r->sg_data[i] = md->sg_data[i]; - size = (apply && apply_bytes < md->sg_data[i].length) ? apply_bytes : md->sg_data[i].length; @@ -536,6 +534,7 @@ static int bpf_tcp_ingress(struct sock *sk, int apply_bytes, } sk_mem_charge(sk, size); + r->sg_data[i] = md->sg_data[i]; r->sg_data[i].length = size; md->sg_data[i].length -= size; md->sg_data[i].offset += size;
[bpf PATCH v3 1/3] bpf: sockmap, map_release does not hold refcnt for pinned maps
Relying on map_release hook to decrement the reference counts when a map is removed only works if the map is not being pinned. In the pinned case the ref is decremented immediately and the BPF programs released. After this BPF programs may not be in-use which is not what the user would expect. This patch moves the release logic into bpf_map_put_uref() and brings sockmap in-line with how a similar case is handled in prog array maps. Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs") Signed-off-by: John Fastabend --- 0 files changed diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0ac4340..a5782d0 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -33,6 +33,7 @@ struct bpf_map_ops { void (*map_release)(struct bpf_map *map, struct file *map_file); void (*map_free)(struct bpf_map *map); int (*map_get_next_key)(struct bpf_map *map, void *key, void *next_key); + void (*map_release_uref)(struct bpf_map *map); /* funcs callable from userspace and from eBPF programs */ void *(*map_lookup_elem)(struct bpf_map *map, void *key); @@ -448,7 +449,6 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value, int bpf_fd_array_map_update_elem(struct bpf_map *map, struct file *map_file, void *key, void *value, u64 map_flags); int bpf_fd_array_map_lookup_elem(struct bpf_map *map, void *key, u32 *value); -void bpf_fd_array_map_clear(struct bpf_map *map); int bpf_fd_htab_map_update_elem(struct bpf_map *map, struct file *map_file, void *key, void *value, u64 map_flags); int bpf_fd_htab_map_lookup_elem(struct bpf_map *map, void *key, u32 *value); diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 02a1893..0fd8d8f 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -526,7 +526,7 @@ static u32 prog_fd_array_sys_lookup_elem(void *ptr) } /* decrement refcnt of all bpf_progs that are stored in this map */ -void bpf_fd_array_map_clear(struct bpf_map *map) +static void bpf_fd_array_map_clear(struct bpf_map *map) { struct bpf_array *array = container_of(map, struct bpf_array, map); int i; @@ -545,6 +545,7 @@ void bpf_fd_array_map_clear(struct bpf_map *map) .map_fd_get_ptr = prog_fd_array_get_ptr, .map_fd_put_ptr = prog_fd_array_put_ptr, .map_fd_sys_lookup_elem = prog_fd_array_sys_lookup_elem, + .map_release_uref = bpf_fd_array_map_clear, }; static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file, diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index a3b2138..a73d484 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -1831,7 +1831,7 @@ static int sock_map_update_elem(struct bpf_map *map, return err; } -static void sock_map_release(struct bpf_map *map, struct file *map_file) +static void sock_map_release(struct bpf_map *map) { struct bpf_stab *stab = container_of(map, struct bpf_stab, map); struct bpf_prog *orig; @@ -1855,7 +1855,7 @@ static void sock_map_release(struct bpf_map *map, struct file *map_file) .map_get_next_key = sock_map_get_next_key, .map_update_elem = sock_map_update_elem, .map_delete_elem = sock_map_delete_elem, - .map_release = sock_map_release, + .map_release_uref = sock_map_release, }; BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index fe23dc5a..a22c26e 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -260,8 +260,8 @@ static void bpf_map_free_deferred(struct work_struct *work) static void bpf_map_put_uref(struct bpf_map *map) { if (atomic_dec_and_test(&map->usercnt)) { - if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY) - bpf_fd_array_map_clear(map); + if (map->ops->map_release_uref) + map->ops->map_release_uref(map); } }
[bpf PATCH v3 0/3] BPF, a couple sockmap fixes
While testing sockmap with more programs (besides our test programs) I found a couple issues. The attached series fixes an issue where pinned maps were not working correctly, blocking sockets returned zero, and an error path that when the sock hit an out of memory case resulted in a double page_put() while doing ingress redirects. See individual patches for more details. v2: Incorporated Daniel's feedback to use map ops for uref put op which also fixed the build error discovered in v1. v3: rename map_put_uref to map_release_uref --- John Fastabend (3): bpf: sockmap, map_release does not hold refcnt for pinned maps bpf: sockmap, sk_wait_event needed to handle blocking cases bpf: sockmap, fix double page_put on ENOMEM error in redirect path 0 files changed -- Signature
[bpf PATCH v3 2/3] bpf: sockmap, sk_wait_event needed to handle blocking cases
In the recvmsg handler we need to add a wait event to support the blocking use cases. Without this we return zero and may confuse user applications. In the wait event any data received on the sk either via sk_receive_queue or the psock ingress list will wake up the sock. Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT") Signed-off-by: John Fastabend --- 0 files changed diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index a73d484..aaf50ec 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -43,6 +43,7 @@ #include #include #include +#include #define SOCK_CREATE_FLAG_MASK \ (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY) @@ -732,6 +733,26 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock, return err; } +static int bpf_wait_data(struct sock *sk, +struct smap_psock *psk, int flags, +long timeo, int *err) +{ + int rc; + + DEFINE_WAIT_FUNC(wait, woken_wake_function); + + add_wait_queue(sk_sleep(sk), &wait); + sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); + rc = sk_wait_event(sk, &timeo, + !list_empty(&psk->ingress) || + !skb_queue_empty(&sk->sk_receive_queue), + &wait); + sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk); + remove_wait_queue(sk_sleep(sk), &wait); + + return rc; +} + static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len) { @@ -755,6 +776,7 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len); lock_sock(sk); +bytes_ready: while (copied != len) { struct scatterlist *sg; struct sk_msg_buff *md; @@ -809,6 +831,28 @@ static int bpf_tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, } } + if (!copied) { + long timeo; + int data; + int err = 0; + + timeo = sock_rcvtimeo(sk, nonblock); + data = bpf_wait_data(sk, psock, flags, timeo, &err); + + if (data) { + if (!skb_queue_empty(&sk->sk_receive_queue)) { + release_sock(sk); + smap_release_sock(psock, sk); + copied = tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len); + return copied; + } + goto bytes_ready; + } + + if (err) + copied = err; + } + release_sock(sk); smap_release_sock(psock, sk); return copied;
Re: dev_loopback_xmit parameters
On 04/23/2018 02:40 PM, Emanuele wrote: > Hello, > > I don't know if this is the right place where to ask, but I was wondering why > the dev_loopback_xmit function defined in /net/core/dev.c takes struct net * > and struct sock * as parameters. They are never used, so I believe passing > only the struct sk_buff * should be enough. > Look at net/ipv6/ip6_output.c where NF_HOOK() uses dev_loopback_xmit(). > In addition, it would like to know where I can read what is and how to set a > skb dst_entry, since I don't really understand it. > > Thanks a lot, > > Emanuele >
Re: [PATCH] selftests: bpf: update .gitignore with missing file
On 04/24/2018 12:14 AM, Anders Roxell wrote: > On 23 April 2018 at 23:34, Daniel Borkmann wrote: >> On 04/23/2018 03:50 PM, Anders Roxell wrote: >>> Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests") >>> Signed-off-by: Anders Roxell >>> --- >>> tools/testing/selftests/bpf/.gitignore | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/tools/testing/selftests/bpf/.gitignore >>> b/tools/testing/selftests/bpf/.gitignore >>> index 5e1ab2f0eb79..3e3b3ced3f7c 100644 >>> --- a/tools/testing/selftests/bpf/.gitignore >>> +++ b/tools/testing/selftests/bpf/.gitignore >>> @@ -15,3 +15,4 @@ test_libbpf_open >>> test_sock >>> test_sock_addr >>> urandom_read >>> +test_btf >> >> Against which tree is this? This doesn't apply to bpf-next. It would >> apply against bpf tree, but c0fa1b6c3efc ("bpf: btf: Add BTF tests") >> is part of bpf-next, so fits to neither of them. > > I'm sorry, > > I did it against this patch [1] that I thought was already applied to > the bpf tree. That was bpf tree since the original change already went to mainline; the BTF is in bpf-next however, so you need to rebase your change against that. Thanks, Daniel
[PATCH 1/4] rhashtable: remove outdated comments about grow_decision etc
grow_decision and shink_decision no longer exist, so remove the remaining references to them. Acked-by: Herbert Xu Signed-off-by: NeilBrown --- include/linux/rhashtable.h | 33 ++--- 1 file changed, 14 insertions(+), 19 deletions(-) diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h index 1f8ad121eb43..87d443a5b11d 100644 --- a/include/linux/rhashtable.h +++ b/include/linux/rhashtable.h @@ -836,9 +836,8 @@ static inline void *__rhashtable_insert_fast( * * It is safe to call this function from atomic context. * - * Will trigger an automatic deferred table resizing if the size grows - * beyond the watermark indicated by grow_decision() which can be passed - * to rhashtable_init(). + * Will trigger an automatic deferred table resizing if residency in the + * table grows beyond 70%. */ static inline int rhashtable_insert_fast( struct rhashtable *ht, struct rhash_head *obj, @@ -866,9 +865,8 @@ static inline int rhashtable_insert_fast( * * It is safe to call this function from atomic context. * - * Will trigger an automatic deferred table resizing if the size grows - * beyond the watermark indicated by grow_decision() which can be passed - * to rhashtable_init(). + * Will trigger an automatic deferred table resizing if residency in the + * table grows beyond 70%. */ static inline int rhltable_insert_key( struct rhltable *hlt, const void *key, struct rhlist_head *list, @@ -890,9 +888,8 @@ static inline int rhltable_insert_key( * * It is safe to call this function from atomic context. * - * Will trigger an automatic deferred table resizing if the size grows - * beyond the watermark indicated by grow_decision() which can be passed - * to rhashtable_init(). + * Will trigger an automatic deferred table resizing if residency in the + * table grows beyond 70%. */ static inline int rhltable_insert( struct rhltable *hlt, struct rhlist_head *list, @@ -922,9 +919,8 @@ static inline int rhltable_insert( * * It is safe to call this function from atomic context. * - * Will trigger an automatic deferred table resizing if the size grows - * beyond the watermark indicated by grow_decision() which can be passed - * to rhashtable_init(). + * Will trigger an automatic deferred table resizing if residency in the + * table grows beyond 70%. */ static inline int rhashtable_lookup_insert_fast( struct rhashtable *ht, struct rhash_head *obj, @@ -981,9 +977,8 @@ static inline void *rhashtable_lookup_get_insert_fast( * * Lookups may occur in parallel with hashtable mutations and resizing. * - * Will trigger an automatic deferred table resizing if the size grows - * beyond the watermark indicated by grow_decision() which can be passed - * to rhashtable_init(). + * Will trigger an automatic deferred table resizing if residency in the + * table grows beyond 70%. * * Returns zero on success. */ @@ -1134,8 +1129,8 @@ static inline int __rhashtable_remove_fast( * walk the bucket chain upon removal. The removal operation is thus * considerable slow if the hash table is not correctly sized. * - * Will automatically shrink the table via rhashtable_expand() if the - * shrink_decision function specified at rhashtable_init() returns true. + * Will automatically shrink the table if permitted when residency drops + * below 30%. * * Returns zero on success, -ENOENT if the entry could not be found. */ @@ -1156,8 +1151,8 @@ static inline int rhashtable_remove_fast( * walk the bucket chain upon removal. The removal operation is thus * considerable slow if the hash table is not correctly sized. * - * Will automatically shrink the table via rhashtable_expand() if the - * shrink_decision function specified at rhashtable_init() returns true. + * Will automatically shrink the table if permitted when residency drops + * below 30% * * Returns zero on success, -ENOENT if the entry could not be found. */
[PATCH 0/4] A few rhashtables cleanups
2 patches fixes documentation 1 fixes a bit in rhashtable_walk_start() 1 improves rhashtable_walk stability. All reviewed and Acked. Thanks, NeilBrown --- NeilBrown (4): rhashtable: remove outdated comments about grow_decision etc rhashtable: Revise incorrect comment on r{hl,hash}table_walk_enter() rhashtable: reset iter when rhashtable_walk_start sees new table rhashtable: improve rhashtable_walk stability when stop/start used. include/linux/rhashtable.h | 38 +++-- lib/rhashtable.c | 51 2 files changed, 63 insertions(+), 26 deletions(-) -- Signature
[PATCH 4/4] rhashtable: improve rhashtable_walk stability when stop/start used.
When a walk of an rhashtable is interrupted with rhastable_walk_stop() and then rhashtable_walk_start(), the location to restart from is based on a 'skip' count in the current hash chain, and this can be incorrect if insertions or deletions have happened. This does not happen when the walk is not stopped and started as iter->p is a placeholder which is safe to use while holding the RCU read lock. In rhashtable_walk_start() we can revalidate that 'p' is still in the same hash chain. If it isn't then the current method is still used. With this patch, if a rhashtable walker ensures that the current object remains in the table over a stop/start period (possibly by elevating the reference count if that is sufficient), it can be sure that a walk will not miss objects that were in the hashtable for the whole time of the walk. rhashtable_walk_start() may not find the object even though it is still in the hashtable if a rehash has moved it to a new table. In this case it will (eventually) get -EAGAIN and will need to proceed through the whole table again to be sure to see everything at least once. Acked-by: Herbert Xu Signed-off-by: NeilBrown --- lib/rhashtable.c | 44 +--- 1 file changed, 41 insertions(+), 3 deletions(-) diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 81edf1ab38ab..9427b5766134 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -727,6 +727,7 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter) __acquires(RCU) { struct rhashtable *ht = iter->ht; + bool rhlist = ht->rhlist; rcu_read_lock(); @@ -735,13 +736,52 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter) list_del(&iter->walker.list); spin_unlock(&ht->lock); - if (!iter->walker.tbl && !iter->end_of_table) { + if (iter->end_of_table) + return 0; + if (!iter->walker.tbl) { iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht); iter->slot = 0; iter->skip = 0; return -EAGAIN; } + if (iter->p && !rhlist) { + /* +* We need to validate that 'p' is still in the table, and +* if so, update 'skip' +*/ + struct rhash_head *p; + int skip = 0; + rht_for_each_rcu(p, iter->walker.tbl, iter->slot) { + skip++; + if (p == iter->p) { + iter->skip = skip; + goto found; + } + } + iter->p = NULL; + } else if (iter->p && rhlist) { + /* Need to validate that 'list' is still in the table, and +* if so, update 'skip' and 'p'. +*/ + struct rhash_head *p; + struct rhlist_head *list; + int skip = 0; + rht_for_each_rcu(p, iter->walker.tbl, iter->slot) { + for (list = container_of(p, struct rhlist_head, rhead); +list; +list = rcu_dereference(list->next)) { + skip++; + if (list == iter->list) { + iter->p = p; + skip = skip; + goto found; + } + } + } + iter->p = NULL; + } +found: return 0; } EXPORT_SYMBOL_GPL(rhashtable_walk_start_check); @@ -917,8 +957,6 @@ void rhashtable_walk_stop(struct rhashtable_iter *iter) iter->walker.tbl = NULL; spin_unlock(&ht->lock); - iter->p = NULL; - out: rcu_read_unlock(); }
[PATCH 3/4] rhashtable: reset iter when rhashtable_walk_start sees new table
The documentation claims that when rhashtable_walk_start_check() detects a resize event, it will rewind back to the beginning of the table. This is not true. We need to set ->slot and ->skip to be zero for it to be true. Acked-by: Herbert Xu Signed-off-by: NeilBrown --- lib/rhashtable.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 6d490f51174e..81edf1ab38ab 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -737,6 +737,8 @@ int rhashtable_walk_start_check(struct rhashtable_iter *iter) if (!iter->walker.tbl && !iter->end_of_table) { iter->walker.tbl = rht_dereference_rcu(ht->tbl, ht); + iter->slot = 0; + iter->skip = 0; return -EAGAIN; }
[PATCH 2/4] rhashtable: Revise incorrect comment on r{hl, hash}table_walk_enter()
Neither rhashtable_walk_enter() or rhltable_walk_enter() sleep, though they do take a spinlock without irq protection. So revise the comments to accurately state the contexts in which these functions can be called. Acked-by: Herbert Xu Signed-off-by: NeilBrown --- include/linux/rhashtable.h |5 +++-- lib/rhashtable.c |5 +++-- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h index 87d443a5b11d..4e1f535c2034 100644 --- a/include/linux/rhashtable.h +++ b/include/linux/rhashtable.h @@ -1268,8 +1268,9 @@ static inline int rhashtable_walk_init(struct rhashtable *ht, * For a completely stable walk you should construct your own data * structure outside the hash table. * - * This function may sleep so you must not call it from interrupt - * context or with spin locks held. + * This function may be called from any process context, including + * non-preemptable context, but cannot be called from softirq or + * hardirq context. * * You must call rhashtable_walk_exit after this function returns. */ diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 2b2b79974b61..6d490f51174e 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -668,8 +668,9 @@ EXPORT_SYMBOL_GPL(rhashtable_insert_slow); * For a completely stable walk you should construct your own data * structure outside the hash table. * - * This function may sleep so you must not call it from interrupt - * context or with spin locks held. + * This function may be called from any process context, including + * non-preemptable context, but cannot be called from softirq or + * hardirq context. * * You must call rhashtable_walk_exit after this function returns. */
Re: [PATCH] selftests: bpf: update .gitignore with missing file
On 23 April 2018 at 23:34, Daniel Borkmann wrote: > On 04/23/2018 03:50 PM, Anders Roxell wrote: >> Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests") >> Signed-off-by: Anders Roxell >> --- >> tools/testing/selftests/bpf/.gitignore | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/tools/testing/selftests/bpf/.gitignore >> b/tools/testing/selftests/bpf/.gitignore >> index 5e1ab2f0eb79..3e3b3ced3f7c 100644 >> --- a/tools/testing/selftests/bpf/.gitignore >> +++ b/tools/testing/selftests/bpf/.gitignore >> @@ -15,3 +15,4 @@ test_libbpf_open >> test_sock >> test_sock_addr >> urandom_read >> +test_btf > > Against which tree is this? This doesn't apply to bpf-next. It would > apply against bpf tree, but c0fa1b6c3efc ("bpf: btf: Add BTF tests") > is part of bpf-next, so fits to neither of them. I'm sorry, I did it against this patch [1] that I thought was already applied to the bpf tree. Cheers, Anders [1] https://patchwork.kernel.org/patch/10332907/
Re: [bpf PATCH v2 1/3] bpf: sockmap, map_release does not hold refcnt for pinned maps
On 04/23/2018 08:29 PM, John Fastabend wrote: > Relying on map_release hook to decrement the reference counts when a > map is removed only works if the map is not being pinned. In the > pinned case the ref is decremented immediately and the BPF programs > released. After this BPF programs may not be in-use which is not > what the user would expect. > > This patch moves the release logic into bpf_map_put_uref() and brings > sockmap in-line with how a similar case is handled in prog array maps. > > Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not > detached progs") > Signed-off-by: John Fastabend Patches look good, but one trivial request below. [...] > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 4ca46df..4b70439 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -257,8 +257,8 @@ static void bpf_map_free_deferred(struct work_struct > *work) > static void bpf_map_put_uref(struct bpf_map *map) > { > if (atomic_dec_and_test(&map->usercnt)) { > - if (map->map_type == BPF_MAP_TYPE_PROG_ARRAY) > - bpf_fd_array_map_clear(map); > + if (map->ops->map_put_uref) > + map->ops->map_put_uref(map); Could you change the callback name into something like 'map_release_uref'? Naming it 'map_put_uref' is a bit confusing since this is only called when the uref reference count already dropped to zero, and here we really release the last reference point to user space. Given this is BPF core infra, would be nice to still fix this up before applying. Thanks, Daniel
[PATCH v2] net: qrtr: Expose tunneling endpoint to user space
This implements a misc character device named "qrtr-tun" for the purpose of allowing user space applications to implement endpoints in the qrtr network. This allows more advanced (and dynamic) testing of the qrtr code as well as opens up the ability of tunneling qrtr over a network or USB link. Signed-off-by: Bjorn Andersson --- Changes since v1: - Dropped queue lock net/qrtr/Kconfig | 7 +++ net/qrtr/Makefile | 2 + net/qrtr/tun.c| 146 ++ 3 files changed, 155 insertions(+) create mode 100644 net/qrtr/tun.c diff --git a/net/qrtr/Kconfig b/net/qrtr/Kconfig index 326fd97444f5..1944834d225c 100644 --- a/net/qrtr/Kconfig +++ b/net/qrtr/Kconfig @@ -21,4 +21,11 @@ config QRTR_SMD Say Y here to support SMD based ipcrouter channels. SMD is the most common transport for IPC Router. +config QRTR_TUN + tristate "TUN device for Qualcomm IPC Router" + ---help--- + Say Y here to expose a character device that allows user space to + implement endpoints of QRTR, for purpose of tunneling data to other + hosts or testing purposes. + endif # QRTR diff --git a/net/qrtr/Makefile b/net/qrtr/Makefile index ab09e40f7c74..be012bfd3e52 100644 --- a/net/qrtr/Makefile +++ b/net/qrtr/Makefile @@ -2,3 +2,5 @@ obj-$(CONFIG_QRTR) := qrtr.o obj-$(CONFIG_QRTR_SMD) += qrtr-smd.o qrtr-smd-y := smd.o +obj-$(CONFIG_QRTR_TUN) += qrtr-tun.o +qrtr-tun-y := tun.o diff --git a/net/qrtr/tun.c b/net/qrtr/tun.c new file mode 100644 index ..48b2147c98f8 --- /dev/null +++ b/net/qrtr/tun.c @@ -0,0 +1,146 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2018, Linaro Ltd */ + +#include +#include +#include +#include + +#include "qrtr.h" + +struct qrtr_tun { + struct qrtr_endpoint ep; + + struct sk_buff_head queue; + wait_queue_head_t readq; +}; + +static int qrtr_tun_send(struct qrtr_endpoint *ep, struct sk_buff *skb) +{ + struct qrtr_tun *tun = container_of(ep, struct qrtr_tun, ep); + + skb_queue_tail(&tun->queue, skb); + + /* wake up any blocking processes, waiting for new data */ + wake_up_interruptible(&tun->readq); + + return 0; +} + +static int qrtr_tun_open(struct inode *inode, struct file *filp) +{ + struct qrtr_tun *tun; + + tun = kzalloc(sizeof(*tun), GFP_KERNEL); + if (!tun) + return -ENOMEM; + + skb_queue_head_init(&tun->queue); + init_waitqueue_head(&tun->readq); + + tun->ep.xmit = qrtr_tun_send; + + filp->private_data = tun; + + return qrtr_endpoint_register(&tun->ep, QRTR_EP_NID_AUTO); +} + +static ssize_t qrtr_tun_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ + struct file *filp = iocb->ki_filp; + struct qrtr_tun *tun = filp->private_data; + struct sk_buff *skb; + int count; + + while (!(skb = skb_dequeue(&tun->queue))) { + if (filp->f_flags & O_NONBLOCK) + return -EAGAIN; + + /* Wait until we get data or the endpoint goes away */ + if (wait_event_interruptible(tun->readq, +!skb_queue_empty(&tun->queue))) + return -ERESTARTSYS; + } + + count = min_t(size_t, iov_iter_count(to), skb->len); + if (copy_to_iter(skb->data, count, to) != count) + count = -EFAULT; + + kfree_skb(skb); + + return count; +} + +static ssize_t qrtr_tun_write_iter(struct kiocb *iocb, struct iov_iter *from) +{ + struct file *filp = iocb->ki_filp; + struct qrtr_tun *tun = filp->private_data; + size_t len = iov_iter_count(from); + ssize_t ret; + void *kbuf; + + kbuf = kzalloc(len, GFP_KERNEL); + if (!kbuf) + return -ENOMEM; + + if (!copy_from_iter_full(kbuf, len, from)) + return -EFAULT; + + ret = qrtr_endpoint_post(&tun->ep, kbuf, len); + + return ret < 0 ? ret : len; +} + +static int qrtr_tun_release(struct inode *inode, struct file *filp) +{ + struct qrtr_tun *tun = filp->private_data; + struct sk_buff *skb; + + qrtr_endpoint_unregister(&tun->ep); + + /* Discard all SKBs */ + while (!skb_queue_empty(&tun->queue)) { + skb = skb_dequeue(&tun->queue); + kfree_skb(skb); + } + + kfree(tun); + + return 0; +} + +static const struct file_operations qrtr_tun_ops = { + .owner = THIS_MODULE, + .open = qrtr_tun_open, + .read_iter = qrtr_tun_read_iter, + .write_iter = qrtr_tun_write_iter, + .release = qrtr_tun_release, +}; + +static struct miscdevice qrtr_tun_miscdev = { + MISC_DYNAMIC_MINOR, + "qrtr-tun", + &qrtr_tun_ops, +}; + +static int __init qrtr_tun_init(void) +{ + int ret; + + ret = misc_register(&qrtr_tun_miscdev); + if (ret) + pr_err("failed to register Qu
[PATCH net-next] tcp: md5: only call tp->af_specific->md5_lookup() for md5 sockets
RETPOLINE made calls to tp->af_specific->md5_lookup() quite expensive, given they have no result. We can omit the calls for sockets that have no md5 keys. Signed-off-by: Eric Dumazet --- net/ipv4/tcp_output.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 383cac0ff0ec059ca7dbc1a6304cc7f8183e008d..95feffb6d53f8a9eadfb15a2fffeec498d6e993a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -585,14 +585,15 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, unsigned int remaining = MAX_TCP_OPTION_SPACE; struct tcp_fastopen_request *fastopen = tp->fastopen_req; + *md5 = NULL; #ifdef CONFIG_TCP_MD5SIG - *md5 = tp->af_specific->md5_lookup(sk, sk); - if (*md5) { - opts->options |= OPTION_MD5; - remaining -= TCPOLEN_MD5SIG_ALIGNED; + if (unlikely(rcu_access_pointer(tp->md5sig_info))) { + *md5 = tp->af_specific->md5_lookup(sk, sk); + if (*md5) { + opts->options |= OPTION_MD5; + remaining -= TCPOLEN_MD5SIG_ALIGNED; + } } -#else - *md5 = NULL; #endif /* We always get an MSS option. The option bytes which will be seen in @@ -720,14 +721,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb opts->options = 0; + *md5 = NULL; #ifdef CONFIG_TCP_MD5SIG - *md5 = tp->af_specific->md5_lookup(sk, sk); - if (unlikely(*md5)) { - opts->options |= OPTION_MD5; - size += TCPOLEN_MD5SIG_ALIGNED; + if (unlikely(rcu_access_pointer(tp->md5sig_info))) { + *md5 = tp->af_specific->md5_lookup(sk, sk); + if (*md5) { + opts->options |= OPTION_MD5; + size += TCPOLEN_MD5SIG_ALIGNED; + } } -#else - *md5 = NULL; #endif if (likely(tp->rx_opt.tstamp_ok)) { -- 2.17.0.484.g0c8726318c-goog
Re: [bpf PATCH 1/2] bpf: Document sockmap '-target bpf' requirement for PROG_TYPE_SK_MSG
Applied both to bpf tree, thanks John!
dev_loopback_xmit parameters
Hello, I don't know if this is the right place where to ask, but I was wondering why the dev_loopback_xmit function defined in /net/core/dev.c takes struct net * and struct sock * as parameters. They are never used, so I believe passing only the struct sk_buff * should be enough. In addition, it would like to know where I can read what is and how to set a skb dst_entry, since I don't really understand it. Thanks a lot, Emanuele
Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue
Hi Andy On 04/23/2018 02:14 PM, Andy Lutomirski wrote: > On 04/20/2018 08:55 AM, Eric Dumazet wrote: >> This patch series provide a new mmap_hook to fs willing to grab >> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity. >> >> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem >> is held), and improve multi-threading scalability. >> > > I think that the right solution is to rework mmap() on TCP sockets a bit. > The current approach in net-next is very strange for a few reasons: > > 1. It uses mmap() as an operation that has side effects besides just creating > a mapping. If nothing else, it's surprising, since mmap() doesn't usually do > that. But it's also causing problems like what you're seeing. > > 2. The performance is worse than it needs to be. mmap() is slow, and I doubt > you'll find many mm developers who consider this particular abuse of mmap() > to be a valid thing to optimize for. > > 3. I'm not at all convinced the accounting is sane. As far as I can tell, > you're allowing unprivileged users to increment the count on network-owned > pages, limited only by available virtual memory, without obviously charging > it to the socket buffer limits. It looks like a program that simply forgot > to call munmap() would cause the system to run out of memory, and I see no > reason to expect the OOM killer to have any real chance of killing the right > task. > > 4. Error handling sucks. If I try to mmap() a large range (which is the > whole point -- using a small range will kill performance) and not quite all > of it can be mapped, then I waste a bunch of time in the kernel and get > *none* of the range mapped. > > I would suggest that you rework the interface a bit. First a user would call > mmap() on a TCP socket, which would create an empty VMA. (It would set > vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize > it, but it would have no effect whatsoever on the TCP state machine. Reading > the VMA would get SIGBUS.) Then a user would call a new ioctl() or > setsockopt() function and pass something like: > > struct tcp_zerocopy_receive { > void *address; > size_t length; > }; > > The kernel would verify that [address, address+length) is entirely inside a > single TCP VMA and then would do the vm_insert_range magic. I have no idea what is the proper API for that. Where the TCP VMA(s) would be stored ? In TCP socket, or MM layer ? And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ? Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet) On success, length is changed to the length that actually got mapped. The kernel could do this while holding mmap_sem for *read*, and it could get the lock ordering right. If and when mm range locks ever get merged, it could switch to using a range lock. > > Then you could use MADV_DONTNEED or another ioctl/setsockopt to zap the part > of the mapping that you're done with. > > Does this seem reasonable? It should involve very little code change, it > will run faster, it will scale better, and it is much less weird IMO. Maybe, although I do not see the 'little code change' yet. But at least, this seems pretty nice idea, especially if it could allow us to fill the mmap()ed area later when packets are received.
Re: WARNING in refcount_dec
On Thu, Apr 19, 2018 at 2:55 PM, Willem de Bruijn wrote: > On Thu, Apr 19, 2018 at 2:32 AM, DaeRyong Jeong wrote: >> Hello. >> We have analyzed the cause of the crash in v4.16-rc3, WARNING in >> refcount_dec, >> which is found by RaceFuzzer (a modified version of Syzkaller). >> >> Since struct packet_sock's member variables, running, has_vnet_hdr, origdev >> and auxdata are declared as bitfields, accessing these variables can race if >> there is no synchronization mechanism. > > Great catch. > > These fields po->{running, auxdata, origdev, has_vnet_hdr} are > accessed without a uniform locking strategy. > > po->running is always accessed with po->bind_lock held (with the > exception of reading in packet_seq_show, but that is best effort). > > That is the only field written to outside setsockopt. If it is moved to > a separate word, it will no longer interfere with the others. > > The other fields are read lockless in the various recv and send > functions, but only set in setsockopt. We've had enough > locking bugs around setsockopt that I suggest we wrap all of > those in lock_sock, like the example I gave before for > has_vnet_hdr. Sent http://patchwork.ozlabs.org/patch/903190/
[net-next 2/2] ipv6: sr: Compute flowlabel of outer IPv6 header for seg6 encap mode
ECMP (equal-cost multipath) hashes are typically computed on the packets' 5-tuple(src IP, dst IP, src port, dst port, L4 proto). For encapsulated packets, the L4 data is not readily available and ECMP hashing will often revert to (src IP, dst IP). This will lead to traffic polarization on a single ECMP path, causing congestion and waste of network capacity. In IPv6, the 20-bit flow label field is also used as part of the ECMP hash. In the lack of L4 data, the hashing will be on (src IP, dst IP, flow label). Having a non-zero flow label is thus important for proper traffic load balancing when L4 data is unavailable (i.e., when packets are encapsulated) Currently, the seg6_do_srh_encap() function extracts the original packet's flow label and set it as the outer IPv6 flow label. There are two issues with this behaviour: a) There is no guarantee that the inner flow label will be set by the source. b) If the original packet is not IPv6, the flow label will be set to zero (e.g., IPv4 or L2 encap). This patch adds a function, named seg6_make_flowlabel(), that computes a flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and leverages the per namespace "seg6_flowlabel" sysctl value. This patch has been tested for IPv6, IPv4, and L2 traffic. Signed-off-by: Ahmed Abdelsalam --- net/ipv6/seg6_iptunnel.c | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c index 5fe1394..3d9cd86 100644 --- a/net/ipv6/seg6_iptunnel.c +++ b/net/ipv6/seg6_iptunnel.c @@ -91,6 +91,24 @@ static void set_tun_src(struct net *net, struct net_device *dev, rcu_read_unlock(); } +/* Compute flowlabel for outer IPv6 header */ +__be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb, + struct ipv6hdr *inner_hdr) +{ + int do_flowlabel = net->ipv6.sysctl.seg6_flowlabel; + __be32 flowlabel = 0; + u32 hash; + + if (do_flowlabel > 0) { + hash = skb_get_hash(skb); + rol32(hash, 16); + flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK; + } else if (!do_flowlabel && skb->protocol == htons(ETH_P_IPV6)) { + flowlabel = ip6_flowlabel(inner_hdr); + } + return flowlabel; +} + /* encapsulate an IPv6 packet within an outer IPv6 header with a given SRH */ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto) { @@ -99,6 +117,7 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto) struct ipv6hdr *hdr, *inner_hdr; struct ipv6_sr_hdr *isrh; int hdrlen, tot_len, err; + __be32 flowlabel; hdrlen = (osrh->hdrlen + 1) << 3; tot_len = hdrlen + sizeof(*hdr); @@ -119,12 +138,13 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto) * decapsulation will overwrite inner hlim with outer hlim */ + flowlabel = seg6_make_flowlabel(net, skb, inner_hdr); if (skb->protocol == htons(ETH_P_IPV6)) { ip6_flow_hdr(hdr, ip6_tclass(ip6_flowinfo(inner_hdr)), -ip6_flowlabel(inner_hdr)); +flowlabel); hdr->hop_limit = inner_hdr->hop_limit; } else { - ip6_flow_hdr(hdr, 0, 0); + ip6_flow_hdr(hdr, 0, flowlabel); hdr->hop_limit = ip6_dst_hoplimit(skb_dst(skb)); } -- 2.1.4
[net-next 1/2] ipv6: sr: add a per namespace sysctl to control seg6 flowlabel
This patch adds a per namespace sysctl, named 'seg6_flowlabel', to be used by seg6_do_srh_encap() to control the behaviour of setting the flowlabel value of outer IPv6. The currently support behaviours are as follows: -1 set flowlabel to zero. 0 copy flowlabel from Inner paceket in case of Inner IPv6 (0 for IPv4/L2) 1 Compute the flowlabel using seg6_make_flowlabel() Signed-off-by: Ahmed Abdelsalam --- include/net/netns/ipv6.h | 1 + net/ipv6/sysctl_net_ipv6.c | 8 2 files changed, 9 insertions(+) diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h index 97b3a54..c978a31 100644 --- a/include/net/netns/ipv6.h +++ b/include/net/netns/ipv6.h @@ -43,6 +43,7 @@ struct netns_sysctl_ipv6 { int max_hbh_opts_cnt; int max_dst_opts_len; int max_hbh_opts_len; + int seg6_flowlabel; }; struct netns_ipv6 { diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c index 6fbdef6..e15cd37 100644 --- a/net/ipv6/sysctl_net_ipv6.c +++ b/net/ipv6/sysctl_net_ipv6.c @@ -152,6 +152,13 @@ static struct ctl_table ipv6_table_template[] = { .extra1 = &zero, .extra2 = &one, }, + { + .procname = "seg6_flowlabel", + .data = &init_net.ipv6.sysctl.seg6_flowlabel, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, { } }; @@ -217,6 +224,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net) ipv6_table[12].data = &net->ipv6.sysctl.max_dst_opts_len; ipv6_table[13].data = &net->ipv6.sysctl.max_hbh_opts_len; ipv6_table[14].data = &net->ipv6.sysctl.multipath_hash_policy, + ipv6_table[15].data = &net->ipv6.sysctl.seg6_flowlabel; ipv6_route_table = ipv6_route_sysctl_init(net); if (!ipv6_route_table) -- 2.1.4
[PATCH net] packet: fix bitfield update race
From: Willem de Bruijn Updates to the bitfields in struct packet_sock are not atomic. Serialize these read-modify-write cycles. Move po->running into a separate variable. Its writes are protected by po->bind_lock (except for one startup case at packet_create). Also replace a textual precondition warning with lockdep annotation. All others are set only in packet_setsockopt. Serialize these updates by holding the socket lock. Analogous to other field updates, also hold the lock when testing whether a ring is active (pg_vec). Fixes: 8dc419447415 ("[PACKET]: Add optional checksum computation for recvmsg") Reported-by: DaeRyong Jeong Reported-by: Byoungyoung Lee Signed-off-by: Willem de Bruijn --- net/packet/af_packet.c | 60 +++--- net/packet/internal.h | 10 +++ 2 files changed, 49 insertions(+), 21 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index c31b0687396a..01f3515cada0 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -329,11 +329,11 @@ static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb) skb_set_queue_mapping(skb, queue_index); } -/* register_prot_hook must be invoked with the po->bind_lock held, +/* __register_prot_hook must be invoked through register_prot_hook * or from a context in which asynchronous accesses to the packet * socket is not possible (packet_create()). */ -static void register_prot_hook(struct sock *sk) +static void __register_prot_hook(struct sock *sk) { struct packet_sock *po = pkt_sk(sk); @@ -348,8 +348,13 @@ static void register_prot_hook(struct sock *sk) } } -/* {,__}unregister_prot_hook() must be invoked with the po->bind_lock - * held. If the sync parameter is true, we will temporarily drop +static void register_prot_hook(struct sock *sk) +{ + lockdep_assert_held_once(&pkt_sk(sk)->bind_lock); + __register_prot_hook(sk); +} + +/* If the sync parameter is true, we will temporarily drop * the po->bind_lock and do a synchronize_net to make sure no * asynchronous packet processing paths still refer to the elements * of po->prot_hook. If the sync parameter is false, it is the @@ -359,6 +364,8 @@ static void __unregister_prot_hook(struct sock *sk, bool sync) { struct packet_sock *po = pkt_sk(sk); + lockdep_assert_held_once(&po->bind_lock); + po->running = 0; if (po->fanout) @@ -3252,7 +3259,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, if (proto) { po->prot_hook.type = proto; - register_prot_hook(sk); + __register_prot_hook(sk); } mutex_lock(&net->packet.sklist_lock); @@ -3732,12 +3739,18 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv if (optlen != sizeof(val)) return -EINVAL; - if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) - return -EBUSY; if (copy_from_user(&val, optval, sizeof(val))) return -EFAULT; - po->tp_loss = !!val; - return 0; + + lock_sock(sk); + if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) { + ret = -EBUSY; + } else { + po->tp_loss = !!val; + ret = 0; + } + release_sock(sk); + return ret; } case PACKET_AUXDATA: { @@ -3748,7 +3761,9 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv if (copy_from_user(&val, optval, sizeof(val))) return -EFAULT; + lock_sock(sk); po->auxdata = !!val; + release_sock(sk); return 0; } case PACKET_ORIGDEV: @@ -3760,7 +3775,9 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv if (copy_from_user(&val, optval, sizeof(val))) return -EFAULT; + lock_sock(sk); po->origdev = !!val; + release_sock(sk); return 0; } case PACKET_VNET_HDR: @@ -3769,15 +3786,20 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv if (sock->type != SOCK_RAW) return -EINVAL; - if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) - return -EBUSY; if (optlen < sizeof(val)) return -EINVAL; if (copy_from_user(&val, optval, sizeof(val))) return -EFAULT; - po->has_vnet_hdr = !!val; - return 0; + lock_sock(sk); + if (po->rx_ring.pg_vec || po->tx_ring.pg_vec) { +
Re: [PATCH] selftests: bpf: update .gitignore with missing file
On 04/23/2018 03:50 PM, Anders Roxell wrote: > Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests") > Signed-off-by: Anders Roxell > --- > tools/testing/selftests/bpf/.gitignore | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/tools/testing/selftests/bpf/.gitignore > b/tools/testing/selftests/bpf/.gitignore > index 5e1ab2f0eb79..3e3b3ced3f7c 100644 > --- a/tools/testing/selftests/bpf/.gitignore > +++ b/tools/testing/selftests/bpf/.gitignore > @@ -15,3 +15,4 @@ test_libbpf_open > test_sock > test_sock_addr > urandom_read > +test_btf Against which tree is this? This doesn't apply to bpf-next. It would apply against bpf tree, but c0fa1b6c3efc ("bpf: btf: Add BTF tests") is part of bpf-next, so fits to neither of them.
Re: [PATCH net-next 0/2] net/sctp: Avoid allocating high order memory with kmalloc()
Hi, On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote: > Each SCTP association can have up to 65535 input and output streams. > For each stream type an array of sctp_stream_in or sctp_stream_out > structures is allocated using kmalloc_array() function. This function > allocates physically contiguous memory regions, so this can lead > to allocation of memory regions of very high order, i.e.: > > sizeof(struct sctp_stream_out) == 24, > ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page), > which means 9th memory order. > > This can lead to a memory allocation failures on the systems > under a memory stress. Did you do performance tests while actually using these 65k streams and with 256 (so it gets 2 pages)? This will introduce another deref on each access to an element, but I'm not expecting any impact due to it. Marcelo
Re: [PATCH net-next 1/2] net/sctp: Make wrappers for accessing in/out streams
On Mon, Apr 23, 2018 at 09:41:05PM +0300, Oleg Babin wrote: > This patch introduces wrappers for accessing in/out streams indirectly. > This will enable to replace physically contiguous memory arrays > of streams with flexible arrays (or maybe any other appropriate > mechanism) which do memory allocation on a per-page basis. > > Signed-off-by: Oleg Babin > --- > include/net/sctp/structs.h | 30 +++- > net/sctp/chunk.c | 6 ++- > net/sctp/outqueue.c | 11 +++-- > net/sctp/socket.c| 4 +- > net/sctp/stream.c| 107 > +-- > net/sctp/stream_interleave.c | 2 +- > net/sctp/stream_sched.c | 13 +++--- > net/sctp/stream_sched_prio.c | 22 - > net/sctp/stream_sched_rr.c | 8 ++-- > 9 files changed, 116 insertions(+), 87 deletions(-) > > diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h > index a0ec462..578bb40 100644 > --- a/include/net/sctp/structs.h > +++ b/include/net/sctp/structs.h > @@ -394,37 +394,37 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 > outcnt, __u16 incnt, > > /* What is the current SSN number for this stream? */ > #define sctp_ssn_peek(stream, type, sid) \ > - ((stream)->type[sid].ssn) > + (sctp_stream_##type##_ptr((stream), (sid))->ssn) > > /* Return the next SSN number for this stream. */ > #define sctp_ssn_next(stream, type, sid) \ > - ((stream)->type[sid].ssn++) > + (sctp_stream_##type##_ptr((stream), (sid))->ssn++) > > /* Skip over this ssn and all below. */ > #define sctp_ssn_skip(stream, type, sid, ssn) \ > - ((stream)->type[sid].ssn = ssn + 1) > + (sctp_stream_##type##_ptr((stream), (sid))->ssn = ssn + 1) > > /* What is the current MID number for this stream? */ > #define sctp_mid_peek(stream, type, sid) \ > - ((stream)->type[sid].mid) > + (sctp_stream_##type##_ptr((stream), (sid))->mid) > > /* Return the next MID number for this stream. */ > #define sctp_mid_next(stream, type, sid) \ > - ((stream)->type[sid].mid++) > + (sctp_stream_##type##_ptr((stream), (sid))->mid++) > > /* Skip over this mid and all below. */ > #define sctp_mid_skip(stream, type, sid, mid) \ > - ((stream)->type[sid].mid = mid + 1) > + (sctp_stream_##type##_ptr((stream), (sid))->mid = mid + 1) > > -#define sctp_stream_in(asoc, sid) (&(asoc)->stream.in[sid]) > +#define sctp_stream_in(asoc, sid) sctp_stream_in_ptr(&(asoc)->stream, (sid)) This will get confusing: - sctp_stream_in(asoc, sid) - sctp_stream_in_ptr(stream, sid) Considering all usages of sctp_stream_in(), seems you can just update them to do the ->stream deref and keep only the later implementation. Which then don't need the _ptr suffix. Marcelo
[bpf-next PATCH 2/4] bpf: sockmap, add a set of tests to run by default
If no options are passed to sockmap after this patch we run a set of tests using various options and sendmsg/sendpage sizes. This replaces the sockmap_test.sh script. Signed-off-by: John Fastabend --- samples/sockmap/sockmap_user.c | 563 1 file changed, 512 insertions(+), 51 deletions(-) diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c index 649f06a..22595b4 100644 --- a/samples/sockmap/sockmap_user.c +++ b/samples/sockmap/sockmap_user.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -53,9 +54,11 @@ #define S2_PORT 10001 #define BPF_FILENAME "sockmap_kern.o" +#define BPF_CGRP "/mnt/cgroup2/" /* global sockets */ int s1, s2, c1, c2, p1, p2; +int test_cnt; int txmsg_pass; int txmsg_noisy; @@ -109,7 +112,7 @@ static void usage(char *argv[]) printf("\n"); } -static int sockmap_init_sockets(void) +static int sockmap_init_sockets(int verbose) { int i, err, one = 1; struct sockaddr_in addr; @@ -209,9 +212,11 @@ static int sockmap_init_sockets(void) return errno; } - printf("connected sockets: c1 <-> p1, c2 <-> p2\n"); - printf("cgroups binding: c1(%i) <-> s1(%i) - - - c2(%i) <-> s2(%i)\n", - c1, s1, c2, s2); + if (verbose) { + printf("connected sockets: c1 <-> p1, c2 <-> p2\n"); + printf("cgroups binding: c1(%i) <-> s1(%i) - - - c2(%i) <-> s2(%i)\n", + c1, s1, c2, s2); + } return 0; } @@ -329,10 +334,12 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt, clock_gettime(CLOCK_MONOTONIC, &s->end); } else { int slct, recv, max_fd = fd; + int fd_flags = O_NONBLOCK; struct timeval timeout; float total_bytes; fd_set w; + fcntl(fd, fd_flags); total_bytes = (float)iov_count * (float)iov_length * (float)cnt; err = clock_gettime(CLOCK_MONOTONIC, &s->start); if (err < 0) @@ -351,7 +358,8 @@ static int msg_loop(int fd, int iov_count, int iov_length, int cnt, clock_gettime(CLOCK_MONOTONIC, &s->end); goto out_errno; } else if (!slct) { - fprintf(stderr, "unexpected timeout\n"); + if (opt->verbose) + fprintf(stderr, "unexpected timeout\n"); errno = -EIO; clock_gettime(CLOCK_MONOTONIC, &s->end); goto out_errno; @@ -440,7 +448,7 @@ static int sendmsg_test(struct sockmap_options *opt) iov_count = 1; err = msg_loop(rx_fd, iov_count, iov_buf, cnt, &s, false, opt); - if (err) + if (err && opt->verbose) fprintf(stderr, "msg_loop_rx: iov_count %i iov_buf %i cnt %i err %i\n", iov_count, iov_buf, cnt, err); @@ -450,10 +458,11 @@ static int sendmsg_test(struct sockmap_options *opt) sent_Bps = sentBps(s); recvd_Bps = recvdBps(s); } - fprintf(stdout, - "rx_sendmsg: TX: %zuB %fB/s %fGB/s RX: %zuB %fB/s %fGB/s\n", - s.bytes_sent, sent_Bps, sent_Bps/giga, - s.bytes_recvd, recvd_Bps, recvd_Bps/giga); + if (opt->verbose) + fprintf(stdout, + "rx_sendmsg: TX: %zuB %fB/s %fGB/s RX: %zuB %fB/s %fGB/s\n", + s.bytes_sent, sent_Bps, sent_Bps/giga, + s.bytes_recvd, recvd_Bps, recvd_Bps/giga); exit(1); } else if (rxpid == -1) { perror("msg_loop_rx: "); @@ -477,10 +486,11 @@ static int sendmsg_test(struct sockmap_options *opt) sent_Bps = sentBps(s); recvd_Bps = recvdBps(s); } - fprintf(stdout, - "tx_sendmsg: TX: %zuB %fB/s %f GB/s RX: %zuB %fB/s %fGB/s\n", - s.bytes_sent, sent_Bps, sent_Bps/giga, - s.bytes_recvd, recvd_Bps, recvd_Bps/giga); + if (opt->verbose) + fprintf(stdout, + "tx_sendmsg: TX: %zuB %fB/s %f GB/s RX: %zuB %fB/s %fGB/s\n", + s.bytes_sent, sent_Bps, sent_Bps/giga, + s.bytes_recvd, recvd_Bps, recvd_Bps/giga); exit(1); } else if (txpid == -1) { perror("msg_loop_tx: "); @@ -575,17 +585,9 @@ enum { SEN
[bpf-next PATCH 3/4] bpf: sockmap, add selftests
This adds a new test program test_sockmap which is the old sample sockmap program. By moving the sample program here we can now run it as part of the self tests suite. To support this a populate_progs() routine is added to load programs and maps which was previously done with load_bpf_file(). This is needed because self test libs do not provide a similar routine. Also we now use the cgroup_helpers routines to manage cgroup use instead of manually creating one and supplying it to the CLI. Notice we keep the CLI around though because it is useful for dbg and specialized testing. To run use ./test_sockmap and the result should be, Summary 660 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: John Fastabend --- tools/include/uapi/linux/bpf.h |1 tools/include/uapi/linux/if_link.h | 39 + tools/lib/bpf/libbpf.c |4 tools/lib/bpf/libbpf.h |2 tools/testing/selftests/bpf/Makefile|5 tools/testing/selftests/bpf/test_sockmap.c | 1465 +++ tools/testing/selftests/bpf/test_sockmap_kern.c | 340 + 7 files changed, 1852 insertions(+), 4 deletions(-) create mode 100644 tools/testing/selftests/bpf/test_sockmap.c create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.c diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 7f7fbb9..c8383a2 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -884,6 +884,7 @@ enum bpf_func_id { /* BPF_FUNC_skb_set_tunnel_key flags. */ #define BPF_F_ZERO_CSUM_TX (1ULL << 1) #define BPF_F_DONT_FRAGMENT(1ULL << 2) +#define BPF_F_SEQ_NUMBER (1ULL << 3) /* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and * BPF_FUNC_perf_event_read_value flags. diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h index 6d94477..68699f6 100644 --- a/tools/include/uapi/linux/if_link.h +++ b/tools/include/uapi/linux/if_link.h @@ -941,4 +941,43 @@ enum { IFLA_EVENT_BONDING_OPTIONS, /* change in bonding options */ }; +/* tun section */ + +enum { + IFLA_TUN_UNSPEC, + IFLA_TUN_OWNER, + IFLA_TUN_GROUP, + IFLA_TUN_TYPE, + IFLA_TUN_PI, + IFLA_TUN_VNET_HDR, + IFLA_TUN_PERSIST, + IFLA_TUN_MULTI_QUEUE, + IFLA_TUN_NUM_QUEUES, + IFLA_TUN_NUM_DISABLED_QUEUES, + __IFLA_TUN_MAX, +}; + +#define IFLA_TUN_MAX (__IFLA_TUN_MAX - 1) + +/* rmnet section */ + +#define RMNET_FLAGS_INGRESS_DEAGGREGATION (1U << 0) +#define RMNET_FLAGS_INGRESS_MAP_COMMANDS (1U << 1) +#define RMNET_FLAGS_INGRESS_MAP_CKSUMV4 (1U << 2) +#define RMNET_FLAGS_EGRESS_MAP_CKSUMV4(1U << 3) + +enum { + IFLA_RMNET_UNSPEC, + IFLA_RMNET_MUX_ID, + IFLA_RMNET_FLAGS, + __IFLA_RMNET_MAX, +}; + +#define IFLA_RMNET_MAX (__IFLA_RMNET_MAX - 1) + +struct ifla_rmnet_flags { + __u32 flags; + __u32 mask; +}; + #endif /* _UAPI_LINUX_IF_LINK_H */ diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 6513e0b..7bcdca1 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1961,8 +1961,8 @@ static bool bpf_program__is_type(struct bpf_program *prog, BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP); BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT); -static void bpf_program__set_expected_attach_type(struct bpf_program *prog, -enum bpf_attach_type type) +void bpf_program__set_expected_attach_type(struct bpf_program *prog, + enum bpf_attach_type type) { prog->expected_attach_type = type; } diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index d6ac4fa..197f9ce 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -193,6 +193,8 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance, int bpf_program__set_xdp(struct bpf_program *prog); int bpf_program__set_perf_event(struct bpf_program *prog); void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type); +void bpf_program__set_expected_attach_type(struct bpf_program *prog, + enum bpf_attach_type type); bool bpf_program__is_socket_filter(struct bpf_program *prog); bool bpf_program__is_tracepoint(struct bpf_program *prog); diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 0b72cc7..094d507 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -24,7 +24,7 @@ urandom_read: urandom_read.c # Order correspond to 'make run_tests' order TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \ test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \ - test_sock test_sock_addr test_btf + test_sock test_sock_addr test_btf t
[bpf-next PATCH 1/4] bpf: sockmap, code sockmap_test in C
By moving sockmap_test from shell script into C we can run it directly from selftests, but we can also push the input/output around in proper structures. However, keep the CLI options around because they are useful for debugging when a paticular pattern of msghdr or sockmap options trips up the sockmap code path. Signed-off-by: John Fastabend --- samples/sockmap/sockmap_user.c | 209 ++-- 1 file changed, 113 insertions(+), 96 deletions(-) diff --git a/samples/sockmap/sockmap_user.c b/samples/sockmap/sockmap_user.c index 6f23349..649f06a 100644 --- a/samples/sockmap/sockmap_user.c +++ b/samples/sockmap/sockmap_user.c @@ -52,6 +52,8 @@ #define S1_PORT 1 #define S2_PORT 10001 +#define BPF_FILENAME "sockmap_kern.o" + /* global sockets */ int s1, s2, c1, c2, p1, p2; @@ -226,6 +228,9 @@ struct sockmap_options { bool sendpage; bool data_test; bool drop_expected; + int iov_count; + int iov_length; + int rate; }; static int msg_loop_sendpage(int fd, int iov_length, int cnt, @@ -409,12 +414,14 @@ static inline float recvdBps(struct msg_stats s) return s.bytes_recvd / (s.end.tv_sec - s.start.tv_sec); } -static int sendmsg_test(int iov_count, int iov_buf, int cnt, - struct sockmap_options *opt) +static int sendmsg_test(struct sockmap_options *opt) { float sent_Bps = 0, recvd_Bps = 0; int rx_fd, txpid, rxpid, err = 0; struct msg_stats s = {0}; + int iov_count = opt->iov_count; + int iov_buf = opt->iov_length; + int cnt = opt->rate; int status; errno = 0; @@ -568,98 +575,13 @@ enum { SENDPAGE, }; -int main(int argc, char **argv) +static int run_options(struct sockmap_options options, int cg_fd, int test) { - int iov_count = 1, length = 1024, rate = 1, tx_prog_fd; - struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY}; - int opt, longindex, err, cg_fd = 0; - struct sockmap_options options = {0}; - int test = PING_PONG; + char *bpf_file = BPF_FILENAME; + int err, tx_prog_fd; char filename[256]; - while ((opt = getopt_long(argc, argv, ":dhvc:r:i:l:t:", - long_options, &longindex)) != -1) { - switch (opt) { - case 's': - txmsg_start = atoi(optarg); - break; - case 'e': - txmsg_end = atoi(optarg); - break; - case 'a': - txmsg_apply = atoi(optarg); - break; - case 'k': - txmsg_cork = atoi(optarg); - break; - case 'c': - cg_fd = open(optarg, O_DIRECTORY, O_RDONLY); - if (cg_fd < 0) { - fprintf(stderr, - "ERROR: (%i) open cg path failed: %s\n", - cg_fd, optarg); - return cg_fd; - } - break; - case 'r': - rate = atoi(optarg); - break; - case 'v': - options.verbose = 1; - break; - case 'i': - iov_count = atoi(optarg); - break; - case 'l': - length = atoi(optarg); - break; - case 'd': - options.data_test = true; - break; - case 't': - if (strcmp(optarg, "ping") == 0) { - test = PING_PONG; - } else if (strcmp(optarg, "sendmsg") == 0) { - test = SENDMSG; - } else if (strcmp(optarg, "base") == 0) { - test = BASE; - } else if (strcmp(optarg, "base_sendpage") == 0) { - test = BASE_SENDPAGE; - } else if (strcmp(optarg, "sendpage") == 0) { - test = SENDPAGE; - } else { - usage(argv); - return -1; - } - break; - case 0: - break; - case 'h': - default: - usage(argv); - return -1; - } - } - - if (!cg_fd) { - fprintf(stderr, "%s requires cgroup option: --cgroup \n", - argv[0]); - return -1; - } - - if (setrlimit(RLIMIT_MEMLOCK, &r)) { - perror("setrlimit(RLIMIT_MEMLOCK)"); - retur
[bpf-next PATCH 0/4] selftests for BPF sockmap use cases
This series moves ./samples/sockmap into BPF selftests. There are a few good reasons to do this. First, by pushing this into selftests the tests will be run automatically. Second, sockmap was not really a sample of anything anymore, but rather a large set of tests. Note: There are three recent fixes outstanding against bpf branch that can be detected occosionally by the automated tests here. https://patchwork.ozlabs.org/patch/903138/ https://patchwork.ozlabs.org/patch/903139/ https://patchwork.ozlabs.org/patch/903140/ --- John Fastabend (4): bpf: sockmap, code sockmap_test in C bpf: sockmap, add a set of tests to run by default bpf: sockmap, add selftests bpf: sockmap, remove samples program samples/sockmap/Makefile| 75 - samples/sockmap/sockmap_kern.c | 341 - samples/sockmap/sockmap_test.sh | 488 samples/sockmap/sockmap_user.c | 894 -- tools/include/uapi/linux/bpf.h |1 tools/include/uapi/linux/if_link.h | 39 + tools/lib/bpf/libbpf.c |4 tools/lib/bpf/libbpf.h |2 tools/testing/selftests/bpf/Makefile|5 tools/testing/selftests/bpf/test_sockmap.c | 1465 +++ tools/testing/selftests/bpf/test_sockmap_kern.c | 340 + 11 files changed, 1852 insertions(+), 1802 deletions(-) delete mode 100644 samples/sockmap/Makefile delete mode 100644 samples/sockmap/sockmap_kern.c delete mode 100755 samples/sockmap/sockmap_test.sh delete mode 100644 samples/sockmap/sockmap_user.c create mode 100644 tools/testing/selftests/bpf/test_sockmap.c create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.c -- Signature