On 16.05.2019 2:27, William Tu wrote: > Hi Ilya, > > Thanks for your feedback. > > On Mon, May 13, 2019 at 10:48 AM Ilya Maximets <i.maxim...@samsung.com> wrote: >> >> On 10.05.2019 2:54, William Tu wrote: >>> The patch introduces experimental AF_XDP support for OVS netdev. >>> AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type >>> built upon the eBPF and XDP technology. It is aims to have comparable >>> performance to DPDK but cooperate better with existing kernel's networking >>> stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program >>> attached to the netdev, by-passing a couple of Linux kernel's subsystems >>> As a result, AF_XDP socket shows much better performance than AF_PACKET >>> For more details about AF_XDP, please see linux kernel's >>> Documentation/networking/af_xdp.rst. Note that by default, this is not >>> compiled in. >>> >>> Signed-off-by: William Tu <u9012...@gmail.com> >>> >>> --- > > snip > >>> +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, >>> + char **errp OVS_UNUSED) >>> +{ >>> + struct netdev_linux *dev = netdev_linux_cast(netdev); >>> + const char *xdpmode; >>> + int new_n_rxq; >>> + >>> + ovs_mutex_lock(&dev->mutex); >>> + >>> + new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1); >>> + if (new_n_rxq > MAX_XSKQ) { >>> + ovs_mutex_unlock(&dev->mutex); >>> + return EINVAL; >>> + } >>> + >>> + if (new_n_rxq != netdev->n_rxq) { >>> + dev->requested_n_rxq = new_n_rxq; >>> + netdev_request_reconfigure(netdev); >>> + } >>> + >>> + xdpmode = smap_get(args, "xdpmode"); >>> + if (xdpmode && strncmp(xdpmode, "drv", 3) == 0) { >>> + dev->requested_xdpmode = XDP_ZEROCOPY; >>> + if (dev->xdpmode != dev->requested_xdpmode) { >>> + netdev_request_reconfigure(netdev); >>> + } >>> + } else { >>> + dev->requested_xdpmode = XDP_COPY; >>> + if (dev->xdpmode != dev->requested_xdpmode) { >>> + netdev_request_reconfigure(netdev); >>> + } >>> + } >> >> Above code will request reconfiguration infinitely until it reconfiguration >> finished. This could cause multiple reconfigurations in a row for the same >> configuration change. Better version could look like this: >> >> new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1); >> if (new_n_rxq > MAX_XSKQ) { >> ovs_mutex_unlock(&dev->mutex); >> VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).", >> netdev_get_name(netdev), new_n_rxq, MAX_XSKQ); >> return EINVAL; >> } >> >> str_xdpmode = smap_get_def(args, "xdpmode", "skb"); >> if (!strcasecmp(str_xdpmode, "drv")) { >> xdpmode = XDP_ZEROCOPY; >> } else if (!strcasecmp(str_xdpmode, "skb")) { >> xdpmode = XDP_COPY; >> } else { >> VLOG_ERR("%s: Incorrect xdpmode (%s).", >> netdev_get_name(netdev), str_xdpmode); >> ovs_mutex_unlock(&dev->mutex); >> return EINVAL; >> } >> >> if (dev->requested_n_rxq != new_n_rxq >> || dev->requested_xdpmode != xdpmode) { >> dev->requested_n_rxq = new_n_rxq; >> dev->requested_xdpmode = xdpmode >> netdev_request_reconfigure(netdev); >> } >> >> The main difference is checking "new" with "requested", not the "new" with >> "current". This allows us to request reconfiguration only once for each >> change. I also made few cosmetic changes which you may find useful, however >> it's up to you. > > Thanks, will fix it in next version. > >> >>> + for (i = 0; i < rcvd; i++) { >>> + uint64_t addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr; >>> + uint32_t len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->len; >>> + char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr); >>> + uint64_t index; >>> + >>> + struct dp_packet_afxdp *xpacket; >>> + struct dp_packet *packet; >>> + >>> + index = addr >> FRAME_SHIFT; >>> + xpacket = UMEM2XPKT(xsk->umem->xpool.array, index); >>> + >>> + packet = &xpacket->packet; >>> + xpacket->mpool = &xsk->umem->mpool; >>> + >>> + /* Initialize the struct dp_packet */ >>> + dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM); >>> + dp_packet_set_size(packet, len); >>> + >>> + /* Add packet into batch, increase batch->count */ >>> + dp_packet_batch_add(batch, packet); >>> + >>> + idx_rx++; >>> + } >>> + >>> + /* We've consume rcvd packets in RX, now re-fill the >>> + * same number back to FILL queue. >>> + */ >>> + ret = umem_elem_pop_n(&xsk->umem->mpool, rcvd, (void **)elems); >>> + if (OVS_UNLIKELY(ret)) {> + return -ENOMEM; >>> + } >> >> Can this be done before actually receiving packets? i.e. don't receive >> anything if cant refill. > > I'm not sure I understand your point. > Do you suggest moving this umem_elem_pop_n in the beginning?
Yes. I meant this. It's just an optimization. We don't need to iterate over 'xsk->rx' if 'umem_elem_pop_n' failed because you're returning error and the batch will not be used. Actually, I looked at the code again and I see that in case of 'umem_elem_pop_n' failure you're not releasing the packets, i.e. xsk_ring_cons__peek() reserved some packets, but xsk_ring_cons__release() never called. This will break the rx ring on the next call. I think that code should look like this: rcvd = xsk_ring_cons__peek(&xsk->rx ...) if (!rcvd) { return 0; } ret = umem_elem_pop_n(rcvd) if (OVS_UNLIKELY(ret)) { xsk_ring_cons__release(&xsk->rx, rcvd); /* TODO: Account packets as dropped. */ return -ENOMEM; } for (i = 0; i < rcvd; i++) { ... dp_packet_use_afxdp(packet, pkt, FRAME_SIZE - FRAME_HEADROOM); ... dp_packet_batch_add(batch, packet); } /* Release the RX queue */ xsk_ring_cons__release(&xsk->rx, rcvd); for (i = 0; i < rcvd; i++) { ... *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index; } xsk_ring_prod__submit(&xsk->umem->fq, rcvd); > I think at this point the rcvd is > 0, otherwise it will return. > So we already know there are packets at rx ring. > >> >>> + >>> + for (i = 0; i < rcvd; i++) { >>> + uint64_t index; >>> + struct umem_elem *elem; >>> + >>> + ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq); >>> + while (OVS_UNLIKELY(ret == 0)) { >>> + /* The FILL queue is full, so retry. (or skip)? */ >>> + ret = xsk_ring_prod__reserve(&xsk->umem->fq, 1, &idx_fq); >>> + } > > And if we can't refill, we will keep trying. One additional point here is that you don't need to call prod__reserve for each packet. you could use: while (!xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq)) { /* The FILL queue is full, so retry. */ } xsk_ring_prod__reserve() doesn't support partial allocations, i.e. it will reserve rcvd or 0 elements. > >>> + >>> + /* Get one free umem, program it into FILL queue */ >>> + elem = elems[i]; >>> + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); >>> + ovs_assert((index & FRAME_SHIFT_MASK) == 0); >>> + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq) = index; >>> + >>> + idx_fq++; >>> + } >>> + xsk_ring_prod__submit(&xsk->umem->fq, rcvd); >>> + >>> + /* Release the RX queue */ >>> + xsk_ring_cons__release(&xsk->rx, rcvd); >>> + xsk->rx_npkts += rcvd; >>> + >>> +#ifdef AFXDP_DEBUG >>> + print_xsk_stat(xsk); >>> +#endif >>> + return 0; >>> +} >>> + >>> +static inline int kick_tx(struct xsk_socket_info *xsk) >>> +{ >>> + int ret; >>> + >>> + /* This causes system call into kernel's xsk_sendmsg, and >>> + * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode). >>> + */ >>> + ret = sendto(xsk_socket__fd(xsk->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0); >>> + if (OVS_UNLIKELY(ret < 0)) { >>> + if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) { >>> + return errno; >>> + } >>> + } >>> + /* no error, or EBUSY or EAGAIN */ >>> + return 0; >>> +} >>> + >>> +int >>> +netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk, >>> + struct dp_packet_batch *batch) >>> +{ >>> + struct umem_elem *elems_pop[BATCH_SIZE]; >>> + struct umem_elem *elems_push[BATCH_SIZE]; >>> + uint32_t tx_done, idx_cq = 0; >>> + struct dp_packet *packet; >>> + uint32_t idx = 0; >>> + int j, ret, retry_count = 0; >>> + const int max_retry = 4; >>> + >>> + ret = umem_elem_pop_n(&xsk->umem->mpool, batch->count, (void >>> **)elems_pop); >>> + if (OVS_UNLIKELY(ret)) { >>> + return EAGAIN; >>> + } >>> + >>> + /* Make sure we have enough TX descs */ >>> + ret = xsk_ring_prod__reserve(&xsk->tx, batch->count, &idx); >>> + if (OVS_UNLIKELY(ret == 0)) { >>> + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void >>> **)elems_pop); >>> + return EAGAIN; >>> + } >>> + >>> + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { >>> + struct umem_elem *elem; >>> + uint64_t index; >>> + >>> + elem = elems_pop[i]; >>> + /* Copy the packet to the umem we just pop from umem pool. >>> + * We can avoid this copy if the packet and the pop umem >>> + * are located in the same umem. >>> + */ >>> + memcpy(elem, dp_packet_data(packet), dp_packet_size(packet)); >>> + >>> + index = (uint64_t)((char *)elem - (char *)xsk->umem->buffer); >>> + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr = index; >>> + xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len >>> + = dp_packet_size(packet); >>> + } >>> + xsk_ring_prod__submit(&xsk->tx, batch->count); >>> + xsk->outstanding_tx += batch->count; >>> + >>> + ret = kick_tx(xsk); >>> + if (OVS_UNLIKELY(ret)) { >>> + umem_elem_push_n(&xsk->umem->mpool, batch->count, (void >>> **)elems_pop); >>> + VLOG_WARN_RL(&rl, "error sending AF_XDP packet: %s", >>> + ovs_strerror(ret)); >>> + return ret; >>> + } >>> + >>> +retry: >>> + /* Process CQ */ >>> + tx_done = xsk_ring_cons__peek(&xsk->umem->cq, batch->count, &idx_cq); >>> + if (tx_done > 0) { >>> + xsk->outstanding_tx -= tx_done; >>> + xsk->tx_npkts += tx_done; >>> + } >>> + >>> + /* Recycle back to umem pool */ >>> + for (j = 0; j < tx_done; j++) { >>> + struct umem_elem *elem; >>> + uint64_t addr; >>> + >>> + addr = *xsk_ring_cons__comp_addr(&xsk->umem->cq, idx_cq++); >>> + >>> + elem = ALIGNED_CAST(struct umem_elem *, >>> + (char *)xsk->umem->buffer + addr); >>> + elems_push[j] = elem; >>> + } >>> + >>> + ret = umem_elem_push_n(&xsk->umem->mpool, tx_done, (void >>> **)elems_push); >>> + ovs_assert(ret == 0); >>> + >>> + xsk_ring_cons__release(&xsk->umem->cq, tx_done); >>> + >>> + if (xsk->outstanding_tx > PROD_NUM_DESCS - (PROD_NUM_DESCS >> 2)) { >>> + /* If there are still a lot not transmitted, try harder. */ >>> + if (retry_count++ > max_retry) { >>> + return 0; >>> + } >>> + goto retry; >>> + } >>> + >>> + return 0; >>> +} >>> diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h >>> new file mode 100644 >>> index 000000000000..6518d8fca0b5 >>> --- /dev/null >>> +++ b/lib/netdev-afxdp.h >>> @@ -0,0 +1,53 @@ >>> +/* >>> + * Copyright (c) 2018 Nicira, Inc. >>> + * >>> + * Licensed under the Apache License, Version 2.0 (the "License"); >>> + * you may not use this file except in compliance with the License. >>> + * You may obtain a copy of the License at: >>> + * >>> + * http://www.apache.org/licenses/LICENSE-2.0 >>> + * >>> + * Unless required by applicable law or agreed to in writing, software >>> + * distributed under the License is distributed on an "AS IS" BASIS, >>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >>> + * See the License for the specific language governing permissions and >>> + * limitations under the License. >>> + */ >>> + >>> +#ifndef NETDEV_AFXDP_H >>> +#define NETDEV_AFXDP_H 1 >>> + >>> +#include <stdint.h> >>> +#include <stdbool.h> >>> + >>> +/* These functions are Linux AF_XDP specific, so they should be used >>> directly >>> + * only by Linux-specific code. */ >>> +#define MAX_XSKQ 16 >>> +struct netdev; >>> +struct xsk_socket_info; >>> +struct xdp_umem; >>> +struct dp_packet_batch; >>> +struct smap; >>> +struct dp_packet; >>> + >>> +struct dp_packet_afxdp * dp_packet_cast_afxdp(const struct dp_packet *d); >>> + >>> +int xsk_configure_all(struct netdev *netdev); >>> + >>> +void xsk_destroy_all(struct netdev *netdev); >>> + >>> +int netdev_linux_rxq_xsk(struct xsk_socket_info *xsk, >>> + struct dp_packet_batch *batch); >>> + >>> +int netdev_linux_afxdp_batch_send(struct xsk_socket_info *xsk, >>> + struct dp_packet_batch *batch); >>> + >>> +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, >>> + char **errp); >>> +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap >>> *args); >>> +int netdev_afxdp_get_numa_id(const struct netdev *netdev); >>> + >>> +void free_afxdp_buf(struct dp_packet *p); >>> +void free_afxdp_buf_batch(struct dp_packet_batch *batch); >>> +int netdev_afxdp_reconfigure(struct netdev *netdev); >>> +#endif /* netdev-afxdp.h */ >>> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h >>> new file mode 100644 >>> index 000000000000..3dd3d902b3c4 >>> --- /dev/null >>> +++ b/lib/netdev-linux-private.h >>> @@ -0,0 +1,124 @@ >>> +/* >>> + * Copyright (c) 2019 Nicira, Inc. >>> + * >>> + * Licensed under the Apache License, Version 2.0 (the "License"); >>> + * you may not use this file except in compliance with the License. >>> + * You may obtain a copy of the License at: >>> + * >>> + * http://www.apache.org/licenses/LICENSE-2.0 >>> + * >>> + * Unless required by applicable law or agreed to in writing, software >>> + * distributed under the License is distributed on an "AS IS" BASIS, >>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >>> + * See the License for the specific language governing permissions and >>> + * limitations under the License. >>> + */ >>> + >>> +#ifndef NETDEV_LINUX_PRIVATE_H >>> +#define NETDEV_LINUX_PRIVATE_H 1 >>> + >>> +#include <config.h> >>> + >>> +#include <linux/filter.h> >>> +#include <linux/gen_stats.h> >>> +#include <linux/if_ether.h> >>> +#include <linux/if_tun.h> >>> +#include <linux/types.h> >>> +#include <linux/ethtool.h> >>> +#include <linux/mii.h> >>> +#include <stdint.h> >>> +#include <stdbool.h> >>> + >>> +#include "netdev-provider.h" >>> +#include "netdev-tc-offloads.h" >>> +#include "netdev-vport.h" >>> +#include "openvswitch/thread.h" >>> +#include "ovs-atomic.h" >>> +#include "timer.h" >>> + >>> +#if HAVE_AF_XDP >>> +#include "netdev-afxdp.h" >>> +#endif >>> + >>> +/* These functions are Linux specific, so they should be used directly >>> only by >>> + * Linux-specific code. */ >>> + >>> +struct netdev; >>> + >>> +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag, >>> + const char *flag_name, bool enable); >>> +int linux_get_ifindex(const char *netdev_name); >>> + >>> +#define LINUX_FLOW_OFFLOAD_API \ >>> + .flow_flush = netdev_tc_flow_flush, \ >>> + .flow_dump_create = netdev_tc_flow_dump_create, \ >>> + .flow_dump_destroy = netdev_tc_flow_dump_destroy, \ >>> + .flow_dump_next = netdev_tc_flow_dump_next, \ >>> + .flow_put = netdev_tc_flow_put, \ >>> + .flow_get = netdev_tc_flow_get, \ >>> + .flow_del = netdev_tc_flow_del, \ >>> + .init_flow_api = netdev_tc_init_flow_api >>> + >>> +struct netdev_linux { >>> + struct netdev up; >>> + >>> + /* Protects all members below. */ >>> + struct ovs_mutex mutex; >>> + >>> + unsigned int cache_valid; >>> + >>> + bool miimon; /* Link status of last poll. */ >>> + long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. >>> */ >>> + struct timer miimon_timer; >>> + >>> + int netnsid; /* Network namespace ID. */ >>> + /* The following are figured out "on demand" only. They are only valid >>> + * when the corresponding VALID_* bit in 'cache_valid' is set. */ >>> + int ifindex; >>> + struct eth_addr etheraddr; >>> + int mtu; >>> + unsigned int ifi_flags; >>> + long long int carrier_resets; >>> + uint32_t kbits_rate; /* Policing data. */ >>> + uint32_t kbits_burst; >>> + int vport_stats_error; /* Cached error code from >>> vport_get_stats(). >>> + 0 or an errno value. */ >>> + int netdev_mtu_error; /* Cached error code from SIOCGIFMTU >>> + * or SIOCSIFMTU. >>> + */ >>> + int ether_addr_error; /* Cached error code from set/get >>> etheraddr. */ >>> + int netdev_policing_error; /* Cached error code from set policing. */ >>> + int get_features_error; /* Cached error code from ETHTOOL_GSET. */ >>> + int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ >>> + >>> + enum netdev_features current; /* Cached from ETHTOOL_GSET. */ >>> + enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ >>> + enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ >>> + >>> + struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ >>> + struct tc *tc; >>> + >>> + /* For devices of class netdev_tap_class only. */ >>> + int tap_fd; >>> + bool present; /* If the device is present in the >>> namespace */ >>> + uint64_t tx_dropped; /* tap device can drop if the iface is >>> down */ >>> + >>> + /* LAG information. */ >>> + bool is_lag_master; /* True if the netdev is a LAG master. */ >>> + >>> + /* AF_XDP information */ >>> +#ifdef HAVE_AF_XDP >>> + struct xsk_socket_info *xsk[MAX_XSKQ]; >>> + int requested_n_rxq; >>> + int xdpmode, requested_xdpmode; /* detect mode changed */ >>> + int xdp_flags, xdp_bind_flags; >>> +#endif >>> +}; >>> + >>> +static struct netdev_linux * >>> +netdev_linux_cast(const struct netdev *netdev) >>> +{ >>> + return CONTAINER_OF(netdev, struct netdev_linux, up); >>> +} >>> + >>> +#endif /* netdev-linux-private.h */ >>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c >>> index f75d73fd39f8..1f190406d145 100644 >>> --- a/lib/netdev-linux.c >>> +++ b/lib/netdev-linux.c >>> @@ -17,6 +17,7 @@ >>> #include <config.h> >>> >>> #include "netdev-linux.h" >>> +#include "netdev-linux-private.h" >>> >>> #include <errno.h> >>> #include <fcntl.h> >>> @@ -54,6 +55,7 @@ >>> #include "fatal-signal.h" >>> #include "hash.h" >>> #include "openvswitch/hmap.h" >>> +#include "netdev-afxdp.h" >>> #include "netdev-provider.h" >>> #include "netdev-tc-offloads.h" >>> #include "netdev-vport.h" >>> @@ -487,51 +489,6 @@ static int tc_calc_cell_log(unsigned int mtu); >>> static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu); >>> static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes); >>> >>> -struct netdev_linux { >>> - struct netdev up; >>> - >>> - /* Protects all members below. */ >>> - struct ovs_mutex mutex; >>> - >>> - unsigned int cache_valid; >>> - >>> - bool miimon; /* Link status of last poll. */ >>> - long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. >>> */ >>> - struct timer miimon_timer; >>> - >>> - int netnsid; /* Network namespace ID. */ >>> - /* The following are figured out "on demand" only. They are only valid >>> - * when the corresponding VALID_* bit in 'cache_valid' is set. */ >>> - int ifindex; >>> - struct eth_addr etheraddr; >>> - int mtu; >>> - unsigned int ifi_flags; >>> - long long int carrier_resets; >>> - uint32_t kbits_rate; /* Policing data. */ >>> - uint32_t kbits_burst; >>> - int vport_stats_error; /* Cached error code from >>> vport_get_stats(). >>> - 0 or an errno value. */ >>> - int netdev_mtu_error; /* Cached error code from SIOCGIFMTU or >>> SIOCSIFMTU. */ >>> - int ether_addr_error; /* Cached error code from set/get >>> etheraddr. */ >>> - int netdev_policing_error; /* Cached error code from set policing. */ >>> - int get_features_error; /* Cached error code from ETHTOOL_GSET. */ >>> - int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ >>> - >>> - enum netdev_features current; /* Cached from ETHTOOL_GSET. */ >>> - enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ >>> - enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ >>> - >>> - struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ >>> - struct tc *tc; >>> - >>> - /* For devices of class netdev_tap_class only. */ >>> - int tap_fd; >>> - bool present; /* If the device is present in the >>> namespace */ >>> - uint64_t tx_dropped; /* tap device can drop if the iface is >>> down */ >>> - >>> - /* LAG information. */ >>> - bool is_lag_master; /* True if the netdev is a LAG master. */ >>> -}; >>> >>> struct netdev_rxq_linux { >>> struct netdev_rxq up; >>> @@ -579,18 +536,23 @@ is_netdev_linux_class(const struct netdev_class >>> *netdev_class) >>> return netdev_class->run == netdev_linux_run; >>> } >>> >>> +#if HAVE_AF_XDP >>> static bool >>> -is_tap_netdev(const struct netdev *netdev) >>> +is_afxdp_netdev(const struct netdev *netdev) >>> { >>> - return netdev_get_class(netdev) == &netdev_tap_class; >>> + return netdev_get_class(netdev) == &netdev_afxdp_class; >>> } >>> - >>> -static struct netdev_linux * >>> -netdev_linux_cast(const struct netdev *netdev) >>> +#else >>> +static bool >>> +is_afxdp_netdev(const struct netdev *netdev OVS_UNUSED) >>> { >>> - ovs_assert(is_netdev_linux_class(netdev_get_class(netdev))); >>> - >>> - return CONTAINER_OF(netdev, struct netdev_linux, up); >>> + return false; >>> +} >>> +#endif >>> +static bool >>> +is_tap_netdev(const struct netdev *netdev) >>> +{ >>> + return netdev_get_class(netdev) == &netdev_tap_class; >>> } >>> >>> static struct netdev_rxq_linux * >>> @@ -1084,6 +1046,11 @@ netdev_linux_destruct(struct netdev *netdev_) >>> atomic_count_dec(&miimon_cnt); >>> } >>> >>> +#if HAVE_AF_XDP >>> + if (is_afxdp_netdev(netdev_)) { >>> + xsk_destroy_all(netdev_); >>> + } >>> +#endif >>> ovs_mutex_destroy(&netdev->mutex); >>> } >>> >>> @@ -1113,7 +1080,7 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_) >>> rx->is_tap = is_tap_netdev(netdev_); >>> if (rx->is_tap) { >>> rx->fd = netdev->tap_fd; >>> - } else { >>> + } else if (!is_afxdp_netdev(netdev_)) { >>> struct sockaddr_ll sll; >>> int ifindex, val; >>> /* Result of tcpdump -dd inbound */ >>> @@ -1318,10 +1285,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, >>> struct dp_packet_batch *batch, >>> { >>> struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_); >>> struct netdev *netdev = rx->up.netdev; >>> - struct dp_packet *buffer; >>> + struct dp_packet *buffer = NULL; >>> ssize_t retval; >>> int mtu; >>> >>> +#if HAVE_AF_XDP >>> + if (is_afxdp_netdev(netdev)) { >>> + struct netdev_linux *dev = netdev_linux_cast(netdev); >>> + int qid = rxq_->queue_id; >>> + >>> + return netdev_linux_rxq_xsk(dev->xsk[qid], batch); >>> + } >> >> Maybe it's better to just implement '.rxq_recv' inside netdev-afxdp.c ? >> Also, you missed clearing the '*qfill'. >> >>> +#endif >>> if (netdev_linux_get_mtu__(netdev_linux_cast(netdev), &mtu)) { >>> mtu = ETH_PAYLOAD_MAX; >>> } >>> @@ -1329,6 +1304,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct >>> dp_packet_batch *batch, >>> /* Assume Ethernet port. No need to set packet_type. */ >>> buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu, >>> DP_NETDEV_HEADROOM); >>> + >>> retval = (rx->is_tap >>> ? netdev_linux_rxq_recv_tap(rx->fd, buffer) >>> : netdev_linux_rxq_recv_sock(rx->fd, buffer)); >>> @@ -1480,7 +1456,8 @@ netdev_linux_send(struct netdev *netdev_, int qid >>> OVS_UNUSED, >>> int error = 0; >>> int sock = 0; >>> >>> - if (!is_tap_netdev(netdev_)) { >>> + if (!is_tap_netdev(netdev_) && >>> + !is_afxdp_netdev(netdev_)) { >>> if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) { >>> error = EOPNOTSUPP; >>> goto free_batch; >>> @@ -1499,6 +1476,36 @@ netdev_linux_send(struct netdev *netdev_, int qid >>> OVS_UNUSED, >>> } >>> >>> error = netdev_linux_sock_batch_send(sock, ifindex, batch); >>> +#if HAVE_AF_XDP >>> + } else if (is_afxdp_netdev(netdev_)) { >>> + struct netdev_linux *dev = netdev_linux_cast(netdev_); >>> + struct dp_packet_afxdp *xpacket; >>> + struct umem_pool *first_mpool; >>> + struct dp_packet *packet; >>> + >>> + error = netdev_linux_afxdp_batch_send(dev->xsk[qid], batch); >>> + >>> + /* all packets must come frome the same umem pool >>> + * and has DPBUF_AFXDP type, otherwise free on-by-one >>> + */ >>> + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { >>> + if (packet->source != DPBUF_AFXDP) { >>> + goto free_batch; >>> + } >>> + >>> + xpacket = dp_packet_cast_afxdp(packet); >>> + if (i == 0) { >>> + first_mpool = xpacket->mpool; >>> + continue; >>> + } >>> + if (xpacket->mpool != first_mpool) { >>> + goto free_batch; >>> + } >>> + } >>> + /* free in batch */ >>> + free_afxdp_buf_batch(batch); >>> + return error; >> >> >> There are a lot of afxdp specific code here and 'netdev_linux_send' doesn't >> provide any magic, i.e. has no real code suitable for all netdev types. >> Maybe it's better to just implement own '.send' function inside >> netdev-afxdp.c ? > > Yes, I will do that. > >> >>> +#endif >>> } else { >>> error = netdev_linux_tap_batch_send(netdev_, batch); >>> } >>> @@ -3323,6 +3330,7 @@ const struct netdev_class netdev_linux_class = { >>> NETDEV_LINUX_CLASS_COMMON, >>> LINUX_FLOW_OFFLOAD_API, >>> .type = "system", >>> + .is_pmd = false, >>> .construct = netdev_linux_construct, >>> .get_stats = netdev_linux_get_stats, >>> .get_features = netdev_linux_get_features, >>> @@ -3333,6 +3341,7 @@ const struct netdev_class netdev_linux_class = { >>> const struct netdev_class netdev_tap_class = { >>> NETDEV_LINUX_CLASS_COMMON, >>> .type = "tap", >>> + .is_pmd = false, >>> .construct = netdev_linux_construct_tap, >>> .get_stats = netdev_tap_get_stats, >>> .get_features = netdev_linux_get_features, >>> @@ -3343,10 +3352,26 @@ const struct netdev_class netdev_internal_class = { >>> NETDEV_LINUX_CLASS_COMMON, >>> LINUX_FLOW_OFFLOAD_API, >>> .type = "internal", >>> + .is_pmd = false, >>> .construct = netdev_linux_construct, >>> .get_stats = netdev_internal_get_stats, >>> .get_status = netdev_internal_get_status, >>> }; >>> + >>> +#ifdef HAVE_AF_XDP >>> +const struct netdev_class netdev_afxdp_class = { >>> + NETDEV_LINUX_CLASS_COMMON, >>> + .type = "afxdp", >>> + .is_pmd = true, >>> + .construct = netdev_linux_construct, >>> + .get_stats = netdev_linux_get_stats, >>> + .get_status = netdev_linux_get_status, >>> + .set_config = netdev_afxdp_set_config, >>> + .get_config = netdev_afxdp_get_config, >>> + .reconfigure = netdev_afxdp_reconfigure, >>> + .get_numa_id = netdev_afxdp_get_numa_id, >>> +}; >>> +#endif >>> >>> >>> #define CODEL_N_QUEUES 0x0000 >>> diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h >>> index 17ca9120168a..b812e64cb078 100644 >>> --- a/lib/netdev-linux.h >>> +++ b/lib/netdev-linux.h >>> @@ -19,6 +19,20 @@ >>> >>> #include <stdint.h> >>> #include <stdbool.h> >>> +#include <linux/filter.h> >>> +#include <linux/gen_stats.h> >>> +#include <linux/if_ether.h> >>> +#include <linux/if_tun.h> >>> +#include <linux/types.h> >>> +#include <linux/ethtool.h> >>> +#include <linux/mii.h> >>> + >>> +#include "netdev-provider.h" >>> +#include "netdev-tc-offloads.h" >>> +#include "netdev-vport.h" >>> +#include "openvswitch/thread.h" >>> +#include "ovs-atomic.h" >>> +#include "timer.h" >>> >>> /* These functions are Linux specific, so they should be used directly >>> only by >>> * Linux-specific code. */ >>> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h >>> index fb0c27e6e8e8..d433818f7064 100644 >>> --- a/lib/netdev-provider.h >>> +++ b/lib/netdev-provider.h >>> @@ -902,7 +902,9 @@ extern const struct netdev_class netdev_linux_class; >>> #endif >>> extern const struct netdev_class netdev_internal_class; >>> extern const struct netdev_class netdev_tap_class; >>> - >>> +#if HAVE_AF_XDP >>> +extern const struct netdev_class netdev_afxdp_class; >>> +#endif >>> #ifdef __cplusplus >>> } >>> #endif >>> diff --git a/lib/netdev.c b/lib/netdev.c >>> index 7d7ecf6f0946..e2fae37d5a5e 100644 >>> --- a/lib/netdev.c >>> +++ b/lib/netdev.c >>> @@ -146,6 +146,9 @@ netdev_initialize(void) >>> netdev_register_provider(&netdev_internal_class); >>> netdev_register_provider(&netdev_tap_class); >>> netdev_vport_tunnel_register(); >>> +#ifdef HAVE_AF_XDP >>> + netdev_register_provider(&netdev_afxdp_class); >>> +#endif >>> #endif >>> #if defined(__FreeBSD__) || defined(__NetBSD__) >>> netdev_register_provider(&netdev_tap_class); >>> diff --git a/lib/xdpsock.c b/lib/xdpsock.c >>> new file mode 100644 >>> index 000000000000..2d80e74d69e4 >>> --- /dev/null >>> +++ b/lib/xdpsock.c >>> @@ -0,0 +1,239 @@ >>> +/* >>> + * Copyright (c) 2018, 2019 Nicira, Inc. >>> + * >>> + * Licensed under the Apache License, Version 2.0 (the "License"); >>> + * you may not use this file except in compliance with the License. >>> + * You may obtain a copy of the License at: >>> + * >>> + * http://www.apache.org/licenses/LICENSE-2.0 >>> + * >>> + * Unless required by applicable law or agreed to in writing, software >>> + * distributed under the License is distributed on an "AS IS" BASIS, >>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >>> + * See the License for the specific language governing permissions and >>> + * limitations under the License. >>> + */ >>> +#include <config.h> >>> + >>> +#include "xdpsock.h" >>> + >>> +#include <ctype.h> >>> +#include <errno.h> >>> +#include <fcntl.h> >>> +#include <stdarg.h> >>> +#include <stdlib.h> >>> +#include <string.h> >>> +#include <sys/stat.h> >>> +#include <sys/types.h> >>> +#include <syslog.h> >>> +#include <time.h> >>> +#include <unistd.h> >>> + >>> +#include "async-append.h" >>> +#include "coverage.h" >>> +#include "dirs.h" >>> +#include "dp-packet.h" >>> +#include "openvswitch/compiler.h" >>> +#include "openvswitch/vlog.h" >>> +#include "ovs-atomic.h" >>> +#include "ovs-thread.h" >>> +#include "sat-math.h" >>> +#include "socket-util.h" >>> +#include "svec.h" >>> +#include "syslog-direct.h" >>> +#include "syslog-libc.h" >>> +#include "syslog-provider.h" >>> +#include "timeval.h" >>> +#include "unixctl.h" >>> +#include "util.h" >>> + >>> +static inline void >>> +ovs_spinlock_init(ovs_spinlock_t *sl) >>> +{ >>> + atomic_init(&sl->locked, 0); >>> +} >>> + >>> +static inline void >>> +ovs_spin_lock(ovs_spinlock_t *sl) >>> +{ >>> + int exp = 0, locked = 0; >>> + >>> + while (!atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, >>> + memory_order_acquire, >>> + memory_order_relaxed)) { >>> + locked = 1; >>> + while (locked) { >>> + atomic_read_relaxed(&sl->locked, &locked); >>> + } >>> + exp = 0; >>> + } >>> +} >>> + >>> +static inline void >>> +ovs_spin_unlock(ovs_spinlock_t *sl) >>> +{ >>> + atomic_store_explicit(&sl->locked, 0, memory_order_release); >>> +} >>> + >>> +static inline int OVS_UNUSED >>> +ovs_spin_trylock(ovs_spinlock_t *sl) >>> +{ >>> + int exp = 0; >>> + return atomic_compare_exchange_strong_explicit(&sl->locked, &exp, 1, >>> + memory_order_acquire, >>> + memory_order_relaxed); >>> +} >>> + >>> +inline int >>> +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) >>> +{ >>> + void *ptr; >>> + >>> + if (OVS_UNLIKELY(umemp->index + n > umemp->size)) { >>> + return -ENOMEM; >>> + } >>> + >>> + ptr = &umemp->array[umemp->index]; >>> + memcpy(ptr, addrs, n * sizeof(void *)); >>> + umemp->index += n; >>> + >>> + return 0; >>> +} >>> + >>> +int umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) >>> +{ >>> + int ret; >>> + >>> + ovs_spin_lock(&umemp->mutex); >>> + ret = __umem_elem_push_n(umemp, n, addrs); >>> + ovs_spin_unlock(&umemp->mutex); >>> + >>> + return ret; >>> +} >>> + >>> +inline void >>> +__umem_elem_push(struct umem_pool *umemp, void *addr) >>> +{ >>> + umemp->array[umemp->index++] = addr; >>> +} >>> + >>> +void >>> +umem_elem_push(struct umem_pool *umemp, void *addr) >>> +{ >>> + >>> + if (OVS_UNLIKELY(umemp->index >= umemp->size)) { >>> + /* stack is overflow, this should not happen */ >>> + OVS_NOT_REACHED(); >>> + } >>> + >>> + ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0); >>> + >>> + ovs_spin_lock(&umemp->mutex); >>> + __umem_elem_push(umemp, addr); >>> + ovs_spin_unlock(&umemp->mutex); >>> +} >>> + >>> +inline int >>> +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) >>> +{ >>> + void *ptr; >>> + >>> + if (OVS_UNLIKELY(umemp->index - n < 0)) { >>> + return -ENOMEM; >>> + } >>> + >>> + umemp->index -= n; >>> + ptr = &umemp->array[umemp->index]; >>> + memcpy(addrs, ptr, n * sizeof(void *)); >>> + >>> + return 0; >>> +} >>> + >>> +int >>> +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) >>> +{ >>> + int ret; >>> + >>> + ovs_spin_lock(&umemp->mutex); >>> + ret = __umem_elem_pop_n(umemp, n, addrs); >>> + ovs_spin_unlock(&umemp->mutex); >>> + >>> + return ret; >>> +} >>> + >>> +inline void * >>> +__umem_elem_pop(struct umem_pool *umemp) >>> +{ >>> + return umemp->array[--umemp->index]; >>> +} >>> + >>> +void * >>> +umem_elem_pop(struct umem_pool *umemp) >>> +{ >>> + void *ptr; >>> + >>> + ovs_spin_lock(&umemp->mutex); >>> + ptr = __umem_elem_pop(umemp); >>> + ovs_spin_unlock(&umemp->mutex); >>> + >>> + return ptr; >>> +} >>> + >>> +void ** >>> +__umem_pool_alloc(unsigned int size) >>> +{ >>> + void *bufs; >>> + >>> + ovs_assert(posix_memalign(&bufs, getpagesize(), >>> + size * sizeof(void *)) == 0); >>> + memset(bufs, 0, size * sizeof(void *)); >>> + return (void **)bufs; >>> +} >>> + >>> +unsigned int >>> +umem_elem_count(struct umem_pool *mpool) >>> +{ >>> + return mpool->index; >>> +} >>> + >>> +int >>> +umem_pool_init(struct umem_pool *umemp, unsigned int size) >>> +{ >>> + umemp->array = __umem_pool_alloc(size); >>> + if (!umemp->array) { >>> + OVS_NOT_REACHED(); >>> + } >>> + >>> + umemp->size = size; >>> + umemp->index = 0; >>> + ovs_spinlock_init(&umemp->mutex); >>> + return 0; >>> +} >>> + >>> +void >>> +umem_pool_cleanup(struct umem_pool *umemp) >>> +{ >>> + free(umemp->array); >>> +} >>> + >>> +/* AF_XDP metadata init/destroy */ >>> +int >>> +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size) >>> +{ >>> + void *bufs; >>> + >>> + /* TODO: check HAVE_POSIX_MEMALIGN */ >>> + ovs_assert(posix_memalign(&bufs, getpagesize(), >>> + size * sizeof(struct dp_packet_afxdp)) == 0); >>> + memset(bufs, 0, size * sizeof(struct dp_packet_afxdp)); >>> + >>> + xp->array = bufs; >>> + xp->size = size; >>> + return 0; >>> +} >>> + >>> +void >>> +xpacket_pool_cleanup(struct xpacket_pool *xp) >>> +{ >>> + free(xp->array); >>> +} >>> diff --git a/lib/xdpsock.h b/lib/xdpsock.h >>> new file mode 100644 >>> index 000000000000..aabaa8e5df24 >>> --- /dev/null >>> +++ b/lib/xdpsock.h >>> @@ -0,0 +1,123 @@ >>> +/* >>> + * Copyright (c) 2018, 2019 Nicira, Inc. >>> + * >>> + * Licensed under the Apache License, Version 2.0 (the "License"); >>> + * you may not use this file except in compliance with the License. >>> + * You may obtain a copy of the License at: >>> + * >>> + * http://www.apache.org/licenses/LICENSE-2.0 >>> + * >>> + * Unless required by applicable law or agreed to in writing, software >>> + * distributed under the License is distributed on an "AS IS" BASIS, >>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >>> + * See the License for the specific language governing permissions and >>> + * limitations under the License. >>> + */ >>> + >>> +#ifndef XDPSOCK_H >>> +#define XDPSOCK_H 1 >>> + >>> +#include <bpf/libbpf.h> >>> +#include <bpf/xsk.h> >>> +#include <errno.h> >>> +#include <getopt.h> >>> +#include <libgen.h> >>> +#include <linux/bpf.h> >>> +#include <linux/if_link.h> >>> +#include <linux/if_xdp.h> >>> +#include <linux/if_ether.h> >>> +#include <locale.h> >>> +#include <net/if.h> >>> +#include <poll.h> >>> +#include <pthread.h> >>> +#include <signal.h> >>> +#include <stdbool.h> >>> +#include <stdio.h> >>> +#include <stdlib.h> >>> +#include <string.h> >>> +#include <sys/resource.h> >>> +#include <sys/socket.h> >>> +#include <sys/types.h> >>> +#include <sys/mman.h> >>> +#include <time.h> >>> +#include <unistd.h> >>> + >>> +#include "openvswitch/thread.h" >>> +#include "ovs-atomic.h" >>> + >>> +#define FRAME_HEADROOM XDP_PACKET_HEADROOM >>> +#define FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE >>> +#define BATCH_SIZE NETDEV_MAX_BURST >>> +#define FRAME_SHIFT XSK_UMEM__DEFAULT_FRAME_SHIFT >>> +#define FRAME_SHIFT_MASK ((1 << FRAME_SHIFT) - 1) >>> + >>> +#define NUM_FRAMES 4096 >>> +#define PROD_NUM_DESCS 512 >>> +#define CONS_NUM_DESCS 512 >>> + >>> +#ifdef USE_XSK_DEFAULT >>> +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS >>> +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS >>> +#endif >> >> Should there be ifdef-else-endif ? > > good catch, thanks. > William > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev