[PATCH 2/2] vhost: forbid IOTLB invalidation when not enabled
When IOTLB is not enabled, we should forbid IOTLB invalidation to avoid a NULL pointer dereference. Signed-off-by: Jason Wang --- drivers/vhost/vhost.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c6f2d89..7d338d5 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -959,6 +959,10 @@ int vhost_process_iotlb_msg(struct vhost_dev *dev, vhost_iotlb_notify_vq(dev, msg); break; case VHOST_IOTLB_INVALIDATE: + if (!dev->iotlb) { + ret = -EFAULT; + break; + } vhost_del_umem_range(dev->iotlb, msg->iova, msg->iova + msg->size - 1); break; -- 2.7.4
RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable
Mark Lord [mailto:ml...@pobox.com] > Sent: Thursday, November 17, 2016 9:42 PM [...] > What the above sample shows, is the URB transfer buffer ran out of space in > the > middle > of a packet, and the hardware then tried to just continue that same packet in > the > next URB, > without an rx_desc header inserted. The r8152.c driver always assumes the URB > buffer begins > with an rx_desc, so of course this behaviour produces really weird effects, > and > system crashes, etc.. The USB device wouldn't know the address and size of buffer. Only the USB host controller knows. Therefore, the device sends the data to host, and the host fills the memory. According to your description, it seems the host splits the data from the device into two different buffers (or URB transfers). I wonder if it would occur. As far as I know, the host wouldn't allow the buffer size less than the data length. Our hw engineers need the log from the USB analyzer to confirm what the device sends to the host. However, I don't think you have USB analyzer to do this. I would try to reproduce the issue. But, I am busy, so I don't think I would response quickly. Besides, the maximum data length which the RTL8152 would send to the host is 16KB. That is, if the agg_buf_sz is 16KB, the host wouldn't split it. However, you still see problems for it. [...] > It is not clear to me how the chip decides when to forward an rx URB to the > host. > If you could describe how that part works for us, then it would help in > further > understanding why fast systems (eg. a PC) don't generally notice the issue, > while much slower embedded systems do see the issue regularly. The driver expects the rx buffer would be rx_desc + a packet + padding to 8 alignment + rx_desc + a packet + padding to 8 alignment + ... Therefore, when a urb transfer is completed, the driver parsers the buffer by this way. After the buffer is handled, it would be submitted to the host, until the transfer is completed again. If the submitting fail, the driver would try again later. The urb->actual_length means how much data the host fills. The drive uses it to check the end of the data. The urb->status mean if the transfer is successful. The driver submits the urb to the host directly if the status is not successful. Best Regards, Hayes
pull-request: mac80211 2016-11-18
Hi Dave, Due to travel/vacation, this is a bit late, but there aren't that many fixes either. Most interesting/important are the fixes from Felix and perhaps the scan entry limit. Please pull and let me know if there's any problem. Thanks, johannes The following changes since commit 269ebce4531b8edc4224259a02143181a1c1d77c: xen-netfront: cast grant table reference first to type int (2016-11-02 15:33:36 -0400) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git tags/mac80211-for-davem-2016-11-18 for you to fetch changes up to 9853a55ef1bb66d7411136046060bbfb69c714fa: cfg80211: limit scan results cache size (2016-11-18 08:44:44 +0100) A few more bugfixes: * limit # of scan results stored in memory - this is a long-standing bug Jouni and I only noticed while discussing other things in Santa Fe * revert AP_LINK_PS patch that was causing issues (Felix) * various A-MSDU/A-MPDU fixes for TXQ code (Felix) * interoperability workaround for peers with broken VHT capabilities (Filip Matusiak) * add bitrate definition for a VHT MCS that's supposed to be invalid but gets used by some hardware anyway (Thomas Pedersen) * beacon timer fix in hwsim (Benjamin Beichler) Benjamin Beichler (1): mac80211_hwsim: fix beacon delta calculation Felix Fietkau (4): Revert "mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE" mac80211: update A-MPDU flag on tx dequeue mac80211: remove bogus skb vif assignment mac80211: fix A-MSDU aggregation with fast-xmit + txq Filip Matusiak (1): mac80211: Ignore VHT IE from peer with wrong rx_mcs_map Johannes Berg (1): cfg80211: limit scan results cache size Pedersen, Thomas (1): cfg80211: add bitrate for 20MHz MCS 9 drivers/net/wireless/mac80211_hwsim.c | 2 +- net/mac80211/sta_info.c | 2 +- net/mac80211/tx.c | 14 +-- net/mac80211/vht.c| 16 net/wireless/core.h | 1 + net/wireless/scan.c | 69 +++ net/wireless/util.c | 3 +- 7 files changed, 100 insertions(+), 7 deletions(-)
[PATCH net-next] bridge: add igmpv3 and mldv2 query support
Add bridge IGMPv3 and MLDv2 query support. But before we think it is stable enough, only enable it when declare in force_igmp/mld_version. Signed-off-by: Hangbin Liu --- net/bridge/br_multicast.c | 203 -- 1 file changed, 194 insertions(+), 9 deletions(-) diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c index 2136e45..9fb47f3 100644 --- a/net/bridge/br_multicast.c +++ b/net/bridge/br_multicast.c @@ -35,6 +35,10 @@ #include "br_private.h" +#define IGMP_V3_SEEN(in_dev) \ + (IPV4_DEVCONF_ALL(dev_net(in_dev->dev), FORCE_IGMP_VERSION) == 3 || \ +IN_DEV_CONF_GET((in_dev), FORCE_IGMP_VERSION) == 3) + static void br_multicast_start_querier(struct net_bridge *br, struct bridge_mcast_own_query *query); static void br_multicast_add_router(struct net_bridge *br, @@ -360,9 +364,8 @@ static int br_mdb_rehash(struct net_bridge_mdb_htable __rcu **mdbp, int max, return 0; } -static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br, - __be32 group, - u8 *igmp_type) +static struct sk_buff *br_ip4_alloc_query_v2(struct net_bridge *br, +__be32 group, u8 *igmp_type) { struct sk_buff *skb; struct igmphdr *ih; @@ -428,10 +431,82 @@ static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br, return skb; } +static struct sk_buff *br_ip4_alloc_query_v3(struct net_bridge *br, +__be32 group, u8 *igmp_type) +{ + struct sk_buff *skb; + struct igmpv3_query *ih3; + struct ethhdr *eth; + struct iphdr *iph; + + skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*iph) + +sizeof(*ih3) + 4); + if (!skb) + goto out; + + skb->protocol = htons(ETH_P_IP); + + skb_reset_mac_header(skb); + eth = eth_hdr(skb); + + ether_addr_copy(eth->h_source, br->dev->dev_addr); + eth->h_dest[0] = 1; + eth->h_dest[1] = 0; + eth->h_dest[2] = 0x5e; + eth->h_dest[3] = 0; + eth->h_dest[4] = 0; + eth->h_dest[5] = 1; + eth->h_proto = htons(ETH_P_IP); + skb_put(skb, sizeof(*eth)); + + skb_set_network_header(skb, skb->len); + iph = ip_hdr(skb); + + iph->version = 4; + iph->ihl = 6; + iph->tos = 0xc0; + iph->tot_len = htons(sizeof(*iph) + sizeof(*ih3) + 4); + iph->id = 0; + iph->frag_off = htons(IP_DF); + iph->ttl = 1; + iph->protocol = IPPROTO_IGMP; + iph->saddr = br->multicast_query_use_ifaddr ? +inet_select_addr(br->dev, 0, RT_SCOPE_LINK) : 0; + iph->daddr = htonl(INADDR_ALLHOSTS_GROUP); + ((u8 *)&iph[1])[0] = IPOPT_RA; + ((u8 *)&iph[1])[1] = 4; + ((u8 *)&iph[1])[2] = 0; + ((u8 *)&iph[1])[3] = 0; + ip_send_check(iph); + skb_put(skb, 24); + + skb_set_transport_header(skb, skb->len); + ih3 = igmpv3_query_hdr(skb); + + *igmp_type = IGMP_HOST_MEMBERSHIP_QUERY; + ih3->type = IGMP_HOST_MEMBERSHIP_QUERY; + ih3->code = (group ? br->multicast_last_member_interval : + br->multicast_query_response_interval) / + (HZ / IGMP_TIMER_SCALE); + ih3->csum = 0; + ih3->group = group; + ih3->resv = 0; + ih3->suppress = 0; + ih3->qrv= 2; + ih3->qqic = br->multicast_query_interval / HZ; + ih3->nsrcs = 0; + ih3->csum = ip_compute_csum((void *)ih3, sizeof(struct igmpv3_query )); + skb_put(skb, sizeof(*ih3)); + + __skb_pull(skb, sizeof(*eth)); + +out: + return skb; +} #if IS_ENABLED(CONFIG_IPV6) -static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br, - const struct in6_addr *grp, - u8 *igmp_type) +static struct sk_buff *br_ip6_alloc_query_v1(struct net_bridge *br, +const struct in6_addr *grp, +u8 *igmp_type) { struct sk_buff *skb; struct ipv6hdr *ip6h; @@ -514,19 +589,129 @@ static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br, out: return skb; } + +static struct sk_buff *br_ip6_alloc_query_v2(struct net_bridge *br, +const struct in6_addr *grp, +u8 *igmp_type) +{ + struct sk_buff *skb; + struct ipv6hdr *ip6h; + struct mld2_query *mld2q; + struct ethhdr *eth; + u8 *hopopt; + unsigned long interval; + + skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*ip6h) + +
[PATCH net-next 2/4] geneve: Merge ipv4 and ipv6 geneve_build_skb()
There are minimal difference in building Geneve header between ipv4 and ipv6 geneve tunnels. Following patch refactors code to unify it. Signed-off-by: Pravin B Shelar --- drivers/net/geneve.c | 100 ++- 1 file changed, 26 insertions(+), 74 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 622cf3b..9a4351c 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -630,67 +630,34 @@ static int geneve_stop(struct net_device *dev) } static void geneve_build_header(struct genevehdr *geneveh, - __be16 tun_flags, u8 vni[3], - u8 options_len, u8 *options) + const struct ip_tunnel_info *info) { geneveh->ver = GENEVE_VER; - geneveh->opt_len = options_len / 4; - geneveh->oam = !!(tun_flags & TUNNEL_OAM); - geneveh->critical = !!(tun_flags & TUNNEL_CRIT_OPT); + geneveh->opt_len = info->options_len / 4; + geneveh->oam = !!(info->key.tun_flags & TUNNEL_OAM); + geneveh->critical = !!(info->key.tun_flags & TUNNEL_CRIT_OPT); geneveh->rsvd1 = 0; - memcpy(geneveh->vni, vni, 3); + tunnel_id_to_vni(info->key.tun_id, geneveh->vni); geneveh->proto_type = htons(ETH_P_TEB); geneveh->rsvd2 = 0; - memcpy(geneveh->options, options, options_len); + ip_tunnel_info_opts_get(geneveh->options, info); } -static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb, - __be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt, - bool xnet) -{ - bool udp_sum = !!(tun_flags & TUNNEL_CSUM); - struct genevehdr *gnvh; - int min_headroom; - int err; - - skb_scrub_packet(skb, xnet); - - min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len - + GENEVE_BASE_HLEN + opt_len + sizeof(struct iphdr); - err = skb_cow_head(skb, min_headroom); - if (unlikely(err)) - goto free_rt; - - err = udp_tunnel_handle_offloads(skb, udp_sum); - if (err) - goto free_rt; - - gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) + opt_len); - geneve_build_header(gnvh, tun_flags, vni, opt_len, opt); - - skb_set_inner_protocol(skb, htons(ETH_P_TEB)); - return 0; - -free_rt: - ip_rt_put(rt); - return err; -} - -#if IS_ENABLED(CONFIG_IPV6) -static int geneve6_build_skb(struct dst_entry *dst, struct sk_buff *skb, -__be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt, -bool xnet) +static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb, + const struct ip_tunnel_info *info, + bool xnet, int ip_hdr_len) { - bool udp_sum = !!(tun_flags & TUNNEL_CSUM); + bool udp_sum = !!(info->key.tun_flags & TUNNEL_CSUM); struct genevehdr *gnvh; int min_headroom; int err; + skb_reset_mac_header(skb); skb_scrub_packet(skb, xnet); - min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len - + GENEVE_BASE_HLEN + opt_len + sizeof(struct ipv6hdr); + min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len + + GENEVE_BASE_HLEN + info->options_len + ip_hdr_len; err = skb_cow_head(skb, min_headroom); if (unlikely(err)) goto free_dst; @@ -699,9 +666,9 @@ static int geneve6_build_skb(struct dst_entry *dst, struct sk_buff *skb, if (err) goto free_dst; - gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) + opt_len); - geneve_build_header(gnvh, tun_flags, vni, opt_len, opt); - + gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) + + info->options_len); + geneve_build_header(gnvh, info); skb_set_inner_protocol(skb, htons(ETH_P_TEB)); return 0; @@ -709,12 +676,11 @@ static int geneve6_build_skb(struct dst_entry *dst, struct sk_buff *skb, dst_release(dst); return err; } -#endif static struct rtable *geneve_get_v4_rt(struct sk_buff *skb, struct net_device *dev, struct flowi4 *fl4, - struct ip_tunnel_info *info) + const struct ip_tunnel_info *info) { bool use_cache = ip_tunnel_dst_cache_usable(skb, info); struct geneve_dev *geneve = netdev_priv(dev); @@ -738,7 +704,7 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb, } fl4->flowi4_tos = RT_TOS(tos); - dst_cache = &info->dst_cache; + dst_cache = (struct dst_cache *)&info->dst_cache; if (use_cache) { rt = dst_cache_get_ip4(dst_cache, &fl4->saddr)
[PATCH net-next 0/4] geneve: Use LWT more effectively.
Following patch series make use of geneve LWT code path for geneve netdev type of device. This allows us to simplify geneve module. Pravin B Shelar (4): geneve: Unify LWT and netdev handling. geneve: Merge ipv4 and ipv6 geneve_build_skb() geneve: Remove redundant socket checks. geneve: Optimize geneve device lookup. drivers/net/geneve.c | 678 +-- 1 file changed, 273 insertions(+), 405 deletions(-) -- 1.8.3.1
[PATCH net-next 1/4] geneve: Unify LWT and netdev handling.
Current geneve implementation has two separate cases to handle. 1. netdev xmit 2. LWT xmit. In case of netdev, geneve configuration is stored in various struct geneve_dev members. For example geneve_addr, ttl, tos, label, flags, dst_cache, etc. For LWT ip_tunnel_info is passed to the device in ip_tunnel_info. Following patch uses ip_tunnel_info struct to store almost all of configuration of a geneve netdevice. This allows us to unify most of geneve driver code around ip_tunnel_info struct. This dramatically simplify geneve code, since it does not need to handle two different configuration cases. Removes duplicate code, single code path can handle either type of geneve devices. Signed-off-by: Pravin B Shelar --- drivers/net/geneve.c | 611 ++- 1 file changed, 262 insertions(+), 349 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 85a423a..622cf3b 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -45,41 +45,22 @@ struct geneve_net { static int geneve_net_id; -union geneve_addr { - struct sockaddr_in sin; - struct sockaddr_in6 sin6; - struct sockaddr sa; -}; - -static union geneve_addr geneve_remote_unspec = { .sa.sa_family = AF_UNSPEC, }; - /* Pseudo network device */ struct geneve_dev { struct hlist_node hlist; /* vni hash table */ struct net *net;/* netns for packet i/o */ struct net_device *dev;/* netdev for geneve tunnel */ + struct ip_tunnel_info info; struct geneve_sock __rcu *sock4;/* IPv4 socket used for geneve tunnel */ #if IS_ENABLED(CONFIG_IPV6) struct geneve_sock __rcu *sock6;/* IPv6 socket used for geneve tunnel */ #endif - u8 vni[3]; /* virtual network ID for tunnel */ - u8 ttl; /* TTL override */ - u8 tos; /* TOS override */ - union geneve_addr remote; /* IP address for link partner */ struct list_head next;/* geneve's per namespace list */ - __be32 label; /* IPv6 flowlabel override */ - __be16 dst_port; - bool collect_md; struct gro_cells gro_cells; - u32flags; - struct dst_cache dst_cache; + bool collect_md; + bool use_udp6_rx_checksums; }; -/* Geneve device flags */ -#define GENEVE_F_UDP_ZERO_CSUM_TX BIT(0) -#define GENEVE_F_UDP_ZERO_CSUM6_TX BIT(1) -#define GENEVE_F_UDP_ZERO_CSUM6_RX BIT(2) - struct geneve_sock { boolcollect_md; struct list_headlist; @@ -87,7 +68,6 @@ struct geneve_sock { struct rcu_head rcu; int refcnt; struct hlist_head vni_list[VNI_HASH_SIZE]; - u32 flags; }; static inline __u32 geneve_net_vni_hash(u8 vni[3]) @@ -109,6 +89,20 @@ static __be64 vni_to_tunnel_id(const __u8 *vni) #endif } +/* Convert 64 bit tunnel ID to 24 bit VNI. */ +static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni) +{ +#ifdef __BIG_ENDIAN + vni[0] = (__force __u8)(tun_id >> 16); + vni[1] = (__force __u8)(tun_id >> 8); + vni[2] = (__force __u8)tun_id; +#else + vni[0] = (__force __u8)((__force u64)tun_id >> 40); + vni[1] = (__force __u8)((__force u64)tun_id >> 48); + vni[2] = (__force __u8)((__force u64)tun_id >> 56); +#endif +} + static sa_family_t geneve_get_sk_family(struct geneve_sock *gs) { return gs->sock->sk->sk_family; @@ -117,6 +111,7 @@ static sa_family_t geneve_get_sk_family(struct geneve_sock *gs) static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, __be32 addr, u8 vni[]) { + __be64 id = vni_to_tunnel_id(vni); struct hlist_head *vni_list_head; struct geneve_dev *geneve; __u32 hash; @@ -125,8 +120,8 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, hash = geneve_net_vni_hash(vni); vni_list_head = &gs->vni_list[hash]; hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) { - if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) && - addr == geneve->remote.sin.sin_addr.s_addr) + if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) && + addr == geneve->info.key.u.ipv4.dst) return geneve; } return NULL; @@ -136,6 +131,7 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs, struct in6_addr addr6, u8 vni[]) { + __be64 id = vni_to_tunnel_id(vni); struct hlist_head *vni_list_head; struct geneve_dev *geneve; __u32 hash; @@ -144,8 +140,8 @@ static struct geneve_d
[PATCH net-next 4/4] geneve: Optimize geneve device lookup.
Rather than comparing 64-bit tunnel-id, compare tunnel vni which is 24-bit id. This also save conversion from vni to tunnel id on each tunnel packet receive. Signed-off-by: Pravin B Shelar --- drivers/net/geneve.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 8e6c659..21f3270 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -103,6 +103,17 @@ static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni) #endif } +static bool cmp_tunnel_id_and_vni(u8 *tun_id, u8 *vni) +{ +#ifdef __BIG_ENDIAN + return (vni[0] == tun_id[2]) && + (vni[1] == tun_id[1]) && + (vni[2] == tun_id[0]); +#else + return !memcmp(vni, &tun_id[5], 3); +#endif +} + static sa_family_t geneve_get_sk_family(struct geneve_sock *gs) { return gs->sock->sk->sk_family; @@ -111,7 +122,6 @@ static sa_family_t geneve_get_sk_family(struct geneve_sock *gs) static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, __be32 addr, u8 vni[]) { - __be64 id = vni_to_tunnel_id(vni); struct hlist_head *vni_list_head; struct geneve_dev *geneve; __u32 hash; @@ -120,7 +130,7 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, hash = geneve_net_vni_hash(vni); vni_list_head = &gs->vni_list[hash]; hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) { - if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) && + if (cmp_tunnel_id_and_vni((u8 *)&geneve->info.key.tun_id, vni) && addr == geneve->info.key.u.ipv4.dst) return geneve; } @@ -131,7 +141,6 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock *gs, static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs, struct in6_addr addr6, u8 vni[]) { - __be64 id = vni_to_tunnel_id(vni); struct hlist_head *vni_list_head; struct geneve_dev *geneve; __u32 hash; @@ -140,7 +149,7 @@ static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs, hash = geneve_net_vni_hash(vni); vni_list_head = &gs->vni_list[hash]; hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) { - if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) && + if (cmp_tunnel_id_and_vni((u8 *)&geneve->info.key.tun_id, vni) && ipv6_addr_equal(&addr6, &geneve->info.key.u.ipv6.dst)) return geneve; } -- 1.8.3.1
[PATCH net-next 3/4] geneve: Remove redundant socket checks.
Geneve already has check for device socket in route lookup function. So no need to check it in xmit function. Signed-off-by: Pravin B Shelar --- drivers/net/geneve.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 9a4351c..8e6c659 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -785,14 +785,11 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct geneve_sock *gs4 = rcu_dereference(geneve->sock4); const struct ip_tunnel_key *key = &info->key; struct rtable *rt; - int err = -EINVAL; struct flowi4 fl4; __u8 tos, ttl; __be16 sport; __be16 df; - - if (!gs4) - return err; + int err; rt = geneve_get_v4_rt(skb, dev, &fl4, info); if (IS_ERR(rt)) @@ -828,13 +825,10 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct net_device *dev, struct geneve_sock *gs6 = rcu_dereference(geneve->sock6); const struct ip_tunnel_key *key = &info->key; struct dst_entry *dst = NULL; - int err = -EINVAL; struct flowi6 fl6; __u8 prio, ttl; __be16 sport; - - if (!gs6) - return err; + int err; dst = geneve_get_v6_dst(skb, dev, &fl6, info); if (IS_ERR(dst)) -- 1.8.3.1
Re: [PATCH net-next v3 4/7] vxlan: improve vxlan route lookup checks.
On Thu, Nov 17, 2016 at 2:17 AM, David Laight wrote: > From: Of Jiri Benc >> Sent: 15 November 2016 14:40 >> On Sun, 13 Nov 2016 20:43:55 -0800, Pravin B Shelar wrote: >> > @@ -1929,8 +1951,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, >> > struct net_device *dev, >> > union vxlan_addr *src; >> > struct vxlan_metadata _md; >> > struct vxlan_metadata *md = &_md; >> > - struct dst_entry *ndst = NULL; >> > __be16 src_port = 0, dst_port; >> > + struct dst_entry *ndst = NULL; >> > __be32 vni, label; >> > __be16 df = 0; >> > __u8 tos, ttl; >> >> This looks kind of arbitrary. You might want to remove this hunk or >> merge it to patch 3. > > Worse than arbitrary, it adds 4 bytes of pad on 64bit systems. > OK. I will send out a patch. But this is not real issue in vxlan module today ;)
Proposal
Business Partnership Proposal For You,contact me via my personal E-mail for further detail's: ms_teresa_a...@outlook.com
Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
Hi Rafal, On Thu, Nov 17, 2016 at 7:05 PM, Rafal Ozieblo wrote: > -Original Message- > From: Nicolas Ferre [mailto:nicolas.fe...@atmel.com] > Sent: 17 listopada 2016 14:29 > To: Harini Katakam; Rafal Ozieblo > Cc: harini.kata...@xilinx.com; netdev@vger.kernel.org; > linux-ker...@vger.kernel.org > Subject: Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM > >> Le 17/11/2016 à 13:21, Harini Katakam a écrit : >> > Hi Rafal, >> > >> > On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo wrote: >> >> Hello, >> >> I think, there could a bug in your patch. >> >> >> >>> + >> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT >> >>> + dmacfg |= GEM_BIT(ADDR64); #endif >> >> >> >> You enable 64 bit addressing (64b dma bus width) always when appropriate >> >> architecture config option is enabled. >> >> But there are some legacy controllers which do not support that feature. >> >> According Cadence hardware team: >> >> "64 bit addressing was added in July 2013. Earlier version do not have it. >> >> This feature was enhanced in release August 2014 to have separate upper >> >> address values for transmit and receive." >> >> >> >>> /* Bitfields in NSR */ >> >>> @@ -474,6 +479,10 @@ >> >>> struct macb_dma_desc { >> >> > u32 addr; >> >>> u32 ctrl; >> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT >> >>> + u32 addrh; >> >>> + u32 resvd; >> >>> +#endif >> >>> }; >> >> >> >> It will not work for legacy hardware. Old descriptor is 2 words wide, the >> >> new one is 4 words wide. >> >> If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't >> >> support it at all, you will miss every second descriptor. >> >> >> > >> > True, this feature is not available in all of Cadence IP versions. >> > In fact, the IP version Zynq does not support this. But the one in ZynqMP >> > does. >> > So, we enable kernel config for 64 bit DMA addressing for this SoC and >> > hence the driver picks it up. My assumption was that if the legacy IP >> > does not support >> > 64 bit addressing, then this DMA option wouldn't be enabled. >> > >> > There is a design config register in Cadence IP which is being read to >> > check for 64 bit address support - DMA mask is set based on that. >> > But the addition of two descriptor words cannot be based on this runtime >> > check. >> > For this reason, all the static changes were placed under this check. >> >> We have quite a bunch of options in this driver to determinate what is the >> real capacity of the underlying hardware. >> If HW configuration registers are not appropriate, and it seems they are >> not, I would advice to simply use the DT compatibility string. >> >> Best regards, >> -- >> Nicolas Ferre > > HW configuration registers are appropriate. The issue is that this code > doesn’t use the capability bit to switch between different dma descriptors (2 > words vs. 4 words). > DMA descriptor size is chosen based on kernel configuration, not based on > hardware capabilities. HW configuration register does give appropriate information. But addition of two address words in the macb descriptor structure is a static change. +static inline void macb_set_addr(struct macb_dma_desc *desc, dma_addr_t addr) +{ + desc->addr = (u32)addr; +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT + desc->addrh = (u32)(addr >> 32); +#endif + Even if the #ifdef condition here is changed to HW config check, addr and addrh are different. And "addrh" entry has to be present for 64 bit desc case to be handled separately. Can you please tell me how you propose change in DMA descriptor structure from 4 to 2 or 2 to 4 words *after* reading the DCFG register? Regards, Harini
[PATCH] net: fec: Detect and recover receive queue hangs
This corrects a problem that appears to be similar to ERR006358. But while ERR006358 is a race when the tx queue transitions from empty to not empty, this problem is a race when the rx queue transitions from full to not full. The symptom is a receive queue that is stuck. The ENET_RDAR register will read 0, indicating that there are no empty receive descriptors in the receive ring. Since no additional frames can be queued, no RXF interrupts occur. This problem can be triggered with a 1 Gb link and about 400 Mbps of traffic. This patch detects this condition, sets the work_rx bit, and reschedules the poll method. Signed-off-by: Chris Lesiak --- drivers/net/ethernet/freescale/fec_main.c | 31 +++ 1 file changed, 31 insertions(+) diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index fea0f33..8a87037 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -1588,6 +1588,34 @@ fec_enet_interrupt(int irq, void *dev_id) return ret; } +static inline bool +fec_enet_recover_rxq(struct fec_enet_private *fep, u16 queue_id) +{ + int work_bit = (queue_id == 0) ? 2 : ((queue_id == 1) ? 0 : 1); + + if (readl(fep->rx_queue[queue_id]->bd.reg_desc_active)) + return false; + + dev_notice_once(&fep->pdev->dev, "Recovered rx queue\n"); + + fep->work_rx |= 1 << work_bit; + + return true; +} + +static inline bool fec_enet_recover_rxqs(struct fec_enet_private *fep) +{ + unsigned int q; + bool ret = false; + + for (q = 0; q < fep->num_rx_queues; q++) { + if (fec_enet_recover_rxq(fep, q)) + ret = true; + } + + return ret; +} + static int fec_enet_rx_napi(struct napi_struct *napi, int budget) { struct net_device *ndev = napi->dev; @@ -1601,6 +1629,9 @@ static int fec_enet_rx_napi(struct napi_struct *napi, int budget) if (pkts < budget) { napi_complete(napi); writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK); + + if (fec_enet_recover_rxqs(fep) && napi_reschedule(napi)) + writel(FEC_NAPI_IMASK, fep->hwp + FEC_IMASK); } return pkts; } -- 2.5.5
[iproute2] iproute2: fix the link group name getting error
In the situation where more than one entry live in the same hash bucket, loop to get the correct one. Before: $ cat /etc/iproute2/group 0 default 256 test $ sudo ip link set group test dummy1 $ ip link show type dummy 11: dummy0: mtu 1500 qdisc noop state DOWN mode DEFAULT group 0 qlen 1000 link/ether 4e:3b:d3:6c:f0:e6 brd ff:ff:ff:ff:ff:ff 12: dummy1: mtu 1500 qdisc noop state DOWN mode DEFAULT group test qlen 1000 link/ether d6:9c:a4:1f:e7:e5 brd ff:ff:ff:ff:ff:ff After: $ ip link show type dummy 11: dummy0: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 4e:3b:d3:6c:f0:e6 brd ff:ff:ff:ff:ff:ff 12: dummy1: mtu 1500 qdisc noop state DOWN mode DEFAULT group test qlen 1000 link/ether d6:9c:a4:1f:e7:e5 brd ff:ff:ff:ff:ff:ff Signed-off-by: Zhang Shengju --- lib/rt_names.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/lib/rt_names.c b/lib/rt_names.c index b665d3e..c66cb1e 100644 --- a/lib/rt_names.c +++ b/lib/rt_names.c @@ -559,8 +559,12 @@ const char *rtnl_group_n2a(int id, char *buf, int len) for (i = 0; i < 256; i++) { entry = rtnl_group_hash[i]; - if (entry && entry->id == id) - return entry->name; + + while (entry) { + if (entry->id == id) + return entry->name; + entry = entry->next; + } } snprintf(buf, len, "%d", id); -- 1.8.3.1
Re: Netperf UDP issue with connected sockets
On 11/17/2016 04:37 PM, Julian Anastasov wrote: On Thu, 17 Nov 2016, Rick Jones wrote: raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 ... socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 connected socket can benefit from dst cached in socket but not if SO_DONTROUTE is set. If we do not want to send packets via gateway this -l 1 should help but I don't see IP_TTL setsockopt in your first example with connect() to 127.0.0.1. Also, may be there can be another default, if -l is used to specify TTL then SO_DONTROUTE should not be set. I.e. we should avoid SO_DONTROUTE, if possible. The global -l option specifies the duration of the test. It doesn't specify the TTL of the IP datagrams being generated by the actions of the test. I resisted setting SO_DONTROUTE for a number of years after the first instance of UDP_STREAM being used in link up/down testing took-out a company's network (including security camera feeds to galactic HQ) but at this point I'm likely to keep it in there because there ended-up being a second such incident. It is set only for UDP_STREAM. It isn't set for UDP_RR or TCP_*. And for UDP_STREAM it can be overridden by the test-specific -R option. happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
Hello, On Thu, 17 Nov 2016, Rick Jones wrote: > raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F > src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 > > ... > > socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 > getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 > getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 > setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, > 16) = 0 > setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 connected socket can benefit from dst cached in socket but not if SO_DONTROUTE is set. If we do not want to send packets via gateway this -l 1 should help but I don't see IP_TTL setsockopt in your first example with connect() to 127.0.0.1. Also, may be there can be another default, if -l is used to specify TTL then SO_DONTROUTE should not be set. I.e. we should avoid SO_DONTROUTE, if possible. Regards -- Julian Anastasov
Re: Long delays creating a netns after deleting one (possibly RCU related)
> On Nov 14, 2016, at 3:09 PM, Eric Dumazet wrote: > > On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote: >> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote: >> >>> synchronize_rcu_expidited is not enough if you have multiple network >>> devices in play. >>> >>> Looking at the code it comes down to this commit, and it appears there >>> is a promise add rcu grace period combining by Eric Dumazet. >>> >>> Eric since people are hitting noticable stalls because of the rcu grace >>> period taking a long time do you think you could look at this code path >>> a bit more? >>> >>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9 >>> Author: Eric Dumazet >>> Date: Wed Nov 18 06:31:03 2015 -0800 >> >> Absolutely, I will take a loop asap. > > The worst offender should be fixed by the following patch. > > busy poll needs to poll the physical device, not a virtual one... > > diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h > index > d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2 > 100644 > --- a/include/net/gro_cells.h > +++ b/include/net/gro_cells.h > @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, > struct net_device *de > struct gro_cell *cell = per_cpu_ptr(gcells->cells, i); > > __skb_queue_head_init(&cell->napi_skbs); > + > + set_bit(NAPI_STATE_NO_BUSY_POLL, &cell->napi.state); > + > netif_napi_add(dev, &cell->napi, gro_cell_poll, 64); > napi_enable(&cell->napi); > } > This solved a ~20 second slowdown between OVS datapath unit tests for me. Jarno
[Patch net v2] af_unix: conditionally use freezable blocking calls in read
Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read") converts schedule_timeout() to its freezable version, it was probably correct at that time, but later, commit 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets") breaks the strong requirement for a freezable sleep, according to commit 0f9548ca1091: We shouldn't try_to_freeze if locks are held. Holding a lock can cause a deadlock if the lock is later acquired in the suspend or hibernate path (e.g. by dpm). Holding a lock can also cause a deadlock in the case of cgroup_freezer if a lock is held inside a frozen cgroup that is later acquired by a process outside that group. The pipe_lock is still held at that point. So use freezable version only for the recvmsg call path, avoid impact for Android. Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets") Reported-by: Dmitry Vyukov Cc: Tejun Heo Cc: Colin Cross Cc: Rafael J. Wysocki Cc: Hannes Frederic Sowa Signed-off-by: Cong Wang --- net/unix/af_unix.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 5d1c14a..2358f26 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -2199,7 +2199,8 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, * Sleep until more data has arrived. But check for races.. */ static long unix_stream_data_wait(struct sock *sk, long timeo, - struct sk_buff *last, unsigned int last_len) + struct sk_buff *last, unsigned int last_len, + bool freezable) { struct sk_buff *tail; DEFINE_WAIT(wait); @@ -2220,7 +2221,10 @@ static long unix_stream_data_wait(struct sock *sk, long timeo, sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); unix_state_unlock(sk); - timeo = freezable_schedule_timeout(timeo); + if (freezable) + timeo = freezable_schedule_timeout(timeo); + else + timeo = schedule_timeout(timeo); unix_state_lock(sk); if (sock_flag(sk, SOCK_DEAD)) @@ -2250,7 +2254,8 @@ struct unix_stream_read_state { unsigned int splice_flags; }; -static int unix_stream_read_generic(struct unix_stream_read_state *state) +static int unix_stream_read_generic(struct unix_stream_read_state *state, + bool freezable) { struct scm_cookie scm; struct socket *sock = state->socket; @@ -2330,7 +2335,7 @@ static int unix_stream_read_generic(struct unix_stream_read_state *state) mutex_unlock(&u->iolock); timeo = unix_stream_data_wait(sk, timeo, last, - last_len); + last_len, freezable); if (signal_pending(current)) { err = sock_intr_errno(timeo); @@ -2472,7 +2477,7 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg, .flags = flags }; - return unix_stream_read_generic(&state); + return unix_stream_read_generic(&state, true); } static int unix_stream_splice_actor(struct sk_buff *skb, @@ -2503,7 +2508,7 @@ static ssize_t unix_stream_splice_read(struct socket *sock, loff_t *ppos, flags & SPLICE_F_NONBLOCK) state.flags = MSG_DONTWAIT; - return unix_stream_read_generic(&state); + return unix_stream_read_generic(&state, false); } static int unix_shutdown(struct socket *sock, int mode) -- 2.1.0
Re: Virtio_net support vxlan encapsulation package TSO offload discuss
I worked on the same issue a few months back. I rebased my proof-of-concept code to the current net-next and posted an RFC patch a moment ago. I have zero experience on QEMU feature negotiation or extending the virtio_net spec. Since the virtio_net handling code is now all done using shared code, this should work for macvtap as well, not sure if macvtap needs some control plane changes. I posted a separate patch to make af_packet also use the shared infra for virtio_net handling yesterday. My RFC patch assumes that af_packet need not be touched, i.e., assumes the af_packet patch is applied, even though the patches apply to net-next in either order. Jarno > On Nov 16, 2016, at 11:27 PM, Jason Wang wrote: > > > > On 2016年11月17日 09:31, Zhangming (James, Euler) wrote: >> On 2016年11月15日 11:28, Jason Wang wrote: >>> On 2016年11月10日 14:19, Zhangming (James, Euler) wrote: On 2016年11月09日 15:14, Jason Wang wrote: > On 2016年11月08日 19:58, Zhangming (James, Euler) wrote: >> On 2016年11月08日 19:17, Jason Wang wrote: >> >>> On 2016年11月08日 19:13, Jason Wang wrote: Cc Michael On 2016年11月08日 16:34, Zhangming (James, Euler) wrote: > In container scenario, OVS is installed in the Virtual machine, > and all the containers connected to the OVS will communicated > through VXLAN encapsulation. > > By now, virtio_net does not support TSO offload for VXLAN > encapsulated TSO package. In this condition, the performance is > not good, sender is bottleneck > > I googled this scenario, but I didn’t find any information. Will > virtio_net support VXLAN encapsulation package TSO offload later? > Yes and for both sender and receiver. > My idea is virtio_net open encapsulated TSO offload, and > transport encapsulation info to TUN, TUN will parse the info and > build skb with encapsulation info. > > OVS or kernel on the host should be modified to support this. > Using this method, the TCP performance aremore than 2x as before. > > Any advice and suggestions for this idea or new idea will be > greatly appreciated! > > Best regards, > > James zhang > Sounds very good. And we may also need features bits (VIRTIO_NET_F_GUEST|HOST_GSO_X) for this. This is in fact one of items in networking todo list. (See http://www.linux-kvm.org/page/NetworkingTodo). While at it, we'd better support not only VXLAN but also other tunnels. >>> Cc Vlad who is working on extending virtio-net headers. >>> We can start with the spec work, or if you've already had some bits you can post them as RFC for early review. Thanks >> Below is my demo code >> Virtio_net.c >> static int virtnet_probe(struct virtio_device *vdev), add belows codes: >> if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) || >> // avoid gso segment, it should be negotiation >> later, because in the demo I reuse num_buffers. >> virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) { >> dev->hw_enc_features |= NETIF_F_TSO; >> dev->hw_enc_features |= NETIF_F_ALL_CSUM; >> dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL; >> dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM; >> dev->hw_enc_features |= >> NETIF_F_GSO_TUNNEL_REMCSUM; >> >> dev->features |= NETIF_F_GSO_UDP_TUNNEL; >> dev->features |= NETIF_F_GSO_UDP_TUNNEL_CSUM; >> dev->features |= NETIF_F_GSO_TUNNEL_REMCSUM; >> } >> >> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb), add >> below to pieces of codes >> >> if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL) >> hdr->hdr.gso_type |= VIRTIO_NET_HDR_GSO_TUNNEL; >> if (skb_shinfo(skb)->gso_type & >> SKB_GSO_UDP_TUNNEL_CSUM) >> hdr->hdr.gso_type |= >> VIRTIO_NET_HDR_GSO_TUNNEL_CSUM; >> if (skb_shinfo(skb)->gso_type & SKB_GSO_TUNNEL_REMCSUM) >> hdr->hdr.gso_type |= >> VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM; >> >> if (skb->encapsulation && skb_is_gso(skb)) { >> inner_mac_len = skb_inner_network_header(skb) - >> skb_inner_mac_header(skb); >> tnl_len = skb_inner_mac_header(skb) - >> skb_mac_header(skb); >> if ( !(inner_mac_len >> DATA_LEN_SHIFT) && !(tnl_len >> >> DATA_LEN_SHIFT) ) { >> hdr->hdr.flags |= >> VIRTIO
Re: Netperf UDP issue with connected sockets
On 11/17/2016 01:44 PM, Eric Dumazet wrote: because netperf sends the same message over and over... Well, sort of, by default. That can be altered to a degree. The global -F option should cause netperf to fill the buffers in its send ring with data from the specified file. The number of buffers in the send ring can be controlled via the global -W option. The number of elements in the ring will default to one more than the initial SO_SNDBUF size divided by the send size. raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 ... socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0 open("src/nettest_omni.c", O_RDONLY)= 5 fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, st_size=230027, st_atime=2016/11/16-09:49:29, st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3099f62000 read(5, "#ifdef HAVE_CONFIG_H\n#include rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f30994a7cb0}, NULL, 8) = 0 rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 0x7f30994a7cb0}, NULL, 8) = 0 alarm(1)= 0 sendto(4, "#ifdef HAVE_CONFIG_H\n#include {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, " used\\n\\\n-m local,remote S"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, " do here but clear the legacy fl"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, "e before we scan the test-specif"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 Of course, it will continue to send the same messages from the send_ring over and over instead of putting different data into the buffers each time, but if one has a sufficiently large -W option specified... happy benchmarking, rick jones
[RFC PATCH net-next] virtio_net: Support UDP Tunnel offloads.
This patch is a proof-of-concept I did a few months ago for UDP tunnel offload support in virtio_net interface, and rebased on to the current net-next. Real implementation needs to extend the virtio_net header rather than piggy-backing on existing fields. Inner MAC length (or inner network offset) also needs to be passed as a new field. Control plane (QEMU) also needs to be updated. All testing was done using Geneve, but this should work for all UDP tunnels the same. Signed-off-by: Jarno Rajahalme --- drivers/net/tun.c | 7 - drivers/net/virtio_net.c| 16 +++--- include/linux/skbuff.h | 5 include/linux/virtio_net.h | 66 ++--- include/uapi/linux/virtio_net.h | 7 + 5 files changed, 78 insertions(+), 23 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 1588469..36f3219 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -198,7 +198,9 @@ struct tun_struct { struct net_device *dev; netdev_features_t set_features; #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \ - NETIF_F_TSO6|NETIF_F_UFO) + NETIF_F_TSO6|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL| \ + NETIF_F_GSO_UDP_TUNNEL_CSUM| \ + NETIF_F_GSO_TUNNEL_REMCSUM) int align; int vnet_hdr_sz; @@ -1877,6 +1879,9 @@ static int set_offload(struct tun_struct *tun, unsigned long arg) if (arg & TUN_F_UFO) { features |= NETIF_F_UFO; +#if 1 + features |= NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM; +#endif arg &= ~TUN_F_UFO; } } diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ca5239a..eb8d887 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1789,7 +1789,10 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) { dev->hw_features |= NETIF_F_TSO | NETIF_F_UFO - | NETIF_F_TSO_ECN | NETIF_F_TSO6; + | NETIF_F_TSO_ECN | NETIF_F_TSO6 + | NETIF_F_GSO_UDP_TUNNEL + | NETIF_F_GSO_UDP_TUNNEL_CSUM + | NETIF_F_GSO_TUNNEL_REMCSUM; } /* Individual feature bits: what can host handle? */ if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_TSO4)) @@ -1798,13 +1801,18 @@ static int virtnet_probe(struct virtio_device *vdev) dev->hw_features |= NETIF_F_TSO6; if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_ECN)) dev->hw_features |= NETIF_F_TSO_ECN; - if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO)) + if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO)) { dev->hw_features |= NETIF_F_UFO; - +#if 1 + dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL; + dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM; + dev->hw_features |= NETIF_F_GSO_TUNNEL_REMCSUM; +#endif + } dev->features |= NETIF_F_GSO_ROBUST; if (gso) - dev->features |= dev->hw_features & (NETIF_F_ALL_TSO|NETIF_F_UFO); + dev->features |= dev->hw_features & (NETIF_F_ALL_TSO|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM); /* (!csum && gso) case will be fixed by register_netdev() */ } if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM)) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a4aeeca..992ad30 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2115,6 +2115,11 @@ static inline unsigned char *skb_inner_mac_header(const struct sk_buff *skb) return skb->head + skb->inner_mac_header; } +static inline int skb_inner_mac_offset(const struct sk_buff *skb) +{ + return skb_inner_mac_header(skb) - skb->data; +} + static inline void skb_reset_inner_mac_header(struct sk_buff *skb) { skb->inner_mac_header = skb->data - skb->head; diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 1c912f8..17384d1 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -8,10 +8,19 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, const struct virtio_net_hdr *hdr, bool little_endian) { - unsigned short gso_type = 0; + u16 start = __virtio16_to_cpu(little_endian, hdr->csum_start); + + if (hdr->flags & VIRTIO_NET_HDR_F_
Re: [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"
On Thu, Nov 17, 2016 at 2:30 PM, Colin Cross wrote: > On Thu, Nov 17, 2016 at 2:09 PM, Cong Wang wrote: >> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read") >> converts schedule_timeout() to its freezable version, it was probably >> correct at that time, but later, commit 2b514574f7e8 >> ("net: af_unix: implement splice for stream af_unix sockets") breaks >> the strong requirement for a freezable sleep, according to >> commit 0f9548ca1091: >> >> We shouldn't try_to_freeze if locks are held. Holding a lock can cause a >> deadlock if the lock is later acquired in the suspend or hibernate path >> (e.g. by dpm). Holding a lock can also cause a deadlock in the case of >> cgroup_freezer if a lock is held inside a frozen cgroup that is later >> acquired by a process outside that group. >> >> The pipe_lock is still held at that point. So just revert commit 2b15af6f95. > > On my phone 77 threads are blocked in unix_stream_recvmsg. A simple > revert of this patch will cause every one of those threads to wake up > twice per suspend cycle, which can be multiple times a second. How > about adding a freezable flag to unix_stream_read_state so > unix_stream_recvmsg can stay freezable, and unix_stream_splice_read > can be unfreezable? Fair enough, I didn't know it could have such an impact. I will send v2. Thanks.
Re: Netperf UDP issue with connected sockets
On Thu, Nov 17, 2016 at 9:34 AM, David Laight wrote: > From: Jesper Dangaard Brouer >> Sent: 17 November 2016 14:58 >> On Thu, 17 Nov 2016 06:17:38 -0800 >> Eric Dumazet wrote: >> >> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: >> > >> > > I can see that qdisc layer does not activate xmit_more in this case. >> > > >> > >> > Sure. Not enough pressure from the sender(s). >> > >> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL >> > limit is kept at a small value. >> > >> > (BTW not all NIC have expensive doorbells) >> >> I believe this NIC mlx5 (50G edition) does. >> >> I'm seeing UDP TX of 1656017.55 pps, which is per packet: >> 2414 cycles(tsc) 603.86 ns >> >> Perf top shows (with my own udp_flood, that avoids __ip_select_ident): >> >> Samples: 56K of event 'cycles', Event count (approx.): 51613832267 >>Overhead CommandShared ObjectSymbol >> +8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock >>- _raw_spin_lock >> + 90.78% __dev_queue_xmit >> + 7.83% dev_queue_xmit >> + 1.30% ___slab_alloc >> +5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w >> +4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit >> +4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup >> +4.00% swapper[mlx5_core] [k] mlx5e_poll_tx_cq >> +3.11% udp_flood [kernel.vmlinux] [k] >> __ip_route_output_key_hash >> +2.49% swapper[kernel.vmlinux] [k] __slab_free >> >> In this setup the spinlock in __dev_queue_xmit should be uncongested. >> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. >> >> But 8.92% of the time is spend on it, which corresponds to a cost of 215 >> cycles (2414*0.0892). This cost is too high, thus something else is >> going on... I claim this mysterious extra cost is the tailptr/doorbell. > > Try adding code to ring the doorbell twice. > If this doesn't slow things down then it isn't (likely to be) responsible > for the delay you are seeing. > > David > The problem isn't only the doorbell. It is doorbell plus a locked transaction on x86 results in a long wait until the doorbell write has been completed. You could batch a bunch of doorbell writes together and it isn't an issue unless you do something like writel(), wmb(), writel(), wmb(), then you will see the effect double since the write memory barrier is what is forcing the delays. - Alex
Re: [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"
On Thu, Nov 17, 2016 at 2:09 PM, Cong Wang wrote: > Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read") > converts schedule_timeout() to its freezable version, it was probably > correct at that time, but later, commit 2b514574f7e8 > ("net: af_unix: implement splice for stream af_unix sockets") breaks > the strong requirement for a freezable sleep, according to > commit 0f9548ca1091: > > We shouldn't try_to_freeze if locks are held. Holding a lock can cause a > deadlock if the lock is later acquired in the suspend or hibernate path > (e.g. by dpm). Holding a lock can also cause a deadlock in the case of > cgroup_freezer if a lock is held inside a frozen cgroup that is later > acquired by a process outside that group. > > The pipe_lock is still held at that point. So just revert commit 2b15af6f95. On my phone 77 threads are blocked in unix_stream_recvmsg. A simple revert of this patch will cause every one of those threads to wake up twice per suspend cycle, which can be multiple times a second. How about adding a freezable flag to unix_stream_read_state so unix_stream_recvmsg can stay freezable, and unix_stream_splice_read can be unfreezable?
Re: net: BUG still has locks held in unix_stream_splice_read
On 17.11.2016 22:44, Cong Wang wrote: > On Sun, Oct 9, 2016 at 8:14 PM, Al Viro wrote: >> E.g what will happen if some code does a read on AF_UNIX socket with >> some local mutex held? AFAICS, there are exactly two callers of >> freezable_schedule_timeout() - this one and one in XFS; the latter is >> in a kernel thread where we do have good warranties about the locking >> environment, but here it's in the bleeding ->recvmsg/->splice_read and >> for those assumption that caller doesn't hold any locks is pretty >> strong, especially since it's not documented anywhere. >> >> What's going on there? > > Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read") > converts schedule_timeout() to its freezable version, it was probably correct > at that time, but later, commit 2b514574f7e88c8498027ee366 > ("net: af_unix: implement splice for stream af_unix sockets") breaks its > requirement for a freezable sleep: > > commit 0f9548ca10916dec166eaf74c816bded7d8e611d > > lockdep: check that no locks held at freeze time > > We shouldn't try_to_freeze if locks are held. Holding a lock can cause a > deadlock if the lock is later acquired in the suspend or hibernate path > (e.g. by dpm). Holding a lock can also cause a deadlock in the case of > cgroup_freezer if a lock is held inside a frozen cgroup that is later > acquired by a process outside that group. > > So probably we just need to revert commit 2b15af6f95 now. > > I am going to send a revert for at least -net and -stable, since Dmitry > saw this warning again. I am not an expert on freezing but this looks around right from the freezer code. Awesome, thanks a lot for spotting this one!
Re: Debugging Ethernet issues
Florian Fainelli writes: > On 11/14/2016 11:00 AM, Måns Rullgård wrote: >> Florian Fainelli writes: >> >>> On 11/14/2016 10:20 AM, Florian Fainelli wrote: On 11/14/2016 09:59 AM, Sebastian Frias wrote: > On 11/14/2016 06:32 PM, Florian Fainelli wrote: >> On 11/14/2016 07:33 AM, Mason wrote: >>> On 14/11/2016 15:58, Mason wrote: >>> nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx vs nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control off I'm not sure whether "flow control" is relevant... >>> >>> Based on phy_print_status() >>> phydev->pause ? "rx/tx" : "off" >>> I added the following patch. >>> >>> diff --git a/drivers/net/ethernet/aurora/nb8800.c >>> b/drivers/net/ethernet/aurora/nb8800.c >>> index defc22a15f67..4e758c1cfa4e 100644 >>> --- a/drivers/net/ethernet/aurora/nb8800.c >>> +++ b/drivers/net/ethernet/aurora/nb8800.c >>> @@ -667,6 +667,8 @@ static void nb8800_link_reconfigure(struct >>> net_device *dev) >>> struct phy_device *phydev = priv->phydev; >>> int change = 0; >>> >>> + printk("%s from %pf\n", __func__, __builtin_return_address(0)); >>> + >>> if (phydev->link) { >>> if (phydev->speed != priv->speed) { >>> priv->speed = phydev->speed; >>> @@ -1274,9 +1276,9 @@ static int nb8800_hw_init(struct net_device *dev) >>> nb8800_writeb(priv, NB8800_PQ2, val & 0xff); >>> >>> /* Auto-negotiate by default */ >>> - priv->pause_aneg = true; >>> - priv->pause_rx = true; >>> - priv->pause_tx = true; >>> + priv->pause_aneg = false; >>> + priv->pause_rx = false; >>> + priv->pause_tx = false; >>> >>> nb8800_mc_init(dev, 0); >>> >>> >> >> [...] >> >> And the time difference is clearly accounted for auto-negotiation time >> here, as you can see it takes about 3 seconds for Gigabit Ethernet to >> auto-negotiate and that seems completely acceptable and normal to me >> since it is a more involved process than lower speeds. >> >>> >>> >>> OK, so now it works (by accident?) even on 100 Mbps switch, but it still >>> prints "flow control rx/tx"... >> >> Because your link partner advertises flow control, and that's what >> phydev->pause and phydev->asym_pause report (I know it's confusing, but >> that's what it is at the moment). > > Thanks. > Could you confirm that Mason's patch is correct and/or that it does not > has negative side-effects? The patch is not correct nor incorrect per-se, it changes the default policy of having pause frames advertised by default to not having them advertised by default. >> >> I was advised to advertise flow control by default back when I was >> working on the driver, and I think it makes sense to do so. >> This influences both your Ethernet MAC and the link partner in that the result is either flow control is enabled (before) or it is not (with the patch). There must be something amiss if you see packet loss or some kind of problem like that with an early exchange such as DHCP. Flow control tend to kick in under higher packet rates (at least, that's what you expect). > > Right now we know that Mason's patch makes this work, but we do not > understand why nor its implications. You need to understand why, right now, the way this problem is presented, you came up with a workaround, not with the root cause or the solution. What does your link partner (switch?) reports, that is, what is the ethtool output when you have a link up from your nb8800 adapter? >>> >>> Actually, nb8800_pause_config() seems to be doing a complete MAC/DMA >>> reconfiguration when pause frames get auto-negotiated while the link is >>> UP, >> >> This is due to a silly hardware limitation. The register containing the >> flow control bits can't be written while rx is enabled. > > You do a DMA stop, but you don't disable the MAC receiver unlike what > nb8800_stop() does, why is not calling nb8800_mac_rx() necessary here? Oh, right. That's because the RXC_CR register (where the flow control bits are) can't be modified when the RCR_EN bit (rx dma enable) is set. The MAC core register controlled by nb8800_mac_rx() doesn't matter here. There is no way of changing the flow control setting without briefly stopping rx dma. None of this should be relevant here though since everything should be all set up before dma is enabled the first time. >>> and it does not differentiate being called from >>> ethtool::set_pauseparam or the PHYLIB adjust_link callback (which it >>> probably should), >> >> Differentiate how? > > Differentiate in that when you are called from
Re: [PATCH v2 net-next] lan78xx: relocate mdix setting to phy driver
On 11/17/2016 02:10 PM, woojung@microchip.com wrote: > From: Woojung Huh > > Relocate mdix code to phy driver to be called at config_init(). > > Signed-off-by: Woojung Huh Reviewed-by: Florian Fainelli -- Florian
[PATCH v2 net-next] lan78xx: relocate mdix setting to phy driver
From: Woojung Huh Relocate mdix code to phy driver to be called at config_init(). Signed-off-by: Woojung Huh --- drivers/net/phy/microchip.c | 36 +- drivers/net/usb/lan78xx.c | 73 ++--- 2 files changed, 38 insertions(+), 71 deletions(-) diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c index 7c00e50..eb4db22 100644 --- a/drivers/net/phy/microchip.c +++ b/drivers/net/phy/microchip.c @@ -106,6 +106,40 @@ static int lan88xx_set_wol(struct phy_device *phydev, return 0; } +static void lan88xx_set_mdix(struct phy_device *phydev) +{ + int buf; + int val; + + switch (phydev->mdix) { + case ETH_TP_MDI: + val = LAN88XX_EXT_MODE_CTRL_MDI_; + break; + case ETH_TP_MDI_X: + val = LAN88XX_EXT_MODE_CTRL_MDI_X_; + break; + case ETH_TP_MDI_AUTO: + val = LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_; + break; + default: + return; + } + + phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_1); + buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL); + buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_; + buf |= val; + phy_write(phydev, LAN88XX_EXT_MODE_CTRL, buf); + phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0); +} + +static int lan88xx_config_aneg(struct phy_device *phydev) +{ + lan88xx_set_mdix(phydev); + + return genphy_config_aneg(phydev); +} + static struct phy_driver microchip_phy_driver[] = { { .phy_id = 0x0007c130, @@ -120,7 +154,7 @@ static struct phy_driver microchip_phy_driver[] = { .remove = lan88xx_remove, .config_init= genphy_config_init, - .config_aneg= genphy_config_aneg, + .config_aneg= lan88xx_config_aneg, .read_status= genphy_read_status, .ack_interrupt = lan88xx_phy_ack_interrupt, diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c index cf2857f..0c459e9 100644 --- a/drivers/net/usb/lan78xx.c +++ b/drivers/net/usb/lan78xx.c @@ -1471,62 +1471,12 @@ static void lan78xx_set_msglevel(struct net_device *net, u32 level) dev->msg_enable = level; } -static int lan78xx_get_mdix_status(struct net_device *net) -{ - struct phy_device *phydev = net->phydev; - int buf; - - phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_1); - buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL); - phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0); - - return buf; -} - -static void lan78xx_set_mdix_status(struct net_device *net, __u8 mdix_ctrl) -{ - struct lan78xx_net *dev = netdev_priv(net); - struct phy_device *phydev = net->phydev; - int buf; - - if (mdix_ctrl == ETH_TP_MDI) { - phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, - LAN88XX_EXT_PAGE_SPACE_1); - buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL); - buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_; - phy_write(phydev, LAN88XX_EXT_MODE_CTRL, - buf | LAN88XX_EXT_MODE_CTRL_MDI_); - phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, - LAN88XX_EXT_PAGE_SPACE_0); - } else if (mdix_ctrl == ETH_TP_MDI_X) { - phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, - LAN88XX_EXT_PAGE_SPACE_1); - buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL); - buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_; - phy_write(phydev, LAN88XX_EXT_MODE_CTRL, - buf | LAN88XX_EXT_MODE_CTRL_MDI_X_); - phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, - LAN88XX_EXT_PAGE_SPACE_0); - } else if (mdix_ctrl == ETH_TP_MDI_AUTO) { - phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, - LAN88XX_EXT_PAGE_SPACE_1); - buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL); - buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_; - phy_write(phydev, LAN88XX_EXT_MODE_CTRL, - buf | LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_); - phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, - LAN88XX_EXT_PAGE_SPACE_0); - } - dev->mdix_ctrl = mdix_ctrl; -} - static int lan78xx_get_link_ksettings(struct net_device *net, struct ethtool_link_ksettings *cmd) { struct lan78xx_net *dev = netdev_priv(net); struct phy_device *phydev = net->phydev; int ret; - int buf; ret = usb_autopm_get_interface(dev->intf); if (ret < 0) @@ -1534,20 +1484,6 @@ static int lan78xx_get_link_ksettings(struct net_device *net, ret = phy_ethtool_ksettings_get(phydev, cmd); - buf = lan78xx_get_mdix_status(net); - - buf
[Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"
Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read") converts schedule_timeout() to its freezable version, it was probably correct at that time, but later, commit 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets") breaks the strong requirement for a freezable sleep, according to commit 0f9548ca1091: We shouldn't try_to_freeze if locks are held. Holding a lock can cause a deadlock if the lock is later acquired in the suspend or hibernate path (e.g. by dpm). Holding a lock can also cause a deadlock in the case of cgroup_freezer if a lock is held inside a frozen cgroup that is later acquired by a process outside that group. The pipe_lock is still held at that point. So just revert commit 2b15af6f95. Reported-by: Dmitry Vyukov Cc: Tejun Heo Cc: Colin Cross Cc: Rafael J. Wysocki Signed-off-by: Cong Wang --- net/unix/af_unix.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 5d1c14a..6f37f9e 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -116,7 +116,6 @@ #include #include #include -#include struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE]; EXPORT_SYMBOL_GPL(unix_socket_table); @@ -2220,7 +2219,7 @@ static long unix_stream_data_wait(struct sock *sk, long timeo, sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk); unix_state_unlock(sk); - timeo = freezable_schedule_timeout(timeo); + timeo = schedule_timeout(timeo); unix_state_lock(sk); if (sock_flag(sk, SOCK_DEAD)) -- 2.1.0
Re: [net-next PATCH v2] net: dummy: Introduce dummy virtual functions
On Mon, Nov 14, 2016 at 3:02 PM, Phil Sutter wrote: > Due to the assumption that all PFs are PCI devices, this implementation > is not completely straightforward: In order to allow for > rtnl_fill_ifinfo() to see the dummy VFs, a fake PCI parent device is > attached to the dummy netdev. This has to happen at the right spot so > register_netdevice() does not get confused. This patch abuses > ndo_fix_features callback for that. In ndo_uninit callback, the fake > parent is removed again for the same purpose. So you did some mimic-ing of PCI interface, how do you let the user to config the number of VFs? though a module param? why? if the module param only serves to say how many VF the device supports, maybe support the maximum possible by PCI spec and skip the module param? > +module_param(num_vfs, int, 0); > +MODULE_PARM_DESC(num_vfs, "Number of dummy VFs per dummy device"); > + > @@ -190,6 +382,7 @@ static int __init dummy_init_one(void) > err = register_netdevice(dev_dummy); > if (err < 0) > goto err; > + > return 0; nit, remove this added blank line..
Re: [PATCH net 1/3] net: phy: realtek: add eee advertisement disable options
On Thu, 2016-11-17 at 23:30 +0530, Anand Moon wrote: > Hi Jerone, > > > > How about adding callback functionality for .soft_reset to handle > > > BMCR > > > where we update the Auto-Negotiation for the phy, > > > as per the datasheet of the rtl8211f. I think BMCR is already pretty well handled by the genphy, don't you think ? > > > > I'm not sure I understand how this would help with our issue (and > > EEE). > > Am I missing something or is it something unrelated that you would > > like > > to see happening on this driver ? > > > [snip] > > I was just tying other phy module to understand the feature. > But in order to improve the throughput I tried to integrate blow u- > boot commit. > > commit 3d6af748ebd831524cb22a29433e9092af469ec7 > Author: Shengzhou Liu > Date: Thu Mar 12 18:54:59 2015 +0800 > > net/phy: Add support for realtek RTL8211F > > RTL8211F has different registers from RTL8211E. > This patch adds support for RTL8211F PHY which > can be found on Freescale's T1023 RDB board. > > And added the similar functionality to .config_aneg= > &rtl8211f_config_aneg, > I assume this is the commit you are referring to : http://git.denx.de/?p=u-boot.git;a=commit;h=3d6af748ebd831524cb22a29433 e9092af469ec7 I tried looking a this particular commit and the other ones in realtek.c history of u-boot. I don't really see what it does that linux is not already doing. > And I seem to have better results in through put with periodic drop > but it recovers. > - > odroid@odroid64:~$ iperf3 -c 10.0.0.102 -p 2006 -i 1 -t 100 -V > iperf 3.0.11 > Linux odroid64 4.9.0-rc5-xc2ml #18 SMP PREEMPT Thu Nov 17 22:56:00 [...] > > Test Complete. Summary Results: > [ ID] Interval Transfer Bandwidth Retr > [ 4] 0.00-100.00 sec 10.5 GBytes 902 > Mbits/sec4 sender > [ 4] 0.00-100.00 sec 10.5 GBytes 902 > Mbits/sec receiver > CPU Utilization: local/sender 5.6% (0.2%u/5.4%s), remote/receiver > 17.1% (1.2%u/15.9%s) > That's the kind of throughput we have on the C2 once the link is reliable (with EEE switch off for GbE) > Can your confirm this at your end. > Once confirm I will try to send this as a fix for this issue. > I'm testing the code to disable EEE in a generic way. I'll post the RFC for it asap. > -Best Regards > Anand Moon
Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
On Thu, 2016-11-17 at 19:44 +0100, André Roth wrote: > Hi all, > > > > > I checked again the kernel > > at https://github.com/hardkernel/linux/tree/ odroidc2-3.14.y. The > > version you mention (3.14.65-73) seems to be: > > sha1: c75d5f4d1516cdd86d90a9d1c565bb0ed9251036 tag: jenkins-deb > > s905 > > kernel-73 > > I downloaded the prebuilt image from hardkernel, I did not build the > kernel myself. but hardkernel has an earlier release of the same > kernel > version, which works fine too. I assume they would have committed the > change in the newer version.. > > > > > In this particular version, both realtek drivers: > > - drivers/net/phy/realtek.c > > - drivers/amlogic/ethernet/phy/am_realtek.c > > > > have the hack to disable 1000M advertisement. I don't understand > > how > > it possible for you to have 1000Base-T Full Duplex with this, maybe > > I'm missing something here ? > > that's what I don't understand as well... > > the patched kernel shows the following: > > $ uname -a > Linux T-06 4.9.0-rc4+ #21 SMP PREEMPT Sun Nov 13 12:07:19 UTC 2016 > > $ sudo ethtool eth0 > Settings for eth0: > Supported ports: [ TP MII ] > Supported link modes: 10baseT/Half 10baseT/Full > 100baseT/Half 100baseT/Full > 1000baseT/Full > Supported pause frame use: No > Supports auto-negotiation: Yes > Advertised link modes: 10baseT/Half 10baseT/Full > 100baseT/Half 100baseT/Full > 1000baseT/Full > Advertised pause frame use: No > Advertised auto-negotiation: Yes > Link partner advertised link modes: 10baseT/Half > 10baseT/Full > 100baseT/Half > 100baseT/Full 1000baseT/Half 1000baseT/Full > Link partner advertised pause frame use: Symmetric Receive- > only > Link partner advertised auto-negotiation: Yes > Speed: 1000Mb/s > Duplex: Full > Port: MII > PHYAD: 0 > Transceiver: external > Auto-negotiation: on > Supports Wake-on: ug > Wake-on: d > Current message level: 0x003f (63) > drv probe link timer ifdown ifup > Link detected: yes > > $ sudo ethtool --show-eee eth0 > EEE Settings for eth0: > EEE status: disabled > Tx LPI: disabled > Supported EEE link modes: 100baseT/Full > 1000baseT/Full > Advertised EEE link modes: 100baseT/Full > Link partner advertised EEE link modes: 100baseT/Full > 1000baseT/Full > > can it be that "EEE link modes" and the "normal" link modes are two > different things ? Exactly, They are. Hardkernel code disable both. With hardkernel's kernel, you should not have 1000baseT/Full in "Advertised link modes" and you would have nothing reported in "Advertised EEE link modes" > > > > > If you did compile the kernel yourself, could you check the 2 file > > mentioned above ? Just to be sure there was no patch applied at the > > last minute, which would not show up in the git history of > > hardkernel ? > > I cannot check this easily at the moment.. > > Regards, > > André >
Re: net: BUG still has locks held in unix_stream_splice_read
On Sun, Oct 9, 2016 at 8:14 PM, Al Viro wrote: > E.g what will happen if some code does a read on AF_UNIX socket with > some local mutex held? AFAICS, there are exactly two callers of > freezable_schedule_timeout() - this one and one in XFS; the latter is > in a kernel thread where we do have good warranties about the locking > environment, but here it's in the bleeding ->recvmsg/->splice_read and > for those assumption that caller doesn't hold any locks is pretty > strong, especially since it's not documented anywhere. > > What's going on there? Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read") converts schedule_timeout() to its freezable version, it was probably correct at that time, but later, commit 2b514574f7e88c8498027ee366 ("net: af_unix: implement splice for stream af_unix sockets") breaks its requirement for a freezable sleep: commit 0f9548ca10916dec166eaf74c816bded7d8e611d lockdep: check that no locks held at freeze time We shouldn't try_to_freeze if locks are held. Holding a lock can cause a deadlock if the lock is later acquired in the suspend or hibernate path (e.g. by dpm). Holding a lock can also cause a deadlock in the case of cgroup_freezer if a lock is held inside a frozen cgroup that is later acquired by a process outside that group. So probably we just need to revert commit 2b15af6f95 now. I am going to send a revert for at least -net and -stable, since Dmitry saw this warning again.
Re: Netperf UDP issue with connected sockets
On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote: > > Maybe you can share your udp flood "udpsnd" program source? Very ugly. This is based on what I wrote when tracking the UDP v6 checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero checksum in skb_checksum_help()), because netperf sends the same message over and over... Use -d 2 to remove the ip_idents_reserve() overhead. #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include char buffer[1400]; int main(int argc, char** argv) { int fd, i; struct sockaddr_in6 addr; char *host = "2002:af6:798::1"; int family = AF_INET6; int discover = -1; while ((i = getopt(argc, argv, "4H:d:")) != -1) { switch (i) { case 'H': host = optarg; break; case '4': family = AF_INET; break; case 'd': discover = atoi(optarg); break; } } fd = socket(family, SOCK_DGRAM, 0); if (fd < 0) error(1, errno, "failed to create socket"); if (discover != -1) setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &discover, sizeof(discover)); memset(&addr, 0, sizeof(addr)); if (family == AF_INET6) { addr.sin6_family = AF_INET6; addr.sin6_port = htons(9); inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr); } else { struct sockaddr_in *in = (struct sockaddr_in *)&addr; in->sin_family = family; in->sin_port = htons(9); inet_pton(family, host, &in->sin_addr); } connect(fd, (struct sockaddr *)&addr, (family == AF_INET6) ? sizeof(addr) : sizeof(struct sockaddr_in)); memset(buffer, 1, 1400); for (i = 0; i < 65536; i++) { memcpy(buffer, &i, sizeof(i)); send(fd, buffer, 100 + rand() % 200, 0); } return 0; }
Re: Netperf UDP issue with connected sockets
On Thu, 17 Nov 2016 10:51:23 -0800 Eric Dumazet wrote: > On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote: > > > The point is I can see a socket Send-Q forming, thus we do know the > > application have something to send. Thus, and possibility for > > non-opportunistic bulking. Allowing/implementing bulk enqueue from > > socket layer into qdisc layer, should be fairly simple (and rest of > > xmit_more is already in place). > > > As I said, you are fooled by TX completions. > > Please make sure to increase the sndbuf limits ! > > echo 2129920 >/proc/sys/net/core/wmem_default > > lpaa23:~# sar -n DEV 1 10|grep eth1 > 10:49:25 eth1 7.00 9273283.00 0.61 2187214.90 0.00 > 0.00 0.00 > 10:49:26 eth1 1.00 9230795.00 0.06 2176787.57 0.00 > 0.00 1.00 > 10:49:27 eth1 2.00 9247906.00 0.17 2180915.45 0.00 > 0.00 0.00 > 10:49:28 eth1 3.00 9246542.00 0.23 2180790.38 0.00 > 0.00 1.00 > 10:49:29 eth1 1.00 9239218.00 0.06 2179044.83 0.00 > 0.00 0.00 > 10:49:30 eth1 3.00 9248775.00 0.23 2181257.84 0.00 > 0.00 1.00 > 10:49:31 eth1 4.00 9225471.00 0.65 2175772.75 0.00 > 0.00 0.00 > 10:49:32 eth1 2.00 9253536.00 0.33 2182666.44 0.00 > 0.00 1.00 > 10:49:33 eth1 1.00 9265900.00 0.06 2185598.40 0.00 > 0.00 0.00 > 10:49:34 eth1 1.00 6949031.00 0.06 1638889.63 0.00 > 0.00 1.00 > Average: eth1 2.50 9018045.70 0.25 2126893.82 0.00 > 0.00 0.50 > > > lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more > xmit_more: 2251366909 > xmit_more: 2256011392 > > lpaa23:~# echo 2256011392-2251366909 | bc > 4644483 xmit more not happen that frequently for my setup, it does happen sometimes. And I do monitor with "ethtool -S". ~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2 Show adapter(s) (mlx5p2) statistics (ONLY that changed!) Ethtool(mlx5p2 ) stat: 92900913 ( 92,900,913) <= tx0_bytes /sec Ethtool(mlx5p2 ) stat:36073 ( 36,073) <= tx0_nop /sec Ethtool(mlx5p2 ) stat: 1548349 ( 1,548,349) <= tx0_packets /sec Ethtool(mlx5p2 ) stat:1 ( 1) <= tx0_xmit_more /sec Ethtool(mlx5p2 ) stat: 92884899 ( 92,884,899) <= tx_bytes /sec Ethtool(mlx5p2 ) stat: 99297696 ( 99,297,696) <= tx_bytes_phy /sec Ethtool(mlx5p2 ) stat: 1548082 ( 1,548,082) <= tx_csum_partial /sec Ethtool(mlx5p2 ) stat: 1548082 ( 1,548,082) <= tx_packets /sec Ethtool(mlx5p2 ) stat: 1551527 ( 1,551,527) <= tx_packets_phy /sec Ethtool(mlx5p2 ) stat: 99076658 ( 99,076,658) <= tx_prio1_bytes /sec Ethtool(mlx5p2 ) stat: 1548073 ( 1,548,073) <= tx_prio1_packets /sec Ethtool(mlx5p2 ) stat: 92936078 ( 92,936,078) <= tx_vport_unicast_bytes /sec Ethtool(mlx5p2 ) stat: 1548934 ( 1,548,934) <= tx_vport_unicast_packets /sec Ethtool(mlx5p2 ) stat:1 ( 1) <= tx_xmit_more /sec (after several attempts I got:) $ ethtool -S mlx5p2|grep more; sleep 1;ethtool -S mlx5p2|grep more tx_xmit_more: 14048 tx0_xmit_more: 14048 tx_xmit_more: 14049 tx0_xmit_more: 14049 This was with: $ grep -H . /proc/sys/net/core/wmem_default /proc/sys/net/core/wmem_default:2129920 >PerfTop: 76969 irqs/sec kernel:96.6% exact: 100.0% [4000Hz cycles:pp], > (all, 48 CPUs) > - > > 11.64% [kernel] [k] skb_set_owner_w > 6.21% [kernel] [k] queued_spin_lock_slowpath > 4.76% [kernel] [k] _raw_spin_lock > 4.40% [kernel] [k] __ip_make_skb > 3.10% [kernel] [k] sock_wfree > 2.87% [kernel] [k] ipt_do_table > 2.76% [kernel] [k] fq_dequeue > 2.71% [kernel] [k] mlx4_en_xmit > 2.50% [kernel] [k] __dev_queue_xmit > 2.29% [kernel] [k] __ip_append_data.isra.40 > 2.28% [kernel] [k] udp_sendmsg > 2.01% [kernel] [k] __alloc_skb > 1.90% [kernel] [k] napi_consume_skb > 1.63% [kernel] [k] udp_send_skb > 1.62% [kernel] [k] skb_release_data > 1.62% [kernel] [k] entry_SYSCALL_64_fastpath > 1.56% [kernel] [k] dev_hard_start_xmit > 1.55% udpsnd[.] __libc_send > 1.48% [kernel] [k] netif_skb_features > 1.42% [kernel] [k] __qdisc_run > 1.35% [kernel] [k] sk_dst_check >
Re: [PATCH net 1/1] net sched filters: pass netlink message flags in event notification
On Wed, Nov 16, 2016 at 2:16 PM, Roman Mashak wrote: > Userland client should be able to read an event, and reflect it back to > the kernel, therefore it needs to extract complete set of netlink flags. > > For example, this will allow "tc monitor" to distinguish Add and Replace > operations. > > Signed-off-by: Roman Mashak > Signed-off-by: Jamal Hadi Salim > --- > net/sched/cls_api.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c > index 2b2a797..8e93d4a 100644 > --- a/net/sched/cls_api.c > +++ b/net/sched/cls_api.c > @@ -112,7 +112,7 @@ static void tfilter_notify_chain(struct net *net, struct > sk_buff *oskb, > > for (it_chain = chain; (tp = rtnl_dereference(*it_chain)) != NULL; > it_chain = &tp->next) > - tfilter_notify(net, oskb, n, tp, 0, event, false); > + tfilter_notify(net, oskb, n, tp, n->nlmsg_flags, event, > false); I must miss something, why does it make sense to pass n->nlmsg_flags as 'fh' to tfilter_notify()?? > } > > /* Select new prio value from the range, managed by kernel. */ > @@ -430,7 +430,8 @@ static int tfilter_notify(struct net *net, struct sk_buff > *oskb, > if (!skb) > return -ENOBUFS; > > - if (tcf_fill_node(net, skb, tp, fh, portid, n->nlmsg_seq, 0, event) > <= 0) { > + if (tcf_fill_node(net, skb, tp, fh, portid, n->nlmsg_seq, > + n->nlmsg_flags, event) <= 0) { This part makes sense.
Re: [PATCH net-next v2 2/4] bpf, mlx5: fix various refcount issues in mlx5e_xdp_set
On 11/17/2016 10:48 AM, Saeed Mahameed wrote: On Wed, Nov 16, 2016 at 5:26 PM, Daniel Borkmann wrote: On 11/16/2016 03:30 PM, Saeed Mahameed wrote: On Wed, Nov 16, 2016 at 3:54 PM, Daniel Borkmann wrote: On 11/16/2016 01:25 PM, Saeed Mahameed wrote: On Wed, Nov 16, 2016 at 2:04 AM, Daniel Borkmann wrote: There are multiple issues in mlx5e_xdp_set(): 1) The batched bpf_prog_add() is currently not checked for errors! When doing so, it should be done at an earlier point in time to makes sure that we cannot fail anymore at the time we want to set the program for each channel. This only means that we have to undo the bpf_prog_add() in case we return due to required reset. 2) When swapping the priv->xdp_prog, then no extra reference count must be taken since we got that from call path via dev_change_xdp_fd() already. Otherwise, we'd never be able to free the program. Also, bpf_prog_add() without checking the return code could fail. Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support") Signed-off-by: Daniel Borkmann --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 25 ++- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index ab0c336..cf26672 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -3142,6 +3142,17 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog) goto unlock; } + if (prog) { + /* num_channels is invariant here, so we can take the +* batched reference right upfront. +*/ + prog = bpf_prog_add(prog, priv->params.num_channels); + if (IS_ERR(prog)) { + err = PTR_ERR(prog); + goto unlock; + } + } + With this approach you will end up taking a ref count twice per RQ! on the first time xdp_set is called i.e (old_prog = NULL, prog != NULL). One ref will be taken per RQ/Channel from the above code, and since reset will be TRUE mlx5e_open_locked will be called and another ref count will be taken on mlx5e_create_rq. The issue here is that we have two places for ref count accounting, xdp_set and mlx5e_create_rq. Having ref-count updates in mlx5e_create_rq is essential for num_channels configuration changes (mlx5e_set_ringparam). Our previous approach made sure that only one path will do the ref counting (mlx5e_open_locked vs. mlx5e_xdp_set batch ref update when reset == false). That is correct, for a short time bpf_prog_add() was charged also when we reset. When we reset, we will then jump to unlock_put and safely undo this since we took ref from mlx5e_create_rq() already in that case, and return successfully. That was intentional, so that eventually we end up just taking a /single/ ref per RQ when we exit mlx5e_xdp_set(), but more below ... [...] 2. Keep the current approach and make sure to not update the ref count twice, you can batch update only if (!reset && was_open) otherwise you can rely on mlx5e_open_locked to take the num_channels ref count for you. Personally I prefer option 2, since we will keep the current logic which will allow configuration updates such as (mlx5e_set_ringparam) to not worry about ref counting since it will be done in the reset flow. ... agree on keeping current approach. I actually like the idea, so we'd end up with this simpler version for the batched ref then. Great :). So let's do the batched update only if we are not going to reset (we already know that in advance), instead of the current patch where you batch update unconditionally and then unlock_put in case reset was performed (which is just redundant and confusing). Please note that if open_locked fails you need to goto unlock_put. Sorry, I'm not quite sure I follow you here; are you /now/ commenting on the original patch or on the already updated diff I did from my very last email, that is, http://patchwork.ozlabs.org/patch/695564/ ? I am commenting on this patch "V2 2/4". this patch will take batched ref count unconditionally and release the extra refs "unlock_put" if reset was performed. I am asking to take the batched ref only if we are not going to reset in advance to save the extra "unlock_put" Okay, sure, it's clear and we discussed about that already earlier, and my response back then was the diff in the /end/ of this email here: https://www.spinics.net/lists/netdev/msg404709.html Hence my confusion why we were back on the original patch after that. ;) Anyway, I'll get the updated set with the suggestion out tomorrow, thanks for the review! Note, your "bpf_prog_add(prog, 1) // one ref for the device." is not needed since this we already got that one through dev_change_xdp_fd() as mentioned. If
[PATCH net-next v4 2/5] net: fsl: Allow most drivers to be built with COMPILE_TEST
There are only a handful of Freescale Ethernet drivers that don't actually build with COMPILE_TEST: * FEC, for which we would need to define a default register layout if no supported architecture is defined * UCC_GETH which depends on PowerPC cpm.h header (which could be moved to a generic location) * GIANFAR needs to depend on HAS_DMA to fix linking failures on some architectures (like m32r) We need to fix an unmet dependency to get there though: warning: (FSL_XGMAC_MDIO) selects OF_MDIO which has unmet direct dependencies (OF && PHYLIB) which would result in CONFIG_OF_MDIO=[ym] without CONFIG_OF to be set. Signed-off-by: Florian Fainelli --- drivers/net/ethernet/freescale/Kconfig | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/Kconfig b/drivers/net/ethernet/freescale/Kconfig index aa3f615886b4..0d415516b577 100644 --- a/drivers/net/ethernet/freescale/Kconfig +++ b/drivers/net/ethernet/freescale/Kconfig @@ -8,7 +8,7 @@ config NET_VENDOR_FREESCALE depends on FSL_SOC || QUICC_ENGINE || CPM1 || CPM2 || PPC_MPC512x || \ M523x || M527x || M5272 || M528x || M520x || M532x || \ ARCH_MXC || ARCH_MXS || (PPC_MPC52xx && PPC_BESTCOMM) || \ - ARCH_LAYERSCAPE + ARCH_LAYERSCAPE || COMPILE_TEST ---help--- If you have a network (Ethernet) card belonging to this class, say Y. @@ -65,6 +65,7 @@ config FSL_PQ_MDIO config FSL_XGMAC_MDIO tristate "Freescale XGMAC MDIO" select PHYLIB + depends on OF select OF_MDIO ---help--- This driver supports the MDIO bus on the Fman 10G Ethernet MACs, and @@ -85,6 +86,7 @@ config UGETH_TX_ON_DEMAND config GIANFAR tristate "Gianfar Ethernet" + depends on HAS_DMA select FSL_PQ_MDIO select PHYLIB select CRC32 -- 2.9.3
[PATCH net-next v4 3/5] bus: mvebu-bus: Provide inline stub for mvebu_mbus_get_dram_win_info
In preparation for allowing CONFIG_MVNETA_BM to build with COMPILE_TEST, provide an inline stub for mvebu_mbus_get_dram_win_info(). Signed-off-by: Florian Fainelli --- include/linux/mbus.h | 8 1 file changed, 8 insertions(+) diff --git a/include/linux/mbus.h b/include/linux/mbus.h index 2931aa43dab1..0d3f14fd2621 100644 --- a/include/linux/mbus.h +++ b/include/linux/mbus.h @@ -82,6 +82,7 @@ static inline int mvebu_mbus_get_io_win_info(phys_addr_t phyaddr, u32 *size, } #endif +#ifdef CONFIG_MVEBU_MBUS int mvebu_mbus_save_cpu_target(u32 __iomem *store_addr); void mvebu_mbus_get_pcie_mem_aperture(struct resource *res); void mvebu_mbus_get_pcie_io_aperture(struct resource *res); @@ -97,5 +98,12 @@ int mvebu_mbus_init(const char *soc, phys_addr_t mbus_phys_base, size_t mbus_size, phys_addr_t sdram_phys_base, size_t sdram_size); int mvebu_mbus_dt_init(bool is_coherent); +#else +static inline int mvebu_mbus_get_dram_win_info(phys_addr_t phyaddr, u8 *target, + u8 *attr) +{ + return -EINVAL; +} +#endif /* CONFIG_MVEBU_MBUS */ #endif /* __LINUX_MBUS_H */ -- 2.9.3
[PATCH net-next v4 0/5] net: Enable COMPILE_TEST for Marvell & Freescale drivers
Hi all, This patch series allows building the Freescale and Marvell Ethernet network drivers with COMPILE_TEST. Changes in v4: - add proper HAS_DMA to fix build errors on m32r - provide an inline stub for mvebu_mbus_get_dram_win_info - added an additional patch to fix build errors with mv88e6xxx on m32r Changes in v3: - reorder patches to avoid introducing a build warning between commits Changes in v2: - rename register define clash when building for i386 (spotted by LKP) Florian Fainelli (5): net: gianfar_ptp: Rename FS bit to FIPERST net: fsl: Allow most drivers to be built with COMPILE_TEST bus: mvebu-bus: Provide inline stub for mvebu_mbus_get_dram_win_info net: marvell: Allow drivers to be built with COMPILE_TEST net: dsa: mv88e6xxx: Select IRQ_DOMAIN drivers/net/dsa/mv88e6xxx/Kconfig| 1 + drivers/net/ethernet/freescale/Kconfig | 4 +++- drivers/net/ethernet/freescale/gianfar_ptp.c | 4 ++-- drivers/net/ethernet/marvell/Kconfig | 11 +++ include/linux/mbus.h | 8 5 files changed, 21 insertions(+), 7 deletions(-) -- 2.9.3
[PATCH net-next v4 4/5] net: marvell: Allow drivers to be built with COMPILE_TEST
All Marvell Ethernet drivers actually build fine with COMPILE_TEST with a few warnings. We need to add a few HAS_DMA dependencies to fix linking failures on problematic architectures like m32r. Signed-off-by: Florian Fainelli --- drivers/net/ethernet/marvell/Kconfig | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig index 2664827ddecd..d74d4e6f0b34 100644 --- a/drivers/net/ethernet/marvell/Kconfig +++ b/drivers/net/ethernet/marvell/Kconfig @@ -5,7 +5,7 @@ config NET_VENDOR_MARVELL bool "Marvell devices" default y - depends on PCI || CPU_PXA168 || MV64X60 || PPC32 || PLAT_ORION || INET + depends on PCI || CPU_PXA168 || MV64X60 || PPC32 || PLAT_ORION || INET || COMPILE_TEST ---help--- If you have a network (Ethernet) card belonging to this class, say Y. @@ -18,7 +18,8 @@ if NET_VENDOR_MARVELL config MV643XX_ETH tristate "Marvell Discovery (643XX) and Orion ethernet support" - depends on (MV64X60 || PPC32 || PLAT_ORION) && INET + depends on (MV64X60 || PPC32 || PLAT_ORION || COMPILE_TEST) && INET + depends on HAS_DMA select PHYLIB select MVMDIO ---help--- @@ -55,7 +56,8 @@ config MVNETA_BM_ENABLE config MVNETA tristate "Marvell Armada 370/38x/XP network interface support" - depends on PLAT_ORION + depends on PLAT_ORION || COMPILE_TEST + depends on HAS_DMA select MVMDIO select FIXED_PHY ---help--- @@ -77,7 +79,8 @@ config MVNETA_BM config MVPP2 tristate "Marvell Armada 375 network interface support" - depends on MACH_ARMADA_375 + depends on MACH_ARMADA_375 || COMPILE_TEST + depends on HAS_DMA select MVMDIO ---help--- This driver supports the network interface units in the -- 2.9.3
[PATCH net-next v4 5/5] net: dsa: mv88e6xxx: Select IRQ_DOMAIN
Some architectures may not define IRQ_DOMAIN (like m32r), fixes undefined references to IRQ_DOMAIN functions. Fixes: dc30c35be720 ("net: dsa: mv88e6xxx: Implement interrupt support.") Signed-off-by: Florian Fainelli --- drivers/net/dsa/mv88e6xxx/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/dsa/mv88e6xxx/Kconfig b/drivers/net/dsa/mv88e6xxx/Kconfig index 486668813e15..1aaa7a95ebc4 100644 --- a/drivers/net/dsa/mv88e6xxx/Kconfig +++ b/drivers/net/dsa/mv88e6xxx/Kconfig @@ -1,6 +1,7 @@ config NET_DSA_MV88E6XXX tristate "Marvell 88E6xxx Ethernet switch fabric support" depends on NET_DSA + select IRQ_DOMAIN select NET_DSA_TAG_EDSA select NET_DSA_TAG_DSA help -- 2.9.3
[PATCH net-next v4 1/5] net: gianfar_ptp: Rename FS bit to FIPERST
FS is a global symbol used by the x86 32-bit architecture, fixes builds re-definitions: >> drivers/net/ethernet/freescale/gianfar_ptp.c:75:0: warning: "FS" >> redefined #define FS(1<<28) /* FIPER start indication */ In file included from arch/x86/include/uapi/asm/ptrace.h:5:0, from arch/x86/include/asm/ptrace.h:6, from arch/x86/include/asm/math_emu.h:4, from arch/x86/include/asm/processor.h:11, from include/linux/mutex.h:19, from include/linux/kernfs.h:13, from include/linux/sysfs.h:15, from include/linux/kobject.h:21, from include/linux/device.h:17, from drivers/net/ethernet/freescale/gianfar_ptp.c:23: arch/x86/include/uapi/asm/ptrace-abi.h:15:0: note: this is the location of the previous definition #define FS 9 Reported-by: kbuild test robot Signed-off-by: Florian Fainelli --- drivers/net/ethernet/freescale/gianfar_ptp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/freescale/gianfar_ptp.c b/drivers/net/ethernet/freescale/gianfar_ptp.c index 57798814160d..3e8d1fffe34e 100644 --- a/drivers/net/ethernet/freescale/gianfar_ptp.c +++ b/drivers/net/ethernet/freescale/gianfar_ptp.c @@ -72,7 +72,7 @@ struct gianfar_ptp_registers { /* Bit definitions for the TMR_CTRL register */ #define ALM1P (1<<31) /* Alarm1 output polarity */ #define ALM2P (1<<30) /* Alarm2 output polarity */ -#define FS(1<<28) /* FIPER start indication */ +#define FIPERST (1<<28) /* FIPER start indication */ #define PP1L (1<<27) /* Fiper1 pulse loopback mode enabled. */ #define PP2L (1<<26) /* Fiper2 pulse loopback mode enabled. */ #define TCLK_PERIOD_SHIFT (16) /* 1588 timer reference clock period. */ @@ -502,7 +502,7 @@ static int gianfar_ptp_probe(struct platform_device *dev) gfar_write(&etsects->regs->tmr_fiper1, etsects->tmr_fiper1); gfar_write(&etsects->regs->tmr_fiper2, etsects->tmr_fiper2); set_alarm(etsects); - gfar_write(&etsects->regs->tmr_ctrl, tmr_ctrl|FS|RTPE|TE|FRD); + gfar_write(&etsects->regs->tmr_ctrl, tmr_ctrl|FIPERST|RTPE|TE|FRD); spin_unlock_irqrestore(&etsects->lock, flags); -- 2.9.3
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On 17.11.2016 19:32, Ido Schimmel wrote: > On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote: >> On 17.11.2016 17:45, David Miller wrote: >>> From: Hannes Frederic Sowa >>> Date: Thu, 17 Nov 2016 15:36:48 +0100 >>> The other way is the journal idea I had, which uses an rb-tree with timestamps as keys (can be lamport timestamps). You insert into the tree until the dump is finished and use it as queue later to shuffle stuff into the hardware. >>> >>> If you have this "place" where pending inserts are stored, you have >>> a policy decision to make. >>> >>> First of all what do other lookups see when there are pending entires? >> >> I think this is a problem with the current approach already, as the >> delayed work queue already postpones the insert for an undecidable >> amount of time (and reorders depending on which CPU the entry was >> inserted and the fib notifier was called). > > The delayed work queue only postpones the insert into the listener's > table, but the entries will be present in the kernel's table as usual. > Therefore, other lookup made on the kernel's table will see the pending > entries. > > Note that both listeners that currently call the dump API do that during > their init, before any lookups can be made on their tables. If you think > it's critical, then we can make sure the workqueues are flushed before > the listeners register their netdevs and effectively expose their tables > to lookups. I do see routing anyway as a best-effort. ;) That means in not too long time the hardware needs to be fully synchronized to the software path and either have all correct entries or abort must have been called. > I'm looking into the reordering issues you mentioned. I belive it's a > valid point. Thanks! >> For user space queries we would still query the in-kernel table. > > Right. No changes here. > >>> If, once inserted into the pending queue, you return success to the >>> inserting entity, then you must make those pending entries visible >>> to lookups. >> >> I think this same problem is in this patch set already. The way I >> understood it, is, that if a problem during insert emerges, the driver >> calls abort and every packet will be send to user space, as the routing >> table cannot be offloaded and it won't try it again, Ido? > > First of all, this abort mechanism is already in place and isn't part of > this patchset. Secondly, why is this a problem? The all-or-nothing > approach is an hard requirement and current implementation is infinitely > better than previous approach in which the kernel's tables were flushed > upon route insertion failure. It rendered the system unusable. Current > implementation of abort mechanism keeps the system usable, but with > reduced performance. Yes, I argued below that I am toally fine with this approach. >> Probably this is an artifact of the mellanox implementation and we can >> implement this differently for other cards, but the schema to abort all >> if the modification doesn't work, seems to be fundamental (I think we >> require the all-or-nothing approach for now). > > Yes, it's an hard requirement for all implementations. mlxsw implements > it by evicting all routes from its tables and inserting a default route > that traps packets to CPU. Correct, I don't see how fib offloading can do something else, besides semantically looking at the routing table and figure out where to insert "exceptions" for subtrees but keep most packets flowing over the hw directly. But this for another time... ;) >>> If you block the inserting entity, well that doesn't make a lot of >>> sense. If blocking is a workable solution, then we can just block the >>> entire insert while this FIB dump to the device driver is happening. >> >> I don't think we should really block (as in kernel-block) at any time. >> >> I was suggesting something like: >> >> rtnl_lock(); >> synchronize_rcu_expedited(); // barrier, all rounting modifications are >> stable and no new can be added due to rtnl_lock >> register notifier(); // notifier adds entries also into journal(); >> rtnl_unlock(); >> walk_fib_tree_rcu_into_journal(); >> // walk finished >> start_syncing_journal_to_hw(); // if new entries show up we sync them >> asap after this point >> >> The journal would need a spin lock to protect its internal state and >> order events correctly. >> >>> But I am pretty sure the idea of blocking modifications for so long >>> was considered undesirable. >> >> Yes, this is also still my opinion. > > It was indeed rejected :) > https://marc.info/?l=linux-netdev&m=147794848224884&w=2 Bye, Hannes
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On 17.11.2016 19:16, David Miller wrote: > From: Hannes Frederic Sowa > Date: Thu, 17 Nov 2016 18:20:39 +0100 > >> Hi, >> >> On 17.11.2016 17:45, David Miller wrote: >>> From: Hannes Frederic Sowa >>> Date: Thu, 17 Nov 2016 15:36:48 +0100 >>> The other way is the journal idea I had, which uses an rb-tree with timestamps as keys (can be lamport timestamps). You insert into the tree until the dump is finished and use it as queue later to shuffle stuff into the hardware. >>> >>> If you have this "place" where pending inserts are stored, you have >>> a policy decision to make. >>> >>> First of all what do other lookups see when there are pending entires? >> >> I think this is a problem with the current approach already, as the >> delayed work queue already postpones the insert for an undecidable >> amount of time (and reorders depending on which CPU the entry was >> inserted and the fib notifier was called). >> >> For user space queries we would still query the in-kernel table. > > Ok, I think I might misunderstand something. > > What is going into this journal exactly? The actual full software and > hardware insert operation, or just the notification to the hardware > device driver notifiers? The journal is only used as a timely ordered queue for updating the hardware in correct order. The enqueue side is the fib notifier only. If no fib notifier is registered we don't use this code at all (and also don't hit the lock protecting this journal in fib insertion/deletion path - fast in-kernel path is untouched - otherwise just a spin_lock already under rtnl_lock in slow path). The fib notifier enqueues the packet with a timestamp into this journal and can also merge entries while they are in the queue, e.g. we got a delete from the fib notifier but the rcu walk indicated an addition of the entry, so we can merge that at this point and depending on the timestamp remove the entry or drop the deletion event. We start dequeueing the fib entries into the hardware as soon as the rcu dump is finished, at this point we are up-to-date in the queue with all events. New events can be added to the journal (with appropriate locking) during this time, as the queue was once in proper synced state we stay proper synchronized. We keep up with the queue in steady state after the dump, so syncing happens ASAP. Maybe we can also drop the journal then. Something alike this described queue is implemented here (haven't checked if it exactly matches the specification, certainly it provides more features): https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.h https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.c For this to work the config path needs to add timestamps to the fib_infos or fib_aliases. > The "lookup" I'm mostly concerned with is the fast path where the > packets being processed actually look up a route. This doesn't change at all. All code will be hidden in a library that gets attached to the fib notifier, which is configuration code path. > I do not think we can return success on the insert to the user yet > have the route lookup dataplace not return that route on a lookup. We don't change kernel fast path at all. If we add/delete a route in software and hardware, kernel indicates success as soon as software has the entry added. It also gets queued up in the journal. Journal will be lazily processed, if error happens during that (e.g. hardware signals table full), abort is called and all packets go to user space ASAP. User space will always show the route as it is added to in the first place and after the driver called abort also process the packets accordingly. I can imagine this can get very complicated. David's approach with a counter to check for modifications and a limited number of retries probably works too, especially because the hardware will probably be initialized before routing daemons start up and will be synced up hopefully all the time. So maybe this is over engineered, but I have no idea how long hardware needs to sync up a e.g. full IPv4 routing table into hardware (if that is actually the current goal of this). Bye, Hannes
Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
Hi Rafal, On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo wrote: > Hello, > I think, there could a bug in your patch. > >> + >> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT >> + dmacfg |= GEM_BIT(ADDR64); >> +#endif > > You enable 64 bit addressing (64b dma bus width) always when appropriate > architecture config option is enabled. > But there are some legacy controllers which do not support that feature. > According Cadence hardware team: > "64 bit addressing was added in July 2013. Earlier version do not have it. > This feature was enhanced in release August 2014 to have separate upper > address values for transmit and receive." > >> /* Bitfields in NSR */ >> @@ -474,6 +479,10 @@ >> struct macb_dma_desc { > > u32 addr; >> u32 ctrl; >> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT >> + u32 addrh; >> + u32 resvd; >> +#endif >> }; > > It will not work for legacy hardware. Old descriptor is 2 words wide, the new > one is 4 words wide. > If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't support it at > all, > you will miss every second descriptor. > True, this feature is not available in all of Cadence IP versions. In fact, the IP version Zynq does not support this. But the one in ZynqMP does. So, we enable kernel config for 64 bit DMA addressing for this SoC and hence the driver picks it up. My assumption was that if the legacy IP does not support 64 bit addressing, then this DMA option wouldn't be enabled. There is a design config register in Cadence IP which is being read to check for 64 bit address support - DMA mask is set based on that. But the addition of two descriptor words cannot be based on this runtime check. For this reason, all the static changes were placed under this check. Regards, Harini
Re: Netperf UDP issue with connected sockets
On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote: > The point is I can see a socket Send-Q forming, thus we do know the > application have something to send. Thus, and possibility for > non-opportunistic bulking. Allowing/implementing bulk enqueue from > socket layer into qdisc layer, should be fairly simple (and rest of > xmit_more is already in place). As I said, you are fooled by TX completions. Please make sure to increase the sndbuf limits ! echo 2129920 >/proc/sys/net/core/wmem_default lpaa23:~# sar -n DEV 1 10|grep eth1 10:49:25 eth1 7.00 9273283.00 0.61 2187214.90 0.00 0.00 0.00 10:49:26 eth1 1.00 9230795.00 0.06 2176787.57 0.00 0.00 1.00 10:49:27 eth1 2.00 9247906.00 0.17 2180915.45 0.00 0.00 0.00 10:49:28 eth1 3.00 9246542.00 0.23 2180790.38 0.00 0.00 1.00 10:49:29 eth1 1.00 9239218.00 0.06 2179044.83 0.00 0.00 0.00 10:49:30 eth1 3.00 9248775.00 0.23 2181257.84 0.00 0.00 1.00 10:49:31 eth1 4.00 9225471.00 0.65 2175772.75 0.00 0.00 0.00 10:49:32 eth1 2.00 9253536.00 0.33 2182666.44 0.00 0.00 1.00 10:49:33 eth1 1.00 9265900.00 0.06 2185598.40 0.00 0.00 0.00 10:49:34 eth1 1.00 6949031.00 0.06 1638889.63 0.00 0.00 1.00 Average: eth1 2.50 9018045.70 0.25 2126893.82 0.00 0.00 0.50 lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more xmit_more: 2251366909 xmit_more: 2256011392 lpaa23:~# echo 2256011392-2251366909 | bc 4644483 PerfTop: 76969 irqs/sec kernel:96.6% exact: 100.0% [4000Hz cycles:pp], (all, 48 CPUs) --- 11.64% [kernel] [k] skb_set_owner_w 6.21% [kernel] [k] queued_spin_lock_slowpath 4.76% [kernel] [k] _raw_spin_lock 4.40% [kernel] [k] __ip_make_skb 3.10% [kernel] [k] sock_wfree 2.87% [kernel] [k] ipt_do_table 2.76% [kernel] [k] fq_dequeue 2.71% [kernel] [k] mlx4_en_xmit 2.50% [kernel] [k] __dev_queue_xmit 2.29% [kernel] [k] __ip_append_data.isra.40 2.28% [kernel] [k] udp_sendmsg 2.01% [kernel] [k] __alloc_skb 1.90% [kernel] [k] napi_consume_skb 1.63% [kernel] [k] udp_send_skb 1.62% [kernel] [k] skb_release_data 1.62% [kernel] [k] entry_SYSCALL_64_fastpath 1.56% [kernel] [k] dev_hard_start_xmit 1.55% udpsnd[.] __libc_send 1.48% [kernel] [k] netif_skb_features 1.42% [kernel] [k] __qdisc_run 1.35% [kernel] [k] sk_dst_check 1.33% [kernel] [k] sock_def_write_space 1.30% [kernel] [k] kmem_cache_alloc_node_trace 1.29% [kernel] [k] __local_bh_enable_ip 1.21% [kernel] [k] copy_user_enhanced_fast_string 1.08% [kernel] [k] __kmalloc_reserve.isra.40 1.08% [kernel] [k] SYSC_sendto 1.07% [kernel] [k] kmem_cache_alloc_node 0.95% [kernel] [k] ip_finish_output2 0.95% [kernel] [k] ktime_get 0.91% [kernel] [k] validate_xmit_skb 0.88% [kernel] [k] sock_alloc_send_pskb 0.82% [kernel] [k] sock_sendmsg
Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
Hi all, > I checked again the kernel > at https://github.com/hardkernel/linux/tree/ odroidc2-3.14.y. The > version you mention (3.14.65-73) seems to be: > sha1: c75d5f4d1516cdd86d90a9d1c565bb0ed9251036 tag: jenkins-deb s905 > kernel-73 I downloaded the prebuilt image from hardkernel, I did not build the kernel myself. but hardkernel has an earlier release of the same kernel version, which works fine too. I assume they would have committed the change in the newer version.. > In this particular version, both realtek drivers: > - drivers/net/phy/realtek.c > - drivers/amlogic/ethernet/phy/am_realtek.c > > have the hack to disable 1000M advertisement. I don't understand how > it possible for you to have 1000Base-T Full Duplex with this, maybe > I'm missing something here ? that's what I don't understand as well... the patched kernel shows the following: $ uname -a Linux T-06 4.9.0-rc4+ #21 SMP PREEMPT Sun Nov 13 12:07:19 UTC 2016 $ sudo ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supported pause frame use: No Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Link partner advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Link partner advertised pause frame use: Symmetric Receive-only Link partner advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: external Auto-negotiation: on Supports Wake-on: ug Wake-on: d Current message level: 0x003f (63) drv probe link timer ifdown ifup Link detected: yes $ sudo ethtool --show-eee eth0 EEE Settings for eth0: EEE status: disabled Tx LPI: disabled Supported EEE link modes: 100baseT/Full 1000baseT/Full Advertised EEE link modes: 100baseT/Full Link partner advertised EEE link modes: 100baseT/Full 1000baseT/Full can it be that "EEE link modes" and the "normal" link modes are two different things ? > If you did compile the kernel yourself, could you check the 2 file > mentioned above ? Just to be sure there was no patch applied at the > last minute, which would not show up in the git history of > hardkernel ? I cannot check this easily at the moment.. Regards, André
Re: [PATCH net 1/1] net sched filters: pass netlink message flags in event notification
From: Roman Mashak Date: Wed, 16 Nov 2016 17:16:10 -0500 > Userland client should be able to read an event, and reflect it back to > the kernel, therefore it needs to extract complete set of netlink flags. > > For example, this will allow "tc monitor" to distinguish Add and Replace > operations. > > Signed-off-by: Roman Mashak > Signed-off-by: Jamal Hadi Salim Looks good, applied.
[PATCH 12/20] net/iucv: Convert to hotplug state machine
Install the callbacks via the state machine and let the core invoke the callbacks on the already online CPUs. The smp function calls in the online/downprep callbacks are not required as the callback is guaranteed to be invoked on the upcoming/outgoing cpu. Cc: Ursula Braun Cc: "David S. Miller" Cc: linux-s...@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior --- include/linux/cpuhotplug.h | 1 + net/iucv/iucv.c| 118 + 2 files changed, 45 insertions(+), 74 deletions(-) diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h index fd5598b8353a..69abf2c09f6c 100644 --- a/include/linux/cpuhotplug.h +++ b/include/linux/cpuhotplug.h @@ -63,6 +63,7 @@ enum cpuhp_state { CPUHP_X86_THERM_PREPARE, CPUHP_X86_CPUID_PREPARE, CPUHP_X86_MSR_PREPARE, + CPUHP_NET_IUCV_PREPARE, CPUHP_TIMERS_DEAD, CPUHP_NOTF_ERR_INJ_PREPARE, CPUHP_MIPS_SOC_PREPARE, diff --git a/net/iucv/iucv.c b/net/iucv/iucv.c index 88a2a3ba4212..f0d6afc5d4a9 100644 --- a/net/iucv/iucv.c +++ b/net/iucv/iucv.c @@ -639,7 +639,7 @@ static void iucv_disable(void) put_online_cpus(); } -static void free_iucv_data(int cpu) +static int iucv_cpu_dead(unsigned int cpu) { kfree(iucv_param_irq[cpu]); iucv_param_irq[cpu] = NULL; @@ -647,9 +647,10 @@ static void free_iucv_data(int cpu) iucv_param[cpu] = NULL; kfree(iucv_irq_data[cpu]); iucv_irq_data[cpu] = NULL; + return 0; } -static int alloc_iucv_data(int cpu) +static int iucv_cpu_prepare(unsigned int cpu) { /* Note: GFP_DMA used to get memory below 2G */ iucv_irq_data[cpu] = kmalloc_node(sizeof(struct iucv_irq_data), @@ -671,58 +672,38 @@ static int alloc_iucv_data(int cpu) return 0; out_free: - free_iucv_data(cpu); + iucv_cpu_dead(cpu); return -ENOMEM; } -static int iucv_cpu_notify(struct notifier_block *self, -unsigned long action, void *hcpu) +static int iucv_cpu_online(unsigned int cpu) { - cpumask_t cpumask; - long cpu = (long) hcpu; - - switch (action) { - case CPU_UP_PREPARE: - case CPU_UP_PREPARE_FROZEN: - if (alloc_iucv_data(cpu)) - return notifier_from_errno(-ENOMEM); - break; - case CPU_UP_CANCELED: - case CPU_UP_CANCELED_FROZEN: - case CPU_DEAD: - case CPU_DEAD_FROZEN: - free_iucv_data(cpu); - break; - case CPU_ONLINE: - case CPU_ONLINE_FROZEN: - case CPU_DOWN_FAILED: - case CPU_DOWN_FAILED_FROZEN: - if (!iucv_path_table) - break; - smp_call_function_single(cpu, iucv_declare_cpu, NULL, 1); - break; - case CPU_DOWN_PREPARE: - case CPU_DOWN_PREPARE_FROZEN: - if (!iucv_path_table) - break; - cpumask_copy(&cpumask, &iucv_buffer_cpumask); - cpumask_clear_cpu(cpu, &cpumask); - if (cpumask_empty(&cpumask)) - /* Can't offline last IUCV enabled cpu. */ - return notifier_from_errno(-EINVAL); - smp_call_function_single(cpu, iucv_retrieve_cpu, NULL, 1); - if (cpumask_empty(&iucv_irq_cpumask)) - smp_call_function_single( - cpumask_first(&iucv_buffer_cpumask), - iucv_allow_cpu, NULL, 1); - break; - } - return NOTIFY_OK; + if (!iucv_path_table) + return 0; + iucv_declare_cpu(NULL); + return 0; } -static struct notifier_block __refdata iucv_cpu_notifier = { - .notifier_call = iucv_cpu_notify, -}; +static int iucv_cpu_down_prep(unsigned int cpu) +{ + cpumask_t cpumask; + + if (!iucv_path_table) + return 0; + + cpumask_copy(&cpumask, &iucv_buffer_cpumask); + cpumask_clear_cpu(cpu, &cpumask); + if (cpumask_empty(&cpumask)) + /* Can't offline last IUCV enabled cpu. */ + return -EINVAL; + + iucv_retrieve_cpu(NULL); + if (!cpumask_empty(&iucv_irq_cpumask)) + return 0; + smp_call_function_single(cpumask_first(&iucv_buffer_cpumask), +iucv_allow_cpu, NULL, 1); + return 0; +} /** * iucv_sever_pathid @@ -2027,6 +2008,7 @@ struct iucv_interface iucv_if = { }; EXPORT_SYMBOL(iucv_if); +static enum cpuhp_state iucv_online; /** * iucv_init * @@ -2035,7 +2017,6 @@ EXPORT_SYMBOL(iucv_if); static int __init iucv_init(void) { int rc; - int cpu; if (!MACHINE_IS_VM) { rc = -EPROTONOSUPPORT; @@ -2054,23 +2035,19 @@ static int __init iucv_init(void) goto out_int; } - cpu_notifier_register_b
Re: [PATCH net-next 0/3] RDS: TCP: HA/Failover fixes
From: Sowmini Varadhan Date: Wed, 16 Nov 2016 13:29:47 -0800 > This series contains a set of fixes for bugs exposed when > we ran the following in a loop between a test machine pair: > > while (1); do ># modprobe rds-tcp on test nodes ># run rds-stress in bi-dir mode between test machine pair ># modprobe -r rds-tcp on test nodes > done > > rds-stress in bi-dir mode will cause both nodes to initiate > RDS-TCP connections at almost the same instant, exposing the > bugs fixed in this series. > > Without the fixes, rds-stress reports sporadic packet drops, > and packets arriving out of sequence. After the fixes,we have > been able to run the test overnight, without any issues. > > Each patch has a detailed description of the root-cause fixed > by the patch. Series applied, thank you.
Re: [PATCH 2/3] net: stmmac: replace hardcoded function name by __func__
From: Corentin Labbe Date: Wed, 16 Nov 2016 20:09:40 +0100 > From: LABBE Corentin > > Some printing have the function name hardcoded. > It is better to use __func__ instead. > > Signed-off-by: Corentin Labbe Applied to net-next.
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote: > On 17.11.2016 17:45, David Miller wrote: > > From: Hannes Frederic Sowa > > Date: Thu, 17 Nov 2016 15:36:48 +0100 > > > >> The other way is the journal idea I had, which uses an rb-tree with > >> timestamps as keys (can be lamport timestamps). You insert into the tree > >> until the dump is finished and use it as queue later to shuffle stuff > >> into the hardware. > > > > If you have this "place" where pending inserts are stored, you have > > a policy decision to make. > > > > First of all what do other lookups see when there are pending entires? > > I think this is a problem with the current approach already, as the > delayed work queue already postpones the insert for an undecidable > amount of time (and reorders depending on which CPU the entry was > inserted and the fib notifier was called). The delayed work queue only postpones the insert into the listener's table, but the entries will be present in the kernel's table as usual. Therefore, other lookup made on the kernel's table will see the pending entries. Note that both listeners that currently call the dump API do that during their init, before any lookups can be made on their tables. If you think it's critical, then we can make sure the workqueues are flushed before the listeners register their netdevs and effectively expose their tables to lookups. I'm looking into the reordering issues you mentioned. I belive it's a valid point. > For user space queries we would still query the in-kernel table. Right. No changes here. > > If, once inserted into the pending queue, you return success to the > > inserting entity, then you must make those pending entries visible > > to lookups. > > I think this same problem is in this patch set already. The way I > understood it, is, that if a problem during insert emerges, the driver > calls abort and every packet will be send to user space, as the routing > table cannot be offloaded and it won't try it again, Ido? First of all, this abort mechanism is already in place and isn't part of this patchset. Secondly, why is this a problem? The all-or-nothing approach is an hard requirement and current implementation is infinitely better than previous approach in which the kernel's tables were flushed upon route insertion failure. It rendered the system unusable. Current implementation of abort mechanism keeps the system usable, but with reduced performance. > Probably this is an artifact of the mellanox implementation and we can > implement this differently for other cards, but the schema to abort all > if the modification doesn't work, seems to be fundamental (I think we > require the all-or-nothing approach for now). Yes, it's an hard requirement for all implementations. mlxsw implements it by evicting all routes from its tables and inserting a default route that traps packets to CPU. > > If you block the inserting entity, well that doesn't make a lot of > > sense. If blocking is a workable solution, then we can just block the > > entire insert while this FIB dump to the device driver is happening. > > I don't think we should really block (as in kernel-block) at any time. > > I was suggesting something like: > > rtnl_lock(); > synchronize_rcu_expedited(); // barrier, all rounting modifications are > stable and no new can be added due to rtnl_lock > register notifier(); // notifier adds entries also into journal(); > rtnl_unlock(); > walk_fib_tree_rcu_into_journal(); > // walk finished > start_syncing_journal_to_hw(); // if new entries show up we sync them > asap after this point > > The journal would need a spin lock to protect its internal state and > order events correctly. > > > But I am pretty sure the idea of blocking modifications for so long > > was considered undesirable. > > Yes, this is also still my opinion. It was indeed rejected :) https://marc.info/?l=linux-netdev&m=147794848224884&w=2
Re: [PATCH 1/3] net: stmmac: replace all pr_xxx by their netdev_xxx counterpart
From: Corentin Labbe Date: Wed, 16 Nov 2016 20:09:39 +0100 > From: LABBE Corentin > > The stmmac driver use lots of pr_xxx functions to print information. > This is bad since we cannot know which device logs the information. > (moreover if two stmmac device are present) > > Furthermore, it seems that it assumes wrongly that all logs will always > be subsequent by using a dev_xxx then some indented pr_xxx like this: > kernel: sun7i-dwmac 1c5.ethernet: no reset control found > kernel: Ring mode enabled > kernel: No HW DMA feature register supported > kernel: Normal descriptors > kernel: TX Checksum insertion supported > > So this patch replace all pr_xxx by their netdev_xxx counterpart. > Excepts for some printing where netdev "cause" unpretty output like: > sun7i-dwmac 1c5.ethernet (unnamed net_device) (uninitialized): no reset > control found > In those case, I keep dev_xxx. > > In the same time I remove some "stmmac:" print since > this will be a duplicate with that dev_xxx displays. > > Signed-off-by: Corentin Labbe Applied to net-next.
Re: [PATCH 3/3] net: stmmac: replace if (netif_msg_type) by their netif_xxx counterpart
From: Corentin Labbe Date: Wed, 16 Nov 2016 20:09:41 +0100 > From: LABBE Corentin > > As sugested by Joe Perches, we could replace all > if (netif_msg_type(priv)) dev_xxx(priv->devices, ...) > by the simpler macro netif_xxx(priv, hw, priv->dev, ...) > > Signed-off-by: Corentin Labbe Applied to net-next.
Re: Netperf UDP issue with connected sockets
On Thu, 17 Nov 2016 08:21:19 -0800 Eric Dumazet wrote: > On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote: > > On Thu, 17 Nov 2016 06:17:38 -0800 > > Eric Dumazet wrote: > > > > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: > > > > > > > I can see that qdisc layer does not activate xmit_more in this case. > > > > > > > > > > Sure. Not enough pressure from the sender(s). > > > > > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL > > > limit is kept at a small value. > > > > > > (BTW not all NIC have expensive doorbells) > > > > I believe this NIC mlx5 (50G edition) does. > > > > I'm seeing UDP TX of 1656017.55 pps, which is per packet: > > 2414 cycles(tsc) 603.86 ns > > > > Perf top shows (with my own udp_flood, that avoids __ip_select_ident): > > > > Samples: 56K of event 'cycles', Event count (approx.): 51613832267 > >Overhead CommandShared ObjectSymbol > > +8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock > >- _raw_spin_lock > > + 90.78% __dev_queue_xmit > > + 7.83% dev_queue_xmit > > + 1.30% ___slab_alloc > > +5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w > > Does TX completion happens on same cpu ? > > > +4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit > > +4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup > > Why fib_table_lookup() is used with connected UDP flow ??? Don't know. I would be interested in hints howto avoid this!... I also see it with netperf, and my udp_flood code is here: https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c > > +4.00% swapper[mlx5_core] [k] mlx5e_poll_tx_cq > > +3.11% udp_flood [kernel.vmlinux] [k] > > __ip_route_output_key_hash > > Same here, this is suspect. It is the function calling fib_table_lookup(), and it is called by ip_route_output_flow(). > > +2.49% swapper[kernel.vmlinux] [k] __slab_free > > > > In this setup the spinlock in __dev_queue_xmit should be uncongested. > > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. > > > > But 8.92% of the time is spend on it, which corresponds to a cost of 215 > > cycles (2414*0.0892). This cost is too high, thus something else is > > going on... I claim this mysterious extra cost is the tailptr/doorbell. > > Well, with no pressure, doorbell is triggered for each packet. > > Since we can not predict that your application is going to send yet > another packet one usec later, we can not avoid this. The point is I can see a socket Send-Q forming, thus we do know the application have something to send. Thus, and possibility for non-opportunistic bulking. Allowing/implementing bulk enqueue from socket layer into qdisc layer, should be fairly simple (and rest of xmit_more is already in place). > Note that with the patches I am working on (busypolling extentions), > we could decide to let the busypoller doing the doorbells, say one every > 10 usec. (or after ~16 packets were queued) Sounds interesting! but an opportunistically approach. > But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same > strategy . It is a possibility that TX completions were happening on another CPU (but I don't think so for mlx5). To rule that out, I reran the experiment making sure to pin everything to CPU-0 and the results are the same. sudo ethtool -L mlx5p2 combined 1 sudo sh -c '\ for x in /proc/irq/*/mlx5*/../smp_affinity; do \ echo 01 > $x; grep . -H $x; done \ ' $ taskset -c 0 ./udp_flood --sendto 198.18.50.1 --count $((10**9)) Perf report validating CPU in use: $ perf report -g --no-children --sort cpu,comm,dso,symbol --stdio --call-graph none # Overhead CPU Command Shared ObjectSymbol # ... .. ... . # 9.97% 000 udp_flood [kernel.vmlinux] [k] _raw_spin_lock 4.37% 000 udp_flood [kernel.vmlinux] [k] fib_table_lookup 3.97% 000 udp_flood [mlx5_core] [k] mlx5e_sq_xmit 3.06% 000 udp_flood [kernel.vmlinux] [k] __ip_route_output_key_hash 2.51% 000 udp_flood [mlx5_core] [k] mlx5e_poll_tx_cq 2.48% 000 udp_flood [kernel.vmlinux] [k] copy_user_enhanced_fast_string 2.47% 000 udp_flood [kernel.vmlinux] [k] entry_SYSCALL_64 2.42% 000 udp_flood [kernel.vmlinux] [k] udp_sendmsg 2.39% 000 udp_flood [kernel.vmlinux] [k] __ip_append_data.isra.47 2.29% 000 udp_flood [k
Re: [PATCH net-next] net_sched: sch_fq: use hash_ptr()
From: Eric Dumazet Date: Thu, 17 Nov 2016 09:48:30 -0800 > From: Eric Dumazet > > When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful, > and I chose hash_32(). > > Linus Torvalds and George Spelvin fixed this issue, so we can > use hash_ptr() to get more entropy on 64bit arches with Terabytes > of memory, and avoid the cast games. > > Signed-off-by: Eric Dumazet > Cc: Hugh Dickins Applied, thanks Eric.
[PATCH v8 0/6] Add eBPF hooks for cgroups
This is v8 of the patch set to allow eBPF programs for network filtering and accounting to be attached to cgroups, so that they apply to all sockets of all tasks placed in that cgroup. The logic also allows to be extendeded for other cgroup based eBPF logic. Again, only minor details are updated in this version. Thanks, Daniel Changes from v7: * Replace the static inline function cgroup_bpf_run_filter() with two specific macros for ingress and egress. This addresses David Miller's concern regarding skb->sk vs. sk in the egress path. Thanks a lot to Daniel Borkmann and Alexei Starovoitov for the suggestions. Changes from v6: * Rebased to 4.9-rc2 * Add EXPORT_SYMBOL(__cgroup_bpf_run_filter). The kbuild test robot now succeeds in building this version of the patch set. * Switch from bpf_prog_run_save_cb() to bpf_prog_run_clear_cb() to not tamper with the contents of skb->cb[]. Pointed out by Daniel Borkmann. * Use sk_to_full_sk() in the egress path, as suggested by Daniel Borkmann. * Renamed BPF_PROG_TYPE_CGROUP_SOCKET to BPF_PROG_TYPE_CGROUP_SKB, as requested by David Ahern. * Added Alexei's Acked-by tags. Changes from v5: * The eBPF programs now operate on L3 rather than on L2 of the packets, and the egress hooks were moved from __dev_queue_xmit() to ip*_output(). * For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb through BPF_LD_[ABS|IND] instructions, but hook up the bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann for the help. Changes from v4: * Plug an skb leak when dropping packets due to eBPF verdicts in __dev_queue_xmit(). Spotted by Daniel Borkmann. * Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't operate on timewait or request sockets. Suggested by Daniel Borkmann. * Add missing @parent parameter in kerneldoc of __cgroup_bpf_update(). Spotted by Rami Rosen. * Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error. Changes from v3: * Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER, renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann. * Dropped the attach_flags member from the anonymous struct for BPF attach operations in union bpf_attr. They can be added later on via CHECK_ATTR. Requested by Daniel Borkmann and Alexei. * Release old_prog at the end of __cgroup_bpf_update rather that at the beginning to fix a race gap between program updates and their users. Spotted by Daniel Borkmann. * Plugged an skb leak when dropping packets on the egress path. Spotted by Daniel Borkmann. * Add cgro...@vger.kernel.org to the loop, as suggested by Rami Rosen. * Some minor coding style adoptions not worth mentioning in particular. Changes from v2: * Fixed the RCU locking details Tejun pointed out. * Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler. Changes from v1: * Moved all bpf specific cgroup code into its own file, and stub out related functions for !CONFIG_CGROUP_BPF as static inline nops. This way, the call sites are not cluttered with #ifdef guards while the feature remains compile-time configurable. * Implemented the new scheme proposed by Tejun. Per cgroup, store one set of pointers that are pinned to the cgroup, and one for the programs that are effective. When a program is attached or detached, the change is propagated to all the cgroup's descendants. If a subcgroup has its own pinned program, skip the whole subbranch in order to allow delegation models. * The hookup for egress packets is now done from __dev_queue_xmit(). * A static key is now used in both the ingress and egress fast paths to keep performance penalties close to zero if the feature is not in use. * Overall cleanup to make the accessors use the program arrays. This should make it much easier to add new program types, which will then automatically follow the pinned vs. effective logic. * Fixed locking issues, as pointed out by Eric Dumazet and Alexei Starovoitov. Changes to the program array are now done with xchg() and are protected by cgroup_mutex. * eBPF programs are now expected to return 1 to let the packet pass, not >= 0. Pointed out by Alexei. * Operation is now limited to INET sockets, so local AF_UNIX sockets are not affected. The enum members are renamed accordingly. In case other socket families should be supported, this can be extended in the future. * The sample program learned to support both ingress and egress, and can now optionally make the eBPF program drop packets by making it return 0. Daniel Mack (6): bpf: add new prog type for cgroup socket filtering cgroup: add support for eBPF programs bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands net: filter: run cgroup eBPF ingress programs net: ipv4, ipv6: run cgroup eBPF egress programs samples: bpf: add use
[PATCH v8 2/6] cgroup: add support for eBPF programs
This patch adds two sets of eBPF program pointers to struct cgroup. One for such that are directly pinned to a cgroup, and one for such that are effective for it. To illustrate the logic behind that, assume the following example cgroup hierarchy. A - B - C \ D - E If only B has a program attached, it will be effective for B, C, D and E. If D then attaches a program itself, that will be effective for both D and E, and the program in B will only affect B and C. Only one program of a given type is effective for a cgroup. Attaching and detaching programs will be done through the bpf(2) syscall. For now, ingress and egress inet socket filtering are the only supported use-cases. Signed-off-by: Daniel Mack Acked-by: Alexei Starovoitov --- include/linux/bpf-cgroup.h | 79 + include/linux/cgroup-defs.h | 4 ++ init/Kconfig| 12 kernel/bpf/Makefile | 1 + kernel/bpf/cgroup.c | 167 kernel/cgroup.c | 18 + 6 files changed, 281 insertions(+) create mode 100644 include/linux/bpf-cgroup.h create mode 100644 kernel/bpf/cgroup.c diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h new file mode 100644 index 000..ec80d0c --- /dev/null +++ b/include/linux/bpf-cgroup.h @@ -0,0 +1,79 @@ +#ifndef _BPF_CGROUP_H +#define _BPF_CGROUP_H + +#include +#include +#include + +struct sock; +struct cgroup; +struct sk_buff; + +#ifdef CONFIG_CGROUP_BPF + +extern struct static_key_false cgroup_bpf_enabled_key; +#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key) + +struct cgroup_bpf { + /* +* Store two sets of bpf_prog pointers, one for programs that are +* pinned directly to this cgroup, and one for those that are effective +* when this cgroup is accessed. +*/ + struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE]; + struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE]; +}; + +void cgroup_bpf_put(struct cgroup *cgrp); +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent); + +void __cgroup_bpf_update(struct cgroup *cgrp, +struct cgroup *parent, +struct bpf_prog *prog, +enum bpf_attach_type type); + +/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */ +void cgroup_bpf_update(struct cgroup *cgrp, + struct bpf_prog *prog, + enum bpf_attach_type type); + +int __cgroup_bpf_run_filter(struct sock *sk, + struct sk_buff *skb, + enum bpf_attach_type type); + +/* Wrappers for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled. */ +#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) \ +({ \ + int __ret = 0; \ + if (cgroup_bpf_enabled) \ + __ret = __cgroup_bpf_run_filter(sk, skb,\ + BPF_CGROUP_INET_INGRESS); \ + \ + __ret; \ +}) + +#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) \ +({ \ + int __ret = 0; \ + if (cgroup_bpf_enabled && sk && sk == skb->sk) {\ + typeof(sk) __sk = sk_to_full_sk(sk);\ + if (sk_fullsock(__sk)) \ + __ret = __cgroup_bpf_run_filter(__sk, skb, \ + BPF_CGROUP_INET_EGRESS); \ + } \ + __ret; \ +}) + +#else + +struct cgroup_bpf {}; +static inline void cgroup_bpf_put(struct cgroup *cgrp) {} +static inline void cgroup_bpf_inherit(struct cgroup *cgrp, + struct cgroup *parent) {} + +#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; }) +#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; }) + +#endif /* CONFIG_CGROUP_BPF */ + +#endif /* _BPF_CGROUP_H */ diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 5b17de6..861b467 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -16,6 +16,7 @@ #include #include #include +#include #ifdef CONFIG_CGROUPS @@ -300,6 +301,9 @@ struct cgroup { /* used to schedule release agent */ struct work_struct release_agent_work; + /* used to store eBPF programs */ + struct cgroup_bpf bpf; + /* ids of the ances
[PATCH v8 4/6] net: filter: run cgroup eBPF ingress programs
If the cgroup associated with the receiving socket has an eBPF programs installed, run them from sk_filter_trim_cap(). eBPF programs used in this context are expected to either return 1 to let the packet pass, or != 1 to drop them. The programs have access to the skb through bpf_skb_load_bytes(), and the payload starts at the network headers (L3). Note that cgroup_bpf_run_filter() is stubbed out as static inline nop for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if the feature is unused. Signed-off-by: Daniel Mack Acked-by: Alexei Starovoitov --- net/core/filter.c | 4 1 file changed, 4 insertions(+) diff --git a/net/core/filter.c b/net/core/filter.c index e3813d6..474b486 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap) if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) return -ENOMEM; + err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb); + if (err) + return err; + err = security_sock_rcv_skb(sk, skb); if (err) return err; -- 2.7.4
[PATCH v8 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and BPF_PROG_DETACH which allow attaching and detaching eBPF programs to a target. On the API level, the target could be anything that has an fd in userspace, hence the name of the field in union bpf_attr is called 'target_fd'. When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is expected to be a valid file descriptor of a cgroup v2 directory which has the bpf controller enabled. These are the only use-cases implemented by this patch at this point, but more can be added. If a program of the given type already exists in the given cgroup, the program is swapped automically, so userspace does not have to drop an existing program first before installing a new one, which would otherwise leave a gap in which no program is attached. For more information on the propagation logic to subcgroups, please refer to the bpf cgroup controller implementation. The API is guarded by CAP_NET_ADMIN. Signed-off-by: Daniel Mack Acked-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 8 + kernel/bpf/syscall.c | 81 2 files changed, 89 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1f3e6f1..f31b655 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -73,6 +73,8 @@ enum bpf_cmd { BPF_PROG_LOAD, BPF_OBJ_PIN, BPF_OBJ_GET, + BPF_PROG_ATTACH, + BPF_PROG_DETACH, }; enum bpf_map_type { @@ -150,6 +152,12 @@ union bpf_attr { __aligned_u64 pathname; __u32 bpf_fd; }; + + struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */ + __u32 target_fd; /* container object to attach to */ + __u32 attach_bpf_fd; /* eBPF program to attach */ + __u32 attach_type; + }; } __attribute__((aligned(8))); /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 228f962..1814c01 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr) return bpf_obj_get_user(u64_to_ptr(attr->pathname)); } +#ifdef CONFIG_CGROUP_BPF + +#define BPF_PROG_ATTACH_LAST_FIELD attach_type + +static int bpf_prog_attach(const union bpf_attr *attr) +{ + struct bpf_prog *prog; + struct cgroup *cgrp; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (CHECK_ATTR(BPF_PROG_ATTACH)) + return -EINVAL; + + switch (attr->attach_type) { + case BPF_CGROUP_INET_INGRESS: + case BPF_CGROUP_INET_EGRESS: + prog = bpf_prog_get_type(attr->attach_bpf_fd, +BPF_PROG_TYPE_CGROUP_SKB); + if (IS_ERR(prog)) + return PTR_ERR(prog); + + cgrp = cgroup_get_from_fd(attr->target_fd); + if (IS_ERR(cgrp)) { + bpf_prog_put(prog); + return PTR_ERR(cgrp); + } + + cgroup_bpf_update(cgrp, prog, attr->attach_type); + cgroup_put(cgrp); + break; + + default: + return -EINVAL; + } + + return 0; +} + +#define BPF_PROG_DETACH_LAST_FIELD attach_type + +static int bpf_prog_detach(const union bpf_attr *attr) +{ + struct cgroup *cgrp; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + if (CHECK_ATTR(BPF_PROG_DETACH)) + return -EINVAL; + + switch (attr->attach_type) { + case BPF_CGROUP_INET_INGRESS: + case BPF_CGROUP_INET_EGRESS: + cgrp = cgroup_get_from_fd(attr->target_fd); + if (IS_ERR(cgrp)) + return PTR_ERR(cgrp); + + cgroup_bpf_update(cgrp, NULL, attr->attach_type); + cgroup_put(cgrp); + break; + + default: + return -EINVAL; + } + + return 0; +} +#endif /* CONFIG_CGROUP_BPF */ + SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) { union bpf_attr attr = {}; @@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz case BPF_OBJ_GET: err = bpf_obj_get(&attr); break; + +#ifdef CONFIG_CGROUP_BPF + case BPF_PROG_ATTACH: + err = bpf_prog_attach(&attr); + break; + case BPF_PROG_DETACH: + err = bpf_prog_detach(&attr); + break; +#endif + default: err = -EINVAL; break; -- 2.7.4
[PATCH v8 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups
Add a simple userpace program to demonstrate the new API to attach eBPF programs to cgroups. This is what it does: * Create arraymap in kernel with 4 byte keys and 8 byte values * Load eBPF program The eBPF program accesses the map passed in to store two pieces of information. The number of invocations of the program, which maps to the number of packets received, is stored to key 0. Key 1 is incremented on each iteration by the number of bytes stored in the skb. * Detach any eBPF program previously attached to the cgroup * Attach the new program to the cgroup using BPF_PROG_ATTACH * Once a second, read map[0] and map[1] to see how many bytes and packets were seen on any socket of tasks in the given cgroup. The program takes a cgroup path as 1st argument, and either "ingress" or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument, which will make the generated eBPF program return 0 instead of 1, so the kernel will drop the packet. libbpf gained two new wrappers for the new syscall commands. Signed-off-by: Daniel Mack Acked-by: Alexei Starovoitov --- samples/bpf/Makefile| 2 + samples/bpf/libbpf.c| 21 ++ samples/bpf/libbpf.h| 3 + samples/bpf/test_cgrp2_attach.c | 147 4 files changed, 173 insertions(+) create mode 100644 samples/bpf/test_cgrp2_attach.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 12b7304..e4cdc74 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -22,6 +22,7 @@ hostprogs-y += spintest hostprogs-y += map_perf_test hostprogs-y += test_overhead hostprogs-y += test_cgrp2_array_pin +hostprogs-y += test_cgrp2_attach hostprogs-y += xdp1 hostprogs-y += xdp2 hostprogs-y += test_current_task_under_cgroup @@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o +test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o xdp1-objs := bpf_load.o libbpf.o xdp1_user.o # reuse xdp1 source intentionally xdp2-objs := bpf_load.o libbpf.o xdp1_user.o diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c index 9969e35..9ce707b 100644 --- a/samples/bpf/libbpf.c +++ b/samples/bpf/libbpf.c @@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type, return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)); } +int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type) +{ + union bpf_attr attr = { + .target_fd = target_fd, + .attach_bpf_fd = prog_fd, + .attach_type = type, + }; + + return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr)); +} + +int bpf_prog_detach(int target_fd, enum bpf_attach_type type) +{ + union bpf_attr attr = { + .target_fd = target_fd, + .attach_type = type, + }; + + return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr)); +} + int bpf_obj_pin(int fd, const char *pathname) { union bpf_attr attr = { diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h index ac6edb6..d0a799a 100644 --- a/samples/bpf/libbpf.h +++ b/samples/bpf/libbpf.h @@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns, int insn_len, const char *license, int kern_version); +int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type); +int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type); + int bpf_obj_pin(int fd, const char *pathname); int bpf_obj_get(const char *pathname); diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c new file mode 100644 index 000..63ef208 --- /dev/null +++ b/samples/bpf/test_cgrp2_attach.c @@ -0,0 +1,147 @@ +/* eBPF example program: + * + * - Creates arraymap in kernel with 4 bytes keys and 8 byte values + * + * - Loads eBPF program + * + * The eBPF program accesses the map passed in to store two pieces of + * information. The number of invocations of the program, which maps + * to the number of packets received, is stored to key 0. Key 1 is + * incremented on each iteration by the number of bytes stored in + * the skb. + * + * - Detaches any eBPF program previously attached to the cgroup + * + * - Attaches the new program to a cgroup using BPF_PROG_ATTACH + * + * - Every second, reads map[0] and map[1] to see how many bytes and + * packets were seen on any socket of tasks in the given cgroup. + */ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "libbpf.h" + +enum { + MAP_KEY_PACKETS, + MAP_KEY_BYTES, +}; + +static int prog_load(int map_fd, int verdict) +{ + struct bpf_insn prog[] = { +
[PATCH v8 1/6] bpf: add new prog type for cgroup socket filtering
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that it does not allow BPF_LD_[ABS|IND] instructions and hooks up the bpf_skb_load_bytes() helper. Programs of this type will be attached to cgroups for network filtering and accounting. Signed-off-by: Daniel Mack Acked-by: Alexei Starovoitov --- include/uapi/linux/bpf.h | 9 + net/core/filter.c| 23 +++ 2 files changed, 32 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index f09c70b..1f3e6f1 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -96,8 +96,17 @@ enum bpf_prog_type { BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_XDP, BPF_PROG_TYPE_PERF_EVENT, + BPF_PROG_TYPE_CGROUP_SKB, }; +enum bpf_attach_type { + BPF_CGROUP_INET_INGRESS, + BPF_CGROUP_INET_EGRESS, + __MAX_BPF_ATTACH_TYPE +}; + +#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE + #define BPF_PSEUDO_MAP_FD 1 /* flags for BPF_MAP_UPDATE_ELEM command */ diff --git a/net/core/filter.c b/net/core/filter.c index 00351cd..e3813d6 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2576,6 +2576,17 @@ xdp_func_proto(enum bpf_func_id func_id) } } +static const struct bpf_func_proto * +cg_skb_func_proto(enum bpf_func_id func_id) +{ + switch (func_id) { + case BPF_FUNC_skb_load_bytes: + return &bpf_skb_load_bytes_proto; + default: + return sk_filter_func_proto(func_id); + } +} + static bool __is_valid_access(int off, int size, enum bpf_access_type type) { if (off < 0 || off >= sizeof(struct __sk_buff)) @@ -2938,6 +2949,12 @@ static const struct bpf_verifier_ops xdp_ops = { .convert_ctx_access = xdp_convert_ctx_access, }; +static const struct bpf_verifier_ops cg_skb_ops = { + .get_func_proto = cg_skb_func_proto, + .is_valid_access= sk_filter_is_valid_access, + .convert_ctx_access = sk_filter_convert_ctx_access, +}; + static struct bpf_prog_type_list sk_filter_type __read_mostly = { .ops= &sk_filter_ops, .type = BPF_PROG_TYPE_SOCKET_FILTER, @@ -2958,12 +2975,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly = { .type = BPF_PROG_TYPE_XDP, }; +static struct bpf_prog_type_list cg_skb_type __read_mostly = { + .ops= &cg_skb_ops, + .type = BPF_PROG_TYPE_CGROUP_SKB, +}; + static int __init register_sk_filter_ops(void) { bpf_register_prog_type(&sk_filter_type); bpf_register_prog_type(&sched_cls_type); bpf_register_prog_type(&sched_act_type); bpf_register_prog_type(&xdp_type); + bpf_register_prog_type(&cg_skb_type); return 0; } -- 2.7.4
[PATCH v8 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
If the cgroup associated with the receiving socket has an eBPF programs installed, run them from ip_output(), ip6_output() and ip_mc_output(). From mentioned functions we have two socket contexts as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through okfn()."). We explicitly need to use sk instead of skb->sk here, since otherwise the same program would run multiple times on egress when encap devices are involved, which is not desired in our case. eBPF programs used in this context are expected to either return 1 to let the packet pass, or != 1 to drop them. The programs have access to the skb through bpf_skb_load_bytes(), and the payload starts at the network headers (L3). Note that cgroup_bpf_run_filter() is stubbed out as static inline nop for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if the feature is unused. Signed-off-by: Daniel Mack Acked-by: Alexei Starovoitov --- net/ipv4/ip_output.c | 15 +++ net/ipv6/ip6_output.c | 8 2 files changed, 23 insertions(+) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 03e7f73..5914006 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -74,6 +74,7 @@ #include #include #include +#include #include #include #include @@ -303,6 +304,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb) { struct rtable *rt = skb_rtable(skb); struct net_device *dev = rt->dst.dev; + int ret; /* * If the indicated interface is up and running, send the packet. @@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb) skb->dev = dev; skb->protocol = htons(ETH_P_IP); + ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb); + if (ret) { + kfree_skb(skb); + return ret; + } + /* * Multicasts are looped back for other local users */ @@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb) int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; + int ret; IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len); skb->dev = dev; skb->protocol = htons(ETH_P_IP); + ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb); + if (ret) { + kfree_skb(skb); + return ret; + } + return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, net, sk, skb, NULL, dev, ip_finish_output, diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 6001e78..483f91b 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -39,6 +39,7 @@ #include #include +#include #include #include @@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb)); + int ret; if (unlikely(idev->cnf.disable_ipv6)) { IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS); @@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) return 0; } + ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb); + if (ret) { + kfree_skb(skb); + return ret; + } + return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING, net, sk, skb, NULL, dev, ip6_finish_output, -- 2.7.4
Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump
From: Hannes Frederic Sowa Date: Thu, 17 Nov 2016 18:20:39 +0100 > Hi, > > On 17.11.2016 17:45, David Miller wrote: >> From: Hannes Frederic Sowa >> Date: Thu, 17 Nov 2016 15:36:48 +0100 >> >>> The other way is the journal idea I had, which uses an rb-tree with >>> timestamps as keys (can be lamport timestamps). You insert into the tree >>> until the dump is finished and use it as queue later to shuffle stuff >>> into the hardware. >> >> If you have this "place" where pending inserts are stored, you have >> a policy decision to make. >> >> First of all what do other lookups see when there are pending entires? > > I think this is a problem with the current approach already, as the > delayed work queue already postpones the insert for an undecidable > amount of time (and reorders depending on which CPU the entry was > inserted and the fib notifier was called). > > For user space queries we would still query the in-kernel table. Ok, I think I might misunderstand something. What is going into this journal exactly? The actual full software and hardware insert operation, or just the notification to the hardware device driver notifiers? The "lookup" I'm mostly concerned with is the fast path where the packets being processed actually look up a route. I do not think we can return success on the insert to the user yet have the route lookup dataplace not return that route on a lookup.
[PATCH ipsec] xfrm: unbreak xfrm_sk_policy_lookup
if we succeed grabbing the refcount, then if (err && !xfrm_pol_hold_rcu) will evaluate to false so this hits last else branch which then sets policy to ERR_PTR(0). Fixes: ae33786f73a7ce ("xfrm: policy: only use rcu in xfrm_sk_policy_lookup") Reported-by: Nicolas Dichtel Tested-by: Nicolas Dichtel Signed-off-by: Florian Westphal --- net/xfrm/xfrm_policy.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index fd6986634e6f..5bf7e1bfeac7 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1268,12 +1268,14 @@ static struct xfrm_policy *xfrm_sk_policy_lookup(const struct sock *sk, int dir, err = security_xfrm_policy_lookup(pol->security, fl->flowi_secid, policy_to_flow_dir(dir)); - if (!err && !xfrm_pol_hold_rcu(pol)) - goto again; - else if (err == -ESRCH) + if (!err) { + if (!xfrm_pol_hold_rcu(pol)) + goto again; + } else if (err == -ESRCH) { pol = NULL; - else + } else { pol = ERR_PTR(err); + } } else pol = NULL; } -- 2.7.3
[PATCH -next] tcp: make undo_cwnd mandatory for congestion modules
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh, which un-does reno halving behaviour. It seems more appropriate to let congctl algorithms pair .ssthresh and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it up for all congestion algorithms that used to rely on the fallback. highspeed, illinois, scalable, veno and yeah use 'reno undo' while their .ssthresh implementation doesn't halve the slowstart threshold, this might point to similar issue as the one fixed for dctcp in ce6dd23329b1e ("dctcp: avoid bogus doubling of cwnd after loss"). Cc: Eric Dumazet Cc: Yuchung Cheng Cc: Neal Cardwell Signed-off-by: Florian Westphal --- include/net/tcp.h| 1 + net/ipv4/tcp_cong.c | 14 -- net/ipv4/tcp_dctcp.c | 1 + net/ipv4/tcp_highspeed.c | 2 +- net/ipv4/tcp_hybla.c | 1 + net/ipv4/tcp_illinois.c | 1 + net/ipv4/tcp_input.c | 5 + net/ipv4/tcp_lp.c| 1 + net/ipv4/tcp_scalable.c | 1 + net/ipv4/tcp_vegas.c | 1 + net/ipv4/tcp_veno.c | 1 + net/ipv4/tcp_westwood.c | 1 + net/ipv4/tcp_yeah.c | 1 + 13 files changed, 24 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 123979fe12bf..7de80739adab 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -958,6 +958,7 @@ u32 tcp_slow_start(struct tcp_sock *tp, u32 acked); void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked); u32 tcp_reno_ssthresh(struct sock *sk); +u32 tcp_reno_undo_cwnd(struct sock *sk); void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked); extern struct tcp_congestion_ops tcp_reno; diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 1294af4e0127..38905ec5f508 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -68,8 +68,9 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca) { int ret = 0; - /* all algorithms must implement ssthresh and cong_avoid ops */ - if (!ca->ssthresh || !(ca->cong_avoid || ca->cong_control)) { + /* all algorithms must implement these */ + if (!ca->ssthresh || !ca->undo_cwnd || + !(ca->cong_avoid || ca->cong_control)) { pr_err("%s does not implement required ops\n", ca->name); return -EINVAL; } @@ -441,10 +442,19 @@ u32 tcp_reno_ssthresh(struct sock *sk) } EXPORT_SYMBOL_GPL(tcp_reno_ssthresh); +u32 tcp_reno_undo_cwnd(struct sock *sk) +{ + const struct tcp_sock *tp = tcp_sk(sk); + + return max(tp->snd_cwnd, tp->snd_ssthresh << 1); +} +EXPORT_SYMBOL_GPL(tcp_reno_undo_cwnd); + struct tcp_congestion_ops tcp_reno = { .flags = TCP_CONG_NON_RESTRICTED, .name = "reno", .owner = THIS_MODULE, .ssthresh = tcp_reno_ssthresh, .cong_avoid = tcp_reno_cong_avoid, + .undo_cwnd = tcp_reno_undo_cwnd, }; diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c index 51139175bf61..bde22ebb92a8 100644 --- a/net/ipv4/tcp_dctcp.c +++ b/net/ipv4/tcp_dctcp.c @@ -342,6 +342,7 @@ static struct tcp_congestion_ops dctcp __read_mostly = { static struct tcp_congestion_ops dctcp_reno __read_mostly = { .ssthresh = tcp_reno_ssthresh, .cong_avoid = tcp_reno_cong_avoid, + .undo_cwnd = tcp_reno_undo_cwnd, .get_info = dctcp_get_info, .owner = THIS_MODULE, .name = "dctcp-reno", diff --git a/net/ipv4/tcp_highspeed.c b/net/ipv4/tcp_highspeed.c index db7842495a64..1eb8fefd9bd0 100644 --- a/net/ipv4/tcp_highspeed.c +++ b/net/ipv4/tcp_highspeed.c @@ -161,7 +161,7 @@ static struct tcp_congestion_ops tcp_highspeed __read_mostly = { .init = hstcp_init, .ssthresh = hstcp_ssthresh, .cong_avoid = hstcp_cong_avoid, - + .undo_cwnd = tcp_reno_undo_cwnd, .owner = THIS_MODULE, .name = "highspeed" }; diff --git a/net/ipv4/tcp_hybla.c b/net/ipv4/tcp_hybla.c index 083831e359df..0f7175c3338e 100644 --- a/net/ipv4/tcp_hybla.c +++ b/net/ipv4/tcp_hybla.c @@ -166,6 +166,7 @@ static void hybla_cong_avoid(struct sock *sk, u32 ack, u32 acked) static struct tcp_congestion_ops tcp_hybla __read_mostly = { .init = hybla_init, .ssthresh = tcp_reno_ssthresh, + .undo_cwnd = tcp_reno_undo_cwnd, .cong_avoid = hybla_cong_avoid, .set_state = hybla_state, diff --git a/net/ipv4/tcp_illinois.c b/net/ipv4/tcp_illinois.c index c8e6d86be114..7c843578f233 100644 --- a/net/ipv4/tcp_illinois.c +++ b/net/ipv4/tcp_illinois.c @@ -327,6 +327,7 @@ static size_t tcp_illinois_info(struct sock *sk, u32 ext, int *attr, static struct tcp_congestion_ops tcp_illinois __read_mostly = { .init = tcp_illinois_init, .ssthresh = tcp_illinois_ssthresh, + .undo_cwnd = tcp_reno_undo_cwnd, .cong_avoid = tcp_illinois_cong_avoid,
[PATCH net-next v3 1/5] ethtool: (uapi) Add ETHTOOL_PHY_GTUNABLE and ETHTOOL_PHY_STUNABLE
From: Raju Lakkaraju Defines a generic API to get/set phy tunables. The API is using the existing ethtool_tunable/tunable_type_id types which is already being used for mac level tunables. Signed-off-by: Raju Lakkaraju Reviewed-by: Andrew Lunn Signed-off-by: Allan W. Nielsen --- include/uapi/linux/ethtool.h | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h index 8e54723..42f696f 100644 --- a/include/uapi/linux/ethtool.h +++ b/include/uapi/linux/ethtool.h @@ -248,6 +248,16 @@ struct ethtool_tunable { void*data[0]; }; +enum phy_tunable_id { + ETHTOOL_PHY_ID_UNSPEC, + + /* +* Add your fresh new phy tunable attribute above and remember to update +* phy_tunable_strings[] in net/core/ethtool.c +*/ + __ETHTOOL_PHY_TUNABLE_COUNT, +}; + /** * struct ethtool_regs - hardware register dump * @cmd: Command number = %ETHTOOL_GREGS @@ -548,6 +558,7 @@ struct ethtool_pauseparam { * @ETH_SS_FEATURES: Device feature names * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names * @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS + * @ETH_SS_PHY_TUNABLES: PHY tunable names */ enum ethtool_stringset { ETH_SS_TEST = 0, @@ -558,6 +569,7 @@ enum ethtool_stringset { ETH_SS_RSS_HASH_FUNCS, ETH_SS_TUNABLES, ETH_SS_PHY_STATS, + ETH_SS_PHY_TUNABLES, }; /** @@ -1313,7 +1325,8 @@ struct ethtool_per_queue_op { #define ETHTOOL_GLINKSETTINGS 0x004c /* Get ethtool_link_settings */ #define ETHTOOL_SLINKSETTINGS 0x004d /* Set ethtool_link_settings */ - +#define ETHTOOL_PHY_GTUNABLE 0x004e /* Get PHY tunable configuration */ +#define ETHTOOL_PHY_STUNABLE 0x004f /* Set PHY tunable configuration */ /* compatibility with older code */ #define SPARC_ETH_GSET ETHTOOL_GSET -- 2.7.3
RE: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
-Original Message- From: Nicolas Ferre [mailto:nicolas.fe...@atmel.com] Sent: 17 listopada 2016 14:29 To: Harini Katakam; Rafal Ozieblo Cc: harini.kata...@xilinx.com; netdev@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM > Le 17/11/2016 à 13:21, Harini Katakam a écrit : > > Hi Rafal, > > > > On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo wrote: > >> Hello, > >> I think, there could a bug in your patch. > >> > >>> + > >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT > >>> + dmacfg |= GEM_BIT(ADDR64); #endif > >> > >> You enable 64 bit addressing (64b dma bus width) always when appropriate > >> architecture config option is enabled. > >> But there are some legacy controllers which do not support that feature. > >> According Cadence hardware team: > >> "64 bit addressing was added in July 2013. Earlier version do not have it. > >> This feature was enhanced in release August 2014 to have separate upper > >> address values for transmit and receive." > >> > >>> /* Bitfields in NSR */ > >>> @@ -474,6 +479,10 @@ > >>> struct macb_dma_desc { > >> > u32 addr; > >>> u32 ctrl; > >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT > >>> + u32 addrh; > >>> + u32 resvd; > >>> +#endif > >>> }; > >> > >> It will not work for legacy hardware. Old descriptor is 2 words wide, the > >> new one is 4 words wide. > >> If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't > >> support it at all, you will miss every second descriptor. > >> > > > > True, this feature is not available in all of Cadence IP versions. > > In fact, the IP version Zynq does not support this. But the one in ZynqMP > > does. > > So, we enable kernel config for 64 bit DMA addressing for this SoC and > > hence the driver picks it up. My assumption was that if the legacy IP > > does not support > > 64 bit addressing, then this DMA option wouldn't be enabled. > > > > There is a design config register in Cadence IP which is being read to > > check for 64 bit address support - DMA mask is set based on that. > > But the addition of two descriptor words cannot be based on this runtime > > check. > > For this reason, all the static changes were placed under this check. > > We have quite a bunch of options in this driver to determinate what is the > real capacity of the underlying hardware. > If HW configuration registers are not appropriate, and it seems they are not, > I would advice to simply use the DT compatibility string. > > Best regards, > -- > Nicolas Ferre HW configuration registers are appropriate. The issue is that this code doesn’t use the capability bit to switch between different dma descriptors (2 words vs. 4 words). DMA descriptor size is chosen based on kernel configuration, not based on hardware capabilities. Regards, Rafal Ozieblo
Re: [PATCH net-next v3 4/7] vxlan: improve vxlan route lookup checks.
From: Jiri Benc Date: Thu, 17 Nov 2016 16:59:49 +0100 > On Thu, 17 Nov 2016 10:17:01 +, David Laight wrote: >> Worse than arbitrary, it adds 4 bytes of pad on 64bit systems. > > It does not, this is not a struct. He is talking about on the function stack.
Re: net: BUG still has locks held in unix_stream_splice_read
On Mon, Oct 10, 2016 at 10:01 AM, Dmitry Vyukov wrote: > On Mon, Oct 10, 2016 at 5:14 AM, Al Viro wrote: >> On Mon, Oct 10, 2016 at 03:46:07AM +0100, Al Viro wrote: >>> On Sun, Oct 09, 2016 at 12:06:14PM +0200, Dmitry Vyukov wrote: >>> > I suspect this is: >>> > >>> > commit 25869262ef7af24ccde988867ac3eb1c3d4b88d4 >>> > Author: Al Viro >>> > Date: Sat Sep 17 21:02:10 2016 -0400 >>> > skb_splice_bits(): get rid of callback >>> > since pipe_lock is the outermost now, we don't need to drop/regain >>> > socket locks around the call of splice_to_pipe() from >>> > skb_splice_bits(), >>> > which kills the need to have a socket-specific callback; we can just >>> > call splice_to_pipe() and be done with that. >>> >>> Unlikely, since that particular commit removes unlocking/relocking ->iolock >>> around the call of splice_to_pipe(). Original would've retaken the same >>> lock on the way out; it's not as if we could leave the syscall there. >>> >>> It might be splice-related, but I don't believe that you've got the right >>> commit here. >> >> It's not that commit > > It's highly likely. Sorry for falsely pointing to your commit. > > >> , all right - it's "can't call unix_stream_read_generic() >> with any locks held" stepped onto a couple of commits prior by >> "splice: lift pipe_lock out of splice_to_pipe()". Could somebody explain >> what is that about? >> >> E.g what will happen if some code does a read on AF_UNIX socket with >> some local mutex held? AFAICS, there are exactly two callers of >> freezable_schedule_timeout() - this one and one in XFS; the latter is >> in a kernel thread where we do have good warranties about the locking >> environment, but here it's in the bleeding ->recvmsg/->splice_read and >> for those assumption that caller doesn't hold any locks is pretty >> strong, especially since it's not documented anywhere. >> >> What's going on there? > > I never saw that warning before. There is some possibility that fuzzer > has discovered some new paths, but it's much more likely that > something has changed recently (the stack looks quite simple -- just a > splice from unix socket). And my previous pull was like a week ago. Ping. Just hit it again on 4.9-rc5 [ BUG: syz-executor/15922 still has locks held! ] 4.9.0-rc5+ #43 Not tainted - 1 lock held by syz-executor/15922: #0: [ 1441.143288] ( [< inline >] pipe_lock_nested fs/pipe.c:66 [] pipe_lock+0x5b/0x70 fs/pipe.c:74 stack backtrace: CPU: 3 PID: 15922 Comm: syz-executor Not tainted 4.9.0-rc5+ #43 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 88004a98f568 834c2a19 0003 110009531e40 ed0009531e38 41b58ab3 895758b0 834c272b 0003 880035256640 0003 88006d122cd8 Call Trace: [< inline >] __dump_stack lib/dump_stack.c:15 [] dump_stack+0x2ee/0x3f5 lib/dump_stack.c:51 [< inline >] print_held_locks_bug kernel/locking/lockdep.c:4296 [] debug_check_no_locks_held+0x125/0x140 kernel/locking/lockdep.c:4302 [< inline >] try_to_freeze include/linux/freezer.h:65 [< inline >] freezer_count include/linux/freezer.h:127 [< inline >] freezable_schedule_timeout include/linux/freezer.h:192 [] unix_stream_data_wait+0x4fd/0x910 net/unix/af_unix.c:2223 [] unix_stream_read_generic+0x11e2/0x2240 net/unix/af_unix.c:2332 [] unix_stream_splice_read+0x27f/0x400 net/unix/af_unix.c:2506 [] sock_splice_read+0xbe/0x100 net/socket.c:772 [] do_splice_to+0x10f/0x170 fs/splice.c:897 [< inline >] do_splice fs/splice.c:1185 [< inline >] SYSC_splice fs/splice.c:1409 [] SyS_splice+0xfaa/0x16a0 fs/splice.c:1392 [] entry_SYSCALL_64_fastpath+0x23/0xc6
[PATCH net-next] net_sched: sch_fq: use hash_ptr()
From: Eric Dumazet When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful, and I chose hash_32(). Linus Torvalds and George Spelvin fixed this issue, so we can use hash_ptr() to get more entropy on 64bit arches with Terabytes of memory, and avoid the cast games. Signed-off-by: Eric Dumazet Cc: Hugh Dickins --- net/sched/sch_fq.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c index 18e752439f6f..86309a3156a5 100644 --- a/net/sched/sch_fq.c +++ b/net/sched/sch_fq.c @@ -245,7 +245,7 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q) skb_orphan(skb); } - root = &q->fq_root[hash_32((u32)(long)sk, q->fq_trees_log)]; + root = &q->fq_root[hash_ptr(sk, q->fq_trees_log)]; if (q->flows >= (2U << q->fq_trees_log) && q->inactive_flows > q->flows/2) @@ -599,7 +599,7 @@ static void fq_rehash(struct fq_sched_data *q, kmem_cache_free(fq_flow_cachep, of); continue; } - nroot = &new_array[hash_32((u32)(long)of->sk, new_log)]; + nroot = &new_array[hash_ptr(of->sk, new_log)]; np = &nroot->rb_node; parent = NULL;
Fwd:[Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN
From: Baicar, Tyler [mailto:tbai...@codeaurora.org] Sent: Tuesday, November 15, 2016 11:50 PM To: Neftin, Sasha ; Kirsher, Jeffrey T ; intel-wired-...@lists.osuosl.org; netdev@vger.kernel.org; linux-ker...@vger.kernel.org; ok...@codeaurora.org; ti...@codeaurora.org Subject: Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN On 11/13/2016 2:25 AM, Neftin, Sasha wrote: > On 11/13/2016 10:34 AM, Neftin, Sasha wrote: >> On 11/11/2016 12:35 AM, Baicar, Tyler wrote: >>> Hello Sasha, >>> >>> On 11/9/2016 11:19 PM, Neftin, Sasha wrote: On 11/9/2016 11:41 PM, Tyler Baicar wrote: > Move IRQ free code so that it will happen regardless of the > __E1000_DOWN bit. Currently the e1000e driver only releases its > IRQ if the __E1000_DOWN bit is cleared. This is not sufficient > because it is possible for __E1000_DOWN to be set without releasing the > IRQ. > In such a situation, we will hit a kernel bug later in > e1000_remove because the IRQ still has action since it was never > freed. A secondary bus reset can cause this case to happen. > > Signed-off-by: Tyler Baicar > --- >drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++- >1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c > b/drivers/net/ethernet/intel/e1000e/netdev.c > index 7017281..36cfcb0 100644 > --- a/drivers/net/ethernet/intel/e1000e/netdev.c > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c > @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev) > if (!test_bit(__E1000_DOWN, &adapter->state)) { >e1000e_down(adapter, true); > -e1000_free_irq(adapter); > /* Link status message must follow this format */ >pr_info("%s NIC Link is Down\n", adapter->netdev->name); >} >+e1000_free_irq(adapter); > + >napi_disable(&adapter->napi); > e1000e_free_tx_resources(adapter->tx_ring); > I would like not recommend insert this change. This change related driver state machine, we afraid from lot of synchronization problem and issues. We need keep e1000_free_irq in loop and check for 'test_bit' ready. >>> What do you mean here? There is no loop. If __E1000_DOWN is set then >>> we will never free the IRQ. >>> Another point, does before execute secondary bus reset your SW back up pcie configuration space as properly? >>> After a secondary bus reset, the link needs to recover and go back >>> to a working state after 1 second. >>> >>> From the callstack, the issue is happening while removing the >>> endpoint from the system, before applying the secondary bus reset. >>> >>> The order of events is >>> 1. remove the drivers >>> 2. cause a secondary bus reset >>> 3. wait 1 second >> Actually, this is too much, usually link up in less than 100ms.You >> can check Data Link Layer indication. >>> 4. recover the link >>> >>> callstack: >>> free_msi_irqs+0x6c/0x1a8 >>> pci_disable_msi+0xb0/0x148 >>> e1000e_reset_interrupt_capability+0x60/0x78 >>> e1000_remove+0xc8/0x180 >>> pci_device_remove+0x48/0x118 >>> __device_release_driver+0x80/0x108 >>> device_release_driver+0x2c/0x40 >>> pci_stop_bus_device+0xa0/0xb0 >>> pci_stop_bus_device+0x3c/0xb0 >>> pci_stop_root_bus+0x54/0x80 >>> acpi_pci_root_remove+0x28/0x64 >>> acpi_bus_trim+0x6c/0xa4 >>> acpi_device_hotplug+0x19c/0x3f4 >>> acpi_hotplug_work_fn+0x28/0x3c >>> process_one_work+0x150/0x460 >>> worker_thread+0x50/0x4b8 >>> kthread+0xd4/0xe8 >>> ret_from_fork+0x10/0x50 >>> >>> Thanks, >>> Tyler >>> >> Hello Tyler, >> Okay, we need consult more about this suggestion. >> May I ask what is setup you run? Is there NIC or on board LAN? I >> would like try reproduce this issue in our lab's too. >> Also, is same issue observed with same scenario and others NIC's too? >> Sasha >> ___ >> Intel-wired-lan mailing list >> intel-wired-...@lists.osuosl.org >> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan >> > Please, specify what is device used. Hello Sasha, This was on a QDF2432 using an Intel PRO/1000 PT Dual Port server adapter. I have not tried other e1000e PCIe cards, but have not seen any similar issues with Mellanox cards. I'm able to reproduce it with just pulling the card out. Here is the lspci -vvv output for this card: 0004:00:00.0 PCI bridge: Airgo Networks, Inc. Device 0400 (prog-if 00 [Normal decode]) Physical Slot: 5 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D
Re: [PATCH 1/3] net: stmmac: replace all pr_xxx by their netdev_xxx counterpart
On 11/16/2016 8:09 PM, Corentin Labbe wrote: From: LABBE Corentin The stmmac driver use lots of pr_xxx functions to print information. This is bad since we cannot know which device logs the information. (moreover if two stmmac device are present) Furthermore, it seems that it assumes wrongly that all logs will always be subsequent by using a dev_xxx then some indented pr_xxx like this: kernel: sun7i-dwmac 1c5.ethernet: no reset control found kernel: Ring mode enabled kernel: No HW DMA feature register supported kernel: Normal descriptors kernel: TX Checksum insertion supported So this patch replace all pr_xxx by their netdev_xxx counterpart. Excepts for some printing where netdev "cause" unpretty output like: sun7i-dwmac 1c5.ethernet (unnamed net_device) (uninitialized): no reset control found In those case, I keep dev_xxx. In the same time I remove some "stmmac:" print since this will be a duplicate with that dev_xxx displays. Signed-off-by: Corentin Labbe Thanks for these changes. Acked-by: Giuseppe Cavallaro --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 204 -- drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c | 14 +- 2 files changed, 123 insertions(+), 95 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index 8eb12353..791daf4 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -305,7 +305,7 @@ bool stmmac_eee_init(struct stmmac_priv *priv) */ spin_lock_irqsave(&priv->lock, flags); if (priv->eee_active) { - pr_debug("stmmac: disable EEE\n"); + netdev_dbg(priv->dev, "disable EEE\n"); del_timer_sync(&priv->eee_ctrl_timer); priv->hw->mac->set_eee_timer(priv->hw, 0, tx_lpi_timer); @@ -334,7 +334,7 @@ bool stmmac_eee_init(struct stmmac_priv *priv) ret = true; spin_unlock_irqrestore(&priv->lock, flags); - pr_debug("stmmac: Energy-Efficient Ethernet initialized\n"); + netdev_dbg(priv->dev, "Energy-Efficient Ethernet initialized\n"); } out: return ret; @@ -456,8 +456,8 @@ static int stmmac_hwtstamp_ioctl(struct net_device *dev, struct ifreq *ifr) sizeof(struct hwtstamp_config))) return -EFAULT; - pr_debug("%s config flags:0x%x, tx_type:0x%x, rx_filter:0x%x\n", -__func__, config.flags, config.tx_type, config.rx_filter); + netdev_dbg(priv->dev, "%s config flags:0x%x, tx_type:0x%x, rx_filter:0x%x\n", + __func__, config.flags, config.tx_type, config.rx_filter); /* reserved for future extensions */ if (config.flags) @@ -756,8 +756,9 @@ static void stmmac_adjust_link(struct net_device *dev) break; default: if (netif_msg_link(priv)) - pr_warn("%s: Speed (%d) not 10/100\n", - dev->name, phydev->speed); + netdev_warn(priv->dev, + "Speed (%d) not 10/100\n", + phydev->speed); break; } @@ -810,10 +811,10 @@ static void stmmac_check_pcs_mode(struct stmmac_priv *priv) (interface == PHY_INTERFACE_MODE_RGMII_ID) || (interface == PHY_INTERFACE_MODE_RGMII_RXID) || (interface == PHY_INTERFACE_MODE_RGMII_TXID)) { - pr_debug("STMMAC: PCS RGMII support enable\n"); + netdev_dbg(priv->dev, "PCS RGMII support enabled\n"); priv->hw->pcs = STMMAC_PCS_RGMII; } else if (interface == PHY_INTERFACE_MODE_SGMII) { - pr_debug("STMMAC: PCS SGMII support enable\n"); + netdev_dbg(priv->dev, "PCS SGMII support enabled\n"); priv->hw->pcs = STMMAC_PCS_SGMII; } } @@ -848,15 +849,15 @@ static int stmmac_init_phy(struct net_device *dev) snprintf(phy_id_fmt, MII_BUS_ID_SIZE + 3, PHY_ID_FMT, bus_id, priv->plat->phy_addr); - pr_debug("stmmac_init_phy: trying to attach to %s\n", -phy_id_fmt); + netdev_dbg(priv->dev, "stmmac_init_phy: trying to attach to %s\n", + phy_id_fmt); phydev = phy_connect(dev, phy_id_fmt, &stmmac_adjust_link, interface);
[PATCH net v2 3/7] net: ethernet: ti: cpsw: fix deferred probe
Make sure to deregister all child devices also on probe errors to avoid leaks and to fix probe deferral: cpsw 4a10.ethernet: omap_device: omap_device_enable() called from invalid state 1 cpsw 4a10.ethernet: use pm_runtime_put_sync_suspend() in driver? cpsw: probe of 4a10.ethernet failed with error -22 Add generic helper to undo the effects of cpsw_probe_dt(), which will also be used in a follow-on patch to fix further leaks that have been introduced more recently. Note that the platform device is now runtime-resumed before registering any child devices in order to make sure that it is synchronously suspended after having deregistered the children in the error path. Fixes: 1fb19aa730e4 ("net: cpsw: Add parent<->child relation support between cpsw and mdio") Signed-off-by: Johan Hovold --- drivers/net/ethernet/ti/cpsw.c | 41 - 1 file changed, 24 insertions(+), 17 deletions(-) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 84c5d214557e..5d14abb06486 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -2441,6 +2441,11 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data, return 0; } +static void cpsw_remove_dt(struct platform_device *pdev) +{ + of_platform_depopulate(&pdev->dev); +} + static int cpsw_probe_dual_emac(struct cpsw_priv *priv) { struct cpsw_common *cpsw = priv->cpsw; @@ -2585,10 +2590,19 @@ static int cpsw_probe(struct platform_device *pdev) /* Select default pin state */ pinctrl_pm_select_default_state(&pdev->dev); + /* Need to enable clocks with runtime PM api to access module +* registers +*/ + ret = pm_runtime_get_sync(&pdev->dev); + if (ret < 0) { + pm_runtime_put_noidle(&pdev->dev); + goto clean_runtime_disable_ret; + } + if (cpsw_probe_dt(&cpsw->data, pdev)) { dev_err(&pdev->dev, "cpsw: platform data missing\n"); ret = -ENODEV; - goto clean_runtime_disable_ret; + goto clean_dt_ret; } data = &cpsw->data; cpsw->rx_ch_num = 1; @@ -2609,7 +2623,7 @@ static int cpsw_probe(struct platform_device *pdev) GFP_KERNEL); if (!cpsw->slaves) { ret = -ENOMEM; - goto clean_runtime_disable_ret; + goto clean_dt_ret; } for (i = 0; i < data->slaves; i++) cpsw->slaves[i].slave_num = i; @@ -2621,7 +2635,7 @@ static int cpsw_probe(struct platform_device *pdev) if (IS_ERR(clk)) { dev_err(priv->dev, "fck is not found\n"); ret = -ENODEV; - goto clean_runtime_disable_ret; + goto clean_dt_ret; } cpsw->bus_freq_mhz = clk_get_rate(clk) / 100; @@ -2629,25 +2643,17 @@ static int cpsw_probe(struct platform_device *pdev) ss_regs = devm_ioremap_resource(&pdev->dev, ss_res); if (IS_ERR(ss_regs)) { ret = PTR_ERR(ss_regs); - goto clean_runtime_disable_ret; + goto clean_dt_ret; } cpsw->regs = ss_regs; - /* Need to enable clocks with runtime PM api to access module -* registers -*/ - ret = pm_runtime_get_sync(&pdev->dev); - if (ret < 0) { - pm_runtime_put_noidle(&pdev->dev); - goto clean_runtime_disable_ret; - } cpsw->version = readl(&cpsw->regs->id_ver); res = platform_get_resource(pdev, IORESOURCE_MEM, 1); cpsw->wr_regs = devm_ioremap_resource(&pdev->dev, res); if (IS_ERR(cpsw->wr_regs)) { ret = PTR_ERR(cpsw->wr_regs); - goto clean_pm_runtime_put_ret; + goto clean_dt_ret; } memset(&dma_params, 0, sizeof(dma_params)); @@ -2684,7 +2690,7 @@ static int cpsw_probe(struct platform_device *pdev) default: dev_err(priv->dev, "unknown version 0x%08x\n", cpsw->version); ret = -ENODEV; - goto clean_pm_runtime_put_ret; + goto clean_dt_ret; } for (i = 0; i < cpsw->data.slaves; i++) { struct cpsw_slave *slave = &cpsw->slaves[i]; @@ -2713,7 +2719,7 @@ static int cpsw_probe(struct platform_device *pdev) if (!cpsw->dma) { dev_err(priv->dev, "error initializing dma\n"); ret = -ENOMEM; - goto clean_pm_runtime_put_ret; + goto clean_dt_ret; } cpsw->txch[0] = cpdma_chan_create(cpsw->dma, 0, cpsw_tx_handler, 0); @@ -2823,7 +2829,8 @@ static int cpsw_probe(struct platform_device *pdev) cpsw_ale_destroy(cpsw->ale); clean_dma_ret: cpdma_ctlr_destroy(cpsw->dma); -clean_pm_runtime_put_ret: +clean_dt_ret: + cpsw_remove_dt(pdev); pm_runtime_put_sync(&pde
Re: [PATCH net 1/3] net: phy: realtek: add eee advertisement disable options
Hi Jerone, On 17 November 2016 at 15:50, Jerome Brunet wrote: > On Wed, 2016-11-16 at 22:36 +0530, Anand Moon wrote: >> Hi Jerome. >> >> On 15 November 2016 at 19:59, Jerome Brunet >> wrote: >> > >> > On some platforms, energy efficient ethernet with rtl8211 devices >> > is >> > causing issue, like throughput drop or broken link. >> > >> > This was reported on the OdroidC2 (DWMAC + RTL8211F). While the >> > issue root >> > cause is not fully understood yet, disabling EEE advertisement >> > prevent auto >> > negotiation from enabling EEE. >> > >> > This patch provides options to disable 1000T and 100TX EEE >> > advertisement >> > individually for the realtek phys supporting this feature. >> > >> > Reported-by: Martin Blumenstingl > > m> >> > Cc: Giuseppe Cavallaro >> > Cc: Alexandre TORGUE >> > Signed-off-by: Jerome Brunet >> > Signed-off-by: Neil Armstrong >> > Tested-by: Andre Roth >> > --- >> > drivers/net/phy/realtek.c | 65 >> > ++- >> > 1 file changed, 64 insertions(+), 1 deletion(-) >> > >> > diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c >> > index aadd6e9f54ad..77235fd5faaf 100644 >> > --- a/drivers/net/phy/realtek.c >> > +++ b/drivers/net/phy/realtek.c >> > @@ -15,6 +15,12 @@ >> > */ >> > #include >> > #include >> > +#include >> > + >> > +struct rtl8211x_phy_priv { >> > + bool eee_1000t_disable; >> > + bool eee_100tx_disable; >> > +}; >> > >> > #define RTL821x_PHYSR 0x11 >> > #define RTL821x_PHYSR_DUPLEX 0x2000 >> > @@ -93,12 +99,44 @@ static int rtl8211f_config_intr(struct >> > phy_device *phydev) >> > return err; >> > } >> > >> > +static void rtl8211x_clear_eee_adv(struct phy_device *phydev) >> > +{ >> > + struct rtl8211x_phy_priv *priv = phydev->priv; >> > + u16 val; >> > + >> > + if (priv->eee_1000t_disable || priv->eee_100tx_disable) { >> > + val = phy_read_mmd_indirect(phydev, >> > MDIO_AN_EEE_ADV, >> > + MDIO_MMD_AN); >> > + >> > + if (priv->eee_1000t_disable) >> > + val &= ~MDIO_AN_EEE_ADV_1000T; >> > + if (priv->eee_100tx_disable) >> > + val &= ~MDIO_AN_EEE_ADV_100TX; >> > + >> > + phy_write_mmd_indirect(phydev, MDIO_AN_EEE_ADV, >> > + MDIO_MMD_AN, val); >> > + } >> > +} >> > + >> > +static int rtl8211x_config_init(struct phy_device *phydev) >> > +{ >> > + int ret; >> > + >> > + ret = genphy_config_init(phydev); >> > + if (ret < 0) >> > + return ret; >> > + >> > + rtl8211x_clear_eee_adv(phydev); >> > + >> > + return 0; >> > +} >> > + >> > static int rtl8211f_config_init(struct phy_device *phydev) >> > { >> > int ret; >> > u16 reg; >> > >> > - ret = genphy_config_init(phydev); >> > + ret = rtl8211x_config_init(phydev); >> > if (ret < 0) >> > return ret; >> > >> > @@ -115,6 +153,26 @@ static int rtl8211f_config_init(struct >> > phy_device *phydev) >> > return 0; >> > } >> > >> > +static int rtl8211x_phy_probe(struct phy_device *phydev) >> > +{ >> > + struct device *dev = &phydev->mdio.dev; >> > + struct device_node *of_node = dev->of_node; >> > + struct rtl8211x_phy_priv *priv; >> > + >> > + priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL); >> > + if (!priv) >> > + return -ENOMEM; >> > + >> > + priv->eee_1000t_disable = >> > + of_property_read_bool(of_node, "realtek,disable- >> > eee-1000t"); >> > + priv->eee_100tx_disable = >> > + of_property_read_bool(of_node, "realtek,disable- >> > eee-100tx"); >> > + >> > + phydev->priv = priv; >> > + >> > + return 0; >> > +} >> > + >> > static struct phy_driver realtek_drvs[] = { >> > { >> > .phy_id = 0x8201, >> > @@ -140,7 +198,9 @@ static struct phy_driver realtek_drvs[] = { >> > .phy_id_mask= 0x001f, >> > .features = PHY_GBIT_FEATURES, >> > .flags = PHY_HAS_INTERRUPT, >> > + .probe = &rtl8211x_phy_probe, >> > .config_aneg= genphy_config_aneg, >> > + .config_init= &rtl8211x_config_init, >> > .read_status= genphy_read_status, >> > .ack_interrupt = rtl821x_ack_interrupt, >> > .config_intr= rtl8211e_config_intr, >> > @@ -152,7 +212,9 @@ static struct phy_driver realtek_drvs[] = { >> > .phy_id_mask= 0x001f, >> > .features = PHY_GBIT_FEATURES, >> > .flags = PHY_HAS_INTERRUPT, >> > + .probe = &rtl8211x_phy_probe, >> > .config_aneg= &genphy_config_aneg, >> > + .config_init= &rtl8211x_config_i
[PATCH net v2 6/7] net: ethernet: ti: cpsw: add missing sanity check
Make sure to check for allocation failures before dereferencing a NULL-pointer during probe. Fixes: 649a1688c960 ("net: ethernet: ti: cpsw: create common struct to hold shared driver data") Signed-off-by: Johan Hovold --- drivers/net/ethernet/ti/cpsw.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 11b2daef3158..1387299030e4 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -2588,6 +2588,9 @@ static int cpsw_probe(struct platform_device *pdev) int irq; cpsw = devm_kzalloc(&pdev->dev, sizeof(struct cpsw_common), GFP_KERNEL); + if (!cpsw) + return -ENOMEM; + cpsw->dev = &pdev->dev; ndev = alloc_etherdev_mq(sizeof(struct cpsw_priv), CPSW_MAX_QUEUES); -- 2.7.3
[PATCH ethtool v3 2/2] Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY downshift
From: Raju Lakkaraju Add ethtool get and set tunable to access PHY drivers. Ethtool Help: ethtool -h for PHY tunables ethtool --set-phy-tunable DEVNAME Set PHY tunable [ downshift on|off [count N] ] ethtool --get-phy-tunable DEVNAME Get PHY tunable [ downshift ] Ethtool ex: ethtool --set-phy-tuanble eth0 downshift on ethtool --set-phy-tuanble eth0 downshift off ethtool --set-phy-tuanble eth0 downshift on count 2 ethtool --get-phy-tunable eth0 downshift Signed-off-by: Raju Lakkaraju Signed-off-by: Allan W. Nielsen --- ethtool.8.in | 39 ethtool.c| 144 +++ 2 files changed, 183 insertions(+) diff --git a/ethtool.8.in b/ethtool.8.in index 9631847..337d0cf 100644 --- a/ethtool.8.in +++ b/ethtool.8.in @@ -340,6 +340,18 @@ ethtool \- query or control network driver and hardware settings .B2 tx-lpi on off .BN tx-timer .BN advertise +.HP +.B ethtool \-\-set\-phy\-tunable +.I devname +.RB [ +.B downshift +.A1 on off +.BN count +.RB ] +.HP +.B ethtool \-\-get\-phy\-tunable +.I devname +.RB [ downshift ] . .\" Adjust lines (i.e. full justification) and hyphenate. .ad @@ -947,6 +959,33 @@ Values are as for Sets the amount of time the device should stay in idle mode prior to asserting its Tx LPI (in microseconds). This has meaning only when Tx LPI is enabled. .RE +.TP +.B \-\-set\-phy\-tunable +Sets the PHY tunable parameters. +.RS 4 +.TP +.A2 downshift on off +Specifies whether downshift should be enabled +.TS +nokeep; +lB l. +.BI count \ N +Sets the PHY downshift re-tries count. +.TE +.PD +.RE +.TP +.B \-\-get\-phy\-tunable +Gets the PHY tunable parameters. +.RS 4 +.TP +.B downshift +For operation in cabling environments that are incompatible with 1000BASE-T, +PHY device provides an automatic link speed downshift operation. +Link speed downshift after N failed 1000BASE-T auto-negotiation attempts. + +Gets the PHY downshift count/status. +.RE .SH BUGS Not supported (in part or whole) on all network drivers. .SH AUTHOR diff --git a/ethtool.c b/ethtool.c index 49ac94e..7dcd005 100644 --- a/ethtool.c +++ b/ethtool.c @@ -4520,6 +4520,146 @@ static int do_seee(struct cmd_context *ctx) return 0; } +static int do_get_phy_tunable(struct cmd_context *ctx) +{ + int argc = ctx->argc; + char **argp = ctx->argp; + int err, i; + u8 downshift_changed = 0; + + if (argc < 1) + exit_bad_args(); + for (i = 0; i < argc; i++) { + if (!strcmp(argp[i], "downshift")) { + downshift_changed = 1; + i += 1; + if (i < argc) + exit_bad_args(); + } else { + exit_bad_args(); + } + } + + if (downshift_changed) { + struct ethtool_tunable ds; + u8 count = 0; + + ds.cmd = ETHTOOL_PHY_GTUNABLE; + ds.id = ETHTOOL_PHY_DOWNSHIFT; + ds.type_id = ETHTOOL_TUNABLE_U8; + ds.len = 1; + ds.data[0] = &count; + err = send_ioctl(ctx, &ds); + if (err < 0) { + perror("Cannot Get PHY downshift count"); + return 87; + } + count = *((u8 *)&ds.data[0]); + if (count) + fprintf(stdout, "Downshift count: %d\n", count); + else + fprintf(stdout, "Downshift disabled\n"); + } + + return err; +} + +static int parse_named_bool(struct cmd_context *ctx, const char *name, u8 *on) +{ + if (ctx->argc < 2) + return 0; + + if (strcmp(*ctx->argp, name)) + return 0; + + if (!strcmp(*(ctx->argp + 1), "on")) { + *on = 1; + } else if (!strcmp(*(ctx->argp + 1), "off")) { + *on = 0; + } else { + fprintf(stderr, "Invalid boolean\n"); + exit_bad_args(); + } + + ctx->argc -= 2; + ctx->argp += 2; + + return 1; +} + +static int parse_named_u8(struct cmd_context *ctx, const char *name, u8 *val) +{ + if (ctx->argc < 2) + return 0; + + if (strcmp(*ctx->argp, name)) + return 0; + + *val = get_uint_range(*(ctx->argp + 1), 0, 0xff); + + ctx->argc -= 2; + ctx->argp += 2; + + return 1; +} + +static int do_set_phy_tunable(struct cmd_context *ctx) +{ + int err = 0; + u8 ds_cnt = DOWNSHIFT_DEV_DEFAULT_COUNT; + u8 ds_changed = 0, ds_has_cnt = 0, ds_enable = 0; + + if (ctx->argc == 0) + exit_bad_args(); + + /* Parse arguments */ + while (ctx->argc) { + if (parse_named_bool(ctx, "downshift", &ds_enable)) { + ds_changed = 1; + ds_has_cnt = parse
[PATCH net-next v3 4/5] ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunable
From: Raju Lakkaraju Adding validation support for the ETHTOOL_PHY_DOWNSHIFT. Functional implementation needs to be done in the individual PHY drivers. Signed-off-by: Raju Lakkaraju Reviewed-by: Andrew Lunn Signed-off-by: Allan W. Nielsen --- net/core/ethtool.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/core/ethtool.c b/net/core/ethtool.c index 61aebdf..e9b45567 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -122,6 +122,7 @@ tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN] = { static const char phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN] = { [ETHTOOL_ID_UNSPEC] = "Unspec", + [ETHTOOL_PHY_DOWNSHIFT] = "phy-downshift", }; static int ethtool_get_features(struct net_device *dev, void __user *useraddr) @@ -2435,6 +2436,11 @@ static int ethtool_set_per_queue(struct net_device *dev, void __user *useraddr) static int ethtool_phy_tunable_valid(const struct ethtool_tunable *tuna) { switch (tuna->id) { + case ETHTOOL_PHY_DOWNSHIFT: + if (tuna->len != sizeof(u8) || + tuna->type_id != ETHTOOL_TUNABLE_U8) + return -EINVAL; + break; default: return -EINVAL; } -- 2.7.3
[PATCH net-next v3 2/5] ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE
From: Raju Lakkaraju Adding get_tunable/set_tunable function pointer to the phy_driver structure, and uses these function pointers to implement the ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE ioctls. Signed-off-by: Raju Lakkaraju Reviewed-by: Andrew Lunn Signed-off-by: Allan W. Nielsen --- include/linux/phy.h | 7 + net/core/ethtool.c | 87 + 2 files changed, 94 insertions(+) diff --git a/include/linux/phy.h b/include/linux/phy.h index 9880d73..3d35c36 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -611,6 +611,13 @@ struct phy_driver { void (*get_strings)(struct phy_device *dev, u8 *data); void (*get_stats)(struct phy_device *dev, struct ethtool_stats *stats, u64 *data); + + /* Get and Set PHY tunables */ + int (*get_tunable)(struct phy_device *dev, + struct ethtool_tunable *tuna, void *data); + int (*set_tunable)(struct phy_device *dev, + struct ethtool_tunable *tuna, + const void *data); }; #define to_phy_driver(d) container_of(to_mdio_common_driver(d), \ struct phy_driver, mdiodrv) diff --git a/net/core/ethtool.c b/net/core/ethtool.c index 9774898..61aebdf 100644 --- a/net/core/ethtool.c +++ b/net/core/ethtool.c @@ -119,6 +119,11 @@ tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN] = { [ETHTOOL_TX_COPYBREAK] = "tx-copybreak", }; +static const char +phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN] = { + [ETHTOOL_ID_UNSPEC] = "Unspec", +}; + static int ethtool_get_features(struct net_device *dev, void __user *useraddr) { struct ethtool_gfeatures cmd = { @@ -227,6 +232,9 @@ static int __ethtool_get_sset_count(struct net_device *dev, int sset) if (sset == ETH_SS_TUNABLES) return ARRAY_SIZE(tunable_strings); + if (sset == ETH_SS_PHY_TUNABLES) + return ARRAY_SIZE(phy_tunable_strings); + if (sset == ETH_SS_PHY_STATS) { if (dev->phydev) return phy_get_sset_count(dev->phydev); @@ -253,6 +261,8 @@ static void __ethtool_get_strings(struct net_device *dev, sizeof(rss_hash_func_strings)); else if (stringset == ETH_SS_TUNABLES) memcpy(data, tunable_strings, sizeof(tunable_strings)); + else if (stringset == ETH_SS_PHY_TUNABLES) + memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings)); else if (stringset == ETH_SS_PHY_STATS) { struct phy_device *phydev = dev->phydev; @@ -2422,6 +2432,76 @@ static int ethtool_set_per_queue(struct net_device *dev, void __user *useraddr) }; } +static int ethtool_phy_tunable_valid(const struct ethtool_tunable *tuna) +{ + switch (tuna->id) { + default: + return -EINVAL; + } + + return 0; +} + +static int get_phy_tunable(struct net_device *dev, void __user *useraddr) +{ + int ret; + struct ethtool_tunable tuna; + struct phy_device *phydev = dev->phydev; + void *data; + + if (!(phydev && phydev->drv && phydev->drv->get_tunable)) + return -EOPNOTSUPP; + + if (copy_from_user(&tuna, useraddr, sizeof(tuna))) + return -EFAULT; + ret = ethtool_phy_tunable_valid(&tuna); + if (ret) + return ret; + data = kmalloc(tuna.len, GFP_USER); + if (!data) + return -ENOMEM; + ret = phydev->drv->get_tunable(phydev, &tuna, data); + if (ret) + goto out; + useraddr += sizeof(tuna); + ret = -EFAULT; + if (copy_to_user(useraddr, data, tuna.len)) + goto out; + ret = 0; + +out: + kfree(data); + return ret; +} + +static int set_phy_tunable(struct net_device *dev, void __user *useraddr) +{ + int ret; + struct ethtool_tunable tuna; + struct phy_device *phydev = dev->phydev; + void *data; + + if (!(phydev && phydev->drv && phydev->drv->set_tunable)) + return -EOPNOTSUPP; + if (copy_from_user(&tuna, useraddr, sizeof(tuna))) + return -EFAULT; + ret = ethtool_phy_tunable_valid(&tuna); + if (ret) + return ret; + data = kmalloc(tuna.len, GFP_USER); + if (!data) + return -ENOMEM; + useraddr += sizeof(tuna); + ret = -EFAULT; + if (copy_from_user(data, useraddr, tuna.len)) + goto out; + ret = phydev->drv->set_tunable(phydev, &tuna, data); + +out: + kfree(data); + return ret; +} + /* The main entry point in this file. Called from net/core/dev_ioctl.c */ int dev_ethtool(struct net *net, struct ifreq *ifr) @@ -2479,6 +2559,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr) case ETHTOOL_GET_TS_INFO:
Re: Netperf UDP issue with connected sockets
On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Well, I suppose adding another setsockopt() to the data socket creation wouldn't be too difficult, along with another command-line option to cause it to happen. Could we leave things as "make sure you don't need fragmentation when you use this" or would netperf have to start processing ICMP messages? happy benchmarking, rick jones
[PATCH net-next v3 5/5] net: phy: Add downshift get/set support in Microsemi PHYs driver
From: Raju Lakkaraju Implements the phy tunable function pointers and implement downshift functionality for MSCC PHYs. Signed-off-by: Raju Lakkaraju Reviewed-by: Andrew Lunn Signed-off-by: Allan W. Nielsen --- drivers/net/phy/mscc.c | 100 + 1 file changed, 100 insertions(+) diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c index d0026ab..92018ba 100644 --- a/drivers/net/phy/mscc.c +++ b/drivers/net/phy/mscc.c @@ -46,8 +46,15 @@ enum rgmii_rx_clock_delay { #define MSCC_EXT_PAGE_ACCESS 31 #define MSCC_PHY_PAGE_STANDARD 0x /* Standard registers */ +#define MSCC_PHY_PAGE_EXTENDED 0x0001 /* Extended registers */ #define MSCC_PHY_PAGE_EXTENDED_2 0x0002 /* Extended reg - page 2 */ +/* Extended Page 1 Registers */ +#define MSCC_PHY_ACTIPHY_CNTL20 +#define DOWNSHIFT_CNTL_MASK 0x001C +#define DOWNSHIFT_EN 0x0010 +#define DOWNSHIFT_CNTL_POS 2 + /* Extended Page 2 Registers */ #define MSCC_PHY_RGMII_CNTL 20 #define RGMII_RX_CLK_DELAY_MASK 0x0070 @@ -75,6 +82,8 @@ enum rgmii_rx_clock_delay { #define MSCC_VDDMAC_2500 2500 #define MSCC_VDDMAC_3300 3300 +#define DOWNSHIFT_COUNT_MAX 5 + struct vsc8531_private { int rate_magic; }; @@ -101,6 +110,66 @@ static int vsc85xx_phy_page_set(struct phy_device *phydev, u8 page) return rc; } +static int vsc85xx_downshift_get(struct phy_device *phydev, u8 *count) +{ + int rc; + u16 reg_val; + + mutex_lock(&phydev->lock); + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED); + if (rc != 0) + goto out_unlock; + + reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL); + reg_val &= DOWNSHIFT_CNTL_MASK; + if (!(reg_val & DOWNSHIFT_EN)) + *count = DOWNSHIFT_DEV_DISABLE; + else + *count = ((reg_val & ~DOWNSHIFT_EN) >> DOWNSHIFT_CNTL_POS) + 2; + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD); + +out_unlock: + mutex_unlock(&phydev->lock); + + return rc; +} + +static int vsc85xx_downshift_set(struct phy_device *phydev, u8 count) +{ + int rc; + u16 reg_val; + + if (count == DOWNSHIFT_DEV_DEFAULT_COUNT) { + /* Default downshift count 3 (i.e. Bit3:2 = 0b01) */ + count = ((1 << DOWNSHIFT_CNTL_POS) | DOWNSHIFT_EN); + } else if (count > DOWNSHIFT_COUNT_MAX || count == 1) { + phydev_err(phydev, "Downshift count should be 2,3,4 or 5\n"); + return -ERANGE; + } else if (count) { + /* Downshift count is either 2,3,4 or 5 */ + count = (((count - 2) << DOWNSHIFT_CNTL_POS) | DOWNSHIFT_EN); + } + + mutex_lock(&phydev->lock); + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED); + if (rc != 0) + goto out_unlock; + + reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL); + reg_val &= ~(DOWNSHIFT_CNTL_MASK); + reg_val |= count; + rc = phy_write(phydev, MSCC_PHY_ACTIPHY_CNTL, reg_val); + if (rc != 0) + goto out_unlock; + + rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD); + +out_unlock: + mutex_unlock(&phydev->lock); + + return rc; +} + static int vsc85xx_wol_set(struct phy_device *phydev, struct ethtool_wolinfo *wol) { @@ -329,6 +398,29 @@ static int vsc85xx_default_config(struct phy_device *phydev) return rc; } +static int vsc85xx_get_tunable(struct phy_device *phydev, + struct ethtool_tunable *tuna, void *data) +{ + switch (tuna->id) { + case ETHTOOL_PHY_DOWNSHIFT: + return vsc85xx_downshift_get(phydev, (u8 *)data); + default: + return -EINVAL; + } +} + +static int vsc85xx_set_tunable(struct phy_device *phydev, + struct ethtool_tunable *tuna, + const void *data) +{ + switch (tuna->id) { + case ETHTOOL_PHY_DOWNSHIFT: + return vsc85xx_downshift_set(phydev, *(u8 *)data); + default: + return -EINVAL; + } +} + static int vsc85xx_config_init(struct phy_device *phydev) { int rc; @@ -418,6 +510,8 @@ static struct phy_driver vsc85xx_driver[] = { .probe = &vsc85xx_probe, .set_wol= &vsc85xx_wol_set, .get_wol= &vsc85xx_wol_get, + .get_tunable= &vsc85xx_get_tunable, + .set_tunable= &vsc85xx_set_tunable, }, { .phy_id = PHY_ID_VSC8531, @@ -437,6 +531,8 @@ static struct phy_driver vsc85xx_driver[] = { .probe = &vsc85xx_probe, .set_wol= &vsc85xx_wol_set, .get_wol= &vsc85xx_wol_get, + .get_tunable= &vsc85xx_get_tunable, +
[PATCH] net: sky2: Fix shutdown crash
The sky2 frequently crashes during machine shutdown with: sky2_get_stats+0x60/0x3d8 [sky2] dev_get_stats+0x68/0xd8 rtnl_fill_stats+0x54/0x140 rtnl_fill_ifinfo+0x46c/0xc68 rtmsg_ifinfo_build_skb+0x7c/0xf0 rtmsg_ifinfo.part.22+0x3c/0x70 rtmsg_ifinfo+0x50/0x5c netdev_state_change+0x4c/0x58 linkwatch_do_dev+0x50/0x88 __linkwatch_run_queue+0x104/0x1a4 linkwatch_event+0x30/0x3c process_one_work+0x140/0x3e0 worker_thread+0x60/0x44c kthread+0xdc/0xf0 ret_from_fork+0x10/0x50 This is caused by the sky2 being called after it has been shutdown. A previous thread about this can be found here: https://lkml.org/lkml/2016/4/12/410 An alternative fix is to assure that IFF_UP gets cleared by calling dev_close() during shutdown. This is similar to what the bnx2/tg3/xgene and maybe others are doing to assure that the driver isn't being called following _shutdown(). Signed-off-by: Jeremy Linton --- drivers/net/ethernet/marvell/sky2.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c index f05ea56..941c8e2 100644 --- a/drivers/net/ethernet/marvell/sky2.c +++ b/drivers/net/ethernet/marvell/sky2.c @@ -5220,6 +5220,19 @@ static SIMPLE_DEV_PM_OPS(sky2_pm_ops, sky2_suspend, sky2_resume); static void sky2_shutdown(struct pci_dev *pdev) { + struct sky2_hw *hw = pci_get_drvdata(pdev); + int port; + + for (port = 0; port < hw->ports; port++) { + struct net_device *ndev = hw->dev[port]; + + rtnl_lock(); + if (netif_running(ndev)) { + dev_close(ndev); + netif_device_detach(ndev); + } + rtnl_unlock(); + } sky2_suspend(&pdev->dev); pci_wake_from_d3(pdev, device_may_wakeup(&pdev->dev)); pci_set_power_state(pdev, PCI_D3hot); -- 2.5.5
[PATCH net v2 1/7] net: ethernet: ti: cpsw: fix bad register access in probe error path
Make sure to keep the platform device runtime-resumed throughout probe to avoid accessing the CPSW registers in the error path (e.g. for deferred probe) with clocks disabled: Unhandled fault: external abort on non-linefetch (0x1008) at 0xd0872d08 ... [] (cpsw_ale_control_set) from [] (cpsw_ale_destroy+0x2c/0x44) [] (cpsw_ale_destroy) from [] (cpsw_probe+0xbd0/0x10c4) [] (cpsw_probe) from [] (platform_drv_probe+0x5c/0xc0) Fixes: df828598a755 ("netdev: driver: ethernet: Add TI CPSW driver") Signed-off-by: Johan Hovold --- drivers/net/ethernet/ti/cpsw.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index c6cff3d2ff05..f60f8ab7c1e3 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -2641,13 +2641,12 @@ static int cpsw_probe(struct platform_device *pdev) goto clean_runtime_disable_ret; } cpsw->version = readl(&cpsw->regs->id_ver); - pm_runtime_put_sync(&pdev->dev); res = platform_get_resource(pdev, IORESOURCE_MEM, 1); cpsw->wr_regs = devm_ioremap_resource(&pdev->dev, res); if (IS_ERR(cpsw->wr_regs)) { ret = PTR_ERR(cpsw->wr_regs); - goto clean_runtime_disable_ret; + goto clean_pm_runtime_put_ret; } memset(&dma_params, 0, sizeof(dma_params)); @@ -2684,7 +2683,7 @@ static int cpsw_probe(struct platform_device *pdev) default: dev_err(priv->dev, "unknown version 0x%08x\n", cpsw->version); ret = -ENODEV; - goto clean_runtime_disable_ret; + goto clean_pm_runtime_put_ret; } for (i = 0; i < cpsw->data.slaves; i++) { struct cpsw_slave *slave = &cpsw->slaves[i]; @@ -2713,7 +2712,7 @@ static int cpsw_probe(struct platform_device *pdev) if (!cpsw->dma) { dev_err(priv->dev, "error initializing dma\n"); ret = -ENOMEM; - goto clean_runtime_disable_ret; + goto clean_pm_runtime_put_ret; } cpsw->txch[0] = cpdma_chan_create(cpsw->dma, 0, cpsw_tx_handler, 0); @@ -2815,12 +2814,16 @@ static int cpsw_probe(struct platform_device *pdev) } } + pm_runtime_put(&pdev->dev); + return 0; clean_ale_ret: cpsw_ale_destroy(cpsw->ale); clean_dma_ret: cpdma_ctlr_destroy(cpsw->dma); +clean_pm_runtime_put_ret: + pm_runtime_put_sync(&pdev->dev); clean_runtime_disable_ret: pm_runtime_disable(&pdev->dev); clean_ndev_ret: -- 2.7.3
RE: Netperf UDP issue with connected sockets
From: Jesper Dangaard Brouer > Sent: 17 November 2016 14:58 > On Thu, 17 Nov 2016 06:17:38 -0800 > Eric Dumazet wrote: > > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: > > > > > I can see that qdisc layer does not activate xmit_more in this case. > > > > > > > Sure. Not enough pressure from the sender(s). > > > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL > > limit is kept at a small value. > > > > (BTW not all NIC have expensive doorbells) > > I believe this NIC mlx5 (50G edition) does. > > I'm seeing UDP TX of 1656017.55 pps, which is per packet: > 2414 cycles(tsc) 603.86 ns > > Perf top shows (with my own udp_flood, that avoids __ip_select_ident): > > Samples: 56K of event 'cycles', Event count (approx.): 51613832267 >Overhead CommandShared ObjectSymbol > +8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock >- _raw_spin_lock > + 90.78% __dev_queue_xmit > + 7.83% dev_queue_xmit > + 1.30% ___slab_alloc > +5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w > +4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit > +4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup > +4.00% swapper[mlx5_core] [k] mlx5e_poll_tx_cq > +3.11% udp_flood [kernel.vmlinux] [k] > __ip_route_output_key_hash > +2.49% swapper[kernel.vmlinux] [k] __slab_free > > In this setup the spinlock in __dev_queue_xmit should be uncongested. > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. > > But 8.92% of the time is spend on it, which corresponds to a cost of 215 > cycles (2414*0.0892). This cost is too high, thus something else is > going on... I claim this mysterious extra cost is the tailptr/doorbell. Try adding code to ring the doorbell twice. If this doesn't slow things down then it isn't (likely to be) responsible for the delay you are seeing. David
[PATCH net v2 4/7] net: ethernet: ti: cpsw: fix of_node and phydev leaks
Make sure to drop references taken and deregister devices registered during probe on probe errors (including deferred probe) and driver unbind. Specifically, PHY of-node references were never released and fixed-link PHY devices were never deregistered. Fixes: 9e42f715264f ("drivers: net: cpsw: add phy-handle parsing") Fixes: 1f71e8c96fc6 ("drivers: net: cpsw: Add support for fixed-link PHY") Signed-off-by: Johan Hovold --- drivers/net/ethernet/ti/cpsw.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 5d14abb06486..c3b78bc4fe58 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -2443,6 +2443,41 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data, static void cpsw_remove_dt(struct platform_device *pdev) { + struct net_device *ndev = platform_get_drvdata(pdev); + struct cpsw_common *cpsw = ndev_to_cpsw(ndev); + struct cpsw_platform_data *data = &cpsw->data; + struct device_node *node = pdev->dev.of_node; + struct device_node *slave_node; + int i = 0; + + for_each_available_child_of_node(node, slave_node) { + struct cpsw_slave_data *slave_data = &data->slave_data[i]; + + if (strcmp(slave_node->name, "slave")) + continue; + + if (of_phy_is_fixed_link(slave_node)) { + struct phy_device *phydev; + + phydev = of_phy_find_device(slave_node); + if (phydev) { + fixed_phy_unregister(phydev); + /* Put references taken by +* of_phy_find_device() and +* of_phy_register_fixed_link(). +*/ + phy_device_free(phydev); + phy_device_free(phydev); + } + } + + of_node_put(slave_data->phy_node); + + i++; + if (i == data->slaves) + break; + } + of_platform_depopulate(&pdev->dev); } -- 2.7.3
[PATCH net-next v3 3/5] ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables
From: Raju Lakkaraju For operation in cabling environments that are incompatible with 1000BASE-T, PHY device may provide an automatic link speed downshift operation. When enabled, the device automatically changes its 1000BASE-T auto-negotiation to the next slower speed after a configured number of failed attempts at 1000BASE-T. This feature is useful in setting up in networks using older cable installations that include only pairs A and B, and not pairs C and D. Signed-off-by: Raju Lakkaraju Signed-off-by: Allan W. Nielsen Reviewed-by: Andrew Lunn --- include/uapi/linux/ethtool.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h index 42f696f..f0db778 100644 --- a/include/uapi/linux/ethtool.h +++ b/include/uapi/linux/ethtool.h @@ -248,9 +248,12 @@ struct ethtool_tunable { void*data[0]; }; +#define DOWNSHIFT_DEV_DEFAULT_COUNT0xff +#define DOWNSHIFT_DEV_DISABLE 0 + enum phy_tunable_id { ETHTOOL_PHY_ID_UNSPEC, - + ETHTOOL_PHY_DOWNSHIFT, /* * Add your fresh new phy tunable attribute above and remember to update * phy_tunable_strings[] in net/core/ethtool.c -- 2.7.3
[PATCH ethtool v3 1/2] ethtool-copy.h:sync with net
From: Raju Lakkaraju This covers kernel changes upto: commit f5a4732f85613b3fb43f8bc33a017e3db3b3605a Author: Raju Lakkaraju Date: Wed Nov 9 16:33:09 2016 +0530 ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables For operation in cabling environments that are incompatible with 1000BASE-T, PHY device may provide an automatic link speed downshift operation. When enabled, the device automatically changes its 1000BASE-T auto-negotiation to the next slower speed after a configured number of failed attempts at 1000BASE-T. This feature is useful in setting up in networks using older cable installations that include only pairs A and B, and not pairs C and D. Signed-off-by: Raju Lakkaraju Signed-off-by: Allan W. Nielsen --- ethtool-copy.h | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/ethtool-copy.h b/ethtool-copy.h index 70748f5..2e2448f 100644 --- a/ethtool-copy.h +++ b/ethtool-copy.h @@ -247,6 +247,19 @@ struct ethtool_tunable { void*data[0]; }; +#define DOWNSHIFT_DEV_DEFAULT_COUNT0xff +#define DOWNSHIFT_DEV_DISABLE 0 + +enum phy_tunable_id { + ETHTOOL_PHY_ID_UNSPEC, + ETHTOOL_PHY_DOWNSHIFT, + /* +* Add your fresh new phy tunable attribute above and remember to update +* phy_tunable_strings[] in net/core/ethtool.c +*/ + __ETHTOOL_PHY_TUNABLE_COUNT, +}; + /** * struct ethtool_regs - hardware register dump * @cmd: Command number = %ETHTOOL_GREGS @@ -547,6 +560,7 @@ struct ethtool_pauseparam { * @ETH_SS_FEATURES: Device feature names * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names * @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS + * @ETH_SS_PHY_TUNABLES: PHY tunable names */ enum ethtool_stringset { ETH_SS_TEST = 0, @@ -557,6 +571,7 @@ enum ethtool_stringset { ETH_SS_RSS_HASH_FUNCS, ETH_SS_TUNABLES, ETH_SS_PHY_STATS, + ETH_SS_PHY_TUNABLES, }; /** @@ -1312,7 +1327,8 @@ struct ethtool_per_queue_op { #define ETHTOOL_GLINKSETTINGS 0x004c /* Get ethtool_link_settings */ #define ETHTOOL_SLINKSETTINGS 0x004d /* Set ethtool_link_settings */ - +#define ETHTOOL_PHY_GTUNABLE 0x004e /* Get PHY tunable configuration */ +#define ETHTOOL_PHY_STUNABLE 0x004f /* Set PHY tunable configuration */ /* compatibility with older code */ #define SPARC_ETH_GSET ETHTOOL_GSET -- 2.7.3
[PATCH net-next v3 0/5] Adding PHY-Tunables and downshift support
Hi All, (This is a re-post of the v3 patch set with a new cover letter - I was not aware that the cover letters was used a commit comments in merge commits). This series add support for PHY tunables, and uses this facility to configure downshifting. The downshifting mechanism is implemented for MSCC phys. This series tries to address the comments provided back in mid October when this feature was posted along with fast-link-failure. Fast-link-failure has been separated out, but we would like to pick continue on that if/when we agree on how the phy-tunables and downshifting should be done. The proposed generic interface is similar to ETHTOOL_GTUNABLE/ETHTOOL_STUNABLE, it uses the same type (ethtool_tunable/tunable_type_id) but a new enum (phy_tunable_id) is added to reflect the PHY tunable. The implementation just call the newly added function pointers in get_tunable/set_tunable phy_device structure. To configure downshifting, the ethtool_tunable structure is used. 'id' must be set to 'ETHTOOL_PHY_DOWNSHIFT', 'type_id' must be set to 'ETHTOOL_TUNABLE_U8' and 'data' value configure the amount of downshift re-tries. If configured to DOWNSHIFT_DEV_DISABLE, then downshift is disabled If configured to DOWNSHIFT_DEV_DEFAULT_COUNT, then it is up to the device to choose a device-specific re-try count. Tested on Beaglebone Black with VSC 8531 PHY. Change set: v0: - Link Speed downshift and Fast Link failure-2 features coded by using Device tree. v1: - Split the Downshift and FLF2 features in different set of patches. - Removed DT access and implemented IOCTL access suggested by Andrew. - Added function pointers in get_tunable/set_tunable phy_device structure v2: - Added trace message with a hist is printed when downshifting clould not be eanbled with the requested count - (ethtool) Syntax is changed from "--set-phy-tunable downshift on|off|%d" to "--set-phy-tunable [downshift on|off [count N]]" - as requested by Andrew. v3: - Fixed Spelling in "net: phy: Add downshift get/set support in Microsemi PHYs driver" Raju Lakkaraju (5): ethtool: (uapi) Add ETHTOOL_PHY_GTUNABLE and ETHTOOL_PHY_STUNABLE ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunable net: phy: Add downshift get/set support in Microsemi PHYs driver drivers/net/phy/mscc.c | 100 +++ include/linux/phy.h | 7 +++ include/uapi/linux/ethtool.h | 18 +++- net/core/ethtool.c | 93 4 files changed, 217 insertions(+), 1 deletion(-) -- 2.7.4
[PATCH net v2 0/7] net: cpsw: fix leaks and probe deferral
This series fixes as number of leaks and issues in the cpsw probe-error and driver-unbind paths, some which specifically prevented deferred probing. Johan v2 - Keep platform device runtime-resumed throughout probe instead of resuming in the probe error path as suggested by Grygorii (patch 1/7). - Runtime-resume platform device before registering any children in order to make sure it is synchronously suspended after deregistering children in the error path (patch 3/7). Johan Hovold (7): net: ethernet: ti: cpsw: fix bad register access in probe error path net: ethernet: ti: cpsw: fix mdio device reference leak net: ethernet: ti: cpsw: fix deferred probe net: ethernet: ti: cpsw: fix of_node and phydev leaks net: ethernet: ti: cpsw: fix secondary-emac probe error path net: ethernet: ti: cpsw: add missing sanity check net: ethernet: ti: cpsw: fix fixed-link phy probe deferral drivers/net/ethernet/ti/cpsw.c | 95 -- 1 file changed, 74 insertions(+), 21 deletions(-) -- 2.7.3
[PATCH ethtool v3 0/2] Adding downshift support to ethtool
Hi All, (This is a re-post of the v3 patch set with a new cover letter - I was not aware that the cover letters was used as commit comments in merge commits). This patch implements for set/get downshifting. Downshifting can either be turned on/off, or it can be configured to a specifc count. "count" is optional. Tested on Beaglebone Black with VSC 8531 PHY. Change set: v1: - Initial version of set/get phy tunable with downshift feature. v2: - (ethtool) Syntax is changed from "--set-phy-tunable downshift on|off|%d" to "--set-phy-tunable [downshift on|off [count N]]" - as requested by Andrew. v3: - Fixed Spelling in "ethtool-copy.h:sync with net" - Fixed "if send_ioctl() returns an error, print the error message and then still print th value of count". Raju Lakkaraju (2): ethtool-copy.h:sync with net Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY downshift ethtool-copy.h | 18 +++- ethtool.8.in | 39 ethtool.c | 144 + 3 files changed, 200 insertions(+), 1 deletion(-) -- 2.7.4
[PATCH net v2 2/7] net: ethernet: ti: cpsw: fix mdio device reference leak
Make sure to drop the reference taken by of_find_device_by_node() when looking up an mdio device from a phy_id property during probe. Fixes: 549985ee9c72 ("cpsw: simplify the setup of the register pointers") Signed-off-by: Johan Hovold --- drivers/net/ethernet/ti/cpsw.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index f60f8ab7c1e3..84c5d214557e 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -2397,6 +2397,7 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data, } snprintf(slave_data->phy_id, sizeof(slave_data->phy_id), PHY_ID_FMT, mdio->name, phyid); + put_device(&mdio->dev); } else { dev_err(&pdev->dev, "No slave[%d] phy_id, phy-handle, or fixed-link property\n", -- 2.7.3
Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN
On 11/13/2016 10:34 AM, Neftin, Sasha wrote: > On 11/11/2016 12:35 AM, Baicar, Tyler wrote: >> Hello Sasha, >> >> On 11/9/2016 11:19 PM, Neftin, Sasha wrote: >>> On 11/9/2016 11:41 PM, Tyler Baicar wrote: Move IRQ free code so that it will happen regardless of the __E1000_DOWN bit. Currently the e1000e driver only releases its IRQ if the __E1000_DOWN bit is cleared. This is not sufficient because it is possible for __E1000_DOWN to be set without releasing the IRQ. In such a situation, we will hit a kernel bug later in e1000_remove because the IRQ still has action since it was never freed. A secondary bus reset can cause this case to happen. Signed-off-by: Tyler Baicar --- drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 7017281..36cfcb0 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev) if (!test_bit(__E1000_DOWN, &adapter->state)) { e1000e_down(adapter, true); -e1000_free_irq(adapter); /* Link status message must follow this format */ pr_info("%s NIC Link is Down\n", adapter->netdev->name); } +e1000_free_irq(adapter); + napi_disable(&adapter->napi); e1000e_free_tx_resources(adapter->tx_ring); >>> I would like not recommend insert this change. This change related >>> driver state machine, we afraid from lot of synchronization problem and >>> issues. >>> We need keep e1000_free_irq in loop and check for 'test_bit' ready. >> >> What do you mean here? There is no loop. If __E1000_DOWN is set then we >> will never free the IRQ. >> >>> Another point, does before execute secondary bus reset your SW back up >>> pcie configuration space as properly? >> >> After a secondary bus reset, the link needs to recover and go back to a >> working state after 1 second. >> >> From the callstack, the issue is happening while removing the endpoint >> from the system, before applying the secondary bus reset. >> >> The order of events is >> 1. remove the drivers >> 2. cause a secondary bus reset >> 3. wait 1 second > Actually, this is too much, usually link up in less than 100ms.You can > check Data Link Layer indication. >> 4. recover the link >> >> callstack: >> free_msi_irqs+0x6c/0x1a8 >> pci_disable_msi+0xb0/0x148 >> e1000e_reset_interrupt_capability+0x60/0x78 >> e1000_remove+0xc8/0x180 >> pci_device_remove+0x48/0x118 >> __device_release_driver+0x80/0x108 >> device_release_driver+0x2c/0x40 >> pci_stop_bus_device+0xa0/0xb0 >> pci_stop_bus_device+0x3c/0xb0 >> pci_stop_root_bus+0x54/0x80 >> acpi_pci_root_remove+0x28/0x64 >> acpi_bus_trim+0x6c/0xa4 >> acpi_device_hotplug+0x19c/0x3f4 >> acpi_hotplug_work_fn+0x28/0x3c >> process_one_work+0x150/0x460 >> worker_thread+0x50/0x4b8 >> kthread+0xd4/0xe8 >> ret_from_fork+0x10/0x50 >> >> Thanks, >> Tyler >> > Hello Tyler, > Okay, we need consult more about this suggestion. > May I ask what is setup you run? Is there NIC or on board LAN? I would > like try reproduce this issue in our lab's too. > Also, is same issue observed with same scenario and others NIC's too? > Sasha > ___ > Intel-wired-lan mailing list > intel-wired-...@lists.osuosl.org > http://lists.osuosl.org/mailman/listinfo/intel-wired-lan > Hello Tyler, I see some in consistent implementation of __*_close methods in our drivers. Do you have any igb NIC to check if same problem persist there? Thanks, Sasha
Re: Netperf UDP issue with connected sockets
On Thu, 17 Nov 2016 06:17:38 -0800 Eric Dumazet wrote: > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: > > > I can see that qdisc layer does not activate xmit_more in this case. > > > > Sure. Not enough pressure from the sender(s). > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL > limit is kept at a small value. > > (BTW not all NIC have expensive doorbells) I believe this NIC mlx5 (50G edition) does. I'm seeing UDP TX of 1656017.55 pps, which is per packet: 2414 cycles(tsc) 603.86 ns Perf top shows (with my own udp_flood, that avoids __ip_select_ident): Samples: 56K of event 'cycles', Event count (approx.): 51613832267 Overhead CommandShared ObjectSymbol +8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock - _raw_spin_lock + 90.78% __dev_queue_xmit + 7.83% dev_queue_xmit + 1.30% ___slab_alloc +5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w +4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit +4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup +4.00% swapper[mlx5_core] [k] mlx5e_poll_tx_cq +3.11% udp_flood [kernel.vmlinux] [k] __ip_route_output_key_hash +2.49% swapper[kernel.vmlinux] [k] __slab_free In this setup the spinlock in __dev_queue_xmit should be uncongested. An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. But 8.92% of the time is spend on it, which corresponds to a cost of 215 cycles (2414*0.0892). This cost is too high, thus something else is going on... I claim this mysterious extra cost is the tailptr/doorbell. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] net: sky2: Fix shutdown crash
On 17/11/16 15:14, Jeremy Linton wrote: The sky2 frequently crashes during machine shutdown with: sky2_get_stats+0x60/0x3d8 [sky2] dev_get_stats+0x68/0xd8 rtnl_fill_stats+0x54/0x140 rtnl_fill_ifinfo+0x46c/0xc68 rtmsg_ifinfo_build_skb+0x7c/0xf0 rtmsg_ifinfo.part.22+0x3c/0x70 rtmsg_ifinfo+0x50/0x5c netdev_state_change+0x4c/0x58 linkwatch_do_dev+0x50/0x88 __linkwatch_run_queue+0x104/0x1a4 linkwatch_event+0x30/0x3c process_one_work+0x140/0x3e0 worker_thread+0x60/0x44c kthread+0xdc/0xf0 ret_from_fork+0x10/0x50 This is caused by the sky2 being called after it has been shutdown. A previous thread about this can be found here: https://lkml.org/lkml/2016/4/12/410 An alternative fix is to assure that IFF_UP gets cleared by calling dev_close() during shutdown. This is similar to what the bnx2/tg3/xgene and maybe others are doing to assure that the driver isn't being called following _shutdown(). Signed-off-by: Jeremy Linton Since this issue has been very random and on/off recently, it's quite hard to test this and confirm. However I did around 20 reboot/shutdown and could not reproduce the issue after applying this patch. So, Tested-by: Sudeep Holla -- Regards, Sudeep
RE: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
-Original Message- From: Harini Katakam [mailto:harinikatakamli...@gmail.com] Sent: 17 listopada 2016 13:22 To: Rafal Ozieblo Cc: harini.kata...@xilinx.com; nicolas.fe...@atmel.com; netdev@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM > Hi Rafal, > > On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo wrote: > > Hello, > > I think, there could a bug in your patch. > > > >> + > >> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT > >> + dmacfg |= GEM_BIT(ADDR64); #endif > > > > You enable 64 bit addressing (64b dma bus width) always when appropriate > > architecture config option is enabled. > > But there are some legacy controllers which do not support that feature. > > According Cadence hardware team: > > "64 bit addressing was added in July 2013. Earlier version do not have it. > > This feature was enhanced in release August 2014 to have separate upper > > address values for transmit and receive." > > > >> /* Bitfields in NSR */ > >> @@ -474,6 +479,10 @@ > >> struct macb_dma_desc { > > > u32 addr; > >> u32 ctrl; > >> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT > >> + u32 addrh; > >> + u32 resvd; > >> +#endif > >> }; > > > > It will not work for legacy hardware. Old descriptor is 2 words wide, the > > new one is 4 words wide. > > If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't > > support it at all, you will miss every second descriptor. > > > > True, this feature is not available in all of Cadence IP versions. > In fact, the IP version Zynq does not support this. But the one in ZynqMP > does. > So, we enable kernel config for 64 bit DMA addressing for this SoC and hence > the driver picks it up. My assumption was that if the legacy IP does not > support > 64 bit addressing, then this DMA option wouldn't be enabled. What for example with arm64 (which enables CONFIG_ARCH_DMA_ADDR_T_64BIT by default) and legacy IP controller? Or systems with multiple IP, both with and without 64b dma capable? It might result in unpredictable behavio. (explained below) > There is a design config register in Cadence IP which is being read to check > for 64 bit address support - DMA mask is set based on that. > But the addition of two descriptor words cannot be based on this runtime > check. > For this reason, all the static changes were placed under this check. Are you talking about this? +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT + if (GEM_BFEXT(DBWDEF, gem_readl(bp, DCFG1)) > GEM_DBW32) + dma_set_mask(&pdev->dev, DMA_BIT_MASK(44)); +#endif + It only sets the maximum address which can be seen on address bus. (its mask exactly) +static inline void macb_set_addr(struct macb_dma_desc *desc, dma_addr_t addr) +{ + desc->addr = (u32)addr; +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT + desc->addrh = (u32)(addr >> 32); +#endif +} + Because addr is 32b wide, (u32)(addr >> 32) equals 0. IP controller uses 2 words for dma descriptor, so desc->addr from second hardware descriptor will be overwritten by desc->addrh from the first software descriptor. (same desc->ctrl from second hardware descriptor will be overwritten by desc->resvd). Regards, Harini
[PATCH net v2 5/7] net: ethernet: ti: cpsw: fix secondary-emac probe error path
Make sure to deregister the primary device in case the secondary emac fails to probe. kernel BUG at /home/johan/work/omicron/src/linux/net/core/dev.c:7743! ... [] (free_netdev) from [] (cpsw_probe+0x9cc/0xe50) [] (cpsw_probe) from [] (platform_drv_probe+0x5c/0xc0) Fixes: d9ba8f9e6298 ("driver: net: ethernet: cpsw: dual emac interface implementation") Signed-off-by: Johan Hovold --- drivers/net/ethernet/ti/cpsw.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index c3b78bc4fe58..11b2daef3158 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -2852,7 +2852,7 @@ static int cpsw_probe(struct platform_device *pdev) ret = cpsw_probe_dual_emac(priv); if (ret) { cpsw_err(priv, probe, "error probe slave 2 emac interface\n"); - goto clean_ale_ret; + goto clean_unregister_netdev_ret; } } @@ -2860,6 +2860,8 @@ static int cpsw_probe(struct platform_device *pdev) return 0; +clean_unregister_netdev_ret: + unregister_netdev(ndev); clean_ale_ret: cpsw_ale_destroy(cpsw->ale); clean_dma_ret: -- 2.7.3
Re: [Patch net] net: check dead netns for peernet2id_alloc()
From: Cong Wang Date: Wed, 16 Nov 2016 10:27:02 -0800 > Andrei reports we still allocate netns ID from idr after we destroy > it in cleanup_net(). > > cleanup_net(): > ... > idr_destroy(&net->netns_ids); > ... > list_for_each_entry_reverse(ops, &pernet_list, list) > ops_exit_list(ops, &net_exit_list); > -> rollback_registered_many() > -> rtmsg_ifinfo_build_skb() > -> rtnl_fill_ifinfo() >-> peernet2id_alloc() > > After that point we should not even access net->netns_ids, we > should check the death of the current netns as early as we can in > peernet2id_alloc(). > > For net-next we can consider to avoid sending rtmsg totally, > it is a good optimization for netns teardown path. > > Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids") > Reported-by: Andrei Vagin > Cc: Nicolas Dichtel > Signed-off-by: Cong Wang Applied and queued up for -stable, thanks.
Re: [PATCH net-next] net/mlx4_en: remove napi_hash_del() call
From: Eric Dumazet Date: Wed, 16 Nov 2016 05:49:22 -0800 > From: Eric Dumazet > > There is no need calling napi_hash_del()+synchronize_rcu() before > calling netif_napi_del() > > netif_napi_del() does this already. > > Using napi_hash_del() in a driver is useful only when dealing with > a batch of NAPI structures, so that a single synchronize_rcu() can > be used. mlx4_en_deactivate_cq() is deactivating a single NAPI. > > Signed-off-by: Eric Dumazet Applied, thanks Eric.
Re: [PATCH net 0/3] net: phy: fix of_node and device leaks
From: Johan Hovold Date: Wed, 16 Nov 2016 15:20:35 +0100 > These patches fix a couple of of_node leaks in the fixed-link code and a > device reference leak in a phy helper. Series applied, thanks.