Re: [PATCH 3/3] 3c59x: Use setup_timer()
On Sun, 28 Feb 2016, Amitoj Kaur Chawla wrote: On Sun, Feb 28, 2016 at 12:18 AM, Stafford Horne wrote: > > > On Thu, 25 Feb 2016, David Miller wrote: > >> From: Amitoj Kaur Chawla >> Date: Wed, 24 Feb 2016 19:28:19 +0530 >> >>> Convert a call to init_timer and accompanying intializations of >>> the timer's data and function fields to a call to setup_timer. >>> >>> The Coccinelle semantic patch that fixes this problem is >>> as follows: >>> >>> // >>> @@ >>> expression t,f,d; >>> @@ >>> >>> -init_timer(&t); >>> +setup_timer(&t,f,d); >>> ... >>> -t.data = d; >>> -t.function = f; >>> // >>> >>> Signed-off-by: Amitoj Kaur Chawla >> >> >> Applied. > > > Hi David, Amitoj, > > The patch here seemed to remove the call to add_timer(&vp->timer) which > applies the expires time. Would that be an issue? > > -Stafford I'm sorry. This is my mistake. How can I rectify it now that the patch is applied? Should I send a patch adding it back? I sent a patch just now which could help to restore the behavior. This is applied on top of your patch which I pulled from Dave's tree here: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git -Stafford
[PATCH] 3c59x: Ensure to apply the expires time
In commit 5b6490def9168af6a ("3c59x: Use setup_timer()") Amitoj removed add_timer which sets up the epires timer. In this patch the behavior is restore but it uses mod_timer which is a bit more compact. Signed-off-by: Stafford Horne --- I think a patch like this will help restore the behavior. Also, its small cleanup since we dont need to do separate set to expire and call to add_timer. But thats a style preference. drivers/net/ethernet/3com/3c59x.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/3com/3c59x.c b/drivers/net/ethernet/3com/3c59x.c index c377607..7b881ed 100644 --- a/drivers/net/ethernet/3com/3c59x.c +++ b/drivers/net/ethernet/3com/3c59x.c @@ -1602,7 +1602,7 @@ vortex_up(struct net_device *dev) } setup_timer(&vp->timer, vortex_timer, (unsigned long)dev); - vp->timer.expires = RUN_AT(media_tbl[dev->if_port].wait); + mod_timer(&vp->timer, RUN_AT(media_tbl[dev->if_port].wait)); setup_timer(&vp->rx_oom_timer, rx_oom_timer, (unsigned long)dev); if (vortex_debug > 1) -- 2.5.0
Re: [PATCH 3/3] 3c59x: Use setup_timer()
On Sun, Feb 28, 2016 at 12:18 AM, Stafford Horne wrote: > > > On Thu, 25 Feb 2016, David Miller wrote: > >> From: Amitoj Kaur Chawla >> Date: Wed, 24 Feb 2016 19:28:19 +0530 >> >>> Convert a call to init_timer and accompanying intializations of >>> the timer's data and function fields to a call to setup_timer. >>> >>> The Coccinelle semantic patch that fixes this problem is >>> as follows: >>> >>> // >>> @@ >>> expression t,f,d; >>> @@ >>> >>> -init_timer(&t); >>> +setup_timer(&t,f,d); >>> ... >>> -t.data = d; >>> -t.function = f; >>> // >>> >>> Signed-off-by: Amitoj Kaur Chawla >> >> >> Applied. > > > Hi David, Amitoj, > > The patch here seemed to remove the call to add_timer(&vp->timer) which > applies the expires time. Would that be an issue? > > -Stafford I'm sorry. This is my mistake. How can I rectify it now that the patch is applied? Should I send a patch adding it back? Amitoj
Re: Softirq priority inversion from "softirq: reduce latencies"
On Sat, 2016-02-27 at 10:19 -0800, Peter Hurley wrote: > Hi Eric, > > For a while now, we've been struggling to understand why we've been > observing missed uart rx DMA. > > Because both the uart driver (omap8250) and the dmaengine driver > (edma) were (relatively) new, we assumed there was some race between > starting a new rx DMA and processing the previous one. Hrm, relatively new + tasklet woes rings a bell. Ah, that.. What's worse is that at the point where this code was written it was already well known that tasklets are a steaming pile of crap and should die. Source thereof https://lwn.net/Articles/588457/ -Mike
Re: [PATCH net-next 09/10] net/mlx5: Fix global UAR mapping
From: Saeed Mahameed Date: Thu, 25 Feb 2016 18:33:19 +0200 > @@ -246,11 +246,11 @@ int mlx5_alloc_map_uar(struct mlx5_core_dev *mdev, > struct mlx5_uar *uar) > err = -ENOMEM; > goto err_free_uar; > } > - > - if (mdev->priv.bf_mapping) > - uar->bf_map = io_mapping_map_wc(mdev->priv.bf_mapping, > - uar->index << PAGE_SHIFT); > - > +#ifdef ARCH_HAS_IOREMAP_WC > + uar->bf_map = ioremap_wc(pfn << PAGE_SHIFT, PAGE_SIZE); > + if (!uar->bf_map) > + mlx5_core_warn(mdev, "ioremap_wc() failed\n"); > +#endif Sorry, this looks very wrong to me. It makes no sense to only map this resource if ARCH_HAS_IOREMAP_WC defined. The interface _always_ exists, and ARCH_HAS_IOREMAP_WC is an internal symbol that include/asm-generic/iomap.h uses to determine whether to provide a generic implementation of the interface or not. I'm not applying this series until you either fix or explain what you are doing here in the commit message. Thanks.
Re: Softirq priority inversion from "softirq: reduce latencies"
From: Peter Hurley Date: Sat, 27 Feb 2016 18:10:27 -0800 > That tasklet should run before any process. You never have this guarantee, even before Eric's patch. Under load tasklets run from ksoftirqd just like any other softirq. Please fix your driver and stop blaming Eric's change. Thank you.
Re: [net-next PATCH v3 1/3] net: sched: consolidate offload decision in cls_u32
On Fri, Feb 26, 2016 at 8:24 PM, John Fastabend wrote: > On 16-02-26 09:39 AM, Cong Wang wrote: >> On Fri, Feb 26, 2016 at 7:53 AM, John Fastabend >> wrote: >>> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h >>> index 2121df5..e64d20b 100644 >>> --- a/include/net/pkt_cls.h >>> +++ b/include/net/pkt_cls.h >>> @@ -392,4 +392,9 @@ struct tc_cls_u32_offload { >>> }; >>> }; >>> >>> +static inline bool tc_should_offload(struct net_device *dev) >>> +{ >>> + return dev->netdev_ops->ndo_setup_tc; >>> +} >>> + >> >> These should be protected by CONFIG_NET_CLS_U32, no? >> > > Its not necessary it is a completely general function and I only > lifted it out of cls_u32 so that the cls_flower classifier could > also use it. > > I don't see the need off-hand to have it wrapped in an ORd ifdef > statement where its (CONFIG_NET_CLS_U32 | CONFIG_NET_CLS_X ...). > Any particular reason you were thnking it should be wrapped in ifdefs? > Not a big deal. I just feel these don't need to compile when I have CONFIG_NET_CLS_U32=n. Thanks.
Re: [net-next-2.6 PATCH v4 3/3] Support to encoding decoding skb prio on IFE action
On Sat, Feb 27, 2016 at 5:08 AM, Jamal Hadi Salim wrote: > From: Jamal Hadi Salim > > Example usage: > Set the skb priority using skbedit then allow it to be encoded > > sudo tc qdisc add dev $ETH root handle 1: prio > sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ > u32 match ip protocol 1 0xff flowid 1:2 \ > action skbedit prio 17 \ > action ife encode \ > allow prio \ > dst 02:15:15:15:15:15 > > Note: You dont need the skbedit action if you are already encoding the > skb priority earlier. A zero skb priority will not be sent > > Alternative hard code static priority of decimal 33 (unlike skbedit) > then mark of 0x12 every time the filter matches > > sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \ > u32 match ip protocol 1 0xff flowid 1:2 \ > action ife encode \ > type 0xDEAD \ > use prio 33 \ > use mark 0x12 \ > dst 02:15:15:15:15:15 > > Signed-off-by: Jamal Hadi Salim Acked-by: Cong Wang
Re: [net-next-2.6 PATCH v4 2/3] Support to encoding decoding skb mark on IFE action
On Sat, Feb 27, 2016 at 5:08 AM, Jamal Hadi Salim wrote: > From: Jamal Hadi Salim > > Example usage: > Set the skb using skbedit then allow it to be encoded > > sudo tc qdisc add dev $ETH root handle 1: prio > sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ > u32 match ip protocol 1 0xff flowid 1:2 \ > action skbedit mark 17 \ > action ife encode \ > allow mark \ > dst 02:15:15:15:15:15 > > Note: You dont need the skbedit action if you are already encoding the > skb mark earlier. A zero skb mark, when seen, will not be encoded. > > Alternative hard code static mark of 0x12 every time the filter matches > > sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \ > u32 match ip protocol 1 0xff flowid 1:2 \ > action ife encode \ > type 0xDEAD \ > use mark 0x12 \ > dst 02:15:15:15:15:15 > > Signed-off-by: Jamal Hadi Salim Acked-by: Cong Wang
Re: [net-next-2.6 PATCH v4 1/3] introduce IFE action
On Sat, Feb 27, 2016 at 5:08 AM, Jamal Hadi Salim wrote: > From: Jamal Hadi Salim > > This action allows for a sending side to encapsulate arbitrary metadata > which is decapsulated by the receiving end. > The sender runs in encoding mode and the receiver in decode mode. > Both sender and receiver must specify the same ethertype. > At some point we hope to have a registered ethertype and we'll > then provide a default so the user doesnt have to specify it. > For now we enforce the user specify it. > [...] > > Signed-off-by: Jamal Hadi Salim Acked-by: Cong Wang Thanks for updating it!
[Patch net-next] net: remove skb_sender_cpu_clear()
After commit 52bd2d62ce67 ("net: better skb->sender_cpu and skb->napi_id cohabitation") skb_sender_cpu_clear() becomes empty and can be removed. Cc: Eric Dumazet Signed-off-by: Cong Wang --- include/linux/skbuff.h | 4 net/bridge/br_forward.c | 1 - net/core/filter.c | 2 -- net/core/skbuff.c | 1 - net/ipv4/ip_forward.c | 1 - net/ipv6/ip6_output.c | 1 - net/netfilter/ipvs/ip_vs_xmit.c | 6 -- net/netfilter/nf_dup_netdev.c | 1 - net/sched/act_mirred.c | 1 - 9 files changed, 18 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index eab4f8f..797cefb 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1161,10 +1161,6 @@ static inline void skb_copy_hash(struct sk_buff *to, const struct sk_buff *from) to->l4_hash = from->l4_hash; }; -static inline void skb_sender_cpu_clear(struct sk_buff *skb) -{ -} - #ifdef NET_SKBUFF_DATA_USES_OFFSET static inline unsigned char *skb_end_pointer(const struct sk_buff *skb) { diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index fcdb86d..f47759f 100644 --- a/net/bridge/br_forward.c +++ b/net/bridge/br_forward.c @@ -44,7 +44,6 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb skb_push(skb, ETH_HLEN); br_drop_fake_rtable(skb); - skb_sender_cpu_clear(skb); if (skb->ip_summed == CHECKSUM_PARTIAL && (skb->protocol == htons(ETH_P_8021Q) || diff --git a/net/core/filter.c b/net/core/filter.c index a3aba15..5e2a3b5 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1597,7 +1597,6 @@ static u64 bpf_clone_redirect(u64 r1, u64 ifindex, u64 flags, u64 r4, u64 r5) } skb2->dev = dev; - skb_sender_cpu_clear(skb2); return dev_queue_xmit(skb2); } @@ -1650,7 +1649,6 @@ int skb_do_redirect(struct sk_buff *skb) } skb->dev = dev; - skb_sender_cpu_clear(skb); return dev_queue_xmit(skb); } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 488566b..7af7ec6 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4302,7 +4302,6 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet) skb->skb_iif = 0; skb->ignore_df = 0; skb_dst_drop(skb); - skb_sender_cpu_clear(skb); secpath_reset(skb); nf_reset(skb); nf_reset_trace(skb); diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c index da0d7ce..af18f1e 100644 --- a/net/ipv4/ip_forward.c +++ b/net/ipv4/ip_forward.c @@ -71,7 +71,6 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s if (unlikely(opt->optlen)) ip_forward_options(skb); - skb_sender_cpu_clear(skb); return dst_output(net, sk, skb); } diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index a163102..9428345 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -332,7 +332,6 @@ static int ip6_forward_proxy_check(struct sk_buff *skb) static inline int ip6_forward_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { - skb_sender_cpu_clear(skb); return dst_output(net, sk, skb); } diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c index a3f5cd9..dc196a0 100644 --- a/net/netfilter/ipvs/ip_vs_xmit.c +++ b/net/netfilter/ipvs/ip_vs_xmit.c @@ -531,8 +531,6 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff *skb, if (ret == NF_ACCEPT) { nf_reset(skb); skb_forward_csum(skb); - if (!skb->sk) - skb_sender_cpu_clear(skb); } return ret; } @@ -573,8 +571,6 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct sk_buff *skb, if (!local) { skb_forward_csum(skb); - if (!skb->sk) - skb_sender_cpu_clear(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb, NULL, skb_dst(skb)->dev, dst_output); } else @@ -595,8 +591,6 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff *skb, if (!local) { ip_vs_drop_early_demux_sk(skb); skb_forward_csum(skb); - if (!skb->sk) - skb_sender_cpu_clear(skb); NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb, NULL, skb_dst(skb)->dev, dst_output); } else diff --git a/net/netfilter/nf_dup_netdev.c b/net/netfilter/nf_dup_netdev.c index 8414ee1..7ec6972 100644 --- a/net/netfilter/nf_dup_netdev.c +++ b/net/netfilter/nf_dup_netdev.c @@ -31,7 +31,6 @@ void nf_dup_netdev_egress(const struct nft_pktinfo *pkt, int oif) skb_push(skb, skb->mac_len); skb->dev = dev; - skb_sender_cpu_clear(skb); dev_queue_xmit(skb); } EXPORT_SY
[PATCH net] sctp: sctp_remaddr_seq_show use the wrong variable to dump transport info
Now in sctp_remaddr_seq_show(), we use variable *tsp to get the param *v. but *tsp is also used to traversal transport_addr_list, which will cover the previous value, and make sctp_transport_put work on the wrong transport. So fix it by adding a new variable to get the param *v. Fixes: fba4c330c5b9 ("sctp: hold transport before we access t->asoc in sctp proc") Signed-off-by: Xin Long --- net/sctp/proc.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/net/sctp/proc.c b/net/sctp/proc.c index ded7d93..963dffc 100644 --- a/net/sctp/proc.c +++ b/net/sctp/proc.c @@ -482,7 +482,7 @@ static void sctp_remaddr_seq_stop(struct seq_file *seq, void *v) static int sctp_remaddr_seq_show(struct seq_file *seq, void *v) { struct sctp_association *assoc; - struct sctp_transport *tsp; + struct sctp_transport *transport, *tsp; if (v == SEQ_START_TOKEN) { seq_printf(seq, "ADDR ASSOC_ID HB_ACT RTO MAX_PATH_RTX " @@ -490,10 +490,10 @@ static int sctp_remaddr_seq_show(struct seq_file *seq, void *v) return 0; } - tsp = (struct sctp_transport *)v; - if (!sctp_transport_hold(tsp)) + transport = (struct sctp_transport *)v; + if (!sctp_transport_hold(transport)) return 0; - assoc = tsp->asoc; + assoc = transport->asoc; list_for_each_entry_rcu(tsp, &assoc->peer.transport_addr_list, transports) { @@ -546,7 +546,7 @@ static int sctp_remaddr_seq_show(struct seq_file *seq, void *v) seq_printf(seq, "\n"); } - sctp_transport_put(tsp); + sctp_transport_put(transport); return 0; } -- 2.1.0
[net-next][PATCH v2 00/13] RDS: Major clean-up with couple of new features for 4.6
v2: Dropped module parameter from [PATCH 11/13] as suggested by David Miller Series is generated against net-next but also applies against Linus's tip cleanly. Entire patchset is available at below git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds_v2 The diff-stat looks bit scary since almost ~4K lines of code is getting removed. Brief summary of the series: - Drop the stale iWARP support: RDS iWarp support code has become stale and non testable for sometime. As discussed and agreed earlier on list, am dropping its support for good. If new iWarp user(s) shows up in future, the plan is to adapt existing IB RDMA with special sink case. - RDS gets SO_TIMESTAMP support - Long due RDS maintainer entry gets updated - Some RDS IB code refactoring towards new FastReg Memory registration (FRMR) - Lastly the initial support for FRMR RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have some patches in progress to address that. But they are not ready for 4.6 so I left them out of this series. Also am keeping eye on new CQ API adaptations like other ULPs doing and will try to adapt RDS for the same most likely in 4.7+ timeframe. Santosh Shilimkar (12): RDS: Drop stale iWARP RDMA transport RDS: Add support for SO_TIMESTAMP for incoming messages MAINTAINERS: update RDS entry RDS: IB: Remove the RDS_IB_SEND_OP dependency RDS: IB: Re-organise ibmr code RDS: IB: create struct rds_ib_fmr RDS: IB: move FMR code to its own file RDS: IB: add connection info to ibmr RDS: IB: handle the RDMA CM time wait event RDS: IB: add mr reused stats RDS: IB: add Fastreg MR (FRMR) detection support RDS: IB: allocate extra space on queues for FRMR support Avinash Repaka (1): RDS: IB: Support Fastreg MR (FRMR) memory registration mode Documentation/networking/rds.txt | 4 +- MAINTAINERS | 6 +- net/rds/Kconfig | 7 +- net/rds/Makefile | 4 +- net/rds/af_rds.c | 26 ++ net/rds/ib.c | 47 +- net/rds/ib.h | 37 +- net/rds/ib_cm.c | 59 ++- net/rds/ib_fmr.c | 248 ++ net/rds/ib_frmr.c| 376 +++ net/rds/ib_mr.h | 148 ++ net/rds/ib_rdma.c| 495 ++-- net/rds/ib_send.c| 6 +- net/rds/ib_stats.c | 2 + net/rds/iw.c | 312 - net/rds/iw.h | 398 net/rds/iw_cm.c | 769 -- net/rds/iw_rdma.c| 837 - net/rds/iw_recv.c| 904 net/rds/iw_ring.c| 169 --- net/rds/iw_send.c| 981 --- net/rds/iw_stats.c | 95 net/rds/iw_sysctl.c | 123 - net/rds/rdma_transport.c | 21 +- net/rds/rdma_transport.h | 5 - net/rds/rds.h| 1 + net/rds/recv.c | 20 +- 27 files changed, 1065 insertions(+), 5035 deletions(-) create mode 100644 net/rds/ib_fmr.c create mode 100644 net/rds/ib_frmr.c create mode 100644 net/rds/ib_mr.h delete mode 100644 net/rds/iw.c delete mode 100644 net/rds/iw.h delete mode 100644 net/rds/iw_cm.c delete mode 100644 net/rds/iw_rdma.c delete mode 100644 net/rds/iw_recv.c delete mode 100644 net/rds/iw_ring.c delete mode 100644 net/rds/iw_send.c delete mode 100644 net/rds/iw_stats.c delete mode 100644 net/rds/iw_sysctl.c -- 1.9.1
[net-next][PATCH v2 01/13] RDS: Drop stale iWARP RDMA transport
RDS iWarp support code has become stale and non testable. As indicated earlier, am dropping the support for it. If new iWarp user(s) shows up in future, we can adapat the RDS IB transprt for the special RDMA READ sink case. iWarp needs an MR for the RDMA READ sink. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- Documentation/networking/rds.txt | 4 +- net/rds/Kconfig | 7 +- net/rds/Makefile | 4 +- net/rds/iw.c | 312 - net/rds/iw.h | 398 net/rds/iw_cm.c | 769 -- net/rds/iw_rdma.c| 837 - net/rds/iw_recv.c| 904 net/rds/iw_ring.c| 169 --- net/rds/iw_send.c| 981 --- net/rds/iw_stats.c | 95 net/rds/iw_sysctl.c | 123 - net/rds/rdma_transport.c | 13 +- net/rds/rdma_transport.h | 5 - 14 files changed, 7 insertions(+), 4614 deletions(-) delete mode 100644 net/rds/iw.c delete mode 100644 net/rds/iw.h delete mode 100644 net/rds/iw_cm.c delete mode 100644 net/rds/iw_rdma.c delete mode 100644 net/rds/iw_recv.c delete mode 100644 net/rds/iw_ring.c delete mode 100644 net/rds/iw_send.c delete mode 100644 net/rds/iw_stats.c delete mode 100644 net/rds/iw_sysctl.c diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt index e1a3d59..9d219d8 100644 --- a/Documentation/networking/rds.txt +++ b/Documentation/networking/rds.txt @@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like TCP. RDS is not Infiniband-specific; it was designed to support different transports. The current implementation used to support RDS over TCP as well -as IB. Work is in progress to support RDS over iWARP, and using DCE to -guarantee no dropped packets on Ethernet, it may be possible to use RDS over -UDP in the future. +as IB. The high-level semantics of RDS from the application's point of view are diff --git a/net/rds/Kconfig b/net/rds/Kconfig index f2c670b..bffde4b 100644 --- a/net/rds/Kconfig +++ b/net/rds/Kconfig @@ -4,14 +4,13 @@ config RDS depends on INET ---help--- The RDS (Reliable Datagram Sockets) protocol provides reliable, - sequenced delivery of datagrams over Infiniband, iWARP, - or TCP. + sequenced delivery of datagrams over Infiniband or TCP. config RDS_RDMA - tristate "RDS over Infiniband and iWARP" + tristate "RDS over Infiniband" depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS ---help--- - Allow RDS to use Infiniband and iWARP as a transport. + Allow RDS to use Infiniband as a transport. This transport supports RDMA operations. config RDS_TCP diff --git a/net/rds/Makefile b/net/rds/Makefile index 56d3f60..19e5485 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o \ - iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \ - iw_sysctl.o iw_rdma.o + ib_sysctl.o ib_rdma.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/iw.c b/net/rds/iw.c deleted file mode 100644 index f4a9fff..000 diff --git a/net/rds/iw.h b/net/rds/iw.h deleted file mode 100644 index 5af01d1..000 diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c deleted file mode 100644 index aea4c91..000 diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c deleted file mode 100644 index b09a40c..000 diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c deleted file mode 100644 index a66d179..000 diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c deleted file mode 100644 index da8e3b6..000 diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c deleted file mode 100644 index e20bd50..000 diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c deleted file mode 100644 index 5fe67f6..000 diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c deleted file mode 100644 index 139239d..000 diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 9c1fed8..4f4b3d8 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id, event->event, rdma_event_msg(event->event)); - if (cm_id->device->node_type == RDMA_NODE_RNIC) - trans = &rds_iw_transport; - else + if (cm_id->device->node_type == RDMA_NODE_IB
[net-next][PATCH v2 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages
The SO_TIMESTAMP generates time stamp for each incoming RDS messages User app can enable it by using SO_TIMESTAMP setsocketopt() at SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the time stamp in struct timeval format. Reviewed-by: Sowmini Varadhan Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/af_rds.c | 26 ++ net/rds/rds.h| 1 + net/rds/recv.c | 20 ++-- 3 files changed, 45 insertions(+), 2 deletions(-) diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index b5476aeb..6beaeb1 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char __user *optval, return rs->rs_transport ? 0 : -ENOPROTOOPT; } +static int rds_enable_recvtstamp(struct sock *sk, char __user *optval, +int optlen) +{ + int val, valbool; + + if (optlen != sizeof(int)) + return -EFAULT; + + if (get_user(val, (int __user *)optval)) + return -EFAULT; + + valbool = val ? 1 : 0; + + if (valbool) + sock_set_flag(sk, SOCK_RCVTSTAMP); + else + sock_reset_flag(sk, SOCK_RCVTSTAMP); + + return 0; +} + static int rds_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen) { @@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, int optname, ret = rds_set_transport(rs, optval, optlen); release_sock(sock->sk); break; + case SO_TIMESTAMP: + lock_sock(sock->sk); + ret = rds_enable_recvtstamp(sock->sk, optval, optlen); + release_sock(sock->sk); + break; default: ret = -ENOPROTOOPT; } diff --git a/net/rds/rds.h b/net/rds/rds.h index 0e2797b..80256b0 100644 --- a/net/rds/rds.h +++ b/net/rds/rds.h @@ -222,6 +222,7 @@ struct rds_incoming { __be32 i_saddr; rds_rdma_cookie_t i_rdma_cookie; + struct timeval i_rx_tstamp; }; struct rds_mr { diff --git a/net/rds/recv.c b/net/rds/recv.c index a00462b..c0be1ec 100644 --- a/net/rds/recv.c +++ b/net/rds/recv.c @@ -35,6 +35,8 @@ #include #include #include +#include +#include #include "rds.h" @@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn, inc->i_conn = conn; inc->i_saddr = saddr; inc->i_rdma_cookie = 0; + inc->i_rx_tstamp.tv_sec = 0; + inc->i_rx_tstamp.tv_usec = 0; } EXPORT_SYMBOL_GPL(rds_inc_init); @@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr, rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, be32_to_cpu(inc->i_hdr.h_len), inc->i_hdr.h_dport); + if (sock_flag(sk, SOCK_RCVTSTAMP)) + do_gettimeofday(&inc->i_rx_tstamp); rds_inc_addref(inc); list_add_tail(&inc->i_item, &rs->rs_recv_queue); __rds_wake_sk_sleep(sk); @@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct msghdr *msghdr) /* * Receive any control messages. */ -static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) +static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg, +struct rds_sock *rs) { int ret = 0; @@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg) return ret; } + if ((inc->i_rx_tstamp.tv_sec != 0) && + sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) { + ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP, + sizeof(struct timeval), + &inc->i_rx_tstamp); + if (ret) + return ret; + } + return 0; } @@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, msg->msg_flags |= MSG_TRUNC; } - if (rds_cmsg_recv(inc, msg)) { + if (rds_cmsg_recv(inc, msg, rs)) { ret = -EFAULT; goto out; } -- 1.9.1
[net-next][PATCH v2 06/13] RDS: IB: create struct rds_ib_fmr
Keep fmr related filed in its own struct. Fastreg MR structure will be added to the union. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 17 ++--- net/rds/ib_mr.h | 11 +-- net/rds/ib_rdma.c | 14 ++ 3 files changed, 29 insertions(+), 13 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index d4f200d..74f2c21 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) { struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; int err = 0, iter = 0; if (npages <= RDS_MR_8K_MSG_SIZE) @@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) goto out_no_cigar; } - ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd, + fmr = &ibmr->u.fmr; + fmr->fmr = ib_alloc_fmr(rds_ibdev->pd, (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC), &pool->fmr_attr); - if (IS_ERR(ibmr->fmr)) { - err = PTR_ERR(ibmr->fmr); - ibmr->fmr = NULL; + if (IS_ERR(fmr->fmr)) { + err = PTR_ERR(fmr->fmr); + fmr->fmr = NULL; pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err); goto out_no_cigar; } @@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) out_no_cigar: if (ibmr) { - if (ibmr->fmr) - ib_dealloc_fmr(ibmr->fmr); + if (fmr->fmr) + ib_dealloc_fmr(fmr->fmr); kfree(ibmr); } atomic_dec(&pool->item_count); @@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, struct scatterlist *sg, unsigned int nents) { struct ib_device *dev = rds_ibdev->dev; + struct rds_ib_fmr *fmr = &ibmr->u.fmr; struct scatterlist *scat = sg; u64 io_addr = 0; u64 *dma_pages; @@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct rds_ib_mr *ibmr, (dma_addr & PAGE_MASK) + j; } - ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr); + ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr); if (ret) goto out; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index d88724f..309ad59 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -43,11 +43,15 @@ #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1)) #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2)) +struct rds_ib_fmr { + struct ib_fmr *fmr; + u64 *dma; +}; + /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { struct rds_ib_device*device; struct rds_ib_mr_pool *pool; - struct ib_fmr *fmr; struct llist_node llnode; @@ -57,8 +61,11 @@ struct rds_ib_mr { struct scatterlist *sg; unsigned intsg_len; - u64 *dma; int sg_dma_len; + + union { + struct rds_ib_fmr fmr; + } u; }; /* Our own little MR pool */ diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index c594519..9e608d9 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all, struct rds_ib_mr **ibmr_ret) { struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; struct llist_node *clean_nodes; struct llist_node *clean_tail; LIST_HEAD(unmap_list); @@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, goto out; /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ - list_for_each_entry(ibmr, &unmap_list, unmap_list) - list_add(&ibmr->fmr->list, &fmr_list); + list_for_each_entry(ibmr, &unmap_list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } ret = ib_unmap_fmr(&fmr_list); if (ret) @@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, /* Now we can destroy the DMA mapping and unpin any pages */ list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) { unpinned += ibmr->sg_len; + fmr = &ibmr->u.fmr; __rds_ib_teardown_mr(ibmr); if (nfreed < free_goal || ibmr->remap_count >= pool
[net-next][PATCH v2 03/13] MAINTAINERS: update RDS entry
Acked-by: Chien Yen Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- MAINTAINERS | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 27393cf..08b084a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9067,10 +9067,14 @@ S: Maintained F: drivers/net/ethernet/rdc/r6040.c RDS - RELIABLE DATAGRAM SOCKETS -M: Chien Yen +M: Santosh Shilimkar +L: netdev@vger.kernel.org +L: linux-r...@vger.kernel.org L: rds-de...@oss.oracle.com (moderated for non-subscribers) +W: https://oss.oracle.com/projects/rds/ S: Supported F: net/rds/ +F: Documentation/networking/rds.txt READ-COPY UPDATE (RCU) M: "Paul E. McKenney" -- 1.9.1
[net-next][PATCH v2 05/13] RDS: IB: Re-organise ibmr code
No functional changes. This is in preperation towards adding fastreg memory resgitration support. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/Makefile | 2 +- net/rds/ib.c | 37 +++--- net/rds/ib.h | 25 +--- net/rds/ib_fmr.c | 217 +++ net/rds/ib_mr.h | 109 net/rds/ib_rdma.c | 379 +++--- 6 files changed, 422 insertions(+), 347 deletions(-) create mode 100644 net/rds/ib_fmr.c create mode 100644 net/rds/ib_mr.h diff --git a/net/rds/Makefile b/net/rds/Makefile index 19e5485..bcf5591 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o + ib_sysctl.o ib_rdma.o ib_fmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.c b/net/rds/ib.c index 9481d55..bb32cb9 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -42,15 +42,16 @@ #include "rds.h" #include "ib.h" +#include "ib_mr.h" -unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE; -unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE; +unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE; +unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE; unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT; -module_param(rds_ib_fmr_1m_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA"); -module_param(rds_ib_fmr_8k_pool_size, int, 0444); -MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA"); +module_param(rds_ib_mr_1m_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA"); +module_param(rds_ib_mr_8k_pool_size, int, 0444); +MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA"); module_param(rds_ib_retry_count, int, 0444); MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting an error"); @@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; - rds_ibdev->max_1m_fmrs = device->attrs.max_mr ? + rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), - rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size; + rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size; - rds_ibdev->max_8k_fmrs = device->attrs.max_mr ? + rds_ibdev->max_8k_mrs = device->attrs.max_mr ? min_t(unsigned int, ((device->attrs.max_mr / 2) * RDS_MR_8K_SCALE), - rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size; + rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size; rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom; rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom; @@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device) goto put_dev; } - rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n", + rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n", device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge, -rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs, -rds_ibdev->max_8k_fmrs); +rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, +rds_ibdev->max_8k_mrs); INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); @@ -364,7 +365,7 @@ void rds_ib_exit(void) rds_ib_sysctl_exit(); rds_ib_recv_exit(); rds_trans_unregister(&rds_ib_transport); - rds_ib_fmr_exit(); + rds_ib_mr_exit(); } struct rds_transport rds_ib_transport = { @@ -400,13 +401,13 @@ int rds_ib_init(void) INIT_LIST_HEAD(&rds_ib_devices); - ret = rds_ib_fmr_init(); + ret = rds_ib_mr_init(); if (ret) goto out; ret = ib_register_client(&rds_ib_client); if (ret) - goto out_fmr_exit; + goto out_mr_exit; ret = rds_ib_sysctl_init(); if (ret) @@ -430,8 +431,8 @@ out_sysctl: rds_ib_sysctl_exit(); out_ibreg: rds_ib_unregister_client(); -out_fmr_exit: - rds_ib_fmr_exit(); +out_mr_exit: + rds_ib_mr_exit(); out: return ret; } diff --git a/net/rds/ib.h b/net/rds/ib.h index 09cd8e3..c88cb22 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@
[net-next][PATCH v2 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency
This helps to combine asynchronous fastreg MR completion handler with send completion handler. No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h | 1 - net/rds/ib_cm.c | 42 +++--- net/rds/ib_send.c | 6 ++ 3 files changed, 29 insertions(+), 20 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index b3fdebb..09cd8e3 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -28,7 +28,6 @@ #define RDS_IB_RECYCLE_BATCH_COUNT 32 #define RDS_IB_WC_MAX 32 -#define RDS_IB_SEND_OP BIT_ULL(63) extern struct rw_semaphore rds_ib_devices_lock; extern struct list_head rds_ib_devices; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index da5a7fb..7f68abc 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, void *context) tasklet_schedule(&ic->i_recv_tasklet); } -static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, - struct ib_wc *wcs, - struct rds_ib_ack_state *ack_state) +static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs) { - int nr; - int i; + int nr, i; struct ib_wc *wc; while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { @@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - if (wc->wr_id & RDS_IB_SEND_OP) - rds_ib_send_cqe_handler(ic, wc); - else - rds_ib_recv_cqe_handler(ic, wc, ack_state); + rds_ib_send_cqe_handler(ic, wc); } } } @@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; struct rds_connection *conn = ic->conn; - struct rds_ib_ack_state state; rds_ib_stats_inc(s_ib_tasklet_call); - memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP); - poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state); + poll_scq(ic, ic->i_send_cq, ic->i_send_wc); if (rds_conn_up(conn) && (!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) || @@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data) rds_send_xmit(ic->conn); } +static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq, +struct ib_wc *wcs, +struct rds_ib_ack_state *ack_state) +{ + int nr, i; + struct ib_wc *wc; + + while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) { + for (i = 0; i < nr; i++) { + wc = wcs + i; + rdsdebug("wc wr_id 0x%llx status %u byte_len %u imm_data %u\n", +(unsigned long long)wc->wr_id, wc->status, +wc->byte_len, be32_to_cpu(wc->ex.imm_data)); + + rds_ib_recv_cqe_handler(ic, wc, ack_state); + } + } +} + static void rds_ib_tasklet_fn_recv(unsigned long data) { struct rds_ib_connection *ic = (struct rds_ib_connection *)data; @@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data) rds_ib_stats_inc(s_ib_tasklet_call); memset(&state, 0, sizeof(state)); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED); - poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); + poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state); if (state.ack_next_valid) rds_ib_set_ack(ic, state.ack_next, state.ack_required); diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index eac30bf..f27d2c8 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic) send->s_op = NULL; - send->s_wr.wr_id = i | RDS_IB_SEND_OP; + send->s_wr.wr_id = i; send->s_wr.sg_list = send->s_sge; send->s_wr.ex.imm_data = 0; @@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc) oldest = rds_ib_ring_oldest(&ic->i_send_ring); - completed = rds_ib_ring_completed(&ic->i_send_ring, - (wc->wr_id & ~RDS_IB_SEND_OP), - oldest); + compl
[net-next][PATCH v2 09/13] RDS: IB: handle the RDMA CM time wait event
Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that it can reconnect and resume. While testing fastreg, this error happened in couple of tests but was getting un-noticed. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/rdma_transport.c | 8 1 file changed, 8 insertions(+) diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c index 4f4b3d8..7220beb 100644 --- a/net/rds/rdma_transport.c +++ b/net/rds/rdma_transport.c @@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id, rds_conn_drop(conn); break; + case RDMA_CM_EVENT_TIMEWAIT_EXIT: + if (conn) { + pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: dropping connection %pI4->%pI4\n", + &conn->c_laddr, &conn->c_faddr); + rds_conn_drop(conn); + } + break; + default: /* things like device disconnect? */ printk(KERN_ERR "RDS: unknown event %u (%s)!\n", -- 1.9.1
[net-next][PATCH v2 08/13] RDS: IB: add connection info to ibmr
Preperatory patch for FRMR support. From connection info, we can retrieve cm_id which contains qp handled needed for work request posting. We also need to drop the RDS connection on QP error states where connection handle becomes useful. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_mr.h | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index f5c1fcb..add7725 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -50,18 +50,19 @@ struct rds_ib_fmr { /* This is stored as mr->r_trans_private. */ struct rds_ib_mr { - struct rds_ib_device*device; - struct rds_ib_mr_pool *pool; + struct rds_ib_device*device; + struct rds_ib_mr_pool *pool; + struct rds_ib_connection*ic; - struct llist_node llnode; + struct llist_node llnode; /* unmap_list is for freeing */ - struct list_headunmap_list; - unsigned intremap_count; + struct list_headunmap_list; + unsigned intremap_count; - struct scatterlist *sg; - unsigned intsg_len; - int sg_dma_len; + struct scatterlist *sg; + unsigned intsg_len; + int sg_dma_len; union { struct rds_ib_fmr fmr; -- 1.9.1
[net-next][PATCH v2 07/13] RDS: IB: move FMR code to its own file
No functional change. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib_fmr.c | 126 +- net/rds/ib_mr.h | 6 +++ net/rds/ib_rdma.c | 108 ++ 3 files changed, 134 insertions(+), 106 deletions(-) diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c index 74f2c21..4fe8f4f 100644 --- a/net/rds/ib_fmr.c +++ b/net/rds/ib_fmr.c @@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device *rds_ibdev, int npages) struct rds_ib_mr_pool *pool; struct rds_ib_mr *ibmr = NULL; struct rds_ib_fmr *fmr; - int err = 0, iter = 0; + int err = 0; if (npages <= RDS_MR_8K_MSG_SIZE) pool = rds_ibdev->mr_8k_pool; else pool = rds_ibdev->mr_1m_pool; - if (atomic_read(&pool->dirty_count) >= pool->max_items / 10) - queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10); - - /* Switch pools if one of the pool is reaching upper limit */ - if (atomic_read(&pool->dirty_count) >= pool->max_items * 9 / 10) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - pool = rds_ibdev->mr_1m_pool; - else - pool = rds_ibdev->mr_8k_pool; - } - - while (1) { - ibmr = rds_ib_reuse_mr(pool); - if (ibmr) - return ibmr; - - /* No clean MRs - now we have the choice of either -* allocating a fresh MR up to the limit imposed by the -* driver, or flush any dirty unused MRs. -* We try to avoid stalling in the send path if possible, -* so we allocate as long as we're allowed to. -* -* We're fussy with enforcing the FMR limit, though. If the -* driver tells us we can't use more than N fmrs, we shouldn't -* start arguing with it -*/ - if (atomic_inc_return(&pool->item_count) <= pool->max_items) - break; - - atomic_dec(&pool->item_count); - - if (++iter > 2) { - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted); - return ERR_PTR(-EAGAIN); - } - - /* We do have some empty MRs. Flush them out. */ - if (pool->pool_type == RDS_IB_MR_8K_POOL) - rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait); - else - rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait); - rds_ib_flush_mr_pool(pool, 0, &ibmr); - if (ibmr) - return ibmr; - } + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + return ibmr; ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, rdsibdev_to_node(rds_ibdev)); @@ -218,3 +173,76 @@ out: return ret; } + +struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev, +struct scatterlist *sg, +unsigned long nents, +u32 *key) +{ + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_fmr *fmr; + int ret; + + ibmr = rds_ib_alloc_fmr(rds_ibdev, nents); + if (IS_ERR(ibmr)) + return ibmr; + + ibmr->device = rds_ibdev; + fmr = &ibmr->u.fmr; + ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents); + if (ret == 0) + *key = fmr->fmr->rkey; + else + rds_ib_free_mr(ibmr, 0); + + return ibmr; +} + +void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed, + unsigned long *unpinned, unsigned int goal) +{ + struct rds_ib_mr *ibmr, *next; + struct rds_ib_fmr *fmr; + LIST_HEAD(fmr_list); + int ret = 0; + unsigned int freed = *nfreed; + + /* String all ib_mr's onto one list and hand them to ib_unmap_fmr */ + list_for_each_entry(ibmr, list, unmap_list) { + fmr = &ibmr->u.fmr; + list_add(&fmr->fmr->list, &fmr_list); + } + + ret = ib_unmap_fmr(&fmr_list); + if (ret) + pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret); + + /* Now we can destroy the DMA mapping and unpin any pages */ + list_for_each_entry_safe(ibmr, next, list, unmap_list) { + fmr = &ibmr->u.fmr; + *unpinned += ibmr->sg_len; + __rds_ib_teardown_mr(ibmr); + if (freed < goal || + ibmr->remap_count >= ibmr->pool->fmr_attr.max_maps) { + if (ibmr->
[net-next][PATCH v2 10/13] RDS: IB: add mr reused stats
Add MR reuse statistics to RDS IB transport. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h | 2 ++ net/rds/ib_rdma.c | 7 ++- net/rds/ib_stats.c | 2 ++ 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index c88cb22..62fe7d5 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -259,6 +259,8 @@ struct rds_ib_statistics { uint64_ts_ib_rdma_mr_1m_pool_flush; uint64_ts_ib_rdma_mr_1m_pool_wait; uint64_ts_ib_rdma_mr_1m_pool_depleted; + uint64_ts_ib_rdma_mr_8k_reused; + uint64_ts_ib_rdma_mr_1m_reused; uint64_ts_ib_atomic_cswp; uint64_ts_ib_atomic_fadd; }; diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 0e84843..ec7ea32 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool *pool) flag = this_cpu_ptr(&clean_list_grace); set_bit(CLEAN_LIST_BUSY_BIT, flag); ret = llist_del_first(&pool->clean_list); - if (ret) + if (ret) { ibmr = llist_entry(ret, struct rds_ib_mr, llnode); + if (pool->pool_type == RDS_IB_MR_8K_POOL) + rds_ib_stats_inc(s_ib_rdma_mr_8k_reused); + else + rds_ib_stats_inc(s_ib_rdma_mr_1m_reused); + } clear_bit(CLEAN_LIST_BUSY_BIT, flag); preempt_enable(); diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c index d77e044..7e78dca 100644 --- a/net/rds/ib_stats.c +++ b/net/rds/ib_stats.c @@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = { "ib_rdma_mr_1m_pool_flush", "ib_rdma_mr_1m_pool_wait", "ib_rdma_mr_1m_pool_depleted", + "ib_rdma_mr_8k_reused", + "ib_rdma_mr_1m_reused", "ib_atomic_cswp", "ib_atomic_fadd", }; -- 1.9.1
[net-next][PATCH v2 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode
From: Avinash Repaka Fastreg MR(FRMR) is another method with which one can register memory to HCA. Some of the newer HCAs supports only fastreg mr mode, so we need to add support for it to have RDS functional on them. Signed-off-by: Santosh Shilimkar Signed-off-by: Avinash Repaka Signed-off-by: Santosh Shilimkar --- net/rds/Makefile | 2 +- net/rds/ib.h | 1 + net/rds/ib_cm.c | 7 +- net/rds/ib_frmr.c | 376 ++ net/rds/ib_mr.h | 24 net/rds/ib_rdma.c | 17 ++- 6 files changed, 422 insertions(+), 5 deletions(-) create mode 100644 net/rds/ib_frmr.c diff --git a/net/rds/Makefile b/net/rds/Makefile index bcf5591..0e72bec 100644 --- a/net/rds/Makefile +++ b/net/rds/Makefile @@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o message.o \ obj-$(CONFIG_RDS_RDMA) += rds_rdma.o rds_rdma-y := rdma_transport.o \ ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \ - ib_sysctl.o ib_rdma.o ib_fmr.o + ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o obj-$(CONFIG_RDS_TCP) += rds_tcp.o diff --git a/net/rds/ib.h b/net/rds/ib.h index eeb0d6c..627fb79 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr); void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); void rds_ib_destroy_nodev_conns(void); +void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc); /* ib_recv.c */ int rds_ib_recv_init(void); diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 83f4673..8764970 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq, (unsigned long long)wc->wr_id, wc->status, wc->byte_len, be32_to_cpu(wc->ex.imm_data)); - rds_ib_send_cqe_handler(ic, wc); + if (wc->wr_id <= ic->i_send_ring.w_nr || + wc->wr_id == RDS_IB_ACK_WR_ID) + rds_ib_send_cqe_handler(ic, wc); + else + rds_ib_mr_cqe_handler(ic, wc); + } } } diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c new file mode 100644 index 000..93ff038 --- /dev/null +++ b/net/rds/ib_frmr.c @@ -0,0 +1,376 @@ +/* + * Copyright (c) 2016 Oracle. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include "ib_mr.h" + +static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev, + int npages) +{ + struct rds_ib_mr_pool *pool; + struct rds_ib_mr *ibmr = NULL; + struct rds_ib_frmr *frmr; + int err = 0; + + if (npages <= RDS_MR_8K_MSG_SIZE) + pool = rds_ibdev->mr_8k_pool; + else + pool = rds_ibdev->mr_1m_pool; + + ibmr = rds_ib_try_reuse_ibmr(pool); + if (ibmr) + return ibmr; + + ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL, + rdsibdev_to_node(rds_ibdev)); + if (!ibmr) { + err = -ENOMEM; + goto out_no_cigar; + } + + frmr = &ibmr->u.frmr; + frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG, +pool->fmr_attr.max_pages); + if (IS_ERR(frmr->mr)) { + pr_warn(
[net-next][PATCH v2 11/13] RDS: IB: add Fastreg MR (FRMR) detection support
Discovere Fast Memmory Registration support using IB device IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR or FMR or both FMR and FRWR. In case both mr type are supported, default FMR is used. Default MR is still kept as FMR against what everyone else is following. Default will be changed to FRMR once the RDS performance with FRMR is comparable with FMR. The work is in progress for the same. Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- v2: Dropped the module parameter as suggested by David Miller net/rds/ib.c| 10 ++ net/rds/ib.h| 4 net/rds/ib_mr.h | 1 + 3 files changed, 15 insertions(+) diff --git a/net/rds/ib.c b/net/rds/ib.c index bb32cb9..b5342fd 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -140,6 +140,12 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->max_wrs = device->attrs.max_qp_wr; rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE); + rds_ibdev->has_fr = (device->attrs.device_cap_flags & + IB_DEVICE_MEM_MGT_EXTENSIONS); + rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr && + device->map_phys_fmr && device->unmap_fmr); + rds_ibdev->use_fastreg = (rds_ibdev->has_fr && !rds_ibdev->has_fmr); + rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32; rds_ibdev->max_1m_mrs = device->attrs.max_mr ? min_t(unsigned int, (device->attrs.max_mr / 2), @@ -178,6 +184,10 @@ static void rds_ib_add_one(struct ib_device *device) rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs, rds_ibdev->max_8k_mrs); + pr_info("RDS/IB: %s: %s supported and preferred\n", + device->name, + rds_ibdev->use_fastreg ? "FRMR" : "FMR"); + INIT_LIST_HEAD(&rds_ibdev->ipaddr_list); INIT_LIST_HEAD(&rds_ibdev->conn_list); diff --git a/net/rds/ib.h b/net/rds/ib.h index 62fe7d5..c5eddc2 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -200,6 +200,10 @@ struct rds_ib_device { struct list_headconn_list; struct ib_device*dev; struct ib_pd*pd; + boolhas_fmr; + boolhas_fr; + booluse_fastreg; + unsigned intmax_mrs; struct rds_ib_mr_pool *mr_1m_pool; struct rds_ib_mr_pool *mr_8k_pool; diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h index add7725..2f9b9c3 100644 --- a/net/rds/ib_mr.h +++ b/net/rds/ib_mr.h @@ -93,6 +93,7 @@ struct rds_ib_mr_pool { extern struct workqueue_struct *rds_ib_mr_wq; extern unsigned int rds_ib_mr_1m_pool_size; extern unsigned int rds_ib_mr_8k_pool_size; +extern bool prefer_frmr; struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev, int npages); -- 1.9.1
[net-next][PATCH v2 12/13] RDS: IB: allocate extra space on queues for FRMR support
Fastreg MR(FRMR) memory registration and invalidation makes use of work request and completion queues for its operation. Patch allocates extra queue space towards these operation(s). Signed-off-by: Santosh Shilimkar Signed-off-by: Santosh Shilimkar --- net/rds/ib.h| 4 net/rds/ib_cm.c | 16 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/net/rds/ib.h b/net/rds/ib.h index c5eddc2..eeb0d6c 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -14,6 +14,7 @@ #define RDS_IB_DEFAULT_RECV_WR 1024 #define RDS_IB_DEFAULT_SEND_WR 256 +#define RDS_IB_DEFAULT_FR_WR 512 #define RDS_IB_DEFAULT_RETRY_COUNT 2 @@ -122,6 +123,9 @@ struct rds_ib_connection { struct ib_wci_send_wc[RDS_IB_WC_MAX]; struct ib_wci_recv_wc[RDS_IB_WC_MAX]; + /* To control the number of wrs from fastreg */ + atomic_ti_fastreg_wrs; + /* interrupt handling */ struct tasklet_struct i_send_tasklet; struct tasklet_struct i_recv_tasklet; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 7f68abc..83f4673 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) struct ib_qp_init_attr attr; struct ib_cq_init_attr cq_attr = {}; struct rds_ib_device *rds_ibdev; - int ret; + int ret, fr_queue_space; /* * It's normal to see a null device if an incoming connection races @@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn) if (!rds_ibdev) return -EOPNOTSUPP; + /* The fr_queue_space is currently set to 512, to add extra space on +* completion queue and send queue. This extra space is used for FRMR +* registration and invalidation work requests +*/ + fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0); + /* add the conn now so that connection establishment has the dev */ rds_ib_add_conn(rds_ibdev, conn); @@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) /* Protection domain and memory range */ ic->i_pd = rds_ibdev->pd; - cq_attr.cqe = ic->i_send_ring.w_nr + 1; + cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1; ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send, rds_ib_cq_event_handler, conn, @@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.event_handler = rds_ib_qp_event_handler; attr.qp_context = conn; /* + 1 to allow for the single ack message */ - attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1; + attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1; attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1; attr.cap.max_send_sge = rds_ibdev->max_sge; attr.cap.max_recv_sge = RDS_IB_RECV_SGE; @@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn) attr.qp_type = IB_QPT_RC; attr.send_cq = ic->i_send_cq; attr.recv_cq = ic->i_recv_cq; + atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR); /* * XXX this can fail if max_*_wr is too large? Are we supposed @@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn) */ wait_event(rds_ib_ring_empty_wait, rds_ib_ring_empty(&ic->i_recv_ring) && - (atomic_read(&ic->i_signaled_sends) == 0)); + (atomic_read(&ic->i_signaled_sends) == 0) && + (atomic_read(&ic->i_fastreg_wrs) == RDS_IB_DEFAULT_FR_WR)); tasklet_kill(&ic->i_send_tasklet); tasklet_kill(&ic->i_recv_tasklet); -- 1.9.1
Re: Softirq priority inversion from "softirq: reduce latencies"
On sam., 2016-02-27 at 18:10 -0800, Peter Hurley wrote: > On 02/27/2016 05:59 PM, Eric Dumazet wrote: > > On sam., 2016-02-27 at 15:33 -0800, Peter Hurley wrote: > >> On 02/27/2016 03:04 PM, David Miller wrote: > >>> From: Peter Hurley > >>> Date: Sat, 27 Feb 2016 12:29:39 -0800 > >>> > Not really. softirq raised from interrupt context will always execute > on this cpu and not in ksoftirqd, unless load forces softirq loop abort. > >>> > >>> That guarantee never was specified. > >> > >> ?? > >> > >> Neither is running network socket servers at normal priority as if they're > >> higher priority than softirq. > >> > >> > >>> Or are you saying that by design, on a system under load, your UART > >>> will not function properly? > >>> > >>> Surely you don't mean that. > >> > >> No, that's not what I mean. > >> > >> What I mean is that bypassing the entire SOFTIRQ priority so that > >> sshd can process one network packet makes a mockery of the point of > >> softirq. > >> > >> This hack to workaround NET_RX looping over-and-over-and-over affects every > >> subsystem, not just one uart. > >> > >> HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug. > > > > No idea what you talk about. > > > > All pending softirq interrupts are processed. _Nothing_ is skipped. > > An interrupt that schedules HI softirq while in NET_RX softirq should > still run the HI softirq. But with your patch that won't happen. Stop saying this. This never had been the case. I am glad my patch finally show you are wrong. > > > > Really, your system stability seems to depend on a completely > > undocumented behavior of linux kernels before linux-3.8 > > > > If I understood, you expect that a tasklet activated from a softirq > > handler is run from the same __do_softirq() loop. This never has been > > the case. > > No. > > The *interrupt handler* for DMA goes off while NET_RX softirq is running. > That's what schedules the *DMA tasklet*. > > That tasklet should run before any process. > > But it doesn't because your patch bails out early from softirq. Fine. Fix your driver.
Re: [net-next][PATCH 00/13] RDS: Major clean-up with couple of new features for 4.6
Hi Dave, On 2/26/16 9:43 PM, Santosh Shilimkar wrote: Series is generated against net-next but also applies against Linus's tip cleanly. The diff-stat looks bit scary since almost ~4K lines of code is getting removed. [...] Entire patchset is available below git tree: git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_4.6/net-next/rds Just noticed that, I accidentally posted the patches from older(v1) folder, instead of updated v2. Sorry about it. Please discard this entire series. Will post intended v2 right after this email. Regards, Santosh
Re: Softirq priority inversion from "softirq: reduce latencies"
On 02/27/2016 05:59 PM, Eric Dumazet wrote: > On sam., 2016-02-27 at 15:33 -0800, Peter Hurley wrote: >> On 02/27/2016 03:04 PM, David Miller wrote: >>> From: Peter Hurley >>> Date: Sat, 27 Feb 2016 12:29:39 -0800 >>> Not really. softirq raised from interrupt context will always execute on this cpu and not in ksoftirqd, unless load forces softirq loop abort. >>> >>> That guarantee never was specified. >> >> ?? >> >> Neither is running network socket servers at normal priority as if they're >> higher priority than softirq. >> >> >>> Or are you saying that by design, on a system under load, your UART >>> will not function properly? >>> >>> Surely you don't mean that. >> >> No, that's not what I mean. >> >> What I mean is that bypassing the entire SOFTIRQ priority so that >> sshd can process one network packet makes a mockery of the point of softirq. >> >> This hack to workaround NET_RX looping over-and-over-and-over affects every >> subsystem, not just one uart. >> >> HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug. > > No idea what you talk about. > > All pending softirq interrupts are processed. _Nothing_ is skipped. An interrupt that schedules HI softirq while in NET_RX softirq should still run the HI softirq. But with your patch that won't happen. > Really, your system stability seems to depend on a completely > undocumented behavior of linux kernels before linux-3.8 > > If I understood, you expect that a tasklet activated from a softirq > handler is run from the same __do_softirq() loop. This never has been > the case. No. The *interrupt handler* for DMA goes off while NET_RX softirq is running. That's what schedules the *DMA tasklet*. That tasklet should run before any process. But it doesn't because your patch bails out early from softirq. > My change simply triggers the bug in your driver earlier. As David > pointed out, your bug should trigger the same on a loaded machine, even > if you revert my patch. > > I honestly do not know why you arm a tasklet from NET_RX, why don't you > simply process this directly, so that you do not rely on some scheduler > decision ?
[PATCH net] sctp: lack the check for ports in sctp_v6_cmp_addr
As the member .cmp_addr of sctp_af_inet6, sctp_v6_cmp_addr should also check the port of addresses, just like sctp_v4_cmp_addr, cause it's invoked by sctp_cmp_addr_exact(). Now sctp_v6_cmp_addr just check the port when two addresses have different family, and lack the port check for two ipv6 addresses. that will make sctp_hash_cmp() cannot work well. so fix it by adding ports comparison in sctp_v6_cmp_addr(). Signed-off-by: Xin Long --- net/sctp/ipv6.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c index ec52912..ce46f1c 100644 --- a/net/sctp/ipv6.c +++ b/net/sctp/ipv6.c @@ -526,6 +526,8 @@ static int sctp_v6_cmp_addr(const union sctp_addr *addr1, } return 0; } + if (addr1->v6.sin6_port != addr2->v6.sin6_port) + return 0; if (!ipv6_addr_equal(&addr1->v6.sin6_addr, &addr2->v6.sin6_addr)) return 0; /* If this is a linklocal address, compare the scope_id. */ -- 2.1.0
Re: Softirq priority inversion from "softirq: reduce latencies"
On sam., 2016-02-27 at 15:33 -0800, Peter Hurley wrote: > On 02/27/2016 03:04 PM, David Miller wrote: > > From: Peter Hurley > > Date: Sat, 27 Feb 2016 12:29:39 -0800 > > > >> Not really. softirq raised from interrupt context will always execute > >> on this cpu and not in ksoftirqd, unless load forces softirq loop abort. > > > > That guarantee never was specified. > > ?? > > Neither is running network socket servers at normal priority as if they're > higher priority than softirq. > > > > Or are you saying that by design, on a system under load, your UART > > will not function properly? > > > > Surely you don't mean that. > > No, that's not what I mean. > > What I mean is that bypassing the entire SOFTIRQ priority so that > sshd can process one network packet makes a mockery of the point of softirq. > > This hack to workaround NET_RX looping over-and-over-and-over affects every > subsystem, not just one uart. > > HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug. No idea what you talk about. All pending softirq interrupts are processed. _Nothing_ is skipped. Really, your system stability seems to depend on a completely undocumented behavior of linux kernels before linux-3.8 If I understood, you expect that a tasklet activated from a softirq handler is run from the same __do_softirq() loop. This never has been the case. My change simply triggers the bug in your driver earlier. As David pointed out, your bug should trigger the same on a loaded machine, even if you revert my patch. I honestly do not know why you arm a tasklet from NET_RX, why don't you simply process this directly, so that you do not rely on some scheduler decision ?
КЛИЕНТСКИЕ БАЗЫ! Тел\Viber\Whatsapp: +79133913837 Email: mloginov...@gmail.com Skype: prodawez389
КЛИЕНТСКИЕ БАЗЫ! Соберем для Вас по интернет базу данных потенциальных клиентов для Вашего Бизнеса! Много! Быстро! Недорого! Узнайте об этом подробнее по Тел: +79133913837 Viber: +79133913837 Whatsapp: +79133913837 Skype: prodawez389 Email: mloginov...@gmail.com
[no subject]
-- We are donating to you 1,500,000 GBP, from David and Carol Martin £33million lottery, contact : davidcar...@yahoo.com.hk view link: http://www.ibtimes.co.uk/lotto-winners-david-carol-martin-want-blast-into-space-after-buying-new-shoes-1537851
Re: Softirq priority inversion from "softirq: reduce latencies"
On 02/27/2016 03:04 PM, David Miller wrote: > From: Peter Hurley > Date: Sat, 27 Feb 2016 12:29:39 -0800 > >> Not really. softirq raised from interrupt context will always execute >> on this cpu and not in ksoftirqd, unless load forces softirq loop abort. > > That guarantee never was specified. ?? Neither is running network socket servers at normal priority as if they're higher priority than softirq. > Or are you saying that by design, on a system under load, your UART > will not function properly? > > Surely you don't mean that. No, that's not what I mean. What I mean is that bypassing the entire SOFTIRQ priority so that sshd can process one network packet makes a mockery of the point of softirq. This hack to workaround NET_RX looping over-and-over-and-over affects every subsystem, not just one uart. HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug. Regards, Peter Hurley
Re: Softirq priority inversion from "softirq: reduce latencies"
From: Peter Hurley Date: Sat, 27 Feb 2016 12:29:39 -0800 > Not really. softirq raised from interrupt context will always execute > on this cpu and not in ksoftirqd, unless load forces softirq loop abort. That guarantee never was specified. Or are you saying that by design, on a system under load, your UART will not function properly? Surely you don't mean that.
Re: Sending short raw packets using sendmsg() broke
On Fri, Feb 26, 2016 at 12:46 PM, David Miller wrote: > From: Willem de Bruijn > Date: Fri, 26 Feb 2016 12:33:13 -0500 > >> Right. The simplest, if hacky, fix is to add something along the lines of >> >> static unsigned short netdev_min_hard_header_len(struct net_device *dev) >> { >> if (unlikely(dev->type ==ARPHDR_AX25)) >> return AX25_KISS_HEADER_LEN; >> else >> return dev->hard_header_len; >> } >> >> Depending on how the variable encoding scheme works, a basic min >> length check may still produce buggy headers that confuse the stack or >> driver. I need to read up on AX25. If so, then extending header_ops >> with an optional validate() function is a more generic approach of >> checking header sanity. > > I suspect we will need some kind of header ops for this. To return the device type minimum length or to do full header validation? Looking at drivers/net/hamradio, I don't see any driver output paths interpreting the header fields, in which case the first is sufficient. A minimum U/S frame is AX25_KISS_HEADER_LEN + 2* AX25_ADDR_LEN + 3 (control + FCS) == AX25_KISS_HEADER_LEN + AX25_HEADER_LEN Heikki, you gave this number + 3. Where does that constant come from? More thorough validation of the header contents is not necessarily hard. The following validates the address, including optional repeaters. static bool ax25_validate_hard_header(const char *ll_header, unsigned short len) { ax25_digi digi; return !ax25_addr_parse(ll_header, len, NULL, NULL, &digi, NULL, NULL); } The major drawback of full validation from the point of fixing the original bug that it requires the header already having been copied to the kernel. The ll_header_truncated check is currently performed before allocation + copy, based solely on len. So this might become a relatively complex patch that is not easy to backport to stable branches. I can send simple minimal length validation patch to net to solve the reported bug. Then optionally follow up with a header_ops->validate() extension in net-next, if there is value in that.
Re: [patch] rocker: fix an error code
From: Dan Carpenter Date: Sat, 27 Feb 2016 14:31:43 +0300 > We intended to return PTR_ERR() here instead of 1. > > Fixes: 1f9993f6825f ('rocker: fix a neigh entry leak issue') > Signed-off-by: Dan Carpenter Applied, thanks.
Re: [PATCH net-next 1/5] vxlan: implement GPE in L2 mode
On Sat, Feb 27, 2016 at 12:54 PM, Tom Herbert wrote: > On Sat, Feb 27, 2016 at 11:31 AM, Jiri Benc wrote: >> On Fri, 26 Feb 2016 15:51:29 -0800, Tom Herbert wrote: >>> I don't think this is right. VXLAN-GPE is a separate protocol than >>> VXLAN, they are not compatible on the wire and don't share flags or >>> fields (for instance GPB uses bits in VXLAN that hold the next >>> protocol in VXLAN-GPE). Neither is there a VXLAN_F_GPE flag defined in >>> VXLAN to differentiate the two. So VXLAN-GPE would be used on a >>> different port >> >> Yes, and that's exactly what this patchset does. If there's the >> VXLAN_F_GPE flag defined while creating the interface, the used UDP >> port defaults to the VXLAN-GPE UDP port (but can be overriden) and the >> driver expects that all packets received are VXLAN-GPE. >> >> Note also that you can't define both GPE and GBP together, because as >> you noted, they're not compatible. The driver correctly refuses such >> combination. >> > Yes, but RCO has not been specified for VXLAN-GPE either so the patch > does not correctly refuse setting those two together. Inevitably > though, those and other extensions will defined for VXLAN-GPE and new > ones for VXLAN. Again, the protocols are fundamentally incompatible, > so instead of trying to enforce each valid combination at > configuration or performing multiple checks for flavor each time we > look at a packet, it seems easier to split the parsing with at most > one check for the protocol variant. For instance in > vxlan_udp_encap_recv just do: > > if (vs->flags & VXLAN_F_GPE) >if (!vxlan_parse_gpe_hdr(&unparsed, skb, vs->flags)) >goto drop; > else >if (!vxlan_parse_gpe(&unparsed, skb, vs->flags)) >goto drop; > I meant if (vs->flags & VXLAN_F_GPE) if (!vxlan_parse_gpe_hdr(&unparsed, skb, vs->flags)) goto drop; else if (!vxlan_parse_hdr(&unparsed, skb, vs->flags)) goto drop; > > And then move REMCSUM and GPB and other protocol specific checks to > the right function. > > Tom
Re: [PATCH net-next 1/5] vxlan: implement GPE in L2 mode
On Sat, Feb 27, 2016 at 11:31 AM, Jiri Benc wrote: > On Fri, 26 Feb 2016 15:51:29 -0800, Tom Herbert wrote: >> I don't think this is right. VXLAN-GPE is a separate protocol than >> VXLAN, they are not compatible on the wire and don't share flags or >> fields (for instance GPB uses bits in VXLAN that hold the next >> protocol in VXLAN-GPE). Neither is there a VXLAN_F_GPE flag defined in >> VXLAN to differentiate the two. So VXLAN-GPE would be used on a >> different port > > Yes, and that's exactly what this patchset does. If there's the > VXLAN_F_GPE flag defined while creating the interface, the used UDP > port defaults to the VXLAN-GPE UDP port (but can be overriden) and the > driver expects that all packets received are VXLAN-GPE. > > Note also that you can't define both GPE and GBP together, because as > you noted, they're not compatible. The driver correctly refuses such > combination. > Yes, but RCO has not been specified for VXLAN-GPE either so the patch does not correctly refuse setting those two together. Inevitably though, those and other extensions will defined for VXLAN-GPE and new ones for VXLAN. Again, the protocols are fundamentally incompatible, so instead of trying to enforce each valid combination at configuration or performing multiple checks for flavor each time we look at a packet, it seems easier to split the parsing with at most one check for the protocol variant. For instance in vxlan_udp_encap_recv just do: if (vs->flags & VXLAN_F_GPE) if (!vxlan_parse_gpe_hdr(&unparsed, skb, vs->flags)) goto drop; else if (!vxlan_parse_gpe(&unparsed, skb, vs->flags)) goto drop; And then move REMCSUM and GPB and other protocol specific checks to the right function. Tom
Re: Softirq priority inversion from "softirq: reduce latencies"
On 02/27/2016 12:13 PM, Eric Dumazet wrote: > On sam., 2016-02-27 at 10:19 -0800, Peter Hurley wrote: >> Hi Eric, >> >> For a while now, we've been struggling to understand why we've been >> observing missed uart rx DMA. >> >> Because both the uart driver (omap8250) and the dmaengine driver >> (edma) were (relatively) new, we assumed there was some race between >> starting a new rx DMA and processing the previous one. >> >> However, after instrumenting both the uart driver and the dmaengine >> driver, what we've observed is huge anomalous latencies between receiving >> the DMA interrupt and servicing the DMA tasklet. >> >> For example, at 3Mbaud we recorded the following distribution of >> softirq[TASKLET] service latency for this specific DMA channel: >> >> root@black:/sys/kernel/debug/edma# cat 35 >> latency(us): 0+ 20+ 40+ 60+ 80+ 100+ 120+ 140+ 160+ 180+ >> 200+ 220+ 240+ 260+ 280+ 300+ 320+ 340+ 360+ 380+ >>195681335315 7 4 3 1 0 0 >> 0 1 4 6 1 0 0 0 0 0 >> >> As you can see, the vast majority of tasklet service happens immediately, >> tapering off to 140+us. >> >> However, note the island of distribution at 220~300 [latencies beyond 300+ >> are not recorded because the uart fifo has filled again by this point and >> dma must be aborted]. >> >> So I cribbed together a latency tracer to catch what was happening at >> the extreme, and what it caught was a priority inversion stemming from >> your commit: >> >>commit c10d73671ad30f54692f7f69f0e09e75d3a8926a >>Author: Eric Dumazet >>Date: Thu Jan 10 15:26:34 2013 -0800 >> >>softirq: reduce latencies >> >>In various network workloads, __do_softirq() latencies can be up >>to 20 ms if HZ=1000, and 200 ms if HZ=100. >> >>This is because we iterate 10 times in the softirq dispatcher, >>and some actions can consume a lot of cycles. >> >> >> In the trace below [1], the trace begins in the edma completion interrupt >> handler when the tasklet is scheduled; the edma interrupt has occurred during >> NET_RX softirq context. >> >> However, instead of causing a restart of the softirq loop to process the >> tasklet _which occurred before sshd was scheduled_, the softirq loop is >> aborted and deferred for ksoftirqd. The tasklet is not serviced for 521us, >> which is way too long, so DMA was aborted. >> >> Your patch has effectively inverted the priority of tasklets with normal >> pri/nice processes that have merely received a network packet. >> >> ISTM, the problem you're trying to solve here was caused by NET_RX softirq >> to begin with, and maybe that thing needs a diet. >> >> But rather than outright reverting your patch, what if more selective >> conditions are used to abort the softirq restart? What would those conditions >> be? In the netperf benchmark you referred to in that commit, is it just >> NET_TX/NET_RX softirqs that are causing scheduling latencies? >> >> It just doesn't make sense to special case for a workload that isn't >> even running. >> >> >> Regards, >> Peter Hurley >> >> >> [1] softirq tasklet latency trace (apologies that it's only events - full >> function trace introduces too much overhead) >> >> # tracer: latency >> # >> # latency latency trace v1.1.5 on 4.5.0-rc2+ >> # >> # latency: 476 us, #59/59, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0) >> #- >> #| task: sshd-750 (uid:1000 nice:0 policy:0 rt_prio:0) >> #- >> # => started at: __tasklet_schedule >> # => ended at: tasklet_action >> # >> # >> # _--=> CPU# >> # / _-=> irqs-off >> #| / _=> need-resched >> #|| / _---=> hardirq/softirq >> #||| / _--=> preempt-depth >> # / delay >> # cmd pid | time | caller >> # \ / | \| / >> -0 0d.H31us : __tasklet_schedule >> -0 0d.H43us : softirq_raise: vec=6 [action=TASKLET] >> -0 0d.H36us : irq_handler_exit: irq=20 ret=handled >> -0 0..s2 15us : kmem_cache_alloc: call_site=c08378e4 >> ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC >> -0 0..s2 23us : netif_receive_skb_entry: dev=eth0 >> napi_id=0x0 queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x >> vlan_tci=0x000 >> 0 protocol=0x0800 ip_summed=0 hash=0x l4_hash=0 len=88 data_len=0 >> truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 >> gso_type=0x0 >> -0 0..s2 30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 >> len=88 >> -0 0d.s5 98us : sched_waking: comm=sshd pid=750 prio=120 >> target_cpu=000 >> -0 0d.s6 105us : sched_stat_sleep: comm=sshd pid=750 >> delay=3125230447 [ns] >> -0 0dns6 110us+: sche
Re: Softirq priority inversion from "softirq: reduce latencies"
On sam., 2016-02-27 at 10:19 -0800, Peter Hurley wrote: > Hi Eric, > > For a while now, we've been struggling to understand why we've been > observing missed uart rx DMA. > > Because both the uart driver (omap8250) and the dmaengine driver > (edma) were (relatively) new, we assumed there was some race between > starting a new rx DMA and processing the previous one. > > However, after instrumenting both the uart driver and the dmaengine > driver, what we've observed is huge anomalous latencies between receiving > the DMA interrupt and servicing the DMA tasklet. > > For example, at 3Mbaud we recorded the following distribution of > softirq[TASKLET] service latency for this specific DMA channel: > > root@black:/sys/kernel/debug/edma# cat 35 > latency(us): 0+ 20+ 40+ 60+ 80+ 100+ 120+ 140+ 160+ 180+ 200+ > 220+ 240+ 260+ 280+ 300+ 320+ 340+ 360+ 380+ >195681335315 7 4 3 1 0 0 0 > 1 4 6 1 0 0 0 0 0 > > As you can see, the vast majority of tasklet service happens immediately, > tapering off to 140+us. > > However, note the island of distribution at 220~300 [latencies beyond 300+ > are not recorded because the uart fifo has filled again by this point and > dma must be aborted]. > > So I cribbed together a latency tracer to catch what was happening at > the extreme, and what it caught was a priority inversion stemming from > your commit: > >commit c10d73671ad30f54692f7f69f0e09e75d3a8926a >Author: Eric Dumazet >Date: Thu Jan 10 15:26:34 2013 -0800 > >softirq: reduce latencies > >In various network workloads, __do_softirq() latencies can be up >to 20 ms if HZ=1000, and 200 ms if HZ=100. > >This is because we iterate 10 times in the softirq dispatcher, >and some actions can consume a lot of cycles. > > > In the trace below [1], the trace begins in the edma completion interrupt > handler when the tasklet is scheduled; the edma interrupt has occurred during > NET_RX softirq context. > > However, instead of causing a restart of the softirq loop to process the > tasklet _which occurred before sshd was scheduled_, the softirq loop is > aborted and deferred for ksoftirqd. The tasklet is not serviced for 521us, > which is way too long, so DMA was aborted. > > Your patch has effectively inverted the priority of tasklets with normal > pri/nice processes that have merely received a network packet. > > ISTM, the problem you're trying to solve here was caused by NET_RX softirq > to begin with, and maybe that thing needs a diet. > > But rather than outright reverting your patch, what if more selective > conditions are used to abort the softirq restart? What would those conditions > be? In the netperf benchmark you referred to in that commit, is it just > NET_TX/NET_RX softirqs that are causing scheduling latencies? > > It just doesn't make sense to special case for a workload that isn't > even running. > > > Regards, > Peter Hurley > > > [1] softirq tasklet latency trace (apologies that it's only events - full > function trace introduces too much overhead) > > # tracer: latency > # > # latency latency trace v1.1.5 on 4.5.0-rc2+ > # > # latency: 476 us, #59/59, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0) > #- > #| task: sshd-750 (uid:1000 nice:0 policy:0 rt_prio:0) > #- > # => started at: __tasklet_schedule > # => ended at: tasklet_action > # > # > # _--=> CPU# > # / _-=> irqs-off > #| / _=> need-resched > #|| / _---=> hardirq/softirq > #||| / _--=> preempt-depth > # / delay > # cmd pid | time | caller > # \ / | \| / > -0 0d.H31us : __tasklet_schedule > -0 0d.H43us : softirq_raise: vec=6 [action=TASKLET] > -0 0d.H36us : irq_handler_exit: irq=20 ret=handled > -0 0..s2 15us : kmem_cache_alloc: call_site=c08378e4 > ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC > -0 0..s2 23us : netif_receive_skb_entry: dev=eth0 napi_id=0x0 > queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x > vlan_tci=0x000 > 0 protocol=0x0800 ip_summed=0 hash=0x l4_hash=0 len=88 data_len=0 > truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 > gso_type=0x0 > -0 0..s2 30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 > len=88 > -0 0d.s5 98us : sched_waking: comm=sshd pid=750 prio=120 > target_cpu=000 > -0 0d.s6 105us : sched_stat_sleep: comm=sshd pid=750 > delay=3125230447 [ns] > -0 0dns6 110us+: sched_wakeup: comm=sshd pid=750 prio=120 > target_cpu=000 > -0 0dns4 123us+: timer_start: timer=dc940e9c > function=tcp_delack_timer
Re: net: mv643xx: interface does not transmit after some time
On 11/02/16 14:38, Ezequiel Garcia wrote: (let's expand the Cc a bit) On 10 February 2016 at 19:57, Andrew Lunn wrote: On Wed, Feb 10, 2016 at 07:40:54PM +0100, Thomas Schlöter wrote: Am 08.02.2016 um 19:49 schrieb Thomas Schlöter : Am 07.02.2016 um 22:07 schrieb Thomas Schlöter : Am 07.02.2016 um 21:35 schrieb Andrew Lunn : FWIW, we had a similar bug report in Debian recently: https://lists.debian.org/debian-arm/2016/01/msg00098.html Hi Thomas I this thread, Ian Campbell mentions a patch. Please could you try that patch and see if it fixes your problem. Thanks Andrew Hi Andrew, I just applied the patch and the NAS is now running it. I???ll try to crash it tonight and keep you informed whether it worked. Thanks Thomas Hi Andrew, the patch did not fix the problem. After 1.2 GiB RX and 950 MiB TX, the interface crashed again. Now I switched off RX/TX offload just to make sure we are talking about the same problem. If we are, the interface should be stable without offload, right? Thomas Okay, so I have installed ethtool and switched off all offload features available. Now the NAS is running rock solid for two days. I backed up my Mac using Time Machine / netatalk (450 GiB transferred) and some Linux machines via NFS (100 GiB total) without a problem. How much code is used for mv643xx offload functionality? Is it possible to debug things in the driver and figure out what happens during the crash? Is the hardware offload interface proprietary or reverse engineered or is it a well known API that can be analyzed? Hi Thomas Ezequiel Garcia probably knows this part of the driver and hardware the best... The TCP segmentation offload (TSO) implemented in this driver is mostly a software thing. I'm CCing Karl and Philipp, who have fixed subtle issues in the TSO path, and may be able to help figure this one out. Hi, Had this issue occur again today. In my case it seems to be triggered by large NFSv4 transfers. I'm running 4.4 plus Nicolas Schichan's patch at https://patchwork.ozlabs.org/patch/573334/ There is a thread a http://forum.doozan.com/read.php?2,17404 suggesting that this has been broken since at least 3.16. I first spotted the issue when upgrading from 3.11 to 4.4. Looking at https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers/net/ethernet/marvell/mv643xx_eth.c I see 2014-05-22 as the date TSO support was first added which is shortly before the merge window opened for 3.16. I'm therefore guessing that TSO has been problematic since it's introduction. Regards Adam
[PATCH] mld, igmp: Fix reserved tailroom calculation
The current reserved_tailroom calculation fails to take hlen and tlen into account. skb: [__hlen__|__data|__tlen___|__extra__] ^ ^ headskb_end_offset In this representation, hlen + data + tlen is the size passed to alloc_skb. "extra" is the extra space made available in __alloc_skb because of rounding up by kmalloc. We can reorder the representation like so: [__hlen__|__data|__extra__|__tlen___] ^ ^ headskb_end_offset The maximum space available for ip headers and payload without fragmentation is min(mtu, data + extra). Therefore, reserved_tailroom = data + extra + tlen - min(mtu, data + extra) = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen) = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen) Compare the second line to the current expression: reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset) and we can see that hlen and tlen are not taken into account. Depending on hlen, tlen, mtu and the number of multicast address records, the current code may output skbs that have less tailroom than dev->needed_tailroom or it may output more skbs than needed because not all space available is used. Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs") Signed-off-by: Benjamin Poirier --- net/ipv4/igmp.c | 4 ++-- net/ipv6/mcast.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c index 05e4cba..b5d28a4 100644 --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -356,9 +356,9 @@ static struct sk_buff *igmpv3_newpack(struct net_device *dev, unsigned int mtu) skb_dst_set(skb, &rt->dst); skb->dev = dev; - skb->reserved_tailroom = skb_end_offset(skb) - -min(mtu, skb_end_offset(skb)); skb_reserve(skb, hlen); + skb->reserved_tailroom = skb_tailroom(skb) - + min_t(int, mtu, skb_tailroom(skb) - tlen); skb_reset_network_header(skb); pip = ip_hdr(skb); diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c index 5ee56d0..c157edc 100644 --- a/net/ipv6/mcast.c +++ b/net/ipv6/mcast.c @@ -1574,9 +1574,9 @@ static struct sk_buff *mld_newpack(struct inet6_dev *idev, unsigned int mtu) return NULL; skb->priority = TC_PRIO_CONTROL; - skb->reserved_tailroom = skb_end_offset(skb) - -min(mtu, skb_end_offset(skb)); skb_reserve(skb, hlen); + skb->reserved_tailroom = skb_tailroom(skb) - + min_t(int, mtu, skb_tailroom(skb) - tlen); if (__ipv6_get_lladdr(idev, &addr_buf, IFA_F_TENTATIVE)) { /* : -- 2.7.0
Re: [PATCH net-next 5/5] vxlan: implement GPE in L3 mode
On Sat, 27 Feb 2016 20:21:59 +0100, Jiri Benc wrote: > You mean returning ETH_P_TEB in skb->protocol? That's not much useful, > unfortunately. You won't get such packet processed by the kernel IP > stack, rendering the VXLAN-GPE device unusable outside of ovs. It would > effectively became a packet sink when used standalone, as it cannot be > added to bridge and received packets are not processed by anything - > there's no protocol handler for ETH_P_TEB. Actually, I can do that in the L3 mode (or whatever we'll call it). It won't hurt anything and may be useful for openvswitch. Ovs will have to special case VXLAN-GPE vport (or perhaps any ARPHRD_NONE port) and set skb->protocol to ETH_P_TEB on xmit and dissect correctly the ETH_P_TEB packet on rcv. The L2 mode (or whatever we'll call it) will need to stay, though, for non-ovs use cases. Jiri
Re: [PATCH net-next 1/5] vxlan: implement GPE in L2 mode
On Fri, 26 Feb 2016 15:51:29 -0800, Tom Herbert wrote: > I don't think this is right. VXLAN-GPE is a separate protocol than > VXLAN, they are not compatible on the wire and don't share flags or > fields (for instance GPB uses bits in VXLAN that hold the next > protocol in VXLAN-GPE). Neither is there a VXLAN_F_GPE flag defined in > VXLAN to differentiate the two. So VXLAN-GPE would be used on a > different port Yes, and that's exactly what this patchset does. If there's the VXLAN_F_GPE flag defined while creating the interface, the used UDP port defaults to the VXLAN-GPE UDP port (but can be overriden) and the driver expects that all packets received are VXLAN-GPE. Note also that you can't define both GPE and GBP together, because as you noted, they're not compatible. The driver correctly refuses such combination. > and probably needs its own rcv functions. I don't see the need for code duplication. This patchset does exactly what you described and reuses the code, as most of it is really the same for all VXLAN modes. I also made sure this is as clean as possible in the driver which was the reason for the previous 4 cleanup patchsets. Jiri
Re: [PATCH net-next 5/5] vxlan: implement GPE in L3 mode
On Fri, 26 Feb 2016 15:42:29 -0800, Tom Herbert wrote: > Agreed, and I don't see why there even needs to be modes. VXLAN-GPE > can carry arbitrary protocols with a next-header field. For Ethernet, > MPLS, IPv4, and IPv6 it should just be a simple mapping of the next > header to Ethertype for purposes of processing the payload. That's exactly what this patchset does, Tom. The mapping is done in vxlan_parse_gpe_hdr and vxlan_build_gpe_hdr. Ethernet is special, though. It needs to be a standalone mode, otherwise frames encapsulated including an Ethernet header wouldn't be processed and there would be no way to send such packets - the only distinction the driver can use is skb->protocol and that won't become ETH_P_TEB magically. Jiri
Re: [PATCH net-next 5/5] vxlan: implement GPE in L3 mode
On Fri, 26 Feb 2016 14:22:03 -0800, Jesse Gross wrote: > Given that VXLAN_GPE_MODE_L3 will eventually come to be used by NSH, > MPLS, etc. in addition to IPv4/v6, most of which are not really L3, it > seems like something along the lines of NO_ARP might be better since > that's what it really indicates. I have no problem naming this differently. Not sure NO_ARP is the best name, though - this is more about absence of the L2 header in received packets than about ARP. > Once that is in, I don't really see > the need to explicitly block Ethernet packets from being handled in > this mode. If they are received, then they can just be handed off to > the stack - at that point it would look like an extra header, the same > as if an NSH packet is received. You mean returning ETH_P_TEB in skb->protocol? That's not much useful, unfortunately. You won't get such packet processed by the kernel IP stack, rendering the VXLAN-GPE device unusable outside of ovs. It would effectively became a packet sink when used standalone, as it cannot be added to bridge and received packets are not processed by anything - there's no protocol handler for ETH_P_TEB. With this patchset, you can create a VXLAN-GPE interface and use it as any other point to point interface, and it works as expected with routing etc. The distinction between Ethernet and no Ethernet is needed, the interface won't work otherwise. Jiri
Re: [PATCH 3/3] 3c59x: Use setup_timer()
On Thu, 25 Feb 2016, David Miller wrote: From: Amitoj Kaur Chawla Date: Wed, 24 Feb 2016 19:28:19 +0530 Convert a call to init_timer and accompanying intializations of the timer's data and function fields to a call to setup_timer. The Coccinelle semantic patch that fixes this problem is as follows: // @@ expression t,f,d; @@ -init_timer(&t); +setup_timer(&t,f,d); ... -t.data = d; -t.function = f; // Signed-off-by: Amitoj Kaur Chawla Applied. Hi David, Amitoj, The patch here seemed to remove the call to add_timer(&vp->timer) which applies the expires time. Would that be an issue? -Stafford
RE: [PATCH net 1/3] r8169:fix nic sometimes doesn't work after changing the mac address.
> Instead of taking the device out of suspended mode to perform the required > action, the driver is moving to a model where 1) said action may be > scheduled to a later time - or result from past time work - and 2) rpm handler > must handle a lot of pm unrelated work. > > rtl8169_ethtool_ops.{get_wol, get_regs, get_settings} aren't even fixed yet > (what about the .set_xyz handlers ?). > > I can't help thinking that the driver should return to a state where it > stupidly > does what it is asked to. No software caching, plain device access, resume > when needed, suspend as "suspend" instead of suspend as "anticipate > whatever may happen to avoid waking up". > This rpm related patches just the workaround for the issues reported by end users. As you say, the Linux kernel should handle these events when driver is in runtime suspend state. --Please consider the environment before printing this e-mail.
Softirq priority inversion from "softirq: reduce latencies"
Hi Eric, For a while now, we've been struggling to understand why we've been observing missed uart rx DMA. Because both the uart driver (omap8250) and the dmaengine driver (edma) were (relatively) new, we assumed there was some race between starting a new rx DMA and processing the previous one. However, after instrumenting both the uart driver and the dmaengine driver, what we've observed is huge anomalous latencies between receiving the DMA interrupt and servicing the DMA tasklet. For example, at 3Mbaud we recorded the following distribution of softirq[TASKLET] service latency for this specific DMA channel: root@black:/sys/kernel/debug/edma# cat 35 latency(us): 0+ 20+ 40+ 60+ 80+ 100+ 120+ 140+ 160+ 180+ 200+ 220+ 240+ 260+ 280+ 300+ 320+ 340+ 360+ 380+ 195681335315 7 4 3 1 0 0 0 1 4 6 1 0 0 0 0 0 As you can see, the vast majority of tasklet service happens immediately, tapering off to 140+us. However, note the island of distribution at 220~300 [latencies beyond 300+ are not recorded because the uart fifo has filled again by this point and dma must be aborted]. So I cribbed together a latency tracer to catch what was happening at the extreme, and what it caught was a priority inversion stemming from your commit: commit c10d73671ad30f54692f7f69f0e09e75d3a8926a Author: Eric Dumazet Date: Thu Jan 10 15:26:34 2013 -0800 softirq: reduce latencies In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. In the trace below [1], the trace begins in the edma completion interrupt handler when the tasklet is scheduled; the edma interrupt has occurred during NET_RX softirq context. However, instead of causing a restart of the softirq loop to process the tasklet _which occurred before sshd was scheduled_, the softirq loop is aborted and deferred for ksoftirqd. The tasklet is not serviced for 521us, which is way too long, so DMA was aborted. Your patch has effectively inverted the priority of tasklets with normal pri/nice processes that have merely received a network packet. ISTM, the problem you're trying to solve here was caused by NET_RX softirq to begin with, and maybe that thing needs a diet. But rather than outright reverting your patch, what if more selective conditions are used to abort the softirq restart? What would those conditions be? In the netperf benchmark you referred to in that commit, is it just NET_TX/NET_RX softirqs that are causing scheduling latencies? It just doesn't make sense to special case for a workload that isn't even running. Regards, Peter Hurley [1] softirq tasklet latency trace (apologies that it's only events - full function trace introduces too much overhead) # tracer: latency # # latency latency trace v1.1.5 on 4.5.0-rc2+ # # latency: 476 us, #59/59, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0) #- #| task: sshd-750 (uid:1000 nice:0 policy:0 rt_prio:0) #- # => started at: __tasklet_schedule # => ended at: tasklet_action # # # _--=> CPU# # / _-=> irqs-off #| / _=> need-resched #|| / _---=> hardirq/softirq #||| / _--=> preempt-depth # / delay # cmd pid | time | caller # \ / | \| / -0 0d.H31us : __tasklet_schedule -0 0d.H43us : softirq_raise: vec=6 [action=TASKLET] -0 0d.H36us : irq_handler_exit: irq=20 ret=handled -0 0..s2 15us : kmem_cache_alloc: call_site=c08378e4 ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC -0 0..s2 23us : netif_receive_skb_entry: dev=eth0 napi_id=0x0 queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x vlan_tci=0x000 0 protocol=0x0800 ip_summed=0 hash=0x l4_hash=0 len=88 data_len=0 truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 gso_type=0x0 -0 0..s2 30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 len=88 -0 0d.s5 98us : sched_waking: comm=sshd pid=750 prio=120 target_cpu=000 -0 0d.s6 105us : sched_stat_sleep: comm=sshd pid=750 delay=3125230447 [ns] -0 0dns6 110us+: sched_wakeup: comm=sshd pid=750 prio=120 target_cpu=000 -0 0dns4 123us+: timer_start: timer=dc940e9c function=tcp_delack_timer expires=9746 [timeout=10] flags=0x -0 0dnH3 150us : irq_handler_entry: irq=176 name=4a10.ethernet -0 0dnH3 153us : softirq_raise: vec=3 [action=NET_RX] -0 0dnH3 155us : irq_handler_exit: irq=176 ret=handled -0 0dnH3 160us : irq_handler_entry:
[PATCH net-next] net: ipv6/l3mdev: Move host route on saved address if necessary
Commit f1705ec197e70 allows IPv6 addresses to be retained on a link down. The address can have a cached host route which can point to the wrong FIB table if the L3 enslavement is changed (e.g., route can point to local table instead of VRF table if device is added to an L3 domain). On link up check the table of the cached host route against the FIB table associated with the device and correct if needed. Signed-off-by: David Ahern --- Normally the 'if CONFIG_NET_L3_MASTER_DEV is enabled' checks are all done in l3mdev.h. In this case putting the functions in the l3mdev header requires adding ipv6 header files which blows up compiles. net/ipv6/addrconf.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index a2d6f6c242af..afab4c359b5b 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -3170,9 +3170,35 @@ static void addrconf_gre_config(struct net_device *dev) } #endif +#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV) +/* If the host route is cached on the addr struct make sure it is associated + * with the proper table. e.g., enslavement can change and if so the cached + * host route needs to move to the new table. + */ +static void l3mdev_check_host_rt(struct inet6_dev *idev, + struct inet6_ifaddr *ifp) +{ + if (ifp->rt) { + u32 tb_id = l3mdev_fib_table(idev->dev) ? : RT6_TABLE_LOCAL; + + if (tb_id != ifp->rt->rt6i_table->tb6_id) { + ip6_del_rt(ifp->rt); + ifp->rt = NULL; + } + } +} +#else +static void l3mdev_check_host_rt(struct inet6_dev *idev, + struct inet6_ifaddr *ifp) +{ +} +#endif + static int fixup_permanent_addr(struct inet6_dev *idev, struct inet6_ifaddr *ifp) { + l3mdev_check_host_rt(idev, ifp); + if (!ifp->rt) { struct rt6_info *rt; -- 2.1.4
[PATCH] wan: lmc: Switch to using managed resources
Use managed resource functions devm_kzalloc and pcim_enable_device to simplify error handling. Subsequently, remove unnecessary kfree, pci_disable_device and pci_release_regions. To be compatible with the change, various gotos are replaced with direct returns and unneeded labels are dropped. Also, `sc` was only being freed in the probe function and not the remove function before the change. By using devm_kzalloc this patch also fixes this memory leak. Signed-off-by: Amitoj Kaur Chawla --- I was not able to find anywhere that `sc` might be freed. However, if a free has been overlooked, there will be a double free, due to the free implicitly performed by devm_kzalloc. drivers/net/wan/lmc/lmc_main.c | 27 +++ 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/drivers/net/wan/lmc/lmc_main.c b/drivers/net/wan/lmc/lmc_main.c index 317bc79..bb33b24 100644 --- a/drivers/net/wan/lmc/lmc_main.c +++ b/drivers/net/wan/lmc/lmc_main.c @@ -826,7 +826,7 @@ static int lmc_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) /* lmc_trace(dev, "lmc_init_one in"); */ - err = pci_enable_device(pdev); + err = pcim_enable_device(pdev); if (err) { printk(KERN_ERR "lmc: pci enable failed: %d\n", err); return err; @@ -835,23 +835,20 @@ static int lmc_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) err = pci_request_regions(pdev, "lmc"); if (err) { printk(KERN_ERR "lmc: pci_request_region failed\n"); - goto err_req_io; + return err; } /* * Allocate our own device structure */ - sc = kzalloc(sizeof(lmc_softc_t), GFP_KERNEL); - if (!sc) { - err = -ENOMEM; - goto err_kzalloc; - } + sc = devm_kzalloc(&pdev->dev, sizeof(lmc_softc_t), GFP_KERNEL); + if (!sc) + return -ENOMEM; dev = alloc_hdlcdev(sc); if (!dev) { printk(KERN_ERR "lmc:alloc_netdev for device failed\n"); - err = -ENOMEM; - goto err_hdlcdev; + return -ENOMEM; } @@ -888,7 +885,7 @@ static int lmc_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) if (err) { printk(KERN_ERR "%s: register_netdev failed.\n", dev->name); free_netdev(dev); - goto err_hdlcdev; + return err; } sc->lmc_cardtype = LMC_CARDTYPE_UNKNOWN; @@ -971,14 +968,6 @@ static int lmc_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) lmc_trace(dev, "lmc_init_one out"); return 0; - -err_hdlcdev: - kfree(sc); -err_kzalloc: - pci_release_regions(pdev); -err_req_io: - pci_disable_device(pdev); - return err; } /* @@ -992,8 +981,6 @@ static void lmc_remove_one(struct pci_dev *pdev) printk(KERN_DEBUG "%s: removing...\n", dev->name); unregister_hdlc_device(dev); free_netdev(dev); - pci_release_regions(pdev); - pci_disable_device(pdev); } } -- 1.9.1
Re: [V9fs-developer] [PATCH] net/9p: convert to new CQ API
Hi, Couple of checkpatch complains: Christoph Hellwig wrote on Sat, Feb 27, 2016: > -struct p9_rdma_context { > - enum ib_wc_opcode wc_op; > +struct p9_rdma_context { trailing tab > - p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, status); > + p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, > wc->status); line over 80 chars That aside it looks good ; I need to check on the new API (hadn't noticed the change) but it looks nice. Will do the actual testing likely only next week only though; Eric has been taking my patches for 9p/RDMA so I suspect he'll take your's as well eventually (get_maintainer.pl has a long-ish list of CC for us usually) BTW I think it's easy enough to do the testing if you have a server that can dish it out. diod[1] and nfs-ganesha[2] are the only two I'm aware of but there might be more (using ganesha myself; happy to help you set it up in private if you need) [1] https://github.com/chaos/diod [2] https://github.com/nfs-ganesha/nfs-ganesha -- Dominique Martinet
[net-next-2.6 v4 0/3] net_sched: Add support for IFE action
From: Jamal Hadi Salim As agreed at netconf in Seville, here's the patch finally (1 year was just too long to wait for an ethertype. Now we are just going have the user configure one). Described in netdev01 paper: "Distributing Linux Traffic Control Classifier-Action Subsystem" Authors: Jamal Hadi Salim and Damascene M. Joachimpillai The original motivation and deployment of this work was to horizontally scale packet processing at scope of a chasis or rack. This means one could take a tc policy and split it across machines connected over L2. The paper refers to this as "pipeline stage indexing". Other use cases which evolved out of the original intent include but are not limited to carrying OAM information, carrying exception handling metadata, carrying programmed authentication and authorization information, encapsulating programmed compliance information, service IDs etc. Read the referenced paper for more details. The architecture allows for incremental updates for new metadatum support to cover different use cases. This patch set includes support for basic skb metadatum. Followup patches will have more examples of metadata and other features. v4 changes: Integrate more feedback from Cong v3 changes: Integrate with the new namespace changes Remove skbhash and queue mapping metadata (but keep their claim for ids) Integrate feedback from Cong Integrate feedback from Daniel v2 changes: Remove module option for an upper bound of metadata Integrate feedback from Cong Integrate feedback from Daniel Jamal Hadi Salim (3): introduce IFE action Support to encoding decoding skb mark on IFE action Support to encoding decoding skb prio on IFE action include/net/tc_act/tc_ife.h| 61 +++ include/uapi/linux/tc_act/tc_ife.h | 38 ++ net/sched/Kconfig | 22 + net/sched/Makefile | 3 + net/sched/act_ife.c| 883 + net/sched/act_meta_mark.c | 79 net/sched/act_meta_skbprio.c | 76 7 files changed, 1162 insertions(+) create mode 100644 include/net/tc_act/tc_ife.h create mode 100644 include/uapi/linux/tc_act/tc_ife.h create mode 100644 net/sched/act_ife.c create mode 100644 net/sched/act_meta_mark.c create mode 100644 net/sched/act_meta_skbprio.c -- 1.9.1
[net-next-2.6 PATCH v4 2/3] Support to encoding decoding skb mark on IFE action
From: Jamal Hadi Salim Example usage: Set the skb using skbedit then allow it to be encoded sudo tc qdisc add dev $ETH root handle 1: prio sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbedit mark 17 \ action ife encode \ allow mark \ dst 02:15:15:15:15:15 Note: You dont need the skbedit action if you are already encoding the skb mark earlier. A zero skb mark, when seen, will not be encoded. Alternative hard code static mark of 0x12 every time the filter matches sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action ife encode \ type 0xDEAD \ use mark 0x12 \ dst 02:15:15:15:15:15 Signed-off-by: Jamal Hadi Salim --- net/sched/Kconfig | 5 +++ net/sched/Makefile| 1 + net/sched/act_meta_mark.c | 79 +++ 3 files changed, 85 insertions(+) create mode 100644 net/sched/act_meta_mark.c diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 4d48ef5..85854c0 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -751,6 +751,11 @@ config NET_ACT_IFE To compile this code as a module, choose M here: the module will be called act_ife. +config NET_IFE_SKBMARK +tristate "Support to encoding decoding skb mark on IFE action" +depends on NET_ACT_IFE +---help--- + config NET_CLS_IND bool "Incoming device classification" depends on NET_CLS_U32 || NET_CLS_FW diff --git a/net/sched/Makefile b/net/sched/Makefile index 3d17667..3f7a182 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -20,6 +20,7 @@ obj-$(CONFIG_NET_ACT_VLAN)+= act_vlan.o obj-$(CONFIG_NET_ACT_BPF) += act_bpf.o obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.o obj-$(CONFIG_NET_ACT_IFE) += act_ife.o +obj-$(CONFIG_NET_IFE_SKBMARK) += act_meta_mark.o obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o obj-$(CONFIG_NET_SCH_CBQ) += sch_cbq.o obj-$(CONFIG_NET_SCH_HTB) += sch_htb.o diff --git a/net/sched/act_meta_mark.c b/net/sched/act_meta_mark.c new file mode 100644 index 000..8289217 --- /dev/null +++ b/net/sched/act_meta_mark.c @@ -0,0 +1,79 @@ +/* + * net/sched/act_meta_mark.c IFE skb->mark metadata module + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * copyright Jamal Hadi Salim (2015) + * +*/ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static int skbmark_encode(struct sk_buff *skb, void *skbdata, + struct tcf_meta_info *e) +{ + u32 ifemark = skb->mark; + + return ife_encode_meta_u32(ifemark, skbdata, e); +} + +static int skbmark_decode(struct sk_buff *skb, void *data, u16 len) +{ + u32 ifemark = *(u32 *)data; + + skb->mark = ntohl(ifemark); + return 0; +} + +static int skbmark_check(struct sk_buff *skb, struct tcf_meta_info *e) +{ + return ife_check_meta_u32(skb->mark, e); +} + +static struct tcf_meta_ops ife_skbmark_ops = { + .metaid = IFE_META_SKBMARK, + .metatype = NLA_U32, + .name = "skbmark", + .synopsis = "skb mark 32 bit metadata", + .check_presence = skbmark_check, + .encode = skbmark_encode, + .decode = skbmark_decode, + .get = ife_get_meta_u32, + .alloc = ife_alloc_meta_u32, + .release = ife_release_meta_gen, + .validate = ife_validate_meta_u32, + .owner = THIS_MODULE, +}; + +static int __init ifemark_init_module(void) +{ + return register_ife_op(&ife_skbmark_ops); +} + +static void __exit ifemark_cleanup_module(void) +{ + unregister_ife_op(&ife_skbmark_ops); +} + +module_init(ifemark_init_module); +module_exit(ifemark_cleanup_module); + +MODULE_AUTHOR("Jamal Hadi Salim(2015)"); +MODULE_DESCRIPTION("Inter-FE skb mark metadata module"); +MODULE_LICENSE("GPL"); +MODULE_ALIAS_IFE_META(IFE_META_SKBMARK); -- 1.9.1
[net-next-2.6 PATCH v4 3/3] Support to encoding decoding skb prio on IFE action
From: Jamal Hadi Salim Example usage: Set the skb priority using skbedit then allow it to be encoded sudo tc qdisc add dev $ETH root handle 1: prio sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action skbedit prio 17 \ action ife encode \ allow prio \ dst 02:15:15:15:15:15 Note: You dont need the skbedit action if you are already encoding the skb priority earlier. A zero skb priority will not be sent Alternative hard code static priority of decimal 33 (unlike skbedit) then mark of 0x12 every time the filter matches sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \ u32 match ip protocol 1 0xff flowid 1:2 \ action ife encode \ type 0xDEAD \ use prio 33 \ use mark 0x12 \ dst 02:15:15:15:15:15 Signed-off-by: Jamal Hadi Salim --- net/sched/Kconfig| 5 +++ net/sched/Makefile | 1 + net/sched/act_meta_skbprio.c | 76 3 files changed, 82 insertions(+) create mode 100644 net/sched/act_meta_skbprio.c diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 85854c0..b148302 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -756,6 +756,11 @@ config NET_IFE_SKBMARK depends on NET_ACT_IFE ---help--- +config NET_IFE_SKBPRIO +tristate "Support to encoding decoding skb prio on IFE action" +depends on NET_ACT_IFE +---help--- + config NET_CLS_IND bool "Incoming device classification" depends on NET_CLS_U32 || NET_CLS_FW diff --git a/net/sched/Makefile b/net/sched/Makefile index 3f7a182..84bddb3 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -21,6 +21,7 @@ obj-$(CONFIG_NET_ACT_BPF) += act_bpf.o obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.o obj-$(CONFIG_NET_ACT_IFE) += act_ife.o obj-$(CONFIG_NET_IFE_SKBMARK) += act_meta_mark.o +obj-$(CONFIG_NET_IFE_SKBPRIO) += act_meta_skbprio.o obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o obj-$(CONFIG_NET_SCH_CBQ) += sch_cbq.o obj-$(CONFIG_NET_SCH_HTB) += sch_htb.o diff --git a/net/sched/act_meta_skbprio.c b/net/sched/act_meta_skbprio.c new file mode 100644 index 000..26bf4d8 --- /dev/null +++ b/net/sched/act_meta_skbprio.c @@ -0,0 +1,76 @@ +/* + * net/sched/act_meta_prio.c IFE skb->priority metadata module + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * copyright Jamal Hadi Salim (2015) + * +*/ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static int skbprio_check(struct sk_buff *skb, struct tcf_meta_info *e) +{ + return ife_check_meta_u32(skb->priority, e); +} + +static int skbprio_encode(struct sk_buff *skb, void *skbdata, + struct tcf_meta_info *e) +{ + u32 ifeprio = skb->priority; /* avoid having to cast skb->priority*/ + + return ife_encode_meta_u32(ifeprio, skbdata, e); +} + +static int skbprio_decode(struct sk_buff *skb, void *data, u16 len) +{ + u32 ifeprio = *(u32 *)data; + + skb->priority = ntohl(ifeprio); + return 0; +} + +static struct tcf_meta_ops ife_prio_ops = { + .metaid = IFE_META_PRIO, + .metatype = NLA_U32, + .name = "skbprio", + .synopsis = "skb prio metadata", + .check_presence = skbprio_check, + .encode = skbprio_encode, + .decode = skbprio_decode, + .get = ife_get_meta_u32, + .alloc = ife_alloc_meta_u32, + .owner = THIS_MODULE, +}; + +static int __init ifeprio_init_module(void) +{ + return register_ife_op(&ife_prio_ops); +} + +static void __exit ifeprio_cleanup_module(void) +{ + unregister_ife_op(&ife_prio_ops); +} + +module_init(ifeprio_init_module); +module_exit(ifeprio_cleanup_module); + +MODULE_AUTHOR("Jamal Hadi Salim(2015)"); +MODULE_DESCRIPTION("Inter-FE skb prio metadata action"); +MODULE_LICENSE("GPL"); +MODULE_ALIAS_IFE_META(IFE_META_PRIO); -- 1.9.1
[net-next-2.6 PATCH v4 1/3] introduce IFE action
From: Jamal Hadi Salim This action allows for a sending side to encapsulate arbitrary metadata which is decapsulated by the receiving end. The sender runs in encoding mode and the receiver in decode mode. Both sender and receiver must specify the same ethertype. At some point we hope to have a registered ethertype and we'll then provide a default so the user doesnt have to specify it. For now we enforce the user specify it. Lets show example usage where we encode icmp from a sender towards a receiver with an skbmark of 17; both sender and receiver use ethertype of 0xdead to interop. : Lets start with Receiver-side policy config: xxx: add an ingress qdisc sudo tc qdisc add dev $ETH ingress xxx: any packets with ethertype 0xdead will be subjected to ife decoding xxx: we then restart the classification so we can match on icmp at prio 3 sudo $TC filter add dev $ETH parent : prio 2 protocol 0xdead \ u32 match u32 0 0 flowid 1:1 \ action ife decode reclassify xxx: on restarting the classification from above if it was an icmp xxx: packet, then match it here and continue to the next rule at prio 4 xxx: which will match based on skb mark of 17 sudo tc filter add dev $ETH parent : prio 3 protocol ip \ u32 match ip protocol 1 0xff flowid 1:1 \ action continue xxx: match on skbmark of 0x11 (decimal 17) and accept sudo tc filter add dev $ETH parent : prio 4 protocol ip \ handle 0x11 fw flowid 1:1 \ action ok xxx: Lets show the decoding policy sudo tc -s filter ls dev $ETH parent : protocol 0xdead xxx: filter pref 2 u32 filter pref 2 u32 fh 800: ht divisor 1 filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 0 success 0) match / at 0 (success 0 ) action order 1: ife decode action reclassify index 1 ref 1 bind 1 installed 14 sec used 14 sec type: 0x0 Metadata: allow mark allow hash allow prio allow qmap Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 xxx: Observe that above lists all metadatum it can decode. Typically these submodules will already be compiled into a monolithic kernel or loaded as modules : Lets show the sender side now .. xxx: Add an egress qdisc on the sender netdev sudo tc qdisc add dev $ETH root handle 1: prio xxx: xxx: Match all icmp packets to 192.168.122.237/24, then xxx: tag the packet with skb mark of decimal 17, then xxx: Encode it with: xxx:ethertype 0xdead xxx:add skb->mark to whitelist of metadatum to send xxx:rewrite target dst MAC address to 02:15:15:15:15:15 xxx: sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 u32 \ match ip dst 192.168.122.237/24 \ match ip protocol 1 0xff \ flowid 1:2 \ action skbedit mark 17 \ action ife encode \ type 0xDEAD \ allow mark \ dst 02:15:15:15:15:15 xxx: Lets show the encoding policy sudo tc -s filter ls dev $ETH parent 1: protocol ip xxx: filter pref 10 u32 filter pref 10 u32 fh 800: ht divisor 1 filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 (rule hit 0 success 0) match c0a87aed/ at 16 (success 0 ) match 0001/00ff at 8 (success 0 ) action order 1: skbedit mark 17 index 6 ref 1 bind 1 Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 action order 2: ife encode action pipe index 3 ref 1 bind 1 dst MAC: 02:15:15:15:15:15 type: 0xDEAD Metadata: allow mark Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 xxx: test by sending ping from sender to destination Signed-off-by: Jamal Hadi Salim --- include/net/tc_act/tc_ife.h| 61 +++ include/uapi/linux/tc_act/tc_ife.h | 38 ++ net/sched/Kconfig | 12 + net/sched/Makefile | 1 + net/sched/act_ife.c| 870 + 5 files changed, 982 insertions(+) create mode 100644 include/net/tc_act/tc_ife.h create mode 100644 include/uapi/linux/tc_act/tc_ife.h create mode 100644 net/sched/act_ife.c diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h new file mode 100644 index 000..dc9a09a --- /dev/null +++ b/include/net/tc_act/tc_ife.h @@ -0,0 +1,61 @@ +#ifndef __NET_TC_IFE_H +#define __NET_TC_IFE_H + +#include +#include +#include +#include + +#define IFE_METAHDRLEN 2 +struct tcf_ife_info { + struct tcf_common common; + u8 eth_dst[ETH_ALEN]; + u8 eth_src[ETH_ALEN]; + u16 eth_type; + u16 flags; + /* list of metaids allowed */ + struct list_head metalist; +}; +#define to_ife(a) \ + container_of(a->priv, struct tcf_ife_info, common) + +struct tcf_meta_info { + const struct tcf_meta_ops *ops; + void *metaval; + u16 metaid; + struct list_head metalist; +}; + +struct tcf_meta_ops { +
Re: [net-next-2.6 v3 1/3] introduce IFE action
On 16-02-26 06:49 PM, Cong Wang wrote: On Fri, Feb 26, 2016 at 2:43 PM, Jamal Hadi Salim wrote: [...] Just some quick reviews... ;) ;-> Ok, update in a little while after some basic testing... cheers, jamal
Re: [patch] rocker: fix an error code
Sat, Feb 27, 2016 at 12:31:43PM CET, dan.carpen...@oracle.com wrote: >We intended to return PTR_ERR() here instead of 1. > >Fixes: 1f9993f6825f ('rocker: fix a neigh entry leak issue') >Signed-off-by: Dan Carpenter Acked-by: Jiri Pirko
[patch] rocker: fix an error code
We intended to return PTR_ERR() here instead of 1. Fixes: 1f9993f6825f ('rocker: fix a neigh entry leak issue') Signed-off-by: Dan Carpenter --- We recently moved rocker files around so this only applies to -next. Probably returning the wrong error code is harmless. diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c index 099008a..07218c3 100644 --- a/drivers/net/ethernet/rocker/rocker_ofdpa.c +++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c @@ -1449,7 +1449,7 @@ static int ofdpa_port_ipv4_resolve(struct ofdpa_port *ofdpa_port, if (!n) { n = neigh_create(&arp_tbl, &ip_addr, dev); if (IS_ERR(n)) - return IS_ERR(n); + return PTR_ERR(n); } /* If the neigh is already resolved, then go ahead and
net/9p: convert to new CQ API
Hi all, who is maintaining the "RDMA transport" (1) for 9p? Below patch converts it to your new CQ API. It's fairly trivial, but untested as I can't figure out how to actually test this code. [1] RDMA seems a bit of a misowner as it's never doing RDMA data transfers, but that's a separate story :)
[PATCH] net/9p: convert to new CQ API
Trivial conversion to the new RDMA CQ API. Signed-off-by: Christoph Hellwig --- net/9p/trans_rdma.c | 87 +++-- 1 file changed, 31 insertions(+), 56 deletions(-) diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c index 52b4a2f..668c3be 100644 --- a/net/9p/trans_rdma.c +++ b/net/9p/trans_rdma.c @@ -109,14 +109,13 @@ struct p9_trans_rdma { /** * p9_rdma_context - Keeps track of in-process WR * - * @wc_op: The original WR op for when the CQE completes in error. * @busa: Bus address to unmap when the WR completes * @req: Keeps track of requests (send) * @rc: Keepts track of replies (receive) */ struct p9_rdma_req; -struct p9_rdma_context { - enum ib_wc_opcode wc_op; +struct p9_rdma_context { + struct ib_cqe cqe; dma_addr_t busa; union { struct p9_req_t *req; @@ -284,9 +283,12 @@ p9_cm_event_handler(struct rdma_cm_id *id, struct rdma_cm_event *event) } static void -handle_recv(struct p9_client *client, struct p9_trans_rdma *rdma, - struct p9_rdma_context *c, enum ib_wc_status status, u32 byte_len) +recv_done(struct ib_cq *cq, struct ib_wc *wc) { + struct p9_client *client = cq->cq_context; + struct p9_trans_rdma *rdma = client->trans; + struct p9_rdma_context *c = + container_of(wc->wr_cqe, struct p9_rdma_context, cqe); struct p9_req_t *req; int err = 0; int16_t tag; @@ -295,7 +297,7 @@ handle_recv(struct p9_client *client, struct p9_trans_rdma *rdma, ib_dma_unmap_single(rdma->cm_id->device, c->busa, client->msize, DMA_FROM_DEVICE); - if (status != IB_WC_SUCCESS) + if (wc->status != IB_WC_SUCCESS) goto err_out; err = p9_parse_header(c->rc, NULL, NULL, &tag, 1); @@ -316,21 +318,31 @@ handle_recv(struct p9_client *client, struct p9_trans_rdma *rdma, req->rc = c->rc; p9_client_cb(client, req, REQ_STATUS_RCVD); + out: + up(&rdma->rq_sem); + kfree(c); return; err_out: - p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, status); + p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, wc->status); rdma->state = P9_RDMA_FLUSHING; client->status = Disconnected; + goto out; } static void -handle_send(struct p9_client *client, struct p9_trans_rdma *rdma, - struct p9_rdma_context *c, enum ib_wc_status status, u32 byte_len) +send_done(struct ib_cq *cq, struct ib_wc *wc) { + struct p9_client *client = cq->cq_context; + struct p9_trans_rdma *rdma = client->trans; + struct p9_rdma_context *c = + container_of(wc->wr_cqe, struct p9_rdma_context, cqe); + ib_dma_unmap_single(rdma->cm_id->device, c->busa, c->req->tc->size, DMA_TO_DEVICE); + up(&rdma->sq_sem); + kfree(c); } static void qp_event_handler(struct ib_event *event, void *context) @@ -339,42 +351,6 @@ static void qp_event_handler(struct ib_event *event, void *context) event->event, context); } -static void cq_comp_handler(struct ib_cq *cq, void *cq_context) -{ - struct p9_client *client = cq_context; - struct p9_trans_rdma *rdma = client->trans; - int ret; - struct ib_wc wc; - - ib_req_notify_cq(rdma->cq, IB_CQ_NEXT_COMP); - while ((ret = ib_poll_cq(cq, 1, &wc)) > 0) { - struct p9_rdma_context *c = (void *) (unsigned long) wc.wr_id; - - switch (c->wc_op) { - case IB_WC_RECV: - handle_recv(client, rdma, c, wc.status, wc.byte_len); - up(&rdma->rq_sem); - break; - - case IB_WC_SEND: - handle_send(client, rdma, c, wc.status, wc.byte_len); - up(&rdma->sq_sem); - break; - - default: - pr_err("unexpected completion type, c->wc_op=%d, wc.opcode=%d, status=%d\n", - c->wc_op, wc.opcode, wc.status); - break; - } - kfree(c); - } -} - -static void cq_event_handler(struct ib_event *e, void *v) -{ - p9_debug(P9_DEBUG_ERROR, "CQ event %d context %p\n", e->event, v); -} - static void rdma_destroy_trans(struct p9_trans_rdma *rdma) { if (!rdma) @@ -387,7 +363,7 @@ static void rdma_destroy_trans(struct p9_trans_rdma *rdma) ib_dealloc_pd(rdma->pd); if (rdma->cq && !IS_ERR(rdma->cq)) - ib_destroy_cq(rdma->cq); + ib_free_cq(rdma->cq); if (rdma->cm_id && !IS_ERR(rdma->cm_id)) rdma_destroy_id(rdma->cm_id); @@ -408,13 +384,14 @@ post_recv(struct p9_client *client, struct p9_rdma_context *c) if (ib_dma_mapping_error(rd
[PATCH v2 2/3] net: ipv4: tcp_probe: Replace timespec with timespec64
TCP probe log timestamps use struct timespec which is not y2038 safe. Even though timespec might be good enough here as it is used to represent delta time, the plan is to get rid of all uses of timespec in the kernel. Replace with struct timespec64 which is y2038 safe. Prints still use unsigned long format and type. Signed-off-by: Deepa Dinamani Reviewed-by: Arnd Bergmann Cc: "David S. Miller" Cc: Alexey Kuznetsov Cc: James Morris Cc: Hideaki YOSHIFUJI Cc: Patrick McHardy --- net/ipv4/tcp_probe.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c index ebf5ff5..f6c50af 100644 --- a/net/ipv4/tcp_probe.c +++ b/net/ipv4/tcp_probe.c @@ -187,13 +187,13 @@ static int tcpprobe_sprint(char *tbuf, int n) { const struct tcp_log *p = tcp_probe.log + tcp_probe.tail; - struct timespec tv - = ktime_to_timespec(ktime_sub(p->tstamp, tcp_probe.start)); + struct timespec64 ts + = ktime_to_timespec64(ktime_sub(p->tstamp, tcp_probe.start)); return scnprintf(tbuf, n, "%lu.%09lu %pISpc %pISpc %d %#x %#x %u %u %u %u %u\n", - (unsigned long)tv.tv_sec, - (unsigned long)tv.tv_nsec, + (unsigned long)ts.tv_sec, + (unsigned long)ts.tv_nsec, &p->src, &p->dst, p->length, p->snd_nxt, p->snd_una, p->snd_cwnd, p->ssthresh, p->snd_wnd, p->srtt, p->rcv_wnd); } -- 1.9.1
[PATCH v2 3/3] net: sctp: Convert log timestamps to be y2038 safe
SCTP probe log timestamps use struct timespec which is not y2038 safe. Use struct timespec64 which is 2038 safe instead. Use monotonic time instead of real time as only time differences are logged. Signed-off-by: Deepa Dinamani Reviewed-by: Arnd Bergmann Acked-by: Neil Horman Cc: Vlad Yasevich Cc: Neil Horman Cc: "David S. Miller" Cc: linux-s...@vger.kernel.org --- net/sctp/probe.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/net/sctp/probe.c b/net/sctp/probe.c index 5e68b94..6cc2152 100644 --- a/net/sctp/probe.c +++ b/net/sctp/probe.c @@ -65,7 +65,7 @@ static struct { struct kfifo fifo; spinlock_tlock; wait_queue_head_t wait; - struct timespec tstart; + struct timespec64 tstart; } sctpw; static __printf(1, 2) void printl(const char *fmt, ...) @@ -85,7 +85,7 @@ static __printf(1, 2) void printl(const char *fmt, ...) static int sctpprobe_open(struct inode *inode, struct file *file) { kfifo_reset(&sctpw.fifo); - getnstimeofday(&sctpw.tstart); + ktime_get_ts64(&sctpw.tstart); return 0; } @@ -138,7 +138,7 @@ static sctp_disposition_t jsctp_sf_eat_sack(struct net *net, struct sk_buff *skb = chunk->skb; struct sctp_transport *sp; static __u32 lcwnd = 0; - struct timespec now; + struct timespec64 now; sp = asoc->peer.primary_path; @@ -149,8 +149,8 @@ static sctp_disposition_t jsctp_sf_eat_sack(struct net *net, (full || sp->cwnd != lcwnd)) { lcwnd = sp->cwnd; - getnstimeofday(&now); - now = timespec_sub(now, sctpw.tstart); + ktime_get_ts64(&now); + now = timespec64_sub(now, sctpw.tstart); printl("%lu.%06lu ", (unsigned long) now.tv_sec, (unsigned long) now.tv_nsec / NSEC_PER_USEC); -- 1.9.1
[PATCH v2 1/3] net: ipv4: Convert IP network timestamps to be y2038 safe
ICMP timestamp messages and IP source route options require timestamps to be in milliseconds modulo 24 hours from midnight UT format. Add inet_current_timestamp() function to support this. The function returns the required timestamp in network byte order. Timestamp calculation is also changed to call ktime_get_real_ts64() which uses struct timespec64. struct timespec64 is y2038 safe. Previously it called getnstimeofday() which uses struct timespec. struct timespec is not y2038 safe. Signed-off-by: Deepa Dinamani Cc: "David S. Miller" Cc: Alexey Kuznetsov Cc: Hideaki YOSHIFUJI Cc: James Morris Cc: Patrick McHardy --- include/net/ip.h | 2 ++ net/ipv4/af_inet.c| 26 ++ net/ipv4/icmp.c | 5 + net/ipv4/ip_options.c | 14 ++ 4 files changed, 35 insertions(+), 12 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index 1a98f1c..5d3a9eb 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -240,6 +240,8 @@ static inline int inet_is_local_reserved_port(struct net *net, int port) } #endif +__be32 inet_current_timestamp(void); + /* From inetpeer.c */ extern int inet_peer_threshold; extern int inet_peer_minttl; diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index eade66d..408e2b3 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1386,6 +1386,32 @@ out: return pp; } +#define SECONDS_PER_DAY86400 + +/* inet_current_timestamp - Return IP network timestamp + * + * Return milliseconds since midnight in network byte order. + */ +__be32 inet_current_timestamp(void) +{ + u32 secs; + u32 msecs; + struct timespec64 ts; + + ktime_get_real_ts64(&ts); + + /* Get secs since midnight. */ + (void)div_u64_rem(ts.tv_sec, SECONDS_PER_DAY, &secs); + /* Convert to msecs. */ + msecs = secs * MSEC_PER_SEC; + /* Convert nsec to msec. */ + msecs += (u32)ts.tv_nsec / NSEC_PER_MSEC; + + /* Convert to network byte order. */ + return htons(msecs); +} +EXPORT_SYMBOL(inet_current_timestamp); + int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int *addr_len) { if (sk->sk_family == AF_INET) diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index 36e2697..6333489 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -931,7 +931,6 @@ static bool icmp_echo(struct sk_buff *skb) */ static bool icmp_timestamp(struct sk_buff *skb) { - struct timespec tv; struct icmp_bxm icmp_param; /* * Too short. @@ -942,9 +941,7 @@ static bool icmp_timestamp(struct sk_buff *skb) /* * Fill in the current time as ms since midnight UT: */ - getnstimeofday(&tv); - icmp_param.data.times[1] = htonl((tv.tv_sec % 86400) * MSEC_PER_SEC + -tv.tv_nsec / NSEC_PER_MSEC); + icmp_param.data.times[1] = inet_current_timestamp(); icmp_param.data.times[2] = icmp_param.data.times[1]; if (skb_copy_bits(skb, 0, &icmp_param.data.times[0], 4)) BUG(); diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c index bd24679..4d158ff 100644 --- a/net/ipv4/ip_options.c +++ b/net/ipv4/ip_options.c @@ -58,10 +58,9 @@ void ip_options_build(struct sk_buff *skb, struct ip_options *opt, if (opt->ts_needaddr) ip_rt_get_source(iph+opt->ts+iph[opt->ts+2]-9, skb, rt); if (opt->ts_needtime) { - struct timespec tv; __be32 midtime; - getnstimeofday(&tv); - midtime = htonl((tv.tv_sec % 86400) * MSEC_PER_SEC + tv.tv_nsec / NSEC_PER_MSEC); + + midtime = inet_current_timestamp(); memcpy(iph+opt->ts+iph[opt->ts+2]-5, &midtime, 4); } return; @@ -415,11 +414,10 @@ int ip_options_compile(struct net *net, break; } if (timeptr) { - struct timespec tv; - u32 midtime; - getnstimeofday(&tv); - midtime = (tv.tv_sec % 86400) * MSEC_PER_SEC + tv.tv_nsec / NSEC_PER_MSEC; - put_unaligned_be32(midtime, timeptr); + __be32 midtime; + + midtime = inet_current_timestamp(); + memcpy(timeptr, &midtime, 4); opt->is_changed = 1; } } else if ((optptr[3]&0xF) != IPOPT_TS_PRESPEC) { -- 1.9.1
[PATCH v2 0/3] Convert network timestamps to be y2038 safe
Introduction: The series is aimed at transitioning network timestamps to being y2038 safe. All patches can be reviewed and merged independently. Socket timestamps and ioctl calls will be handled separately. Thanks to Arnd Bergmann for discussing solution options with me. Solution: Data type struct timespec is not y2038 safe. Replace timespec with struct timespec64 which is y2038 safe. Changes v1 -> v2: Move and rename inet_current_time() as discussed Squash patches 1 and 2 Reword commit text for patch 2/3 Carry over review tags Deepa Dinamani (3): net: ipv4: Convert IP network timestamps to be y2038 safe net: ipv4: tcp_probe: Replace timespec with timespec64 net: sctp: Convert log timestamps to be y2038 safe include/net/ip.h | 2 ++ net/ipv4/af_inet.c| 26 ++ net/ipv4/icmp.c | 5 + net/ipv4/ip_options.c | 14 ++ net/ipv4/tcp_probe.c | 8 net/sctp/probe.c | 10 +- 6 files changed, 44 insertions(+), 21 deletions(-) -- 1.9.1 Cc: Vlad Yasevich Cc: Neil Horman Cc: "David S. Miller" Cc: Alexey Kuznetsov Cc: James Morris Cc: Hideaki YOSHIFUJI Cc: Patrick McHardy Cc: linux-s...@vger.kernel.org
Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64
> +{ > + asm("lea 40f(, %[slen], 4), %%r11\n\t" > + "clc\n\t" > + "jmpq *%%r11\n\t" > + "adcq 7*8(%[src]),%[res]\n\t" > + "adcq 6*8(%[src]),%[res]\n\t" > + "adcq 5*8(%[src]),%[res]\n\t" > + "adcq 4*8(%[src]),%[res]\n\t" > + "adcq 3*8(%[src]),%[res]\n\t" > + "adcq 2*8(%[src]),%[res]\n\t" > + "adcq 1*8(%[src]),%[res]\n\t" > + "adcq 0*8(%[src]),%[res]\n\t" > + "nop\n\t" > + "40: adcq $0,%[res]" > + : [res] "=r" (sum) > + : [src] "r" (buff), > + [slen] "r" (-((unsigned long)(len >> 3))), "[res]" (sum) > + : "r11"); > + With this patch I cannot mix/match different length checksums without things failing. In perf the jmpq in the loop above seems to be set to a fixed value so perhaps it is something in how the compiler is interpreting the inline assembler. - Alex