date:20160227

From: Saeed Mahameed 
Date: Thu, 25 Feb 2016 18:33:19 +0200

> @@ -246,11 +246,11 @@ int mlx5_alloc_map_uar(struct mlx5_core_dev *mdev, 
> struct mlx5_uar *uar)
>   err = -ENOMEM;
>   goto err_free_uar;
>   }
> -
> - if (mdev->priv.bf_mapping)
> - uar->bf_map = io_mapping_map_wc(mdev->priv.bf_mapping,
> - uar->index << PAGE_SHIFT);
> -
> +#ifdef ARCH_HAS_IOREMAP_WC
> + uar->bf_map = ioremap_wc(pfn << PAGE_SHIFT, PAGE_SIZE);
> + if (!uar->bf_map)
> + mlx5_core_warn(mdev, "ioremap_wc() failed\n");
> +#endif

Sorry, this looks very wrong to me.

It makes no sense to only map this resource if ARCH_HAS_IOREMAP_WC
defined.

The interface _always_ exists, and ARCH_HAS_IOREMAP_WC is an internal
symbol that include/asm-generic/iomap.h uses to determine whether to
provide a generic implementation of the interface or not.

I'm not applying this series until you either fix or explain what
you are doing here in the commit message.

Thanks.

Re: Softirq priority inversion from "softirq: reduce latencies"

From: Peter Hurley 
Date: Sat, 27 Feb 2016 18:10:27 -0800

> That tasklet should run before any process.

You never have this guarantee, even before Eric's patch.
Under load tasklets run from ksoftirqd just like any other
softirq.

Please fix your driver and stop blaming Eric's change.

Thank you.

Re: [net-next PATCH v3 1/3] net: sched: consolidate offload decision in cls_u32

On Fri, Feb 26, 2016 at 8:24 PM, John Fastabend
 wrote:
> On 16-02-26 09:39 AM, Cong Wang wrote:
>> On Fri, Feb 26, 2016 at 7:53 AM, John Fastabend
>>  wrote:
>>> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
>>> index 2121df5..e64d20b 100644
>>> --- a/include/net/pkt_cls.h
>>> +++ b/include/net/pkt_cls.h
>>> @@ -392,4 +392,9 @@ struct tc_cls_u32_offload {
>>> };
>>>  };
>>>
>>> +static inline bool tc_should_offload(struct net_device *dev)
>>> +{
>>> +   return dev->netdev_ops->ndo_setup_tc;
>>> +}
>>> +
>>
>> These should be protected by CONFIG_NET_CLS_U32, no?
>>
>
> Its not necessary it is a completely general function and I only
> lifted it out of cls_u32 so that the cls_flower classifier could
> also use it.
>
> I don't see the need off-hand to have it wrapped in an ORd ifdef
> statement where its (CONFIG_NET_CLS_U32 | CONFIG_NET_CLS_X ...).
> Any particular reason you were thnking it should be wrapped in ifdefs?
>

Not a big deal.

I just feel these don't need to compile when I have CONFIG_NET_CLS_U32=n.

Thanks.

Re: [net-next-2.6 PATCH v4 3/3] Support to encoding decoding skb prio on IFE action

On Sat, Feb 27, 2016 at 5:08 AM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> Example usage:
> Set the skb priority using skbedit then allow it to be encoded
>
> sudo tc qdisc add dev $ETH root handle 1: prio
> sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
> u32 match ip protocol 1 0xff flowid 1:2 \
> action skbedit prio 17 \
> action ife encode \
> allow prio \
> dst 02:15:15:15:15:15
>
> Note: You dont need the skbedit action if you are already encoding the
> skb priority earlier. A zero skb priority will not be sent
>
> Alternative hard code static priority of decimal 33 (unlike skbedit)
> then mark of 0x12 every time the filter matches
>
> sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
> u32 match ip protocol 1 0xff flowid 1:2 \
> action ife encode \
> type 0xDEAD \
> use prio 33 \
> use mark 0x12 \
> dst 02:15:15:15:15:15
>
> Signed-off-by: Jamal Hadi Salim 

Acked-by: Cong Wang

Re: [net-next-2.6 PATCH v4 2/3] Support to encoding decoding skb mark on IFE action

On Sat, Feb 27, 2016 at 5:08 AM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> Example usage:
> Set the skb using skbedit then allow it to be encoded
>
> sudo tc qdisc add dev $ETH root handle 1: prio
> sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
> u32 match ip protocol 1 0xff flowid 1:2 \
> action skbedit mark 17 \
> action ife encode \
> allow mark \
> dst 02:15:15:15:15:15
>
> Note: You dont need the skbedit action if you are already encoding the
> skb mark earlier. A zero skb mark, when seen, will not be encoded.
>
> Alternative hard code static mark of 0x12 every time the filter matches
>
> sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
> u32 match ip protocol 1 0xff flowid 1:2 \
> action ife encode \
> type 0xDEAD \
> use mark 0x12 \
> dst 02:15:15:15:15:15
>
> Signed-off-by: Jamal Hadi Salim 

Acked-by: Cong Wang

Re: [net-next-2.6 PATCH v4 1/3] introduce IFE action

On Sat, Feb 27, 2016 at 5:08 AM, Jamal Hadi Salim  wrote:
> From: Jamal Hadi Salim 
>
> This action allows for a sending side to encapsulate arbitrary metadata
> which is decapsulated by the receiving end.
> The sender runs in encoding mode and the receiver in decode mode.
> Both sender and receiver must specify the same ethertype.
> At some point we hope to have a registered ethertype and we'll
> then provide a default so the user doesnt have to specify it.
> For now we enforce the user specify it.
>

[...]

>
> Signed-off-by: Jamal Hadi Salim 

Acked-by: Cong Wang 

Thanks for updating it!

[Patch net-next] net: remove skb_sender_cpu_clear()

After commit 52bd2d62ce67 ("net: better skb->sender_cpu and skb->napi_id 
cohabitation")
skb_sender_cpu_clear() becomes empty and can be removed.

Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 include/linux/skbuff.h  | 4 
 net/bridge/br_forward.c | 1 -
 net/core/filter.c   | 2 --
 net/core/skbuff.c   | 1 -
 net/ipv4/ip_forward.c   | 1 -
 net/ipv6/ip6_output.c   | 1 -
 net/netfilter/ipvs/ip_vs_xmit.c | 6 --
 net/netfilter/nf_dup_netdev.c   | 1 -
 net/sched/act_mirred.c  | 1 -
 9 files changed, 18 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index eab4f8f..797cefb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1161,10 +1161,6 @@ static inline void skb_copy_hash(struct sk_buff *to, 
const struct sk_buff *from)
to->l4_hash = from->l4_hash;
 };
 
-static inline void skb_sender_cpu_clear(struct sk_buff *skb)
-{
-}
-
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
 static inline unsigned char *skb_end_pointer(const struct sk_buff *skb)
 {
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index fcdb86d..f47759f 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -44,7 +44,6 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, 
struct sk_buff *skb
 
skb_push(skb, ETH_HLEN);
br_drop_fake_rtable(skb);
-   skb_sender_cpu_clear(skb);
 
if (skb->ip_summed == CHECKSUM_PARTIAL &&
(skb->protocol == htons(ETH_P_8021Q) ||
diff --git a/net/core/filter.c b/net/core/filter.c
index a3aba15..5e2a3b5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1597,7 +1597,6 @@ static u64 bpf_clone_redirect(u64 r1, u64 ifindex, u64 
flags, u64 r4, u64 r5)
}
 
skb2->dev = dev;
-   skb_sender_cpu_clear(skb2);
return dev_queue_xmit(skb2);
 }
 
@@ -1650,7 +1649,6 @@ int skb_do_redirect(struct sk_buff *skb)
}
 
skb->dev = dev;
-   skb_sender_cpu_clear(skb);
return dev_queue_xmit(skb);
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 488566b..7af7ec6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4302,7 +4302,6 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
skb->skb_iif = 0;
skb->ignore_df = 0;
skb_dst_drop(skb);
-   skb_sender_cpu_clear(skb);
secpath_reset(skb);
nf_reset(skb);
nf_reset_trace(skb);
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index da0d7ce..af18f1e 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -71,7 +71,6 @@ static int ip_forward_finish(struct net *net, struct sock 
*sk, struct sk_buff *s
if (unlikely(opt->optlen))
ip_forward_options(skb);
 
-   skb_sender_cpu_clear(skb);
return dst_output(net, sk, skb);
 }
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a163102..9428345 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -332,7 +332,6 @@ static int ip6_forward_proxy_check(struct sk_buff *skb)
 static inline int ip6_forward_finish(struct net *net, struct sock *sk,
 struct sk_buff *skb)
 {
-   skb_sender_cpu_clear(skb);
return dst_output(net, sk, skb);
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index a3f5cd9..dc196a0 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -531,8 +531,6 @@ static inline int ip_vs_tunnel_xmit_prepare(struct sk_buff 
*skb,
if (ret == NF_ACCEPT) {
nf_reset(skb);
skb_forward_csum(skb);
-   if (!skb->sk)
-   skb_sender_cpu_clear(skb);
}
return ret;
 }
@@ -573,8 +571,6 @@ static inline int ip_vs_nat_send_or_cont(int pf, struct 
sk_buff *skb,
 
if (!local) {
skb_forward_csum(skb);
-   if (!skb->sk)
-   skb_sender_cpu_clear(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output);
} else
@@ -595,8 +591,6 @@ static inline int ip_vs_send_or_cont(int pf, struct sk_buff 
*skb,
if (!local) {
ip_vs_drop_early_demux_sk(skb);
skb_forward_csum(skb);
-   if (!skb->sk)
-   skb_sender_cpu_clear(skb);
NF_HOOK(pf, NF_INET_LOCAL_OUT, cp->ipvs->net, NULL, skb,
NULL, skb_dst(skb)->dev, dst_output);
} else
diff --git a/net/netfilter/nf_dup_netdev.c b/net/netfilter/nf_dup_netdev.c
index 8414ee1..7ec6972 100644
--- a/net/netfilter/nf_dup_netdev.c
+++ b/net/netfilter/nf_dup_netdev.c
@@ -31,7 +31,6 @@ void nf_dup_netdev_egress(const struct nft_pktinfo *pkt, int 
oif)
skb_push(skb, skb->mac_len);
 
skb->dev = dev;
-   skb_sender_cpu_clear(skb);
dev_queue_xmit(skb);
 }
 EXPORT_SY

[PATCH net] sctp: sctp_remaddr_seq_show use the wrong variable to dump transport info

2016-02-27 Thread Xin Long

Now in sctp_remaddr_seq_show(), we use variable *tsp to get the param *v.
but *tsp is also used to traversal transport_addr_list, which will cover
the previous value, and make sctp_transport_put work on the wrong transport.

So fix it by adding a new variable to get the param *v.

Fixes: fba4c330c5b9 ("sctp: hold transport before we access t->asoc in sctp 
proc")
Signed-off-by: Xin Long 
---
 net/sctp/proc.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/sctp/proc.c b/net/sctp/proc.c
index ded7d93..963dffc 100644
--- a/net/sctp/proc.c
+++ b/net/sctp/proc.c
@@ -482,7 +482,7 @@ static void sctp_remaddr_seq_stop(struct seq_file *seq, 
void *v)
 static int sctp_remaddr_seq_show(struct seq_file *seq, void *v)
 {
struct sctp_association *assoc;
-   struct sctp_transport *tsp;
+   struct sctp_transport *transport, *tsp;
 
if (v == SEQ_START_TOKEN) {
seq_printf(seq, "ADDR ASSOC_ID HB_ACT RTO MAX_PATH_RTX "
@@ -490,10 +490,10 @@ static int sctp_remaddr_seq_show(struct seq_file *seq, 
void *v)
return 0;
}
 
-   tsp = (struct sctp_transport *)v;
-   if (!sctp_transport_hold(tsp))
+   transport = (struct sctp_transport *)v;
+   if (!sctp_transport_hold(transport))
return 0;
-   assoc = tsp->asoc;
+   assoc = transport->asoc;
 
list_for_each_entry_rcu(tsp, &assoc->peer.transport_addr_list,
transports) {
@@ -546,7 +546,7 @@ static int sctp_remaddr_seq_show(struct seq_file *seq, void 
*v)
seq_printf(seq, "\n");
}
 
-   sctp_transport_put(tsp);
+   sctp_transport_put(transport);
 
return 0;
 }
-- 
2.1.0

[net-next][PATCH v2 00/13] RDS: Major clean-up with couple of new features for 4.6

v2:
Dropped module parameter from [PATCH 11/13] as suggested by David Miller

Series is generated against net-next but also applies against Linus's tip
cleanly. Entire patchset is available at below git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net-next/rds_v2

The diff-stat looks bit scary since almost ~4K lines of code is
getting removed. Brief summary of the series:

- Drop the stale iWARP support:
RDS iWarp support code has become stale and non testable for
sometime.  As discussed and agreed earlier on list, am dropping
its support for good. If new iWarp user(s) shows up in future,
the plan is to adapt existing IB RDMA with special sink case.
- RDS gets SO_TIMESTAMP support
- Long due RDS maintainer entry gets updated
- Some RDS IB code refactoring towards new FastReg Memory registration (FRMR)
- Lastly the initial support for FRMR

RDS IB RDMA performance with FRMR is not yet as good as FMR and I do have
some patches in progress to address that. But they are not ready for 4.6
so I left them out of this series. 

Also am keeping eye on new CQ API adaptations like other ULPs doing and
will try to adapt RDS for the same most likely in 4.7+ timeframe. 

Santosh Shilimkar (12):
  RDS: Drop stale iWARP RDMA transport
  RDS: Add support for SO_TIMESTAMP for incoming messages
  MAINTAINERS: update RDS entry
  RDS: IB: Remove the RDS_IB_SEND_OP dependency
  RDS: IB: Re-organise ibmr code
  RDS: IB: create struct rds_ib_fmr
  RDS: IB: move FMR code to its own file
  RDS: IB: add connection info to ibmr
  RDS: IB: handle the RDMA CM time wait event
  RDS: IB: add mr reused stats
  RDS: IB: add Fastreg MR (FRMR) detection support
  RDS: IB: allocate extra space on queues for FRMR support

Avinash Repaka (1):
  RDS: IB: Support Fastreg MR (FRMR) memory registration mode

 Documentation/networking/rds.txt |   4 +-
 MAINTAINERS  |   6 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/af_rds.c |  26 ++
 net/rds/ib.c |  47 +-
 net/rds/ib.h |  37 +-
 net/rds/ib_cm.c  |  59 ++-
 net/rds/ib_fmr.c | 248 ++
 net/rds/ib_frmr.c| 376 +++
 net/rds/ib_mr.h  | 148 ++
 net/rds/ib_rdma.c| 495 ++--
 net/rds/ib_send.c|   6 +-
 net/rds/ib_stats.c   |   2 +
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  21 +-
 net/rds/rdma_transport.h |   5 -
 net/rds/rds.h|   1 +
 net/rds/recv.c   |  20 +-
 27 files changed, 1065 insertions(+), 5035 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_frmr.c
 create mode 100644 net/rds/ib_mr.h
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

-- 
1.9.1

[net-next][PATCH v2 01/13] RDS: Drop stale iWARP RDMA transport

RDS iWarp support code has become stale and non testable. As
indicated earlier, am dropping the support for it.

If new iWarp user(s) shows up in future, we can adapat the RDS IB
transprt for the special RDMA READ sink case. iWarp needs an MR
for the RDMA READ sink.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 Documentation/networking/rds.txt |   4 +-
 net/rds/Kconfig  |   7 +-
 net/rds/Makefile |   4 +-
 net/rds/iw.c | 312 -
 net/rds/iw.h | 398 
 net/rds/iw_cm.c  | 769 --
 net/rds/iw_rdma.c| 837 -
 net/rds/iw_recv.c| 904 
 net/rds/iw_ring.c| 169 ---
 net/rds/iw_send.c| 981 ---
 net/rds/iw_stats.c   |  95 
 net/rds/iw_sysctl.c  | 123 -
 net/rds/rdma_transport.c |  13 +-
 net/rds/rdma_transport.h |   5 -
 14 files changed, 7 insertions(+), 4614 deletions(-)
 delete mode 100644 net/rds/iw.c
 delete mode 100644 net/rds/iw.h
 delete mode 100644 net/rds/iw_cm.c
 delete mode 100644 net/rds/iw_rdma.c
 delete mode 100644 net/rds/iw_recv.c
 delete mode 100644 net/rds/iw_ring.c
 delete mode 100644 net/rds/iw_send.c
 delete mode 100644 net/rds/iw_stats.c
 delete mode 100644 net/rds/iw_sysctl.c

diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index e1a3d59..9d219d8 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like 
TCP.
 
 RDS is not Infiniband-specific; it was designed to support different
 transports.  The current implementation used to support RDS over TCP as well
-as IB. Work is in progress to support RDS over iWARP, and using DCE to
-guarantee no dropped packets on Ethernet, it may be possible to use RDS over
-UDP in the future.
+as IB.
 
 The high-level semantics of RDS from the application's point of view are
 
diff --git a/net/rds/Kconfig b/net/rds/Kconfig
index f2c670b..bffde4b 100644
--- a/net/rds/Kconfig
+++ b/net/rds/Kconfig
@@ -4,14 +4,13 @@ config RDS
depends on INET
---help---
  The RDS (Reliable Datagram Sockets) protocol provides reliable,
- sequenced delivery of datagrams over Infiniband, iWARP,
- or TCP.
+ sequenced delivery of datagrams over Infiniband or TCP.
 
 config RDS_RDMA
-   tristate "RDS over Infiniband and iWARP"
+   tristate "RDS over Infiniband"
depends on RDS && INFINIBAND && INFINIBAND_ADDR_TRANS
---help---
- Allow RDS to use Infiniband and iWARP as a transport.
+ Allow RDS to use Infiniband as a transport.
  This transport supports RDMA operations.
 
 config RDS_TCP
diff --git a/net/rds/Makefile b/net/rds/Makefile
index 56d3f60..19e5485 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,9 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o \
-   iw.o iw_cm.o iw_recv.o iw_ring.o iw_send.o iw_stats.o \
-   iw_sysctl.o iw_rdma.o
+   ib_sysctl.o ib_rdma.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/iw.c b/net/rds/iw.c
deleted file mode 100644
index f4a9fff..000
diff --git a/net/rds/iw.h b/net/rds/iw.h
deleted file mode 100644
index 5af01d1..000
diff --git a/net/rds/iw_cm.c b/net/rds/iw_cm.c
deleted file mode 100644
index aea4c91..000
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
deleted file mode 100644
index b09a40c..000
diff --git a/net/rds/iw_recv.c b/net/rds/iw_recv.c
deleted file mode 100644
index a66d179..000
diff --git a/net/rds/iw_ring.c b/net/rds/iw_ring.c
deleted file mode 100644
index da8e3b6..000
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
deleted file mode 100644
index e20bd50..000
diff --git a/net/rds/iw_stats.c b/net/rds/iw_stats.c
deleted file mode 100644
index 5fe67f6..000
diff --git a/net/rds/iw_sysctl.c b/net/rds/iw_sysctl.c
deleted file mode 100644
index 139239d..000
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 9c1fed8..4f4b3d8 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -49,9 +49,7 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id,
 event->event, rdma_event_msg(event->event));
 
-   if (cm_id->device->node_type == RDMA_NODE_RNIC)
-   trans = &rds_iw_transport;
-   else
+   if (cm_id->device->node_type == RDMA_NODE_IB

[net-next][PATCH v2 02/13] RDS: Add support for SO_TIMESTAMP for incoming messages

The SO_TIMESTAMP generates time stamp for each incoming RDS messages
User app can enable it by using SO_TIMESTAMP setsocketopt() at
SOL_SOCKET level. CMSG data of cmsg type SO_TIMESTAMP contains the
time stamp in struct timeval format.

Reviewed-by: Sowmini Varadhan 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c | 26 ++
 net/rds/rds.h|  1 +
 net/rds/recv.c   | 20 ++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index b5476aeb..6beaeb1 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -277,6 +277,27 @@ static int rds_set_transport(struct rds_sock *rs, char 
__user *optval,
return rs->rs_transport ? 0 : -ENOPROTOOPT;
 }
 
+static int rds_enable_recvtstamp(struct sock *sk, char __user *optval,
+int optlen)
+{
+   int val, valbool;
+
+   if (optlen != sizeof(int))
+   return -EFAULT;
+
+   if (get_user(val, (int __user *)optval))
+   return -EFAULT;
+
+   valbool = val ? 1 : 0;
+
+   if (valbool)
+   sock_set_flag(sk, SOCK_RCVTSTAMP);
+   else
+   sock_reset_flag(sk, SOCK_RCVTSTAMP);
+
+   return 0;
+}
+
 static int rds_setsockopt(struct socket *sock, int level, int optname,
  char __user *optval, unsigned int optlen)
 {
@@ -312,6 +333,11 @@ static int rds_setsockopt(struct socket *sock, int level, 
int optname,
ret = rds_set_transport(rs, optval, optlen);
release_sock(sock->sk);
break;
+   case SO_TIMESTAMP:
+   lock_sock(sock->sk);
+   ret = rds_enable_recvtstamp(sock->sk, optval, optlen);
+   release_sock(sock->sk);
+   break;
default:
ret = -ENOPROTOOPT;
}
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 0e2797b..80256b0 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -222,6 +222,7 @@ struct rds_incoming {
__be32  i_saddr;
 
rds_rdma_cookie_t   i_rdma_cookie;
+   struct timeval  i_rx_tstamp;
 };
 
 struct rds_mr {
diff --git a/net/rds/recv.c b/net/rds/recv.c
index a00462b..c0be1ec 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -35,6 +35,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "rds.h"
 
@@ -46,6 +48,8 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_conn = conn;
inc->i_saddr = saddr;
inc->i_rdma_cookie = 0;
+   inc->i_rx_tstamp.tv_sec = 0;
+   inc->i_rx_tstamp.tv_usec = 0;
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
@@ -228,6 +232,8 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 
saddr, __be32 daddr,
rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong,
  be32_to_cpu(inc->i_hdr.h_len),
  inc->i_hdr.h_dport);
+   if (sock_flag(sk, SOCK_RCVTSTAMP))
+   do_gettimeofday(&inc->i_rx_tstamp);
rds_inc_addref(inc);
list_add_tail(&inc->i_item, &rs->rs_recv_queue);
__rds_wake_sk_sleep(sk);
@@ -381,7 +387,8 @@ static int rds_notify_cong(struct rds_sock *rs, struct 
msghdr *msghdr)
 /*
  * Receive any control messages.
  */
-static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg)
+static int rds_cmsg_recv(struct rds_incoming *inc, struct msghdr *msg,
+struct rds_sock *rs)
 {
int ret = 0;
 
@@ -392,6 +399,15 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg)
return ret;
}
 
+   if ((inc->i_rx_tstamp.tv_sec != 0) &&
+   sock_flag(rds_rs_to_sk(rs), SOCK_RCVTSTAMP)) {
+   ret = put_cmsg(msg, SOL_SOCKET, SCM_TIMESTAMP,
+  sizeof(struct timeval),
+  &inc->i_rx_tstamp);
+   if (ret)
+   return ret;
+   }
+
return 0;
 }
 
@@ -474,7 +490,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
msg->msg_flags |= MSG_TRUNC;
}
 
-   if (rds_cmsg_recv(inc, msg)) {
+   if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
}
-- 
1.9.1

[net-next][PATCH v2 06/13] RDS: IB: create struct rds_ib_fmr

Keep fmr related filed in its own struct. Fastreg MR structure
will be added to the union.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 17 ++---
 net/rds/ib_mr.h   | 11 +--
 net/rds/ib_rdma.c | 14 ++
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index d4f200d..74f2c21 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -36,6 +36,7 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 {
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
int err = 0, iter = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
@@ -99,15 +100,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
goto out_no_cigar;
}
 
-   ibmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
+   fmr = &ibmr->u.fmr;
+   fmr->fmr = ib_alloc_fmr(rds_ibdev->pd,
(IB_ACCESS_LOCAL_WRITE |
 IB_ACCESS_REMOTE_READ |
 IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC),
&pool->fmr_attr);
-   if (IS_ERR(ibmr->fmr)) {
-   err = PTR_ERR(ibmr->fmr);
-   ibmr->fmr = NULL;
+   if (IS_ERR(fmr->fmr)) {
+   err = PTR_ERR(fmr->fmr);
+   fmr->fmr = NULL;
pr_warn("RDS/IB: %s failed (err=%d)\n", __func__, err);
goto out_no_cigar;
}
@@ -122,8 +124,8 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
 
 out_no_cigar:
if (ibmr) {
-   if (ibmr->fmr)
-   ib_dealloc_fmr(ibmr->fmr);
+   if (fmr->fmr)
+   ib_dealloc_fmr(fmr->fmr);
kfree(ibmr);
}
atomic_dec(&pool->item_count);
@@ -134,6 +136,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
   struct scatterlist *sg, unsigned int nents)
 {
struct ib_device *dev = rds_ibdev->dev;
+   struct rds_ib_fmr *fmr = &ibmr->u.fmr;
struct scatterlist *scat = sg;
u64 io_addr = 0;
u64 *dma_pages;
@@ -190,7 +193,7 @@ int rds_ib_map_fmr(struct rds_ib_device *rds_ibdev, struct 
rds_ib_mr *ibmr,
(dma_addr & PAGE_MASK) + j;
}
 
-   ret = ib_map_phys_fmr(ibmr->fmr, dma_pages, page_cnt, io_addr);
+   ret = ib_map_phys_fmr(fmr->fmr, dma_pages, page_cnt, io_addr);
if (ret)
goto out;
 
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index d88724f..309ad59 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -43,11 +43,15 @@
 #define RDS_MR_8K_SCALE(256 / (RDS_MR_8K_MSG_SIZE + 1))
 #define RDS_MR_8K_POOL_SIZE(RDS_MR_8K_SCALE * (8192 / 2))
 
+struct rds_ib_fmr {
+   struct ib_fmr   *fmr;
+   u64 *dma;
+};
+
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
struct rds_ib_device*device;
struct rds_ib_mr_pool   *pool;
-   struct ib_fmr   *fmr;
 
struct llist_node   llnode;
 
@@ -57,8 +61,11 @@ struct rds_ib_mr {
 
struct scatterlist  *sg;
unsigned intsg_len;
-   u64 *dma;
int sg_dma_len;
+
+   union {
+   struct rds_ib_fmr   fmr;
+   } u;
 };
 
 /* Our own little MR pool */
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index c594519..9e608d9 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -334,6 +334,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 int free_all, struct rds_ib_mr **ibmr_ret)
 {
struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
struct llist_node *clean_nodes;
struct llist_node *clean_tail;
LIST_HEAD(unmap_list);
@@ -395,8 +396,10 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
goto out;
 
/* String all ib_mr's onto one list and hand them to ib_unmap_fmr */
-   list_for_each_entry(ibmr, &unmap_list, unmap_list)
-   list_add(&ibmr->fmr->list, &fmr_list);
+   list_for_each_entry(ibmr, &unmap_list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
 
ret = ib_unmap_fmr(&fmr_list);
if (ret)
@@ -405,6 +408,7 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
/* Now we can destroy the DMA mapping and unpin any pages */
list_for_each_entry_safe(ibmr, next, &unmap_list, unmap_list) {
unpinned += ibmr->sg_len;
+   fmr = &ibmr->u.fmr;
__rds_ib_teardown_mr(ibmr);
if (nfreed < free_goal ||
ibmr->remap_count >= pool

[net-next][PATCH v2 03/13] MAINTAINERS: update RDS entry

Acked-by: Chien Yen 
Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 MAINTAINERS | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 27393cf..08b084a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9067,10 +9067,14 @@ S:  Maintained
 F: drivers/net/ethernet/rdc/r6040.c
 
 RDS - RELIABLE DATAGRAM SOCKETS
-M: Chien Yen 
+M: Santosh Shilimkar 
+L: netdev@vger.kernel.org
+L: linux-r...@vger.kernel.org
 L: rds-de...@oss.oracle.com (moderated for non-subscribers)
+W: https://oss.oracle.com/projects/rds/
 S: Supported
 F: net/rds/
+F: Documentation/networking/rds.txt
 
 READ-COPY UPDATE (RCU)
 M: "Paul E. McKenney" 
-- 
1.9.1

[net-next][PATCH v2 05/13] RDS: IB: Re-organise ibmr code

No functional changes. This is in preperation towards adding
fastreg memory resgitration support.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.c  |  37 +++---
 net/rds/ib.h  |  25 +---
 net/rds/ib_fmr.c  | 217 +++
 net/rds/ib_mr.h   | 109 
 net/rds/ib_rdma.c | 379 +++---
 6 files changed, 422 insertions(+), 347 deletions(-)
 create mode 100644 net/rds/ib_fmr.c
 create mode 100644 net/rds/ib_mr.h

diff --git a/net/rds/Makefile b/net/rds/Makefile
index 19e5485..bcf5591 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9481d55..bb32cb9 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -42,15 +42,16 @@
 
 #include "rds.h"
 #include "ib.h"
+#include "ib_mr.h"
 
-unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
-unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
+unsigned int rds_ib_mr_1m_pool_size = RDS_MR_1M_POOL_SIZE;
+unsigned int rds_ib_mr_8k_pool_size = RDS_MR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(rds_ib_fmr_1m_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
-module_param(rds_ib_fmr_8k_pool_size, int, 0444);
-MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
+module_param(rds_ib_mr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_1m_pool_size, " Max number of 1M mr per HCA");
+module_param(rds_ib_mr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_mr_8k_pool_size, " Max number of 8K mr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -140,13 +141,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
-   rds_ibdev->max_1m_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
- rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+ rds_ib_mr_1m_pool_size) : rds_ib_mr_1m_pool_size;
 
-   rds_ibdev->max_8k_fmrs = device->attrs.max_mr ?
+   rds_ibdev->max_8k_mrs = device->attrs.max_mr ?
min_t(unsigned int, ((device->attrs.max_mr / 2) * 
RDS_MR_8K_SCALE),
- rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
+ rds_ib_mr_8k_pool_size) : rds_ib_mr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = device->attrs.max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = device->attrs.max_qp_rd_atom;
@@ -172,10 +173,10 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_mrs = %d, max_8k_mrs = %d\n",
 device->attrs.max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
-rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
-rds_ibdev->max_8k_fmrs);
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
+rds_ibdev->max_8k_mrs);
 
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
@@ -364,7 +365,7 @@ void rds_ib_exit(void)
rds_ib_sysctl_exit();
rds_ib_recv_exit();
rds_trans_unregister(&rds_ib_transport);
-   rds_ib_fmr_exit();
+   rds_ib_mr_exit();
 }
 
 struct rds_transport rds_ib_transport = {
@@ -400,13 +401,13 @@ int rds_ib_init(void)
 
INIT_LIST_HEAD(&rds_ib_devices);
 
-   ret = rds_ib_fmr_init();
+   ret = rds_ib_mr_init();
if (ret)
goto out;
 
ret = ib_register_client(&rds_ib_client);
if (ret)
-   goto out_fmr_exit;
+   goto out_mr_exit;
 
ret = rds_ib_sysctl_init();
if (ret)
@@ -430,8 +431,8 @@ out_sysctl:
rds_ib_sysctl_exit();
 out_ibreg:
rds_ib_unregister_client();
-out_fmr_exit:
-   rds_ib_fmr_exit();
+out_mr_exit:
+   rds_ib_mr_exit();
 out:
return ret;
 }
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 09cd8e3..c88cb22 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@

[net-next][PATCH v2 04/13] RDS: IB: Remove the RDS_IB_SEND_OP dependency

This helps to combine asynchronous fastreg MR completion handler
with send completion handler.

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  |  1 -
 net/rds/ib_cm.c   | 42 +++---
 net/rds/ib_send.c |  6 ++
 3 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index b3fdebb..09cd8e3 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -28,7 +28,6 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
-#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index da5a7fb..7f68abc 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -236,12 +236,10 @@ static void rds_ib_cq_comp_handler_recv(struct ib_cq *cq, 
void *context)
tasklet_schedule(&ic->i_recv_tasklet);
 }
 
-static void poll_cq(struct rds_ib_connection *ic, struct ib_cq *cq,
-   struct ib_wc *wcs,
-   struct rds_ib_ack_state *ack_state)
+static void poll_scq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs)
 {
-   int nr;
-   int i;
+   int nr, i;
struct ib_wc *wc;
 
while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
@@ -251,10 +249,7 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   if (wc->wr_id & RDS_IB_SEND_OP)
-   rds_ib_send_cqe_handler(ic, wc);
-   else
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   rds_ib_send_cqe_handler(ic, wc);
}
}
 }
@@ -263,14 +258,12 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
struct rds_connection *conn = ic->conn;
-   struct rds_ib_ack_state state;
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-   memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
-   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, &state);
+   poll_scq(ic, ic->i_send_cq, ic->i_send_wc);
 
if (rds_conn_up(conn) &&
(!test_bit(RDS_LL_SEND_FULL, &conn->c_flags) ||
@@ -278,6 +271,25 @@ static void rds_ib_tasklet_fn_send(unsigned long data)
rds_send_xmit(ic->conn);
 }
 
+static void poll_rcq(struct rds_ib_connection *ic, struct ib_cq *cq,
+struct ib_wc *wcs,
+struct rds_ib_ack_state *ack_state)
+{
+   int nr, i;
+   struct ib_wc *wc;
+
+   while ((nr = ib_poll_cq(cq, RDS_IB_WC_MAX, wcs)) > 0) {
+   for (i = 0; i < nr; i++) {
+   wc = wcs + i;
+   rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
+(unsigned long long)wc->wr_id, wc->status,
+wc->byte_len, be32_to_cpu(wc->ex.imm_data));
+
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+   }
+   }
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -291,9 +303,9 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
rds_ib_stats_inc(s_ib_tasklet_call);
 
memset(&state, 0, sizeof(state));
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
-   poll_cq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
+   poll_rcq(ic, ic->i_recv_cq, ic->i_recv_wc, &state);
 
if (state.ack_next_valid)
rds_ib_set_ack(ic, state.ack_next, state.ack_required);
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index eac30bf..f27d2c8 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -195,7 +195,7 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic)
 
send->s_op = NULL;
 
-   send->s_wr.wr_id = i | RDS_IB_SEND_OP;
+   send->s_wr.wr_id = i;
send->s_wr.sg_list = send->s_sge;
send->s_wr.ex.imm_data = 0;
 
@@ -263,9 +263,7 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, 
struct ib_wc *wc)
 
oldest = rds_ib_ring_oldest(&ic->i_send_ring);
 
-   completed = rds_ib_ring_completed(&ic->i_send_ring,
- (wc->wr_id & ~RDS_IB_SEND_OP),
- oldest);
+   compl

[net-next][PATCH v2 09/13] RDS: IB: handle the RDMA CM time wait event

Drop the RDS connection on RDMA_CM_EVENT_TIMEWAIT_EXIT so that
it can reconnect and resume.

While testing fastreg, this error happened in couple of tests but
was getting un-noticed.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/rdma_transport.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 4f4b3d8..7220beb 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -117,6 +117,14 @@ int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
rds_conn_drop(conn);
break;
 
+   case RDMA_CM_EVENT_TIMEWAIT_EXIT:
+   if (conn) {
+   pr_info("RDS: RDMA_CM_EVENT_TIMEWAIT_EXIT event: 
dropping connection %pI4->%pI4\n",
+   &conn->c_laddr, &conn->c_faddr);
+   rds_conn_drop(conn);
+   }
+   break;
+
default:
/* things like device disconnect? */
printk(KERN_ERR "RDS: unknown event %u (%s)!\n",
-- 
1.9.1

[net-next][PATCH v2 08/13] RDS: IB: add connection info to ibmr

Preperatory patch for FRMR support. From connection info,
we can retrieve cm_id which contains qp handled needed for
work request posting.

We also need to drop the RDS connection on QP error states
where connection handle becomes useful.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_mr.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index f5c1fcb..add7725 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -50,18 +50,19 @@ struct rds_ib_fmr {
 
 /* This is stored as mr->r_trans_private. */
 struct rds_ib_mr {
-   struct rds_ib_device*device;
-   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_device*device;
+   struct rds_ib_mr_pool   *pool;
+   struct rds_ib_connection*ic;
 
-   struct llist_node   llnode;
+   struct llist_node   llnode;
 
/* unmap_list is for freeing */
-   struct list_headunmap_list;
-   unsigned intremap_count;
+   struct list_headunmap_list;
+   unsigned intremap_count;
 
-   struct scatterlist  *sg;
-   unsigned intsg_len;
-   int sg_dma_len;
+   struct scatterlist  *sg;
+   unsigned intsg_len;
+   int sg_dma_len;
 
union {
struct rds_ib_fmr   fmr;
-- 
1.9.1

[net-next][PATCH v2 07/13] RDS: IB: move FMR code to its own file

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_fmr.c  | 126 +-
 net/rds/ib_mr.h   |   6 +++
 net/rds/ib_rdma.c | 108 ++
 3 files changed, 134 insertions(+), 106 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 74f2c21..4fe8f4f 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -37,61 +37,16 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
struct rds_ib_mr_pool *pool;
struct rds_ib_mr *ibmr = NULL;
struct rds_ib_fmr *fmr;
-   int err = 0, iter = 0;
+   int err = 0;
 
if (npages <= RDS_MR_8K_MSG_SIZE)
pool = rds_ibdev->mr_8k_pool;
else
pool = rds_ibdev->mr_1m_pool;
 
-   if (atomic_read(&pool->dirty_count) >= pool->max_items / 10)
-   queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
-
-   /* Switch pools if one of the pool is reaching upper limit */
-   if (atomic_read(&pool->dirty_count) >=  pool->max_items * 9 / 10) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   pool = rds_ibdev->mr_1m_pool;
-   else
-   pool = rds_ibdev->mr_8k_pool;
-   }
-
-   while (1) {
-   ibmr = rds_ib_reuse_mr(pool);
-   if (ibmr)
-   return ibmr;
-
-   /* No clean MRs - now we have the choice of either
-* allocating a fresh MR up to the limit imposed by the
-* driver, or flush any dirty unused MRs.
-* We try to avoid stalling in the send path if possible,
-* so we allocate as long as we're allowed to.
-*
-* We're fussy with enforcing the FMR limit, though. If the
-* driver tells us we can't use more than N fmrs, we shouldn't
-* start arguing with it
-*/
-   if (atomic_inc_return(&pool->item_count) <= pool->max_items)
-   break;
-
-   atomic_dec(&pool->item_count);
-
-   if (++iter > 2) {
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_depleted);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_depleted);
-   return ERR_PTR(-EAGAIN);
-   }
-
-   /* We do have some empty MRs. Flush them out. */
-   if (pool->pool_type == RDS_IB_MR_8K_POOL)
-   rds_ib_stats_inc(s_ib_rdma_mr_8k_pool_wait);
-   else
-   rds_ib_stats_inc(s_ib_rdma_mr_1m_pool_wait);
-   rds_ib_flush_mr_pool(pool, 0, &ibmr);
-   if (ibmr)
-   return ibmr;
-   }
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
 
ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
rdsibdev_to_node(rds_ibdev));
@@ -218,3 +173,76 @@ out:
 
return ret;
 }
+
+struct rds_ib_mr *rds_ib_reg_fmr(struct rds_ib_device *rds_ibdev,
+struct scatterlist *sg,
+unsigned long nents,
+u32 *key)
+{
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_fmr *fmr;
+   int ret;
+
+   ibmr = rds_ib_alloc_fmr(rds_ibdev, nents);
+   if (IS_ERR(ibmr))
+   return ibmr;
+
+   ibmr->device = rds_ibdev;
+   fmr = &ibmr->u.fmr;
+   ret = rds_ib_map_fmr(rds_ibdev, ibmr, sg, nents);
+   if (ret == 0)
+   *key = fmr->fmr->rkey;
+   else
+   rds_ib_free_mr(ibmr, 0);
+
+   return ibmr;
+}
+
+void rds_ib_unreg_fmr(struct list_head *list, unsigned int *nfreed,
+ unsigned long *unpinned, unsigned int goal)
+{
+   struct rds_ib_mr *ibmr, *next;
+   struct rds_ib_fmr *fmr;
+   LIST_HEAD(fmr_list);
+   int ret = 0;
+   unsigned int freed = *nfreed;
+
+   /* String all ib_mr's onto one list and hand them to  ib_unmap_fmr */
+   list_for_each_entry(ibmr, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   list_add(&fmr->fmr->list, &fmr_list);
+   }
+
+   ret = ib_unmap_fmr(&fmr_list);
+   if (ret)
+   pr_warn("RDS/IB: FMR invalidation failed (err=%d)\n", ret);
+
+   /* Now we can destroy the DMA mapping and unpin any pages */
+   list_for_each_entry_safe(ibmr, next, list, unmap_list) {
+   fmr = &ibmr->u.fmr;
+   *unpinned += ibmr->sg_len;
+   __rds_ib_teardown_mr(ibmr);
+   if (freed < goal ||
+   ibmr->remap_count >= ibmr->pool->fmr_attr.max_maps) {
+   if (ibmr->

[net-next][PATCH v2 10/13] RDS: IB: add mr reused stats

Add MR reuse statistics to RDS IB transport.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   | 2 ++
 net/rds/ib_rdma.c  | 7 ++-
 net/rds/ib_stats.c | 2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c88cb22..62fe7d5 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -259,6 +259,8 @@ struct rds_ib_statistics {
uint64_ts_ib_rdma_mr_1m_pool_flush;
uint64_ts_ib_rdma_mr_1m_pool_wait;
uint64_ts_ib_rdma_mr_1m_pool_depleted;
+   uint64_ts_ib_rdma_mr_8k_reused;
+   uint64_ts_ib_rdma_mr_1m_reused;
uint64_ts_ib_atomic_cswp;
uint64_ts_ib_atomic_fadd;
 };
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 0e84843..ec7ea32 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -188,8 +188,13 @@ struct rds_ib_mr *rds_ib_reuse_mr(struct rds_ib_mr_pool 
*pool)
flag = this_cpu_ptr(&clean_list_grace);
set_bit(CLEAN_LIST_BUSY_BIT, flag);
ret = llist_del_first(&pool->clean_list);
-   if (ret)
+   if (ret) {
ibmr = llist_entry(ret, struct rds_ib_mr, llnode);
+   if (pool->pool_type == RDS_IB_MR_8K_POOL)
+   rds_ib_stats_inc(s_ib_rdma_mr_8k_reused);
+   else
+   rds_ib_stats_inc(s_ib_rdma_mr_1m_reused);
+   }
 
clear_bit(CLEAN_LIST_BUSY_BIT, flag);
preempt_enable();
diff --git a/net/rds/ib_stats.c b/net/rds/ib_stats.c
index d77e044..7e78dca 100644
--- a/net/rds/ib_stats.c
+++ b/net/rds/ib_stats.c
@@ -73,6 +73,8 @@ static const char *const rds_ib_stat_names[] = {
"ib_rdma_mr_1m_pool_flush",
"ib_rdma_mr_1m_pool_wait",
"ib_rdma_mr_1m_pool_depleted",
+   "ib_rdma_mr_8k_reused",
+   "ib_rdma_mr_1m_reused",
"ib_atomic_cswp",
"ib_atomic_fadd",
 };
-- 
1.9.1

[net-next][PATCH v2 13/13] RDS: IB: Support Fastreg MR (FRMR) memory registration mode

From: Avinash Repaka 

Fastreg MR(FRMR) is another method with which one can
register memory to HCA. Some of the newer HCAs supports only fastreg
mr mode, so we need to add support for it to have RDS functional
on them.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Avinash Repaka 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/Makefile  |   2 +-
 net/rds/ib.h  |   1 +
 net/rds/ib_cm.c   |   7 +-
 net/rds/ib_frmr.c | 376 ++
 net/rds/ib_mr.h   |  24 
 net/rds/ib_rdma.c |  17 ++-
 6 files changed, 422 insertions(+), 5 deletions(-)
 create mode 100644 net/rds/ib_frmr.c

diff --git a/net/rds/Makefile b/net/rds/Makefile
index bcf5591..0e72bec 100644
--- a/net/rds/Makefile
+++ b/net/rds/Makefile
@@ -6,7 +6,7 @@ rds-y :=af_rds.o bind.o cong.o connection.o info.o 
message.o   \
 obj-$(CONFIG_RDS_RDMA) += rds_rdma.o
 rds_rdma-y :=  rdma_transport.o \
ib.o ib_cm.o ib_recv.o ib_ring.o ib_send.o ib_stats.o \
-   ib_sysctl.o ib_rdma.o ib_fmr.o
+   ib_sysctl.o ib_rdma.o ib_fmr.o ib_frmr.o
 
 
 obj-$(CONFIG_RDS_TCP) += rds_tcp.o
diff --git a/net/rds/ib.h b/net/rds/ib.h
index eeb0d6c..627fb79 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -349,6 +349,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, 
__be32 ipaddr);
 void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_remove_conn(struct rds_ib_device *rds_ibdev, struct rds_connection 
*conn);
 void rds_ib_destroy_nodev_conns(void);
+void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 
 /* ib_recv.c */
 int rds_ib_recv_init(void);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 83f4673..8764970 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -249,7 +249,12 @@ static void poll_scq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
 
-   rds_ib_send_cqe_handler(ic, wc);
+   if (wc->wr_id <= ic->i_send_ring.w_nr ||
+   wc->wr_id == RDS_IB_ACK_WR_ID)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_mr_cqe_handler(ic, wc);
+
}
}
 }
diff --git a/net/rds/ib_frmr.c b/net/rds/ib_frmr.c
new file mode 100644
index 000..93ff038
--- /dev/null
+++ b/net/rds/ib_frmr.c
@@ -0,0 +1,376 @@
+/*
+ * Copyright (c) 2016 Oracle.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include "ib_mr.h"
+
+static struct rds_ib_mr *rds_ib_alloc_frmr(struct rds_ib_device *rds_ibdev,
+  int npages)
+{
+   struct rds_ib_mr_pool *pool;
+   struct rds_ib_mr *ibmr = NULL;
+   struct rds_ib_frmr *frmr;
+   int err = 0;
+
+   if (npages <= RDS_MR_8K_MSG_SIZE)
+   pool = rds_ibdev->mr_8k_pool;
+   else
+   pool = rds_ibdev->mr_1m_pool;
+
+   ibmr = rds_ib_try_reuse_ibmr(pool);
+   if (ibmr)
+   return ibmr;
+
+   ibmr = kzalloc_node(sizeof(*ibmr), GFP_KERNEL,
+   rdsibdev_to_node(rds_ibdev));
+   if (!ibmr) {
+   err = -ENOMEM;
+   goto out_no_cigar;
+   }
+
+   frmr = &ibmr->u.frmr;
+   frmr->mr = ib_alloc_mr(rds_ibdev->pd, IB_MR_TYPE_MEM_REG,
+pool->fmr_attr.max_pages);
+   if (IS_ERR(frmr->mr)) {
+   pr_warn(

[net-next][PATCH v2 11/13] RDS: IB: add Fastreg MR (FRMR) detection support

Discovere Fast Memmory Registration support using IB device
IB_DEVICE_MEM_MGT_EXTENSIONS. Certain HCA might support just FRMR
or FMR or both FMR and FRWR. In case both mr type are supported,
default FMR is used.

Default MR is still kept as FMR against what everyone else
is following. Default will be changed to FRMR once the
RDS performance with FRMR is comparable with FMR. The
work is in progress for the same.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
v2: Dropped the module parameter as suggested by David Miller

 net/rds/ib.c| 10 ++
 net/rds/ib.h|  4 
 net/rds/ib_mr.h |  1 +
 3 files changed, 15 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index bb32cb9..b5342fd 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -140,6 +140,12 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_wrs = device->attrs.max_qp_wr;
rds_ibdev->max_sge = min(device->attrs.max_sge, RDS_IB_MAX_SGE);
 
+   rds_ibdev->has_fr = (device->attrs.device_cap_flags &
+ IB_DEVICE_MEM_MGT_EXTENSIONS);
+   rds_ibdev->has_fmr = (device->alloc_fmr && device->dealloc_fmr &&
+   device->map_phys_fmr && device->unmap_fmr);
+   rds_ibdev->use_fastreg = (rds_ibdev->has_fr && !rds_ibdev->has_fmr);
+
rds_ibdev->fmr_max_remaps = device->attrs.max_map_per_fmr?: 32;
rds_ibdev->max_1m_mrs = device->attrs.max_mr ?
min_t(unsigned int, (device->attrs.max_mr / 2),
@@ -178,6 +184,10 @@ static void rds_ib_add_one(struct ib_device *device)
 rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_mrs,
 rds_ibdev->max_8k_mrs);
 
+   pr_info("RDS/IB: %s: %s supported and preferred\n",
+   device->name,
+   rds_ibdev->use_fastreg ? "FRMR" : "FMR");
+
INIT_LIST_HEAD(&rds_ibdev->ipaddr_list);
INIT_LIST_HEAD(&rds_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 62fe7d5..c5eddc2 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -200,6 +200,10 @@ struct rds_ib_device {
struct list_headconn_list;
struct ib_device*dev;
struct ib_pd*pd;
+   boolhas_fmr;
+   boolhas_fr;
+   booluse_fastreg;
+
unsigned intmax_mrs;
struct rds_ib_mr_pool   *mr_1m_pool;
struct rds_ib_mr_pool   *mr_8k_pool;
diff --git a/net/rds/ib_mr.h b/net/rds/ib_mr.h
index add7725..2f9b9c3 100644
--- a/net/rds/ib_mr.h
+++ b/net/rds/ib_mr.h
@@ -93,6 +93,7 @@ struct rds_ib_mr_pool {
 extern struct workqueue_struct *rds_ib_mr_wq;
 extern unsigned int rds_ib_mr_1m_pool_size;
 extern unsigned int rds_ib_mr_8k_pool_size;
+extern bool prefer_frmr;
 
 struct rds_ib_mr_pool *rds_ib_create_mr_pool(struct rds_ib_device *rds_dev,
 int npages);
-- 
1.9.1

[net-next][PATCH v2 12/13] RDS: IB: allocate extra space on queues for FRMR support

Fastreg MR(FRMR) memory registration and invalidation makes use
of work request and completion queues for its operation. Patch
allocates extra queue space towards these operation(s).

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h|  4 
 net/rds/ib_cm.c | 16 
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index c5eddc2..eeb0d6c 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -14,6 +14,7 @@
 
 #define RDS_IB_DEFAULT_RECV_WR 1024
 #define RDS_IB_DEFAULT_SEND_WR 256
+#define RDS_IB_DEFAULT_FR_WR   512
 
 #define RDS_IB_DEFAULT_RETRY_COUNT 2
 
@@ -122,6 +123,9 @@ struct rds_ib_connection {
struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
+   /* To control the number of wrs from fastreg */
+   atomic_ti_fastreg_wrs;
+
/* interrupt handling */
struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 7f68abc..83f4673 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -363,7 +363,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
struct ib_qp_init_attr attr;
struct ib_cq_init_attr cq_attr = {};
struct rds_ib_device *rds_ibdev;
-   int ret;
+   int ret, fr_queue_space;
 
/*
 * It's normal to see a null device if an incoming connection races
@@ -373,6 +373,12 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!rds_ibdev)
return -EOPNOTSUPP;
 
+   /* The fr_queue_space is currently set to 512, to add extra space on
+* completion queue and send queue. This extra space is used for FRMR
+* registration and invalidation work requests
+*/
+   fr_queue_space = (rds_ibdev->use_fastreg ? RDS_IB_DEFAULT_FR_WR : 0);
+
/* add the conn now so that connection establishment has the dev */
rds_ib_add_conn(rds_ibdev, conn);
 
@@ -384,7 +390,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
/* Protection domain and memory range */
ic->i_pd = rds_ibdev->pd;
 
-   cq_attr.cqe = ic->i_send_ring.w_nr + 1;
+   cq_attr.cqe = ic->i_send_ring.w_nr + fr_queue_space + 1;
 
ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
@@ -424,7 +430,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.event_handler = rds_ib_qp_event_handler;
attr.qp_context = conn;
/* + 1 to allow for the single ack message */
-   attr.cap.max_send_wr = ic->i_send_ring.w_nr + 1;
+   attr.cap.max_send_wr = ic->i_send_ring.w_nr + fr_queue_space + 1;
attr.cap.max_recv_wr = ic->i_recv_ring.w_nr + 1;
attr.cap.max_send_sge = rds_ibdev->max_sge;
attr.cap.max_recv_sge = RDS_IB_RECV_SGE;
@@ -432,6 +438,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
attr.qp_type = IB_QPT_RC;
attr.send_cq = ic->i_send_cq;
attr.recv_cq = ic->i_recv_cq;
+   atomic_set(&ic->i_fastreg_wrs, RDS_IB_DEFAULT_FR_WR);
 
/*
 * XXX this can fail if max_*_wr is too large?  Are we supposed
@@ -751,7 +758,8 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)
 */
wait_event(rds_ib_ring_empty_wait,
   rds_ib_ring_empty(&ic->i_recv_ring) &&
-  (atomic_read(&ic->i_signaled_sends) == 0));
+  (atomic_read(&ic->i_signaled_sends) == 0) &&
+  (atomic_read(&ic->i_fastreg_wrs) == 
RDS_IB_DEFAULT_FR_WR));
tasklet_kill(&ic->i_send_tasklet);
tasklet_kill(&ic->i_recv_tasklet);
 
-- 
1.9.1

Re: Softirq priority inversion from "softirq: reduce latencies"

2016-02-27 Thread Eric Dumazet

On sam., 2016-02-27 at 18:10 -0800, Peter Hurley wrote:
> On 02/27/2016 05:59 PM, Eric Dumazet wrote:
> > On sam., 2016-02-27 at 15:33 -0800, Peter Hurley wrote:
> >> On 02/27/2016 03:04 PM, David Miller wrote:
> >>> From: Peter Hurley 
> >>> Date: Sat, 27 Feb 2016 12:29:39 -0800
> >>>
>  Not really. softirq raised from interrupt context will always execute
>  on this cpu and not in ksoftirqd, unless load forces softirq loop abort.
> >>>
> >>> That guarantee never was specified.
> >>
> >> ??
> >>
> >> Neither is running network socket servers at normal priority as if they're
> >> higher priority than softirq.
> >>
> >>
> >>> Or are you saying that by design, on a system under load, your UART
> >>> will not function properly?
> >>>
> >>> Surely you don't mean that.
> >>
> >> No, that's not what I mean.
> >>
> >> What I mean is that bypassing the entire SOFTIRQ priority so that
> >> sshd can process one network packet makes a mockery of the point of 
> >> softirq.
> >>
> >> This hack to workaround NET_RX looping over-and-over-and-over affects every
> >> subsystem, not just one uart.
> >>
> >> HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug.
> > 
> > No idea what you talk about.
> > 
> > All pending softirq interrupts are processed. _Nothing_ is skipped.
> 
> An interrupt that schedules HI softirq while in NET_RX softirq should
> still run the HI softirq. But with your patch that won't happen.

Stop saying this. This never had been the case. I am glad my patch
finally show you are wrong.

> 
> 
> > Really, your system stability seems to depend on a completely
> > undocumented behavior of linux kernels before linux-3.8
> > 
> > If I understood, you expect that a tasklet activated from a softirq
> > handler is run from the same  __do_softirq() loop. This never has been
> > the case.
> 
> No.
> 
> The *interrupt handler* for DMA goes off while NET_RX softirq is running.
> That's what schedules the *DMA tasklet*.
> 
> That tasklet should run before any process.
> 
> But it doesn't because your patch bails out early from softirq.

Fine. Fix your driver.

Re: [net-next][PATCH 00/13] RDS: Major clean-up with couple of new features for 4.6

2016-02-27 Thread santosh.shilim...@oracle.com


Hi Dave,

On 2/26/16 9:43 PM, Santosh Shilimkar wrote:

Series is generated against net-next but also applies against Linus's tip
cleanly. The diff-stat looks bit scary since almost ~4K lines of code is
getting removed.



[...]



Entire patchset is available below git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git 
for_4.6/net-next/rds


Just noticed that, I accidentally posted the patches from older(v1)
folder, instead of updated v2. Sorry about it. Please discard
this entire series. Will post intended v2 right after this email.

Regards,
Santosh

Re: Softirq priority inversion from "softirq: reduce latencies"

On 02/27/2016 05:59 PM, Eric Dumazet wrote:
> On sam., 2016-02-27 at 15:33 -0800, Peter Hurley wrote:
>> On 02/27/2016 03:04 PM, David Miller wrote:
>>> From: Peter Hurley 
>>> Date: Sat, 27 Feb 2016 12:29:39 -0800
>>>
 Not really. softirq raised from interrupt context will always execute
 on this cpu and not in ksoftirqd, unless load forces softirq loop abort.
>>>
>>> That guarantee never was specified.
>>
>> ??
>>
>> Neither is running network socket servers at normal priority as if they're
>> higher priority than softirq.
>>
>>
>>> Or are you saying that by design, on a system under load, your UART
>>> will not function properly?
>>>
>>> Surely you don't mean that.
>>
>> No, that's not what I mean.
>>
>> What I mean is that bypassing the entire SOFTIRQ priority so that
>> sshd can process one network packet makes a mockery of the point of softirq.
>>
>> This hack to workaround NET_RX looping over-and-over-and-over affects every
>> subsystem, not just one uart.
>>
>> HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug.
> 
> No idea what you talk about.
> 
> All pending softirq interrupts are processed. _Nothing_ is skipped.

An interrupt that schedules HI softirq while in NET_RX softirq should
still run the HI softirq. But with your patch that won't happen.


> Really, your system stability seems to depend on a completely
> undocumented behavior of linux kernels before linux-3.8
> 
> If I understood, you expect that a tasklet activated from a softirq
> handler is run from the same  __do_softirq() loop. This never has been
> the case.

No.

The *interrupt handler* for DMA goes off while NET_RX softirq is running.
That's what schedules the *DMA tasklet*.

That tasklet should run before any process.

But it doesn't because your patch bails out early from softirq.


> My change simply triggers the bug in your driver earlier. As David
> pointed out, your bug should trigger the same on a loaded machine, even
> if you revert my patch.
> 
> I honestly do not know why you arm a tasklet from NET_RX, why don't you
> simply process this directly, so that you do not rely on some scheduler
> decision ?

[PATCH net] sctp: lack the check for ports in sctp_v6_cmp_addr

2016-02-27 Thread Xin Long

As the member .cmp_addr of sctp_af_inet6, sctp_v6_cmp_addr should also check
the port of addresses, just like sctp_v4_cmp_addr, cause it's invoked by
sctp_cmp_addr_exact().

Now sctp_v6_cmp_addr just check the port when two addresses have different
family, and lack the port check for two ipv6 addresses. that will make
sctp_hash_cmp() cannot work well.

so fix it by adding ports comparison in sctp_v6_cmp_addr().

Signed-off-by: Xin Long 
---
 net/sctp/ipv6.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index ec52912..ce46f1c 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -526,6 +526,8 @@ static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
}
return 0;
}
+   if (addr1->v6.sin6_port != addr2->v6.sin6_port)
+   return 0;
if (!ipv6_addr_equal(&addr1->v6.sin6_addr, &addr2->v6.sin6_addr))
return 0;
/* If this is a linklocal address, compare the scope_id. */
-- 
2.1.0

Re: Softirq priority inversion from "softirq: reduce latencies"

2016-02-27 Thread Eric Dumazet

On sam., 2016-02-27 at 15:33 -0800, Peter Hurley wrote:
> On 02/27/2016 03:04 PM, David Miller wrote:
> > From: Peter Hurley 
> > Date: Sat, 27 Feb 2016 12:29:39 -0800
> > 
> >> Not really. softirq raised from interrupt context will always execute
> >> on this cpu and not in ksoftirqd, unless load forces softirq loop abort.
> > 
> > That guarantee never was specified.
> 
> ??
> 
> Neither is running network socket servers at normal priority as if they're
> higher priority than softirq.
> 
> 
> > Or are you saying that by design, on a system under load, your UART
> > will not function properly?
> > 
> > Surely you don't mean that.
> 
> No, that's not what I mean.
> 
> What I mean is that bypassing the entire SOFTIRQ priority so that
> sshd can process one network packet makes a mockery of the point of softirq.
> 
> This hack to workaround NET_RX looping over-and-over-and-over affects every
> subsystem, not just one uart.
> 
> HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug.

No idea what you talk about.

All pending softirq interrupts are processed. _Nothing_ is skipped.

Really, your system stability seems to depend on a completely
undocumented behavior of linux kernels before linux-3.8

If I understood, you expect that a tasklet activated from a softirq
handler is run from the same  __do_softirq() loop. This never has been
the case.

My change simply triggers the bug in your driver earlier. As David
pointed out, your bug should trigger the same on a loaded machine, even
if you revert my patch.

I honestly do not know why you arm a tasklet from NET_RX, why don't you
simply process this directly, so that you do not rely on some scheduler
decision ?

КЛИЕНТСКИЕ БАЗЫ! Тел\Viber\Whatsapp: +79133913837 Email: mloginov...@gmail.com Skype: prodawez389

2016-02-27 Thread netdev@vger.kernel.org

КЛИЕНТСКИЕ БАЗЫ!

Соберем для Вас по интернет базу данных потенциальных клиентов 
для Вашего Бизнеса!
Много! Быстро! Недорого! 
Узнайте об этом подробнее по 
Тел: +79133913837
Viber: +79133913837
Whatsapp: +79133913837
Skype: prodawez389
Email: mloginov...@gmail.com

[no subject]

2016-02-27 Thread David and Carol Martin



-- 
We are donating to you 1,500,000 GBP, from David and Carol Martin £33million 
lottery, contact : davidcar...@yahoo.com.hk 
view link: 
http://www.ibtimes.co.uk/lotto-winners-david-carol-martin-want-blast-into-space-after-buying-new-shoes-1537851

Re: Softirq priority inversion from "softirq: reduce latencies"

On 02/27/2016 03:04 PM, David Miller wrote:
> From: Peter Hurley 
> Date: Sat, 27 Feb 2016 12:29:39 -0800
> 
>> Not really. softirq raised from interrupt context will always execute
>> on this cpu and not in ksoftirqd, unless load forces softirq loop abort.
> 
> That guarantee never was specified.

??

Neither is running network socket servers at normal priority as if they're
higher priority than softirq.

> Or are you saying that by design, on a system under load, your UART
> will not function properly?
> 
> Surely you don't mean that.

No, that's not what I mean.

What I mean is that bypassing the entire SOFTIRQ priority so that
sshd can process one network packet makes a mockery of the point of softirq.

This hack to workaround NET_RX looping over-and-over-and-over affects every
subsystem, not just one uart.

HI, TIMER, BLOCK; all of these are skipped: that's straight-up, a bug.

Regards,
Peter Hurley

Re: Softirq priority inversion from "softirq: reduce latencies"

From: Peter Hurley 
Date: Sat, 27 Feb 2016 12:29:39 -0800

> Not really. softirq raised from interrupt context will always execute
> on this cpu and not in ksoftirqd, unless load forces softirq loop abort.

That guarantee never was specified.

Or are you saying that by design, on a system under load, your UART
will not function properly?

Surely you don't mean that.

Re: Sending short raw packets using sendmsg() broke

2016-02-27 Thread Willem de Bruijn

On Fri, Feb 26, 2016 at 12:46 PM, David Miller  wrote:
> From: Willem de Bruijn 
> Date: Fri, 26 Feb 2016 12:33:13 -0500
>
>> Right. The simplest, if hacky, fix is to add something along the lines of
>>
>>   static unsigned short netdev_min_hard_header_len(struct net_device *dev)
>>   {
>>   if (unlikely(dev->type ==ARPHDR_AX25))
>> return AX25_KISS_HEADER_LEN;
>>   else
>> return dev->hard_header_len;
>>   }
>>
>> Depending on how the variable encoding scheme works, a basic min
>> length check may still produce buggy headers that confuse the stack or
>> driver. I need to read up on AX25. If so, then extending header_ops
>> with an optional validate() function is a more generic approach of
>> checking header sanity.
>
> I suspect we will need some kind of header ops for this.

To return the device type minimum length or to do full header validation?

Looking at drivers/net/hamradio, I don't see any driver output paths
interpreting the header fields, in which case the first is sufficient.
A minimum U/S frame is

  AX25_KISS_HEADER_LEN + 2* AX25_ADDR_LEN + 3 (control + FCS) ==
  AX25_KISS_HEADER_LEN + AX25_HEADER_LEN

Heikki, you gave this number + 3. Where does that constant come from?

More thorough validation of the header contents is not necessarily
hard. The following validates the address, including optional
repeaters.

  static bool ax25_validate_hard_header(const char *ll_header,
   unsigned short len)
  {
 ax25_digi digi;

 return !ax25_addr_parse(ll_header, len, NULL, NULL, &digi, NULL, NULL);
  }

The major drawback of full validation from the point of fixing the
original bug that it requires the header already having been copied to
the kernel. The ll_header_truncated check is currently performed
before allocation + copy, based solely on len. So this might become a
relatively complex patch that is not easy to backport to stable
branches.

I can send simple minimal length validation patch to net to solve the
reported bug. Then optionally follow up with a header_ops->validate()
extension in net-next, if there is value in that.

Re: [patch] rocker: fix an error code

From: Dan Carpenter 
Date: Sat, 27 Feb 2016 14:31:43 +0300

> We intended to return PTR_ERR() here instead of 1.
> 
> Fixes: 1f9993f6825f ('rocker: fix a neigh entry leak issue')
> Signed-off-by: Dan Carpenter 

Applied, thanks.

Re: [PATCH net-next 1/5] vxlan: implement GPE in L2 mode

2016-02-27 Thread Tom Herbert

On Sat, Feb 27, 2016 at 12:54 PM, Tom Herbert  wrote:
> On Sat, Feb 27, 2016 at 11:31 AM, Jiri Benc  wrote:
>> On Fri, 26 Feb 2016 15:51:29 -0800, Tom Herbert wrote:
>>> I don't think this is right. VXLAN-GPE is a separate protocol than
>>> VXLAN, they are not compatible on the wire and don't share flags or
>>> fields (for instance GPB uses bits in VXLAN that hold the next
>>> protocol in VXLAN-GPE). Neither is there a VXLAN_F_GPE flag defined in
>>> VXLAN to differentiate the two. So VXLAN-GPE would be used on a
>>> different port
>>
>> Yes, and that's exactly what this patchset does. If there's the
>> VXLAN_F_GPE flag defined while creating the interface, the used UDP
>> port defaults to the VXLAN-GPE UDP port (but can be overriden) and the
>> driver expects that all packets received are VXLAN-GPE.
>>
>> Note also that you can't define both GPE and GBP together, because as
>> you noted, they're not compatible. The driver correctly refuses such
>> combination.
>>
> Yes, but RCO has not been specified for VXLAN-GPE either so the patch
> does not correctly refuse setting those two together. Inevitably
> though, those and other extensions will defined for VXLAN-GPE and new
> ones for VXLAN. Again, the protocols are fundamentally incompatible,
> so instead of trying to enforce each valid combination at
> configuration or performing multiple checks for flavor each time we
> look at a packet, it seems easier to split the parsing with at most
> one check for the protocol variant. For instance in
> vxlan_udp_encap_recv just do:
>
> if (vs->flags & VXLAN_F_GPE)
>if (!vxlan_parse_gpe_hdr(&unparsed, skb, vs->flags))
>goto drop;
> else
>if (!vxlan_parse_gpe(&unparsed, skb, vs->flags))
>goto drop;
>

I meant

if (vs->flags & VXLAN_F_GPE)
   if (!vxlan_parse_gpe_hdr(&unparsed, skb, vs->flags))
   goto drop;
else
   if (!vxlan_parse_hdr(&unparsed, skb, vs->flags))
   goto drop;

>
> And then move REMCSUM and GPB and other protocol specific checks to
> the right function.
>
> Tom

Re: [PATCH net-next 1/5] vxlan: implement GPE in L2 mode

2016-02-27 Thread Tom Herbert

On Sat, Feb 27, 2016 at 11:31 AM, Jiri Benc  wrote:
> On Fri, 26 Feb 2016 15:51:29 -0800, Tom Herbert wrote:
>> I don't think this is right. VXLAN-GPE is a separate protocol than
>> VXLAN, they are not compatible on the wire and don't share flags or
>> fields (for instance GPB uses bits in VXLAN that hold the next
>> protocol in VXLAN-GPE). Neither is there a VXLAN_F_GPE flag defined in
>> VXLAN to differentiate the two. So VXLAN-GPE would be used on a
>> different port
>
> Yes, and that's exactly what this patchset does. If there's the
> VXLAN_F_GPE flag defined while creating the interface, the used UDP
> port defaults to the VXLAN-GPE UDP port (but can be overriden) and the
> driver expects that all packets received are VXLAN-GPE.
>
> Note also that you can't define both GPE and GBP together, because as
> you noted, they're not compatible. The driver correctly refuses such
> combination.
>
Yes, but RCO has not been specified for VXLAN-GPE either so the patch
does not correctly refuse setting those two together. Inevitably
though, those and other extensions will defined for VXLAN-GPE and new
ones for VXLAN. Again, the protocols are fundamentally incompatible,
so instead of trying to enforce each valid combination at
configuration or performing multiple checks for flavor each time we
look at a packet, it seems easier to split the parsing with at most
one check for the protocol variant. For instance in
vxlan_udp_encap_recv just do:

if (vs->flags & VXLAN_F_GPE)
   if (!vxlan_parse_gpe_hdr(&unparsed, skb, vs->flags))
   goto drop;
else
   if (!vxlan_parse_gpe(&unparsed, skb, vs->flags))
   goto drop;

And then move REMCSUM and GPB and other protocol specific checks to
the right function.

Tom

Re: Softirq priority inversion from "softirq: reduce latencies"

On 02/27/2016 12:13 PM, Eric Dumazet wrote:
> On sam., 2016-02-27 at 10:19 -0800, Peter Hurley wrote:
>> Hi Eric,
>>
>> For a while now, we've been struggling to understand why we've been
>> observing missed uart rx DMA.
>>
>> Because both the uart driver (omap8250) and the dmaengine driver
>> (edma) were (relatively) new, we assumed there was some race between
>> starting a new rx DMA and processing the previous one.
>>
>> However, after instrumenting both the uart driver and the dmaengine
>> driver, what we've observed is huge anomalous latencies between receiving
>> the DMA interrupt and servicing the DMA tasklet.
>>
>> For example, at 3Mbaud we recorded the following distribution of
>> softirq[TASKLET] service latency for this specific DMA channel:
>>
>> root@black:/sys/kernel/debug/edma# cat 35
>> latency(us):   0+   20+   40+   60+   80+  100+  120+  140+  160+  180+  
>> 200+  220+  240+  260+  280+  300+  320+  340+  360+  380+
>>195681335315 7 4 3 1 0 0 
>> 0 1 4 6 1 0 0 0 0 0
>>
>> As you can see, the vast majority of tasklet service happens immediately,
>> tapering off to 140+us.
>>
>> However, note the island of distribution at 220~300 [latencies beyond 300+
>> are not recorded because the uart fifo has filled again by this point and
>> dma must be aborted].
>>
>> So I cribbed together a latency tracer to catch what was happening at
>> the extreme, and what it caught was a priority inversion stemming from
>> your commit:
>>
>>commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
>>Author: Eric Dumazet 
>>Date:   Thu Jan 10 15:26:34 2013 -0800
>>
>>softirq: reduce latencies
>> 
>>In various network workloads, __do_softirq() latencies can be up
>>to 20 ms if HZ=1000, and 200 ms if HZ=100.
>> 
>>This is because we iterate 10 times in the softirq dispatcher,
>>and some actions can consume a lot of cycles.
>>
>>
>> In the trace below [1], the trace begins in the edma completion interrupt
>> handler when the tasklet is scheduled; the edma interrupt has occurred during
>> NET_RX softirq context.
>>
>> However, instead of causing a restart of the softirq loop to process the
>> tasklet _which occurred before sshd was scheduled_, the softirq loop is
>> aborted and deferred for ksoftirqd. The tasklet is not serviced for 521us,
>> which is way too long, so DMA was aborted.
>>
>> Your patch has effectively inverted the priority of tasklets with normal
>> pri/nice processes that have merely received a network packet.
>>
>> ISTM, the problem you're trying to solve here was caused by NET_RX softirq
>> to begin with, and maybe that thing needs a diet.
>>
>> But rather than outright reverting your patch, what if more selective
>> conditions are used to abort the softirq restart? What would those conditions
>> be? In the netperf benchmark you referred to in that commit, is it just
>> NET_TX/NET_RX softirqs that are causing scheduling latencies?
>>
>> It just doesn't make sense to special case for a workload that isn't
>> even running.
>>
>>
>> Regards,
>> Peter Hurley
>>
>>
>> [1] softirq tasklet latency trace  (apologies that it's only events - full
>> function trace introduces too much overhead)
>>
>> # tracer: latency
>> #
>> # latency latency trace v1.1.5 on 4.5.0-rc2+
>> # 
>> # latency: 476 us, #59/59, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0)
>> #-
>> #| task: sshd-750 (uid:1000 nice:0 policy:0 rt_prio:0)
>> #-
>> #  => started at: __tasklet_schedule  
>> #  => ended at:   tasklet_action
>> #
>> #
>> #  _--=> CPU#
>> # / _-=> irqs-off 
>> #| / _=> need-resched
>> #|| / _---=> hardirq/softirq
>> #||| / _--=> preempt-depth
>> # / delay
>> #  cmd pid   | time  |   caller
>> # \   /  |  \|   /
>>   -0   0d.H31us : __tasklet_schedule
>>   -0   0d.H43us : softirq_raise: vec=6 [action=TASKLET]
>>   -0   0d.H36us : irq_handler_exit: irq=20 ret=handled
>>   -0   0..s2   15us : kmem_cache_alloc: call_site=c08378e4 
>> ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC
>>   -0   0..s2   23us : netif_receive_skb_entry: dev=eth0 
>> napi_id=0x0 queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x 
>> vlan_tci=0x000
>> 0 protocol=0x0800 ip_summed=0 hash=0x l4_hash=0 len=88 data_len=0 
>> truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 
>> gso_type=0x0
>>   -0   0..s2   30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 
>> len=88
>>   -0   0d.s5   98us : sched_waking: comm=sshd pid=750 prio=120 
>> target_cpu=000
>>   -0   0d.s6  105us : sched_stat_sleep: comm=sshd pid=750 
>> delay=3125230447 [ns]
>>   -0   0dns6  110us+: sche

Re: Softirq priority inversion from "softirq: reduce latencies"

2016-02-27 Thread Eric Dumazet

On sam., 2016-02-27 at 10:19 -0800, Peter Hurley wrote:
> Hi Eric,
> 
> For a while now, we've been struggling to understand why we've been
> observing missed uart rx DMA.
> 
> Because both the uart driver (omap8250) and the dmaengine driver
> (edma) were (relatively) new, we assumed there was some race between
> starting a new rx DMA and processing the previous one.
> 
> However, after instrumenting both the uart driver and the dmaengine
> driver, what we've observed is huge anomalous latencies between receiving
> the DMA interrupt and servicing the DMA tasklet.
> 
> For example, at 3Mbaud we recorded the following distribution of
> softirq[TASKLET] service latency for this specific DMA channel:
> 
> root@black:/sys/kernel/debug/edma# cat 35
> latency(us):   0+   20+   40+   60+   80+  100+  120+  140+  160+  180+  200+ 
>  220+  240+  260+  280+  300+  320+  340+  360+  380+
>195681335315 7 4 3 1 0 0 0 
> 1 4 6 1 0 0 0 0 0
> 
> As you can see, the vast majority of tasklet service happens immediately,
> tapering off to 140+us.
> 
> However, note the island of distribution at 220~300 [latencies beyond 300+
> are not recorded because the uart fifo has filled again by this point and
> dma must be aborted].
> 
> So I cribbed together a latency tracer to catch what was happening at
> the extreme, and what it caught was a priority inversion stemming from
> your commit:
> 
>commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
>Author: Eric Dumazet 
>Date:   Thu Jan 10 15:26:34 2013 -0800
> 
>softirq: reduce latencies
> 
>In various network workloads, __do_softirq() latencies can be up
>to 20 ms if HZ=1000, and 200 ms if HZ=100.
> 
>This is because we iterate 10 times in the softirq dispatcher,
>and some actions can consume a lot of cycles.
> 
> 
> In the trace below [1], the trace begins in the edma completion interrupt
> handler when the tasklet is scheduled; the edma interrupt has occurred during
> NET_RX softirq context.
> 
> However, instead of causing a restart of the softirq loop to process the
> tasklet _which occurred before sshd was scheduled_, the softirq loop is
> aborted and deferred for ksoftirqd. The tasklet is not serviced for 521us,
> which is way too long, so DMA was aborted.
> 
> Your patch has effectively inverted the priority of tasklets with normal
> pri/nice processes that have merely received a network packet.
> 
> ISTM, the problem you're trying to solve here was caused by NET_RX softirq
> to begin with, and maybe that thing needs a diet.
> 
> But rather than outright reverting your patch, what if more selective
> conditions are used to abort the softirq restart? What would those conditions
> be? In the netperf benchmark you referred to in that commit, is it just
> NET_TX/NET_RX softirqs that are causing scheduling latencies?
> 
> It just doesn't make sense to special case for a workload that isn't
> even running.
> 
> 
> Regards,
> Peter Hurley
> 
> 
> [1] softirq tasklet latency trace  (apologies that it's only events - full
> function trace introduces too much overhead)
> 
> # tracer: latency
> #
> # latency latency trace v1.1.5 on 4.5.0-rc2+
> # 
> # latency: 476 us, #59/59, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0)
> #-
> #| task: sshd-750 (uid:1000 nice:0 policy:0 rt_prio:0)
> #-
> #  => started at: __tasklet_schedule  
> #  => ended at:   tasklet_action
> #
> #
> #  _--=> CPU#
> # / _-=> irqs-off 
> #| / _=> need-resched
> #|| / _---=> hardirq/softirq
> #||| / _--=> preempt-depth
> # / delay
> #  cmd pid   | time  |   caller
> # \   /  |  \|   /
>   -0   0d.H31us : __tasklet_schedule
>   -0   0d.H43us : softirq_raise: vec=6 [action=TASKLET]
>   -0   0d.H36us : irq_handler_exit: irq=20 ret=handled
>   -0   0..s2   15us : kmem_cache_alloc: call_site=c08378e4 
> ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC
>   -0   0..s2   23us : netif_receive_skb_entry: dev=eth0 napi_id=0x0 
> queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x 
> vlan_tci=0x000
> 0 protocol=0x0800 ip_summed=0 hash=0x l4_hash=0 len=88 data_len=0 
> truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 
> gso_type=0x0
>   -0   0..s2   30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 
> len=88
>   -0   0d.s5   98us : sched_waking: comm=sshd pid=750 prio=120 
> target_cpu=000
>   -0   0d.s6  105us : sched_stat_sleep: comm=sshd pid=750 
> delay=3125230447 [ns]
>   -0   0dns6  110us+: sched_wakeup: comm=sshd pid=750 prio=120 
> target_cpu=000
>   -0   0dns4  123us+: timer_start: timer=dc940e9c 
> function=tcp_delack_timer

Re: net: mv643xx: interface does not transmit after some time

2016-02-27 Thread Adam Baker

On 11/02/16 14:38, Ezequiel Garcia wrote:

(let's expand the Cc a bit)

On 10 February 2016 at 19:57, Andrew Lunn wrote:

On Wed, Feb 10, 2016 at 07:40:54PM +0100, Thomas Schlöter wrote:

Am 08.02.2016 um 19:49 schrieb Thomas Schlöter :

Am 07.02.2016 um 22:07 schrieb Thomas Schlöter :

Am 07.02.2016 um 21:35 schrieb Andrew Lunn :

FWIW, we had a similar bug report in Debian recently:
https://lists.debian.org/debian-arm/2016/01/msg00098.html

Hi Thomas

I this thread, Ian Campbell mentions a patch. Please could you try
that patch and see if it fixes your problem.

Thanks
Andrew

Hi Andrew,

I just applied the patch and the NAS is now running it. I???ll try to crash it
tonight and keep you informed whether it worked.

Thanks
Thomas

Hi Andrew,

the patch did not fix the problem. After 1.2 GiB RX and 950 MiB TX, the
interface crashed again.

Now I switched off RX/TX offload just to make sure we are talking about the
same problem. If we are, the interface should be stable without offload, right?

Thomas

Okay, so I have installed ethtool and switched off all offload features
available. Now the NAS is running rock solid for two days. I backed up my Mac
using Time Machine / netatalk (450 GiB transferred) and some Linux machines via
NFS (100 GiB total) without a problem.

How much code is used for mv643xx offload functionality?
Is it possible to debug things in the driver and figure out what happens during
the crash?
Is the hardware offload interface proprietary or reverse engineered or is it a
well known API that can be analyzed?

Hi Thomas

Ezequiel Garcia probably knows this part of the driver and hardware
the best...

The TCP segmentation offload (TSO) implemented in this driver is
mostly a software thing.

I'm CCing Karl and Philipp, who have fixed subtle issues in the TSO
path, and may be able to help figure this one out.

Hi,

Had this issue occur again today. In my case it seems to be triggered by
large NFSv4 transfers.

I'm running 4.4 plus Nicolas Schichan's patch at
https://patchwork.ozlabs.org/patch/573334/

There is a thread a http://forum.doozan.com/read.php?2,17404 suggesting
that this has been broken since at least 3.16.

I first spotted the issue when upgrading from 3.11 to 4.4.

Looking at
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/drivers/net/ethernet/marvell/mv643xx_eth.c
I see 2014-05-22 as the date TSO support was first added which is
shortly before the merge window opened for 3.16. I'm therefore guessing
that TSO has been problematic since it's introduction.

Regards

Adam

[PATCH] mld, igmp: Fix reserved tailroom calculation

2016-02-27 Thread Benjamin Poirier

The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data|__tlen___|__extra__]
^   ^
headskb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data|__extra__|__tlen___]
^   ^
headskb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large 
MTUs")
Signed-off-by: Benjamin Poirier 
---
 net/ipv4/igmp.c  | 4 ++--
 net/ipv6/mcast.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 05e4cba..b5d28a4 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -356,9 +356,9 @@ static struct sk_buff *igmpv3_newpack(struct net_device 
*dev, unsigned int mtu)
skb_dst_set(skb, &rt->dst);
skb->dev = dev;
 
-   skb->reserved_tailroom = skb_end_offset(skb) -
-min(mtu, skb_end_offset(skb));
skb_reserve(skb, hlen);
+   skb->reserved_tailroom = skb_tailroom(skb) -
+   min_t(int, mtu, skb_tailroom(skb) - tlen);
 
skb_reset_network_header(skb);
pip = ip_hdr(skb);
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 5ee56d0..c157edc 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -1574,9 +1574,9 @@ static struct sk_buff *mld_newpack(struct inet6_dev 
*idev, unsigned int mtu)
return NULL;
 
skb->priority = TC_PRIO_CONTROL;
-   skb->reserved_tailroom = skb_end_offset(skb) -
-min(mtu, skb_end_offset(skb));
skb_reserve(skb, hlen);
+   skb->reserved_tailroom = skb_tailroom(skb) -
+   min_t(int, mtu, skb_tailroom(skb) - tlen);
 
if (__ipv6_get_lladdr(idev, &addr_buf, IFA_F_TENTATIVE)) {
/* :
-- 
2.7.0

Re: [PATCH net-next 5/5] vxlan: implement GPE in L3 mode

On Sat, 27 Feb 2016 20:21:59 +0100, Jiri Benc wrote:
> You mean returning ETH_P_TEB in skb->protocol? That's not much useful,
> unfortunately. You won't get such packet processed by the kernel IP
> stack, rendering the VXLAN-GPE device unusable outside of ovs. It would
> effectively became a packet sink when used standalone, as it cannot be
> added to bridge and received packets are not processed by anything -
> there's no protocol handler for ETH_P_TEB.

Actually, I can do that in the L3 mode (or whatever we'll call it). It
won't hurt anything and may be useful for openvswitch. Ovs will have to
special case VXLAN-GPE vport (or perhaps any ARPHRD_NONE port) and set
skb->protocol to ETH_P_TEB on xmit and dissect correctly the ETH_P_TEB
packet on rcv.

The L2 mode (or whatever we'll call it) will need to stay, though, for
non-ovs use cases.

 Jiri

Re: [PATCH net-next 1/5] vxlan: implement GPE in L2 mode

On Fri, 26 Feb 2016 15:51:29 -0800, Tom Herbert wrote:
> I don't think this is right. VXLAN-GPE is a separate protocol than
> VXLAN, they are not compatible on the wire and don't share flags or
> fields (for instance GPB uses bits in VXLAN that hold the next
> protocol in VXLAN-GPE). Neither is there a VXLAN_F_GPE flag defined in
> VXLAN to differentiate the two. So VXLAN-GPE would be used on a
> different port

Yes, and that's exactly what this patchset does. If there's the
VXLAN_F_GPE flag defined while creating the interface, the used UDP
port defaults to the VXLAN-GPE UDP port (but can be overriden) and the
driver expects that all packets received are VXLAN-GPE.

Note also that you can't define both GPE and GBP together, because as
you noted, they're not compatible. The driver correctly refuses such
combination.

> and probably needs its own rcv functions.

I don't see the need for code duplication. This patchset does exactly
what you described and reuses the code, as most of it is really the
same for all VXLAN modes. I also made sure this is as clean as possible
in the driver which was the reason for the previous 4 cleanup patchsets.

 Jiri

Re: [PATCH net-next 5/5] vxlan: implement GPE in L3 mode

On Fri, 26 Feb 2016 15:42:29 -0800, Tom Herbert wrote:
> Agreed, and I don't see why there even needs to be modes. VXLAN-GPE
> can carry arbitrary protocols with a next-header field. For Ethernet,
> MPLS, IPv4, and IPv6 it should just be a simple mapping of the next
> header to Ethertype for purposes of processing the payload.

That's exactly what this patchset does, Tom. The mapping is done in
vxlan_parse_gpe_hdr and vxlan_build_gpe_hdr.

Ethernet is special, though. It needs to be a standalone mode,
otherwise frames encapsulated including an Ethernet header wouldn't be
processed and there would be no way to send such packets - the only
distinction the driver can use is skb->protocol and that won't become
ETH_P_TEB magically.

 Jiri

Re: [PATCH net-next 5/5] vxlan: implement GPE in L3 mode

On Fri, 26 Feb 2016 14:22:03 -0800, Jesse Gross wrote:
> Given that VXLAN_GPE_MODE_L3 will eventually come to be used by NSH,
> MPLS, etc. in addition to IPv4/v6, most of which are not really L3, it
> seems like something along the lines of NO_ARP might be better since
> that's what it really indicates.

I have no problem naming this differently. Not sure NO_ARP is the best
name, though - this is more about absence of the L2 header in received
packets than about ARP.

> Once that is in, I don't really see
> the need to explicitly block Ethernet packets from being handled in
> this mode. If they are received, then they can just be handed off to
> the stack - at that point it would look like an extra header, the same
> as if an NSH packet is received.

You mean returning ETH_P_TEB in skb->protocol? That's not much useful,
unfortunately. You won't get such packet processed by the kernel IP
stack, rendering the VXLAN-GPE device unusable outside of ovs. It would
effectively became a packet sink when used standalone, as it cannot be
added to bridge and received packets are not processed by anything -
there's no protocol handler for ETH_P_TEB.

With this patchset, you can create a VXLAN-GPE interface and use it as
any other point to point interface, and it works as expected with
routing etc.

The distinction between Ethernet and no Ethernet is needed, the
interface won't work otherwise.

 Jiri

Re: [PATCH 3/3] 3c59x: Use setup_timer()

2016-02-27 Thread Stafford Horne

On Thu, 25 Feb 2016, David Miller wrote:

From: Amitoj Kaur Chawla 
Date: Wed, 24 Feb 2016 19:28:19 +0530

Convert a call to init_timer and accompanying intializations of
the timer's data and function fields to a call to setup_timer.

The Coccinelle semantic patch that fixes this problem is
as follows:

// 
@@
expression t,f,d;
@@

-init_timer(&t);
+setup_timer(&t,f,d);
 ...
-t.data = d;
-t.function = f;
// 

Signed-off-by: Amitoj Kaur Chawla 

Applied.

Hi David, Amitoj,

The patch here seemed to remove the call to add_timer(&vp->timer) which
applies the expires time. Would that be an issue?

-Stafford

RE: [PATCH net 1/3] r8169:fix nic sometimes doesn't work after changing the mac address.

2016-02-27 Thread Hau

 > Instead of taking the device out of suspended mode to perform the required
> action, the driver is moving to a model where 1) said action may be
> scheduled to a later time - or result from past time work - and 2) rpm handler
> must handle a lot of pm unrelated work.
> 
> rtl8169_ethtool_ops.{get_wol, get_regs, get_settings} aren't even fixed yet
> (what about the .set_xyz handlers ?).
> 
> I can't help thinking that the driver should return to a state where it 
> stupidly
> does what it is asked to. No software caching, plain device access, resume
> when needed, suspend as "suspend" instead of suspend as "anticipate
> whatever may happen to avoid waking up".
>

This rpm related patches just the workaround for the issues reported by end 
users.  As you say, the Linux kernel should handle these events when driver is 
in runtime suspend state.
 
--Please consider the environment before printing this e-mail.

Softirq priority inversion from "softirq: reduce latencies"

Hi Eric,

For a while now, we've been struggling to understand why we've been
observing missed uart rx DMA.

Because both the uart driver (omap8250) and the dmaengine driver
(edma) were (relatively) new, we assumed there was some race between
starting a new rx DMA and processing the previous one.

However, after instrumenting both the uart driver and the dmaengine
driver, what we've observed is huge anomalous latencies between receiving
the DMA interrupt and servicing the DMA tasklet.

For example, at 3Mbaud we recorded the following distribution of
softirq[TASKLET] service latency for this specific DMA channel:

root@black:/sys/kernel/debug/edma# cat 35
latency(us):   0+   20+   40+   60+   80+  100+  120+  140+  160+  180+  200+  
220+  240+  260+  280+  300+  320+  340+  360+  380+
   195681335315 7 4 3 1 0 0 0   
  1 4 6 1 0 0 0 0 0

As you can see, the vast majority of tasklet service happens immediately,
tapering off to 140+us.

However, note the island of distribution at 220~300 [latencies beyond 300+
are not recorded because the uart fifo has filled again by this point and
dma must be aborted].

So I cribbed together a latency tracer to catch what was happening at
the extreme, and what it caught was a priority inversion stemming from
your commit:

   commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
   Author: Eric Dumazet 
   Date:   Thu Jan 10 15:26:34 2013 -0800

   softirq: reduce latencies

   In various network workloads, __do_softirq() latencies can be up
   to 20 ms if HZ=1000, and 200 ms if HZ=100.

   This is because we iterate 10 times in the softirq dispatcher,
   and some actions can consume a lot of cycles.


In the trace below [1], the trace begins in the edma completion interrupt
handler when the tasklet is scheduled; the edma interrupt has occurred during
NET_RX softirq context.

However, instead of causing a restart of the softirq loop to process the
tasklet _which occurred before sshd was scheduled_, the softirq loop is
aborted and deferred for ksoftirqd. The tasklet is not serviced for 521us,
which is way too long, so DMA was aborted.

Your patch has effectively inverted the priority of tasklets with normal
pri/nice processes that have merely received a network packet.

ISTM, the problem you're trying to solve here was caused by NET_RX softirq
to begin with, and maybe that thing needs a diet.

But rather than outright reverting your patch, what if more selective
conditions are used to abort the softirq restart? What would those conditions
be? In the netperf benchmark you referred to in that commit, is it just
NET_TX/NET_RX softirqs that are causing scheduling latencies?

It just doesn't make sense to special case for a workload that isn't
even running.


Regards,
Peter Hurley


[1] softirq tasklet latency trace  (apologies that it's only events - full
function trace introduces too much overhead)

# tracer: latency
#
# latency latency trace v1.1.5 on 4.5.0-rc2+
# 
# latency: 476 us, #59/59, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0)
#-
#| task: sshd-750 (uid:1000 nice:0 policy:0 rt_prio:0)
#-
#  => started at: __tasklet_schedule  
#  => ended at:   tasklet_action
#
#
#  _--=> CPU#
# / _-=> irqs-off 
#| / _=> need-resched
#|| / _---=> hardirq/softirq
#||| / _--=> preempt-depth
# / delay
#  cmd pid   | time  |   caller
# \   /  |  \|   /
  -0   0d.H31us : __tasklet_schedule
  -0   0d.H43us : softirq_raise: vec=6 [action=TASKLET]
  -0   0d.H36us : irq_handler_exit: irq=20 ret=handled
  -0   0..s2   15us : kmem_cache_alloc: call_site=c08378e4 
ptr=de55d7c0 bytes_req=192 bytes_alloc=192 gfp_flags=GFP_ATOMIC
  -0   0..s2   23us : netif_receive_skb_entry: dev=eth0 napi_id=0x0 
queue_mapping=0 skbaddr=dca04400 vlan_tagged=0 vlan_proto=0x vlan_tci=0x000
0 protocol=0x0800 ip_summed=0 hash=0x l4_hash=0 len=88 data_len=0 
truesize=1984 mac_header_valid=1 mac_header=-14 nr_frags=0 gso_size=0 
gso_type=0x0
  -0   0..s2   30us+: netif_receive_skb: dev=eth0 skbaddr=dca04400 
len=88
  -0   0d.s5   98us : sched_waking: comm=sshd pid=750 prio=120 
target_cpu=000
  -0   0d.s6  105us : sched_stat_sleep: comm=sshd pid=750 
delay=3125230447 [ns]
  -0   0dns6  110us+: sched_wakeup: comm=sshd pid=750 prio=120 
target_cpu=000
  -0   0dns4  123us+: timer_start: timer=dc940e9c 
function=tcp_delack_timer expires=9746 [timeout=10] flags=0x
  -0   0dnH3  150us : irq_handler_entry: irq=176 
name=4a10.ethernet
  -0   0dnH3  153us : softirq_raise: vec=3 [action=NET_RX]
  -0   0dnH3  155us : irq_handler_exit: irq=176 ret=handled
  -0   0dnH3  160us : irq_handler_entry:

[PATCH net-next] net: ipv6/l3mdev: Move host route on saved address if necessary

2016-02-27 Thread David Ahern

Commit f1705ec197e70 allows IPv6 addresses to be retained on a link down.
The address can have a cached host route which can point to the wrong
FIB table if the L3 enslavement is changed (e.g., route can point to local
table instead of VRF table if device is added to an L3 domain).

On link up check the table of the cached host route against the FIB
table associated with the device and correct if needed.

Signed-off-by: David Ahern 
---
Normally the 'if CONFIG_NET_L3_MASTER_DEV is enabled' checks are all
done in l3mdev.h. In this case putting the functions in the l3mdev
header requires adding ipv6 header files which blows up compiles.

 net/ipv6/addrconf.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index a2d6f6c242af..afab4c359b5b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3170,9 +3170,35 @@ static void addrconf_gre_config(struct net_device *dev)
 }
 #endif
 
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+/* If the host route is cached on the addr struct make sure it is associated
+ * with the proper table. e.g., enslavement can change and if so the cached
+ * host route needs to move to the new table.
+ */
+static void l3mdev_check_host_rt(struct inet6_dev *idev,
+ struct inet6_ifaddr *ifp)
+{
+   if (ifp->rt) {
+   u32 tb_id = l3mdev_fib_table(idev->dev) ? : RT6_TABLE_LOCAL;
+
+   if (tb_id != ifp->rt->rt6i_table->tb6_id) {
+   ip6_del_rt(ifp->rt);
+   ifp->rt = NULL;
+   }
+   }
+}
+#else
+static void l3mdev_check_host_rt(struct inet6_dev *idev,
+ struct inet6_ifaddr *ifp)
+{
+}
+#endif
+
 static int fixup_permanent_addr(struct inet6_dev *idev,
struct inet6_ifaddr *ifp)
 {
+   l3mdev_check_host_rt(idev, ifp);
+
if (!ifp->rt) {
struct rt6_info *rt;
 
-- 
2.1.4

[PATCH] wan: lmc: Switch to using managed resources

2016-02-27 Thread Amitoj Kaur Chawla

Use managed resource functions devm_kzalloc and pcim_enable_device
to simplify error handling. Subsequently, remove unnecessary
kfree, pci_disable_device and pci_release_regions.

To be compatible with the change, various gotos are replaced with
direct returns and unneeded labels are dropped.

Also, `sc` was only being freed in the probe function and not the
remove function before the change. By using devm_kzalloc this patch
also fixes this memory leak.

Signed-off-by: Amitoj Kaur Chawla 
---
I was not able to find anywhere that `sc` might be freed. However, 
if a free has been overlooked, there will be a double free, due to 
the free implicitly performed by devm_kzalloc.

 drivers/net/wan/lmc/lmc_main.c | 27 +++
 1 file changed, 7 insertions(+), 20 deletions(-)

diff --git a/drivers/net/wan/lmc/lmc_main.c b/drivers/net/wan/lmc/lmc_main.c
index 317bc79..bb33b24 100644
--- a/drivers/net/wan/lmc/lmc_main.c
+++ b/drivers/net/wan/lmc/lmc_main.c
@@ -826,7 +826,7 @@ static int lmc_init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 
/* lmc_trace(dev, "lmc_init_one in"); */
 
-   err = pci_enable_device(pdev);
+   err = pcim_enable_device(pdev);
if (err) {
printk(KERN_ERR "lmc: pci enable failed: %d\n", err);
return err;
@@ -835,23 +835,20 @@ static int lmc_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
err = pci_request_regions(pdev, "lmc");
if (err) {
printk(KERN_ERR "lmc: pci_request_region failed\n");
-   goto err_req_io;
+   return err;
}
 
/*
 * Allocate our own device structure
 */
-   sc = kzalloc(sizeof(lmc_softc_t), GFP_KERNEL);
-   if (!sc) {
-   err = -ENOMEM;
-   goto err_kzalloc;
-   }
+   sc = devm_kzalloc(&pdev->dev, sizeof(lmc_softc_t), GFP_KERNEL);
+   if (!sc)
+   return -ENOMEM;
 
dev = alloc_hdlcdev(sc);
if (!dev) {
printk(KERN_ERR "lmc:alloc_netdev for device failed\n");
-   err = -ENOMEM;
-   goto err_hdlcdev;
+   return -ENOMEM;
}
 
 
@@ -888,7 +885,7 @@ static int lmc_init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (err) {
printk(KERN_ERR "%s: register_netdev failed.\n", dev->name);
free_netdev(dev);
-   goto err_hdlcdev;
+   return err;
}
 
 sc->lmc_cardtype = LMC_CARDTYPE_UNKNOWN;
@@ -971,14 +968,6 @@ static int lmc_init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 
 lmc_trace(dev, "lmc_init_one out");
 return 0;
-
-err_hdlcdev:
-   kfree(sc);
-err_kzalloc:
-   pci_release_regions(pdev);
-err_req_io:
-   pci_disable_device(pdev);
-   return err;
 }
 
 /*
@@ -992,8 +981,6 @@ static void lmc_remove_one(struct pci_dev *pdev)
printk(KERN_DEBUG "%s: removing...\n", dev->name);
unregister_hdlc_device(dev);
free_netdev(dev);
-   pci_release_regions(pdev);
-   pci_disable_device(pdev);
}
 }
 
-- 
1.9.1

Re: [V9fs-developer] [PATCH] net/9p: convert to new CQ API

2016-02-27 Thread Dominique Martinet

Hi,

Couple of checkpatch complains:

Christoph Hellwig wrote on Sat, Feb 27, 2016:
> -struct p9_rdma_context {
> - enum ib_wc_opcode wc_op;
> +struct p9_rdma_context { 

trailing tab

> - p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, status);
> + p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, 
> wc->status);

line over 80 chars


That aside it looks good ; I need to check on the new API (hadn't
noticed the change) but it looks nice.

Will do the actual testing likely only next week only though;
Eric has been taking my patches for 9p/RDMA so I suspect he'll take
your's as well eventually (get_maintainer.pl has a long-ish list of CC
for us usually)


BTW I think it's easy enough to do the testing if you have a server that
can dish it out. diod[1] and nfs-ganesha[2] are the only two I'm aware
of but there might be more (using ganesha myself; happy to help you set
it up in private if you need)

[1] https://github.com/chaos/diod
[2] https://github.com/nfs-ganesha/nfs-ganesha

-- 
Dominique Martinet

[net-next-2.6 v4 0/3] net_sched: Add support for IFE action

From: Jamal Hadi Salim 

As agreed at netconf in Seville, here's the patch finally (1 year
was just too long to wait for an ethertype. Now we are just going
have the user configure one).
Described in netdev01 paper:
"Distributing Linux Traffic Control Classifier-Action Subsystem"
 Authors: Jamal Hadi Salim and Damascene M. Joachimpillai

The original motivation and deployment of this work was to horizontally
scale packet processing at scope of a chasis or rack. This means one
could take a tc policy and split it across machines connected over
L2. The paper refers to this as "pipeline stage indexing". Other
use cases which evolved out of the original intent include but are
not limited to carrying OAM information, carrying exception handling
metadata, carrying programmed authentication and authorization information,
encapsulating programmed compliance information, service IDs etc.
Read the referenced paper for more details.

The architecture allows for incremental updates for new metadatum support
to cover different use cases.
This patch set includes support for basic skb metadatum.
Followup patches will have more examples of metadata and other features.

v4 changes:
Integrate more feedback from Cong 

v3 changes:
Integrate with the new namespace changes 
Remove skbhash and queue mapping metadata (but keep their claim for ids)
Integrate feedback from Cong 
Integrate feedback from Daniel

v2 changes:
Remove module option for an upper bound of metadata
Integrate feedback from Cong 
Integrate feedback from Daniel

Jamal Hadi Salim (3):
  introduce IFE action
  Support to encoding decoding skb mark on IFE action
  Support to encoding decoding skb prio on IFE action

 include/net/tc_act/tc_ife.h|  61 +++
 include/uapi/linux/tc_act/tc_ife.h |  38 ++
 net/sched/Kconfig  |  22 +
 net/sched/Makefile |   3 +
 net/sched/act_ife.c| 883 +
 net/sched/act_meta_mark.c  |  79 
 net/sched/act_meta_skbprio.c   |  76 
 7 files changed, 1162 insertions(+)
 create mode 100644 include/net/tc_act/tc_ife.h
 create mode 100644 include/uapi/linux/tc_act/tc_ife.h
 create mode 100644 net/sched/act_ife.c
 create mode 100644 net/sched/act_meta_mark.c
 create mode 100644 net/sched/act_meta_skbprio.c

-- 
1.9.1

[net-next-2.6 PATCH v4 2/3] Support to encoding decoding skb mark on IFE action

From: Jamal Hadi Salim 

Example usage:
Set the skb using skbedit then allow it to be encoded

sudo tc qdisc add dev $ETH root handle 1: prio
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit mark 17 \
action ife encode \
allow mark \
dst 02:15:15:15:15:15

Note: You dont need the skbedit action if you are already encoding the
skb mark earlier. A zero skb mark, when seen, will not be encoded.

Alternative hard code static mark of 0x12 every time the filter matches

sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action ife encode \
type 0xDEAD \
use mark 0x12 \
dst 02:15:15:15:15:15

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/Kconfig |  5 +++
 net/sched/Makefile|  1 +
 net/sched/act_meta_mark.c | 79 +++
 3 files changed, 85 insertions(+)
 create mode 100644 net/sched/act_meta_mark.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 4d48ef5..85854c0 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -751,6 +751,11 @@ config NET_ACT_IFE
  To compile this code as a module, choose M here: the
  module will be called act_ife.
 
+config NET_IFE_SKBMARK
+tristate "Support to encoding decoding skb mark on IFE action"
+depends on NET_ACT_IFE
+---help---
+
 config NET_CLS_IND
bool "Incoming device classification"
depends on NET_CLS_U32 || NET_CLS_FW
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 3d17667..3f7a182 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_NET_ACT_VLAN)+= act_vlan.o
 obj-$(CONFIG_NET_ACT_BPF)  += act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.o
 obj-$(CONFIG_NET_ACT_IFE)  += act_ife.o
+obj-$(CONFIG_NET_IFE_SKBMARK)  += act_meta_mark.o
 obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o
 obj-$(CONFIG_NET_SCH_CBQ)  += sch_cbq.o
 obj-$(CONFIG_NET_SCH_HTB)  += sch_htb.o
diff --git a/net/sched/act_meta_mark.c b/net/sched/act_meta_mark.c
new file mode 100644
index 000..8289217
--- /dev/null
+++ b/net/sched/act_meta_mark.c
@@ -0,0 +1,79 @@
+/*
+ * net/sched/act_meta_mark.c IFE skb->mark metadata module
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * copyright Jamal Hadi Salim (2015)
+ *
+*/
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int skbmark_encode(struct sk_buff *skb, void *skbdata,
+ struct tcf_meta_info *e)
+{
+   u32 ifemark = skb->mark;
+
+   return ife_encode_meta_u32(ifemark, skbdata, e);
+}
+
+static int skbmark_decode(struct sk_buff *skb, void *data, u16 len)
+{
+   u32 ifemark = *(u32 *)data;
+
+   skb->mark = ntohl(ifemark);
+   return 0;
+}
+
+static int skbmark_check(struct sk_buff *skb, struct tcf_meta_info *e)
+{
+   return ife_check_meta_u32(skb->mark, e);
+}
+
+static struct tcf_meta_ops ife_skbmark_ops = {
+   .metaid = IFE_META_SKBMARK,
+   .metatype = NLA_U32,
+   .name = "skbmark",
+   .synopsis = "skb mark 32 bit metadata",
+   .check_presence = skbmark_check,
+   .encode = skbmark_encode,
+   .decode = skbmark_decode,
+   .get = ife_get_meta_u32,
+   .alloc = ife_alloc_meta_u32,
+   .release = ife_release_meta_gen,
+   .validate = ife_validate_meta_u32,
+   .owner = THIS_MODULE,
+};
+
+static int __init ifemark_init_module(void)
+{
+   return register_ife_op(&ife_skbmark_ops);
+}
+
+static void __exit ifemark_cleanup_module(void)
+{
+   unregister_ife_op(&ife_skbmark_ops);
+}
+
+module_init(ifemark_init_module);
+module_exit(ifemark_cleanup_module);
+
+MODULE_AUTHOR("Jamal Hadi Salim(2015)");
+MODULE_DESCRIPTION("Inter-FE skb mark metadata module");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_IFE_META(IFE_META_SKBMARK);
-- 
1.9.1

[net-next-2.6 PATCH v4 3/3] Support to encoding decoding skb prio on IFE action

From: Jamal Hadi Salim 

Example usage:
Set the skb priority using skbedit then allow it to be encoded

sudo tc qdisc add dev $ETH root handle 1: prio
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit prio 17 \
action ife encode \
allow prio \
dst 02:15:15:15:15:15

Note: You dont need the skbedit action if you are already encoding the
skb priority earlier. A zero skb priority will not be sent

Alternative hard code static priority of decimal 33 (unlike skbedit)
then mark of 0x12 every time the filter matches

sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action ife encode \
type 0xDEAD \
use prio 33 \
use mark 0x12 \
dst 02:15:15:15:15:15

Signed-off-by: Jamal Hadi Salim 
---
 net/sched/Kconfig|  5 +++
 net/sched/Makefile   |  1 +
 net/sched/act_meta_skbprio.c | 76 
 3 files changed, 82 insertions(+)
 create mode 100644 net/sched/act_meta_skbprio.c

diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 85854c0..b148302 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -756,6 +756,11 @@ config NET_IFE_SKBMARK
 depends on NET_ACT_IFE
 ---help---
 
+config NET_IFE_SKBPRIO
+tristate "Support to encoding decoding skb prio on IFE action"
+depends on NET_ACT_IFE
+---help---
+
 config NET_CLS_IND
bool "Incoming device classification"
depends on NET_CLS_U32 || NET_CLS_FW
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 3f7a182..84bddb3 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_NET_ACT_BPF) += act_bpf.o
 obj-$(CONFIG_NET_ACT_CONNMARK) += act_connmark.o
 obj-$(CONFIG_NET_ACT_IFE)  += act_ife.o
 obj-$(CONFIG_NET_IFE_SKBMARK)  += act_meta_mark.o
+obj-$(CONFIG_NET_IFE_SKBPRIO)  += act_meta_skbprio.o
 obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o
 obj-$(CONFIG_NET_SCH_CBQ)  += sch_cbq.o
 obj-$(CONFIG_NET_SCH_HTB)  += sch_htb.o
diff --git a/net/sched/act_meta_skbprio.c b/net/sched/act_meta_skbprio.c
new file mode 100644
index 000..26bf4d8
--- /dev/null
+++ b/net/sched/act_meta_skbprio.c
@@ -0,0 +1,76 @@
+/*
+ * net/sched/act_meta_prio.c IFE skb->priority metadata module
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * copyright Jamal Hadi Salim (2015)
+ *
+*/
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int skbprio_check(struct sk_buff *skb, struct tcf_meta_info *e)
+{
+   return ife_check_meta_u32(skb->priority, e);
+}
+
+static int skbprio_encode(struct sk_buff *skb, void *skbdata,
+ struct tcf_meta_info *e)
+{
+   u32 ifeprio = skb->priority; /* avoid having to cast skb->priority*/
+
+   return ife_encode_meta_u32(ifeprio, skbdata, e);
+}
+
+static int skbprio_decode(struct sk_buff *skb, void *data, u16 len)
+{
+   u32 ifeprio = *(u32 *)data;
+
+   skb->priority = ntohl(ifeprio);
+   return 0;
+}
+
+static struct tcf_meta_ops ife_prio_ops = {
+   .metaid = IFE_META_PRIO,
+   .metatype = NLA_U32,
+   .name = "skbprio",
+   .synopsis = "skb prio metadata",
+   .check_presence = skbprio_check,
+   .encode = skbprio_encode,
+   .decode = skbprio_decode,
+   .get = ife_get_meta_u32,
+   .alloc = ife_alloc_meta_u32,
+   .owner = THIS_MODULE,
+};
+
+static int __init ifeprio_init_module(void)
+{
+   return register_ife_op(&ife_prio_ops);
+}
+
+static void __exit ifeprio_cleanup_module(void)
+{
+   unregister_ife_op(&ife_prio_ops);
+}
+
+module_init(ifeprio_init_module);
+module_exit(ifeprio_cleanup_module);
+
+MODULE_AUTHOR("Jamal Hadi Salim(2015)");
+MODULE_DESCRIPTION("Inter-FE skb prio metadata action");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_IFE_META(IFE_META_PRIO);
-- 
1.9.1

[net-next-2.6 PATCH v4 1/3] introduce IFE action

From: Jamal Hadi Salim 

This action allows for a sending side to encapsulate arbitrary metadata
which is decapsulated by the receiving end.
The sender runs in encoding mode and the receiver in decode mode.
Both sender and receiver must specify the same ethertype.
At some point we hope to have a registered ethertype and we'll
then provide a default so the user doesnt have to specify it.
For now we enforce the user specify it.

Lets show example usage where we encode icmp from a sender towards
a receiver with an skbmark of 17; both sender and receiver use
ethertype of 0xdead to interop.

: Lets start with Receiver-side policy config:
xxx: add an ingress qdisc
sudo tc qdisc add dev $ETH ingress

xxx: any packets with ethertype 0xdead will be subjected to ife decoding
xxx: we then restart the classification so we can match on icmp at prio 3
sudo $TC filter add dev $ETH parent : prio 2 protocol 0xdead \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify

xxx: on restarting the classification from above if it was an icmp
xxx: packet, then match it here and continue to the next rule at prio 4
xxx: which will match based on skb mark of 17
sudo tc filter add dev $ETH parent : prio 3 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue

xxx: match on skbmark of 0x11 (decimal 17) and accept
sudo tc filter add dev $ETH parent : prio 4 protocol ip \
handle 0x11 fw flowid 1:1 \
action ok

xxx: Lets show the decoding policy
sudo tc -s filter ls dev $ETH parent : protocol 0xdead
xxx:
filter pref 2 u32
filter pref 2 u32 fh 800: ht divisor 1
filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1  (rule hit 
0 success 0)
  match / at 0 (success 0 )
action order 1: ife decode action reclassify
 index 1 ref 1 bind 1 installed 14 sec used 14 sec
 type: 0x0
 Metadata: allow mark allow hash allow prio allow qmap
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
Observe that above lists all metadatum it can decode. Typically these
submodules will already be compiled into a monolithic kernel or
loaded as modules

: Lets show the sender side now ..

xxx: Add an egress qdisc on the sender netdev
sudo tc qdisc add dev $ETH root handle 1: prio
xxx:
xxx: Match all icmp packets to 192.168.122.237/24, then
xxx: tag the packet with skb mark of decimal 17, then
xxx: Encode it with:
xxx:ethertype 0xdead
xxx:add skb->mark to whitelist of metadatum to send
xxx:rewrite target dst MAC address to 02:15:15:15:15:15
xxx:
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10  u32 \
match ip dst 192.168.122.237/24 \
match ip protocol 1 0xff \
flowid 1:2 \
action skbedit mark 17 \
action ife encode \
type 0xDEAD \
allow mark \
dst 02:15:15:15:15:15

xxx: Lets show the encoding policy
sudo tc -s filter ls dev $ETH parent 1: protocol ip
xxx:
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2  (rule 
hit 0 success 0)
  match c0a87aed/ at 16 (success 0 )
  match 0001/00ff at 8 (success 0 )

action order 1:  skbedit mark 17
 index 6 ref 1 bind 1
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 2: ife encode action pipe
 index 3 ref 1 bind 1
 dst MAC: 02:15:15:15:15:15 type: 0xDEAD
 Metadata: allow mark
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:

test by sending ping from sender to destination

Signed-off-by: Jamal Hadi Salim 
---
 include/net/tc_act/tc_ife.h|  61 +++
 include/uapi/linux/tc_act/tc_ife.h |  38 ++
 net/sched/Kconfig  |  12 +
 net/sched/Makefile |   1 +
 net/sched/act_ife.c| 870 +
 5 files changed, 982 insertions(+)
 create mode 100644 include/net/tc_act/tc_ife.h
 create mode 100644 include/uapi/linux/tc_act/tc_ife.h
 create mode 100644 net/sched/act_ife.c

diff --git a/include/net/tc_act/tc_ife.h b/include/net/tc_act/tc_ife.h
new file mode 100644
index 000..dc9a09a
--- /dev/null
+++ b/include/net/tc_act/tc_ife.h
@@ -0,0 +1,61 @@
+#ifndef __NET_TC_IFE_H
+#define __NET_TC_IFE_H
+
+#include 
+#include 
+#include 
+#include 
+
+#define IFE_METAHDRLEN 2
+struct tcf_ife_info {
+   struct tcf_common common;
+   u8 eth_dst[ETH_ALEN];
+   u8 eth_src[ETH_ALEN];
+   u16 eth_type;
+   u16 flags;
+   /* list of metaids allowed */
+   struct list_head metalist;
+};
+#define to_ife(a) \
+   container_of(a->priv, struct tcf_ife_info, common)
+
+struct tcf_meta_info {
+   const struct tcf_meta_ops *ops;
+   void *metaval;
+   u16 metaid;
+   struct list_head metalist;
+};
+
+struct tcf_meta_ops {
+

Re: [net-next-2.6 v3 1/3] introduce IFE action


On 16-02-26 06:49 PM, Cong Wang wrote:

On Fri, Feb 26, 2016 at 2:43 PM, Jamal Hadi Salim  wrote:

[...]


Just some quick reviews... ;)


;->

Ok, update in a little while after some basic testing...

cheers,
jamal

Re: [patch] rocker: fix an error code

2016-02-27 Thread Jiri Pirko

Sat, Feb 27, 2016 at 12:31:43PM CET, dan.carpen...@oracle.com wrote:
>We intended to return PTR_ERR() here instead of 1.
>
>Fixes: 1f9993f6825f ('rocker: fix a neigh entry leak issue')
>Signed-off-by: Dan Carpenter 

Acked-by: Jiri Pirko

[patch] rocker: fix an error code

2016-02-27 Thread Dan Carpenter

We intended to return PTR_ERR() here instead of 1.

Fixes: 1f9993f6825f ('rocker: fix a neigh entry leak issue')
Signed-off-by: Dan Carpenter 
---
We recently moved rocker files around so this only applies to -next.
Probably returning the wrong error code is harmless.

diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c 
b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 099008a..07218c3 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -1449,7 +1449,7 @@ static int ofdpa_port_ipv4_resolve(struct ofdpa_port 
*ofdpa_port,
if (!n) {
n = neigh_create(&arp_tbl, &ip_addr, dev);
if (IS_ERR(n))
-   return IS_ERR(n);
+   return PTR_ERR(n);
}
 
/* If the neigh is already resolved, then go ahead and

net/9p: convert to new CQ API

2016-02-27 Thread Christoph Hellwig

Hi all,

who is maintaining the "RDMA transport" (1) for 9p?  Below patch converts
it to your new CQ API.  It's fairly trivial, but untested as I can't figure
out how to actually test this code.

[1] RDMA seems a bit of a misowner as it's never doing RDMA data transfers,
but that's a separate story :)

[PATCH] net/9p: convert to new CQ API

2016-02-27 Thread Christoph Hellwig

Trivial conversion to the new RDMA CQ API.

Signed-off-by: Christoph Hellwig 
---
 net/9p/trans_rdma.c | 87 +++--
 1 file changed, 31 insertions(+), 56 deletions(-)

diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
index 52b4a2f..668c3be 100644
--- a/net/9p/trans_rdma.c
+++ b/net/9p/trans_rdma.c
@@ -109,14 +109,13 @@ struct p9_trans_rdma {
 /**
  * p9_rdma_context - Keeps track of in-process WR
  *
- * @wc_op: The original WR op for when the CQE completes in error.
  * @busa: Bus address to unmap when the WR completes
  * @req: Keeps track of requests (send)
  * @rc: Keepts track of replies (receive)
  */
 struct p9_rdma_req;
-struct p9_rdma_context {
-   enum ib_wc_opcode wc_op;
+struct p9_rdma_context {   
+   struct ib_cqe cqe;
dma_addr_t busa;
union {
struct p9_req_t *req;
@@ -284,9 +283,12 @@ p9_cm_event_handler(struct rdma_cm_id *id, struct 
rdma_cm_event *event)
 }
 
 static void
-handle_recv(struct p9_client *client, struct p9_trans_rdma *rdma,
-   struct p9_rdma_context *c, enum ib_wc_status status, u32 byte_len)
+recv_done(struct ib_cq *cq, struct ib_wc *wc)
 {
+   struct p9_client *client = cq->cq_context;
+   struct p9_trans_rdma *rdma = client->trans;
+   struct p9_rdma_context *c =
+   container_of(wc->wr_cqe, struct p9_rdma_context, cqe);
struct p9_req_t *req;
int err = 0;
int16_t tag;
@@ -295,7 +297,7 @@ handle_recv(struct p9_client *client, struct p9_trans_rdma 
*rdma,
ib_dma_unmap_single(rdma->cm_id->device, c->busa, client->msize,
 DMA_FROM_DEVICE);
 
-   if (status != IB_WC_SUCCESS)
+   if (wc->status != IB_WC_SUCCESS)
goto err_out;
 
err = p9_parse_header(c->rc, NULL, NULL, &tag, 1);
@@ -316,21 +318,31 @@ handle_recv(struct p9_client *client, struct 
p9_trans_rdma *rdma,
req->rc = c->rc;
p9_client_cb(client, req, REQ_STATUS_RCVD);
 
+ out:
+   up(&rdma->rq_sem);
+   kfree(c);
return;
 
  err_out:
-   p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, status);
+   p9_debug(P9_DEBUG_ERROR, "req %p err %d status %d\n", req, err, 
wc->status);
rdma->state = P9_RDMA_FLUSHING;
client->status = Disconnected;
+   goto out;
 }
 
 static void
-handle_send(struct p9_client *client, struct p9_trans_rdma *rdma,
-   struct p9_rdma_context *c, enum ib_wc_status status, u32 byte_len)
+send_done(struct ib_cq *cq, struct ib_wc *wc)
 {
+   struct p9_client *client = cq->cq_context;
+   struct p9_trans_rdma *rdma = client->trans;
+   struct p9_rdma_context *c =
+   container_of(wc->wr_cqe, struct p9_rdma_context, cqe);
+
ib_dma_unmap_single(rdma->cm_id->device,
c->busa, c->req->tc->size,
DMA_TO_DEVICE);
+   up(&rdma->sq_sem);
+   kfree(c);
 }
 
 static void qp_event_handler(struct ib_event *event, void *context)
@@ -339,42 +351,6 @@ static void qp_event_handler(struct ib_event *event, void 
*context)
 event->event, context);
 }
 
-static void cq_comp_handler(struct ib_cq *cq, void *cq_context)
-{
-   struct p9_client *client = cq_context;
-   struct p9_trans_rdma *rdma = client->trans;
-   int ret;
-   struct ib_wc wc;
-
-   ib_req_notify_cq(rdma->cq, IB_CQ_NEXT_COMP);
-   while ((ret = ib_poll_cq(cq, 1, &wc)) > 0) {
-   struct p9_rdma_context *c = (void *) (unsigned long) wc.wr_id;
-
-   switch (c->wc_op) {
-   case IB_WC_RECV:
-   handle_recv(client, rdma, c, wc.status, wc.byte_len);
-   up(&rdma->rq_sem);
-   break;
-
-   case IB_WC_SEND:
-   handle_send(client, rdma, c, wc.status, wc.byte_len);
-   up(&rdma->sq_sem);
-   break;
-
-   default:
-   pr_err("unexpected completion type, c->wc_op=%d, 
wc.opcode=%d, status=%d\n",
-  c->wc_op, wc.opcode, wc.status);
-   break;
-   }
-   kfree(c);
-   }
-}
-
-static void cq_event_handler(struct ib_event *e, void *v)
-{
-   p9_debug(P9_DEBUG_ERROR, "CQ event %d context %p\n", e->event, v);
-}
-
 static void rdma_destroy_trans(struct p9_trans_rdma *rdma)
 {
if (!rdma)
@@ -387,7 +363,7 @@ static void rdma_destroy_trans(struct p9_trans_rdma *rdma)
ib_dealloc_pd(rdma->pd);
 
if (rdma->cq && !IS_ERR(rdma->cq))
-   ib_destroy_cq(rdma->cq);
+   ib_free_cq(rdma->cq);
 
if (rdma->cm_id && !IS_ERR(rdma->cm_id))
rdma_destroy_id(rdma->cm_id);
@@ -408,13 +384,14 @@ post_recv(struct p9_client *client, struct 
p9_rdma_context *c)
if (ib_dma_mapping_error(rd

[PATCH v2 2/3] net: ipv4: tcp_probe: Replace timespec with timespec64

TCP probe log timestamps use struct timespec which is
not y2038 safe. Even though timespec might be good enough here
as it is used to represent delta time, the plan is to get rid
of all uses of timespec in the kernel.
Replace with struct timespec64 which is y2038 safe.

Prints still use unsigned long format and type.

Signed-off-by: Deepa Dinamani 
Reviewed-by: Arnd Bergmann 
Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: James Morris 
Cc: Hideaki YOSHIFUJI 
Cc: Patrick McHardy 
---
 net/ipv4/tcp_probe.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
index ebf5ff5..f6c50af 100644
--- a/net/ipv4/tcp_probe.c
+++ b/net/ipv4/tcp_probe.c
@@ -187,13 +187,13 @@ static int tcpprobe_sprint(char *tbuf, int n)
 {
const struct tcp_log *p
= tcp_probe.log + tcp_probe.tail;
-   struct timespec tv
-   = ktime_to_timespec(ktime_sub(p->tstamp, tcp_probe.start));
+   struct timespec64 ts
+   = ktime_to_timespec64(ktime_sub(p->tstamp, tcp_probe.start));
 
return scnprintf(tbuf, n,
"%lu.%09lu %pISpc %pISpc %d %#x %#x %u %u %u %u %u\n",
-   (unsigned long)tv.tv_sec,
-   (unsigned long)tv.tv_nsec,
+   (unsigned long)ts.tv_sec,
+   (unsigned long)ts.tv_nsec,
&p->src, &p->dst, p->length, p->snd_nxt, p->snd_una,
p->snd_cwnd, p->ssthresh, p->snd_wnd, p->srtt, 
p->rcv_wnd);
 }
-- 
1.9.1

[PATCH v2 3/3] net: sctp: Convert log timestamps to be y2038 safe

SCTP probe log timestamps use struct timespec which is
not y2038 safe.
Use struct timespec64 which is 2038 safe instead.

Use monotonic time instead of real time as only time
differences are logged.

Signed-off-by: Deepa Dinamani 
Reviewed-by: Arnd Bergmann 
Acked-by: Neil Horman 
Cc: Vlad Yasevich 
Cc: Neil Horman 
Cc: "David S. Miller" 
Cc: linux-s...@vger.kernel.org
---
 net/sctp/probe.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/sctp/probe.c b/net/sctp/probe.c
index 5e68b94..6cc2152 100644
--- a/net/sctp/probe.c
+++ b/net/sctp/probe.c
@@ -65,7 +65,7 @@ static struct {
struct kfifo  fifo;
spinlock_tlock;
wait_queue_head_t wait;
-   struct timespec   tstart;
+   struct timespec64 tstart;
 } sctpw;
 
 static __printf(1, 2) void printl(const char *fmt, ...)
@@ -85,7 +85,7 @@ static __printf(1, 2) void printl(const char *fmt, ...)
 static int sctpprobe_open(struct inode *inode, struct file *file)
 {
kfifo_reset(&sctpw.fifo);
-   getnstimeofday(&sctpw.tstart);
+   ktime_get_ts64(&sctpw.tstart);
 
return 0;
 }
@@ -138,7 +138,7 @@ static sctp_disposition_t jsctp_sf_eat_sack(struct net *net,
struct sk_buff *skb = chunk->skb;
struct sctp_transport *sp;
static __u32 lcwnd = 0;
-   struct timespec now;
+   struct timespec64 now;
 
sp = asoc->peer.primary_path;
 
@@ -149,8 +149,8 @@ static sctp_disposition_t jsctp_sf_eat_sack(struct net *net,
(full || sp->cwnd != lcwnd)) {
lcwnd = sp->cwnd;
 
-   getnstimeofday(&now);
-   now = timespec_sub(now, sctpw.tstart);
+   ktime_get_ts64(&now);
+   now = timespec64_sub(now, sctpw.tstart);
 
printl("%lu.%06lu ", (unsigned long) now.tv_sec,
   (unsigned long) now.tv_nsec / NSEC_PER_USEC);
-- 
1.9.1

[PATCH v2 1/3] net: ipv4: Convert IP network timestamps to be y2038 safe

ICMP timestamp messages and IP source route options require
timestamps to be in milliseconds modulo 24 hours from
midnight UT format.

Add inet_current_timestamp() function to support this. The function
returns the required timestamp in network byte order.

Timestamp calculation is also changed to call ktime_get_real_ts64()
which uses struct timespec64. struct timespec64 is y2038 safe.
Previously it called getnstimeofday() which uses struct timespec.
struct timespec is not y2038 safe.

Signed-off-by: Deepa Dinamani 
Cc: "David S. Miller" 
Cc: Alexey Kuznetsov 
Cc: Hideaki YOSHIFUJI 
Cc: James Morris 
Cc: Patrick McHardy 
---
 include/net/ip.h  |  2 ++
 net/ipv4/af_inet.c| 26 ++
 net/ipv4/icmp.c   |  5 +
 net/ipv4/ip_options.c | 14 ++
 4 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 1a98f1c..5d3a9eb 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -240,6 +240,8 @@ static inline int inet_is_local_reserved_port(struct net 
*net, int port)
 }
 #endif
 
+__be32 inet_current_timestamp(void);
+
 /* From inetpeer.c */
 extern int inet_peer_threshold;
 extern int inet_peer_minttl;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index eade66d..408e2b3 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1386,6 +1386,32 @@ out:
return pp;
 }
 
+#define SECONDS_PER_DAY86400
+
+/* inet_current_timestamp - Return IP network timestamp
+ *
+ * Return milliseconds since midnight in network byte order.
+ */
+__be32 inet_current_timestamp(void)
+{
+   u32 secs;
+   u32 msecs;
+   struct timespec64 ts;
+
+   ktime_get_real_ts64(&ts);
+
+   /* Get secs since midnight. */
+   (void)div_u64_rem(ts.tv_sec, SECONDS_PER_DAY, &secs);
+   /* Convert to msecs. */
+   msecs = secs * MSEC_PER_SEC;
+   /* Convert nsec to msec. */
+   msecs += (u32)ts.tv_nsec / NSEC_PER_MSEC;
+
+   /* Convert to network byte order. */
+   return htons(msecs);
+}
+EXPORT_SYMBOL(inet_current_timestamp);
+
 int inet_recv_error(struct sock *sk, struct msghdr *msg, int len, int 
*addr_len)
 {
if (sk->sk_family == AF_INET)
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 36e2697..6333489 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -931,7 +931,6 @@ static bool icmp_echo(struct sk_buff *skb)
  */
 static bool icmp_timestamp(struct sk_buff *skb)
 {
-   struct timespec tv;
struct icmp_bxm icmp_param;
/*
 *  Too short.
@@ -942,9 +941,7 @@ static bool icmp_timestamp(struct sk_buff *skb)
/*
 *  Fill in the current time as ms since midnight UT:
 */
-   getnstimeofday(&tv);
-   icmp_param.data.times[1] = htonl((tv.tv_sec % 86400) * MSEC_PER_SEC +
-tv.tv_nsec / NSEC_PER_MSEC);
+   icmp_param.data.times[1] = inet_current_timestamp();
icmp_param.data.times[2] = icmp_param.data.times[1];
if (skb_copy_bits(skb, 0, &icmp_param.data.times[0], 4))
BUG();
diff --git a/net/ipv4/ip_options.c b/net/ipv4/ip_options.c
index bd24679..4d158ff 100644
--- a/net/ipv4/ip_options.c
+++ b/net/ipv4/ip_options.c
@@ -58,10 +58,9 @@ void ip_options_build(struct sk_buff *skb, struct ip_options 
*opt,
if (opt->ts_needaddr)
ip_rt_get_source(iph+opt->ts+iph[opt->ts+2]-9, skb, rt);
if (opt->ts_needtime) {
-   struct timespec tv;
__be32 midtime;
-   getnstimeofday(&tv);
-   midtime = htonl((tv.tv_sec % 86400) * MSEC_PER_SEC + 
tv.tv_nsec / NSEC_PER_MSEC);
+
+   midtime = inet_current_timestamp();
memcpy(iph+opt->ts+iph[opt->ts+2]-5, &midtime, 4);
}
return;
@@ -415,11 +414,10 @@ int ip_options_compile(struct net *net,
break;
}
if (timeptr) {
-   struct timespec tv;
-   u32  midtime;
-   getnstimeofday(&tv);
-   midtime = (tv.tv_sec % 86400) * 
MSEC_PER_SEC + tv.tv_nsec / NSEC_PER_MSEC;
-   put_unaligned_be32(midtime, timeptr);
+   __be32 midtime;
+
+   midtime = inet_current_timestamp();
+   memcpy(timeptr, &midtime, 4);
opt->is_changed = 1;
}
} else if ((optptr[3]&0xF) != IPOPT_TS_PRESPEC) {
-- 
1.9.1

[PATCH v2 0/3] Convert network timestamps to be y2038 safe