Re: [PATCH v2 16/16] NFS: Enable client side NFSv4.1 backchannel to use other transports

2015-10-07 Thread Leon Romanovsky
On Tue, Oct 6, 2015 at 6:00 PM, Chuck Lever  wrote:
> Pass the correct backchannel transport class to svc_create_xprt()
> when setting up an NFSv4.1 backchannel transport.
>
> Signed-off-by: Chuck Lever 
> ---
>  fs/nfs/callback.c   |   33 +
>  include/linux/sunrpc/xprt.h |1 +
>  net/sunrpc/xprtrdma/transport.c |1 +
>  net/sunrpc/xprtsock.c   |1 +
>  4 files changed, 24 insertions(+), 12 deletions(-)
>
> diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
> index 75f7c0a..46ed2c5 100644
> --- a/fs/nfs/callback.c
> +++ b/fs/nfs/callback.c
> @@ -99,15 +99,22 @@ nfs4_callback_up(struct svc_serv *serv)
>  }
>
>  #if defined(CONFIG_NFS_V4_1)
> -static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net)
> +/*
> + * Create an svc_sock for the back channel service that shares the
> + * fore channel connection.
> + * Returns the input port (0) and sets the svc_serv bc_xprt on success
> + */
> +static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net,
> +struct rpc_xprt *xprt)
>  {
> -   /*
> -* Create an svc_sock for the back channel service that shares the
> -* fore channel connection.
> -* Returns the input port (0) and sets the svc_serv bc_xprt on success
> -*/
> -   return svc_create_xprt(serv, "tcp-bc", net, PF_INET, 0,
> - SVC_SOCK_ANONYMOUS);
> +   int ret = -EPROTONOSUPPORT;
> +
> +   if (xprt->bc_name)
> +   ret = svc_create_xprt(serv, xprt->bc_name, net, PF_INET, 0,
> + SVC_SOCK_ANONYMOUS);
> +   dprintk("NFS: svc_create_xprt(%s) returned %d\n",
> +   xprt->bc_name, ret);

Maybe we should convert it to more general and common pr_debug(..)
interface [1]?
Despite the fact that the other parts of code already use the dprintk macro.

> +   return ret;
>  }
>
>  /*
> @@ -184,7 +191,8 @@ static inline void nfs_callback_bc_serv(u32 minorversion, 
> struct rpc_xprt *xprt,
> xprt->bc_serv = serv;
>  }
>  #else
> -static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net)
> +static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net,
> +struct rpc_xprt *xprt)
>  {
> return 0;
>  }
> @@ -259,7 +267,8 @@ static void nfs_callback_down_net(u32 minorversion, 
> struct svc_serv *serv, struc
> svc_shutdown_net(serv, net);
>  }
>
> -static int nfs_callback_up_net(int minorversion, struct svc_serv *serv, 
> struct net *net)
> +static int nfs_callback_up_net(int minorversion, struct svc_serv *serv,
> +  struct net *net, struct rpc_xprt *xprt)
>  {
> struct nfs_net *nn = net_generic(net, nfs_net_id);
> int ret;
> @@ -281,7 +290,7 @@ static int nfs_callback_up_net(int minorversion, struct 
> svc_serv *serv, struct n
> break;
> case 1:
> case 2:
> -   ret = nfs41_callback_up_net(serv, net);
> +   ret = nfs41_callback_up_net(serv, net, xprt);
> break;
> default:
> printk(KERN_ERR "NFS: unknown callback version: %d\n",
It can be pr_err(..) [1].

> @@ -364,7 +373,7 @@ int nfs_callback_up(u32 minorversion, struct rpc_xprt 
> *xprt)
> goto err_create;
> }
>
> -   ret = nfs_callback_up_net(minorversion, serv, net);
> +   ret = nfs_callback_up_net(minorversion, serv, net, xprt);
> if (ret < 0)
> goto err_net;
>
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 82c0839..88bae18 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -170,6 +170,7 @@ struct rpc_xprt {
> struct sockaddr_storage addr;   /* server address */
> size_t  addrlen;/* size of server address */
> int prot;   /* IP protocol */
> +   char*bc_name;   /* backchannel transport */
>
> unsigned long   cong;   /* current congestion */
> unsigned long   cwnd;   /* congestion window */
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 845278e..c60bbc8 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -337,6 +337,7 @@ xprt_setup_rdma(struct xprt_create *args)
> /* Ensure xprt->addr holds valid server TCP (not RDMA)
>  * address, for any side protocols which peek at it */
> xprt->prot = IPPROTO_TCP;
> +   xprt->bc_name = "rdma-bc";
> xprt->addrlen = args->addrlen;
> memcpy(&xprt->addr, sap, xprt->addrlen);
>
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 3084163..d82d00b 100644
> --- a/net/sunrpc/xprtsock

Re: [PATCH v1 00/24] New fast registration API

2015-10-07 Thread Christoph Hellwig
On Tue, Oct 06, 2015 at 11:37:40AM +0300, Sagi Grimberg wrote:
> The issue is that the device requires the MR page array to have
> an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the
> page array allocation to be non-coherent I didn't take care of
> alignment.

Just curious:  why did you switch away from the coheret dma allocations
anyway?  Seems like the page lists are mapped as long as they are
allocated so the coherent allocator would seem like a nice fit.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/24] New fast registration API

2015-10-07 Thread Sagi Grimberg

On 10/7/2015 12:20 PM, Christoph Hellwig wrote:

On Tue, Oct 06, 2015 at 11:37:40AM +0300, Sagi Grimberg wrote:

The issue is that the device requires the MR page array to have
an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the
page array allocation to be non-coherent I didn't take care of
alignment.


Just curious:  why did you switch away from the coheret dma allocations
anyway?  Seems like the page lists are mapped as long as they are
allocated so the coherent allocator would seem like a nice fit.



Bart suggested that having to sync once for the entire page list might
perform better than coherent memory. I'll settle either way since using
non-coherent memory might cause higher-order allocations due to
alignment, so it's not free-of-charge.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/24] New fast registration API

2015-10-07 Thread Christoph Hellwig
On Wed, Oct 07, 2015 at 12:25:25PM +0300, Sagi Grimberg wrote:
> Bart suggested that having to sync once for the entire page list might
> perform better than coherent memory. I'll settle either way since using
> non-coherent memory might cause higher-order allocations due to
> alignment, so it's not free-of-charge.

I don't really care either way, it just seemed like an odd change hiding
in here that I missed when reviewing earlier.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/24] New fast registration API

2015-10-07 Thread Sagi Grimberg

I don't really care either way, it just seemed like an odd change hiding
in here that I missed when reviewing earlier.


OK, so I'm sticking with it until someone suggests otherwise.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/cxgb4: re-fix 32-bit build warning

2015-10-07 Thread Arnd Bergmann
Casting a pointer to __be64 produces a warning on 32-bit architectures:

drivers/infiniband/hw/cxgb4/mem.c:147:20: warning: cast from pointer to integer 
of different size [-Wpointer-to-int-cast]
req->wr.wr_lo = (__force __be64)&wr_wait;

This was fixed at least twice for this driver in different places,
and accidentally reverted once more. This puts the correct version
back in place.

Signed-off-by: Arnd Bergmann 
Fixes: 6198dd8d7a6a7 ("iw_cxgb4: 32b platform fixes")

diff --git a/drivers/infiniband/hw/cxgb4/mem.c 
b/drivers/infiniband/hw/cxgb4/mem.c
index 026b91ebd5e2..140415d31bcc 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -144,7 +144,7 @@ static int _c4iw_write_mem_inline(struct c4iw_rdev *rdev, 
u32 addr, u32 len,
if (i == (num_wqe-1)) {
req->wr.wr_hi = cpu_to_be32(FW_WR_OP_V(FW_ULPTX_WR) |
FW_WR_COMPL_F);
-   req->wr.wr_lo = (__force __be64)&wr_wait;
+   req->wr.wr_lo = (__force __be64)(unsigned long)&wr_wait;
} else
req->wr.wr_hi = cpu_to_be32(FW_WR_OP_V(FW_ULPTX_WR));
req->wr.wr_mid = cpu_to_be32(

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/core: avoid 32-bit warning

2015-10-07 Thread Arnd Bergmann
The INIT_UDATA() macro requires a pointer or unsigned long argument for
both input and output buffer, and all callers had a cast from when
the code was merged until a recent restructuring, so now we get

core/uverbs_cmd.c: In function 'ib_uverbs_create_cq':
core/uverbs_cmd.c:1481:66: warning: cast to pointer from integer of different 
size [-Wint-to-pointer-cast]

This makes the code behave as before by adding back the cast to
unsigned long.

Signed-off-by: Arnd Bergmann 
Fixes: 565197dd8fb1 ("IB/core: Extend ib_uverbs_create_cq")

diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index be4cb9f04be3..88b3b78340f2 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1478,7 +1478,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
if (copy_from_user(&cmd, buf, sizeof(cmd)))
return -EFAULT;
 
-   INIT_UDATA(&ucore, buf, cmd.response, sizeof(cmd), sizeof(resp));
+   INIT_UDATA(&ucore, buf, (unsigned long)cmd.response, sizeof(cmd), 
sizeof(resp));
 
INIT_UDATA(&uhw, buf + sizeof(cmd),
   (unsigned long)cmd.response + sizeof(resp),

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/core: avoid 32-bit warning

2015-10-07 Thread Yann Droneaud
Hi,

Le mercredi 07 octobre 2015 à 14:29 +0200, Arnd Bergmann a écrit :
> The INIT_UDATA() macro requires a pointer or unsigned long argument 
> for both input and output buffer, and all callers had a cast from 
> when the code was merged until a recent restructuring, so now we get
> 
> core/uverbs_cmd.c: In function 'ib_uverbs_create_cq':
> core/uverbs_cmd.c:1481:66: warning: cast to pointer from integer of
> different size [-Wint-to-pointer-cast]
> 
> This makes the code behave as before by adding back the cast to
> unsigned long.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 565197dd8fb1 ("IB/core: Extend ib_uverbs_create_cq")
> 

Reviewed-by: Yann Droneaud 

> diff --git a/drivers/infiniband/core/uverbs_cmd.c
> b/drivers/infiniband/core/uverbs_cmd.c
> index be4cb9f04be3..88b3b78340f2 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -1478,7 +1478,7 @@ ssize_t ib_uverbs_create_cq(struct
> ib_uverbs_file *file,
>   if (copy_from_user(&cmd, buf, sizeof(cmd)))
>   return -EFAULT;
>  
> - INIT_UDATA(&ucore, buf, cmd.response, sizeof(cmd),
> sizeof(resp));
> + INIT_UDATA(&ucore, buf, (unsigned long)cmd.response,
> sizeof(cmd), sizeof(resp));
>  
>   INIT_UDATA(&uhw, buf + sizeof(cmd),
>  (unsigned long)cmd.response + sizeof(resp),
> 

Regards.

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/core: avoid 32-bit warning

2015-10-07 Thread Sagi Grimberg

On 10/7/2015 3:29 PM, Arnd Bergmann wrote:

The INIT_UDATA() macro requires a pointer or unsigned long argument for
both input and output buffer, and all callers had a cast from when
the code was merged until a recent restructuring, so now we get

core/uverbs_cmd.c: In function 'ib_uverbs_create_cq':
core/uverbs_cmd.c:1481:66: warning: cast to pointer from integer of different 
size [-Wint-to-pointer-cast]

This makes the code behave as before by adding back the cast to
unsigned long.

Signed-off-by: Arnd Bergmann 
Fixes: 565197dd8fb1 ("IB/core: Extend ib_uverbs_create_cq")

diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index be4cb9f04be3..88b3b78340f2 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1478,7 +1478,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
if (copy_from_user(&cmd, buf, sizeof(cmd)))
return -EFAULT;

-   INIT_UDATA(&ucore, buf, cmd.response, sizeof(cmd), sizeof(resp));
+   INIT_UDATA(&ucore, buf, (unsigned long)cmd.response, sizeof(cmd), 
sizeof(resp));


Would it make sense to cast inside INIT_UDATA() and not have callers
worry about it?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/core: avoid 32-bit warning

2015-10-07 Thread Arnd Bergmann
On Wednesday 07 October 2015 16:19:27 Sagi Grimberg wrote:
> On 10/7/2015 3:29 PM, Arnd Bergmann wrote:
> > The INIT_UDATA() macro requires a pointer or unsigned long argument for
> > both input and output buffer, and all callers had a cast from when
> > the code was merged until a recent restructuring, so now we get
> >
> > core/uverbs_cmd.c: In function 'ib_uverbs_create_cq':
> > core/uverbs_cmd.c:1481:66: warning: cast to pointer from integer of 
> > different size [-Wint-to-pointer-cast]
> >
> > This makes the code behave as before by adding back the cast to
> > unsigned long.
> >
> > Signed-off-by: Arnd Bergmann 
> > Fixes: 565197dd8fb1 ("IB/core: Extend ib_uverbs_create_cq")
> >
> > diff --git a/drivers/infiniband/core/uverbs_cmd.c 
> > b/drivers/infiniband/core/uverbs_cmd.c
> > index be4cb9f04be3..88b3b78340f2 100644
> > --- a/drivers/infiniband/core/uverbs_cmd.c
> > +++ b/drivers/infiniband/core/uverbs_cmd.c
> > @@ -1478,7 +1478,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file 
> > *file,
> >   if (copy_from_user(&cmd, buf, sizeof(cmd)))
> >   return -EFAULT;
> >
> > - INIT_UDATA(&ucore, buf, cmd.response, sizeof(cmd), sizeof(resp));
> > + INIT_UDATA(&ucore, buf, (unsigned long)cmd.response, sizeof(cmd), 
> > sizeof(resp));
> 
> Would it make sense to cast inside INIT_UDATA() and not have callers
> worry about it?

If we want to do this properly, INIT_UDATA should be converted into a function
call that has the right types.

My patch doesn't try to do that here, it just restores the code to how it
was for the past 10 years.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/core: avoid 32-bit warning

2015-10-07 Thread Yann Droneaud
Hi,

Le mercredi 07 octobre 2015 à 16:19 +0300, Sagi Grimberg a écrit :
> On 10/7/2015 3:29 PM, Arnd Bergmann wrote:
> > The INIT_UDATA() macro requires a pointer or unsigned long argument
> > for
> > both input and output buffer, and all callers had a cast from when
> > the code was merged until a recent restructuring, so now we get
> > 
> > core/uverbs_cmd.c: In function 'ib_uverbs_create_cq':
> > core/uverbs_cmd.c:1481:66: warning: cast to pointer from integer of
> > different size [-Wint-to-pointer-cast]
> > 
> > This makes the code behave as before by adding back the cast to
> > unsigned long.
> > 
> > Signed-off-by: Arnd Bergmann 
> > Fixes: 565197dd8fb1 ("IB/core: Extend ib_uverbs_create_cq")
> > 
> > diff --git a/drivers/infiniband/core/uverbs_cmd.c
> > b/drivers/infiniband/core/uverbs_cmd.c
> > index be4cb9f04be3..88b3b78340f2 100644
> > --- a/drivers/infiniband/core/uverbs_cmd.c
> > +++ b/drivers/infiniband/core/uverbs_cmd.c
> > @@ -1478,7 +1478,7 @@ ssize_t ib_uverbs_create_cq(struct
> > ib_uverbs_file *file,
> > if (copy_from_user(&cmd, buf, sizeof(cmd)))
> > return -EFAULT;
> > 
> > -   INIT_UDATA(&ucore, buf, cmd.response, sizeof(cmd),
> > sizeof(resp));
> > +   INIT_UDATA(&ucore, buf, (unsigned long)cmd.response,
> > sizeof(cmd), sizeof(resp));
> 
> Would it make sense to cast inside INIT_UDATA() and not have callers
> worry about it?

It's ... complicated. See INIT_UDATA_BUF_OR_NULL().

Awayway, I have patch to do the opposite, eg. explicitly cast u64 value
to (void __user *)(unsigned long) in the caller function instead, as it
allows some safer / new operations on the response buffer.

Regards.

-- 
Yann Droneaud
OPTEYA

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 16/16] NFS: Enable client side NFSv4.1 backchannel to use other transports

2015-10-07 Thread Chuck Lever

> On Oct 7, 2015, at 5:14 AM, Leon Romanovsky  wrote:
> 
> On Tue, Oct 6, 2015 at 6:00 PM, Chuck Lever  wrote:
>> Pass the correct backchannel transport class to svc_create_xprt()
>> when setting up an NFSv4.1 backchannel transport.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> fs/nfs/callback.c   |   33 +
>> include/linux/sunrpc/xprt.h |1 +
>> net/sunrpc/xprtrdma/transport.c |1 +
>> net/sunrpc/xprtsock.c   |1 +
>> 4 files changed, 24 insertions(+), 12 deletions(-)
>> 
>> diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
>> index 75f7c0a..46ed2c5 100644
>> --- a/fs/nfs/callback.c
>> +++ b/fs/nfs/callback.c
>> @@ -99,15 +99,22 @@ nfs4_callback_up(struct svc_serv *serv)
>> }
>> 
>> #if defined(CONFIG_NFS_V4_1)
>> -static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net)
>> +/*
>> + * Create an svc_sock for the back channel service that shares the
>> + * fore channel connection.
>> + * Returns the input port (0) and sets the svc_serv bc_xprt on success
>> + */
>> +static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net,
>> +struct rpc_xprt *xprt)
>> {
>> -   /*
>> -* Create an svc_sock for the back channel service that shares the
>> -* fore channel connection.
>> -* Returns the input port (0) and sets the svc_serv bc_xprt on 
>> success
>> -*/
>> -   return svc_create_xprt(serv, "tcp-bc", net, PF_INET, 0,
>> - SVC_SOCK_ANONYMOUS);
>> +   int ret = -EPROTONOSUPPORT;
>> +
>> +   if (xprt->bc_name)
>> +   ret = svc_create_xprt(serv, xprt->bc_name, net, PF_INET, 0,
>> + SVC_SOCK_ANONYMOUS);
>> +   dprintk("NFS: svc_create_xprt(%s) returned %d\n",
>> +   xprt->bc_name, ret);
> 
> Maybe we should convert it to more general and common pr_debug(..)
> interface [1]?
> Despite the fact that the other parts of code already use the dprintk macro.

dprintk in this code is deprecated, and I used it here out
of laziness. I guess the maintainer might prefer a trace point
instead, or even no output.

In any event, converting dprintk to pr_debug() is out of scope
for this patch series.


>> +   return ret;
>> }
>> 
>> /*
>> @@ -184,7 +191,8 @@ static inline void nfs_callback_bc_serv(u32 
>> minorversion, struct rpc_xprt *xprt,
>>xprt->bc_serv = serv;
>> }
>> #else
>> -static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net)
>> +static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net,
>> +struct rpc_xprt *xprt)
>> {
>>return 0;
>> }
>> @@ -259,7 +267,8 @@ static void nfs_callback_down_net(u32 minorversion, 
>> struct svc_serv *serv, struc
>>svc_shutdown_net(serv, net);
>> }
>> 
>> -static int nfs_callback_up_net(int minorversion, struct svc_serv *serv, 
>> struct net *net)
>> +static int nfs_callback_up_net(int minorversion, struct svc_serv *serv,
>> +  struct net *net, struct rpc_xprt *xprt)
>> {
>>struct nfs_net *nn = net_generic(net, nfs_net_id);
>>int ret;
>> @@ -281,7 +290,7 @@ static int nfs_callback_up_net(int minorversion, struct 
>> svc_serv *serv, struct n
>>break;
>>case 1:
>>case 2:
>> -   ret = nfs41_callback_up_net(serv, net);
>> +   ret = nfs41_callback_up_net(serv, net, xprt);
>>break;
>>default:
>>printk(KERN_ERR "NFS: unknown callback version: %d\n",
> It can be pr_err(..) [1].

It can be, but this patch isn’t changing this line, nor
is it’s purpose to clean up error messages.

It would be better to submit a separate patch to clean
these up, if the maintainer suggests this is something
he will accept.


> 
>> @@ -364,7 +373,7 @@ int nfs_callback_up(u32 minorversion, struct rpc_xprt 
>> *xprt)
>>goto err_create;
>>}
>> 
>> -   ret = nfs_callback_up_net(minorversion, serv, net);
>> +   ret = nfs_callback_up_net(minorversion, serv, net, xprt);
>>if (ret < 0)
>>goto err_net;
>> 
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index 82c0839..88bae18 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -170,6 +170,7 @@ struct rpc_xprt {
>>struct sockaddr_storage addr;   /* server address */
>>size_t  addrlen;/* size of server address */
>>int prot;   /* IP protocol */
>> +   char*bc_name;   /* backchannel transport */
>> 
>>unsigned long   cong;   /* current congestion */
>>unsigned long   cwnd;   /* congestion window */
>> diff --git a/net/sunrpc/xprtrdma/transport.c 
>> b/ne

Re: [PATCH v2 06/16] xprtrdma: Use workqueue to process RPC/RDMA replies

2015-10-07 Thread Sagi Grimberg

On 10/6/2015 5:59 PM, Chuck Lever wrote:

The reply tasklet is fast, but it's single threaded. After reply
traffic saturates a single CPU, there's no more reply processing
capacity.

Replace the tasklet with a workqueue to spread reply handling across
all CPUs.  This also moves RPC/RDMA reply handling out of the soft
IRQ context and into a context that allows sleeps.


Hi Chuck,

I'm probably missing something here, but do you ever schedule in
the workqueue context? Don't you need to explicitly schedule after
a jiffie or so the code works also in a non fully preemptable kernel?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/16] xprtrdma: Use workqueue to process RPC/RDMA replies

2015-10-07 Thread Chuck Lever

> On Oct 7, 2015, at 10:39 AM, Sagi Grimberg  wrote:
> 
> On 10/6/2015 5:59 PM, Chuck Lever wrote:
>> The reply tasklet is fast, but it's single threaded. After reply
>> traffic saturates a single CPU, there's no more reply processing
>> capacity.
>> 
>> Replace the tasklet with a workqueue to spread reply handling across
>> all CPUs.  This also moves RPC/RDMA reply handling out of the soft
>> IRQ context and into a context that allows sleeps.
> 
> Hi Chuck,
> 
> I'm probably missing something here, but do you ever schedule in
> the workqueue context? Don't you need to explicitly schedule after
> a jiffie or so the code works also in a non fully preemptable kernel?

Each RPC reply gets its own work request. This is unlike
the tasklet, which continues to run as long as there are
items on xprtrdma’s global tasklet queue.

I can’t think of anything in the current reply handler
that would take longer than a few microseconds to run,
unless there is lock contention on the transport_lock.

wake_up_bit can also be slow sometimes, but it schedules
internally.

Eventually the reply handler will also synchronously
perform R_key invalidation. In that case, I think there
will be an implicit schedule while waiting for the
invalidation to finish.


-—
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/16] xprtrdma: Use workqueue to process RPC/RDMA replies

2015-10-07 Thread Sagi Grimberg

On 10/7/2015 5:48 PM, Chuck Lever wrote:



On Oct 7, 2015, at 10:39 AM, Sagi Grimberg  wrote:

On 10/6/2015 5:59 PM, Chuck Lever wrote:

The reply tasklet is fast, but it's single threaded. After reply
traffic saturates a single CPU, there's no more reply processing
capacity.

Replace the tasklet with a workqueue to spread reply handling across
all CPUs.  This also moves RPC/RDMA reply handling out of the soft
IRQ context and into a context that allows sleeps.


Hi Chuck,

I'm probably missing something here, but do you ever schedule in
the workqueue context? Don't you need to explicitly schedule after
a jiffie or so the code works also in a non fully preemptable kernel?


Each RPC reply gets its own work request. This is unlike
the tasklet, which continues to run as long as there are
items on xprtrdma’s global tasklet queue.


OK I understand now.

So this and the rest of the series looks good to me.

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 for-next 0/7] Add support for multicast loopback prevention to mlx4

2015-10-07 Thread Doug Ledford
On 10/06/2015 05:49 PM, Or Gerlitz wrote:
> On Wed, Oct 7, 2015 at 12:26 AM, Doug Ledford  wrote:
> 
>> Nothing so simple unfortunately.  And it isn't an IB/RoCE cluster, it's
>> IB/IB/OPA/RoCE/IWARP cluster.  Regardless though, that's not my problem
>> and what I'm chasing.
> 
> To be precise no two transports out of IB/RoCE/iWARP/OPA are
> inter-operable, so these are "just" different cards/transports under
> the same IB core on this cluster.

Except that some machines have links to as many as four of the different
fabrics and so a problem in one can effect testing of others.

>> Yes, I know how to do DOA testing.
> 
> So what's dead in your env after (say) 59m of examination?

It's not dead after 59m, it's DOA immediately.  And it's iSER.

But the details are much more complex than iSER is DOA.  It was DOA when
running a rhel7 kernel (internal, for next kernel, not a release
kernel).  That kernel is pretty close to upstream.  When I went to put
an upstream kernel on there to see if it had the same issue, the
upstream kernel on that machine oopses on boot.  It oopses in list_add,
but the backtrace doesn't list any usable information about who called
list_add with bogus data.  However, reliably, right before the oops, the
ciostor driver fails to load properly, so I'm going with that being the
likely culprit.  But each iteration is slow because when the rhel7
kernel iSER does it's thing, it causes a hung reboot, but it also
crashes the iDRAC in the machine (errant drivers crashing a baseboard
management controller is never a good sign), so the reboot must be done
via a hard power cycle.  When the upstream kernel oopses on boot, at
least the iDRAC is still working.  As a result, each test iteration is
pretty slow.

There we go, bootup on a 4.3-rc4 kernel with cxgb4 FCoE driver disabled
succeeded.  A hurdle passed.  Now I can test upstream iSER.

With an upstream kernel, the drive is still read-only with iSER (it's
not configured that way to the best of my knowledge, but I'm using auto
generate ACLs, I'm getting ready to switch the system to specific ACLs
instead), but the thread isn't stuck in D state, so that's an improvement.

However, the machine is still crashing the iDRAC on reboot.  I can't be
certain if it's the SRP target or iSER target causing this as they both
were brought up live at the same time and reboot cycles without either
of these work fine.  So I have more investigation to go before I know
exactly what's going on.  And as I pointed out, each iteration is slow :-/

>>> What we do know that needs fixing for 4.3-rc
>>> --> RoCE, you need the patch re-posted by Haggai few hours ago
>>> "IB/cma: Accept connection without a valid netdev on RoCE" -- without
>>> it, RoCE isn't working.
> 
>> I have that already.  It's available on both github and k.o and just
>> waiting for a pull request.
> 
> Maybe wait to get the fixes for the non-default pkey on mlx5 (see more below)?
> 
> Did you actually note that before Haggai posted the patch?!

No.

> once I realized how deep was the breakage, I became sort of very
> worried re your testing env not shouting hard on us this something is
> broken even before 4.3-rc1

My test environment has been down for upgrades.  In the last little bit
we've brought a second rack online, added 10 new machines, 3 new
switches, and moved existing machines around between the two racks in
order to more evenly balance the need for each port type across switches
in the two racks.  There's been more than that going on behind the
scenes here too, but it's not really worth getting into all of it.
Suffice it to say I've been working on A) expanding the cluster, B)
expanding the things the cluster is configured to do and therefore able
to test, and C) finding a way to get upstream code into this testing
framework since it was previously all rhel/fedora centric.

And this test infrastructure goes down by COB Thursday of this week and
won't be back for a week because it's being used for NFSoRDMA testing at
this fall's Bake-a-thon.

>>> --> **mlx5** devices and no-default IB pkeys, Haggai and Co are
>>> working on a fix since this isn't working since 4.3-rc1. I told them
>>> we need it till rc5.5 (i.e few days before rc6 and if not, will have
>>> to revert some 4.3-rc1 bits.
> 
>> I already have on patch related to this in my repo as well.  The 0day
>> testing just came back and it's all good.
> 
> I suspect that you don't...

I meant build tests passed, not run tests.

> do you have rping up and running between
> mlx4 and mlx5 on non default pkey? the breakage is a bit tricky and
> you might not see it if you run mlx5 against mlx5, BTW which patch is
> that?


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v1 for-next 0/7] Add support for multicast loopback prevention to mlx4

2015-10-07 Thread Or Gerlitz

On 10/7/2015 6:28 PM, Doug Ledford wrote:

However, the machine is still crashing the iDRAC on reboot.  I can't be
certain if it's the SRP target or iSER target causing this as they both
were brought up live at the same time and reboot cycles without either
of these work fine.  So I have more investigation to go before I know
exactly what's going on.  And as I pointed out, each iteration is slow :-/


You may want to move to run with the lightweight user-space iSER target 
(TGT) it's provided with Fedora

and also with RHEL7 using the unsupported addons (forgot how you call that)

And anyway, Sagi suggested to help, so, please make use of his brainpower

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/24] New fast registration API

2015-10-07 Thread Bart Van Assche

On 10/06/2015 11:42 PM, Sagi Grimberg wrote:

On 10/6/2015 9:49 PM, Bart Van Assche wrote:

On 10/06/2015 01:37 AM, Sagi Grimberg wrote:

I see now the error you are referring to.

The issue is that the device requires the MR page array to have
an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the
page array allocation to be non-coherent I didn't take care of
alignment.

Taking care of this alignment may result in a higher order allocation
as we'd need to add (alignment - 1) to the allocation size.

e.g. a 512 pages on mlx4 will become:
512 * 8 + 0x40 - 1 = 4159

I'm leaning towards this approach. Any preference?

I think this patch should take care of mlx4:
[ ... ]


Hello Sagi,

Thanks for the patch. But since the patch included in the previous
e-mail mapped a memory range that could be outside the bounds of the
allocated memory I have been testing the patch below:


Thanks! I correct the patches.

Can I take it as your Tested-by on srp?


Sure :-) But please keep in mind that I currently only have access to 
ConnectX-3 HCA's for testing RDMA software and not to any other RDMA HCA 
model.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/24] New fast registration API

2015-10-07 Thread Sagi Grimberg

On 10/7/2015 6:46 PM, Bart Van Assche wrote:

On 10/06/2015 11:42 PM, Sagi Grimberg wrote:

On 10/6/2015 9:49 PM, Bart Van Assche wrote:

On 10/06/2015 01:37 AM, Sagi Grimberg wrote:

I see now the error you are referring to.

The issue is that the device requires the MR page array to have
an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the
page array allocation to be non-coherent I didn't take care of
alignment.

Taking care of this alignment may result in a higher order allocation
as we'd need to add (alignment - 1) to the allocation size.

e.g. a 512 pages on mlx4 will become:
512 * 8 + 0x40 - 1 = 4159

I'm leaning towards this approach. Any preference?

I think this patch should take care of mlx4:
[ ... ]


Hello Sagi,

Thanks for the patch. But since the patch included in the previous
e-mail mapped a memory range that could be outside the bounds of the
allocated memory I have been testing the patch below:


Thanks! I correct the patches.

Can I take it as your Tested-by on srp?


Sure :-) But please keep in mind that I currently only have access to
ConnectX-3 HCA's for testing RDMA software and not to any other RDMA HCA
model.


Thanks Bart. For what its worth, I've tested srp (and iser + nfs) on
both CX3 and CX4 with your config file.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Seeing WARN_ON in ib_dealloc_pd from ipoib in kernel 4.3-rc1-debug

2015-10-07 Thread Sagi Grimberg

This started popping up (not sure if it's new to 4.3-rc1).

Happens when unloading the provider driver (mlx4/mlx5 in my case).

Has anyone seen this?

kernel: [ cut here ]
kernel: WARNING: CPU: 2 PID: 6012 at drivers/infiniband/core/verbs.c:283 
ib_dealloc_pd+0x5b/0xa0 [ib_core]()
kernel: Modules linked in: rpcrdma ib_srp scsi_transport_srp ib_iser 
rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_umad ib_uverbs ib_ipoib 
ib_cm mlx4_ib ib_sa ib_mad mlx4_core mlx5_ib(-) mlx5_core ib_core 
ib_addr mst_pciconf(O) mst_pci(O) nfsv3 nfs af_packet coretemp 
x86_pkg_temp_thermal crct10dif_pclmul crc32c_intel aesni_intel 
aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd microcode 
ipmi_ssif pcspkr lpc_ich i2c_i801 mfd_core ioatdma wmi ipmi_si 
ipmi_msghandler processor button nfsd auth_rpcgss oid_registry nfs_acl 
lockd grace sunrpc ip_tables x_tables ext4 crc16 mbcache jbd2 sd_mod 
hid_generic usbhid hid ahci libahci libata igb ehci_pci hwmon ehci_hcd 
ptp usbcore pps_core scsi_mod i2c_algo_bit usb_common i2c_core dca 
autofs4 [last unloaded: mlx4_core] 

kernel: CPU: 2 PID: 6012 Comm: modprobe Tainted: G   O L 
4.3.0-rc3-debug+ #67

kernel: Hardware name: Supermicro SYS-1027R-WRF/X9DRW, BIOS 3.0a 08/08/2013
kernel:  011b 8807a99afbe8 8129915b 0009
kernel:   8807a99afc28 810752b5 880827d7c2a0
kernel:  8807b0d03260 880827d7c2a0 880827d7cc60 
kernel: Call Trace:
kernel:  [] dump_stack+0x4f/0x74
kernel:  [] warn_slowpath_common+0x95/0xe0
kernel:  [] warn_slowpath_null+0x1a/0x20
kernel:  [] ib_dealloc_pd+0x5b/0xa0 [ib_core]
kernel:  [] ipoib_transport_dev_cleanup+0x9e/0xf0 
[ib_ipoib]

kernel:  [] ipoib_ib_dev_cleanup+0x5e/0x80 [ib_ipoib]
kernel:  [] ipoib_dev_cleanup+0x2a4/0x3b0 [ib_ipoib]
kernel:  [] ? __local_bh_enable_ip+0x6d/0xd0
kernel:  [] ipoib_uninit+0xe/0x10 [ib_ipoib]
kernel:  [] rollback_registered_many+0x1a7/0x2c0
kernel:  [] rollback_registered+0x31/0x40
kernel:  [] unregister_netdevice_queue+0x58/0xb0
kernel:  [] unregister_netdev+0x20/0x30
kernel:  [] ipoib_remove_one+0xa1/0xe0 [ib_ipoib]
kernel:  [] ib_unregister_device+0xc1/0x160 [ib_core]
kernel:  [] mlx5_ib_remove+0x19/0x50 [mlx5_ib]
kernel:  [] mlx5_remove_device+0x68/0x80 [mlx5_core]
kernel:  [] mlx5_unregister_interface+0x3e/0x70 
[mlx5_core]

kernel:  [] mlx5_ib_cleanup+0x10/0x694 [mlx5_ib]
kernel:  [] SyS_delete_module+0x17a/0x1c0
kernel:  [] ? trace_hardirqs_on_thunk+0x17/0x19
kernel:  [] ? generic_show_options+0x180/0x180
kernel:  [] entry_SYSCALL_64_fastpath+0x12/0x76
kernel: ---[ end trace 31339c7283574ccb ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Seeing WARN_ON in ib_dealloc_pd from ipoib in kernel 4.3-rc1-debug

2015-10-07 Thread Doug Ledford
On 10/07/2015 11:51 AM, Sagi Grimberg wrote:
> This started popping up (not sure if it's new to 4.3-rc1).
> 
> Happens when unloading the provider driver (mlx4/mlx5 in my case).
> 
> Has anyone seen this?
> 
> kernel: [ cut here ]
> kernel: WARNING: CPU: 2 PID: 6012 at drivers/infiniband/core/verbs.c:283
> ib_dealloc_pd+0x5b/0xa0 [ib_core]()
> kernel: Modules linked in: rpcrdma ib_srp scsi_transport_srp ib_iser
> rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_umad ib_uverbs ib_ipoib
> ib_cm mlx4_ib ib_sa ib_mad mlx4_core mlx5_ib(-) mlx5_core ib_core
> ib_addr mst_pciconf(O) mst_pci(O) nfsv3 nfs af_packet coretemp
> x86_pkg_temp_thermal crct10dif_pclmul crc32c_intel aesni_intel
> aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd microcode
> ipmi_ssif pcspkr lpc_ich i2c_i801 mfd_core ioatdma wmi ipmi_si
> ipmi_msghandler processor button nfsd auth_rpcgss oid_registry nfs_acl
> lockd grace sunrpc ip_tables x_tables ext4 crc16 mbcache jbd2 sd_mod
> hid_generic usbhid hid ahci libahci libata igb ehci_pci hwmon ehci_hcd
> ptp usbcore pps_core scsi_mod i2c_algo_bit usb_common i2c_core dca
> autofs4 [last unloaded: mlx4_core]
> kernel: CPU: 2 PID: 6012 Comm: modprobe Tainted: G   O L
> 4.3.0-rc3-debug+ #67
> kernel: Hardware name: Supermicro SYS-1027R-WRF/X9DRW, BIOS 3.0a 08/08/2013
> kernel:  011b 8807a99afbe8 8129915b
> 0009
> kernel:   8807a99afc28 810752b5
> 880827d7c2a0
> kernel:  8807b0d03260 880827d7c2a0 880827d7cc60
> 
> kernel: Call Trace:
> kernel:  [] dump_stack+0x4f/0x74
> kernel:  [] warn_slowpath_common+0x95/0xe0
> kernel:  [] warn_slowpath_null+0x1a/0x20
> kernel:  [] ib_dealloc_pd+0x5b/0xa0 [ib_core]
> kernel:  [] ipoib_transport_dev_cleanup+0x9e/0xf0
> [ib_ipoib]
> kernel:  [] ipoib_ib_dev_cleanup+0x5e/0x80 [ib_ipoib]
> kernel:  [] ipoib_dev_cleanup+0x2a4/0x3b0 [ib_ipoib]
> kernel:  [] ? __local_bh_enable_ip+0x6d/0xd0
> kernel:  [] ipoib_uninit+0xe/0x10 [ib_ipoib]
> kernel:  [] rollback_registered_many+0x1a7/0x2c0
> kernel:  [] rollback_registered+0x31/0x40
> kernel:  [] unregister_netdevice_queue+0x58/0xb0
> kernel:  [] unregister_netdev+0x20/0x30
> kernel:  [] ipoib_remove_one+0xa1/0xe0 [ib_ipoib]
> kernel:  [] ib_unregister_device+0xc1/0x160 [ib_core]
> kernel:  [] mlx5_ib_remove+0x19/0x50 [mlx5_ib]
> kernel:  [] mlx5_remove_device+0x68/0x80 [mlx5_core]
> kernel:  [] mlx5_unregister_interface+0x3e/0x70
> [mlx5_core]
> kernel:  [] mlx5_ib_cleanup+0x10/0x694 [mlx5_ib]
> kernel:  [] SyS_delete_module+0x17a/0x1c0
> kernel:  [] ? trace_hardirqs_on_thunk+0x17/0x19
> kernel:  [] ? generic_show_options+0x180/0x180
> kernel:  [] entry_SYSCALL_64_fastpath+0x12/0x76
> kernel: ---[ end trace 31339c7283574ccb ]---

Yes.  I'm seeing this too.  The last time this popped up I fixed it by
adding the code for reaping ahs.  I suspect that the new code to timeout
sendonly multicast joins combined with us now creating and joining what
used to be sendonly groups is the likely culprit here.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: Seeing WARN_ON in ib_dealloc_pd from ipoib in kernel 4.3-rc1-debug

2015-10-07 Thread santosh.shilim...@oracle.com

Sagi,

On 10/7/15 8:51 AM, Sagi Grimberg wrote:

This started popping up (not sure if it's new to 4.3-rc1).

Happens when unloading the provider driver (mlx4/mlx5 in my case).

Has anyone seen this?


Not sure it is useful but yes I have seen similar dump with
RDS on 4.3-rc1. I later found that RDS code had mr leak(s) in
normal operation which lead to WARNS on module clean up.
I believe the leaks lead to pd 'usecnt' getting messed up.
Once i avoided that, I stopped seeing it.



kernel: [ cut here ]
kernel: WARNING: CPU: 2 PID: 6012 at drivers/infiniband/core/verbs.c:283
ib_dealloc_pd+0x5b/0xa0 [ib_core]()
kernel: Modules linked in: rpcrdma ib_srp scsi_transport_srp ib_iser
rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_umad ib_uverbs ib_ipoib
ib_cm mlx4_ib ib_sa ib_mad mlx4_core mlx5_ib(-) mlx5_core ib_core
ib_addr mst_pciconf(O) mst_pci(O) nfsv3 nfs af_packet coretemp
x86_pkg_temp_thermal crct10dif_pclmul crc32c_intel aesni_intel
aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd microcode
ipmi_ssif pcspkr lpc_ich i2c_i801 mfd_core ioatdma wmi ipmi_si
ipmi_msghandler processor button nfsd auth_rpcgss oid_registry nfs_acl
lockd grace sunrpc ip_tables x_tables ext4 crc16 mbcache jbd2 sd_mod
hid_generic usbhid hid ahci libahci libata igb ehci_pci hwmon ehci_hcd
ptp usbcore pps_core scsi_mod i2c_algo_bit usb_common i2c_core dca
autofs4 [last unloaded: mlx4_core]
kernel: CPU: 2 PID: 6012 Comm: modprobe Tainted: G   O L
4.3.0-rc3-debug+ #67
kernel: Hardware name: Supermicro SYS-1027R-WRF/X9DRW, BIOS 3.0a 08/08/2013
kernel:  011b 8807a99afbe8 8129915b
0009
kernel:   8807a99afc28 810752b5
880827d7c2a0
kernel:  8807b0d03260 880827d7c2a0 880827d7cc60

kernel: Call Trace:
kernel:  [] dump_stack+0x4f/0x74
kernel:  [] warn_slowpath_common+0x95/0xe0
kernel:  [] warn_slowpath_null+0x1a/0x20
kernel:  [] ib_dealloc_pd+0x5b/0xa0 [ib_core]
kernel:  [] ipoib_transport_dev_cleanup+0x9e/0xf0
[ib_ipoib]
kernel:  [] ipoib_ib_dev_cleanup+0x5e/0x80 [ib_ipoib]
kernel:  [] ipoib_dev_cleanup+0x2a4/0x3b0 [ib_ipoib]
kernel:  [] ? __local_bh_enable_ip+0x6d/0xd0
kernel:  [] ipoib_uninit+0xe/0x10 [ib_ipoib]
kernel:  [] rollback_registered_many+0x1a7/0x2c0
kernel:  [] rollback_registered+0x31/0x40
kernel:  [] unregister_netdevice_queue+0x58/0xb0
kernel:  [] unregister_netdev+0x20/0x30
kernel:  [] ipoib_remove_one+0xa1/0xe0 [ib_ipoib]
kernel:  [] ib_unregister_device+0xc1/0x160 [ib_core]
kernel:  [] mlx5_ib_remove+0x19/0x50 [mlx5_ib]
kernel:  [] mlx5_remove_device+0x68/0x80 [mlx5_core]
kernel:  [] mlx5_unregister_interface+0x3e/0x70
[mlx5_core]
kernel:  [] mlx5_ib_cleanup+0x10/0x694 [mlx5_ib]
kernel:  [] SyS_delete_module+0x17a/0x1c0
kernel:  [] ? trace_hardirqs_on_thunk+0x17/0x19
kernel:  [] ? generic_show_options+0x180/0x180
kernel:  [] entry_SYSCALL_64_fastpath+0x12/0x76
kernel: ---[ end trace 31339c7283574ccb ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/24] New fast registration API

2015-10-07 Thread Bart Van Assche

On 10/07/2015 02:20 AM, Christoph Hellwig wrote:

On Tue, Oct 06, 2015 at 11:37:40AM +0300, Sagi Grimberg wrote:

The issue is that the device requires the MR page array to have
an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the
page array allocation to be non-coherent I didn't take care of
alignment.


Just curious:  why did you switch away from the coheret dma allocations
anyway?  Seems like the page lists are mapped as long as they are
allocated so the coherent allocator would seem like a nice fit.


Hello Christoph,

My concern is that caching and/or write combining might be disabled for 
DMA coherent memory regions. This is why I assume that calling 
dma_map_single() and dma_unmap_single() will be faster for registering 
multiple pages as a single memory region instead of using DMA coherent 
memory.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] svcrdma: handle rdma read with a non-zero initial page offset

2015-10-07 Thread J. Bruce Fields
On Tue, Oct 06, 2015 at 01:44:25PM -0400, Doug Ledford wrote:
> On 09/28/2015 05:46 PM, Steve Wise wrote:
> > The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
> > were not taking into account the initial page_offset when determining
> > the rdma read length.  This resulted in a read who's starting address
> > and length exceeded the base/bounds of the frmr.
> > 
> > Most work loads don't tickle this bug apparently, but one test hit it
> > every time: building the linux kernel on a 16 core node with 'make -j
> > 16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.
> > 
> > This bug seems to only be tripped with devices having small fastreg page
> > list depths.  I didn't see it with mlx4, for instance.
> 
> Bruce, what's you're take on this?  Do you want to push this through or
> would you care if I push it through my tree?

Whoops, sorry, I meant to send a pull request for that last week.  Uh, I
think I'll go ahead and do that now if it's OK with you.

--b.

> 
> > Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic')
> > Signed-off-by: Steve Wise 
> > Tested-by: Chuck Lever 
> > ---
> > 
> >  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |6 --
> >  1 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
> > b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> > index cb51742..5f6ca47 100644
> > --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> > +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> > @@ -136,7 +136,8 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
> > ctxt->direction = DMA_FROM_DEVICE;
> > ctxt->read_hdr = head;
> > pages_needed = min_t(int, pages_needed, xprt->sc_max_sge_rd);
> > -   read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
> > +   read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
> > +rs_length);
> >  
> > for (pno = 0; pno < pages_needed; pno++) {
> > int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
> > @@ -235,7 +236,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
> > ctxt->direction = DMA_FROM_DEVICE;
> > ctxt->frmr = frmr;
> > pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len);
> > -   read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
> > +   read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
> > +rs_length);
> >  
> > frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]);
> > frmr->direction = DMA_FROM_DEVICE;
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> -- 
> Doug Ledford 
>   GPG KeyID: 0E572FDD
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] svcrdma: handle rdma read with a non-zero initial page offset

2015-10-07 Thread Doug Ledford
On 10/07/2015 01:01 PM, J. Bruce Fields wrote:
> On Tue, Oct 06, 2015 at 01:44:25PM -0400, Doug Ledford wrote:
>> On 09/28/2015 05:46 PM, Steve Wise wrote:
>>> The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
>>> were not taking into account the initial page_offset when determining
>>> the rdma read length.  This resulted in a read who's starting address
>>> and length exceeded the base/bounds of the frmr.
>>>
>>> Most work loads don't tickle this bug apparently, but one test hit it
>>> every time: building the linux kernel on a 16 core node with 'make -j
>>> 16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.
>>>
>>> This bug seems to only be tripped with devices having small fastreg page
>>> list depths.  I didn't see it with mlx4, for instance.
>>
>> Bruce, what's you're take on this?  Do you want to push this through or
>> would you care if I push it through my tree?
> 
> Whoops, sorry, I meant to send a pull request for that last week.  Uh, I
> think I'll go ahead and do that now if it's OK with you.

Fine with me.  I was just trying to make sure it didn't get forgotten ;-)

> --b.
> 
>>
>>> Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic')
>>> Signed-off-by: Steve Wise 
>>> Tested-by: Chuck Lever 
>>> ---
>>>
>>>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |6 --
>>>  1 files changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
>>> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> index cb51742..5f6ca47 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>>> @@ -136,7 +136,8 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
>>> ctxt->direction = DMA_FROM_DEVICE;
>>> ctxt->read_hdr = head;
>>> pages_needed = min_t(int, pages_needed, xprt->sc_max_sge_rd);
>>> -   read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
>>> +   read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
>>> +rs_length);
>>>  
>>> for (pno = 0; pno < pages_needed; pno++) {
>>> int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
>>> @@ -235,7 +236,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
>>> ctxt->direction = DMA_FROM_DEVICE;
>>> ctxt->frmr = frmr;
>>> pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len);
>>> -   read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
>>> +   read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
>>> +rs_length);
>>>  
>>> frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]);
>>> frmr->direction = DMA_FROM_DEVICE;
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>> -- 
>> Doug Ledford 
>>   GPG KeyID: 0E572FDD
>>
>>
> 
> 


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: Seeing WARN_ON in ib_dealloc_pd from ipoib in kernel 4.3-rc1-debug

2015-10-07 Thread Or Gerlitz
On Wed, Oct 7, 2015 at 6:51 PM, Sagi Grimberg  wrote:
> This started popping up (not sure if it's new to 4.3-rc1).
> Happens when unloading the provider driver (mlx4/mlx5 in my case).
> Has anyone seen this?

yes, I think to see it over the last 1-2 years

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] svcrdma: handle rdma read with a non-zero initial page offset

2015-10-07 Thread J. Bruce Fields
On Wed, Oct 07, 2015 at 01:33:05PM -0400, Doug Ledford wrote:
> On 10/07/2015 01:01 PM, J. Bruce Fields wrote:
> > On Tue, Oct 06, 2015 at 01:44:25PM -0400, Doug Ledford wrote:
> >> On 09/28/2015 05:46 PM, Steve Wise wrote:
> >>> The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
> >>> were not taking into account the initial page_offset when determining
> >>> the rdma read length.  This resulted in a read who's starting address
> >>> and length exceeded the base/bounds of the frmr.
> >>>
> >>> Most work loads don't tickle this bug apparently, but one test hit it
> >>> every time: building the linux kernel on a 16 core node with 'make -j
> >>> 16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.
> >>>
> >>> This bug seems to only be tripped with devices having small fastreg page
> >>> list depths.  I didn't see it with mlx4, for instance.
> >>
> >> Bruce, what's you're take on this?  Do you want to push this through or
> >> would you care if I push it through my tree?
> > 
> > Whoops, sorry, I meant to send a pull request for that last week.  Uh, I
> > think I'll go ahead and do that now if it's OK with you.
> 
> Fine with me.  I was just trying to make sure it didn't get forgotten ;-)

Understood, thanks!  I've sent the pull request.--b.

> 
> > --b.
> > 
> >>
> >>> Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic')
> >>> Signed-off-by: Steve Wise 
> >>> Tested-by: Chuck Lever 
> >>> ---
> >>>
> >>>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |6 --
> >>>  1 files changed, 4 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
> >>> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >>> index cb51742..5f6ca47 100644
> >>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >>> @@ -136,7 +136,8 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
> >>>   ctxt->direction = DMA_FROM_DEVICE;
> >>>   ctxt->read_hdr = head;
> >>>   pages_needed = min_t(int, pages_needed, xprt->sc_max_sge_rd);
> >>> - read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
> >>> + read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
> >>> +  rs_length);
> >>>  
> >>>   for (pno = 0; pno < pages_needed; pno++) {
> >>>   int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
> >>> @@ -235,7 +236,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
> >>>   ctxt->direction = DMA_FROM_DEVICE;
> >>>   ctxt->frmr = frmr;
> >>>   pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len);
> >>> - read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
> >>> + read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
> >>> +  rs_length);
> >>>  
> >>>   frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]);
> >>>   frmr->direction = DMA_FROM_DEVICE;
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >>> the body of a message to majord...@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >>
> >> -- 
> >> Doug Ledford 
> >>   GPG KeyID: 0E572FDD
> >>
> >>
> > 
> > 
> 
> 
> -- 
> Doug Ledford 
>   GPG KeyID: 0E572FDD
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IPoIB: serialize changing on tx_outstanding

2015-10-07 Thread Wengang Wang

Hi,
Any comment on this patch?

thanks,
wengang

在 2015年09月28日 13:42, Wengang Wang 写道:

The changing on tx_outstanding should be protected by spinlock or to be
atomic operations.

Such log is found in dmesg:

Sep 16 14:20:53 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034733, 
tx_tail 1034733, tx_outstanding 359 ipoib_sendq_size: 512
Sep 16 14:21:33 naep11x06 kernel: ib0: transmit timeout: latency 9560 msecs
Sep 16 14:21:33 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, 
tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512
Sep 16 14:21:38 naep11x06 kernel: ib0: transmit timeout: latency 14568 msecs
Sep 16 14:21:38 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, 
tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512

And the send queue of ib0 kept full. When transmit timeout is reported,
queue is reported as "stopped", but the IPoIB stuff tx_head and tx_tail
points to same value. I am not able to see such numbers in ipoib_cm_tx
(for CM) because I have no vmcore. Though I am not quite sure it's caused
by parallel access of tx_outstanding(send path VS interrup path), we really
need to serialize the changeing on tx_outstanding.

This patch also make sure the increase of tx_outstanding prior to the
calling of post_send to avoid the possible decreasing before increasing in
case the running of increasing is scheduled later than the interrupt
handler.

Signed-off-by: Wengang Wang 
---
  drivers/infiniband/ulp/ipoib/ipoib_cm.c | 40 +++--
  drivers/infiniband/ulp/ipoib/ipoib_ib.c | 24 ++--
  2 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index c78dc16..044da94 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -710,6 +710,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ipoib_tx_buf *tx_req;
int rc;
+   unsigned long flags;
  
  	if (unlikely(skb->len > tx->mtu)) {

ipoib_warn(priv, "packet len %d (> %d) too long to send, 
dropping\n",
@@ -742,27 +743,36 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
skb_orphan(skb);
skb_dst_drop(skb);
  
+	spin_lock_irqsave(&priv->lock, flags);

+   if (++priv->tx_outstanding == ipoib_sendq_size) {
+   ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net 
queue\n",
+ tx->qp->qp_num);
+   netif_stop_queue(dev);
+   }
+   spin_unlock_irqrestore(&priv->lock, flags);
+   if (netif_queue_stopped(dev)) {
+   rc = ib_req_notify_cq(priv->send_cq,
+   IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
+   if (rc < 0)
+   ipoib_warn(priv, "request notify on send CQ failed\n");
+   else if (rc)
+   ipoib_send_comp_handler(priv->send_cq, dev);
+   }
+
rc = post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), tx_req);
if (unlikely(rc)) {
ipoib_warn(priv, "post_send failed, error %d\n", rc);
++dev->stats.tx_errors;
+   spin_lock_irqsave(&priv->lock, flags);
+   --priv->tx_outstanding;
+   if (netif_queue_stopped(dev))
+   netif_wake_queue(dev);
+   spin_unlock_irqrestore(&priv->lock, flags);
ipoib_dma_unmap_tx(priv, tx_req);
dev_kfree_skb_any(skb);
} else {
dev->trans_start = jiffies;
++tx->tx_head;
-
-   if (++priv->tx_outstanding == ipoib_sendq_size) {
-   ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net 
queue\n",
- tx->qp->qp_num);
-   netif_stop_queue(dev);
-   rc = ib_req_notify_cq(priv->send_cq,
-   IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
-   if (rc < 0)
-   ipoib_warn(priv, "request notify on send CQ 
failed\n");
-   else if (rc)
-   ipoib_send_comp_handler(priv->send_cq, dev);
-   }
}
  }
  
@@ -796,10 +806,13 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)

netif_tx_lock(dev);
  
  	++tx->tx_tail;

+
+   spin_lock_irqsave(&priv->lock, flags);
if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
netif_queue_stopped(dev) &&
test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
netif_wake_queue(dev);
+   spin_unlock_irqrestore(&priv->lock, flags);
  
  	if (wc->status != IB_WC_SUCCESS &&

wc->status != IB_WC_WR_FLUSH_ERR) {
@@ -1169,6 +1182,7 @@ static void ipoib_cm_tx

Re: [PATCH] IB/mlx4: correct order of variables in log

2015-10-07 Thread Wengang Wang

Hi,

Any comment on this patch?

thanks,
wengang

在 2015年09月28日 10:08, Wengang Wang 写道:

There is a mis-order in mlx4 log. Fix it.

Signed-off-by: Wengang Wang 
---
  drivers/net/ethernet/mellanox/mlx4/cmd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 0a32020..150fbb3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -1010,7 +1010,7 @@ static int mlx4_MAD_IFC_wrapper(struct mlx4_dev *dev, int 
slave,
if (!(smp->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED &&
  smp->method == IB_MGMT_METHOD_GET) || network_view) {
mlx4_err(dev, "Unprivileged slave %d is trying to execute a 
Subnet MGMT MAD, class 0x%x, method 0x%x, view=%s for attr 0x%x. Rejecting\n",
-slave, smp->method, smp->mgmt_class,
+slave, smp->mgmt_class, smp->method,
 network_view ? "Network" : "Host",
 be16_to_cpu(smp->attr_id));
return -EPERM;


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IPoIB: serialize changing on tx_outstanding

2015-10-07 Thread Leon Romanovsky
On Mon, Sep 28, 2015 at 01:42:10PM +0800, Wengang Wang wrote:
> The changing on tx_outstanding should be protected by spinlock or to be
> atomic operations.
> 
> Such log is found in dmesg:
> 
> Sep 16 14:20:53 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034733, 
> tx_tail 1034733, tx_outstanding 359 ipoib_sendq_size: 512
> Sep 16 14:21:33 naep11x06 kernel: ib0: transmit timeout: latency 9560 msecs
> Sep 16 14:21:33 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, 
> tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512
> Sep 16 14:21:38 naep11x06 kernel: ib0: transmit timeout: latency 14568 msecs
> Sep 16 14:21:38 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, 
> tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512
> 
> And the send queue of ib0 kept full. When transmit timeout is reported,
> queue is reported as "stopped", but the IPoIB stuff tx_head and tx_tail
> points to same value. I am not able to see such numbers in ipoib_cm_tx
> (for CM) because I have no vmcore. Though I am not quite sure it's caused
> by parallel access of tx_outstanding(send path VS interrup path), we really
> need to serialize the changeing on tx_outstanding.
> 
> This patch also make sure the increase of tx_outstanding prior to the
> calling of post_send to avoid the possible decreasing before increasing in
> case the running of increasing is scheduled later than the interrupt
> handler.
> 
> Signed-off-by: Wengang Wang 
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_cm.c | 40 
> +++--
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c | 24 ++--
>  2 files changed, 50 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
> b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index c78dc16..044da94 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -710,6 +710,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
> *skb, struct ipoib_cm_
>   struct ipoib_dev_priv *priv = netdev_priv(dev);
>   struct ipoib_tx_buf *tx_req;
>   int rc;
> + unsigned long flags;
>  
>   if (unlikely(skb->len > tx->mtu)) {
>   ipoib_warn(priv, "packet len %d (> %d) too long to send, 
> dropping\n",
> @@ -742,27 +743,36 @@ void ipoib_cm_send(struct net_device *dev, struct 
> sk_buff *skb, struct ipoib_cm_
>   skb_orphan(skb);
>   skb_dst_drop(skb);
>  
> + spin_lock_irqsave(&priv->lock, flags);
> + if (++priv->tx_outstanding == ipoib_sendq_size) {
> + ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net 
> queue\n",
> +   tx->qp->qp_num);
> + netif_stop_queue(dev);
> + }
> + spin_unlock_irqrestore(&priv->lock, flags);
> + if (netif_queue_stopped(dev)) {
> + rc = ib_req_notify_cq(priv->send_cq,
> + IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
> + if (rc < 0)
> + ipoib_warn(priv, "request notify on send CQ failed\n");
> + else if (rc)
> + ipoib_send_comp_handler(priv->send_cq, dev);
> + }
> +
>   rc = post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), tx_req);
>   if (unlikely(rc)) {
>   ipoib_warn(priv, "post_send failed, error %d\n", rc);
>   ++dev->stats.tx_errors;
> + spin_lock_irqsave(&priv->lock, flags);
> + --priv->tx_outstanding;
> + if (netif_queue_stopped(dev))
> + netif_wake_queue(dev);
> + spin_unlock_irqrestore(&priv->lock, flags);
Why are you locking the netif_* calls?
>   ipoib_dma_unmap_tx(priv, tx_req);
>   dev_kfree_skb_any(skb);
>   } else {
>   dev->trans_start = jiffies;
>   ++tx->tx_head;
> -
> - if (++priv->tx_outstanding == ipoib_sendq_size) {
> - ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net 
> queue\n",
> -   tx->qp->qp_num);
> - netif_stop_queue(dev);
> - rc = ib_req_notify_cq(priv->send_cq,
> - IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
> - if (rc < 0)
> - ipoib_warn(priv, "request notify on send CQ 
> failed\n");
> - else if (rc)
> - ipoib_send_comp_handler(priv->send_cq, dev);
> - }
>   }
>  }
>  
> @@ -796,10 +806,13 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, 
> struct ib_wc *wc)
>   netif_tx_lock(dev);
>  
>   ++tx->tx_tail;
> +
> + spin_lock_irqsave(&priv->lock, flags);
>   if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
>   netif_queue_stopped(dev) &&
>   test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
>   netif_wake_queue(dev);
> + spin_unlock_irqrestore(&priv->lock, flags);
>  
>

Re: [PATCH] IB/mlx4: correct order of variables in log

2015-10-07 Thread Or Gerlitz

On 9/28/2015 5:08 AM, Wengang Wang wrote:

There is a mis-order in mlx4 log. Fix it.

Signed-off-by: Wengang Wang

Acked-by: Or Gerlitz 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mlx4: correct order of variables in log

2015-10-07 Thread Or Gerlitz

On 9/28/2015 5:08 AM, Wengang Wang wrote:

There is a mis-order in mlx4 log. Fix it.

Signed-off-by: Wengang Wang

I wanted to ack it, but wait...

We want commits to our driver to start with Capital letter so please
resubmit with this  title

IB/mlx4: Use correct order of variables in log message

You can add Acked-by: Or Gerlitz 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IPoIB: serialize changing on tx_outstanding

2015-10-07 Thread Wengang Wang

Hi Leon,

thanks for review.

在 2015年10月08日 12:33, Leon Romanovsky 写道:

On Mon, Sep 28, 2015 at 01:42:10PM +0800, Wengang Wang wrote:

The changing on tx_outstanding should be protected by spinlock or to be
atomic operations.

Such log is found in dmesg:

Sep 16 14:20:53 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034733, 
tx_tail 1034733, tx_outstanding 359 ipoib_sendq_size: 512
Sep 16 14:21:33 naep11x06 kernel: ib0: transmit timeout: latency 9560 msecs
Sep 16 14:21:33 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, 
tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512
Sep 16 14:21:38 naep11x06 kernel: ib0: transmit timeout: latency 14568 msecs
Sep 16 14:21:38 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, 
tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512

And the send queue of ib0 kept full. When transmit timeout is reported,
queue is reported as "stopped", but the IPoIB stuff tx_head and tx_tail
points to same value. I am not able to see such numbers in ipoib_cm_tx
(for CM) because I have no vmcore. Though I am not quite sure it's caused
by parallel access of tx_outstanding(send path VS interrup path), we really
need to serialize the changeing on tx_outstanding.

This patch also make sure the increase of tx_outstanding prior to the
calling of post_send to avoid the possible decreasing before increasing in
case the running of increasing is scheduled later than the interrupt
handler.

Signed-off-by: Wengang Wang 
---
  drivers/infiniband/ulp/ipoib/ipoib_cm.c | 40 +++--
  drivers/infiniband/ulp/ipoib/ipoib_ib.c | 24 ++--
  2 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index c78dc16..044da94 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -710,6 +710,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ipoib_tx_buf *tx_req;
int rc;
+   unsigned long flags;
  
  	if (unlikely(skb->len > tx->mtu)) {

ipoib_warn(priv, "packet len %d (> %d) too long to send, 
dropping\n",
@@ -742,27 +743,36 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
skb_orphan(skb);
skb_dst_drop(skb);
  
+	spin_lock_irqsave(&priv->lock, flags);

+   if (++priv->tx_outstanding == ipoib_sendq_size) {
+   ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net 
queue\n",
+ tx->qp->qp_num);
+   netif_stop_queue(dev);
+   }
+   spin_unlock_irqrestore(&priv->lock, flags);
+   if (netif_queue_stopped(dev)) {
+   rc = ib_req_notify_cq(priv->send_cq,
+   IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
+   if (rc < 0)
+   ipoib_warn(priv, "request notify on send CQ failed\n");
+   else if (rc)
+   ipoib_send_comp_handler(priv->send_cq, dev);
+   }
+
rc = post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), tx_req);
if (unlikely(rc)) {
ipoib_warn(priv, "post_send failed, error %d\n", rc);
++dev->stats.tx_errors;
+   spin_lock_irqsave(&priv->lock, flags);
+   --priv->tx_outstanding;
+   if (netif_queue_stopped(dev))
+   netif_wake_queue(dev);
+   spin_unlock_irqrestore(&priv->lock, flags);

Why are you locking the netif_* calls?


Yes, I intended to do that.   This make the accessing on tx_outstanding 
and the reopening of the send queue in the same atomic session which is 
the expected behavior.

Otherwise,  we may have the following problem:
#time order

thread1(on cpu1) thread2(on cpu2)
lock
modify/check tx_outstanding
unlock


lock
modify/check tx_outstanding
unlock

reopen queue


   stop queue


So that we actually want reopen the send queue, but the result is we 
stopped it.


thanks,
wengang


ipoib_dma_unmap_tx(priv, tx_req);
dev_kfree_skb_any(skb);
} else {
dev->trans_start = jiffies;
++tx->tx_head;
-
-   if (++priv->tx_outstanding == ipoib_sendq_size) {
-   ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net 
queue\n",
- tx->qp->qp_num);
-   netif_stop_queue(dev);
-   rc = ib_req_notify_cq(priv->send_cq,
-   IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
-   if (rc < 0)
-   ipoib_warn(priv, "request notify on send CQ 
failed\n");
-   else if (rc)
-   ipoib_send_comp_handler(priv->send_cq, dev);
-   }
}
  }
 

Re: [PATCH] IB/mlx4: correct order of variables in log

2015-10-07 Thread Wengang Wang

Thanks Or.
I will resend the revised(title) the patch with your Ack.

thanks,
wengang

在 2015年10月08日 12:52, Or Gerlitz 写道:

On 9/28/2015 5:08 AM, Wengang Wang wrote:

There is a mis-order in mlx4 log. Fix it.

Signed-off-by: Wengang Wang

I wanted to ack it, but wait...

We want commits to our driver to start with Capital letter so please
resubmit with this  title

IB/mlx4: Use correct order of variables in log message

You can add Acked-by: Or Gerlitz 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mlx4: Use vmalloc for WR buffers when needed

2015-10-07 Thread Or Gerlitz

On 9/24/2015 1:49 PM, Wengang Wang wrote:

@@ -1050,8 +1057,9 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, 
struct mlx4_ib_qp *qp,
  &qp->db);
ib_umem_release(qp->umem);
} else {
-   kfree(qp->sq.wrid);
-   kfree(qp->rq.wrid);
+   kvfree(qp->sq.wrid);
+   kvfree(qp->rq.wrid);
+
if (qp->mlx4_ib_qp_type & (MLX4_IB_QPT_PROXY_SMI_OWNER |
MLX4_IB_QPT_PROXY_SMI | MLX4_IB_QPT_PROXY_GSI))
free_proxy_bufs(&dev->ib_dev, qp);


when you fix issue A you don't conduct cleanup for issue B, remove the 
new line added here andadd


Acked-by: Or Gerlitz 

When you re-submit patches makes sure to maintain version, e.g the 
re-post would be [PATCH V2] (I believe

this is going to be the 3rd version, so V0... V1 ... and now V2)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] IB/mlx4: Use correct order of variables in log message

2015-10-07 Thread Wengang Wang
There is a mis-order in mlx4 log. Fix it.

Signed-off-by: Wengang Wang 
Acked-by: Or Gerlitz 
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 0a32020..150fbb3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -1010,7 +1010,7 @@ static int mlx4_MAD_IFC_wrapper(struct mlx4_dev *dev, int 
slave,
if (!(smp->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED &&
  smp->method == IB_MGMT_METHOD_GET) || network_view) {
mlx4_err(dev, "Unprivileged slave %d is trying to 
execute a Subnet MGMT MAD, class 0x%x, method 0x%x, view=%s for attr 0x%x. 
Rejecting\n",
-slave, smp->method, smp->mgmt_class,
+slave, smp->mgmt_class, smp->method,
 network_view ? "Network" : "Host",
 be16_to_cpu(smp->attr_id));
return -EPERM;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3] IB/mlx4: Use vmalloc for WR buffers when needed

2015-10-07 Thread Wengang Wang
There are several hits that WR buffer allocation(kmalloc) failed.
It failed at order 3 and/or 4 contigous pages allocation. At the same time
there are actually 100MB+ free memory but well fragmented.
So try vmalloc when kmalloc failed.

Signed-off-by: Wengang Wang 
Acked-by: Or Gerlitz 
---
 drivers/infiniband/hw/mlx4/qp.c  | 19 +--
 drivers/infiniband/hw/mlx4/srq.c | 11 ---
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 4ad9be3..3ccbd3a 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -786,8 +787,14 @@ static int create_qp_common(struct mlx4_ib_dev *dev, 
struct ib_pd *pd,
if (err)
goto err_mtt;
 
-   qp->sq.wrid  = kmalloc(qp->sq.wqe_cnt * sizeof (u64), gfp);
-   qp->rq.wrid  = kmalloc(qp->rq.wqe_cnt * sizeof (u64), gfp);
+   qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64), gfp);
+   if (!qp->sq.wrid)
+   qp->sq.wrid = __vmalloc(qp->sq.wqe_cnt * sizeof(u64),
+   gfp, PAGE_KERNEL);
+   qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64), gfp);
+   if (!qp->rq.wrid)
+   qp->rq.wrid = __vmalloc(qp->rq.wqe_cnt * sizeof(u64),
+   gfp, PAGE_KERNEL);
if (!qp->sq.wrid || !qp->rq.wrid) {
err = -ENOMEM;
goto err_wrid;
@@ -874,8 +881,8 @@ err_wrid:
if (qp_has_rq(init_attr))

mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &qp->db);
} else {
-   kfree(qp->sq.wrid);
-   kfree(qp->rq.wrid);
+   kvfree(qp->sq.wrid);
+   kvfree(qp->rq.wrid);
}
 
 err_mtt:
@@ -1050,8 +1057,8 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, 
struct mlx4_ib_qp *qp,
  &qp->db);
ib_umem_release(qp->umem);
} else {
-   kfree(qp->sq.wrid);
-   kfree(qp->rq.wrid);
+   kvfree(qp->sq.wrid);
+   kvfree(qp->rq.wrid);
if (qp->mlx4_ib_qp_type & (MLX4_IB_QPT_PROXY_SMI_OWNER |
MLX4_IB_QPT_PROXY_SMI | MLX4_IB_QPT_PROXY_GSI))
free_proxy_bufs(&dev->ib_dev, qp);
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index dce5dfe..8d133c4 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mlx4_ib.h"
 #include "user.h"
@@ -172,8 +173,12 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
 
srq->wrid = kmalloc(srq->msrq.max * sizeof (u64), GFP_KERNEL);
if (!srq->wrid) {
-   err = -ENOMEM;
-   goto err_mtt;
+   srq->wrid = __vmalloc(srq->msrq.max * sizeof(u64),
+ GFP_KERNEL, PAGE_KERNEL);
+   if (!srq->wrid) {
+   err = -ENOMEM;
+   goto err_mtt;
+   }
}
}
 
@@ -204,7 +209,7 @@ err_wrid:
if (pd->uobject)
mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), 
&srq->db);
else
-   kfree(srq->wrid);
+   kvfree(srq->wrid);
 
 err_mtt:
mlx4_mtt_cleanup(dev->dev, &srq->mtt);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 00/26] New fast registration API

2015-10-07 Thread Selvin Xavier
Sagi,

With the RDMA CM patch to accept RoCE
connections(https://patchwork.kernel.org/patch/7335451/) on top of
reg_api.5 branch,  I have tested  nfs rdma over ocrdma. Looks good.

Tested-by: Selvin Xavier 

Thanks,
Selvin Xavier

> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Sagi Grimberg
> Sent: Thursday, September 24, 2015 11:05 PM
> To: linux-rdma@vger.kernel.org
> Cc: linux-...@vger.kernel.org; Nicholas A. Bellinger
> Subject: [PATCH v2 00/26] New fast registration API
>
> Hi all,
>
> As discussed on the linux-rdma list, there is plenty of room for
improvement
> in our memory registration APIs. We keep finding ULPs that are
duplicating
> code, sometimes use wrong strategies and mis-use our current API.
>
> As a first step, this patch set replaces the fast registration API to
accept a
> kernel common struct scatterlist and takes care of the page vector
> construction in the core layer with hooks for the drivers HW specific
> assignments. This allows to remove a common code duplication as it was
> done in each and every ULP driver.
>
> Changes from v1:
> - Add ib_map_mr_sg_zbva() for RDS which uses it (preferred it over
>   polluting the API).
>
> - Replaced coherent allocations in mlx4, mlx5 with DMA streaming
>   APIs (Bart)
>
> - Changed ib_map_mr_sg description (Bart)
>
> - Split SRP driver patches (Bart)
>
> - Added missing wr->next = NULL from various ULPs (Steve, Santosh)
>
> - Fixed 0-day testing errors in nes driver, xprtrdma and svcrdma
>
> - Fixed checkpatch issues
>
> Changes from v0:
> - Rebased on top of 4.3-rc1 + Christoph's ib_send_wr conversion patches
>
> - Allow the ULP to pass page_size argument to ib_map_mr_sg in order
>   to have it work better in some specific workloads. This suggestion
>   came from Bart Van Assche which pointed out that some applications
>   might use page sizes significantly smaller than the system PAGE_SIZE
>   of specific architectures
>
> - Fixed some logical bugs in ib_sg_to_pages
>
> - Added a set_page function pointer for drivers to pass to
ib_sg_to_pages
>   so some drivers (e.g mlx4, mlx5, nes) can avoid keeping a second page
>   vector and/or re-iterate on the page vector in order to perform HW
specific
>   assignments (big/little endian conversion, extra flags)
>
> - Converted SRP initiator and RDS iwarp ULPs to the new API
>
> - Removed fast registration code from hfi1 driver (as it isn't supported
>   anyway). I assume that the correct place to get the support back would
>   be in a shared SW library (hfi1, qib, rxe).
>
> - Updated the change logs
>
> So far my tests covered:
> - ULPs:
>   * iser initiator
>   * iser target
>   * xprtrdma
>   * svcrdma
> - Drivers:
>   * mlx4
>   * mlx5
>   * Steve Wise was kind enough to run NFS client/server over cxgb4
> and
> reported good results.
>   * Santosh reports that RDS IB conversion is functional and working
> with ib_map_mr_sg_zbva
>
> I don't have access to other HW devices (qib, nes) nor iwarp devices so
RDS is
> compile tested only. Santosh reports
>
> I'm targeting this to 4.4 so I'll appreciate more feedback and a bigger
testing
> coverage.
>
> The code is available at:
> https://github.com/sagigrimberg/linux/tree/reg_api.4
>
> Sagi Grimberg (26):
>   IB/core: Introduce new fast registration API
>   IB/mlx5: Remove dead fmr code
>   IB/mlx5: Support the new memory registration API
>   IB/mlx4: Support the new memory registration API
>   RDMA/ocrdma: Support the new memory registration API
>   RDMA/cxgb3: Support the new memory registration API
>   iw_cxgb4: Support the new memory registration API
>   IB/qib: Support the new memory registration API
>   RDMA/nes: Support the new memory registration API
>   IB/iser: Port to new fast registration API
>   iser-target: Port to new memory registration API
>   xprtrdma: Port to new memory registration API
>   svcrdma: Port to new memory registration API
>   RDS/IW: Convert to new memory registration API
>   IB/srp: Split srp_map_sg
>   IB/srp: Convert to new registration API
>   IB/srp: Dont allocate a page vector when using fast_reg
>   IB/mlx5: Remove old FRWR API support
>   IB/mlx4: Remove old FRWR API support
>   RDMA/ocrdma: Remove old FRWR API
>   RDMA/cxgb3: Remove old FRWR API
>   iw_cxgb4: Remove old FRWR API
>   IB/qib: Remove old FRWR API
>   RDMA/nes: Remove old FRWR API
>   IB/hfi1: Remove Old fast registraion API support
>   IB/core: Remove old fast registration API
>
>  drivers/infiniband/core/verbs.c | 132 +++---
>  drivers/infiniband/hw/cxgb3/iwch_cq.c   |   2 +-
>  drivers/infiniband/hw/cxgb3/iwch_provider.c |  39 +++--
>  drivers/infiniband/hw/cxgb3/iwch_provider.h |   2 +
>  drivers/infiniband/hw/cxgb3/iwch_qp.c   |  37 ++--
>  drivers/infiniband/hw/cxgb4/cq.c|   2 +-
>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h  |  25 +--
>  drivers/infiniband/hw/cxgb4/mem.c

Re: [PATCH] IB/mlx4: Use vmalloc for WR buffers when needed

2015-10-07 Thread Leon Romanovsky
On Fri, Sep 25, 2015 at 08:51:22AM +0800, Wengang Wang wrote:
> Hi Or,
> 
> 在 2015年09月24日 19:57, Or Gerlitz 写道:
> >On Thu, Sep 24, 2015 at 1:49 PM, Wengang Wang  
> >wrote:
> >>@@ -786,8 +787,14 @@ static int create_qp_common(struct mlx4_ib_dev *dev, 
> >>struct ib_pd *pd,
> >> if (err)
> >> goto err_mtt;
> >>
> >>-   qp->sq.wrid  = kmalloc(qp->sq.wqe_cnt * sizeof (u64), gfp);
> >>-   qp->rq.wrid  = kmalloc(qp->rq.wqe_cnt * sizeof (u64), gfp);
> >>+   qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64), gfp);
> >>+   if (!qp->sq.wrid)
> >>+   qp->sq.wrid = __vmalloc(qp->sq.wqe_cnt * 
> >>sizeof(u64),
> >>+   gfp, PAGE_KERNEL);
> >On other spots of mlx4, we're using vmalloc and not __vmalloc, any
> >pros/cons for going that way too here?
> 
> vmalloc is just using GFP_KERNEL | __GFP_HIGHMEM, we can't pass in the flag
> gfp with it.  We should respect orginal code which needs to pass the flag.
Additionally, I want to spot Or's attention on the following discussion
in MM-subsystem about kmalloc/vmalloc and general function to fallback
from one to another.

[1] [PATCH 2/7] mm: introduce kvmalloc and kvmalloc_node
https://lkml.org/lkml/2015/7/7/548
[2] [PATCH 0/7] mm: reliable memory allocation with kvmalloc
https://lkml.org/lkml/2015/7/7/545

> 
> thanks,
> wengang
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mlx4: Use vmalloc for WR buffers when needed

2015-10-07 Thread Or Gerlitz

On 10/8/2015 9:06 AM, Leon Romanovsky wrote:

Additionally, I want to spot Or's attention on the following discussion
in MM-subsystem about kmalloc/vmalloc and general function to fallback
from one to another.




too busy to read them now, if you have review comments speak now and 
provide them to the author.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/mlx4: Use vmalloc for WR buffers when needed

2015-10-07 Thread Leon Romanovsky
On Thu, Oct 8, 2015 at 9:14 AM, Or Gerlitz  wrote:
> On 10/8/2015 9:06 AM, Leon Romanovsky wrote:
>>
>> Additionally, I want to spot Or's attention on the following discussion
>> in MM-subsystem about kmalloc/vmalloc and general function to fallback
>> from one to another.
>>
>>
>
> too busy to read them now, if you have review comments speak now and provide
> them to the author.
My comments to author that from my point of view this patch is a
correct to fix current behaviour.

The more general solution (I doubt if it is feasible) is to decrease
the dependency on high order allocations.

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IPoIB: serialize changing on tx_outstanding

2015-10-07 Thread Leon Romanovsky
>>> +   spin_lock_irqsave(&priv->lock, flags);
>>> +   --priv->tx_outstanding;
>>> +   if (netif_queue_stopped(dev))
>>> +   netif_wake_queue(dev);
>>> +   spin_unlock_irqrestore(&priv->lock, flags);
>>
>> Why are you locking the netif_* calls?
>
>
> Yes, I intended to do that.   This make the accessing on tx_outstanding and
> the reopening of the send queue in the same atomic session which is the
> expected behavior.
> Otherwise,  we may have the following problem:
> #time order
>
> thread1(on cpu1) thread2(on cpu2)
> lock
> modify/check tx_outstanding
> unlock
>
>
> lock
> modify/check tx_outstanding
> unlock
>
> reopen queue
>
>
>stop queue
>
>
> So that we actually want reopen the send queue, but the result is we stopped
> it.
Thanks for the explanation.

>
> thanks,
> wengang
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html