Re: nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Tom Tucker

Roland Dreier wrote:

 > This issue is really not in the NFS-RDMA code.  the nfsd code is doing
 > the binding.  See commit:

I think the really relevant thing is 7d21c0f9 ("SUNRPC: Set IPV6ONLY
flag on PF_INET6 RPC listener sockets") and followups.  NFS expects to
have one IPv6-only socket and one IPv4-only socket.

  


Yes, this new behavior is common to both sockets and rdma transports.


It seems RDMA CM should create a similar V6ONLY option for binding (and
probably default to the /proc/sys/net/ipv6/bindv6only sysctl value) to
handle this.
  

This makes sense to me.

Tom

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Roland Dreier
 > This issue is really not in the NFS-RDMA code.  the nfsd code is doing
 > the binding.  See commit:

I think the really relevant thing is 7d21c0f9 ("SUNRPC: Set IPV6ONLY
flag on PF_INET6 RPC listener sockets") and followups.  NFS expects to
have one IPv6-only socket and one IPv4-only socket.

It seems RDMA CM should create a similar V6ONLY option for binding (and
probably default to the /proc/sys/net/ipv6/bindv6only sysctl value) to
handle this.
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Jason Gunthorpe
On Mon, Mar 29, 2010 at 02:51:46PM -0500, Steve Wise wrote:

>> Yeah, exactly, it is very complex and there is a real need for
>> things pretending to be IP to capture all this subtlety. The details
>> can't just be skipped over, people will notice :(
>>
>> Though, I'm also not entirely certain that NFS-RDMA is right to bind
>> to both AFs, generally speaking on Linux for a multi-protocol app you
>> only want to bind to v6 addresses.. Or is it using IPV6_V6ONLY or alike?

> This issue is really not in the NFS-RDMA code.  the nfsd code is doing  
> the binding.  See commit:
>
> 37498292aa97658a5d0a9bb84699ce8c1016bb74
> Author: Chuck Lever 
> Date:   Tue Jan 26 14:04:22 2010 -0500
>
>NFSD: Create PF_INET6 listener in write_ports

Sure.. but it relies on the behavior of svcsock.c which does this:

if (family == PF_INET6)
kernel_setsockopt(sock, SOL_IPV6, IPV6_V6ONLY,
(char *)&val, sizeof(val));

And the NFS-RDMA has no equivalent. Having the common code explicitly
rely on IPV6_V6ONLY is quite troublesome when you can't implement it
:)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Steve Wise

Jason Gunthorpe wrote:

On Mon, Mar 29, 2010 at 12:01:07PM -0700, Roland Dreier wrote:
  

 > > The rdma_cm might be able to support this if the port space were separated 
based
 > > on the address family, depending on how PS IB ends up.
 > 
 > I think separate port spaces is the correct solution.


This gets a bit tricky -- for normal IP stuff, there's the "bindv6only"
sysctl (and the IPV6_V6ONLY socket option).  Without that, you can't
bind an IPv4 socket to the same port as an IPv6 socket, since the IPv6
socket will accept IPv4 connections via an v4->v6 mapped address.  (You
can look at inet_csk_bind_conflict() to see the full complexity of the
checking done when binding an IPv4 socket)



Yeah, exactly, it is very complex and there is a real need for
things pretending to be IP to capture all this subtlety. The details
can't just be skipped over, people will notice :(

Though, I'm also not entirely certain that NFS-RDMA is right to bind
to both AFs, generally speaking on Linux for a multi-protocol app you
only want to bind to v6 addresses.. Or is it using IPV6_V6ONLY or alike?

  


This issue is really not in the NFS-RDMA code.  the nfsd code is doing 
the binding.  See commit:



37498292aa97658a5d0a9bb84699ce8c1016bb74
Author: Chuck Lever 
Date:   Tue Jan 26 14:04:22 2010 -0500

   NFSD: Create PF_INET6 listener in write_ports


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Jason Gunthorpe
On Mon, Mar 29, 2010 at 12:01:07PM -0700, Roland Dreier wrote:
>  > > The rdma_cm might be able to support this if the port space were 
> separated based
>  > > on the address family, depending on how PS IB ends up.
>  > 
>  > I think separate port spaces is the correct solution.
> 
> This gets a bit tricky -- for normal IP stuff, there's the "bindv6only"
> sysctl (and the IPV6_V6ONLY socket option).  Without that, you can't
> bind an IPv4 socket to the same port as an IPv6 socket, since the IPv6
> socket will accept IPv4 connections via an v4->v6 mapped address.  (You
> can look at inet_csk_bind_conflict() to see the full complexity of the
> checking done when binding an IPv4 socket)

Yeah, exactly, it is very complex and there is a real need for
things pretending to be IP to capture all this subtlety. The details
can't just be skipped over, people will notice :(

Though, I'm also not entirely certain that NFS-RDMA is right to bind
to both AFs, generally speaking on Linux for a multi-protocol app you
only want to bind to v6 addresses.. Or is it using IPV6_V6ONLY or alike?

> I wonder what the right way from the RDMA CM to stay close to Linux
> sockets semantics without adding too much horror is.  (Adding Jason to
> the CC list since he usually has an opinion about things like this :)

Clearly the best way is to figure out some way to work with the
existing routines in the kernel. This stuff is complex and duplicating
all of it in rdma_cm would be annoying..

To match the semantics each CM ID would still register to one SID but
an incoming connection request on a v4 PF SID could be matched to a v6
SID, etc.

I don't think new port spaces in the API are desirable.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] IB/ipoib: fix dangling pointer references to ipoib_neigh and ipoib_path

2010-03-29 Thread Roland Dreier
 > > > @@ -249,44 +252,26 @@ static int __path_add(struct net_device *dev, 
 > > > struct ipoib_path *path)
 > > >  n = &pn->rb_left;
 > > >  else if (ret > 0)
 > > >  n = &pn->rb_right;
 > > > -else
 > > > -return -EEXIST;
 > > > +else /* Should never happen since we always search 
 > > > first */
 > > > +return;
 > > >  }
 > > 
 > > Why not remove the last else and change the "else if" into else?
 > 
 > I don't understand. This is left, right, or return.
 > I'm only changing the return value to void since it is
 > never used.

It would probably be better to split that cleanup out into a separate
patch.  And since we have a "should never happen" condition in the code,
I guess we should either trust our code and just assume that it really
never happens (ie have just an if-else as Eli suggests), or add
something like a WARN_ONCE() if it actually does happen (probably safer)
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Roland Dreier
 > > The rdma_cm might be able to support this if the port space were separated 
 > > based
 > > on the address family, depending on how PS IB ends up.
 > 
 > I think separate port spaces is the correct solution.

This gets a bit tricky -- for normal IP stuff, there's the "bindv6only"
sysctl (and the IPV6_V6ONLY socket option).  Without that, you can't
bind an IPv4 socket to the same port as an IPv6 socket, since the IPv6
socket will accept IPv4 connections via an v4->v6 mapped address.  (You
can look at inet_csk_bind_conflict() to see the full complexity of the
checking done when binding an IPv4 socket)

I wonder what the right way from the RDMA CM to stay close to Linux
sockets semantics without adding too much horror is.  (Adding Jason to
the CC list since he usually has an opinion about things like this :)

 - R.
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Steve Wise

Sean Hefty wrote:


The rdma_cm might be able to support this if the port space were separated based
on the address family, depending on how PS IB ends up.


I think separate port spaces is the correct solution.

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Sean Hefty
>Does the rdma-cm allow concurrent binds to PF_INET, INADDR_ANY, port=X
>and PF_INET6, IN6ADDR_ANY_INIT, port=X ?

No, since this shows up in the rdma_cm as using the same port space.

The rdma_cm might be able to support this if the port space were separated based
on the address family, depending on how PS IB ends up.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] IB/ipoib: fix dangling pointer references to ipoib_neigh and ipoib_path

2010-03-29 Thread Ralph Campbell
On Sun, 2010-03-28 at 09:02 -0700, Eli Cohen wrote:
> On Thu, Mar 04, 2010 at 10:58:27AM -0800, Ralph Campbell wrote:
> > Subject: [PATCH v3] IB/ipoib: fix dangling pointer references to 
> > ipoib_neigh and ipoib_path
> > 
> > When using connected mode, ipoib_cm_create_tx() kmallocs a
> > struct ipoib_cm_tx which contains pointers to ipoib_neigh and
> > ipoib_path. If the paths are flushed or the struct neighbour is
> > destroyed, the pointers held by struct ipoib_cm_tx can reference
> > freed memory.
> > 
> I could use some more content here as this is quite a large patch.
> Anyway below are some comments. I think besides reviewing this patch
> we need to make sure it has been checked in real life.

Do you mean more details about how ipoib_cm_tx can be used after
being freed or more about how the changes fix the problem?

I have been waiting for our internal regression tests to finish
and to collect review feedback before sending out the final version
of the patch.

> > +/*
> > + * Search the list of connections to be started and remove any entries
> > + * which match the path being destroyed.
> > + *
> > + * This should be called with netif_tx_lock_bh() and priv->lock held.
> > + */
> > +void ipoib_cm_flush_path(struct net_device *dev, struct ipoib_path *path)
> > +{
> > +   struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +   struct ipoib_cm_tx *tx;
> > +
> > +   list_for_each_entry(tx, &priv->cm.start_list, list) {
> > +   tx = list_entry(priv->cm.start_list.next, typeof(*tx), list);
> > +   if (tx->path == path) {
> > +   tx->path = NULL;
> > +   list_del_init(&tx->list);
> > +   break;
> > +   }
> > }
> >  }
> 
> Looks to me like we may find struct ipoib_cm_tx objects hanging and
> not freed. This could happen if they were just added to start_list and
> removed by ipoib_cm_flush_path() before being processed by
> ipoib_cm_tx_start().

Quite right. I should also use list_for_each_entry_safe().
I will fix this.

> > -static int __path_add(struct net_device *dev, struct ipoib_path *path)
> > +static void __path_add(struct net_device *dev, struct ipoib_path *path)
> >  {
> > struct ipoib_dev_priv *priv = netdev_priv(dev);
> > struct rb_node **n = &priv->path_tree.rb_node;
> > @@ -249,44 +252,26 @@ static int __path_add(struct net_device *dev, struct 
> > ipoib_path *path)
> > n = &pn->rb_left;
> > else if (ret > 0)
> > n = &pn->rb_right;
> > -   else
> > -   return -EEXIST;
> > +   else /* Should never happen since we always search first */
> > +   return;
> > }
> 
> Why not remove the last else and change the "else if" into else?

I don't understand. This is left, right, or return.
I'm only changing the return value to void since it is
never used.

> > }
> > @@ -460,19 +446,13 @@ static void path_rec_completion(int status,
> > memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
> >sizeof(union ib_gid));
> >  
> > -   if (ipoib_cm_enabled(dev, neigh->neighbour)) {
> > -   if (!ipoib_cm_get(neigh))
> > -   ipoib_cm_set(neigh, 
> > ipoib_cm_create_tx(dev,
> > -  
> > path,
> > -  
> > neigh));
> > -   if (!ipoib_cm_get(neigh)) {
> > -   list_del(&neigh->list);
> > -   if (neigh->ah)
> > -   ipoib_put_ah(neigh->ah);
> > -   ipoib_neigh_free(dev, neigh);
> > -   continue;
> > -   }
> > -   }
> ipoib_cm_set() is called only once in neigh_alloc(), setting neigh->cm
> to  NULL, and never again so all calls to ipoib_cm_get() will return
> NULL.

ipoib_cm_set() is only called once since I changed
ipoib_cm_create_tx() to initialize neigh->cm and return void.
To me, it is more clear to let ipoib_cm.c do as much of
the CM specific work as possible instead of defining two
versions of a function for when CM is compiled in or not.
I guess I could change neigh_alloc() to call kzmalloc()
and remove ipoib_cm_set().
Normally, I would define set/get pairs but ipoib_main.c
doesn't really need to use the pointer, just know that it
has been allocated and be able to pass it to ipoib_cm.c
I guess we could hide more of the details by just passing
neigh and letting the CM functions get the ipoib_cm_tx pointer.
It is all a matter of style so I'm OK with changing it back
to the original or making it a separate patch (probably wiser).

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.

Re: nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Steve Wise
Actually, this looks like a recent nfs change that now creates an INET 
and INET6 transport when you add one via the /proc/fs/nfsd/portlist 
file.  Looking at 2.6.30, which works, the nfsctl.c code is very 
different and only creates an INET transport...





Steve Wise wrote:

Hey Sean,

I'm trying NFSRDMA on net-next and the server side fails when 
registering the rdma transport.  I think its due to the INET6 support 
added to the rdma-cm.  I'm still debugging though.


In fs/nfsd/nfsctl.c:__write_ports_addxprt(), it tries to create a new 
svc transport for PF_INET, and PF_INET6 using the same port and the 
wildcard address.  If the INET6 fails with anything other than 
-EAFNOSUPPORT, then the entire transport registration fails (ie no 
RDMA/INET support is added).
When I do echo "rdma 20049" > /proc/fs/nfsd/portlist  I see the 
PF_INET transport get created successfully, but the INET6 transport 
create fails with -EADDRNOTAVAIL.
Does the rdma-cm allow concurrent binds to PF_INET, INADDR_ANY, port=X 
and PF_INET6, IN6ADDR_ANY_INIT, port=X ?  Apparently the native stack 
allows this (which makes sense seeing as how they are different 
protocol families).



Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


nfsrdma broken on 2.6.34-rc1?

2010-03-29 Thread Steve Wise

Hey Sean,

I'm trying NFSRDMA on net-next and the server side fails when 
registering the rdma transport.  I think its due to the INET6 support 
added to the rdma-cm.  I'm still debugging though.


In fs/nfsd/nfsctl.c:__write_ports_addxprt(), it tries to create a new 
svc transport for PF_INET, and PF_INET6 using the same port and the 
wildcard address.  If the INET6 fails with anything other than 
-EAFNOSUPPORT, then the entire transport registration fails (ie no 
RDMA/INET support is added). 

When I do echo "rdma 20049" > /proc/fs/nfsd/portlist  I see the PF_INET 
transport get created successfully, but the INET6 transport create fails 
with -EADDRNOTAVAIL. 

Does the rdma-cm allow concurrent binds to PF_INET, INADDR_ANY, port=X 
and PF_INET6, IN6ADDR_ANY_INIT, port=X ?  Apparently the native stack 
allows this (which makes sense seeing as how they are different protocol 
families).



Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/12] ehca: convert cpu notifier to return encapsulate errno value

2010-03-29 Thread Alexander Schmidt
On Thu, 18 Mar 2010 18:05:21 +0900
Akinobu Mita  wrote:

> By the previous modification, the cpu notifier can return encapsulate
> errno value. This converts the cpu notifiers for ehca.
> 
> Signed-off-by: Akinobu Mita 
> Cc: Hoang-Nam Nguyen 
> Cc: Christoph Raisch 
> Cc: linux-rdma@vger.kernel.org

Acked-by: Alexander Schmidt 

> ---
>  drivers/infiniband/hw/ehca/ehca_irq.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/ehca/ehca_irq.c 
> b/drivers/infiniband/hw/ehca/ehca_irq.c
> index b2b6fea..31f3ee1 100644
> --- a/drivers/infiniband/hw/ehca/ehca_irq.c
> +++ b/drivers/infiniband/hw/ehca/ehca_irq.c
> @@ -845,7 +845,7 @@ static int __cpuinit comp_pool_callback(struct 
> notifier_block *nfb,
>   ehca_gen_dbg("CPU: %x (CPU_PREPARE)", cpu);
>   if (!create_comp_task(pool, cpu)) {
>   ehca_gen_err("Can't create comp_task for cpu: %x", cpu);
> - return NOTIFY_BAD;
> + return notifier_from_errno(-ENOMEM);
>   }
>   break;
>   case CPU_UP_CANCELED:
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html