[PATCH] IB/ipoib: Do not turn on carrier to a non active port

2009-09-17 Thread Moni Shoua

This patch fixes https://bugs.openfabrics.org/show_bug.cgi?id=1726
Multicast join can succeed even if IB port is down. This happens when OpenSM
runs on the same port as the requesting port. IPoIB on the other hand, calls
netif_carrier_on() when join succeeded without caring about the state of
the IB port. The result is - an IPoIB interface in RUNNING state but without
active IB port to support it. If a bonding interface uses this IPoIB interface
as a slave it might not detect that this slave is almost useless and failover
functionality will be damaged.
The fix here is to check the state of the IB port in the carrier_task before
calling netif_carrier_on().

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |2 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c|2 ++
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |2 +-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   13 +++--
 4 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 753a983..f29ce14 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -292,7 +292,7 @@ struct ipoib_dev_priv {
 
struct delayed_work pkey_poll_task;
struct delayed_work mcast_task;
-   struct work_struct carrier_on_task;
+   struct delayed_work  carrier_on_task;
struct work_struct flush_light;
struct work_struct flush_normal;
struct work_struct flush_heavy;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index e35f4a0..c452089 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -724,6 +724,8 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
ipoib_dbg(priv, "downing ib_dev\n");
 
clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
+   cancel_delayed_work(&priv->carrier_on_task);
+
netif_carrier_off(dev);
 
/* Shutdown the P_Key thread if still active */
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 2bf5116..5242e0d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1079,7 +1079,7 @@ static void ipoib_setup(struct net_device *dev)
 
INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
-   INIT_WORK(&priv->carrier_on_task, ipoib_mcast_carrier_on_task);
+   INIT_DELAYED_WORK(&priv->carrier_on_task, ipoib_mcast_carrier_on_task);
INIT_WORK(&priv->flush_light,   ipoib_ib_dev_flush_light);
INIT_WORK(&priv->flush_normal,   ipoib_ib_dev_flush_normal);
INIT_WORK(&priv->flush_heavy,   ipoib_ib_dev_flush_heavy);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 25874fc..b4b4016 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -361,13 +361,22 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast 
*mcast)
 void ipoib_mcast_carrier_on_task(struct work_struct *work)
 {
struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
-  carrier_on_task);
+  carrier_on_task.work);
+   struct ib_port_attr attr;
 
/*
 * Take rtnl_lock to avoid racing with ipoib_stop() and
 * turning the carrier back on while a device is being
 * removed.
 */
+
+   if (ib_query_port(priv->ca, priv->port, &attr) ||
+   attr.state != IB_PORT_ACTIVE) {
+   ipoib_dbg(priv, "wait with carrier until IB port is active\n");
+   if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
+   queue_delayed_work(ipoib_workqueue, 
&priv->carrier_on_task, HZ);
+   return;
+   }
rtnl_lock();
netif_carrier_on(priv->dev);
rtnl_unlock();
@@ -403,7 +412,7 @@ static int ipoib_mcast_join_complete(int status,
 * deadlock on rtnl_lock here.
 */
if (mcast == priv->broadcast)
-   queue_work(ipoib_workqueue, &priv->carrier_on_task);
+   queue_delayed_work(ipoib_workqueue, 
&priv->carrier_on_task, 0);
 
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: Do not turn on carrier to a non active port

2009-09-21 Thread Moni Shoua
Roland Dreier wrote:
>  > +  if (ib_query_port(priv->ca, priv->port, &attr) ||
>  > +  attr.state != IB_PORT_ACTIVE) {
>  > +  ipoib_dbg(priv, "wait with carrier until IB port is active\n");
>  > +  if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
>  > +  queue_delayed_work(ipoib_workqueue, 
> &priv->carrier_on_task, HZ);
>  > +  return;
>  > +  }
> 
> This queueing delayed work to poll the port state seems a bit odd to
> me... we get an event when the port changes state anyway, right?  So
> can't we just turn the carrier on when we get an active event?
> 
>  - R.
You're right. I've complicated things where I shouldn't need.
The call to __ipoib_ib_dev_flush() from ipoib_event() will requeue the 
carrier_on_task in join completion of
the broadcast group.

I'll resend

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: Do not turn on carrier to a non active port

2009-09-21 Thread Moni Shoua
Roland Dreier wrote:
> And by the way, this current patch has a deadlock I think:
> 
>  > @@ -724,6 +724,8 @@ int ipoib_ib_dev_down(struct net_device *dev, int 
> flush)
>  >ipoib_dbg(priv, "downing ib_dev\n");
>  >  
>  >clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
>  > +  cancel_delayed_work(&priv->carrier_on_task);
> 
> ipoib_ib_dev_down() is called with rtnl held but carrier_on_task() does
> rtn_lock().  So if carrier_on_task() is running but about to take the
> rtnl when we try to do cancel_delayed_work() here, then it will wait
> forever.
> 
> I think using lockdep on a new enough kernel (2.6.30 or maybe 2.6.31)
> will report workqueue / timer vs. lock deadlocks.
> 
>  - R.
I may miss this but I don't  see how ipoib_ib_dev_down() is called with rtnl 
held.
Anyway, the new patch doesn't use delayed work.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] IB/ipoib: Do not turn on carrier to a non active port

2009-09-21 Thread Moni Shoua
This patch fixes https://bugs.openfabrics.org/show_bug.cgi?id=1726
Multicast join can succeed even if IB port is down. This happens when OpenSM
runs on the same port with the requesting port. IPoIB on the other hand, calls
netif_carrier_on() when join succeeded without caring about the state of
the IB port. The result is an IPoIB interface in RUNNING state but without
active IB port to support it. If a bonding interface uses this IPoIB interface
as a slave it might not detect that this slave is almost useless and failover
functionality will be damaged.
The fix  checks the state of the IB port in the carrier_task before
calling netif_carrier_on().

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 25874fc..9ace51d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -362,12 +362,19 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 {
struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
   carrier_on_task);
+   struct ib_port_attr attr;
 
/*
 * Take rtnl_lock to avoid racing with ipoib_stop() and
 * turning the carrier back on while a device is being
 * removed.
 */
+
+   if (ib_query_port(priv->ca, priv->port, &attr) ||
+   attr.state != IB_PORT_ACTIVE) {
+   ipoib_dbg(priv, "wait with carrier until IB port is active\n");
+   return;
+   }
rtnl_lock();
netif_carrier_on(priv->dev);
rtnl_unlock();
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to handle illegal multicast addresses in IPoIB?

2009-09-21 Thread Moni Shoua
Hi,
The problem in short: Illegal multicast addresses can be passed to ib_ipoib 
which will add them to the queue of multicast addresses it needs to join. From 
this point on ib_ipoib will get stuck in the attempt to join an illegal 
multicast group and without a giving a chance for other groups to try.

Recently 2 patches were sent to to deal with this problem. Both of them were 
rejectedm mainly from technical reasons.
Since I think that it is important to have the problem fixed I want to revive 
the discussion and resubmit one of them or both, according to the opinions here.

One patch 
(http://lists.openfabrics.org/pipermail/general/2009-August/061663.html) checks 
each multicast address for validity before it lets it get into the queue. 

The other patch 
(http://lists.openfabrics.org/pipermail/general/2009-August/061693.html) queues 
all addresses it gets but adds fairness to the join attempts.

My questions is: Do we need both patches? If not, which one to choose?

I think that the first patch (check validity) is good enough.

thanks
 MoniS
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to handle illegal multicast addresses in IPoIB?

2009-09-21 Thread Moni Shoua
Or Gerlitz wrote:
> Moni Shoua wrote:
>> One patch 
>> (http://lists.openfabrics.org/pipermail/general/2009-August/061663.html) 
>> checks each 
>> multicast address for validity before it lets it get into the queue. 
> 
> isn't it the below commit which appears in Linus tree?
> 
> Or.
>> commit 5e47596bee12597824a3b5b21e20f80b61e58a35
>> Author: Jason Gunthorpe 
Yes, thanks you...
I must have missed it.

So there is only one question left. Do we also need the other patch?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] OFED 1.5 rc4 release is available

2009-12-16 Thread Moni Shoua

> Limitations:
> 
> - SLES10 SP3 on IA64 is not supported yet
> 
In what way is it unsupported?
Our QA say that OFED-1.5 can be compiled on this platform.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB issues

2010-03-10 Thread Moni Shoua
Eli Cohen wrote:
> I just posted a patch which might fix your problem. Please try it and
> let us know if it fixed anything.
> 
Hi Eli
Although Josh already reported that the patch seems to fix the issue I have a 
question though.

"post_send failed" prints were during work in datagram mode. I don't know if 
Josh verified 
that but I don't expect that these prints would go away, even with the patch. 
Am I right?

BTW, what could be the reason for UD QP post_send() failures?

>>
>> In datagram mode, I see errors on the boot servers of the form.
>>
>> ib0: post_send failed
>> ib0: post_send failed
>> ib0: post_send failed
>>
>>
>> When using connected mode, I hit a different error:
>>
>> NETDEV WATCHDOG: ib0: transmit timed out
>> ib0: transmit timeout: latency 1999 msecs
>> ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464
>> NETDEV WATCHDOG: ib0: transmit timed out
>> ib0: transmit timeout: latency 2999 msecs
>> ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464
>> ...
>> ...
>> NETDEV WATCHDOG: ib0: transmit timed out
>> ib0: transmit timeout: latency 61824999 msecs
>> ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464
>>
>>
>> The errors seem to hit only after NFS comes into play.  Once it
>> starts, the NETDEV WATCHDOG messages continue until I run
>> 'ifconfig ib0 down up'.  I've tried tuning send_queue_size and
>> recv_queue_size on both sides, the txqueuelen of the ib0 interface, the
>> NFS rsize/wsize.  None of it seems to help greatly.  Does anyone have
>> any ideas about what can I do to try to fix
>> these problems?
>>
>> -JE
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "Unicast, no dst" warning from IPoIB

2010-03-22 Thread Moni Shoua

> Why did we get this? It could happen since the IPoIB neighbour that
> the specific instance of IPoIB CM is pointing to might have SKBs in
> its queue. When REP arrives for this connection, it will re-queue all
> the queued SKBs again but there may be no dst for them anymore.
> 
Do you mean that it ("no dst for them anymore.") happened due to aging of the 
neigh?
My intuition tells me that it should be a very rare scenario, if I guess your 
intension correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH] ipoib: remove addrlen check for mc addresses

2010-03-23 Thread Moni Shoua

> Could you send a link to the git tree where I can find this commit and
> the related fixes?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
See commit 5e47596bee12597824a3b5b21e20f80b61e58a35 for the fix prior to this 
one.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH] ipoib: remove addrlen check for mc addresses

2010-03-23 Thread Moni Shoua
Eli Cohen wrote:
> On Tue, Mar 23, 2010 at 12:34:13PM +0200, Or Gerlitz wrote:
>> basically, as the subject line suggests, it should be in Dave's net-next tree
>>
> I just need to clone this tree and need the url. Can you give it to me
> from .git/config?
> Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
maybe your'e looking for this one
http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=32a806c194ea112cfab00f558482dd97bee5e44e

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH] ipoib: remove addrlen check for mc addresses

2010-03-23 Thread Moni Shoua

> maybe your'e looking for this one
> http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=32a806c194ea112cfab00f558482dd97bee5e44e

This is the link to the related patch of course but it also tells you which 
tree to clone
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: are IB port counters functioning over --IB-- with the RDMAoE patch set?

2010-06-03 Thread Moni Shoua
Did you try OFED-1.5.1 or even better, OFED-1.5.2?
I know patches for counters with RoCEE were submitted since OFED-1.5 and
I saw it working.


-Original Message-
From: Or Gerlitz 
Sent: Thursday, June 03, 2010 14:49
To: Eli Cohen; Vladimir Sokolovsky; Eli Cohen
Cc: Linux RDMA list; Moni Shoua; Shiri Franchi
Subject: Re: are IB port counters functioning over --IB-- with the
RDMAoE patch set?


> Hi Eli, Vlad,  Using some lab node here, I noticed that the IB port 
> counters were not functioning over --IB-- with the RDMAoE patches 
> present in an OFED drop named OFED-RDMAoE-1.5-rc6, see below.
> Was/is this a known issue? if yes, was it fixed? is/was there an ofa 
> bz case open for this? I didn't find such... I was told that under 
> ofed 1.5.1 this doesn't happen, but I wanted to 1st clarify with you.
Eli, Vlad, anything you can comment here?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCH v4] IB Core: RAW ETH support

2010-06-16 Thread Moni Shoua
Hi Roland,

> What the hell is the thinking behind introducing IB_QPT_RAW_ETH?  You're
> inserting an enum value before IB_QPT_RAW_ETY, so any old userspace
> passing in IB_QPT_RAW_ETY will silently get different behavior depending
> on the kernel version.  And you're creating two constands that differ in
> a single letter (IB_QPT_RAW_ETY vs. IB_QPT_RAW_ETH).  How are you going
> to explain that to users?  How is anyone ever going to get it right?
> For that matter, what exactly does IB_QPT_RAW_ETH mean?
> 
There is no qp type IBV_QPT_RAW_ETY in user space (at least not in the 
definitions
coming with libibverbs). In fact, libibverbs that comes with OFED defines (in 
verbs.h)
a qp type called IBV_QPT_RAW_ETT which equals to 7.
The patch that is under discussion here adds a new qp type IB_QPT_RAW_ETH and 
equals it to 7
to match the definition in user space. This indeed changes the value of 
IB_QPT_RAW_ETY to 8
but I don't see who can be affected since
1. No user space program that uses IB_QPT_RAW_ETY exists
2. kernel is compiled as one piece of code.

> This all seems to be a symptom of how broken our development process
> is.  Yes, unfortunately I can't spend as much time reviewing and
> applying patches as I might like, and I apologize for that.  But if we
> have all the RDMA developers piling up shit in their little area and
> then sending it on to be merged as soon as it kind of works, without
> thinking about design or maintainability and without ever doing any
> review, then I'm always going to have an expanding review backlog.
> 
I want very much to push the RAW_ETH patches to upstream and we didn't intend 
to skip
this stage. However,
1. It seems that upstream kernel still can't use ConnectX ETH ports. Is it 
right?
2. From practical (commercial) reasons it was important to us that the RAW_ETH 
patches 
will be in OFED. AFAIK, OFED recommends that patches will be reviewed and added 
first in 
upstream kernel but exceptions are allowed. Anyway, we didn't think that a 
review is not
necessary, we just used the OFED group for that matter.
> And then we have OFED compounding problems -- "Oh that's a nice pile of
> shit you've built there.  We better ship it to users while it's still
> steaming."  How about if OFED developers take a little time to think
> things through?
> 
> 
> 
> In other words, can someone explain the plan for this raw QP stuff to me?
I hope I understand what needs to be explained but in short, we want a QP type
that takes the data in post_send() and puts it on the wire as is without adding
headers or footers. This allows the user of this kind of QP to build an entire 
Ethernet
packet and send it to an Ethernet switch port.

UD/RC QP with RoCEE is no good because we want to be compliant with simple ETH 
devices (e.g. remote
server with standard 10G NIC)
Infiniband RAW packets is no good (if I understand the spec correctly) because 
it adds LRH and is not
suitable for fabrics with Ethernet switch

This is why RAW_ETH was invented.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCH v4] IB Core: RAW ETH support

2010-06-16 Thread Moni Shoua
 
> 
> 
> It doesn't even look like this patch and the mlx4 patch were ever posted
> to linux-rdma. Only to the EWG list.
> 
Not 100% correct. See thread from April 30.
Patches to core, libibverbs and NES driver were presented there.

> Granted our dev process may not be documented, but I always assumed the
> general idea was to get changes accepted upstream, then pull into ofed.
> OFED is just a mechanism to make top-of-tree linux work on distro
> kernels. There are some exceptions, but this stuff shouldn't be an
> exception.
> 
> We should all follow this "upstream first" process IMO.
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type

2010-06-21 Thread Moni Shoua
Walukiewicz, Miroslaw wrote:
> No, no more changes are necessary. It is a standalone application.
> 
> Mirek
> 
Are you sure?
AFAIK, patches to kernel and libibverbs are required which were not accepted 
yet.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Name for a new type of QP

2010-06-23 Thread Moni Shoua
Hi,
This message follows a discussion in the EWG mailing list.

We want to promote a patch that enables use of a new QP type.
This QP type lets the user post_send() data to its SQ and treat it as the 
entire packet, including headers.
An example of use with this QP is sending Ethernet packets from userspace (and 
enjoying kernel bypass).

An open question in this matter it how should we call this QP type.
The first name IBV_QPT_RAW_ETH seems to be too similar to the existing type 
IBV_QPT_RAW_ETY.

My suggestion (that were posted in a different thread) are

IBV_QPT_FRAME
IBV_QPT_PACKET
IBV_QPT_NOHDR

Please make your comments and send your suggestions.

When we decide about a name we will send a patch that enables the use of this 
QP type.


thanks

Moni
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ewg] [PATCH v4] IB Core: RAW ETH support

2010-06-24 Thread Moni Shoua
Thanks. I accept this remark.
We will try to address all the bothering issues regarding the patch set
of RAW QP support and resubmit them linux-rdma.
Since it won't work with Mellanox the patches will be only for Intel
(nes) driver. When time comes and Mellanox cards  can be tested for RAQ
QO feature they will benefit from the common part.



---
  Moni Shoua|  +972-54-5567934  


-Original Message-
From: Roland Dreier [mailto:rdre...@cisco.com] 
Sent: Wednesday, June 23, 2010 8:32 PM
To: Moni Shoua
Cc: Aleksey Senin; Eli Cohen; e...@openfabrics.org;
linux-rdma@vger.kernel.org; Tziporet Koren; Yiftah Shahar
Subject: Re: [ewg] [PATCH v4] IB Core: RAW ETH support

 > There is no qp type IBV_QPT_RAW_ETY in user space (at least not in
the definitions  > coming with libibverbs). In fact, libibverbs that
comes with OFED defines (in verbs.h)  > a qp type called IBV_QPT_RAW_ETT
which equals to 7.
 > The patch that is under discussion here adds a new qp type
IB_QPT_RAW_ETH and equals it to 7  > to match the definition in user
space. This indeed changes the value of IB_QPT_RAW_ETY to 8  > but I
don't see who can be affected since  > 1. No user space program that
uses IB_QPT_RAW_ETY exists  > 2. kernel is compiled as one piece of
code.

Why renumber the _ETY enum?  Maybe it doesn't break anything serious but
why risk it?
--
Roland Dreier  || For corporate legal information go
to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] naming for new QP types

2010-07-01 Thread Moni Shoua
Hi,
This message follows a discussion about names for a new QP type (see 
http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg04512.html) and 
precedes a patch set for RAW QP that we intend to send.

The patch requires changes in enum ib_qp_type
1. Rename IB_QPT_RAW_ETY to IB_QPT_RAW_ETHERTYPE - this is not really a 
requirement but an opportunity to improve an existing name.
2. Add new QP type - IBV_QPT_RAW_PACKET. I'll delay the comments for the new 
name to later, when the patches are submitted, but ask another question instead.

The xrc branch also includes modifications to the same structure and it adds a 
new type IBV_QPT_XRC before the end.

In general, will it be acceptable if we submit the patches so that after they 
are applied the structure will look like this

enum ib_qp_type {
/*
 * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries
 * here (and in that order) since the MAD layer uses them as
 * indices into a 2-entry table.
 */
IB_QPT_SMI,
IB_QPT_GSI,

IB_QPT_RC,
IB_QPT_UC,
IB_QPT_UD,
IB_QPT_RAW_IPV6,
IB_QPT_RAW_ETHERTYPE = 7,
IB_QPT_RAW_PACKET = 8
};

So when XRC is accepted we still keep the same values for the last 2?

thanks

MoniS
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] naming for new QP types

2010-07-01 Thread Moni Shoua
Along with the patches to the kernel we will send a patch to libibverbs
to expose the new values

---
  Moni Shoua|  +972-54-5567934  


-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Friday, July 02, 2010 3:34 AM
To: Moni Shoua; Roland Dreier
Cc: linux-rdma; Or Gerlitz; Walukiewicz, Miroslaw; Aleksey Senin
Subject: RE: [RFC] naming for new QP types

> In general, will it be acceptable if we submit the patches so that 
> after they are applied the structure will look like this
> 
> enum ib_qp_type {
> /*
>  * IB_QPT_SMI and IB_QPT_GSI have to be the first two entries
>  * here (and in that order) since the MAD layer uses them as
>  * indices into a 2-entry table.
>  */
> IB_QPT_SMI,
> IB_QPT_GSI,
> 
> IB_QPT_RC,
> IB_QPT_UC,
> IB_QPT_UD,
> IB_QPT_RAW_IPV6,
> IB_QPT_RAW_ETHERTYPE = 7,
> IB_QPT_RAW_PACKET = 8
> };
> 
> So when XRC is accepted we still keep the same values for the last 2?

Have the existing values been exposed to user space in any way, even if
not defined by libibverbs?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC] naming for new QP types

2010-07-04 Thread Moni Shoua
If I understand you right this time then you ask what about will happen
if user calls create_qp() from user space with qp_type=6. Right?

I guess that this should be handled in code like with any other illegal
value.

---
  Moni Shoua|  +972-54-5567934  


-Original Message-
From: Hefty, Sean [mailto:sean.he...@intel.com] 
Sent: Friday, July 02, 2010 6:37 PM
To: Moni Shoua; Roland Dreier
Cc: linux-rdma; Or Gerlitz; Walukiewicz, Miroslaw; Aleksey Senin
Subject: RE: [RFC] naming for new QP types

> Along with the patches to the kernel we will send a patch to 
> libibverbs to expose the new values

I was actually asking if a user space app could access the existing
definitions/values in any way.  For example, if a user space app called
create qp with a hard-coded value of 6, what would happen?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] IB/ipoib: Leave stale send-only multicast groups

2011-01-17 Thread Moni Shoua
Unlike with send/receive multicast groups, there is no indication for IPoIB
that a send-only multicast group is useless. Therefore, even a single packet
to a multicast destination leaves a multicast entry on the fabric until the
host interface is down. This causes an MGID leakage in the SM.

Here, a garbage-collection task will be scheduled once a minute and will leave
stale multicast groups.

V1 of the patch below was sent to the list a long ago by Yossi Etigin and from 
some
reason the discussion about it was stopped without a conclusion.

Link to V1:
 - http://www.mail-archive.com/general@lists.openfabrics.org/msg18928.html

Changes from V1:
 - Add a module parameter to control the amount of time that an idle send-only
   group is allowed to stay joined.

Signed-off-by: Yossi Etigin 
Signed-off-by: Moni Shoua 

--
 drivers/infiniband/ulp/ipoib/ipoib.h   |8 +++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |8 +++-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   50 +
 3 files changed, 56 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index ab97f92..fb1714f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -92,6 +92,7 @@ enum {
IPOIB_FLAG_ADMIN_CM   = 9,
IPOIB_FLAG_UMCAST = 10,
IPOIB_FLAG_CSUM   = 11,
+   IPOIB_MCAST_RUN_GC= 12,
 
IPOIB_MAX_BACKOFF_SECONDS = 16,
 
@@ -132,6 +133,7 @@ struct ipoib_mcast {
struct list_head  list;
 
unsigned long created;
+   unsigned long used;
unsigned long backoff;
 
unsigned long flags;
@@ -283,7 +285,8 @@ struct ipoib_dev_priv {
struct rb_root multicast_tree;
 
struct delayed_work pkey_poll_task;
-   struct delayed_work mcast_task;
+   struct delayed_work mcast_join_task;
+   struct delayed_work mcast_leave_task;
struct work_struct carrier_on_task;
struct work_struct flush_light;
struct work_struct flush_normal;
@@ -411,6 +414,8 @@ void ipoib_neigh_free(struct net_device *dev, struct 
ipoib_neigh *neigh);
 
 extern struct workqueue_struct *ipoib_workqueue;
 
+extern int ipoib_mc_sendonly_timeout;
+
 /* functions */
 
 int ipoib_poll(struct napi_struct *napi, int budget);
@@ -453,6 +458,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device 
*ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
 
 void ipoib_mcast_join_task(struct work_struct *work);
+void ipoib_mcast_leave_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
 void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb);
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 7a07a72..563370e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -67,6 +67,11 @@ module_param_named(debug_level, ipoib_debug_level, int, 
0644);
 MODULE_PARM_DESC(debug_level, "Enable debug tracing if > 0");
 #endif
 
+int ipoib_mc_sendonly_timeout;
+
+module_param_named(mc_sendonly_timeout, ipoib_mc_sendonly_timeout, int, 0644);
+MODULE_PARM_DESC(mc_sendonly_timeout, "Enable debug tracing if > 0");
+
 struct ipoib_path_iter {
struct net_device *dev;
struct ipoib_path  path;
@@ -1020,7 +1025,8 @@ static void ipoib_setup(struct net_device *dev)
INIT_LIST_HEAD(&priv->multicast_list);
 
INIT_DELAYED_WORK(&priv->pkey_poll_task, ipoib_pkey_poll);
-   INIT_DELAYED_WORK(&priv->mcast_task,   ipoib_mcast_join_task);
+   INIT_DELAYED_WORK(&priv->mcast_join_task,   ipoib_mcast_join_task);
+   INIT_DELAYED_WORK(&priv->mcast_leave_task, ipoib_mcast_leave_task);
INIT_WORK(&priv->carrier_on_task, ipoib_mcast_carrier_on_task);
INIT_WORK(&priv->flush_light,   ipoib_ib_dev_flush_light);
INIT_WORK(&priv->flush_normal,   ipoib_ib_dev_flush_normal);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 3871ac6..87928c1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -117,6 +117,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct 
net_device *dev,
 
mcast->dev = dev;
mcast->created = jiffies;
+   mcast->used = jiffies;
mcast->backoff = 1;
 
INIT_LIST_HEAD(&mcast->list);
@@ -403,7 +404,7 @@ static int ipoib_mcast_join_complete(int status,
mutex_lock(&mcast_mutex);
if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
queue_delayed_work(ipoib_workqueue,
-  &priv->mcast_task, 0);
+

Re: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG

2011-02-02 Thread Moni Shoua
Mike Heinz wrote:
> Wouldn't the BUSY patch I proposed last year deal with this situation?
Can you please send a link to this patch?

> 
> -Original Message-
> From: ewg-boun...@lists.openfabrics.org 
> [mailto:ewg-boun...@lists.openfabrics.org] On Behalf Of Moni Shoua
> Sent: Wednesday, February 02, 2011 10:10 AM
> To: Vlad
> Cc: n...@voltaire.com; ewg
> Subject: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an 
> MCG
> 
> This patch helps when SM is busy and so an MC group is left joined
> while the host bellies that it is was left.
> 
> Note: the patch below is not to driver/infiniband/core but it generates
> a patch under kernel_patches/fixes.
> 
> Index: 
> ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch
> ===
> --- /dev/null   1970-01-01 00:00:00.0 +
> +++ 
> ofa_kernel-1.5.3/kernel_patches/fixes/core_0290_sysfs_mcast_leave_retries.patch
>  2011-02-02 16:52:02.0 +0200
> @@ -0,0 +1,46 @@
> +Add a multicast leave maximum retry setting in 
> sys/module/ib_sa/parameters/mcast_leave_retries.
> +Add a debug print when the maximum retry count is reached.
> +
> +Signed-off-by: Nir Muchtar 
> +Reviewed-by:   Moni Shoua  
> +--
> +
> +Index: ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c
> +===
> +--- ofa_kernel-1.5.2.orig/drivers/infiniband/core/multicast.c  2010-08-17 
> 12:56:06.0 +0300
>  ofa_kernel-1.5.2/drivers/infiniband/core/multicast.c   2010-08-17 
> 13:15:38.0 +0300
> +@@ -40,6 +40,12 @@
> + #include 
> + #include "sa.h"
> +
> ++static int mcast_leave_retries = 3;
> ++
> ++module_param_call(mcast_leave_retries, param_set_int, param_get_int,
> ++&mcast_leave_retries, 0644);
> ++MODULE_PARM_DESC(mcast_leave_retries, "Number of retries for multicast 
> leave requests before giving up");
> ++
> + static void mcast_add_one(struct ib_device *device);
> + static void mcast_remove_one(struct ib_device *device);
> +
> +@@ -520,8 +526,11 @@
> +   if (status && (group->retries > 0) &&
> +   !send_leave(group, group->leave_state))
> +   group->retries--;
> +-  else
> ++  else {
> ++  if (status && group->retries <= 0)
> ++  printk("reached max retry count. status=%d  .Giving 
> up\n", status);
> +   mcast_work_handler(&group->work);
> ++  }
> + }
> +
> + static struct mcast_group *acquire_group(struct mcast_port *port,
> +@@ -544,7 +553,7 @@
> +   if (!group)
> +   return NULL;
> +
> +-  group->retries = 3;
> ++  group->retries = mcast_leave_retries;
> +   group->port = port;
> +   group->rec.mgid = *mgid;
> +   group->pkey_index = MCAST_INVALID_PKEY_INDEX;
> ___
> ewg mailing list
> e...@lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 
> 
> This message and any attached documents contain information from QLogic 
> Corporation or its wholly-owned subsidiaries that may be confidential. If you 
> are not the intended recipient, you may not read, copy, distribute, or use 
> this information. If you have received this transmission in error, please 
> notify the sender immediately by reply e-mail and then delete this message.
> 
> ___
> ewg mailing list
> e...@lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] [PATCH] IB/core: Control number of retries for SA to leave an MCG

2011-02-02 Thread Moni Shoua
Mike Heinz wrote:
> It was discussed in the Linux-RDMA list for many months. You can find a list 
> of the archived messages here:
> 
> http://www.mail-archive.com/search?q=SA+Busy&l=linux-rdma@vger.kernel.org
> 
> The most recent version of the patch is here:
> 
> http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg06644.html
> 
> Basically, the spec permits an SM to reply "busy" instead of simply tossing 
> packets on the floor, but OFED does not handle this case right now.
> 
I took a look and I'm not sure that your patch solves the issue I was trying to 
handle.
Nothing will prevent from number of retries to reach zero and no one will 
report the upper layer (e.g IPoIB)
that leaving the multicast group had failed.
The worst thing is that there is no indication to the user about this state (a 
host is joined without no one to ever try and make it leave)
The patch I sent also puts a message in the kernel log so users can read and 
react.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/rdma_cm: cancel pending lap message when destroying an ID

2011-03-02 Thread Moni Shoua

When destroying a cm_id from a context of a work queue and if the lap_state of
this cm_id is IB_CM_LAP_SENT, we need to release the reference of this id that
was taken upon the send of the lap message.  Otherwise, if the expected apr
message gets lost, it is only after a long time that the reference will be
released, while during that the work handler thread is not available to process
other things.
Also, a sigle variable 'msg' was used to hold the last snet mad. However, this
variable serves 2 state machines (CM and LAP) which can cause to memory leaks or
to unexpected behavior. This patch adds another variable to hold the lap mad.

Signed-off-by: Moni Shoua 
Signed-off-by: Amir Vadai 
--
 drivers/infiniband/core/cm.c |   25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 64e0903..3c00646 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -210,6 +210,7 @@ struct cm_id_private {
atomic_t refcount;
 
struct ib_mad_send_buf *msg;
+   struct ib_mad_send_buf *lap_msg;
struct cm_timewait_info *timewait_info;
/* todo: use alternate port on send failure */
struct cm_av av;
@@ -839,6 +840,24 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
cm_id_priv = container_of(cm_id, struct cm_id_private, id);
 retest:
spin_lock_irq(&cm_id_priv->lock);
+
+   /* handle lap states first */
+   switch (cm_id->lap_state) {
+   case IB_CM_LAP_UNINIT:
+   case IB_CM_LAP_IDLE:
+   break;
+   case IB_CM_LAP_SENT:
+   cm_id_priv->id.lap_state = IB_CM_LAP_IDLE;
+   ib_cancel_mad(cm_id_priv->av.port->mad_agent, 
cm_id_priv->lap_msg);
+   cm_id_priv->lap_msg = NULL;
+   break;
+   case IB_CM_LAP_RCVD:
+   case IB_CM_MRA_LAP_SENT:
+   case IB_CM_MRA_LAP_RCVD:
+   default:
+   break;
+   }
+
switch (cm_id->state) {
case IB_CM_LISTEN:
cm_id->state = IB_CM_IDLE;
@@ -2615,7 +2634,7 @@ int ib_send_cm_lap(struct ib_cm_id *cm_id,
}
 
cm_id->lap_state = IB_CM_LAP_SENT;
-   cm_id_priv->msg = msg;
+   cm_id_priv->lap_msg = msg;
 
 out:   spin_unlock_irqrestore(&cm_id_priv->lock, flags);
return ret;
@@ -2811,8 +2830,8 @@ static int cm_apr_handler(struct cm_work *work)
goto out;
}
cm_id_priv->id.lap_state = IB_CM_LAP_IDLE;
-   ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
-   cm_id_priv->msg = NULL;
+   ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->lap_msg);
+   cm_id_priv->lap_msg = NULL;
 
ret = atomic_inc_and_test(&cm_id_priv->work_count);
if (!ret)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/CM: cancel pending lap message when destroying an ID

2011-03-02 Thread Moni Shoua
Sorry. The patch is to the CM. Subject was changed


On Wed, Mar 2, 2011 at 5:45 PM, Moni Shoua  wrote:
>
> When destroying a cm_id from a context of a work queue and if the lap_state of
> this cm_id is IB_CM_LAP_SENT, we need to release the reference of this id that
> was taken upon the send of the lap message.  Otherwise, if the expected apr
> message gets lost, it is only after a long time that the reference will be
> released, while during that the work handler thread is not available to 
> process
> other things.
> Also, a sigle variable 'msg' was used to hold the last snet mad. However, this
> variable serves 2 state machines (CM and LAP) which can cause to memory leaks 
> or
> to unexpected behavior. This patch adds another variable to hold the lap mad.
>
> Signed-off-by: Moni Shoua 
> Signed-off-by: Amir Vadai 
> --
>  drivers/infiniband/core/cm.c |   25 ++---
>  1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 64e0903..3c00646 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -210,6 +210,7 @@ struct cm_id_private {
>        atomic_t refcount;
>
>        struct ib_mad_send_buf *msg;
> +       struct ib_mad_send_buf *lap_msg;
>        struct cm_timewait_info *timewait_info;
>        /* todo: use alternate port on send failure */
>        struct cm_av av;
> @@ -839,6 +840,24 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int 
> err)
>        cm_id_priv = container_of(cm_id, struct cm_id_private, id);
>  retest:
>        spin_lock_irq(&cm_id_priv->lock);
> +
> +       /* handle lap states first */
> +       switch (cm_id->lap_state) {
> +       case IB_CM_LAP_UNINIT:
> +       case IB_CM_LAP_IDLE:
> +               break;
> +       case IB_CM_LAP_SENT:
> +               cm_id_priv->id.lap_state = IB_CM_LAP_IDLE;
> +               ib_cancel_mad(cm_id_priv->av.port->mad_agent, 
> cm_id_priv->lap_msg);
> +               cm_id_priv->lap_msg = NULL;
> +               break;
> +       case IB_CM_LAP_RCVD:
> +       case IB_CM_MRA_LAP_SENT:
> +       case IB_CM_MRA_LAP_RCVD:
> +       default:
> +               break;
> +       }
> +
>        switch (cm_id->state) {
>        case IB_CM_LISTEN:
>                cm_id->state = IB_CM_IDLE;
> @@ -2615,7 +2634,7 @@ int ib_send_cm_lap(struct ib_cm_id *cm_id,
>        }
>
>        cm_id->lap_state = IB_CM_LAP_SENT;
> -       cm_id_priv->msg = msg;
> +       cm_id_priv->lap_msg = msg;
>
>  out:   spin_unlock_irqrestore(&cm_id_priv->lock, flags);
>        return ret;
> @@ -2811,8 +2830,8 @@ static int cm_apr_handler(struct cm_work *work)
>                goto out;
>        }
>        cm_id_priv->id.lap_state = IB_CM_LAP_IDLE;
> -       ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->msg);
> -       cm_id_priv->msg = NULL;
> +       ib_cancel_mad(cm_id_priv->av.port->mad_agent, cm_id_priv->lap_msg);
> +       cm_id_priv->lap_msg = NULL;
>
>        ret = atomic_inc_and_test(&cm_id_priv->work_count);
>        if (!ret)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ib/cm: Cancel pending lap message when destroying an ID

2011-03-03 Thread Moni Shoua
> Good catch, although, I think we can simplify the fix to the patch below
> (completely untested).  Please let me know if this solves the issue for you.
>
>  drivers/infiniband/core/cm.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 1d9616b..79da42d 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -888,6 +888,8 @@ retest:
>                               NULL, 0, NULL, 0);
>                break;
>        case IB_CM_ESTABLISHED:
> +               if (cm_id->lap_state == IB_CM_LAP_SENT)
> +                       ib_cancel_mad(cm_id_priv->av.port->mad_agent, 
> cm_id_priv->msg);
>                spin_unlock_irq(&cm_id_priv->lock);
>                ib_send_cm_dreq(cm_id, NULL, 0);
>                goto retest;
>
>

Hi Sean
I tried this and it doesn't fix the issue in my test. It operates only
in IB_CM_ESTABLISHED but in my tests the order of operations in client
side is
1. rdma_connect() was called
2. lap was sent (for which apr will never be received - according to test case)
3. rdma_disconnect() was called
4. rdma_destroy_id() was called (state here is IB_CM_DREQ_SENT)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ib/cm: Cancel pending LAP message when exiting IB_CM_ESTABLISH state

2011-03-04 Thread Moni Shoua
> Moni, I don't have a good way to test that this really fixes your problem,
> but it looks like it should.
> I considered merging the cm state and lap states together, but I wasn't
> convinced that that made things any simpler.
>
Sean,
I have a test that can verify the first hunk of the patch
(ib_send_cm_dreq()). I'll probably have results by Monday and I will
send them to you. For the second and third hunks (cm_dreq_handler()
and cm_rej_handler()) I don't have test thats covers them but I'll try
to think of something.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HA mode bonding for IB over SDP

2011-04-06 Thread Moni Shoua
If you want a reference code that works you can browse cmpost (get
from git clone git://git.openfabrics.org/~shefty/rdma-dev.git -
chechout branch test-apps).

There are several design issues that you need to resolve before you
start to implement.
Examples:
1. Where to add the code to
   - SDP: via direct calls to APM functions in CM
   - RDMA_CM: enhance interface
2. How to determine endpoints of alt path
3. What to do when migration occurs
4. exchange alt path info during connection establishment or after
connection establishment (LAP)

If you have specific questions then we'd be happy to share our
experience and try to answer.


If you want a reference code that works you can browse cmpost (get
from git clone git://git.openfabrics.org/~shefty/rdma-dev.git -
chechout branch test-apps).

There are several design issues that you need to resolve before you
start to implement.
Examples:
1. Where to add the code to
   - SDP: via direct calls to APM functions in CM
   - RDMA_CM: enhance interface
2. How to determine endpoints of alt path
3. What to do when migration occurs
4. exchange alt path info during connection establishment or after
connection establishment (LAP)

If you have specific questions then we'd be happy to share our
experience and try to answer.



On Fri, Apr 1, 2011 at 8:11 AM, Bhavin  wrote:
> Hi,
>
> I am a student pursuing masters. I have chosen to work on infiniband for my
> final year project. As part of my project I am trying to implement infiniband
> over SDP. I have also setup two systems for this purpose having Mallanox IB
> cards. I found that OFED doesn’t have support for HA mode failover for IB over
> SDP when searching for limitations of this configuration. So I have decided to
> work on this issue.
>
> Right now I am going through the OFED source code to understand how different
> functionalities are implemented. I also found there is something called
> APM(Automatic Path Migration) is present in OFED.
>
> Is it possible to use APM for failover with IB over SDP? If yes, please guide 
> me
> on how can I use APM for IB over SDP.
>
> Please Help.
>
> Thanks
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: HA mode bonding for IB over SDP

2011-04-11 Thread Moni Shoua

Thanks a lot for providing me such useful details.

Besides the APM functionality one thing I don't understand is that why is the 
IB-bonding not working with SDP, as SDP also uses IPoIB for addressing. Can you 
please describe about the problem which doesn't allow the fail-over to take 
place with IB-bonding driver? This would be more helpful in tackling the 
problem.

Again, I appreciate for your valuable information.

Thanking You,
Bhavin  

Hi Bhavin  
Bonding enslaves netdev devices (i.e. IPoIB interfaces) but SDP doesn't 
send/recv traffic on these interfaces.
SDP uses to IB interfaces to send/recv traffic to the network and socket 
interface to send/recv data from/to application.
However, IPoIb interfaces are used for address resolution so you can't give 
them up completely.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next 00/10] mlx4: Add Memory Windows support

2013-02-23 Thread Moni Shoua
+ Shani@Mellanox

-Original Message-
From: Or Gerlitz [mailto:or.gerl...@gmail.com] 
Sent: Monday, February 18, 2013 6:48 PM
To: Roland Dreier
Cc: linux-rdma; Haggai Eran; Moni Shoua; sha...@mellano.com
Subject: Re: [PATCH for-next 00/10] mlx4: Add Memory Windows support

On Wed, Feb 6, 2013 at 6:19 PM, Or Gerlitz  wrote:

> Here's a series from Shani Michaeli and Haggai Eran adds mlx4 driver 
> support for Memory Windows.

Hi Roland,

So things are warming up for the 3.9 merge window, I see that this series is 
marked as "new" in your patchwork - any comments? is it be condisred in for 3.9?

Or.


> The first entries in this set are "pre patches" preparing the grounds 
> for the actual implementation of MWs. Later there're two core patches, 
> one to ib_verbs.h adding support for type 2 MWs and another one to 
> uverbs that exposes MW commands to user space. And finally the actual 
> mlx4 driver MWs patches.

> Shani Michaeli (10):
>   IB/mlx4_ib: Remove local invalidate segment unused fields
>   net/mlx4_core: Rename MPT related service routines to have mpt_ prefix
>   net/mlx4_core: Propogate MR deregistration failure
>   net/mlx4_core: Disable memory windows for VFs
>   net/mlx4_core: Enable memory windows in {INIT,QUERY}_HCA
>   IB/core: Enhance memory windows support
>   IB/uverbs: Implement memory windows support in uverbs
>   mlx4: Implement memory windows allocation and deallocation
>   IB/mlx4_ib: Support memory window binding
>   IB/mlx4_ib: Advertize MW support
>
>  drivers/infiniband/core/uverbs.h   |2 +
>  drivers/infiniband/core/uverbs_cmd.c   |  121 +
>  drivers/infiniband/core/uverbs_main.c  |   13 ++-
>  drivers/infiniband/core/verbs.c|5 +-
>  drivers/infiniband/hw/cxgb3/iwch_provider.c|5 +-
>  drivers/infiniband/hw/cxgb3/iwch_qp.c  |   15 +-
>  drivers/infiniband/hw/cxgb4/iw_cxgb4.h |2 +-
>  drivers/infiniband/hw/cxgb4/mem.c  |5 +-
>  drivers/infiniband/hw/ehca/ehca_iverbs.h   |2 +-
>  drivers/infiniband/hw/ehca/ehca_mrmw.c |5 +-
>  drivers/infiniband/hw/mlx4/main.c  |   19 ++
>  drivers/infiniband/hw/mlx4/mlx4_ib.h   |   14 ++
>  drivers/infiniband/hw/mlx4/mr.c|   87 +-
>  drivers/infiniband/hw/mlx4/qp.c|   41 -
>  drivers/infiniband/hw/nes/nes_verbs.c  |   19 ++-
>  drivers/net/ethernet/mellanox/mlx4/en_main.c   |4 +-
>  drivers/net/ethernet/mellanox/mlx4/fw.c|   14 ++-
>  drivers/net/ethernet/mellanox/mlx4/fw.h|1 +
>  drivers/net/ethernet/mellanox/mlx4/main.c  |4 +
>  drivers/net/ethernet/mellanox/mlx4/mlx4.h  |   34 +++-
>  drivers/net/ethernet/mellanox/mlx4/mr.c|  186
> +++-
>  .../net/ethernet/mellanox/mlx4/resource_tracker.c  |   63 ++-
>  include/linux/mlx4/device.h|   22 ++-
>  include/linux/mlx4/qp.h|   19 ++-
>  include/rdma/ib_verbs.h|   73 +++-
>  include/uapi/rdma/ib_user_verbs.h  |   16 ++
>  net/sunrpc/xprtrdma/verbs.c|   20 +-
>  27 files changed, 683 insertions(+), 128 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/CMA: fail qkey setting for RDMA_PS_IPOIB and IB_LINK_LAYER_ETHERNET

2011-07-07 Thread Moni Shoua
From: Moni Shoua 

In general, when link layer is ETHERNET it is wrong to use IPoIB port space 
since
no IPoIB interface is available. Specifically, setting qkey when port space is
RDMA_PS_IPOIB, requires SA query which is impossible when link layer is 
IB_LINK_LAYER_ETHERNET.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/core/cma.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index b6a33b3..9bf83d5 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -314,6 +314,9 @@ static int cma_set_qkey(struct rdma_id_private *id_priv)
id_priv->qkey = RDMA_UDP_QKEY;
break;
case RDMA_PS_IPOIB:
+   if (rdma_port_get_link_layer(id_priv->cma_dev->device, 
id_priv->id.port_num) ==
+   IB_LINK_LAYER_ETHERNET)
+   return -EINVAL;
ib_addr_get_mgid(&id_priv->id.route.addr.dev_addr, &rec.mgid);
ret = ib_sa_get_mcmember_rec(id_priv->id.device,
 id_priv->id.port_num, &rec.mgid,
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/CMA: fail qkey setting for RDMA_PS_IPOIB and IB_LINK_LAYER_ETHERNET

2011-07-08 Thread Moni Shoua
Yes, the kernel crashed after watchdog detected a lockup. This
happened while running udaddy -p 0x2 with RoCE. The crash isn't 100%
reproducible but I have a pair of nodes where that used to crash with
probability of 50%.
The fix you suggest seems to detect the mismatch between port space
and link layer earlier, so I guess it handles the bug better. I'll
rewrite, retest and send a fix on Sunday.

- monis

On Thu, Jul 7, 2011 at 7:55 PM, Hefty, Sean  wrote:
>> In general, when link layer is ETHERNET it is wrong to use IPoIB port space
>> since
>> no IPoIB interface is available. Specifically, setting qkey when port space 
>> is
>> RDMA_PS_IPOIB, requires SA query which is impossible when link layer is
>> IB_LINK_LAYER_ETHERNET.
>
> Can you describe the problem that the current code causes?  Does it lead to a 
> kernel crash?
>
> Maybe the issue is that port space ipoib should never be associated with a 
> non-IB device, with a check added to cma_acquire_dev().
>
> - Sean
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] IB/CMA: Invalidate connections in port space IPoIB over Ethernet link layer

2011-07-11 Thread Moni Shoua
From: Moni Shoua 

This patch fixes a kernel crash in cma_set_qkey().
When link layer is ETHERNET it is wrong to use IPoIB port space since no IPoIB
interface is available. Specifically, setting qkey when port space is
RDMA_PS_IPOIB, requires mgid calculation and SA query which is illegal over 
Ethernet.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/core/cma.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index b6a33b3..2d60130 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -359,6 +359,10 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
enum rdma_link_layer dev_ll = dev_addr->dev_type == ARPHRD_INFINIBAND ?
IB_LINK_LAYER_INFINIBAND : IB_LINK_LAYER_ETHERNET;
 
+   if (dev_ll == IB_LINK_LAYER_ETHERNET &&
+   id_priv->id.ps == RDMA_PS_IPOIB)
+   return -EINVAL;
+
mutex_lock(&lock);
iboe_addr_get_sgid(dev_addr, &iboe_gid);
memcpy(&gid, dev_addr->src_dev_addr +
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] IB/CMA: Invalidate connections in port space IPoIB over Ethernet link layer

2011-07-12 Thread Moni Shoua
This patch fixes a kernel crash in cma_set_qkey().
When link layer is ETHERNET it is wrong to use IPoIB port space since no IPoIB
interface is available. Specifically, setting qkey when port space is
RDMA_PS_IPOIB, requires mgid calculation and SA query which is illegal over 
Ethernet.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/core/cma.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index b6a33b3..4530eb8 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -359,6 +359,10 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
enum rdma_link_layer dev_ll = dev_addr->dev_type == ARPHRD_INFINIBAND ?
IB_LINK_LAYER_INFINIBAND : IB_LINK_LAYER_ETHERNET;
 
+   if (dev_ll != IB_LINK_LAYER_INFINIBAND &&
+   id_priv->id.ps == RDMA_PS_IPOIB)
+   return -EINVAL;
+
mutex_lock(&lock);
iboe_addr_get_sgid(dev_addr, &iboe_gid);
memcpy(&gid, dev_addr->src_dev_addr +
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-08 Thread Moni Shoua
On Wed, Apr 8, 2015 at 2:30 AM, Hefty, Sean  wrote:
>> In order to manage multiple types, vlans and MACs per GID, we
>> need to store them along the GID itself. We store the net device
>> as well, as sometimes GIDs should be handled according to the
>> net device they came from. Since populating the GID table should
>> be identical for every RoCE provider, the GIDs table should be
>> handled in ib_core.
>>
>> Adding a GID cache table that supports a lockless find, add and
>> delete gids. The lockless nature comes from using a unique
>> sequence number per table entry and detecting that while reading/
>> writing this sequence wasn't changed.
>>
>> By using this RoCE GID cache table, providers must implement a
>> modify_gid callback. The table is managed exclusively by
>> this roce_gid_cache and the provider just need to write
>> the data to the hardware.
>>
>> Signed-off-by: Matan Barak 
>> Signed-off-by: Somnath Kotur 
>> ---
>>  drivers/infiniband/core/Makefile |   3 +-
>>  drivers/infiniband/core/core_priv.h  |  24 ++
>>  drivers/infiniband/core/roce_gid_cache.c | 518
>
> Why does RoCE need such a complex gid cache?  If a gid cache is needed at 
> all, why should it be restricted to RoCE only?  And why is such a complex 
> synchronization scheme needed?  Seriously, how many times will GIDs change 
> and how many readers at once do you expect to have?
>
GID cache is also implemented for link layer IB. Howver, for RoCE the
GID cache is also the manager of the table. This means that adding or
removing entries from the GID table is under the responsibility of the
cache and not the HW/device driver. This is a new scheme that frees
each vendor's driver to deal with net and inet events.
Content of the GID table is much more dynamic for RoCE than for IB and
so is access to the table so I guess that extra mechanism is required.
The fact that GID entry is associated with net_device and inet_addr
objects that can be modified/deleted at any time is an example.
>
>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>> index 65994a1..1866595 100644
>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -64,6 +64,36 @@ union ib_gid {
>>   } global;
>>  };
>>
>> +extern union ib_gid zgid;
>> +
>> +enum ib_gid_type {
>> + /* If link layer is Ethernet, this is RoCE V1 */
>
> I don't understand this comment.  Does RoCE v2 not run on Ethernet?
>
>> + IB_GID_TYPE_IB= 0,
>> + IB_GID_TYPE_ROCE_V2   = 1,
>> + IB_GID_TYPE_SIZE
>> +};
>
> Can you explain the purpose of defining a 'GID type'.  A GID is just a global 
> address.  Why does it matter to anyone using it how it was constructed?
>
>> +
>> +struct ib_gid_attr {
>> + enum ib_gid_typegid_type;
>> + struct net_device   *ndev;
>> +};
>> +
>> +struct ib_roce_gid_cache_entry {
>> + /* seq number of 0 indicates entry being changed. */
>> + unsigned intseq;
>> + union ib_gidgid;
>> + struct ib_gid_attr  attr;
>> + void   *context;
>> +};
>> +
>> +struct ib_roce_gid_cache {
>> + int  active;
>> + int  sz;
>> + /* locking against multiple writes in data_vec */
>> + struct mutex lock;
>> + struct ib_roce_gid_cache_entry *data_vec;
>> +};
>> +
>>  enum rdma_node_type {
>>   /* IB values map to NodeInfo:NodeType. */
>>   RDMA_NODE_IB_CA = 1,
>> @@ -265,7 +295,9 @@ enum ib_port_cap_flags {
>>   IB_PORT_BOOT_MGMT_SUP   = 1 << 23,
>>   IB_PORT_LINK_LATENCY_SUP= 1 << 24,
>>   IB_PORT_CLIENT_REG_SUP  = 1 << 25,
>> - IB_PORT_IP_BASED_GIDS   = 1 << 26
>> + IB_PORT_IP_BASED_GIDS   = 1 << 26,
>> + IB_PORT_ROCE= 1 << 27,
>> + IB_PORT_ROCE_V2 = 1 << 28,
>
> Why does RoCE suddenly require a port capability bit?  RoCE runs today 
> without setting any bit.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 for-next 01/33] IB/core: Add RoCE GID cache

2015-04-16 Thread Moni Shoua
On Wed, Apr 15, 2015 at 7:08 PM, Hefty, Sean  wrote:
>> It does not  break every app, the choice of which GID type to use is made
>> by the RDMA-CM based on network topology hint obtained from the IP stack.
>> Please refer to patch 15/33: IB/Core: Changes to the IB Core
>> infrastructure for RoCEv2 support.
>> Of course, if the user does not want to go with this choice made by the
>> RDMA-CM, then there is the option of overriding it using the configfs
>> patch (PATCH 14/33)
>> Hope that clarifies?
>
> RoCE v2 is really Infiniband over UDP over IP.  Why don't we just call it 
> IBoUDP like it is?
RoCEv2 is the name in the IBTA spec (Annex 17)
>
> IBoUDP changes the Ethertype, replaces the network header, adds a new 
> transport protocol header, and layers IB over that.  This change should be 
> exposed properly and not as just a new GID type.
I don't understand what do you suggest here. Can you give an example?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/CMA: Fix condition to search for InfiniBand route

2015-06-08 Thread Moni Shoua
The flag RDMA_CORE_CAP_IB_SA is set when link layer is InfiniBand but
also when link layer is Ethernet. Therefore, link layer is not implied by
this flag.

---
 drivers/infiniband/core/cma.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 3b943b7..f96d6fd 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1993,7 +1993,7 @@ int rdma_resolve_route(struct rdma_cm_id *id, int 
timeout_ms)
return -EINVAL;
 
atomic_inc(&id_priv->refcount);
-   if (rdma_cap_ib_sa(id->device, id->port_num))
+   if (rdma_protocol_ib(id->device, id->port_num))
ret = cma_resolve_ib_route(id_priv, timeout_ms);
else if (rdma_protocol_roce(id->device, id->port_num))
ret = cma_resolve_iboe_route(id_priv);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/core: Don't advertise SA in RoCE port capabilities

2015-06-09 Thread Moni Shoua
The Subnet Administrator (SA) is not a component of the RoCE spec.
Therefore, it should not be a capability of a RoCE port.

Change-Id: Iadfaa56bdc9f6e28f46d009064c2d15969293cf7
Signed-off-by: Moni Shoua 
---
 include/rdma/ib_verbs.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7d78794..7725cee 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -381,7 +381,6 @@ union rdma_protocol_stats {
 #define RDMA_CORE_PORT_IBA_ROCE(RDMA_CORE_CAP_PROT_ROCE \
| RDMA_CORE_CAP_IB_MAD  \
| RDMA_CORE_CAP_IB_CM   \
-   | RDMA_CORE_CAP_IB_SA   \
| RDMA_CORE_CAP_AF_IB   \
| RDMA_CORE_CAP_ETH_AH)
 #define RDMA_CORE_PORT_IWARP   (RDMA_CORE_CAP_PROT_IWARP \
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/core: Don't advertise SA in RoCE port capabilities

2015-06-10 Thread Moni Shoua
The Subnet Administrator (SA) is not a component of the RoCE spec.
Therefore, it should not be a capability of a RoCE port.

Signed-off-by: Moni Shoua 
---
 include/rdma/ib_verbs.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7d78794..7725cee 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -381,7 +381,6 @@ union rdma_protocol_stats {
 #define RDMA_CORE_PORT_IBA_ROCE(RDMA_CORE_CAP_PROT_ROCE \
| RDMA_CORE_CAP_IB_MAD  \
| RDMA_CORE_CAP_IB_CM   \
-   | RDMA_CORE_CAP_IB_SA   \
| RDMA_CORE_CAP_AF_IB   \
| RDMA_CORE_CAP_ETH_AH)
 #define RDMA_CORE_PORT_IWARP   (RDMA_CORE_CAP_PROT_IWARP \
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/core: Don't warn on no SA support in event handler

2015-06-10 Thread Moni Shoua
Registering an event handler is done for a device. This device may have
one RoCE port (no SA cap) and one InfiniBand port (has SA cap).
Therefore, warning from the event handler about a specific port that
doesn't have SA cap is correct but pollutes the kernel log without a
need.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/core/multicast.c | 2 +-
 drivers/infiniband/core/sa_query.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/multicast.c 
b/drivers/infiniband/core/multicast.c
index 605f20a..1244f02 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -780,7 +780,7 @@ static void mcast_event_handler(struct ib_event_handler 
*handler,
int index;
 
dev = container_of(handler, struct mcast_device, event_handler);
-   if (WARN_ON(!rdma_cap_ib_mcast(dev->device, event->element.port_num)))
+   if (!rdma_cap_ib_mcast(dev->device, event->element.port_num))
return;
 
index = event->element.port_num - dev->start_port;
diff --git a/drivers/infiniband/core/sa_query.c 
b/drivers/infiniband/core/sa_query.c
index 7f7c8c9..3d0b7b2 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -450,7 +450,7 @@ static void ib_sa_event(struct ib_event_handler *handler, 
struct ib_event *event
struct ib_sa_port *port =
&sa_dev->port[event->element.port_num - 
sa_dev->start_port];
 
-   if (WARN_ON(!rdma_cap_ib_sa(handler->device, port->port_num)))
+   if (!rdma_cap_ib_sa(handler->device, port->port_num))
return;
 
spin_lock_irqsave(&port->ah_lock, flags);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V5 10/12] IB/mlx4: Implement ib_device callbacks

2015-06-11 Thread Moni Shoua
On Thu, Jun 11, 2015 at 9:31 AM, Jason Gunthorpe
 wrote:
> On Mon, Jun 08, 2015 at 05:12:13PM +0300, Matan Barak wrote:
>
>> +static struct net_device *mlx4_ib_get_netdev(struct ib_device *device, u8 
>> port_num)
>> +{
>
> This function is never referenced in this patch, so we get compile
> warnings?
>
The motivation here was to add an implementation for the new RoCE GID
table management in one patch (this one) and replace it with the other
in the next patch. This way we have a working RoCE after each patch
while avoiding from having one big patch that adds the new
implementation and removes the old one.
The price is compilation warnings

> Warnings are not a huge deal, but you did compile test every patch in
> the series??
Yes we did and we are aware of the warnings.
>
>> + if (mlx4_is_bonded(ibdev->dev)) {
>> + struct net_device *dev;
>> + struct net_device *upper = NULL;
>
> So, I see this code in mlx4 touching bonding, but I don't see similar
> code in ocrdma..
>
> If bonding is a general feature why is this bit in the driver? Should
> the core code be doing this?
>
> Does bonding work in ocrdma at the end of this series? I was expecting
> it to..
>
This relates to feature in mlx4 called RoCE port aggregation. I guess
that this is not relevant to OCRDMA
Other bonding aspects (like populating addresses of an upper dev or
populating only on the port of the active slave) is being done in the
core the same way for all vendors

>> + gid_tbl = mailbox->buf;
>> +
>> + for (i = 0; i < MLX4_MAX_PORT_GIDS; ++i)
>> + memcpy(&gid_tbl[i], &gids[i].gid, sizeof(union ib_gid));
>> +
>> + err = mlx4_cmd(dev, mailbox->dma,
>> +MLX4_SET_PORT_GID_TABLE << 8 | port_num,
>> +1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
>> +MLX4_CMD_WRAPPED);
>> + if (mlx4_is_bonded(dev))
>> + err += mlx4_cmd(dev, mailbox->dma,
>> + MLX4_SET_PORT_GID_TABLE << 8 | 2,
>> + 1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
>> + MLX4_CMD_WRAPPED);
>
> Again, wonder why the driver is sensitive to bonding, and ocrdma is not..
>
Same answer
>> @@ -477,11 +673,22 @@ out:
>>  static int iboe_query_gid(struct ib_device *ibdev, u8 port, int index,
>> union ib_gid *gid)
>>  {
>> - struct mlx4_ib_dev *dev = to_mdev(ibdev);
>> + int ret;
>>
>> - *gid = dev->iboe.gid_table[port - 1][index];
>> + if (!rdma_cap_roce_gid_table(ibdev, port)) {
>> + struct mlx4_ib_dev *dev = to_mdev(ibdev);
>>
>> - return 0;
>> + *gid = dev->iboe.gid_table[port - 1][index];
>> + return 0;
>> + }
>> +
>> + ret = ib_get_cached_gid(ibdev, port, index, gid);
>> + if (ret == -EAGAIN) {
>> + memcpy(gid, &zgid, sizeof(*gid));
>> + return 0;
>> + }
>> +
>> + return ret;
>>  }
>
> Hum, is it Ok to change iboe_query_gid like this at this point in the
> series? Should this be in patch 11?
You are right. I'll fix.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create a common verbs transport library

2015-10-12 Thread Moni Shoua
Hi Denny,



We initially thought to implement a shared library that contains the
transport logic.

However, it seems that a SW Verbs transport driver would allow better
code sharing.

In fact, the VT driver would need only a single user-space driver for
all "backends". Any direct HW access from user-space should be exposed
by the corresponding backend driver and accessed by a different
library (e.g., psm).



At a high-level, it seems that we should do as follows:

- Decide on an initial code base for VT (rxe/hfi/qib), clone it, and
rename to VT

- Split the code to VT and backend and create the initial backend APIs, e.g.:

-- Send packet

-- Deliver packet (receive)

-- Attach multicast

-- Packet buffer allocation

-- Notify when more send space is available

- In parallel, prepare the backends of other drivers while enhancing
VT as needed.



Do you have any preferences to the initial code base?

Do you already have some code that we can look at?



Please advise as we are starting to develop a VT driver for RoCE now.

I suggest that we set up common user+kernel git repos for the initial work.



Thanks,

-Moni

On Tue, Sep 29, 2015 at 3:56 PM, Dennis Dalessandro
 wrote:
> Hi All,
>
> One of the conditions to move the hfi1 driver from staging into the normal
> drivers/infiniband/hw directory is to handle the code duplication in our
> verbs layer. This is going to be done by creating a new kmod which we will
> call rdmavt, for RDMA verbs transport. This will eventually live in the
> existing drivers/infiniband tree in a new sw directory:
> drivers/infiniband/sw/vt. This new directory can serve as a home for soft
> roce when its ready as well.
>
> The verbs library will start out life in drivers/staging/rdma/vt alongside
> hfi1. We (Intel) will push incremental patches to keep the community
> apprised of the development and allow for early and more continuous
> feedback. Once complete the plan would be to move out of staging along with
> hfi1.
>
> The current verbs support in the IB core should not need to be modified,
> rdmavt is just another verbs provider. Drivers will not use rdmavt directly.
> Rather, rdmavt will use the drivers to abstract away the hardware
> differences. Here is a diagram of what this will look like.
>
>   +---+
>   |Ib Core|
>   +---+
>   +
>   |
> +--v+
> |Verbs Transport|
> +-+--+--+
>  |  |
>  |  |
> +-v--+ +-v--+
> |qib | |hfi1|
> ++ ++
>
> -Denny
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create a common verbs transport library

2015-10-14 Thread Moni Shoua
>
>
> I assume by user-space driver we are talking about libverbs? We have
> separate libraries for ipath/qib and hfi. We should probably coalesce these
> into a single library but that is a separate issue. PSM is also unrelated to
> the work here since PSM is not verbs.
>
Not libibverbs but lib (e.g. libmlx4, libqib,...), like for
any other kernel verbs provider.
In other wordsm if the name of the SW verbs provider is called VT
(verbs transport) than the library we need to develop will be libVT.
However, the challenge here is larger than for other providers due to
different style of backends.
Take for example the ibv_post_send() verb. For some totally incapable
kernel bypass backends this requires that this verb go down to the
kernel (Soft RoCE for example) but for others it needs to put the WQE
directly on the HW in some HW specific manner.
This needs some architectural effort.

>> At a high-level, it seems that we should do as follows:
>>
>> - Decide on an initial code base for VT (rxe/hfi/qib), clone it, and
>> rename to VT
>>
>> - Split the code to VT and backend and create the initial backend APIs,
>> e.g.:
>
>
> We have been planning a bit of a different approach. My thoughts are we make
> VT a completely new kmod. It will start out life lettings verbs calls from
> the core go into the drivers to do their thing, but will contain a bunch of
> the duplicated code that we have in hfi1/qib/ipath. The next step is to move
> piece by piece the rest of the verbs code.
>
We agree with the kmod approach. However
1. Until we get rid completely from code duplication I don't think we
can get out from being in staging phase so the initial work better be
complete. Don't you think so?
2. Soft RoCE is a candidate driver (backend) for this framework and we
would like to make sure that the new VT provider meets out needs.
>> -- Send packet
>>
>> -- Deliver packet (receive)
>>
>> -- Attach multicast
>>
>> -- Packet buffer allocation
>>
>> -- Notify when more send space is available
>>
>> - In parallel, prepare the backends of other drivers while enhancing
>> VT as needed.
>
>
> Yes, we need to come up with an API, I'm not fully sure what that should
> look like yet, it is a work in progress.
I think that this will be a good place to start before implementation.
Can you please share API between backends and VT so we can comment on?
>
>> Do you have any preferences to the initial code base?
>>
>> Do you already have some code that we can look at?
>
>
> We'll be starting out with making changes to hfi1 and qib to follow shortly
> behind. No code just yet, but I should have something to post as an RFC very
> soon (in the next two weeks).
I suggest that implementation will start after we close the arch. Like
I said we want to make sure that Soft RoCE requirements are met in the
design.
I strongly recommend that we figure out a mechanism to allows us to cooperate.
I hope that you see it this way too,
>
> Thanks
>
>
> -Denny
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create a common verbs transport library

2015-10-15 Thread Moni Shoua
On Thu, Oct 15, 2015 at 4:07 PM, Dennis Dalessandro
 wrote:
> On Thu, Oct 15, 2015 at 08:40:25AM +0300, Moni Shoua wrote:
>>>
>>> I assume by user-space driver we are talking about libverbs? We have
>>> separate libraries for ipath/qib and hfi. We should probably coalesce
>>> these
>>> into a single library but that is a separate issue. PSM is also unrelated
>>> to
>>> the work here since PSM is not verbs.
>>>
>> Not libibverbs but lib (e.g. libmlx4, libqib,...), like for
>> any other kernel verbs provider.
>> In other wordsm if the name of the SW verbs provider is called VT
>> (verbs transport) than the library we need to develop will be libVT.
>> However, the challenge here is larger than for other providers due to
>> different style of backends.
>> Take for example the ibv_post_send() verb. For some totally incapable
>> kernel bypass backends this requires that this verb go down to the
>> kernel (Soft RoCE for example) but for others it needs to put the WQE
>> directly on the HW in some HW specific manner.
>> This needs some architectural effort.
>
>
> Ok, I'm with you now. Yes, this is an issue and agree that it will take some
> architectural work.
>
>>>> At a high-level, it seems that we should do as follows:
>>>>
>>>> - Decide on an initial code base for VT (rxe/hfi/qib), clone it, and
>>>> rename to VT
>>>>
>>>> - Split the code to VT and backend and create the initial backend APIs,
>>>> e.g.:
>>>
>>>
>>>
>>> We have been planning a bit of a different approach. My thoughts are we
>>> make
>>> VT a completely new kmod. It will start out life lettings verbs calls
>>> from
>>> the core go into the drivers to do their thing, but will contain a bunch
>>> of
>>> the duplicated code that we have in hfi1/qib/ipath. The next step is to
>>> move
>>> piece by piece the rest of the verbs code.
>>>
>> We agree with the kmod approach. However
>> 1. Until we get rid completely from code duplication I don't think we
>> can get out from being in staging phase so the initial work better be
>> complete. Don't you think so?
>
>
> Yes, I completely agree. Code duplication issue should be solved before we
> move out of staging.
> I could have been more clear on my intentions here. The initial set of
> patches that we will submit for RFC will not be complete.  That is not meant
> to get us out of staging and satisfy the hfi1 TODO item. It is a stepping
> stone to get things moving and start getting feedback from the community. We
> want to engage early on. So the sooner we have something tangible that we
> can all look at the better.
>
>> 2. Soft RoCE is a candidate driver (backend) for this framework and we
>> would like to make sure that the new VT provider meets out needs.
>
>
> Yes, while my focus is of course hfi1 and qib, we certainly realize there is
> a larger community here that has a vested interest, and this is why we want
> to engage early.
>
>>>> -- Send packet
>>>>
>>>> -- Deliver packet (receive)
>>>>
>>>> -- Attach multicast
>>>>
>>>> -- Packet buffer allocation
>>>>
>>>> -- Notify when more send space is available
>>>>
>>>> - In parallel, prepare the backends of other drivers while enhancing
>>>> VT as needed.
>>>
>>>
>>>
>>> Yes, we need to come up with an API, I'm not fully sure what that should
>>> look like yet, it is a work in progress.
>>
>> I think that this will be a good place to start before implementation.
>> Can you please share API between backends and VT so we can comment on?
>
>
> This will be coming very soon, and we anticipate needing to make changes so
> that all drivers are best supported. Don't think of it as, "this is what we
> are going to do". Think of it as, "this is what we are thinking, does it
> work for you?".
>
Sure. I don't mind talking in 'C' and I suggest that we start with the
header files first.

>>>> Do you have any preferences to the initial code base?
>>>>
>>>> Do you already have some code that we can look at?
>>>
>>>
>>>
>>> We'll be starting out with making changes to hfi1 and qib to follow
>>> shortly
>>> behind. No code just yet, but I should have something to post as an RFC
>>> very
>>> soon (in the next two weeks).
>>
&

Re: [PATCH 00/23] Update driver to 0.9-294

2015-10-20 Thread Moni Shoua
>
> Perhaps I did not chose my words carefully enough.
>
> The largest issue on the TODO list is the refactoring of the code to be 
> shared between the hfi1 and qib driver.  While the hardware between hfi1 and 
> qib is similar and thus the initial code looked similar, our performance 
> tuning on hfi1 has revealed that some changes will be required to the hfi1 
> code to fully utilize the hardware.  We would like to get these updates in to 
> make sure that the refactoring work is done properly for the 2 hardware types.
Please keep in mind that there are 3 HW types (our SoftRoCE driver)
that needs to be part of the framework.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/23] Update driver to 0.9-294

2015-10-20 Thread Moni Shoua
> Understood, however, unlike SoftRoCE, qib and hfi currently share a lot of 
> code
> to drive the hardware.
>
> The underlying reason for the TODO item "Remove software processing of IB
> protocol..." is because we have a large amount of duplicated code between 
> these
> drivers.  _Some_ of which, at a high level, is sharable with SoftRoCE.
>
After studying the qib driver (briefly I must admit) I found that it
has a lot in common with SoftRoCE driver and both can be redesigned to
share a fair amount of code.
> These patches (and more to follow), further differentiate the 2 drivers along
> hardware lines.  Accepting these patches will help us make sure that we don't
> create some common code between qib and hfi which, because of our testing we
> now know, needs to be separated out.
>
I'm sorry but I'm not sure that I understand the plan. What we had in
mind is that we start following the idea which was presented by Deny,
build a detailed design plan and implement it. All that can be done
without fixing bugs, adding functionality and improving performance
but I understand that you also need to move forward.
Us, in Mellanox are willing to contribute to the effort and take
responsibility over the Software Verbs Transport.
> This is a separate issue from the higher level code sharing which will be done
> with SoftRoCE.
I'm not sure I agree. Can you explain the plan to remove code sharing
between qib and hfi that is not part of the Software Verbs Transport
module?
>
> Ira
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Create a common verbs transport library

2015-10-25 Thread Moni Shoua
Thanks, we'll look into it today

On Fri, Oct 23, 2015 at 11:49 PM, Dennis Dalessandro
 wrote:
> On Thu, Oct 15, 2015 at 05:15:38PM +0300, Moni Shoua wrote:
>>>
>>> I'm thinking perhaps we'll post some stuff to GitHub and you folks can
>>> take
>>> a look. We can do some back and forth before even posting patches to the
>>> list. I'll check into the logistics and share the details soon?
>>
>>
>> This will be great. Just send us the ref.
>> If you'd like we can have an online meeting to have an initial talk.
>> From my experience this can be very useful to align all on the
>> subject.
>
>
> I have created a branch on a GitHub repo. You can find it at:
> https://github.com/weiny2/linux-kernel/tree/rdmavt
>
> This is a starting point to get the ball rolling. I have a few more patches
> to further refine the skeleton rdmavt module. I should be able to post those
> early next week.  At that point we will have a more clear idea of what the
> API between the drivers and rdmavt will need to look like for hfi1 and qib.
> We will look for your feedback on how soft roce is going to fit in as well.
>
>
> -Denny
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] RDMA verbs transport design notes

2015-11-02 Thread Moni Shoua
On Thu, Oct 29, 2015 at 9:41 PM, Dennis Dalessandro
 wrote:
> Hi Folks,
>
> I had previously posted a notice about the very beginnings of the rdmavt
> driver which is the software verbs consolidation for multiple drivers [1].
> I have  now pushed another set of updates to a GitHub repo [2] which
> contains more details.  What this latest batch of patches entail is a
> stubbed out version with annotations in the comments as to what the
> interaction between drivers and rdmavt will look like. Look for lines like:
>
Thanks Denny
We will pull from GitHub, learn the design and hopefully contribute.
Some minor comments below

> VT-DRIVER-API
>
> The following is a summary of the current posted code and the direction
> which we are thinking of going based on knowledge of qib and hfi1 drivers.
> Feedback and suggestions are welcome.
>
> Design Goals
> 
> - Remove duplication of software verbs code present in multiple drivers.
> - Do not regress performance.
>
> Registration and general code flow
> --
> Instead of registering directly with the IB core like they do now, drivers
> will register with the rdmavt, referred to as rvt in the code. Drivers will
> build up the ib_device_attr and pass in to the registration by way of the
> rvt_dev_info struct. This will also contain any other driver specific
> settings that rvt will need to know about.
>
> Currently allocation of the ib_device is done by the driver. This is merely
> a stepping stone, and eventually the allocation will move up to rvt. The
> driver should not need to know about the ib_device structure eventually,
> other than for those functions it chooses to override
>
> In addition to describing its properties, drivers will supply a mapping of
> function pointers for use by rvt. The idea is that most of the verbs code
> lives in rvt, but there are some device specific functions which drivers
> will need to perform, such as pushing packets to the wire. Rvt will
> accomplish its tasks by calling into the drivers for these.
>
> There are also times when drivers will need to call back into rvt. We should
> aim to limit this as much as possible. For things like a packet arriving
> from the wire we have no choice but for the driver to initiate the
> processing and call into (or signal in some way) rvt.
>
> Driver override
> ---
> Drivers need to be able to override functions that would normally be done by
> the rvt. In the current set of patches this is accomplished by filling in a
> value in the ib_device_attr function pointer map. If the value is NULL then
> rvt uses its function, otherwise rvt is bypassed by the core and the driver
> is called directly.  Performance optimizations could be one reason, another
> is incremental development. We can work on moving a driver over to rvt in
> stages.
>
> Driver provided functionality
> -
> This list will likely grow as the code evolves but as a first pass through
> these are the things which I see as needing to be provided by the driver:
>
> query_port_state()
> Returns pretty much what is in ib_port_attr
> Will differ based on driver
>
> set_link_state()
> For rvt to have the driver set the state of the link
>
> get_lid()
> Provides the LID
>
LID is a special case of L2 address (MAC is another special case)
Maybe change this to het_l2()?

> qp_mtu()
> Using the SL determines the MTU (this varies per VL in OPA)
>
Again, this is too tied to the InfiniBand protocol.
Also, It doesn't make sense to me that a driver won't know how to
create a QP (this is done by the rvt) but will know how to answer
about QP mtu. Does it?

> make_qpn()
> QPN ranges differ for drivers
>
> flush_qp()
> Flush out all pending operations for a QP that have not made it the
> wire, and wait for that flush to finish.
Again, needs generalization

> do_send()
> Take a fully constructed packet and place on the wire
>
The hardest operation of all IMO.
Should  be efficient but yet general
> Other functions for things like maintaining MAD counters perhaps.
>
> Driver notification or upcall to rvt
> 
> Certain things will require the driver to notify the rvt or execute some
> function. For instance, the driver needs to hand the packet to rvt after it
> pulls it off the wire.
>
> There are also event which the driver needs to let rvt know have happened.
> Things that currently generate IB_EVENT_PORT_ERROR, or IB_EVENT_PORT_ACTIVE,
> etc. There are likely other events as well.
>
> Next steps
> --
> We will continue posting code to GitHub [2] while we field feedback. Note
> the repo has been moved from my previous announcement [1]. I have placed it
> under my GitHub. Once there is more significant development and folks are
> generally happy with the design we will begin posting to this mailing list
> (linux-rdma).
What's the minimal progress in the rvt and the drivers be

Re: [PATCH for-next V1 9/9] IB/cma: Join and leave multicast groups with IGMP

2015-11-24 Thread Moni Shoua
On Mon, Nov 23, 2015 at 11:25 PM, Jason Gunthorpe
 wrote:
> On Thu, Oct 15, 2015 at 07:07:12PM +0300, Matan Barak wrote:
>> diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
>> index 0a40ed2..5bea0e8 100644
>> +++ b/include/rdma/ib_sa.h
>> @@ -206,6 +206,9 @@ struct ib_sa_mcmember_rec {
>>   u8   scope;
>>   u8   join_state;
>>   int  proxy_join;
>> + int  ifindex;
>> + struct net  *net;
>> + enum ib_gid_type gid_type;
>>  };
>
> This is really gross.
>
> Make ib_init_ah_from_mcmember accept a QP and get the above stuff from
> the QP.
>
> Jason

Which QP is that. You might not have any existing QP when you want to
create the AH or you might have 10.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V1 libibverbs 2/3] Use neighbour lookup for RoCE UD QPs Eth L2 resolution

2014-02-11 Thread Moni Shoua
We don't really care if the UDP packet is sent or not.
All we want is to trigger ARP.

 - Moni S.

-Original Message-
From: Yann Droneaud [mailto:ydrone...@opteya.com] 
Sent: Tuesday, February 11, 2014 5:20 PM
To: Or Gerlitz
Cc: rol...@kernel.org; linux-rdma@vger.kernel.org; Moni Shoua; Matan Barak; 
Amir Vadai; Yishai Hadas; dledf...@redhat.com
Subject: Re: [PATCH V1 libibverbs 2/3] Use neighbour lookup for RoCE UD QPs Eth 
L2 resolution

Le mardi 11 février 2014 à 14:31 +0200, Or Gerlitz a écrit :
> From: Matan Barak 
> 
> In order to implement RoCE IP based addressing for UD QPs, without 
> introducing uverbs changes, we need a way to resolve the L2 Ethernet 
> addresses from user-space.
> 
> This is done with netlink through libnl, and in libibverbs such that 
> multiple vendor provider libraries can use the code.
> 
> The Ethernet L2 params are passed to the provider driver using an 
> extension verb drv_ibv_create_ah_ex call. This resolution is done only 
> for providers that support this extension verb, otherwise, there's not real 
> reason to do that.
> 
> The steps for resolution:
> 
> 1. get sgid
> 
> 2. from sgid, get the interface
> 
> 3. query route from interface to the destination
> 
> 4. query the neigh table, if the MAC for the destination is already known, 
> done.
> 
> 5. if not, loop until timeout:
> a. send a UDP packet to that IP targeted to the "discard" port

What will happen if the system is configured to refuse to send packet to this 
port: eg it exists an netfilter rules that disallow this:

  iptables -t filter -A OUTPUT -p udp --dport 9 -j REJECT --reject-with 
icmp-admin-prohibited

(not tested)

And why send a packet in the first place, would it be enough to connect to the 
socket to target address:9 ? It's possible to call connect() on a dgram socket 
to bind the destination address, but I don't know if it's trigger a neighbor 
resolution.

> b. listen to events from the neigh table
> c. query neigh table in order to know if the neigh has shown up between
>query until we started monitoring it
> 

Such userspace ping must be configurable: timeout and number of tries should 
probably be configurable by userspace.

> 6. query vlan id from the interface
> 
> This solution depends on libnl v1.
> 

According to http://www.infradead.org/~tgr/libnl/

"1.1.x releases
Support for 1.1.x releases is limited, backports are only done upon request. Do 
not develop new applications based on libnl1 and consider porting your 
applications to libnl3"

Please provide some rational for using "obsolete" version 1.

> Signed-off-by: Matan Barak 
> Signed-off-by: Or Gerlitz 
> ---
>  Makefile.am|3 +
>  configure.ac   |   11 +
>  include/infiniband/verbs.h |   35 ++-
>  src/neigh.c|  866 
> 
>  src/neigh.h|   42 +++
>  src/verbs.c|  160 -
>  6 files changed, 1114 insertions(+), 3 deletions(-)  create mode 
> 100644 src/neigh.c  create mode 100644 src/neigh.h
> 
> diff --git a/Makefile.am b/Makefile.am index a6767de..f21697f 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -12,6 +12,9 @@ libibverbs_version_script = 
> @LIBIBVERBS_VERSION_SCRIPT@  src_libibverbs_la_SOURCES = src/cmd.c 
> src/compat-1_0.c src/device.c src/init.c \
>   src/marshall.c src/memory.c src/sysfs.c src/verbs.c 
> \
>   src/enum_strs.c
> +if ! NO_RESOLVE_NEIGH
> +src_libibverbs_la_SOURCES += src/neigh.c endif
>  src_libibverbs_la_LDFLAGS = -version-info 1 -export-dynamic \
>  $(libibverbs_version_script)
>  src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map diff 
> --git a/configure.ac b/configure.ac index b8d4cea..ba67b0a 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -28,6 +28,17 @@ else
>  fi
>  fi
>  
> +AC_ARG_WITH([resolve-neigh],
> +AC_HELP_STRING([--with-resolve-neigh],
> +[Enable neighbour resolution in Ethernet (default YES)])) if  
> +test x$with_resolve_neigh = x || test x$with_resolve_neigh = xyes; then
> + AC_CHECK_LIB(nl, rtnl_link_vlan_get_id, [],
> + AC_MSG_ERROR([rtnl_link_vlan_get_id not found.  libibverbs 
> requires 
> +libnl.])) else
> +AC_DEFINE([NRESOLVE_NEIGH], 1, [Define to 1 to disable resovle 
> +neigh annotations.]) fi AM_CONDITIONAL(NO_RESOLVE_NEIGH, test 
> +x$with_resolve_neigh = xno)
> +
>  dnl Checks for libraries
>  AC_CHECK_LIB(dl, dlsym, [],
>  AC_MSG_ERROR([dlsym() not found.  libibverbs requires libdl.])) 
> diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h 
> index a79a1de..63c4756 10064

Re: IB/mlx4: Build the port IBoE GID table properly under bonding

2014-02-18 Thread Moni Shoua

On 2/17/2014 1:52 PM, Dan Carpenter wrote:

Hello Moni Shoua,

This is a semi-automatic email about new static checker warnings.

The patch ad4885d279b6: "IB/mlx4: Build the port IBoE GID table
properly under bonding" from Feb 5, 2014, leads to the following
Smatch complaint:

drivers/infiniband/hw/mlx4/main.c:1749 mlx4_ib_scan_netdevs()
 error: we previously assumed 'curr_netdev' could be null (see line 
1722)

drivers/infiniband/hw/mlx4/main.c
   1721 
   1722 if (curr_netdev) {
 ^^^
Check.

   1723 port_state = (netif_running(curr_netdev) && 
netif_carrier_ok(curr_netdev)) ?
   1724 IB_PORT_ACTIVE : 
IB_PORT_DOWN;
   1725 mlx4_ib_set_default_gid(ibdev, curr_netdev, 
port);
   1726 } else {
   1727 reset_gid_table(ibdev, port);
   1728 }
   1729 /* if using bonding/team and a slave port is down, we 
don't the bond IP
   1730  * based gids in the table since flows that select port 
by gid may get
   1731  * the down port.
   1732  */
   1733 if (curr_master && (port_state == IB_PORT_DOWN)) {
   1734 reset_gid_table(ibdev, port);
   1735 mlx4_ib_set_default_gid(ibdev, curr_netdev, 
port);
   1736 }
   1737 /* if bonding is used it is possible that we add it to 
masters
   1738  * only after IP address is assigned to the net bonding
   1739  * interface.
   1740 */
   1741 if (curr_master && (old_master != curr_master)) {
   1742 reset_gid_table(ibdev, port);
   1743 mlx4_ib_set_default_gid(ibdev, curr_netdev, 
port);
   1744 mlx4_ib_get_dev_addr(curr_master, ibdev, port);
   1745 }
   1746 
   1747 if (!curr_master && (old_master != curr_master)) {
   1748 reset_gid_table(ibdev, port);
   1749 mlx4_ib_set_default_gid(ibdev, curr_netdev, 
port);
^^^
Dereferenced without checking.

   1750 mlx4_ib_get_dev_addr(curr_netdev, ibdev, port);
   1751 }

regards,
dan carpenter

Not really a bug. (curr_master != NULL) implies (curr_netdev != NULL)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/mlx4: Build the port IBoE GID table properly under bonding

2014-02-18 Thread Moni Shoua
Ha ha. Take another look. That's what I was just explaining about! :) On 
line 1743 when curr_master is non-NULL then Smatch doesn't complain 
because it understands about the relationship between curr_master and 
curr_netdev. But here it is complaining about line 1749 where 
curr_master is NULL so the implication doesn't apply. Nice, huh? 
regards, dan carpenter


You're right :)
I'll write the fix.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/mlx4: Build the port IBoE GID table properly under bonding

2014-03-12 Thread Moni Shoua

On 3/12/2014 12:17 PM, Bart Van Assche wrote:

On 02/18/14 15:32, Moni Shoua wrote:

Ha ha. Take another look. That's what I was just explaining about! :) On
line 1743 when curr_master is non-NULL then Smatch doesn't complain
because it understands about the relationship between curr_master and
curr_netdev. But here it is complaining about line 1749 where
curr_master is NULL so the implication doesn't apply. Nice, huh?
regards, dan carpenter

You're right :)
I'll write the fix.

Hello Moni,

Have you already had a chance to look further into this issue ?

Thanks,

Bart.


Hi
Yes, a patch is waiting for internal review before submission.
I hope it will be out soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/core: Suppress a sparse warning

2014-03-12 Thread Moni Shoua

On 3/12/2014 12:15 PM, Bart Van Assche wrote:

On 03/10/14 17:08, Paul E. McKenney wrote:

On Mon, Mar 10, 2014 at 04:02:13PM +0100, Yann Droneaud wrote:

Hi,

Le lundi 10 mars 2014 à 15:26 +0100, Bart Van Assche a écrit :

On 03/10/14 14:33, Yann Droneaud wrote:

Le lundi 10 mars 2014 à 13:22 +0100, Bart Van Assche a écrit :

Suppress the following sparse warning:
include/rdma/ib_addr.h:187:24: warning: cast removes address space of expression

You should explain why there's a warning here, and why is it safe to
cast. (I believe it's related to RCU domain ?)

Hello Yann,

Now that I've had a closer look at the code in include/rdma/ib_addr.h,
that code probably isn't safe. How about the (untested) patch below ?


Thanks for investigating.

I'm not an expert in RCU, but I believe it then miss the RCU annotations
around the RCU reader section (ensure correct ordering if I recall
correctly).

Cc: "Paul E. McKenney" 

If the rcu_read_lock() isn't supplied by all callers to this function,
then yes, it needs to be supplied as Yann shows below.

The CONFIG_PROVE_RCU=y Kconfig option can help determine that they are
needed, but of course cannot prove that they are not needed, at least
not unless you have a workload that exercises all possible calls to
this function.

Hello Moni,

I think this warning got introduced via commit "IB/cma: IBoE (RoCE)
IP-based GID addressing" (7b85627b9f02f9b0fb2ef5f021807f4251135857;
December 12, 2013). Can you follow this up further ?

Thanks,

Bart.

Sure. I'll look into it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Adding support for RoCE V2 specification

2014-11-27 Thread Moni Shoua
On Thu, Nov 27, 2014 at 2:56 AM, Somnath Kotur  wrote:
> Please apply.
>
> Somnath Kotur (2):
>   IB/Core: Changes to the IB Core infrastructure for RoCE V2.
>   IB/ocrdma: Changes to comply with RoCEv2 Specification changes
>
>  drivers/infiniband/core/addr.c  |   13 +-
>  drivers/infiniband/core/cm.c|   15 ---
>  drivers/infiniband/core/cma.c   |4 +-
>  drivers/infiniband/core/sa_query.c  |1 +
>  drivers/infiniband/core/verbs.c |   47 +---
>  drivers/infiniband/hw/ocrdma/ocrdma.h   |2 +
>  drivers/infiniband/hw/ocrdma/ocrdma_ah.c|   64 +-
>  drivers/infiniband/hw/ocrdma/ocrdma_hw.c|   26 +++
>  drivers/infiniband/hw/ocrdma/ocrdma_sli.h   |   16 ++-
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |   11 -
>  include/rdma/ib_addr.h  |3 +-
>  include/rdma/ib_sa.h|1 +
>  include/rdma/ib_verbs.h |   15 ++
>  13 files changed, 187 insertions(+), 31 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Few comments to the IB/core part

1.   RoCEv2 spec (Annex A17) sees RoCE type (or as the patch name
it - network_hdr_type) as an attribute of the source GID. If so, there
is no point to for address structures (ib_ah_attr, ib_sa_path_rec, …)
to have both sgid and RoCE type. The alternative is to keep an extra
info in each GID table entry about the RoCE header (IPv4/IPv6 or GRH)
and use it when modifying QP from INIT to RTR (RC/UC) or creating AH
(UD).

2.   Population of  table in for RoCE should be a common code to
all drivers and needs to be moved to IB/core (unlike today when it is
specific to a vendor). We plan to push such a change soon.

3.   If I get it right, the purpose of the change in ib_addr is to
that determine RoCE type based on the resolution of address from OS.
If gateway is used (dest address is not in the subnet of source
address) then the protocol will be V2, otherwise it will be V1. This
means that IB over UDP is not possible on the same network. I don't
think that this is acceptable.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Adding support for RoCE V2 specification

2014-11-29 Thread Moni Shoua
> Yes, that is correct.
>
> The overarching goal behind this patch is to keep RDMA-CM as the central 
> entity that decides the protocol (ROCEV2 /ROCEV1)
> and hides all the address resolution details from applications.
I agree but with one comment. RDMA-CM, although preferred,  is not
mandatory to establish a connection. The solution needs to take care
also for
apps that don't use RDMA-CM

> This  is just a continuation of the existing RDMA-CM design philosophy for IP 
> Based GIDs that lets applications operate over any form
> of RDMA Service in a completely transparent way without any changes to the 
> applications themselves.
>
> Protocol Resolution(RoCEV2 or V1) must be done even before we
> send out the 1st connection request packet, the corresponding MAD pkt must be 
> V2 for a destination that is behind  a router, correct?
> I am not sure we want to try sending out RoCEv2 connection requests first and 
> if that fails ,then fallback/retry with RoCEv1 request.
I agree but when you send the first packet (request or response) you
know the SGID index to be used. Knowing that and assuming you chose
the SGID index correctly there is no need to guess.
A suggestion for choosing the SGID index is:
1. First, populate GID table with GIDs matching the IP address list of
the corresponding net devices. Each GID will appear twice, one for
RoCEv1 and one for RoCEv2.
2. When connection is being established RDMA-CM will choose one of the
2 matching indices based on a
preferred protocol attribute for the rmda_id (new attr.)
3. Proffered protocol is
 a) global for the node/fabric based on the administrator settings
 b) can be changed by applications that want to override the
default (requires extra API)


> That would needlessly complicate and slow down the connection-establishment 
> process, do you agree ?
> Instead it's much more seamless to take help of the IP stack to select the 
> network header type and subsequently the ROCE pkt format while
> attempting to establish the connection.
>
> The one thing that my patch does miss out though is the ability for local 
> subnet end nodes to be able to communicate using RoCEv2 packet formats.
> (The default policy in this patch is to use ROCEV1 for nodes connected 
> directly /within the same local subnet.)
> I believe that can be achieved by extending RDMA-CM to override the default 
> protocol selection policy proposed in this patch by making use of
> the GID Entry attribute type in another patch.
Right, and this seems to fit the proposal above.


> This scheme should still work with your proposal  in point# 2 , where GID 
> Table management would move to a central/stack module instead of being
> vendor specific.
>
> Thanks
> Som
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Adding support for RoCE V2 specification

2014-12-01 Thread Moni Shoua
On Tue, Dec 2, 2014 at 7:35 AM, Somnath Kotur  wrote:
> HI Moni,
> Thank you for your comments , please find my response inline.
>
>
>> > The overarching goal behind this patch is to keep RDMA-CM as the
>> > central entity that decides the protocol (ROCEV2 /ROCEV1) and hides all the
>> address resolution details from applications.
>> I agree but with one comment. RDMA-CM, although preferred,  is not
>> mandatory to establish a connection. The solution needs to take care also for
>> apps that don't use RDMA-CM
>
> Agreed and this patch takes care of that as well thanks to the solution that 
> was put in place
> for IP-based GIDs anyway in `modify_qp` effectively by means of the 
> ib_resolve_l2_attrs().
>
>>
>> > This  is just a continuation of the existing RDMA-CM design philosophy
>> > for IP Based GIDs that lets applications operate over any form of RDMA
>> Service in a completely transparent way without any changes to the
>> applications themselves.
>> >
>> > Protocol Resolution(RoCEV2 or V1) must be done even before we send out
>> > the 1st connection request packet, the corresponding MAD pkt must be V2
>> for a destination that is behind  a router, correct?
>> > I am not sure we want to try sending out RoCEv2 connection requests first
>> and if that fails ,then fallback/retry with RoCEv1 request.
>> I agree but when you send the first packet (request or response) you know
>> the SGID index to be used. Knowing that and assuming you chose the SGID
>> index correctly there is no need to guess.
>> A suggestion for choosing the SGID index is:
>> 1. First, populate GID table with GIDs matching the IP address list of the
>> corresponding net devices. Each GID will appear twice, one for
>> RoCEv1 and one for RoCEv2.
>> 2. When connection is being established RDMA-CM will choose one of the
>> 2 matching indices based on a preferred protocol attribute for the rmda_id
>> (new attr.) 3. Proffered protocol is
>>  a) global for the node/fabric based on the administrator settings
>>  b) can be changed by applications that want to override the default
>> (requires extra API)
>
> Agree to point 1 which is what we were thinking of doing as well as part of 
> the 'override' option.
> Still feel that the 'default' protocol resolution should be based on help 
> from the IP/Networking stack
> as that would straight away enable existing applications and majority of 
> use-case scenarios that use RDMA-CM to seamlessly
> work without any change for RoCE V2.
> What this means is now RDMA-CM will choose one of the 2 matching GID indices 
> based on help from the IP stack.
>
> For this to happen , the 'gid_cache' structure will have to undergo a change 
> to now incorporate a new 'gid_type' field.
> Accordingly the helper functions that find a cached gid entry - 
> ib_find_cached_gid() will also undergo a change to take into account this new 
> field.
>
> I plan to resubmit the patch series making the following changes
> a) Populating GID table with 2 entries per IP address
> b) Changes to the GID cache structure along with it's helper functions
> c) Make sure that RDMA-CM selects/finds the corresponding GID once a network 
> type is chosen based on help from the IP stack.
>
>
> The only thing missing is the ability for end-nodes connected locally inside 
> same subnet to communicate using V2 and that
> could be provided as an 'override' option using the 'preferred protocol' 
> attribute setting on a system-wide basis or on an application basis
> as you have proposed.
> This I propose as an additional patch after the above patch goes in first.
>
> Does this sound like a plan ?
>
> Thanks
> Som
>
Hi Som
Generally, I agree with the changes you suggested. To summarize (and
make sure I got it right) your plan for V2 is to:

1. Associate gid_index with RoCE type in GID table and GID cache.
Requires helper functions to search for gid_index according to GID
value and type
2. Move the code that monitors IP address changes to populate GID
table from device driver to IB/core. Requires each RoCE device driver
to implement add_gid_entry/rem_gid_entry functions
3. Remove the RoCE type from address structure (e.g. ib_ah_attr).
Leave it only in rdma_addr to keep the hint from the IP stack
4. Choose sgid_index (from all matching entries) in RDMA-CM based on
hint from IP stack, user preference and admin policy

If so, I guess that we have a plan.

Thanks
Moni
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Adding support for RoCE V2 specification

2014-12-02 Thread Moni Shoua
>
> Agree with all your points above , except for point# 2.  That IMHO would be 
> more of a code refactoring/cleanup patch
> that can be taken up later once the baseline version is checked in as it's 
> not really related to this patch ?
> Also, if we do that , then we also have to take care of  the IB case as well 
> where GID is something that is obtained from the port/card(query_gid)
> as opposed to something that is pushed down from the stack as you are now 
> proposing for IP-based GIDs.. i.e implications are much more than
> what this patch intends to address.
> So I'd rather do this at a later point in time and not club it with V2, sound 
> good ?
>
> Thanks
> Som

Hi Som
I think that it can be done in 2 stages. Moving GID table management
to IB/core is indeed not mandatory for RoCEv2.
However, you might want to consider it anyway since RoCEv2 support is
going to add changes to GID table management anyhow.
So, instead of modifying each device driver sperately, do it once
centrally and add a plugin function in struct ib_device (e.g.
ib_write_gid(union ib_gid, struct ib_gid_info *info)). We can work in
parallel and implement the mlx4 function while you concentrate in
IB/core and ocrdma.

I leave it for you to decide how to proceed from here.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] IB/Core: Changes to the IB Core infrastructure for RoCEv2 support

2014-12-25 Thread Moni Shoua
On Thu, Dec 25, 2014 at 2:59 AM, Somnath Kotur  wrote:
> From: Somnath kotur 
>
> 1. Associate gid_index with a Type (as per V2 SPEC) in GID Table
>and GID Cache.
> 2. Modify GID Cache helper functions to search for gid_index based on
>GID Value and Type.
> 3. Choose sgid_index from all the matching entries in RDMA-CM based on
>hint from the IP stack.
> 4. Set hop_limit for the IP Packet based on above hint from IP stack
> 5. Modify GID table population for all device drivers to add the GID Type
>for each IP address entry. Each GID will appear twice, one for RoCEV1
>and another for RoCEV2.
> 6. Introduce a new API/ Driver hook to query GID Type.
>Default to GID_TYPE_V1 for all drivers/cards that don't support V2 so
>all drivers for now don't have to mandatorily implement the new hook.
> 7. Introduce a new Port Capability flag for a HW vendor to indicate
>RoCEV2 Based GIDs support.
> 8. Introduce a new API to report Port Type (RoCEV2 or V1).
>
> Signed-off-by: Somnath Kotur 
> Signed-off-by: Devesh Sharma 
> ---
>  drivers/infiniband/core/addr.c  |8 +++
>  drivers/infiniband/core/cache.c |   14 --
>  drivers/infiniband/core/cm.c|   15 --
>  drivers/infiniband/core/cma.c   |   56 +++
>  drivers/infiniband/core/device.c|   23 +
>  drivers/infiniband/core/multicast.c |3 +-
>  drivers/infiniband/core/sa_query.c  |3 +-
>  drivers/infiniband/core/verbs.c |   68 
>  include/rdma/ib_addr.h  |1 +
>  include/rdma/ib_cache.h |2 +
>  include/rdma/ib_sa.h|1 +
>  include/rdma/ib_verbs.h |   85 
> ++-
>  12 files changed, 248 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
> index f80da50..9a7e38c 100644
> --- a/drivers/infiniband/core/addr.c
> +++ b/drivers/infiniband/core/addr.c
> @@ -257,6 +257,9 @@ static int addr4_resolve(struct sockaddr_in *src_in,
> goto put;
> }
>
> +   if (rt->rt_uses_gateway)
> +   addr->network = RDMA_NETWORK_IPv4;
> +
I see that you use this info to make a decision about roce_type. I
think you should make this a hint instead of a rule.
Also, did you consider complicated topologies where rt_uses_gateway is
false but IP header is required for the packet to reach?
I'm not sure but maybe proxy ARP can be such an example


> ret = dst_fetch_ha(&rt->dst, addr, &fl4.daddr);
>  put:
> ip_rt_put(rt);
> @@ -271,6 +274,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
>  {
> struct flowi6 fl6;
> struct dst_entry *dst;
> +   struct rt6_info *rt;
> int ret;
>
> memset(&fl6, 0, sizeof fl6);
> @@ -282,6 +286,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
> if ((ret = dst->error))
> goto put;
>
> +   rt = (struct rt6_info *)dst;
> if (ipv6_addr_any(&fl6.saddr)) {
> ret = ipv6_dev_get_saddr(&init_net, ip6_dst_idev(dst)->dev,
>  &fl6.daddr, 0, &fl6.saddr);
> @@ -305,6 +310,9 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
> goto put;
> }
>
> +   if (rt->rt6i_flags & RTF_GATEWAY)
> +   addr->network = RDMA_NETWORK_IPv6;
> +
> ret = dst_fetch_ha(dst, addr, &fl6.daddr);
>  put:
> dst_release(dst);
> diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
> index 80f6cf2..c31c71b 100644
> --- a/drivers/infiniband/core/cache.c
> +++ b/drivers/infiniband/core/cache.c
> @@ -48,8 +48,8 @@ struct ib_pkey_cache {
>  };
>
>  struct ib_gid_cache {
> -   int table_len;
> -   union ib_gidtable[0];
> +   int table_len;
> +   struct ib_gid_entry table[0];
>  };
>
>  struct ib_update_work {
> @@ -88,7 +88,7 @@ int ib_get_cached_gid(struct ib_device *device,
> if (index < 0 || index >= cache->table_len)
> ret = -EINVAL;
> else
> -   *gid = cache->table[index];
> +   *gid = cache->table[index].gid;
>
> read_unlock_irqrestore(&device->cache.lock, flags);
>
> @@ -98,6 +98,7 @@ EXPORT_SYMBOL(ib_get_cached_gid);
>
>  int ib_find_cached_gid(struct ib_device *device,
>union ib_gid *gid,
> +  u8   gid_type,
>u8   *port_num,
>u16  *index)
>  {
> @@ -115,7 +116,8 @@ int ib_find_cached_gid(struct ib_device *device,
> for (p = 0; p <= end_port(device) - start_port(device); ++p) {
> cache = device->cache.gid_cache[p];
> for (i = 0; i < cache->table_len; ++i) {
> -   if (!memcmp(gid, &cache->table[i], sizeof *gid)) {
> +   if (!memcmp(gid,

Re: [PATCH 1/6] IB/Core: Changes to the IB Core infrastructure for RoCEv2 support

2014-12-30 Thread Moni Shoua
> Although you follow the spec here, 3 types of RDMA_NETWORK are not
> really required. Maybe we can get rid of this
> Maybe we can get rid of this duplication
>
> [SOM]: Not sure i understood the duplication here or why it's not required?
> We now have a new 'network/L3' layer on top of L2 - that was the reason, does 
> it not make sense?
>
Once you know pair (type of RoCE and GID value) you know what is the
network type
And once you know network type you know the pair.
So, it is not really necessary to keep them all stored.
My idea is to get rid if the type that describes network type

> [SOM]: Well, partly the use of get_port_type() was motivated by the 
> SPEC(Query HCA - 17.5.x.x IIRC) talking about the need to have a port_type
> attribute as part of RoCEV2 that would indicate if a HW device/port supported 
> RoCEV2 or not.  It was also serving another purpose
> in cma_acqure_dev() as you can see above in the patch where it was helping 
> the use case of devices that only support V2 and not V1.
> Still feel it doesn't make sense?
> Not sure how/where did you want point 2 coming from - sysfs/proc/debugfs?
> I'd prefer to have that in the next stage of the patchset

Spec is confusing. Under the section QUERY HCA it describes changes to
the port attr.
Anyway, I see that spec points to the ability of querying capabilities
and comparing them against decisions but not as a method to take
decisions.
What I would do is to add 2 flags to ib_port_cap_flags,
IB_PORT_ROCE_SUP and IB_PORT_ROCE_V2_SUP.

I think that the main difference between approaches is how to decide
about the RoCE type to use for a session.
This is also why I think that we should not postpone the change in
cma_acquire_dev to later.

thanks
Moni
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 9/9] IB/cma: Join and leave multicast groups with IGMP

2015-11-25 Thread Moni Shoua
On Tue, Nov 24, 2015 at 8:15 PM, Jason Gunthorpe
 wrote:
> On Tue, Nov 24, 2015 at 11:41:10AM +0200, Moni Shoua wrote:
>> On Mon, Nov 23, 2015 at 11:25 PM, Jason Gunthorpe
>>  wrote:
>> > On Thu, Oct 15, 2015 at 07:07:12PM +0300, Matan Barak wrote:
>> >> diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
>> >> index 0a40ed2..5bea0e8 100644
>> >> +++ b/include/rdma/ib_sa.h
>> >> @@ -206,6 +206,9 @@ struct ib_sa_mcmember_rec {
>> >>   u8   scope;
>> >>   u8   join_state;
>> >>   int  proxy_join;
>> >> + int  ifindex;
>> >> + struct net  *net;
>> >> + enum ib_gid_type gid_type;
>> >>  };
>> >
>> > This is really gross.
>> >
>> > Make ib_init_ah_from_mcmember accept a QP and get the above stuff from
>> > the QP.
>> >
>> > Jason
>>
>> Which QP is that. You might not have any existing QP when you want to
>> create the AH or you might have 10.
>
> roce multicast is only done with the CM and the CM always has a QP.
>
> Jason
I don't see why you can't join before having a QP and anyway,
rdma_create_qp() is optional
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-06 Thread Moni Shoua
On Sun, Dec 6, 2015 at 4:03 PM, Christoph Hellwig  wrote:
> The ULPs couldn't really care less about the network type.  The problem

In WC, ULPs care about network type as much as they care for dgid or port_num
So, network_type is indeed an information that is relevant only to
RoCE but it only follows the tradition
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-06 Thread Moni Shoua
On Mon, Dec 7, 2015 at 8:02 AM, Jason Gunthorpe
 wrote:
> Why? The sgid index must tell you the network type.

struct ib_wc doesn't have sgid or sgid_index field.
What you have though is the sgid taken from the GRH that is scattered
to the first 40/20 bytes of the receive WQE.  This is not enough to
determine the network type.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-06 Thread Moni Shoua
On Mon, Dec 7, 2015 at 8:34 AM, Jason Gunthorpe
 wrote:
> On Mon, Dec 07, 2015 at 08:15:40AM +0200, Moni Shoua wrote:
>
>> What you have though is the sgid taken from the GRH that is scattered
>> to the first 40/20 bytes of the receive WQE.  This is not enough to
>> determine the network type.
>
> It is enough to discover the sgid index which will tell you the type.
but how? all you have in hand is the sgid which can appear several
times in the GID table in different indices.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-07 Thread Moni Shoua
Well, just tell me how you want to discover gid_index when you poll
the WC out of the CQ.
Like I said, the sgid itself is in the GRH that is scattered to the
buffers in the receive queue. When ib_poll_cq() is called the pointer
to GRH is not passed so there is no way to determine the gid_index
inside the polling context.
BTW, why do you argue about this only now? This is not RoCE specific
issue. sgid_index was always resolved outside the polling context. The
only difference between then and now is that with InfiniBand there was
no conflict about sgid_index once you had the sgid. Now, when same gid
value can be present in multiple entries you need an extra parameter
to get distinguish between them - that would be the netwrok_type


On Mon, Dec 7, 2015 at 7:12 PM, Jason Gunthorpe
 wrote:
> On Mon, Dec 07, 2015 at 08:37:41AM +0200, Moni Shoua wrote:
>> On Mon, Dec 7, 2015 at 8:34 AM, Jason Gunthorpe
>>  wrote:
>> > On Mon, Dec 07, 2015 at 08:15:40AM +0200, Moni Shoua wrote:
>> >
>> >> What you have though is the sgid taken from the GRH that is scattered
>> >> to the first 40/20 bytes of the receive WQE.  This is not enough to
>> >> determine the network type.
>> >
>> > It is enough to discover the sgid index which will tell you the type.
>> but how? all you have in hand is the sgid which can appear several
>> times in the GID table in different indices.
>
> Eh? Reliably recovering the gid index is not optional. The network
> namespace stuff that is already in the kernel hard requires that
> ability. (This is the same argument we went around on already why pkey
> had to come from the wc, not from the payload)
>
> If your position is it cannot be done from a WC,QP as is, then
> gid_index needs to be added to the wc, or something else to remove the
> ambiguity.
>
> In either case, network_type is absolutely the wrong thing to have in
> the wc.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-07 Thread Moni Shoua
On Mon, Dec 7, 2015 at 8:19 PM,   wrote:
> Moni Shoua asked:
>
>
>
>> but how? all you have in hand is the sgid which can appear several
>
>> times in the GID table in different indices.
>
>
> Step back and review the context.
>
> Adding an extra field to the WC is *not* going to redesign the hardware.

This is where you have a mistake. Hardware is already designed that
way. Mellanox NICs completion info have network_type indication of the
packet for which the completion refers to.
BTW, hardware that doesn't support network_type will still work. In
that case it is the SW indeed that finds the network_type but the
method for doing this should be avoided as much as possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-07 Thread Moni Shoua
On Mon, Dec 7, 2015 at 8:48 PM, Jason Gunthorpe
 wrote:
> On Mon, Dec 07, 2015 at 08:34:43PM +0200, Moni Shoua wrote:
>> Well, just tell me how you want to discover gid_index when you poll
>> the WC out of the CQ.
>
> Hey, I'm not desiging this rocev2 stuff, this is something the rocev2
> community needs to sort out.
>
>> Like I said, the sgid itself is in the GRH that is scattered to the
>> buffers in the receive queue. When ib_poll_cq() is called the pointer
>> to GRH is not passed so there is no way to determine the gid_index
>> inside the polling context.
>
> Then have an API to let the consumer recover it.
>
>> BTW, why do you argue about this only now? This is not RoCE specific
>> issue. sgid_index was always resolved outside the polling context. The
>> only difference between then and now is that with InfiniBand there was
>> no conflict about sgid_index once you had the sgid.
>
> I think you answered your own question. In IB and rocev1 the sgid
> index is not ambiguous, rocev2 breaks that requirement without
> adaquately fixing it.
>
> It is probably also the case that rocev1 has similar problems,
> depending on how vland and macvlan ended up being implemented.
>
>> Now, when same gid value can be present in multiple entries you need
>> an extra parameter to get distinguish between them - that would be
>> the netwrok_type
>
> Eh?  You are only worried about duplicates between rocev1 mode which
> uses the GUID and v2 mode which uses the host's IP address?
>
> What about duplicates entirely within v2 mode where vlan and macvlan
> can create duplicate host IP addresses.
In that case these entries will be distinguished  byt its eth devices
(eth device is part of the gid attributes structure)

>
> You can also just not create duplicate entries in the first place. For
> instance there probably isn't a really a good reason to use rocev2 for
> IPv6 link local addreses, that trivially eliminates the v1/v2
> collision.
First, spec says that I must.
Second, even if you have an example that makes it unnecessary for some
it is still necessary for others and therefore the netwotk_type in WC
is still needed
Third, regarding the IPv6 link local example,  IPv6 link local packet
is almost identical to RoCEv1 but not entirely. The former is an IP
protocol packet while the later isn't. Switches and network admins may
care about this.

>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 20/37] IB/rdamvt: Add post send and recv stubs

2015-12-08 Thread Moni Shoua
Fix typo in the subject: IB/rdam --> IB/rdma
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] IB/qib: Use rdmavt protection domain

2015-12-08 Thread Moni Shoua
On Mon, Dec 7, 2015 at 10:49 PM, Dennis Dalessandro
 wrote:
> Remove protection domain datastructure from qib and use rdmavts version.
>
I failed to apply this patch
tried with rc3 and rc4
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-09 Thread Moni Shoua
> Eh? I think you've missed the point, there is no net device when
> looking at a wc.
>
> Look, here is a concrete direction:
>
> Replace all the crap in
> ib_init_ah_from_wc/get_sgid_index_from_eth/rdma_addr_find_dmac_by_grh
>
> with a straightforward
>
>rdma_dgid_index_from_wc(
> const struct ib_qp *qp,
> const struct ib_wc *wc,
> const struct ib_grh *grh,
> u16 *gid_index)
>
> Sort of function that reads the GRH and wc and returns the unambiguous
> gid index that was used to receive that packet on the UD QP.
>

I already answered this to but I'll do it again
RoCEv2 spec says that L3 header will be scattered to receive WQE in
the following way
IPv6 and RoCEv1 - 40 bytes of the L3 header (GRH or IPv6) to the first
40 bytes of the receive bufs
IPv4 - 20 bytes of the L3 header to the second half of the first 40
bytes of the receive bufs. The first 20 bytes remain undefined.

Now, if you think how you deduce network_type from GRH you'll see that
it requires tools like checksum validation and other validations and
you end up with a method that is not 100% error free. So,to eliminate
the need for heavy computation (with regards to the other option) and
be free from false deductions you have the option of getting
network_type from the hardware. So, if you do have hardware that
supports it why give it up?



> The gid cache is not allowed to create an ambiguity the driver cannot
> resolve.
>
> That said, I wouldn't object to vendor-specific bits in the wc. Ie if
> mlx hardware needs a network_type bit to implement
> rdma_find_dgid_index_from_wc, then fine - define a vendor specific
> place to put it. In this case rdma_find_dgid_index_from_wc would be a
> driver call back, which is fine, and what Caitlin was talking about.
>

This is not a Mellanox specific flag. See a quote from the spec

A17.4.5.1 UD COMPLETION QUEUE ENTRIES (CQES)
For UD, the Completion Queue Entry (CQE) includes remote address
information (InfiniBand Specification Vol. 1 Rev 1.2.1 Section
11.4.2.1). For RoCEv2, the remote address information comprises the
source L2 Address and a flag that indicates if the received frame is
an IPv4, IPv6 or RoCE packet.



> But, it is not part of our verbs API, and I'd *strongly* encourage
> other vendors and future hardware to simply return the gid index that
> the hardware matched instead of requiring the software to try and
> guess after the fact.

Could be problematic for virtual machine architectures that give a
portion of the entire GID table to a VM that index it 0..N
>
> This is the same issue/argument we went around and around on the
> lladdr ipoib details when working on the namespace patches, about how
> important it is to resolve the namespace from the hardware headers.
>
> Of course once we have the gid index we now have the net device and
> other information needed to make namespaces work.
>
> .. and this is part of what I mean what I said the interface from the
> gid cache code is not a sane API and needs to be changed.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-09 Thread Moni Shoua
On Mon, Dec 7, 2015 at 8:48 PM, Jason Gunthorpe
 wrote:
> On Mon, Dec 07, 2015 at 08:34:43PM +0200, Moni Shoua wrote:
>> Well, just tell me how you want to discover gid_index when you poll
>> the WC out of the CQ.
>
> Hey, I'm not desiging this rocev2 stuff, this is something the rocev2
> community needs to sort out.
>
>> Like I said, the sgid itself is in the GRH that is scattered to the
>> buffers in the receive queue. When ib_poll_cq() is called the pointer
>> to GRH is not passed so there is no way to determine the gid_index
>> inside the polling context.
>
> Then have an API to let the consumer recover it.
We do but it has it disadvantages. We'll use it when there it no other
choice (i.e. not supporting HW)
>
>> BTW, why do you argue about this only now? This is not RoCE specific
>> issue. sgid_index was always resolved outside the polling context. The
>> only difference between then and now is that with InfiniBand there was
>> no conflict about sgid_index once you had the sgid.
>
> I think you answered your own question. In IB and rocev1 the sgid
> index is not ambiguous, rocev2 breaks that requirement without
> adaquately fixing it.

ambiguity is not the issue here. You say that sgid_index needs to be
known during poll so why raise it now?
We want to add information to the CQE to help resolve the conflict but
we leave the resolution of sgid index to the caller, like for RoCEv1
and like for native InfiniBand
>
> It is probably also the case that rocev1 has similar problems,
> depending on how vland and macvlan ended up being implemented.
>
>> Now, when same gid value can be present in multiple entries you need
>> an extra parameter to get distinguish between them - that would be
>> the netwrok_type
>
> Eh?  You are only worried about duplicates between rocev1 mode which
> uses the GUID and v2 mode which uses the host's IP address?
>
> What about duplicates entirely within v2 mode where vlan and macvlan
> can create duplicate host IP addresses.
>
> You can also just not create duplicate entries in the first place. For
> instance there probably isn't a really a good reason to use rocev2 for
> IPv6 link local addreses, that trivially eliminates the v1/v2
> collision.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Add rdma verbs transport library

2015-12-09 Thread Moni Shoua
On Mon, Dec 7, 2015 at 10:42 PM, Dennis Dalessandro
 wrote:
> The following series implements rdmavt. This is the rdma verbs transport
> software library which will help to solve the problem of code duplication
> between hardware drivers when it comes to a verbs implementation.
>
> Rdmavt is basically just another verbs provider and lives in the Infiniband 
> tree
> in a new sw directory. It provides a software implementation of the Infiniband
> verbs API. More details can be found in the following threads:
>
> http://www.spinics.net/lists/linux-rdma/msg29064.html
> http://www.spinics.net/lists/linux-rdma/msg29922.html
>
> This patch series is based on 4.4-rc3.
>
> ---
Hi Denny

Can you tell how you see the end of this project?
Wouldn't it be right to start with goals and design RFC with more
details than the one you have in the links above, like target drivers
and interfaces?
But since you've already posted this, can we at least have this
discussion now before you proceed so we can all have a chance to
influence the final result?

To be more specific, I always imagined the send path like this
[1] Application -> [2] ib_core -> [3] rdmavt -> [4] rvt-backend
The interfaces from [1] to [3] are like the ones we have today and IMO
the one from [3] t0 [4] should look like

int qib_verbs_send(struct qib_qp *qp, struct qib_ib_header *hdr,
   u32 hdrwords, struct qib_sge_state *ss, u32 len);

With what I see now, what RDMAVT does is to forward the call to to the
rvt-backend  as is. I'm sure that you are not going to keep it this
way or the goal for removing code duplication won't be achieved.
However, I still would like to know about the plans for being there.

Also, you  you've out yourself as the maintainer of this project. You
deserve it technically of course but since RDMA_VT is not going to
serve just Intel's backends, wouldn't it be right if we add other
co-maintainers that have an interest on this project?

To conclude, RDMA_VT is a project with importance for more than one
vendor and as such it needs to be developed in cooperation or else it
would not serve its purpose.

Thanks

Moni
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-09 Thread Moni Shoua
>>
>> A17.4.5.1 UD COMPLETION QUEUE ENTRIES (CQES)
>> For UD, the Completion Queue Entry (CQE) includes remote address
>> information (InfiniBand Specification Vol. 1 Rev 1.2.1 Section
>> 11.4.2.1). For RoCEv2, the remote address information comprises the
>> source L2 Address and a flag that indicates if the received frame is
>> an IPv4, IPv6 or RoCE packet.
>
> rdma_dgid_index_from_wc satisfies the above spec requirements without
How? Please explain.

> creating a horrible API, or requiring all vendors to implement a
> network type flag.
>
Actually you haven't suggested anything yet besides a name to the function.
I already said that calculating gid_index from wc and grh requires
extra CPU cycles and is prone to mistakes. But, I might be wrong and
you have a better idea. Do you?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-10 Thread Moni Shoua
> No, I am defining an API that *make sense* and doesn't leak useless
> details. Of course that doesn't force code duplication or anyhting
> like that, just implement it smartly.
>
> I think mlx made a big mistake returning network_type instead of gid
> index, and I don't want to see that error enshrined in our API.
>
returning gid_index is wrong because it forces CQ pollers to be aware
of the entire table. Like I already mentioned, the GID table is a HW
resource that can be divided and handed to multiple VMs,

>> The Verbs are a low-level API, that should report exactly what was
>> received from the wire.  In the RoCEv2 case, it should be the GID/IP
>> addresses and the protocol type.  The addressing information is not
>> intended to be used directly by applications; it is the raw bits
>> that were accepted from the wire.
>
> Low level details isn't what any in kernel consumer needs. Everything
> in kernel needs the gid index to determine the namespace, routing and
> other details. It is not optional. A common API is thus needed to do
> this conversion.
>
> API-wise, once you get the gid index then it is trivial to make easy
> extractors for everything else. ie for example:
> rdma_get_ud_src_sockaddr(gid_index,&addr,wc,grh)
> rdma_get_ud_dst_sockaddr(gid_index,&addr,wc,grh)
>
>> ib_init_ah_from_wc() and friends is exactly the place that you want
>> to create an address handle based on completion and packet fields.
>
> CMA needs exactly the same logic as well, the fact it doesn't have it
> is a bug in this series.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-14 Thread Moni Shoua
On Thu, Dec 3, 2015 at 3:47 PM, Matan Barak  wrote:
> Hi Doug,
>
> This series adds the support for RoCE v2. In order to support RoCE v2,
> we add gid_type attribute to every GID. When the RoCE GID management
> populates the GID table, it duplicates each GID with all supported types.
> This gives the user the ability to communicate over each supported
> type.
>
> Patch 0001, 0002 and 0003 add support for multiple GID types to the
> cache and related APIs. The third patch exposes the GID attributes
> information is sysfs.
>
> Patch 0004 adds the RoCE v2 GID type and the capabilities required
> from the vendor in order to implement RoCE v2. These capabilities
> are grouped together as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP.
>
> RoCE v2 could work at IPv4 and IPv6 networks. When receiving ib_wc, this
> information should come from the vendor's driver. In case the vendor
> doesn't supply this information, we parse the packet headers and resolve
> its network type. Patch 0005 adds this information and required utilities.
>
> Patches 0006 and 0007 adds route validation. This is mandatory to ensure
> that we send packets using GIDS which corresponds to a net-device that
> can be routed to the destination.
>
> Patches 0008 and 0009 add configfs support (and the required
> infrastructure) for CMA. The administrator should be able to set the
> default RoCE type. This is done through a new per-port
> default_roce_mode configfs file.
>
> Patch 0010 formats a QP1 packet in order to support RoCE v2 CM
> packets. This is required for vendors which implement their
> QP1 as a Raw QP.
>
> Patch 0011 adds support for IPv4 multicast as an IPv4 network
> requires IGMP to be sent in order to join multicast groups.
>
> Vendors code aren't part of this patch-set. Soft-Roce will be
> sent soon and depends on these patches. Other vendors, like
> mlx4, ocrdma and mlx5 will follow.
>
> This patch is applied on "Change per-entry locks in GID cache to table lock"
> which was sent to the mailing list.
>
> Thanks,
> Matan
>
> Changed from V1:
>  - Rebased against Linux 4.4-rc2 master branch.
>  - Add route validation
>  - ConfigFS - avoid compiling INFINIBAND=y and CONFIGFS_FS=m
>  - Add documentation for configfs and sysfs ABI
>  - Remove ifindex and gid_type from mcmember
>
> Changes from V0:
>  - Rebased patches against Doug's latest k.o/for-4.4 tree.
>  - Fixed a bug in configfs (rmdir caused an incorrect free).
>
> Matan Barak (8):
>   IB/core: Add gid_type to gid attribute
>   IB/cm: Use the source GID index type
>   IB/core: Add gid attributes to sysfs
>   IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type
>   IB/core: Move rdma_is_upper_dev_rcu to header file
>   IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path
>   IB/rdma_cm: Add wrapper for cma reference count
>   IB/cma: Add configfs for rdma_cm
>
> Moni Shoua (2):
>   IB/core: Initialize UD header structure with IP and UDP headers
>   IB/cma: Join and leave multicast groups with IGMP
>
> Somnath Kotur (1):
>   IB/core: Add rdma_network_type to wc
>
>  Documentation/ABI/testing/configfs-rdma_cm   |  22 ++
>  Documentation/ABI/testing/sysfs-class-infiniband |  16 ++
>  drivers/infiniband/Kconfig   |   9 +
>  drivers/infiniband/core/Makefile |   2 +
>  drivers/infiniband/core/addr.c   | 185 +
>  drivers/infiniband/core/cache.c  | 169 
>  drivers/infiniband/core/cm.c |  31 ++-
>  drivers/infiniband/core/cma.c| 261 --
>  drivers/infiniband/core/cma_configfs.c   | 321 
> +++
>  drivers/infiniband/core/core_priv.h  |  45 
>  drivers/infiniband/core/device.c |  10 +-
>  drivers/infiniband/core/multicast.c  |  17 +-
>  drivers/infiniband/core/roce_gid_mgmt.c  |  81 --
>  drivers/infiniband/core/sa_query.c   |  76 +-
>  drivers/infiniband/core/sysfs.c  | 184 -
>  drivers/infiniband/core/ud_header.c  | 155 ++-
>  drivers/infiniband/core/uverbs_marshall.c|   1 +
>  drivers/infiniband/core/verbs.c  | 170 ++--
>  drivers/infiniband/hw/mlx4/qp.c  |   7 +-
>  drivers/infiniband/hw/mthca/mthca_qp.c   |   2 +-
>  drivers/infiniband/hw/ocrdma/ocrdma_ah.c |   2 +-
>  include/rdma/ib_addr.h   |  11 +-
>  include/rdma/ib_cache.h  |   4 +
>  include/rdma/ib_pack.h   |  45 +++-
>  include/rdma/ib_sa.h |   3 

Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-15 Thread Moni Shoua
> The part that bothers me about this is that this statement makes sense
> when just thinking about the spec, as you say.  However, once you
> consider namespaces, security implications make this statement spec
> compliant, but still unacceptable.  The spec itself is silent on
> namespaces.  But, you guys wanted, and you got, namespace support.
> Since that's beyond spec, and carries security requirements, I think
> it's fair to say that from now on, the Linux kernel RDMA stack can no
> longer *just* be spec compliant.  There are additional concerns that
> must always be addressed with new changes, and those are the namespace
> constraint preservation concerns.

I can't object to that but I really would like to get an example of a
security risk.
So far, besides hearing that the way we choose to handle completions
is wrong, I didn't get a convincing example of how where it doesn't
work. Moreover, regarding security, all we wanted is for HW to report
the L3 protocol (IB, IPv4, or IPv6) in the packet. This is data that
with some extra CPU cycles can be obtained from the 40 bytes that are
scattered to the receive bufs anyway. So, if there is a security hole
it exists from day one of the IB stack and this is not the time we
should insist on fixing it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-17 Thread Moni Shoua
> No, not true.  You are implementing RoCEv2 support, which is an entirely
> new feature.  So this feature can't have had a security hole since
> forever as it has never been in the kernel before now.  The objections
> are arising because of the ordering of events.  Specifically, we added
> the core namespace support (even though it isn't complete, so far it's
> the infrastructure ready for various upper portions of the stack to
> start using, but it isn't a complete stack wide solution yet) first, and
> so this new feature, which will need to be a part of that namespace
> infrastructure that other parts of the IB stack can use, should have its
> namespace support already enabled (ideally, but if it didn't, it should
> at least have a clear plan for how to enable it in the future).  Jason's
> objection is based upon this premise and the fact that a technical
> review of the code makes it look like the core namespace infrastructure
> becomes less complete, not more, with the inclusion of these patches.
>
> As I understand it, prior to these patches there would always be a 1:1
> mapping of GID to gid_index because you would never have duplicate GIDs
> in the GID table.  That allowed an easy, definitive 1:1 mapping of GID
> to namespace via the existing infrastructure for any received packet [1].
>

Incorrect. Even with RoCEv1 you can have 2 GID table entries with the same
GID value but with different VLANs. The conflict in looking for a matching index
was resolved by making the pair  as a key for this table. The value
of VLAN was obtained from the completion BTW. All we do now is widen the
definition to include network_type (or L3 protocol if you like) to the
key (and therefore
to the completion structure)

> These patches add the concept of duplicate GIDs that are differentiated
> by their RoCE version (also called network type).  So, now, an incoming
> packet could match a couple different gid_indexes and we need additional
> information to get back to the definitive 1:1 mapping.  The submitted
> patches are designed around a lazy resolution of the namespace,
> preferring to defer the work of mapping the incoming packet to a unique
> namespace until that information is actually needed.  To enable this
> lazy resolution, it provides the network_type so that the resolution can
> be done.
>
You could say that even for native IB. Now, you resolve sgid_index
only when you really
need it (with ib_init_ah_from_wc()) and you don't fill the completion
structure with gid_index
every time you poll.

> This is a fair assessment of the current state of things and what these
> patches do, yes?
>
> Jason's objections are this:
>
> 1)  The lazy resolution is wrong.
> 2)  The use of network_type as the additional information to get to the
> unique namespace is vendor specific cruft that shouldn't be part of the
> core kernel API.

Not true. This is not vendor specific information. Spec says that network_type
be a part of the completion. Honestly, I really don't understand the resistance
from implementing the spec as it is written.
>
> Jason's preference would be that the above issues be resolved by
> skipping the lazy resolution and instead doing proactive resolution on
> receipt of a packet and then probably just pass the namespace around
> instead of passing around the information needed to resolve the
> namespace.  Or, at a minimum, at least make the information added to the
> core API not something vendor specific like network_type, which is a
> detail of the Mellanox implementation.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Generic InfiniBand transport done in software

2015-12-22 Thread Moni Shoua
Hi,

In the past months the need for a kernel module that implements the InfiniBand
transport in software and unify all the InfiniBand software drivers has
been raised. Since then, nobody has submitted any design proposal that satisfy
the initial thoughts and can serve various back-ends.

The following is a RFC that presents a solution made of a single
generic InfiniBand
driver and many hardware specific back-end drivers. The RFC defines
the requirements
and the interfaces that have to be part of any generic InfiniBand driver.
A generic InfiniBand driver that is not compliant with this RFC wouldn't be able
to serve different back-ends and therefore would miss its target.



A. Introduction

In Linux kernel, the responsibility to implement the InfiniBand protocol is
roughly divided between 2 elements. The first are the core drivers which expose
an abstract verbs interface to the upper layer as well as interfaces to some
common IB services like MAD, SA and CM. The second are vendor drivers and
hardware which implement the abstract verbs interface.

A high level view of the model is in Figure A1

 +-+
 | |
 |IB core  |
 |drivers  |
 | |
 +++
  | Common

  | Vendor
 +++
 | |
 |  Hardware   |
 |  drivers|
 | |
 +++
 | |
 |  Hardware   |
 | |
 +-+

A1 - IB implementation model in Linux kernel

In the vendor part of the model, the devision of work between software and
hardware is undefined and is usually one of the two below

- Context and logic are  managed in software. Hardware role is limited to
  lower layer protocols (depending on the link layer) and maybe some offloads
- Context and logic are managed in hardware while software role is to create
  or destroy a context in the hardware and gets notified when hardware reports
  about a completions tasks.

The following examples demonstrates the difference between the approaches above.

- Send flow: application calls post_send() with QP and a WR. In the software
  based approach the QP context is retrieved, the WR is parsed and a proper IB
  packet is formed to be put in hardware buffers. In hardware based approach the
  driver puts the WR in hardware buffers with a handle to the QP and hardware
  does the rest.
- Receive flow: a data packet is received and put in hardware buffers (assume
  that the destination is a RC QP). In software based approach the packet is
  handed to the driver. The driver retrieves the context by QPN, decides if
  ACK/NACK packet is required (if so, generates it) and decides if CQE is
  required on the CQ of the QP. In the other approach the hardware compares
  the packet to the state in the context, generates the ACK/NACK and raises an
  event to the driver that CQE is ready for reading.

Figure A2 illustrates hardware/software relationship in the implementation of a
IB transport solution in the software based approach.

  ++
  |  IB transport  |
  |  capable driver|
  ||
  |  +--+  |
  |  |Context (deep)|  |
  |  |  |  |
  |  +--+  |
  ||
  |  +--+  |
  |  | Logic (IB transport) |  |
  |  |  |  |
  |  +--+  |
  ||
  |  +--+  |
  |  | Interface|  |
  |  |  |  |
  |  +--+  |
  ++
|
|
  ++
  |  +--+  |
  |  | Interface|  |
  |  |  |  |
  |  +--+  |
  ||
  |  +--+  |
  |  | Context (shallow)|  |
  |  |  |  |
  |  +--+  |
  ||
  |  +--+  |
  |  | Logic (link layer)   |  |
  |  |  |  |
  |  +--+  |
  ||
  |  IB transport  |
  |  incapable 

Re: [RFC] Generic InfiniBand transport done in software

2015-12-23 Thread Moni Shoua
>
>
> This makes no mention of the already posted work which aims to consolidate
> the qib and hfi1 drivers verbs implementation. However it does seem to be
> mostly in line with what we have already presented for rdmavt and the
> direction the next set of patches is going in. Have you seen something in
> rdmavt that contradicts this? I have not seen the need to write a detailed
> RFC because it's pretty straight forward what needs to happen. Take the
> common parts of qib and hfi and move to another kmod. Don't do anything that
> prevents soft-roce from using the same infrastructure. I think we are
> accomplishing that and have been very open during the process.
>
> Specifically this RFC does not capture much more detail beyond what has
> already been posted. There are also a number of details regarding hardware
> specifics that are not dealt with here but are in the current rdmavt patch
> series, as well as patches that are yet to be posted. The bottom line is we
> can't sacrifice performance for a perfect API.
>
> Previous discussion threads:
> http://marc.info/?l=linux-rdma&m=144353141420065&w=2
> http://marc.info/?l=linux-rdma&m=144563342718705&w=2
> http://marc.info/?l=linux-rdma&m=144614769419528&w=2
>
> Rdmavt & Qib code submissions:
> http://marc.info/?l=linux-rdma&m=144968107508268&w=2
> http://marc.info/?l=linux-rdma&m=144952133926970&w=2
> Feedback has been received on these patches and unless otherwise noted in
> the relevant threads will be incorporated into the V2 of the patch sets.
>

Hi Denny

http://marc.info/?l=linux-rdma&m=144614769419528&w=2 introduced a
concept (RVT) which was never followed by a discussion and never
agreed upon. Moreover,
http://marc.info/?l=linux-rdma&m=144952098726776&w=2 presents a work
that besides keeping the name RVT is far from the immature concept I
mentioned earlier and its scope was changed from general purpose
solution to Intel and HFI/QIB only. I therefore conclude that the
concept of RVT, as it was supposed to be, was abandoned.

Moni
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Generic InfiniBand transport done in software

2015-12-24 Thread Moni Shoua
>
>
> There were discussions, and Mellanox even contributed code to the effort.
> See Kamal's patches in the patch set I provided.
>
As far as I see it discussions were shallow and never produced an
agreement. Kamal's patches should not be considered as as such.

>> http://marc.info/?l=linux-rdma&m=144952098726776&w=2 presents a work
>> that besides keeping the name RVT is far from the immature concept I
>> mentioned earlier and its scope was changed from general purpose
>> solution to Intel and HFI/QIB only.
>
>
> The scope has never changed. Our goal is, and has always been to remove the
> code duplication between qib and hfi1. We are doing that by way of rdmavt.
> It is limited in scope to Intel's drivers currently for what I hope are
> obvious reasons.
>
So you actually agree that rdmavt was intended to be a solution to
Intel's specific drivers.
Fair, but IMO this is not what we aimed for.
In fact, if this is an Intel specific solution then why put it in
drivers/infiniband/sw and why publish it when it is not ready?

> I think it makes sense that soft-roce be added as well and hope that
> Mellanox decides to contribute rather than reinventing the wheel.
>
> Is there something in rdmavt that would not work for soft-roce, or is
> something fundamental missing? I have asked this a number of times and
> nothing has been raised so I assume there are no issues. If there are lets
> discuss them.
>
Interfaces between rdmavt and its backends are missing. I consider
this as fundamental.
Concerns were raised but answers were not provided, at least not
satisfying answers.

> Reading through your RFC here, perhaps something like the multicast add
> and delete is concerning?  This is something that is not really needed by
> qib and hfi1 but may be for soft-roce. All that means is soft-roce needs to
> provide it and it would be optional for qib and hfi1. The rdmavt
> architecture is flexible and allows exactly this.
>
>> I therefore conclude that the
>> concept of RVT, as it was supposed to be, was abandoned.
>
>
> This is absolutely incorrect. As mentioned above, nothing has changed.
>
>
> -Denny
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Generic InfiniBand transport done in software

2015-12-24 Thread Moni Shoua
> I'm not sure I understand what you mean.  Denny has posted several high level
> emails similar to this one and has asked for public feedback.  We have been
> addressing all the feedback as it has come in and we continue to work toward
> this common layer.
>
> To that end we have been asking for you to identify any place you see an
> incompatibility with the rdmavt layer for SoftRoCE.  Patches and comments to
> rdmavt are encouraged and we have already incorporated patches from outside
> Intel.
>
I really don't know how to answer to that because I don't expect that
the stub interface to last and if it does, how the problem of code
duplication is going to be solved. When I know how interfaces look
like plus some other documentation I will be able to answer this.

> We are very sensitive to the fact that other drivers will want to use this
> layer in the future.  But I don't think this layer is set in stone.  To
> that end, any functionality which is specific to SoftRoCE should be added when
> SoftRoCE is submitted to the kernel.  To date I have not seen any show stopper
> objections except Sagi's comment on the lack of IB_WR_LOCAL_INV which we will
> drop in the next submission.
>
SoftRoCE was submitted today to staging. Now, if I want to take it out
of staging by using rdmavt I actually can't unless I'll use the stubs
and watch for rdmavt changes. This is unacceptable. Missing final
interfaces and documentation of how to use them is a show stopper IMO

> Also, please be aware that we are being very careful with this work to not
> sacrifice the performance of either qib or hfi1.  There are a couple of items
> you mention below which seem to indicate you would like a more "pure"
> separation of the driver.  I hope you understand that any work in this area
> which affects our performance must be accounted for and may not result in as
> "pure" a separation as you may like.  If that is a show stopper for SoftRoCE
> lets work through the specific examples.
Eventually you'll have to break WR to packets of MTU size, wouldn't you?
Anyway, this is what we had to finalize in the early discussion and
not after code posting

Ira,
I've read your comments to the RFC and I find them worth a thought
before I answer.
In general, finding a good abstraction of the back-end is the hardest
thing in this project and affects how the interface looks like.
I think that most if not all your comments fall in the category of abstraction.


>
> More inline.
>
>>
>> The following is a RFC that presents a solution made of a single
>> generic InfiniBand
>> driver and many hardware specific back-end drivers. The RFC defines
>> the requirements
>> and the interfaces that have to be part of any generic InfiniBand driver.
>> A generic InfiniBand driver that is not compliant with this RFC wouldn't be 
>> able
>> to serve different back-ends and therefore would miss its target.
>>
>> 
>>
>> A. Introduction
>> 
>> In Linux kernel, the responsibility to implement the InfiniBand protocol is
>> roughly divided between 2 elements. The first are the core drivers which 
>> expose
>> an abstract verbs interface to the upper layer as well as interfaces to some
>> common IB services like MAD, SA and CM. The second are vendor drivers and
>> hardware which implement the abstract verbs interface.
>>
>> A high level view of the model is in Figure A1
>>
>>  +-+
>>  | |
>>  |IB core  |
>>  |drivers  |
>>  | |
>>  +++
>>   | Common
>> 
>>   | Vendor
>>  +++
>>  | |
>>  |  Hardware   |
>>  |  drivers|
>>  | |
>>  +++
>>  | |
>>  |  Hardware   |
>>  | |
>>  +-+
>>
>> A1 - IB implementation model in Linux kernel
>>
>> In the vendor part of the model, the devision of work between software and
>> hardware is undefined and is usually one of the two below
>>
>> - Context and logic are  managed in software. Hardware role is limited to
>>   lower layer protocols (depending on the link layer) and maybe some offloads
>> - Context and logic are managed in hardware while software role is to create
>>   or destroy a context in the hardware and gets notified when hardware 
>> reports
>>   about a completions tasks.
>>
>> The following examples demonstrates the difference between the approaches 
>> above.
>>
>> - Send flow: application calls post_send() with QP and a WR. In the software
>>   based approach the QP context is retrieved, the WR is parsed and a proper 
>> IB
>>   packet 

Re: [RFC] Generic InfiniBand transport done in software

2015-12-27 Thread Moni Shoua
>
> Point is others have looked at the code. No issues have been called out to
> date as to why what is there won't work for everyone.
>
http://marc.info/?l=linux-rdma&m=144968107508268&w=2
Your answer to the send() issue is first an agreement with the comment
and later says that it can't be done because of how PIO and SDMA
(Intel specific implementation)
This is an example for a discussion that never ended with an agreement.


>
> Yes it is specific to Intel *now*, that doesn't mean it should stay that
> way. Rdmavt could, and in my opinion should, be extended to support
> soft-roce. I don't think replicating the same thing is a great idea.
>
But you post *now* a so called generic driver so it must now fit any
possible driver (including Soft RoCE)
> As to the location, where do you think it should go. drivers/infiniband/sw
> makes the most sense to me, but open to suggestions.
>
> And for the question of why publish when it's not ready, the better question
> is why not?  Is it not good to see the work in progress as it evolves so the
> community can provide feedback?
>
What kind of a feedback you expect when I don't have an idea about
your plans for rdmavt
Interfaces, flows, data structures... all is missing from the
documentation to rdmavt.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Generic InfiniBand transport done in software

2015-12-29 Thread Moni Shoua
> it's never going to fit any possible future driver.  Dennis and folks
> have done great work to move code outside the drivers into a shared
> library.  So far it's been driven just by the Intel drivers as that's
> the only thing they were interested in.
>

If it's not going to be a solution for anything else but Intel then
why declare it as such?
Where is that shared library? There amount of shared code in rdmavt
that can be considered as shared is very little.

> If you are interested in supporting SoftROCE please work with them
> by adjusting the code towards your requirements.  In Linux we have
> great results with iterative appoaches and I'd suggest you try it
> as well.
>
Exactly. All you asked for  is in the RFC I posted.

>
> You've got the code, so let's work based on that.
> --
I say let's agree on the interfaces and start writing code.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/36] Add rdma verbs transport library

2015-12-29 Thread Moni Shoua
> Changes since v1:
> Removed driver specific version
> Fixed license text to remove copyright and put on top
> Return 0 in rvt_map_sg instead of BAD_DMA_AGGRESS
> Remove #include of dma.h from dma.c
> Update comment about protection domain limit
> Remove comment on alternative design for private data
> Rename CDR macro to CHECK_DRIVER_OVERRIDE
> Change all the stubs to return EOPNOTSUPP
> Fix comment style for rvt_query_port
> Fix typo in subject
> Rename rdi.lk_table to rdi.lkey_table
> Rename rvt_sge.m => rvt_sge.cur_map (Sean)
> Rename rvt_sge.n => rvt_sge.cur_seg (Sean)
> Remove rvt_reg_phys_mr
> Drop support for commit 38071a461f0a ("IB/qib: Support the new memory
>registration API")
>
I don't understand what in this change log justifies a V2 for this patch set
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Generic InfiniBand transport done in software

2015-12-29 Thread Moni Shoua
> No. PIO and SDMA is driver specific and lives in the driver. Rdmavt has no
> concept of this. I'm agreeing that the send will be generic and have no hw
> specific stuff.
>
I understand that PIO/SDMA are not a concept of RVT. However, making
the send from RVT to driver exactly as the interface from ib_core to
RVT raises the question: What exactly do we achieve by this?
>
>
> As I've stated a number of times across multiple threads: It must not do
> anything that would prevent another driver from using it.
>
The question is not how Soft RoCE fits into this framework but how
does this framework achieve its goals.

>
>
> I expect feedback based on the code submissions. More will be coming
> shortly. I have taken all the feedback from the first post and will be
> sending a v2 shortly.
>
Again, I have no idea about the complete interfaces between both
pieces of the suggested solution.
- If you have them then please publish
- if you don't but plan to have them then why did you submit a half baked idea
- If you say that final interface is what we see now then I say that
the problem of code duplication isn't going to be resolved
So, what it is from the 3?

>
> -Denny
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Generic InfiniBand transport done in software

2015-12-29 Thread Moni Shoua
I think that my point is missed. See my answers inline


> This is incorrect.  This isn't some public API that we are exporting to
> user space.  Nor is it an API that out of tree drivers are using.  This
> is a purely kernel internal API for use by a limited number of drivers.
>  As such, it need not be finalized before it is submitted or used.  It
> can be taken one piece at a time, and if, at some point, it is
> determined that there are shortcomings to the API, it can be updated in
> place with all of the drivers that use it in a single patch or patch
> series.  So a finalized design prior to putting code in place is
> specifically *not* needed.
>
This is not a question of future backward comparability where
interfaces must be kept forever. I agree that kernel interfaces may be
changed with kernel moving forward. However, this is not what I'm
arguing against.

When one submits a RFC for a generic Infrastructure he must state what
are the interfaces between blocks of the design.
Soft RoCE block can't start until I know how the final interfaces look
like. This is an unacceptable method of work.

>
> They released it so that you can start hooking SoftRoCE into it.  As you
> hook it in, if it needs changes to work with SoftRoCE, simply make the
> changes needed and move on.

This is not a question if I can hook Soft RoCE driver into this framework.
In fact, I can't think of an IB driver that can't use this framework. What this
framework offers is just another hop from ib_core the real driver.
Where is the removal of duplicated code? This is a list of functions
that for now
must be implemented in the low level driver.

create_cq
destroy_cq
poll_cq
req_notify_cq
resize_cq
create_srq
modify_srq
destroy_srq
query_srq
create_qp
query_device
query_gid
alloc_ucontext
modify_device
modify_qp
dealloc_ucontext
query_port
destroy_qp
get_port_immutable
modify_port
query_qp
post_send
post_recv
post_srq_recv

Most if not all of them have common part in all drivers.
What are the plans to get rid of them? When?
Don't you think that this should be known in advance?

I already asked and never been answered seriously: what was
the purpose of the submission in this premature state of the code
It can't be for feedback because what kind of feedback can you provide
for just a skeleton? Moreover, today they submitted V2 with a changelog
that is almost 100% cosmetic changes. I really don't understand this kind
of work.




>
> I think Dennis' point, and I agree with him, is that you are over
> complicating the issue here.  This need not be a highly designed item,
> it needs to be a functional item, and we can build it as we go.  If you
> have to make changes to rdmavt in order to hook up SoftRoCE, that's
> fine, post them to the list, they will get reviewed.  As long as the
> change doesn't break or otherwise negatively impact qib and/or hfi1,
> then it should be fine.  If it does, then I'm sure Intel will work with
> you to find a solution that doesn't negatively impact them.

A reminder of what the initial goal was - remove code duplicates between
all IB transport drivers. This goal is complicated and in my RFC I explained
why. So, for start, I am not complicating anything that was simple before.

Second, what you are saying here is actually: "this is a project to serves
Intel's needs". So why treat it as a generic infrastructure? I'm not aiming to
hurt performance but Intel should aim for achieving the goals we agreed on
in the begging.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 5/7] IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers

2015-12-30 Thread Moni Shoua
>
> The last hunk that you removed had a role and was by no means
> dead-code, right? so... (1) why it's correct to remove it? (2) if you
> want to introduce different way to implement what was done here, why
> in this patch? maybe add pre-patch for that

In a way you are right. This hunk does not insert a bug and even
improves correctness but it acutally belongs to an earlier patch
(dbf727de7440f73c4b92be4b958cbc24977e8ca2 IB/core: Use GID table in AH
creation and dmac resolution)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RXE: A Soft RoCE back-end for the RVT

2016-01-06 Thread Moni Shoua
This patch introduces an implementation of a back-end that works with
RVT to make RoCE Verbs transport over any Ethernet network device.

Example:

After loading ib_rxe_net.ko
echo eth1 > /sys/module/ib_rxe_net/parameters/add
will create rvt0 IB device in RVT with Ethernet link layer
---
 drivers/infiniband/Kconfig|1 +
 drivers/infiniband/sw/Makefile|1 +
 drivers/infiniband/sw/rxe/Kconfig |   23 ++
 drivers/infiniband/sw/rxe/Makefile|5 +
 drivers/infiniband/sw/rxe/rxe_net.c   |  580 +
 drivers/infiniband/sw/rxe/rxe_net.h   |   89 +
 drivers/infiniband/sw/rxe/rxe_sysfs.c |  167 ++
 7 files changed, 866 insertions(+), 0 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/Kconfig
 create mode 100644 drivers/infiniband/sw/rxe/Makefile
 create mode 100644 drivers/infiniband/sw/rxe/rxe_net.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_net.h
 create mode 100644 drivers/infiniband/sw/rxe/rxe_sysfs.c

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 1e82984..ef23047 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -75,6 +75,7 @@ source "drivers/infiniband/hw/ocrdma/Kconfig"
 source "drivers/infiniband/hw/usnic/Kconfig"
 
 source "drivers/infiniband/sw/rdmavt/Kconfig"
+source "drivers/infiniband/sw/rxe/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/sw/Makefile b/drivers/infiniband/sw/Makefile
index c1f6377..9b9f68d 100644
--- a/drivers/infiniband/sw/Makefile
+++ b/drivers/infiniband/sw/Makefile
@@ -1,2 +1,3 @@
 obj-$(CONFIG_INFINIBAND_RDMAVT)+= rdmavt/
+obj-$(CONFIG_INFINIBAND_RXE)   += rxe/
 
diff --git a/drivers/infiniband/sw/rxe/Kconfig 
b/drivers/infiniband/sw/rxe/Kconfig
new file mode 100644
index 000..d32843a
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/Kconfig
@@ -0,0 +1,23 @@
+config INFINIBAND_RXE
+   tristate "RoCE RDMAVT backend driver"
+   depends on INET && PCI && INFINIBAND_RDMAVT
+   ---help---
+   This driver implements the InfiniBand RDMA transport over
+   the Linux network stack. It enables a system with a
+   standard Ethernet adapter to interoperate with a RoCE
+   adapter or with another system running the RXE driver.
+   Documentation on InfiniBand and RoCE can be downloaded at
+   www.infinibandta.org and www.openfabrics.org. (See also
+   siw which is a similar software driver for iWARP.)
+
+   The driver is split into two layers, one interfaces with the
+   Linux RDMA stack and implements a kernel or user space
+   verbs API. The user space verbs API requires a support
+   library named librxe which is loaded by the generic user
+   space verbs API, libibverbs. The other layer interfaces
+   with the Linux network stack at layer 3.
+
+   To configure and work with soft-RoCE driver please use the
+   following wiki page under "configure Soft-RoCE (RXE)" section:
+
+   https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
diff --git a/drivers/infiniband/sw/rxe/Makefile 
b/drivers/infiniband/sw/rxe/Makefile
new file mode 100644
index 000..3f6c05b
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/Makefile
@@ -0,0 +1,5 @@
+obj-$(CONFIG_INFINIBAND_RXE) += ib_rxe_net.o
+
+ib_rxe_net-y := \
+   rxe_net.o \
+   rxe_sysfs.o
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c 
b/drivers/infiniband/sw/rxe/rxe_net.c
new file mode 100644
index 000..84c9606
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -0,0 +1,580 @@
+/*
+ * Copyright (c) 2015 Mellanox Technologies Ltd. All rights reserved.
+ * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * - Redistributions of source code must retain the above
+ *   copyright notice, this list of conditions and the following
+ *   disclaimer.
+ *
+ * - Redistributions in binary form must reproduce the above
+ *   copyright notice, this list of conditions and the following
+ *   disclaimer in the documentation and/or other materials
+ *   provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CON