Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-15 Thread Ralph Campbell
On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
> > Pradeep Satyanarayana wrote:
> >> Roland Dreier wrote:
> >>>  > I guess I came to a premature conclusion. One set of tests ran fine 
> >>> and I made that
> >>>  > conclusion. Another set of tests caused the following crash:
> >>>
> >>> I don't really know how to interpret this.  Is this crash new, or is it
> >>> the same crash you were hoping this patch fixed?
> >> This is a new crash.
> > 
> > I see other manifestations resulting in different crashes :
> > 
> > :mon> t
> > [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> > [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
> > [ib_ipoib]
> > [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
> > [c0074603be00] c00c7190 .worker_thread+0xc0/0x180
> > [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
> > [c0074603bf90] c00309fc .kernel_thread+0x54/0x70
> > 9:mon> e
> > cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
> > pc: c05ac390: ._spin_lock+0x20/0xc8
> > lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > sp: c0074603b9a0
> >msr: 80009032
> >dar: 3a0
> >  dsisr: 4000
> >   current = 0xc00756ce8b00
> >   paca= 0xc0f63800
> > pid   = 18095, comm = ipoib
> > 9:mon>
> 
> Recreating the crash has been tricky. I have tried several several hundred 
> times today
> to unload and reload IPoIB while there is traffic and no crashes happened. I 
> took
> a closer look at the IPoIB CM code and I see a few things that look 
> suspicious.
> 
> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is 
> held before 
> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch 
> (fix dangling pointer).
> Is this a potential race?

ipoib_cm_send() is only called by ipoib_start_xmit() so it is protected
by netif_tx_lock(dev) or stopping the ipoib network device.
It all depends on what pointer or data structure you think is being
accessed while free or being modified without the proper protection.

> In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, 
> &tx->flags) in 
> ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's 
> patch) there is a 
> clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling 
> ipoib_cm_destroy_tx() only in 
> select cases. Was that intended?

The v4 patch comments explain the changes:
http://www.spinics.net/lists/linux-rdma/msg03733.html
Basically, IPOIB_FLAG_INITIALIZED now means that the struct ipoib_cm_tx
has completed the RC QP creation process via the CM instead of simply
when ipoib_cm_create_tx() allocates the structure.
The test and clear was used to indicate the struct ipoib_cm_tx
had been put on the destroy list and the reaper thread woken up.
Now ipoib_cm_destroy_tx() uses the tx->neigh pointer != NULL to
indicate that ipoib_cm_destroy_tx() has started the destroy process.
ipoib_cm_destroy_tx() is only called when netif_tx_lock() and priv->lock
are held to protect tx->neigh.

> Thanks
> Pradeep

The longer write up on locking is turning out to be very complex.
I will keep working on it but I think it will be just as hard
to understand as slogging through the code.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When IBoE will be merged to upstream?

2010-07-15 Thread Roland Dreier
 > Small correction needed regarding the multicast forwarding.
 > Since we are talking about IPv6 multicast groups, which translate to
 > 33:33:xx:xx:xx:xx MAC address, the router listener notification protocol
 > is going to be MLD and not IGMP. Still there are switches which support
 > MLD forwarding to prevent the network flooding.

Well as I said the mapping of IBoE MGID to Ethernet address is not
specified.  However I agree that using the same mapping as IPv6 so we
end up with 33:33:... addresses makes sense.

Yes, you are right that MLD snooping is the mechanism for switches to
discover IPv6 multicast group membership.  However for the IBoE case
there is no requirement that IPv6 multicast group membership corresponds
in any way to the IBoE multicast group membership for the interface (and
indeed as far as I can tell from the IBoE spec, there is no requirement
that any IPv6 interface be configured on an IBoE port).

Furthermore, even if an IBoE interface sends MLD messages for a given
IPv6 group, there is no requirement that a switch use the membership
information for that group to forward multicast packets with a non-IPv6
ethertype.

 - R.
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-15 Thread Ralph Campbell
I will write up a description of the locking as I
understand it and the changes I made. Give
me a day or two to write it up and check it.

On Thu, 2010-07-15 at 04:56 -0700, Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
> > Pradeep Satyanarayana wrote:
> >> Roland Dreier wrote:
> >>>  > I guess I came to a premature conclusion. One set of tests ran fine 
> >>> and I made that
> >>>  > conclusion. Another set of tests caused the following crash:
> >>>
> >>> I don't really know how to interpret this.  Is this crash new, or is it
> >>> the same crash you were hoping this patch fixed?
> >> This is a new crash.
> > 
> > I see other manifestations resulting in different crashes :
> > 
> > :mon> t
> > [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> > [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
> > [ib_ipoib]
> > [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
> > [c0074603be00] c00c7190 .worker_thread+0xc0/0x180
> > [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
> > [c0074603bf90] c00309fc .kernel_thread+0x54/0x70
> > 9:mon> e
> > cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
> > pc: c05ac390: ._spin_lock+0x20/0xc8
> > lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> > sp: c0074603b9a0
> >msr: 80009032
> >dar: 3a0
> >  dsisr: 4000
> >   current = 0xc00756ce8b00
> >   paca= 0xc0f63800
> > pid   = 18095, comm = ipoib
> > 9:mon>
> 
> Recreating the crash has been tricky. I have tried several several hundred 
> times today
> to unload and reload IPoIB while there is traffic and no crashes happened. I 
> took
> a closer look at the IPoIB CM code and I see a few things that look 
> suspicious.
> 
> In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is 
> held before 
> calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch 
> (fix dangling pointer).
> Is this a potential race?
> 
> In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, 
> &tx->flags) in 
> ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's 
> patch) there is a 
> clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling 
> ipoib_cm_destroy_tx() only in 
> select cases. Was that intended?
> 
> Thanks
> Pradeep
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-15 Thread Alex Rosenbaum
> No, I think all switches will flood unknown multicast packets.  But
> there is a reason that IGMP snooping was invented -- it is inefficient
> (to say the least) to flood all multicast traffic.
>
> - R.


Small correction needed regarding the multicast forwarding.
Since we are talking about IPv6 multicast groups, which translate to
33:33:xx:xx:xx:xx MAC address, the router listener notification protocol
is going to be MLD and not IGMP. Still there are switches which support
MLD forwarding to prevent the network flooding.

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/25] infiniband/hw/nes: Convert pci_table entries to PCI_VDEVICE (if PCI_ANY_ID is used)

2010-07-15 Thread Peter Hüwe
From: Peter Huewe 

This patch converts pci_table entries, where .subvendor=PCI_ANY_ID and
.subdevice=PCI_ANY_ID, .class=0 and .class_mask=0, to use the
PCI_VDEVICE macro, and thus improves readability.

Signed-off-by: Peter Huewe 
---
 drivers/infiniband/hw/nes/nes.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c
index de7b9d7..1f26927 100644
--- a/drivers/infiniband/hw/nes/nes.c
+++ b/drivers/infiniband/hw/nes/nes.c
@@ -110,8 +110,8 @@ static unsigned int sysfs_nonidx_addr;
 static unsigned int sysfs_idx_addr;
 
 static struct pci_device_id nes_pci_table[] = {
-   {PCI_VENDOR_ID_NETEFFECT, PCI_DEVICE_ID_NETEFFECT_NE020, PCI_ANY_ID, 
PCI_ANY_ID},
-   {PCI_VENDOR_ID_NETEFFECT, PCI_DEVICE_ID_NETEFFECT_NE020_KR, PCI_ANY_ID, 
PCI_ANY_ID},
+   {PCI_VDEVICE(NETEFFECT, PCI_DEVICE_ID_NETEFFECT_NE020), },
+   {PCI_VDEVICE(NETEFFECT, PCI_DEVICE_ID_NETEFFECT_NE020_KR), },
{0}
 };
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: some dapl assistance

2010-07-15 Thread Davis, Arlin R
Itay,

Can you add "-env I_MPI_DAPL_PROVIDER ofa-v2-mthca0-1" to your mpiexec options
to make sure we pick up the correct v2 provider with pkey support? Also bump
up I_MPI_DEBUG to 5 so I can see the provider selection from MPI output.

Thanks,

-arlin

>-Original Message-
>From: Itay Berman [mailto:it...@voltaire.com] 
>Sent: Thursday, July 15, 2010 10:19 AM
>To: Davis, Arlin R; Or Gerlitz
>Cc: linux-rdma
>Subject: RE: some dapl assistance
>
>No:
>No. Same warning:
>
>[r...@dodly0 compat-dapl-1.2.18]# mpiexec -ppn 1 -n 2 -env 
>I_MPI_FABRICS dapl:dapl -env I_MPI_DEBUG 2 -env 
>I_MPI_CHECK_DAPL_PROVIDER_MISMADAPL_DBG_TYPE 0x -env 
>DAPL_IB_PKEY 0x8002 /tmp/osu
>dodly4:625b: dapl_init: dbg_type=0x,dbg_dest=0x1
>dodly0:2c17: dapl_init: dbg_type=0x,dbg_dest=0x1
>dodly0:2c17:  open_hca: device mlx4_0 not found
>dodly0:2c17:  open_hca: device mlx4_0 not found
>dodly4:625b:  Warning: new pkey(32770), query (Success) err or 
>key !found, using defaults
>dodly4:625b:  query_hca: port.link_layer = 0x1
>dodly4:625b:  query_hca: (a0.0) eps 262076, sz 16351 evds 
>65408, sz 4194303 mtu 2048 - pkey 32770 p_idx 0 sl 0
>dodly4:625b:  query_hca: msg 1073741824 rdma 1073741824 iov 32 
>lmr 524272 rmr 0 ack_time 16 mr 4294967295
>dodly0:2c17:  Warning: new pkey(32770), query (Success) err or 
>key !found, using defaults
>dodly0:2c17:  query_hca: port.link_layer = 0x1
>dodly0:2c17:  query_hca: (a0.0) eps 64512, sz 16384 evds 
>65408, sz 131071 mtu 2048 - pkey 32770 p_idx 0 sl 0
>dodly0:2c17:  query_hca: msg 2147483648 rdma 2147483648 iov 27 
>lmr 131056 rmr 0 ack_time 16 mr 4294967295
>dodly0:2c17:  Warning: new pkey(32770), query (Success) err or 
>key !found, using defaults
>dodly0:2c17:  query_hca: port.link_layer = 0x1
>dodly0:2c17:  query_hca: (a0.0) eps 64512, sz 16384 evds 
>65408, sz 131071 mtu 2048 - pkey 32770 p_idx 0 sl 0
>dodly0:2c17:  query_hca: msg 2147483648 rdma 2147483648 iov 27 
>lmr 131056 rmr 0 ack_time 16 mr 4294967295
>dodly4:625b:  Warning: new pkey(32770), query (Success) err or 
>key !found, using defaults
>dodly4:625b:  query_hca: port.link_layer = 0x1
>dodly4:625b:  query_hca: (a0.0) eps 262076, sz 16351 evds 
>65408, sz 4194303 mtu 2048 - pkey 32770 p_idx 0 sl 0
>dodly4:625b:  query_hca: msg 1073741824 rdma 1073741824 iov 32 
>lmr 524272 rmr 0 ack_time 16 mr 4294967295
>dodly0:2c17:  Warning: new pkey(32770), query (Success) err or 
>key !found, using defaults
>dodly0:2c17:  query_hca: port.link_layer = 0x1
>dodly0:2c17:  query_hca: (a0.0) eps 64512, sz 16384 evds 
>65408, sz 131071 mtu 2048 - pkey 32770 p_idx 0 sl 0
>dodly0:2c17:  query_hca: msg 2147483648 rdma 2147483648 iov 27 
>lmr 131056 rmr 0 ack_time 16 mr 4294967295
>dodly0:2c17:  dapl_poll: fd=17 ret=1, evnts=0x1
>[0] MPI startup(): DAPL provider ofa-v2-mthca0-1
>dodly0:2c17:  dapl_poll: fd=17 ret=0, evnts=0x0
>[0] MPI startup(): dapl data transfer mode
>dodly0:2c17:  dapl_poll: fd=14 ret=0, evnts=0x0
>dodly4:625b:  Warning: new pkey(32770), query (Success) err or 
>key !found, using defaults
>dodly4:625b:  query_hca: port.link_layer = 0x1
>dodly4:625b:  query_hca: (a0.0) eps 262076, sz 16351 evds 
>65408, sz 4194303 mtu 2048 - pkey 32770 p_idx 0 sl 0
>dodly4:625b:  query_hca: msg 1073741824 rdma 1073741824 iov 32 
>lmr 524272 rmr 0 ack_time 16 mr 4294967295
>
>[r...@dodly0 compat-dapl-1.2.18]# cat 
>/sys/class/infiniband/mthca0/ports/1/pkeys/1
>0x8002
>
>-Original Message-
>From: Davis, Arlin R [mailto:arlin.r.da...@intel.com] 
>Sent: ה 15 יולי 2010 18:56
>To: Itay Berman; Or Gerlitz
>Cc: linux-rdma
>Subject: RE: some dapl assistance
>
>
>>OK, we got Intel MPI to run. To test the pkey usage we 
>>configured it to run over pkey that is not configured on the 
>>node. In this case the MPI should have failed, but it didn't.
>>The dapl debug reports the given pkey (0x8001 = 32769).
>>How can that be?
>>
>
>Itay,
>
>If the pkey override is not valid it uses default idx of 0 and 
>ignores pkey value given. 
>
>Notice the Warning message:
>
>odly0:3b37:  Warning: new pkey(32769), query (Success) err or 
>key !found, using defaults 
>odly0:3b37:  query_hca: (a0.0) eps 64512, sz 16384 evds 65408, 
>sz 131071 mtu 2048 - pkey 32769 p_idx 0 sl 0
>
>When you override with a correct value of 8002 does it move to 
>p_idx=1 and work?
>
>-arlin
>
>
>
>--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: some dapl assistance

2010-07-15 Thread Davis, Arlin R
 
>No:
>No. Same warning:
>
>dodly4:625b:  Warning: new pkey(32770), query (Success) err or 
>key !found, using defaults
>dodly4:625b:  query_hca: port.link_layer = 0x1
>dodly4:625b:  query_hca: (a0.0) eps 262076, sz 16351 evds 
>65408, sz 4194303 mtu 2048 - pkey 32770 p_idx 0 sl 0
>
>[r...@dodly0 compat-dapl-1.2.18]# cat 
>/sys/class/infiniband/mthca0/ports/1/pkeys/1
>0x8002

Sorry, I only have mlx4 adapters. 
Can you check ibv_devinfo -v and look for max_pkeys?

Thanks,

-arlin



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When IBoE will be merged to upstream?

2010-07-15 Thread Roland Dreier
 > No, I think all switches will flood unknown multicast packets.  But
 > there is a reason that IGMP snooping was invented -- it is inefficient
 > (to say the least) to flood all multicast traffic.

And by the way I view the fact that the IBoE spec does not say anything
at all about how to map MGIDs to Ethernet addresses as another serious
shortcoming of the spec.

 - R.
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When IBoE will be merged to upstream?

2010-07-15 Thread Roland Dreier
 > > I agree -- the current spec is rather broken for multicast.  
 > > Choosing a different ethertype and then saying that all 
 > > switches will just flood multicast traffic is half-baked at best.

 > It is a realistic approach. Do you claim that there are switches that will 
 > not forward the packets?

No, I think all switches will flood unknown multicast packets.  But
there is a reason that IGMP snooping was invented -- it is inefficient
(to say the least) to flood all multicast traffic.

 - R.
-- 
Roland Dreier  || For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: some dapl assistance

2010-07-15 Thread Itay Berman
No:
No. Same warning:

[r...@dodly0 compat-dapl-1.2.18]# mpiexec -ppn 1 -n 2 -env I_MPI_FABRICS 
dapl:dapl -env I_MPI_DEBUG 2 -env I_MPI_CHECK_DAPL_PROVIDER_MISMADAPL_DBG_TYPE 
0x -env DAPL_IB_PKEY 0x8002 /tmp/osu
dodly4:625b: dapl_init: dbg_type=0x,dbg_dest=0x1
dodly0:2c17: dapl_init: dbg_type=0x,dbg_dest=0x1
dodly0:2c17:  open_hca: device mlx4_0 not found
dodly0:2c17:  open_hca: device mlx4_0 not found
dodly4:625b:  Warning: new pkey(32770), query (Success) err or key !found, 
using defaults
dodly4:625b:  query_hca: port.link_layer = 0x1
dodly4:625b:  query_hca: (a0.0) eps 262076, sz 16351 evds 65408, sz 4194303 mtu 
2048 - pkey 32770 p_idx 0 sl 0
dodly4:625b:  query_hca: msg 1073741824 rdma 1073741824 iov 32 lmr 524272 rmr 0 
ack_time 16 mr 4294967295
dodly0:2c17:  Warning: new pkey(32770), query (Success) err or key !found, 
using defaults
dodly0:2c17:  query_hca: port.link_layer = 0x1
dodly0:2c17:  query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 
2048 - pkey 32770 p_idx 0 sl 0
dodly0:2c17:  query_hca: msg 2147483648 rdma 2147483648 iov 27 lmr 131056 rmr 0 
ack_time 16 mr 4294967295
dodly0:2c17:  Warning: new pkey(32770), query (Success) err or key !found, 
using defaults
dodly0:2c17:  query_hca: port.link_layer = 0x1
dodly0:2c17:  query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 
2048 - pkey 32770 p_idx 0 sl 0
dodly0:2c17:  query_hca: msg 2147483648 rdma 2147483648 iov 27 lmr 131056 rmr 0 
ack_time 16 mr 4294967295
dodly4:625b:  Warning: new pkey(32770), query (Success) err or key !found, 
using defaults
dodly4:625b:  query_hca: port.link_layer = 0x1
dodly4:625b:  query_hca: (a0.0) eps 262076, sz 16351 evds 65408, sz 4194303 mtu 
2048 - pkey 32770 p_idx 0 sl 0
dodly4:625b:  query_hca: msg 1073741824 rdma 1073741824 iov 32 lmr 524272 rmr 0 
ack_time 16 mr 4294967295
dodly0:2c17:  Warning: new pkey(32770), query (Success) err or key !found, 
using defaults
dodly0:2c17:  query_hca: port.link_layer = 0x1
dodly0:2c17:  query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 
2048 - pkey 32770 p_idx 0 sl 0
dodly0:2c17:  query_hca: msg 2147483648 rdma 2147483648 iov 27 lmr 131056 rmr 0 
ack_time 16 mr 4294967295
dodly0:2c17:  dapl_poll: fd=17 ret=1, evnts=0x1
[0] MPI startup(): DAPL provider ofa-v2-mthca0-1
dodly0:2c17:  dapl_poll: fd=17 ret=0, evnts=0x0
[0] MPI startup(): dapl data transfer mode
dodly0:2c17:  dapl_poll: fd=14 ret=0, evnts=0x0
dodly4:625b:  Warning: new pkey(32770), query (Success) err or key !found, 
using defaults
dodly4:625b:  query_hca: port.link_layer = 0x1
dodly4:625b:  query_hca: (a0.0) eps 262076, sz 16351 evds 65408, sz 4194303 mtu 
2048 - pkey 32770 p_idx 0 sl 0
dodly4:625b:  query_hca: msg 1073741824 rdma 1073741824 iov 32 lmr 524272 rmr 0 
ack_time 16 mr 4294967295

[r...@dodly0 compat-dapl-1.2.18]# cat 
/sys/class/infiniband/mthca0/ports/1/pkeys/1
0x8002

-Original Message-
From: Davis, Arlin R [mailto:arlin.r.da...@intel.com] 
Sent: ה 15 יולי 2010 18:56
To: Itay Berman; Or Gerlitz
Cc: linux-rdma
Subject: RE: some dapl assistance


>OK, we got Intel MPI to run. To test the pkey usage we 
>configured it to run over pkey that is not configured on the 
>node. In this case the MPI should have failed, but it didn't.
>The dapl debug reports the given pkey (0x8001 = 32769).
>How can that be?
>

Itay,

If the pkey override is not valid it uses default idx of 0 and ignores pkey 
value given. 

Notice the Warning message:

odly0:3b37:  Warning: new pkey(32769), query (Success) err or key !found, using 
defaults 
odly0:3b37:  query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 
2048 - pkey 32769 p_idx 0 sl 0

When you override with a correct value of 8002 does it move to p_idx=1 and work?

-arlin



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: When IBoE will be merged to upstream?

2010-07-15 Thread Jason Gunthorpe
On Thu, Jul 15, 2010 at 01:55:32PM +0300, Liran Liss wrote:

> > I objected to the draft spec leaving this area 
> > absent, even.
> 
> You should submit a comment on this matter using the IBTA comment
> tracker database if you intend your concern to be taken into account.

The position of IBTA is that the L2 layer is not specified as part of
the spec, so of course there is no talk of how to get/create L2
information. The spec is *silent* on the issue of L2 addressing, so,
IMHO, it is compltely wrong to assume it specs one approach over
another, just because it omits L2 addressing related
discussion/fields/etc.

It, unfortunately, becomes implementation defined - and if that means
an implementation chooses to extend the AH, then so be it.

This is the problem with rushing incomplete specs through :)

> > It wouldn't be adding another L3 itentifier it would be an L2 
> > next hop MAC address for the router. It would be nice to do 
> > this from the start but if growing the AH is really that 
> > scary then it should wait until someone figures out how to 
> > solve the lossless routing problem on ethernet.

> Augmenting the AH has a significant cost. There is a tradeoff here
> between preserving the verbs api vs. dealing with the implementation
> challenges associated with doing address resolution below the
> verbs. The RoCE spec deliberately chooses one direction. You seem to
> favor the other one. But in the interest of progress and since we
> all seem to agree on the way things work when we use link local
> GIDs, let us move forward with that approach for now. And we can get
> back to non local GIDs later.

You still have to solve the problem with vlan tags, and either each
vlan interface has a seperate rdma interface or the tag has to flow
into the AH from the RDMA-CM.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: some dapl assistance

2010-07-15 Thread Davis, Arlin R

>OK, we got Intel MPI to run. To test the pkey usage we 
>configured it to run over pkey that is not configured on the 
>node. In this case the MPI should have failed, but it didn't.
>The dapl debug reports the given pkey (0x8001 = 32769).
>How can that be?
>

Itay,

If the pkey override is not valid it uses default idx of 0 and ignores pkey 
value given. 

Notice the Warning message:

odly0:3b37:  Warning: new pkey(32769), query (Success) err or key !found, using 
defaults 
odly0:3b37:  query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 
2048 - pkey 32769 p_idx 0 sl 0

When you override with a correct value of 8002 does it move to p_idx=1 and work?

-arlin



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] RDMA/nes: corrected link type for nes card

2010-07-15 Thread miroslaw . walukiewicz
Now correct interface link type is set for ibv_query_port()

Signed-off-by: Mirek Walukiewicz 
---

 drivers/infiniband/hw/nes/nes_verbs.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index f179586..45bf56c 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -599,7 +599,7 @@ static int nes_query_port(struct ib_device *ibdev, u8 port, 
struct ib_port_attr
props->active_width = IB_WIDTH_4X;
props->active_speed = 1;
props->max_msg_sz = 0x8000;
-
+   props->link_layer = IB_LINK_LAYER_ETHERNET;
return 0;
 }
 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend] RDMA/nes: corrected firmware version update

2010-07-15 Thread miroslaw . walukiewicz
Now firmware version is read from correct place

Signed-off-by: Mirek Walukiewicz 
---

 drivers/infiniband/hw/nes/nes_verbs.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index 0abd4f2..f179586 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -520,7 +520,7 @@ static int nes_query_device(struct ib_device *ibdev, struct 
ib_device_attr *prop
memset(props, 0, sizeof(*props));
memcpy(&props->sys_image_guid, nesvnic->netdev->dev_addr, 6);
 
-   props->fw_ver = nesdev->nesadapter->fw_ver;
+   props->fw_ver = nesdev->nesadapter->firmware_version;
props->device_cap_flags = nesdev->nesadapter->device_cap_flags;
props->vendor_id = nesdev->nesadapter->vendor_id;
props->vendor_part_id = nesdev->nesadapter->vendor_part_id;


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: some dapl assistance

2010-07-15 Thread Itay Berman
Hello Arlin,

I am Or's colleague whom he assist with this manner.

OK, we got Intel MPI to run. To test the pkey usage we configured it to run 
over pkey that is not configured on the node. In this case the MPI should have 
failed, but it didn't.
The dapl debug reports the given pkey (0x8001 = 32769).
How can that be?

See attached the different mpi run. I believe the devices are the correct ones 
(ofa-v2*). 

Itay
   

-Original Message-
From: Davis, Arlin R [mailto:arlin.r.da...@intel.com] 
Sent: ג 13 יולי 2010 19:19
To: Or Gerlitz
Cc: Itay Berman; linux-rdma
Subject: RE: some dapl assistance

Sorry, Intel MPI requires development packages which include libdat.so and 
libdat2.so  

Please see the install instructions on 
http://www.openfabrics.org/downloads/dapl/

---

For 1.2 and 2.0 support on same system, including development, install RPM 
packages as follow: 

dapl-2.0.29-1 
dapl-utils-2.0.29-1 
dapl-devel-2.0.29-1  
dapl-debuginfo-2.0.29-1 
compat-dapl-1.2.18-1 
compat-dapl-devel-1.2.18-1  

---

Thanks for the heads up on dat.conf manpage. I will fix the conflict in next 
release.

-arlin

>-Original Message-
>From: Or Gerlitz [mailto:ogerl...@voltaire.com] 
>Sent: Tuesday, July 13, 2010 4:41 AM
>To: Davis, Arlin R
>Cc: Itay Berman; linux-rdma
>Subject: Re: some dapl assistance
>
>Davis, Arlin R wrote:
>> There is limited debug in the non-debug builds. If you want 
>full debugging capabilities
>> you can install the source RPM and configure and make as 
>follows [..] (OFED target example):
>
>okay, got that, once I built the sources by hand as you 
>suggested I could see debug prints
>but things didn't really work, so I stepped back and installed 
>the latest rpms - dapl-2.0.29-1
>and compat-dapl-1.2.18-1, now I couldn't get intel-mpi to run:
>
>> [r...@dodly0 ~]# rpm -qav | grep dapl
>> dapl-utils-2.0.29-1
>> dapl-2.0.29-1
>> compat-dapl-1.2.18-1
>
>> [r...@dodly0 ~]# ldconfig -p | grep libdat
>> libdat2.so.2 (libc6,x86-64) => /usr/lib64/libdat2.so.2
>> libdat.so.1 (libc6,x86-64) => /usr/lib64/libdat.so.1
>
>> [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat.so.1
>> compat-dapl-1.2.18-1
>> [r...@dodly0 ~]# rpm -qf /usr/lib64/libdat2.so.2
>> dapl-2.0.29-1
>
>> [r...@dodly0 ~]# 
>/opt/intel/impi/4.0.0.027/intel64/bin/mpiexec -ppn 1 -n 2  
>-env DAPL_IB_PKEY 0x8002 -env DAPL_DBG_TYPE 0xff -env 
>DAPL_DBG_DEST 0x3  -env I_MPI_DEBUG 3 -env 
>I_MPI_CHECK_DAPL_PROVIDER_MISMATCH none -env I_MPI_FABRICS 
>dapl:dapl /tmp/osu
>> [0] MPI startup(): cannot open dynamic library libdat.so
>> [1] MPI startup(): cannot open dynamic library libdat.so
>> [0] MPI startup(): cannot open dynamic library libdat2.so
>> [0] dapl fabric is not available and fallback fabric is not enabled
>> [1] MPI startup(): cannot open dynamic library libdat2.so
>> [1] dapl fabric is not available and fallback fabric is not enabled
>> rank 1 in job 5  dodly0_54941   caused collective abort of all ranks
>>   exit status of rank 1: return code 254
>> rank 0 in job 5  dodly0_54941   caused collective abort of all ranks
>>   exit status of rank 0: return code 254
>
>Any idea what we're doing wrong?
>
>BTW - before things stopped to work, exporting LD_DEBUG=libs 
>to the MPI rank, 
>I noticed that it used the compat-1.2 rpm ...
>
>Now, I can run dapltest fine,
>> [r...@dodly0 ~]# dapltest -T S -D ofa-v2-mthca0-1
>> Dapltest: Service Point Ready - ofa-v2-mthca0-1
>> Dapltest: Service Point Ready - ofa-v2-mthca0-1
>> Server: Transaction Test Finished for this client
>
>> [r...@dodly4 ~]# dapltest -T T -D ofa-v2-mlx4_0-1 -s dodly0 
>-i 1000 server SR 65536 4 client SR 65536 4
>> Server Name: dodly0
>> Server Net Address: 172.30.3.230
>> DT_cs_Client: Starting Test ...
>> - Stats  : 1 threads, 1 EPs
>> Total WQE:2919.70 WQE/Sec
>> Total Time   :   0.68 sec
>> Total Send   : 262.14 MB - 382.69 MB/Sec
>> Total Recv   : 262.14 MB - 382.69 MB/Sec
>> Total RDMA Read  :   0.00 MB -   0.00 MB/Sec
>> Total RDMA Write :   0.00 MB -   0.00 MB/Sec
>> DT_cs_Client: == End of Work -- Client Exiting
>
>I also noted that the dapl-utils and the compat-dapl-utils are 
>mutual exclusive as both 
>attempt to install the same man page for dat.conf
>> # rpm -Uvh 
>/usr/src/redhat/RPMS/x86_64/compat-dapl-utils-1.2.18-1.x86_64.rpm
>> Preparing...
>### [100%]
>> file /usr/share/man/man5/dat.conf.5.gz from install 
>of compat-dapl-utils-1.2.18-1.x86_64 conflicts with file from 
>package dapl-utils-2.0.29-1.x86_64
>
>Or.
>

[r...@dodly0 compat-dapl-1.2.18]# mpiexec -ppn 1 -n 2 -env I_MPI_FABRICS 
dapl:dapl -env I_MPI_DEBUG 2 -env I_MPI_CHECK_DAPL_PROVIDER_MISMATCH none -env 
DAPL_DBG_TYPE 0x /tmp/osu
dodly0:47ba: dapl_init: dbg_type=0x,dbg_dest=0x1
dodly4:e32: dapl_init: dbg_type=0x,dbg_dest=0x1
dodly0:47ba:  open_hca: device mlx4_0 not found
dodly0:47ba:  open_hca: device mlx4_0 not

Re: IB/ipoib: fix dangling pointer reference to ipoib_neigh and ipoib_path -when will it go upstream?

2010-07-15 Thread Pradeep Satyanarayana
Pradeep Satyanarayana wrote:
> Pradeep Satyanarayana wrote:
>> Roland Dreier wrote:
>>>  > I guess I came to a premature conclusion. One set of tests ran fine and 
>>> I made that
>>>  > conclusion. Another set of tests caused the following crash:
>>>
>>> I don't really know how to interpret this.  Is this crash new, or is it
>>> the same crash you were hoping this patch fixed?
>> This is a new crash.
> 
> I see other manifestations resulting in different crashes :
> 
> :mon> t
> [c0074603ba20] d000193527ac .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> [c0074603bb10] d00019356dac .ipoib_mcast_free+0x74/0x2a0 [ib_ipoib]
> [c0074603bbe0] d00019358558 .ipoib_mcast_restart_task+0x3d0/0x560 
> [ib_ipoib]
> [c0074603bd40] c00c6fe4 .run_workqueue+0xf4/0x1e0
> [c0074603be00] c00c7190 .worker_thread+0xc0/0x180
> [c0074603bed0] c00ccf4c .kthread+0xb4/0xc0
> [c0074603bf90] c00309fc .kernel_thread+0x54/0x70
> 9:mon> e
> cpu 0x9: Vector: 300 (Data Access) at [c0074603b720]
> pc: c05ac390: ._spin_lock+0x20/0xc8
> lr: d000193527ac: .ipoib_neigh_flush+0x6c/0x350 [ib_ipoib]
> sp: c0074603b9a0
>msr: 80009032
>dar: 3a0
>  dsisr: 4000
>   current = 0xc00756ce8b00
>   paca= 0xc0f63800
> pid   = 18095, comm = ipoib
> 9:mon>

Recreating the crash has been tricky. I have tried several several hundred 
times today
to unload and reload IPoIB while there is traffic and no crashes happened. I 
took
a closer look at the IPoIB CM code and I see a few things that look suspicious.

In the ipoib_cm_send() path no priv->lock is held, whereas the priv->lock is 
held before 
calling ipoib_cm_destroy_tx(). This is true with and without Ralph's patch (fix 
dangling pointer).
Is this a potential race?

In Roland's git tree I do see a test_and_clear_bit(IPOIB_FLAG_INITIALIZED, 
&tx->flags) in 
ipoib_cm_destroy_tx() which seems to be missing in Ralph's patch. In Ralph's 
patch) there is a 
clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags) called before calling 
ipoib_cm_destroy_tx() only in 
select cases. Was that intended?

Thanks
Pradeep

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-15 Thread Liran Liss
>  > A quibble about multicast - AFAIK this is unsolved. I 
> think some spec  > needs to be agreed that documents what 
> sort of multicast snooping  > operations switches need to do, 
> ie if IGMP joins imply that IBoE  > traffic for the same DMAC 
> is included in the join, or if IBoE requires  > a seperate 
> IGMP type process on its own ether-type. That would make it  
> > much clearer what to do with MGIDs.
>

It would be quite naïve to require *new* snooping functionality in Eth 
switches. Some switches will gracefully apply to non-ip traffic the filtering 
information acquired through IGMP snooping. And some will just flood non-ip MC 
frames within the corresponding VLAN which is benign (e.g. that is the way FIP 
works). A cleaner solution would be based on MMRP but that, AFAIK, is not very 
widely deployed so it is less practical at this stage. 
 
> I agree -- the current spec is rather broken for multicast.  
> Choosing a different ethertype and then saying that all 
> switches will just flood multicast traffic is half-baked at best.
>

It is a realistic approach. Do you claim that there are switches that will not 
forward the packets?
 
>  > It would be nice to at least have a plan on how to 
> integrate a  > non-link local address, if that is ever 
> necessary in future. An  > extended AH with an additional 48 
> DMAC field seems reasonable to me?
> 
> You mean have a next-hop destination + a final destination?  
> Could be done I guess.  But I'm not sure how having a routing 
> table where you have to look up 48-bit Ethernet addresses is 
> all that different from just having a standard Ethernet 
> forwarding table.

I guess Jason suggests regarding the GID as a true L3 address and using a new 
added L2 field for the next hop L2 address.

> 
> I suppose something based on MAC-in-MAC (a la 802.1ah) could 
> be done but to be honest the IBoE spec that the IBTA came up 
> with looks rather broken for routing.

Routing is out of the scope of the current RoCE spec.
And I do not see how .1ah would be relevant for this purpose.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-15 Thread Liran Liss
>  > But, we can't mandate an overload of the GID in a way that 
> it  > prevents its use as a true L3 address (eventually routable).
> 
> Actually I'm beginning to think that the only possible way we 
> can use the GID in IBoE is as a link-local IPv6 addresses 
> containing an Ethernet address.  Trying to hide neighbour 
> discovery or ARP below the verbs doesn't seem workable -- 
> being forced to change the locking rules we've had for the 
> past 5+ years about create_ah is just the beginning.  We get 
> further problems if a remote address should ever change and 
> I'm probably missing other issues.
> 

We believe the problems are workable. But let us stop arguing for a while and
make progress with link local addressees since we all seem to agree with that.
We can get back to non-local GIDs later

> So the best solution I can see is to declare that an IBoE GID 
> must be an
> IPv6 address coming from an EUI-64 Ethernet address for the 
> corresponding port; for MGIDs I guess we use the standard 
> IPv6 mapping to Ethernet address 33:33:xx:xx:xx:xx.
> 
> I'm not sure how we want to handle IPv4 -- presumably unicast 
> ARP can be done within the RDMA CM, which will then create a 
> DGID with the appropriate Ethernet address.  However it's not 
> clear to me whether we need a way to create IPv4 
> (01:00:5e:xx:xx:xx) multicast addresses.
> 
> Also, since there is no way to map a link-local IPv6 address 
> to a particular interface, then I guess we need a way to pass 
> in the VLAN tag to be used -- presumably we can steal some 
> other field for the 12 bits.
> (The fact that the IBoE annex does not mention VLANs or 
> 802.1q a single time is just another thing that shows how 
> rushed and incomplete it is)
> 
> With all this said, I think it means we do not need to do the 
> mapping from GID to Ethernet address in the kernel for IBoE 
> user verbs, since it is so simple -- we can simply add a 
> fairly trivial helper to libibverbs.
> 
>  - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: When IBoE will be merged to upstream?

2010-07-15 Thread Liran Liss
> > > The text is saying that the specification does not use any of the 
> > > LID fields in the verbs interface, that is it. It isn't talking 
> > > about MAC addresses.
> > > 
> > > Exactly how and where the MAC address comes about was 
> never decided, 
> > > and at least some participants thought it should be a 1:1 
> > > algorithmic mapping from the GID.
> > > 
> > > Ditto for VLANs, how and where the vlan tag comes about 
> is not part 
> > > of the spec.
> 
> > You are trying to rewrite history.
> > Read the spec, address handles fields are fixed.
> 
> Not really, this was all discussed on this list before the 
> IBxoE working group was formed,

The paragraph above is about the RoCE spec. And *this list* did not write the 
RoCE spec.

> it was discussed in the 
> working group,

The RoCE spec adopts the verbs defined in the base IB spec and does not add any 
new input modifiers to the AH verb. You may not agree with it but that does not 
change the spec.

> I objected to the draft spec leaving this area 
> absent, even.

You should submit a comment on this matter using the IBTA comment tracker 
database if you intend your concern to be taken into account.

> The spec doesn't say squat about how MAC and 
> VLAN values get into the AH,

True. The spec does not say it because there are no MAC and VLAN input 
modifiers to the "create AH" verb. The spec assumes the resolution from the L3 
address happens below the channel interface.

> and you have already heard how 
> my opinion on this subject differs from others.

I never attempted to misrepresent your opinion. I am just pointing out what the 
RoCE spec says.

> 
> > > But, even if we do get there some day then we could extend the AH.
> > 
> > This is unacceptable - we are not going to add another L3 
> identifier.
> 
> It wouldn't be adding another L3 itentifier it would be an L2 
> next hop MAC address for the router. It would be nice to do 
> this from the start but if growing the AH is really that 
> scary then it should wait until someone figures out how to 
> solve the lossless routing problem on ethernet.

Augmenting the AH has a significant cost. There is a tradeoff here between 
preserving the verbs api vs. dealing with the implementation challenges 
associated with doing address resolution below the verbs. The RoCE spec 
deliberately chooses one direction. You seem to favor the other one. But in the 
interest of progress and since we all seem to agree on the way things work when 
we use link local GIDs, let us move forward with that approach for now. And we 
can get back to non local GIDs later.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html