Works!

[r...@dodly0 OMB-3.1.1]# mpiexec -ppn 1 -n 2 -env I_MPI_FABRICS dapl:dapl -env 
I_MPI_DEBUG 5 -env I_MPI_CHECK_DAPL_PROVIDER_MISMATCH none -env DAPL_DBG_TYPE 
0xffff -env DAPL_IB_PKEY 0x0280 -env DAPL_IB_SL 4 /tmp/osu_long
dodly0:5bc3: dapl_init: dbg_type=0xffff,dbg_dest=0x1
dodly0:5bc3:  open_hca: device mlx4_0 not found
dodly0:5bc3:  open_hca: device mlx4_0 not found
dodly0:5bc3:  query_hca: port.link_layer = 0x1
dodly0:5bc3:  query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 
2048 - pkey 640 p_idx 1 sl 4
dodly0:5bc3:  query_hca: msg 2147483648 rdma 2147483648 iov 27 lmr 131056 rmr 0 
ack_time 16 mr 4294967295
dodly0:5bc3:  query_hca: port.link_layer = 0x1
dodly0:5bc3:  query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 
2048 - pkey 640 p_idx 1 sl 4
dodly0:5bc3:  query_hca: msg 2147483648 rdma 2147483648 iov 27 lmr 131056 rmr 0 
ack_time 16 mr 4294967295
dodly0:5bc3:  query_hca: port.link_layer = 0x1
dodly0:5bc3:  query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 
2048 - pkey 640 p_idx 1 sl 4
dodly0:5bc3:  query_hca: msg 2147483648 rdma 2147483648 iov 27 lmr 131056 rmr 0 
ack_time 16 mr 4294967295
dodly0:5bc3:  dapl_poll: fd=17 ret=1, evnts=0x1
dodly0:5bc3:  dapl_poll: fd=17 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=14 ret=0, evnts=0x0
dodly4:1e8d: dapl_init: dbg_type=0xffff,dbg_dest=0x1
[0] MPI startup(): DAPL provider ofa-v2-mthca0-1
[0] MPI startup(): dapl data transfer mode
dodly4:1e8d:  query_hca: port.link_layer = 0x1
dodly4:1e8d:  query_hca: (a0.0) eps 262076, sz 16351 evds 65408, sz 4194303 mtu 
2048 - pkey 640 p_idx 1 sl 4
dodly4:1e8d:  query_hca: msg 1073741824 rdma 1073741824 iov 32 lmr 524272 rmr 0 
ack_time 16 mr 4294967295
dodly4:1e8d:  query_hca: port.link_layer = 0x1
dodly4:1e8d:  query_hca: (a0.0) eps 262076, sz 16351 evds 65408, sz 4194303 mtu 
2048 - pkey 640 p_idx 1 sl 4
dodly4:1e8d:  query_hca: msg 1073741824 rdma 1073741824 iov 32 lmr 524272 rmr 0 
ack_time 16 mr 4294967295
dodly4:1e8d:  query_hca: port.link_layer = 0x1
dodly4:1e8d:  query_hca: (a0.0) eps 262076, sz 16351 evds 65408, sz 4194303 mtu 
2048 - pkey 640 p_idx 1 sl 4
dodly4:1e8d:  query_hca: msg 1073741824 rdma 1073741824 iov 32 lmr 524272 rmr 0 
ack_time 16 mr 4294967295
dodly4:1e8d:  dapl_poll: fd=15 ret=1, evnts=0x1
dodly4:1e8d:  dapl_poll: fd=15 ret=0, evnts=0x0
dodly4:1e8d:  dapl_poll: fd=13 ret=0, evnts=0x0
[1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
[1] MPI startup(): dapl data transfer mode
[0] MPI startup(): static connections storm algo
dodly0:5bc3:  dapl_poll: fd=17 ret=1, evnts=0x1
dodly0:5bc3:  dapl_poll: fd=17 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=14 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=19 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=17 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=14 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=19 ret=1, evnts=0x4
dodly0:5bc3:  dapl_poll: fd=17 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=14 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=19 ret=0, evnts=0x0
dodly4:1e8d:  dapl_poll: fd=15 ret=0, evnts=0x0
dodly4:1e8d:  dapl_poll: fd=13 ret=1, evnts=0x1
dodly4:1e8d:  dapl_poll: fd=13 ret=0, evnts=0x0
dodly4:1e8d:  dapl_poll: fd=15 ret=1, evnts=0x1
dodly4:1e8d:  dapl_poll: fd=15 ret=0, evnts=0x0
dodly4:1e8d:  dapl_poll: fd=13 ret=0, evnts=0x0
dodly4:1e8d:  dapl_poll: fd=17 ret=1, evnts=0x1
dodly0:5bc3:  dapl_poll: fd=17 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=14 ret=0, evnts=0x0
dodly0:5bc3:  dapl_poll: fd=19 ret=1, evnts=0x1
[0] MPI startup(): I_MPI_CHECK_DAPL_PROVIDER_MISMATCH=none
[0] MPI startup(): I_MPI_DEBUG=5
dodly4:1e8d:  dapl_poll: fd=15 ret=0, evnts=0x0
dodly4:1e8d:  dapl_poll: fd=13 ret=0, evnts=0x0
dodly4:1e8d:  dapl_poll: fd=17 ret=1, evnts=0x1
[0] MPI startup(): I_MPI_FABRICS=dapl:dapl
[0] MPI startup(): set domain to {0,1,2,3} on node dodly0
[1] MPI startup(): set domain to {0,1,2,3} on node dodly4
[0] Rank    Pid      Node name  Pin cpu
[0] 0       23491    dodly0     {0,1,2,3}
[0] 1       7821     dodly4     {0,1,2,3}
# OSU MPI Bandwidth Test v3.1.1
# Size        Bandwidth (MB/s)
4194304                 978.30
4194304                 978.45
4194304                 978.69
4194304                 978.24
dodly0:5bc3: dapl async_event: DEV ERR 12
dodly4:1e8d: dapl async_event: DEV ERR 12
dodly4:1e8d:  DTO completion ERROR: 12: op 0xff
dodly4:1e8d: DTO completion ERR: status 12, op OP_RDMA_READ, vendor_err 0x81 - 
172.30.3.230
[1:dodly4][../../dapl_module_poll.c:3972] Intel MPI fatal error: 
ofa-v2-mlx4_0-1 DTO operation posted for [0:dodly0] completed with error. 
status=0x8. cookie=0x40000
Assertion failed in file ../../dapl_module_poll.c at line 3973: 0
internal ABORT - process 1
rank 1 in job 41  dodly0_54941   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9

dapl reports p_idx 1. this is an output of an osu test that I removed the 
configured pkey. At that time the mpi died. So it indeed ran over that pkey.

To test the sl I will have to change my configuration a bit.

We will be happy to get a new build of dapl if possible.

Thanks,

Itay.  

-----Original Message-----
From: Davis, Arlin R [mailto:arlin.r.da...@intel.com] 
Sent: ב 19 יולי 2010 22:04
To: Itay Berman
Cc: linux-rdma; Or Gerlitz
Subject: RE: some dapl assistance - [PATCH] dapl-2.0 improperly handles 
pkeycheck/query in host order

 
Itay,

>>>OK, we got Intel MPI to run. To test the pkey usage we 
>>>configured it to run over pkey that is not configured on the 
>>>node. In this case the MPI should have failed, but it didn't.
>>>The dapl debug reports the given pkey (0x8001 = 32769).
>>>How can that be?
>>
>>If the pkey override is not valid it uses default idx of 0 and 
>>ignores pkey value given. 

Sorry, verbs pkey_query is network order and the consumer
variable is assumed host order. Please try the following
v2.0 patch (or use 0x0280 without patch):

---

scm, ucm: improperly handles pkey check/query in host order

Convert consumer input to network order before verbs
query pkey check.

Signed-off-by: Arlin Davis <arlin.r.da...@intel.com>

diff --git a/dapl/openib_common/util.c b/dapl/openib_common/util.c
index a69261f..73730ef 100644
--- a/dapl/openib_common/util.c
+++ b/dapl/openib_common/util.c
@@ -326,7 +326,7 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr,
 
                /* set SL, PKEY values, defaults = 0 */
                hca_ptr->ib_trans.pkey_idx = 0;
-               hca_ptr->ib_trans.pkey = dapl_os_get_env_val("DAPL_IB_PKEY", 0);
+               hca_ptr->ib_trans.pkey = 
htons(dapl_os_get_env_val("DAPL_IB_PKEY", 0));
                hca_ptr->ib_trans.sl = dapl_os_get_env_val("DAPL_IB_SL", 0);
 
                /* index provided, get pkey; pkey provided, get index */
@@ -345,10 +345,10 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr,
                                }
                        }
                        if (i == dev_attr.max_pkeys) {
-                               dapl_log(DAPL_DBG_TYPE_WARN,
-                                        " Warning: new pkey(%d), query (%s)"
-                                        " err or key !found, using defaults\n",
-                                        hca_ptr->ib_trans.pkey, 
strerror(errno));
+                               dapl_log(DAPL_DBG_TYPE_ERR,
+                                        " ERR: new pkey(0x%x), query (%s)"
+                                        " err or key !found, using default 
pkey_idx=0\n",
+                                        ntohs(hca_ptr->ib_trans.pkey), 
strerror(errno));
                        }
                }
 skip_ib:


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to