Works! [r...@dodly0 OMB-3.1.1]# mpiexec -ppn 1 -n 2 -env I_MPI_FABRICS dapl:dapl -env I_MPI_DEBUG 5 -env I_MPI_CHECK_DAPL_PROVIDER_MISMATCH none -env DAPL_DBG_TYPE 0xffff -env DAPL_IB_PKEY 0x0280 -env DAPL_IB_SL 4 /tmp/osu_long dodly0:5bc3: dapl_init: dbg_type=0xffff,dbg_dest=0x1 dodly0:5bc3: open_hca: device mlx4_0 not found dodly0:5bc3: open_hca: device mlx4_0 not found dodly0:5bc3: query_hca: port.link_layer = 0x1 dodly0:5bc3: query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 2048 - pkey 640 p_idx 1 sl 4 dodly0:5bc3: query_hca: msg 2147483648 rdma 2147483648 iov 27 lmr 131056 rmr 0 ack_time 16 mr 4294967295 dodly0:5bc3: query_hca: port.link_layer = 0x1 dodly0:5bc3: query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 2048 - pkey 640 p_idx 1 sl 4 dodly0:5bc3: query_hca: msg 2147483648 rdma 2147483648 iov 27 lmr 131056 rmr 0 ack_time 16 mr 4294967295 dodly0:5bc3: query_hca: port.link_layer = 0x1 dodly0:5bc3: query_hca: (a0.0) eps 64512, sz 16384 evds 65408, sz 131071 mtu 2048 - pkey 640 p_idx 1 sl 4 dodly0:5bc3: query_hca: msg 2147483648 rdma 2147483648 iov 27 lmr 131056 rmr 0 ack_time 16 mr 4294967295 dodly0:5bc3: dapl_poll: fd=17 ret=1, evnts=0x1 dodly0:5bc3: dapl_poll: fd=17 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=14 ret=0, evnts=0x0 dodly4:1e8d: dapl_init: dbg_type=0xffff,dbg_dest=0x1 [0] MPI startup(): DAPL provider ofa-v2-mthca0-1 [0] MPI startup(): dapl data transfer mode dodly4:1e8d: query_hca: port.link_layer = 0x1 dodly4:1e8d: query_hca: (a0.0) eps 262076, sz 16351 evds 65408, sz 4194303 mtu 2048 - pkey 640 p_idx 1 sl 4 dodly4:1e8d: query_hca: msg 1073741824 rdma 1073741824 iov 32 lmr 524272 rmr 0 ack_time 16 mr 4294967295 dodly4:1e8d: query_hca: port.link_layer = 0x1 dodly4:1e8d: query_hca: (a0.0) eps 262076, sz 16351 evds 65408, sz 4194303 mtu 2048 - pkey 640 p_idx 1 sl 4 dodly4:1e8d: query_hca: msg 1073741824 rdma 1073741824 iov 32 lmr 524272 rmr 0 ack_time 16 mr 4294967295 dodly4:1e8d: query_hca: port.link_layer = 0x1 dodly4:1e8d: query_hca: (a0.0) eps 262076, sz 16351 evds 65408, sz 4194303 mtu 2048 - pkey 640 p_idx 1 sl 4 dodly4:1e8d: query_hca: msg 1073741824 rdma 1073741824 iov 32 lmr 524272 rmr 0 ack_time 16 mr 4294967295 dodly4:1e8d: dapl_poll: fd=15 ret=1, evnts=0x1 dodly4:1e8d: dapl_poll: fd=15 ret=0, evnts=0x0 dodly4:1e8d: dapl_poll: fd=13 ret=0, evnts=0x0 [1] MPI startup(): DAPL provider ofa-v2-mlx4_0-1 [1] MPI startup(): dapl data transfer mode [0] MPI startup(): static connections storm algo dodly0:5bc3: dapl_poll: fd=17 ret=1, evnts=0x1 dodly0:5bc3: dapl_poll: fd=17 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=14 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=19 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=17 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=14 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=19 ret=1, evnts=0x4 dodly0:5bc3: dapl_poll: fd=17 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=14 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=19 ret=0, evnts=0x0 dodly4:1e8d: dapl_poll: fd=15 ret=0, evnts=0x0 dodly4:1e8d: dapl_poll: fd=13 ret=1, evnts=0x1 dodly4:1e8d: dapl_poll: fd=13 ret=0, evnts=0x0 dodly4:1e8d: dapl_poll: fd=15 ret=1, evnts=0x1 dodly4:1e8d: dapl_poll: fd=15 ret=0, evnts=0x0 dodly4:1e8d: dapl_poll: fd=13 ret=0, evnts=0x0 dodly4:1e8d: dapl_poll: fd=17 ret=1, evnts=0x1 dodly0:5bc3: dapl_poll: fd=17 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=14 ret=0, evnts=0x0 dodly0:5bc3: dapl_poll: fd=19 ret=1, evnts=0x1 [0] MPI startup(): I_MPI_CHECK_DAPL_PROVIDER_MISMATCH=none [0] MPI startup(): I_MPI_DEBUG=5 dodly4:1e8d: dapl_poll: fd=15 ret=0, evnts=0x0 dodly4:1e8d: dapl_poll: fd=13 ret=0, evnts=0x0 dodly4:1e8d: dapl_poll: fd=17 ret=1, evnts=0x1 [0] MPI startup(): I_MPI_FABRICS=dapl:dapl [0] MPI startup(): set domain to {0,1,2,3} on node dodly0 [1] MPI startup(): set domain to {0,1,2,3} on node dodly4 [0] Rank Pid Node name Pin cpu [0] 0 23491 dodly0 {0,1,2,3} [0] 1 7821 dodly4 {0,1,2,3} # OSU MPI Bandwidth Test v3.1.1 # Size Bandwidth (MB/s) 4194304 978.30 4194304 978.45 4194304 978.69 4194304 978.24 dodly0:5bc3: dapl async_event: DEV ERR 12 dodly4:1e8d: dapl async_event: DEV ERR 12 dodly4:1e8d: DTO completion ERROR: 12: op 0xff dodly4:1e8d: DTO completion ERR: status 12, op OP_RDMA_READ, vendor_err 0x81 - 172.30.3.230 [1:dodly4][../../dapl_module_poll.c:3972] Intel MPI fatal error: ofa-v2-mlx4_0-1 DTO operation posted for [0:dodly0] completed with error. status=0x8. cookie=0x40000 Assertion failed in file ../../dapl_module_poll.c at line 3973: 0 internal ABORT - process 1 rank 1 in job 41 dodly0_54941 caused collective abort of all ranks exit status of rank 1: killed by signal 9
dapl reports p_idx 1. this is an output of an osu test that I removed the configured pkey. At that time the mpi died. So it indeed ran over that pkey. To test the sl I will have to change my configuration a bit. We will be happy to get a new build of dapl if possible. Thanks, Itay. -----Original Message----- From: Davis, Arlin R [mailto:arlin.r.da...@intel.com] Sent: ב 19 יולי 2010 22:04 To: Itay Berman Cc: linux-rdma; Or Gerlitz Subject: RE: some dapl assistance - [PATCH] dapl-2.0 improperly handles pkeycheck/query in host order Itay, >>>OK, we got Intel MPI to run. To test the pkey usage we >>>configured it to run over pkey that is not configured on the >>>node. In this case the MPI should have failed, but it didn't. >>>The dapl debug reports the given pkey (0x8001 = 32769). >>>How can that be? >> >>If the pkey override is not valid it uses default idx of 0 and >>ignores pkey value given. Sorry, verbs pkey_query is network order and the consumer variable is assumed host order. Please try the following v2.0 patch (or use 0x0280 without patch): --- scm, ucm: improperly handles pkey check/query in host order Convert consumer input to network order before verbs query pkey check. Signed-off-by: Arlin Davis <arlin.r.da...@intel.com> diff --git a/dapl/openib_common/util.c b/dapl/openib_common/util.c index a69261f..73730ef 100644 --- a/dapl/openib_common/util.c +++ b/dapl/openib_common/util.c @@ -326,7 +326,7 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr, /* set SL, PKEY values, defaults = 0 */ hca_ptr->ib_trans.pkey_idx = 0; - hca_ptr->ib_trans.pkey = dapl_os_get_env_val("DAPL_IB_PKEY", 0); + hca_ptr->ib_trans.pkey = htons(dapl_os_get_env_val("DAPL_IB_PKEY", 0)); hca_ptr->ib_trans.sl = dapl_os_get_env_val("DAPL_IB_SL", 0); /* index provided, get pkey; pkey provided, get index */ @@ -345,10 +345,10 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr, } } if (i == dev_attr.max_pkeys) { - dapl_log(DAPL_DBG_TYPE_WARN, - " Warning: new pkey(%d), query (%s)" - " err or key !found, using defaults\n", - hca_ptr->ib_trans.pkey, strerror(errno)); + dapl_log(DAPL_DBG_TYPE_ERR, + " ERR: new pkey(0x%x), query (%s)" + " err or key !found, using default pkey_idx=0\n", + ntohs(hca_ptr->ib_trans.pkey), strerror(errno)); } } skip_ib: -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html