Pasha, Thanks for the patch. Unfortunately, it doesn't seem like that fixed the problem. I realized earlier I didn't mention what version of OpenMPI I was trying - it's 1.2.6. Should I be trying 1.2.7 with this patch?
Thanks, Matt 2008/10/7 Pavel Shamis (Pasha) <pa...@dev.mellanox.co.il> > Matt, > Can you please try attached patch ? I guess it will resolve this issue. > > Thanks, > Pasha > > Matt Burgess wrote: > >> Lenny, >> >> Thanks for the info. It doesn't seem to be be working still. My command >> line is: >> >> /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -H d2-ib,d3-ib -mca btl openib,self >> -mca btl_openib_of_pkey_val 33033 /cluster/pallas/x86_64-ib/IMB-MPI1 >> >> I don't have a "/sys/class/infiniband/mthca0/ports/1/pkeys/" but I do have >> "/sys/class/infiniband/mlx4_0/ports/1/pkeys/". It's contents are: >> >> 0 106 114 122 16 24 32 40 49 57 65 73 81 9 98 >> 1 107 115 123 17 25 33 41 5 58 66 74 82 90 99 >> 10 108 116 124 18 26 34 42 50 59 67 75 83 91 100 >> 109 117 125 19 27 35 43 51 6 68 76 84 92 101 11 >> 118 126 2 28 36 44 52 60 69 77 85 93 102 110 119 >> 127 20 29 37 45 53 61 7 78 86 94 103 111 12 13 >> 21 3 38 46 54 62 70 79 87 95 104 112 120 14 22 30 >> 39 47 55 63 71 8 88 96 105 113 121 15 23 31 4 >> 48 56 64 72 80 89 97 >> We aren't using the opensm, but voltaire's SM on a 2012 switch. >> >> Thanks again, >> Matt >> >> >> On Tue, Oct 7, 2008 at 9:37 AM, Lenny Verkhovsky < >> lenny.verkhov...@gmail.com <mailto:lenny.verkhov...@gmail.com>> wrote: >> >> Hi Matt, >> >> It seems that the right way to do it is the fallowing: >> >> -mca btl openib,self -mca btl_openib_ib_pkey_val 33033 >> >> when the value is a decimal number of the pkey, in your case >> 0x8109 = 33033, and no need for btl_openib_ib_pkey_ix value. >> >> ex. >> mpirun -np 2 -H witch2,witch3 -mca btl openib,self -mca >> btl_openib_ib_pkey_val 32769 ./mpi_p1_4_1_2 -t lt >> LT (2) (size min max avg) 1 3.511429 3.511429 3.511429 >> >> if it's not working check cat >> /sys/class/infiniband/mthca0/ports/1/pkeys/* for pkeys ans SM, >> maybe it's a setup. >> >> Pasha is currently checking this issue. >> >> Best regards, >> >> Lenny. >> >> >> >> >> >> On 10/7/08, *Jeff Squyres* <jsquy...@cisco.com >> <mailto:jsquy...@cisco.com>> wrote: >> >> FWIW, if this configuration is for all of your users, you >> might want to specify these MCA params in the default MCA >> param file, or the environment, ...etc. Just so that you >> don't have to specify it on every mpirun command line. >> >> See >> http://www.open-mpi.org/faq/?category=tuning#setting-mca-params. >> >> >> >> On Oct 7, 2008, at 5:43 AM, Lenny Verkhovsky wrote: >> >> Sorry, misunderstood the question, >> >> thanks for Pasha the right command line will be >> >> -mca btl openib,self -mca btl_openib_of_pkey_val 0x8109 >> -mca btl_openib_of_pkey_ix 1 >> >> ex. >> >> #mpirun -np 2 -H witch2,witch3 -mca btl openib,self -mca >> btl_openib_of_pkey_val 0x8001 -mca btl_openib_of_pkey_ix 1 >> ./mpi_p1_4_TRUNK -t lt >> LT (2) (size min max avg) 1 3.443480 3.443480 3.443480 >> >> >> Best regards >> >> Lenny. >> >> >> On 10/6/08, Jeff Squyres <jsquy...@cisco.com >> <mailto:jsquy...@cisco.com>> wrote: On Oct 5, 2008, at >> 1:22 PM, Lenny Verkhovsky wrote: >> >> you should probably use -mca tcp,self -mca >> btl_openib_if_include ib0.8109 >> >> >> Really? I thought we only took OpenFabrics device names >> in the openib_if_include MCA param...? It looks like >> ib0.8109 is an IPoIB device name. >> >> >> >> Lenny. >> >> >> >> On 10/3/08, Matt Burgess <burgess.m...@gmail.com >> <mailto:burgess.m...@gmail.com>> wrote: >> Hi, >> >> >> I'm trying to get openmpi working over openib partitions. >> On this cluster, the partition number is 0x109. The ib >> interfaces are pingable over the appropriate ib0.8109 >> interface: >> >> d2:/opt/openmpi-ib # ifconfig ib0.8109 >> ib0.8109 Link encap:UNSPEC HWaddr >> 80-00-00-4A-FE-80-00-00-00-00-00-00-00-00-00-00 >> inet addr:10.21.48.2 <http://10.21.48.2> >> Bcast:10.21.255.255 <http://10.21.255.255> >> Mask:255.255.0.0 <http://255.255.0.0> >> inet6 addr: fe80::202:c902:26:ca01/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 >> RX packets:16811 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:15848 errors:0 dropped:1 overruns:0 >> carrier:0 >> collisions:0 txqueuelen:256 >> RX bytes:102229428 (97.4 Mb) TX bytes:102324172 >> (97.5 Mb) >> >> >> I have tried the following: >> >> /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile >> machinefile -mca btl openib,self -mca btl_openib_max_btls >> 1 -mca btl_openib_ib_pkey_val 0x8109 -mca >> btl_openib_ib_pkey_ix 1 /cluster/pallas/x86_64-ib/IMB-MPI1 >> >> but I just get a RETRY EXCEEDED ERROR. Is there a MCA >> parameter I am missing? >> >> I was successful using tcp only: >> >> /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile >> machinefile -mca btl tcp,self -mca btl_openib_max_btls 1 >> -mca btl_openib_ib_pkey_val 0x8109 >> /cluster/pallas/x86_64-ib/IMB-MPI1 >> >> >> >> Thanks, >> Matt Burgess >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- Jeff Squyres >> Cisco Systems >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> -- Jeff Squyres >> Cisco Systems >> >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > -- > Pavel Shamis (Pasha) > Mellanox Technologies LTD. > > > Index: ompi/mca/btl/openib/btl_openib_component.c > =================================================================== > --- ompi/mca/btl/openib/btl_openib_component.c (revision 19490) > +++ ompi/mca/btl/openib/btl_openib_component.c (working copy) > @@ -558,7 +558,7 @@ static int init_one_hca(opal_list_t *btl > goto dealloc_pd; > } > > - ret = OMPI_SUCCESS; > + ret = OMPI_SUCCESS; > /* Note ports are 1 based hence j = 1 */ > for(i = 1; i <= hca->ib_dev_attr.phys_port_cnt; i++){ > struct ibv_port_attr ib_port_attr; > @@ -580,7 +580,7 @@ static int init_one_hca(opal_list_t *btl > uint16_t pkey,j; > for (j=0; j < hca->ib_dev_attr.max_pkeys; j++) { > ibv_query_pkey(hca->ib_dev_context, i, j, &pkey); > - pkey=ntohs(pkey); > + pkey=ntohs(pkey) & 0x7fff; > if(pkey == mca_btl_openib_component.ib_pkey_val){ > ret = init_one_port(btl_list, hca, i, j, > &ib_port_attr); > break; > Index: ompi/mca/btl/openib/btl_openib_ini.c > =================================================================== > --- ompi/mca/btl/openib/btl_openib_ini.c (revision 19490) > +++ ompi/mca/btl/openib/btl_openib_ini.c (working copy) > @@ -90,8 +90,6 @@ static int parse_line(parsed_section_val > static void reset_section(bool had_previous_value, parsed_section_values_t > *s); > static void reset_values(ompi_btl_openib_ini_values_t *v); > static int save_section(parsed_section_values_t *s); > -static int intify(char *string); > -static int intify_list(char *str, uint32_t **values, int *len); > static inline void show_help(const char *topic); > > > @@ -364,14 +362,14 @@ static int parse_line(parsed_section_val > all whitespace at the beginning and ending of the value. */ > > if (0 == strcasecmp(key_buffer, "vendor_id")) { > - if (OMPI_SUCCESS != (ret = intify_list(value, &sv->vendor_ids, > + if (OMPI_SUCCESS != (ret = ompi_btl_openib_ini_intify_list(value, > &sv->vendor_ids, > &sv->vendor_ids_len))) { > return ret; > } > } > > else if (0 == strcasecmp(key_buffer, "vendor_part_id")) { > - if (OMPI_SUCCESS != (ret = intify_list(value, > &sv->vendor_part_ids, > + if (OMPI_SUCCESS != (ret = ompi_btl_openib_ini_intify_list(value, > &sv->vendor_part_ids, > &sv->vendor_part_ids_len))) > { > return ret; > } > @@ -379,13 +377,13 @@ static int parse_line(parsed_section_val > > else if (0 == strcasecmp(key_buffer, "mtu")) { > /* Single value */ > - sv->values.mtu = (uint32_t) intify(value); > + sv->values.mtu = (uint32_t) ompi_btl_openib_ini_intify(value); > sv->values.mtu_set = true; > } > > else if (0 == strcasecmp(key_buffer, "use_eager_rdma")) { > /* Single value */ > - sv->values.use_eager_rdma = (uint32_t) intify(value); > + sv->values.use_eager_rdma = (uint32_t) > ompi_btl_openib_ini_intify(value); > sv->values.use_eager_rdma_set = true; > } > > @@ -547,7 +545,7 @@ static int save_section(parsed_section_v > /* > * Do string-to-integer conversion, for both hex and decimal numbers > */ > -static int intify(char *str) > +int ompi_btl_openib_ini_intify(char *str) > { > while (isspace(*str)) { > ++str; > @@ -568,7 +566,7 @@ static int intify(char *str) > /* > * Take a comma-delimited list and infity them all > */ > -static int intify_list(char *value, uint32_t **values, int *len) > +int ompi_btl_openib_ini_intify_list(char *value, uint32_t **values, int > *len) > { > char *comma; > char *str = value; > @@ -584,7 +582,7 @@ static int intify_list(char *value, uint > if (NULL == *values) { > return OMPI_ERR_OUT_OF_RESOURCE; > } > - *values[0] = (uint32_t) intify(str); > + *values[0] = (uint32_t) ompi_btl_openib_ini_intify(str); > *len = 1; > } else { > /* If we found a comma, loop over all the values. Be a > @@ -594,7 +592,7 @@ static int intify_list(char *value, uint > do { > *comma = '\0'; > *values = realloc(*values, sizeof(uint32_t) * (*len + 2)); > - (*values)[*len] = (int32_t) intify(str); > + (*values)[*len] = (int32_t) ompi_btl_openib_ini_intify(str); > ++(*len); > str = comma + 1; > comma = strchr(str, ','); > @@ -602,7 +600,7 @@ static int intify_list(char *value, uint > /* Get the last value (i.e., the value after the last > comma, because it won't have been snarfed in the > loop) */ > - (*values)[*len] = (uint32_t) intify(str); > + (*values)[*len] = (uint32_t) ompi_btl_openib_ini_intify(str); > ++(*len); > } > > Index: ompi/mca/btl/openib/btl_openib_ini.h > =================================================================== > --- ompi/mca/btl/openib/btl_openib_ini.h (revision 19490) > +++ ompi/mca/btl/openib/btl_openib_ini.h (working copy) > @@ -49,6 +49,9 @@ extern "C" { > */ > int ompi_btl_openib_ini_finalize(void); > > + int ompi_btl_openib_ini_intify(char *string); > + int ompi_btl_openib_ini_intify_list(char *str, uint32_t **values, int > *len); > + > #if defined(c_plusplus) || defined(__cplusplus) > } > #endif > Index: ompi/mca/btl/openib/btl_openib_mca.c > =================================================================== > --- ompi/mca/btl/openib/btl_openib_mca.c (revision 19490) > +++ ompi/mca/btl/openib/btl_openib_mca.c (working copy) > @@ -27,6 +27,7 @@ > #include "opal/mca/base/mca_base_param.h" > #include "btl_openib.h" > #include "btl_openib_mca.h" > +#include "btl_openib_ini.h" > > /* > * Local flags > @@ -97,7 +98,7 @@ static inline int reg_int(const char* pa > */ > int btl_openib_register_mca_params(void) > { > - char *msg, *str; > + char *msg, *str, *pkey; > int ival, ival2, ret, tmp; > > ret = OMPI_SUCCESS; > @@ -192,13 +193,15 @@ int btl_openib_register_mca_params(void) > 0, &ival, REGINT_GE_ZERO)); > mca_btl_openib_component.ib_pkey_ix = (uint32_t) ival; > > - CHECK(reg_int("ib_pkey_val", "InfiniBand pkey value" > + CHECK(reg_string("ib_pkey_val", "InfiniBand pkey value" > "(must be > 0 and < 0xffff)", > - 0, &ival, REGINT_GE_ZERO)); > - if (ival > 0xffff) { > + "0", &pkey, 0)); > + mca_btl_openib_component.ib_pkey_val = > ompi_btl_openib_ini_intify(pkey) & 0x7fff; > + if (mca_btl_openib_component.ib_pkey_val > 0xffff || > + mca_btl_openib_component.ib_pkey_val < 0) { > ret = OMPI_ERR_BAD_PARAM; > } > - mca_btl_openib_component.ib_pkey_val = (uint32_t) ival; > + free(pkey); > > CHECK(reg_int("ib_psn", "InfiniBand packet sequence starting number " > "(must be >= 0)", > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >