Re: [OMPI devel] [1.10.3rc2] testing summary
On Sat, May 21, 2016 at 3:14 PM, Paul Hargrove wrote: > > I did encounter one issue that still needs more investigating before I can > attribute it to Solaris or to Open MPI: > Since I last built Open MPI there x86-64/Solaris systems were updated from > Solaris 11.2 to 11.3. > I can build this release tarball using "gmake" fine, but "gmake -j4" is > failing. > It appears to be libtool failing to create a ".libs" directory. > My current guess is some issue between the updated Solaris and the NFS > server for the filesystem where I am building. > This issue that was reproducible yesterday does not occur today. So, I am dropping my investigation. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] modex getting corrupted
Hello Ralph Thanks for your input. The routine that does the send is this: static int btl_lf_modex_send(lfgroup lfgroup) { char *grp_name = lf_get_group_name(lfgroup, NULL, 0); btl_lf_modex_t lf_modex; int rc; strncpy(lf_modex.grp_name, grp_name, GRP_NAME_MAX_LEN); OPAL_MODEX_SEND(rc, OPAL_PMIX_GLOBAL, &mca_btl_lf_component.super.btl_version, (char *)&lf_modex, sizeof(lf_modex)); return rc; } This routine is called from the component init routine (mca_btl_lf_component_init()). I have verified that the values in the modex (lf_modex) are correct. The receive happens in proc_create, and I call it like this: OPAL_MODEX_RECV(rc, &mca_btl_lf_component.super.btl_version, &opal_proc->proc_name, (uint8_t **)&module_proc->proc_modex, &size); In here, I get junk value in proc_modex. If I pass a buffer that was malloc()'ed in place of module_proc->proc_modex, I still get bad data. Thanks again for your help. Durga We learn from history that we never learn from history. On Sat, May 21, 2016 at 8:38 PM, Ralph Castain wrote: > Please provide the exact code used for both send/recv - you likely have an > error in the syntax > > > On May 20, 2016, at 9:36 PM, dpchoudh . wrote: > > Hello all > > I have a naive question: > > My 'cluster' consists of two nodes, connected back to back with a > proprietary link as well as GbE (over a switch). > I am calling OPAL_MODEX_SEND() and the modex consists of just this: > > struct modex > {char name[20], unsigned mtu}; > > The mtu field is not currently being used. I bzero() the struct and have > verified that the value being written to the 'name' field (this is similar > to a PKEY for infiniband; the driver will translate this to a unique > integer) is correct at the sending end. > > When I do a OPAL_MODEX_RECV(), the value is completely corrupted. However, > the size of the modex message is still correct (24 bytes) > What could I be doing wrong? (Both nodes are little endian x86_64 machines) > > Thanks in advance > Durga > > We learn from history that we never learn from history. > ___ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/05/19012.php > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/05/19019.php >
Re: [OMPI devel] modex getting corrupted
Hello Ralph and all Please ignore this mail. It is indeed due to a syntax error in my code. Sorry for the noise; I'll be more careful with my homework from now on. Best regards Durga We learn from history that we never learn from history. On Mon, May 23, 2016 at 2:13 AM, dpchoudh . wrote: > Hello Ralph > > Thanks for your input. The routine that does the send is this: > > static int btl_lf_modex_send(lfgroup lfgroup) > { > char *grp_name = lf_get_group_name(lfgroup, NULL, 0); > btl_lf_modex_t lf_modex; > int rc; > strncpy(lf_modex.grp_name, grp_name, GRP_NAME_MAX_LEN); > OPAL_MODEX_SEND(rc, OPAL_PMIX_GLOBAL, > &mca_btl_lf_component.super.btl_version, > (char *)&lf_modex, sizeof(lf_modex)); > return rc; > } > > This routine is called from the component init routine > (mca_btl_lf_component_init()). I have verified that the values in the modex > (lf_modex) are correct. > > The receive happens in proc_create, and I call it like this: > OPAL_MODEX_RECV(rc, &mca_btl_lf_component.super.btl_version, >&opal_proc->proc_name, (uint8_t > **)&module_proc->proc_modex, &size); > > In here, I get junk value in proc_modex. If I pass a buffer that was > malloc()'ed in place of module_proc->proc_modex, I still get bad data. > > > Thanks again for your help. > > Durga > > We learn from history that we never learn from history. > > On Sat, May 21, 2016 at 8:38 PM, Ralph Castain wrote: > >> Please provide the exact code used for both send/recv - you likely have >> an error in the syntax >> >> >> On May 20, 2016, at 9:36 PM, dpchoudh . wrote: >> >> Hello all >> >> I have a naive question: >> >> My 'cluster' consists of two nodes, connected back to back with a >> proprietary link as well as GbE (over a switch). >> I am calling OPAL_MODEX_SEND() and the modex consists of just this: >> >> struct modex >> {char name[20], unsigned mtu}; >> >> The mtu field is not currently being used. I bzero() the struct and have >> verified that the value being written to the 'name' field (this is similar >> to a PKEY for infiniband; the driver will translate this to a unique >> integer) is correct at the sending end. >> >> When I do a OPAL_MODEX_RECV(), the value is completely corrupted. >> However, the size of the modex message is still correct (24 bytes) >> What could I be doing wrong? (Both nodes are little endian x86_64 >> machines) >> >> Thanks in advance >> Durga >> >> We learn from history that we never learn from history. >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/05/19012.php >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/05/19019.php >> > >
[OMPI devel] Github having issues ATM...
FYI: Github is having a minor outage (i.e., delays). Looks like they're working on it: https://status.github.com/ -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [1.10.3rc2] testing summary
On Sat, May 21, 2016 at 3:14 PM, Paul Hargrove wrote: > I will note that I was not able to test IA64 (yet?) because that system is > down. > I've emailed the owners of that system and am hopeful that I can test it > in the next few days. > The IA64 tests ran today and PASSED. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900