Re: [OMPI devel] Question about tight integration with not-yet-supported queuing systems

2014-12-01 Thread marc . hoeppner
HI, sorry for the late reply - I've been traveling with limited email access. I think you can leave this issue be. I think I was hoping for a way to just launch mpirun and have it create the allocation by itself. It's not super important right now, more something I was wondering about. Thank

Re: [OMPI devel] Question about tight integration with not-yet-supported queuing systems

2014-12-01 Thread Gilles Gouaillardet
Marc, i am not aware of any mpi implementation in which mpirun does the job allocation. instead, mpirun gets job info from the batch manager (e.g. number of nodes) so the job can be launched seamlessly and be properly killed in case of a job abort (bkill or equivalent) Cheers, Gilles On 2014/1

Re: [OMPI devel] Open MPI 1.8: link problem when Fortran+C+Platform LSF

2014-12-01 Thread Jeff Squyres (jsquyres)
Paul -- Sorry for the delay -- SC and the US Thanksgiving holiday last week got in the way of responding to this properly. I talked with Dave Goodell about this issue a bunch today. Going back to the original email in this thread (http://www.open-mpi.org/community/lists/devel/2014/10/16064.p

Re: [OMPI devel] OMPI devel] OMPI devel] race condition in abort can cause mpirun v1.8 hang

2014-12-01 Thread Ralph Castain
Looks like this should be fixed in my PR #101 - could you please review it? Thanks Ralph > On Nov 26, 2014, at 8:14 PM, Ralph Castain wrote: > > Aha - I see what happened. I have that param set to false in my default mca > param file. If I set it to true on the cmd line, then I run without >

[OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Howard Pritchard
Hi ompi developers, If you always configure ompi with --disable-dlopen you can delete this message now. There has been some discussion of end case situations with use of dlopen in the ompi mca framework that can lead to unresolved symbols when subsequent shared libraries are dlopen'd that might n

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Jeff Squyres (jsquyres)
On Dec 1, 2014, at 4:07 PM, Howard Pritchard wrote: > There has been some discussion of end case situations with use of dlopen > in the ompi mca framework that can lead to unresolved symbols when > subsequent shared libraries are dlopen'd that might needs symbols from > a library that had been op

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its dependencies (the pmi-2 one is correct). Moe is aware of the problem and fixing it on their side. This won’t help existing installations until they upgrade, but I tend to agree with Jeff about not fixing other people’s prob

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Jeff Squyres (jsquyres)
On Dec 1, 2014, at 5:07 PM, Ralph Castain wrote: > FWIW: It’s Slurm’s pmi-1 library that isn’t linked correctly against its > dependencies (the pmi-2 one is correct). Moe is aware of the problem and > fixing it on their side. This won’t help existing installations until they > upgrade, but I

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
Easy enough to explain. We link libpmi into the pmix/s1 component. This library is missing the linkage to libslurm that contains the linkage to libauth where munge resides. So when we call a PMI function, libpmi references a call to munge for authentication and hits an “unresolved symbol” error.

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Jeff Squyres (jsquyres)
Ok, if the problem is moot, great. (sidenote: this is moot, so ignore this if you want: with this explanation, I'm still not sure how RTLD_GLOBAL fixes the issue) On Dec 1, 2014, at 5:15 PM, Ralph Castain wrote: > Easy enough to explain. We link libpmi into the pmix/s1 component. This > libr

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Gilles Gouaillardet
Jeff, FWIW, you can read my analysis of what is going wrong at http://www.open-mpi.org/community/lists/pmix-devel/2014/11/0293.php bottom line, i agree this is a slurm issue (slurm plugin should depend on libslurm, but they do not, yet) a possible workaround would be to make the pmi component a

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
Another option is to simply add the -lslurm -lauth flags to the pmix/s1 component as this is the only place that requires it, and it won’t hurt anything to do so. > On Dec 1, 2014, at 6:03 PM, Gilles Gouaillardet > wrote: > > Jeff, > > FWIW, you can read my analysis of what is going wrong a

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Gilles Gouaillardet
I d like to make a step back ... i previously tested with slurm 2.6.0, and it complained about the slurm_verbose symbol that is defined in libslurm.so so with slurm 2.6.0, RTLD_GLOBAL or relinking is ok now i tested with slurm 2.6.6 and it complains about the slurm_auth_get_arg_desc symbol, and t

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
Out of curiosity - how are you testing these? I have more current versions of Slurm and would like to test the observations there. > On Dec 1, 2014, at 7:49 PM, Gilles Gouaillardet > wrote: > > I d like to make a step back ... > > i previously tested with slurm 2.6.0, and it complained about

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Gilles Gouaillardet
$ srun --version slurm 2.6.6-VENDOR_PROVIDED $ srun --mpi=pmi2 -n 1 ~/hw I am 0 / 1 $ srun -n 1 ~/hw /csc/home1/gouaillardet/hw: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_verbose srun: error: slurm_receive_msg: Zero Bytes were transmitted or received srun: error

Re: [OMPI devel] RTLD_GLOBAL question

2014-12-01 Thread Ralph Castain
If it isn’t too much trouble, it would be good to confirm that it remains broken. I strongly suspect it is based on Moe’s comments. Obviously, other people are making this work. For Intel MPI, all you do is point it at libpmi and they can run. However, they do explicitly dlopen it in their code

Re: [OMPI devel] Setting up debug environment on Eclipse PTP

2014-12-01 Thread Alvyn Liang
Hi Ralph, Yes, Eclipse is currently being actively developed. To my understanding https://www.eclipse.org/ptp/ is also active. I did drop a question on Eclipse forum, but I got no response. https://www.eclipse.org/forums/index.php/t/869298/ I am still looking for answers. Hopefully I will find an