Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread Reuti via users


Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:

> I have been trying to run some MPI jobs under SGE for almost a year without 
> success.  What seems like a very simple test program fails; the ingredients 
> of it are below.  Any suggestions on any piece of the test, reasons for 
> failure, requests for additional info, configuration thoughts, etc. would be 
> much appreciated.  I suspect the linkage between SGIEand MPI, but can't 
> identify the problem.  We do have SGE support build into MPI.  We also have 
> the SGE parallel environment (PE) set up as described in several places on 
> the web.
> 
> Many thanks for any input!

Did you compile Open MPI on your own or was it delivered with the Linux 
distribution? That it tries to use `ssh` is quite strange, as nowadays Open MPI 
and others have built-in support to detect that they are running under the 
control of a queuing system. It should use `qrsh` in your case.

What does:

mpiexec --version
ompi_info | grep grid

reveal? What does:

qconf -sconf | egrep "(command|daemon)"

show?

-- Reuti


> Cheers,
> 
> -David Laidlaw
> 
> 
> 
> 
> Here is how I submit the job:
> 
>/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
> 
> 
> Here is what is in runme:
> 
>   #!/bin/bash
>   #$ -cwd
>   #$ -pe orte_fill 1
>   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> allocation ./hello
> 
> 
> Here is hello.c:
> 
> #include 
> #include 
> #include 
> #include 
> 
> int main(int argc, char** argv) {
> // Initialize the MPI environment
> MPI_Init(NULL, NULL);
> 
> // Get the number of processes
> int world_size;
> MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> 
> // Get the rank of the process
> int world_rank;
> MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> 
> // Get the name of the processor
> char processor_name[MPI_MAX_PROCESSOR_NAME];
> int name_len;
> MPI_Get_processor_name(processor_name, &name_len);
> 
> // Print off a hello world message
> printf("Hello world from processor %s, rank %d out of %d processors\n",
>processor_name, world_rank, world_size);
> // system("printenv");
> 
> sleep(15); // sleep for 60 seconds
> 
> // Finalize the MPI environment.
> MPI_Finalize();
> }
> 
> 
> This command will build it:
> 
>  mpicc hello.c -o hello
> 
> 
> Running produces the following:
> 
> /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> dblade01.cs.brown.edu 1 shor...@dblade01.cs.brown.edu UNDEFINED
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --
> 
> 
> and:
> 
> [dblade01:10902] [[37323,0],0] plm:rsh: final template argv:
> /usr/bin/ssh  set path = ( /usr/bin $path ) ; if ( $?
> LD_LIBRARY_PATH == 1 ) set OMPI_have_llp ; if ( $?LD_LIBRARY_PATH
>  == 0 ) setenv LD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_llp == 1 ) setenv
> LD_LIBRARY_PATH /usr/lib:$LD_LIBRARY_PATH ; if ( $?DYLD_LIBRARY
> _PATH == 1 ) set OMPI_have_dllp ; if ( $?DYLD_LIBRARY_PATH == 0 ) setenv
> DYLD_LIBRARY_PATH /usr/lib ; if ( $?OMPI_have_dllp == 1 ) setenv DY
> LD_LIBRARY_PATH /usr/lib:$DYLD_LIBRARY_PATH ;   /usr/bin/orted --hnp-topo-sig
> 0N:2S:0L3:4L2:4L1:4C:4H:x86_64 -mca ess "env" -mca ess_base_jo
> bid "2446000128" -mca ess_base_vpid "" -mca ess_base_num_procs "2" -
> mca orte_hnp_uri "2446000128.0;usock;tcp://10.116.85.90:44791"
>  --mca plm_base_verbose "1" -mca plm "rsh" -mca orte_display_alloc "1" -mca
> pmix "^s1,s2,cray"
> ssh_exchange_identification: read: Connection reset by peer
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] can't run MPI job under SGE

2019-07-25 Thread Reuti via users


Am 25.07.2019 um 23:00 schrieb David Laidlaw:

> Here is most of the command output when run on a grid machine:
> 
> dblade65.dhl(101) mpiexec --version
> mpiexec (OpenRTE) 2.0.2

This is some time old. I would suggest to install a fresh one. You can even 
compile one in your home directory and install it e.g. in 
$HOME/local/openmpi-3.1.4-gcc_7.4.0-shared ( by --prefix=…intended path…) and 
then access this for all your jobs (adjust for your version of gcc). In your 
~/.bash_profile and the job script:

DEFAULT_MANPATH="$(manpath -q)"
MY_OMPI="$HOME/local/openmpi-3.1.4_gcc-7.4.0_shared"
export PATH="$MY_OMPI/bin:$PATH"
export LD_LIBRARY_PATH="$MY_OMPI/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
export MANPATH="$MY_OMPI/share/man${DEFAULT_MANPATH:+:$DEFAULT_MANPATH}"
unset MY_OMPI
unset DEFAULT_MANPATH

Essentially there is no conflict with the already installed version.


> dblade65.dhl(102) ompi_info | grep grid
>  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component 
> v2.0.2)
> dblade65.dhl(103) c
> denied: host "dblade65.cs.brown.edu" is neither submit nor admin host
> dblade65.dhl(104) 

On a node it’s ok this way.


> Does that suggest anything?
> 
> qconf is restricted to sysadmins, which I am not.

What error is output if you try it anyway? Usually the viewing is always 
possible.


> I would note that we are running debian stretch on the cluster machines.  On 
> some of our other (non-grid) machines, running debian buster, the output is:
> 
> cslab3d.dhl(101) mpiexec --version
> mpiexec (OpenRTE) 3.1.3
> Report bugs to http://www.open-mpi.org/community/help/
> cslab3d.dhl(102) ompi_info | grep grid
>  MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component 
> v3.1.3)

If you compile on such a machine and intend to run it in the cluster it won't 
work, as the versions don't match. Therefore the above solution, to use a 
personal version available in your $HOME for compiling and running the 
applications.

Side note: Open MPI binds the processes to cores by default. In case more than 
one MPI job is running on a node one will have to use `mpiexec --bind-to none 
…` as otherwise all jobs on this node will use core 0 upwards.

-- Reuti


> Thanks!
> 
> -David Laidlaw
> 
> On Thu, Jul 25, 2019 at 2:13 PM Reuti  wrote:
> 
> Am 25.07.2019 um 18:59 schrieb David Laidlaw via users:
> 
> > I have been trying to run some MPI jobs under SGE for almost a year without 
> > success.  What seems like a very simple test program fails; the ingredients 
> > of it are below.  Any suggestions on any piece of the test, reasons for 
> > failure, requests for additional info, configuration thoughts, etc. would 
> > be much appreciated.  I suspect the linkage between SGIEand MPI, but can't 
> > identify the problem.  We do have SGE support build into MPI.  We also have 
> > the SGE parallel environment (PE) set up as described in several places on 
> > the web.
> > 
> > Many thanks for any input!
> 
> Did you compile Open MPI on your own or was it delivered with the Linux 
> distribution? That it tries to use `ssh` is quite strange, as nowadays Open 
> MPI and others have built-in support to detect that they are running under 
> the control of a queuing system. It should use `qrsh` in your case.
> 
> What does:
> 
> mpiexec --version
> ompi_info | grep grid
> 
> reveal? What does:
> 
> qconf -sconf | egrep "(command|daemon)"
> 
> show?
> 
> -- Reuti
> 
> 
> > Cheers,
> > 
> > -David Laidlaw
> > 
> > 
> > 
> > 
> > Here is how I submit the job:
> > 
> >/usr/bin/qsub /gpfs/main/home/dhl/liggghtsTest/hello2/runme
> > 
> > 
> > Here is what is in runme:
> > 
> >   #!/bin/bash
> >   #$ -cwd
> >   #$ -pe orte_fill 1
> >   env PATH="$PATH" /usr/bin/mpirun --mca plm_base_verbose 1 -display-
> > allocation ./hello
> > 
> > 
> > Here is hello.c:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > int main(int argc, char** argv) {
> > // Initialize the MPI environment
> > MPI_Init(NULL, NULL);
> > 
> > // Get the number of processes
> > int world_size;
> > MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> > 
> > // Get the rank of the process
> > int world_rank;
> > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> > 
> > // Get the name of the processor
> > char processor_name[MPI_MAX_PROCESSOR_NAME];
> > int name_len;
> > MPI_Get_processor_name(processor_name, &name_len);
> > 
> > // Print off a hello world message
> > printf("Hello world from processor %s, rank %d out of %d processors\n",
> >processor_name, world_rank, world_size);
> > // system("printenv");
> > 
> > sleep(15); // sleep for 60 seconds
> > 
> > // Finalize the MPI environment.
> > MPI_Finalize();
> > }
> > 
> > 
> > This command will build it:
> > 
> >  mpicc hello.c -o hello
> > 
> > 
> > Running produces the following:
> > 
> > /var/spool/gridengine/execd/dblade01/active_jobs/1895308.1/pe_hostfile
> > dblade01.cs.brown.edu 1 s

Re: [OMPI users] TMPDIR for running openMPI job under grid

2019-07-26 Thread Reuti via users
Hi,

Am 26.07.2019 um 21:12 schrieb Kulshrestha, Vipul via users:

> Hi,
>  
> I am trying to setup my open-mpi application to run under grid.
>  
> It works sometimes, but sometimes I get the below error. I have contacted my 
> grid site administrator and the message from them is that they cannot change 
> the TMPDIR path used in the grid configuration.
>  
> I have tried setting TNPDIR, but it does not help (probably because grid 
> engine resets it).
>  
> What other alternatives do I have?
>  
> One other curious question is that why does open-mpi creates such a large 
> name? I understand that part of this path is dependent on TMPDIR value, but 
> even after that it adds additional unnecessary characters like 
> “openmpi-sessions-<5 digit number>@_0/ number>”, which could have been shortened to something like “omp-<5 digit 
> number>@_0/<5 digit number>” and saving 14 characters (almost 
> 15% of possible length).
>  
> Thanks,
> Vipul
>  
> PMIx has detected a temporary directory name that results
> in a path that is too long for the Unix domain socket:
>  
> Temp dir: /var/spool/sge/wv2/tmp/<9 digit grid job id>.1.<16 character 
> queuename>.q/openmpi-sessions-43757@<12character machine name>_0/50671

Personally I find it quite unusual to have the scratch directory being located 
in /var. Often it's a plain /scratch.

Could a symbolic link help? I mean: create it in /tmp and point it to 
/var/spool/sge/wv2/tmp/<9 digit grid job id>.1.<16 character queuename>.q Then 
/tmp/$(mktemp -u ) could be shorter which you define as TMPDIR before 
starting `mpiexec`.

===

If it happens only occasionally, maybe it depends on the length of the hostname 
where it's running on?

If the admin are nice, the could define a symbolic link directly as /scratch 
pointing to /var/spool/sge/wv2/tmp and setup in the queue configuration 
/scratch as being TMPDIR. Effect and location like now, but safes some 
characters

-- Reuti
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] can't run MPI job under SGE

2019-07-29 Thread Reuti via users


> Am 29.07.2019 um 17:17 schrieb David Laidlaw :
> 
> I will try building a newer ompi version in my home directory, but that will 
> take me some time.
> 
> qconf is not available to me on any machine.  It provides that same error 
> wherever I am able to try it:
> > denied: host "..." is neither submit nor admin host
> 
> Here is what it produces when I have a sysadmin run it:
> $ qconf -sconf | egrep "(command|daemon)"
> qlogin_command   /sysvol/sge.test/bin/qlogin-wrapper
> qlogin_daemon/sysvol/sge.test/bin/grid-sshd -i
> rlogin_command   builtin
> rlogin_daemonbuiltin
> rsh_command  builtin
> rsh_daemon   builtin

That's fine. I wondered whether rsh_* would contain a redirection to `ssh` to 
(to get the source of the used `ssh` in your error output).

-- Reuti
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] mpirun --output-filename behavior

2019-11-01 Thread Reuti via users


> Am 01.11.2019 um 14:46 schrieb Jeff Squyres (jsquyres) via users 
> :
> 
> On Nov 1, 2019, at 9:34 AM, Jeff Squyres (jsquyres) via users 
>  wrote:
>> 
>>> Point to make: it would be nice to have an option to suppress the output on 
>>> stdout and/or stderr when output redirection to file is requested. In my 
>>> case, having stdout still visible on the terminal is desirable but having a 
>>> way to suppress output of stderr to the terminal would be immensely helpful.
>> 
>> I do believe that --output-file will write to a *local* file on the node 
>> where it is running (vs. being sent to mpirun, and mpirun writing to the 
>> output file).  So snipping off the output from being sent to mpirun in the 
>> first place would actually be an efficiency-gaining feature.
> 
> 
> Guess what?  It turns out that this is another 
> previously-undocumented-but-already-existing feature.  :-)
> 
>mpirun --output-filename foo:nocopy ...
> 
> The ":nocopy" suffix will not emit to stdout/stderr; it will *only* write to 
> the files.
> 
> You can also comma-delimit / mix this with "nojobid" behavior.  For example:
> 
>mpirun --output-filename foo:nocopy,nojobid ...
>(ordering of the tokens doesn't matter in the comma-delimited list)
> 
> (I have to admit that I actually LOL'ed when I looked in the code and found 
> that the feature was already there!)
> 
> For the most part, this whole thing needs to get documented.

Especially that the colon is a disallowed character in the directory name. Any 
suffix :foo will just be removed AFAICS without any error output about foo 
being an unknown option.

-- Reuti


>  I don't know the timing of when this will happen, but we should probably 
> also rename this to --output-directory to be a bit more accurate (and 
> probably keep --output-filename as a deprecated synonym for at least the 
> duration of the 4.0.x series).
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 



Re: [OMPI users] Univa Grid Engine and OpenMPI 1.8.7

2020-01-12 Thread Reuti via users
Hi,

Am 12.01.2020 um 07:15 schrieb Lane, William via users:

> I'm having problems w/an old openMPI test program which I re-compiled using 
> OpenMPI 1.8.7 for CentOS 6.3 running Univa Grid Engine 8.6.4.

IIRC at that time it was necessary to compile Open MPI explicitly with 
--with-sge. Any newer version of Open MPI wouldn't work for you, where it 
detects UGE automatically?


>   • Are the special PE requirements for Son of Grid Engine needed for 
> Univa Grid Engine 8.6.4 (in particular qsort_args and/or control_slaves both 
> being present and set to TRUE)

Yes, as "control_slaves" allows `qrsh -inherit …` only when it's set.


>   •  is LD_LIBRARY_PATH required to be set for openMPI 1.8.7 to run 
> (Univa Grid Engine specifically reports that it has: "removed environment 
> variable LD_LIBRARY_PATH from submit environment - it is considered a 
> security issue" when I run my test openMPI 1.8.7 job.

I usually don't like `qsub -V …` at all (which I think you are using), as a 
changed interactive environment might break the job at a later point in time 
and it's hard to investigate. I prefer setting all necessary environment 
variable inside the job script itself, so that it is self contained.

Maybe they judge it a security issue, as this variable would also be present in 
case you run a queue prolog/epilog as a different user. For the plain job 
itself it wouldn't matter IMO.

And for any further investigation: which problem do you face in detail?

-- Reuti

Re: [OMPI users] running mpirun with grid

2020-02-06 Thread Reuti via users
Hi,

> Am 06.02.2020 um 21:47 schrieb Kulshrestha, Vipul via users 
> :
> 
> Hi,
>  
> I need to launch my openmpi application on grid.
>  
> My application is designed to run N processes, where each process would have 
> M threads.
>  
> To run it without grid, I run it as (say N = 7, M = 2)
> % mpirun –np 7 
>  
> The above works well and runs N processes. I am also able to submit it on 
> grid using below command and it works.
>  
> % qsub –pe orte 7 –l os-redhat6.7* -V –j y –b y –shell no mpirun –np 7 
> 
>  
> However, the above job allocates only N slots on the grid, when it really is 
> consuming N*M slots. How do I submit the qsub command so that it reserves the 
> N*M slots, while starting up N processes? I tried belwo but I get some weird 
> error from ORTE as pasted below.
>  
> % qsub –pe orte 14 –l os-redhat6.7* -V –j y –b y –shell no mpirun –np 7 
> 

a) You will first have to talk to the admin to provide a fixed allocation rule 
on all involved nodes, hence e.g. "allocation_rule 2" and name this PE "orte2". 
Essentially you can be sure to get always 2 slots on each node this way.

b) Instead of submitting a binary, you will need a job script where you mangle 
the provided PE_HOSTFILE to include each node only with a slot count of 1. I.e. 
Open MPI should think to start only one process per node. You can then use the 
remaining core for an additional thread. As the original file can't be changed, 
it has to be copied, adjusted and then PE_HOSTFILE reset to this new file.

c) It would be nice, if the admin could prepare already a mangled PE_HOSTFILE 
(maybe by dividing the slotcount by the last diigit in the PE name) in a 
parallel prolog and put it in $TMPDIR of the job. As the environemnt variables 
won't be inherited to the job, you will have to point the environment variable 
PE_HOSTFILE to the mangled one in your job script in this case too.

d) SGE should get the real amount of needed slots of your job during 
submission, i.e. 14 in your case.

This way you will get an allocation of 14 slots, due to the fixed allocation 
rule "orte2" they are equally distributed. The mangled PE_HOSTFILE will include 
only one slot per node and Open MPI will start only one process per node for a 
total of 7. Then you can use OMP_NUM_THREAD=2 or alike to tell your application 
to start an additional thread per node. The environment variable OMP_NUM_THREAD 
should also be distributed to the nodes by the option "-x" to `mpirun` (or use 
MPI itself to distribute this information).

Note that in contrast to Torque you get each node only once for sure. AFAIR 
there was a setting in Torque to allow or disallow mutiple elections of the 
fixed allocation rule per node.

HTH -- Reuti

Re: [OMPI users] Moving an installation

2020-07-24 Thread Reuti via users
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Am 24.07.2020 um 18:55 schrieb Lana Deere via users:

> I have open-mpi 4.0.4 installed on my desktop and my small test programs are 
> working.
> 
> I would like to migrate the open-mpi to a cluster and run a larger program 
> there.  When moved, the open-mpi installation is in a different pathname than 
> it was on my desktop and it doesn't seem to work any longer.  I can make the 
> libraries visible via LD_LIBRARY_PATH but this seems insufficient.  Is there 
> an environment variable which can be used to tell the open-mpi where it is 
> installed?

There is OPAL_PREFIX to be set:

https://www.open-mpi.org/faq/?category=building#installdirs

- -- Reuti
-BEGIN PGP SIGNATURE-
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAl8bIa0ACgkQo/GbGkBRnRrGywCgj5PHSKdMRwSx3jVB4en+wbmV
yG8AniYxICQCHrAsxg/Mbx59YpC9ElvW
=y8nX
-END PGP SIGNATURE-


Re: [OMPI users] segfault in libibverbs.so

2020-07-27 Thread Reuti via users


Am 27.07.2020 um 21:18 schrieb Prentice Bisbal via users:

> Can anyone explain why my job still calls libibverbs when I run it with '-mca 
> btl ^openib'?

A similar behavior I observed too in a mixed cluster where some nodes have 
InfiniBand and others not. Even checking the node beforehand and applying '-mca 
btl ^openib' didn't help to suppress the warnings about the missing libibverbs. 
While in case of IB even more libs are required, at least the libibverbs seems 
to be required to avoid the warning about its absence in any case (while the 
job continued despite the warning).

[node01:119439] mca_base_component_repository_open: unable to open mca_oob_ud: 
libibverbs.so.1: cannot open shared object file: No such file or directory 
(ignored)


> If I instead use '-mca btl tcp', my jobs don't segfault. I would assum 'mca 
> btl ^openib' and '-mca btl tcp' to essentially be equivalent, but there's 
> obviously a difference in the two.

This I didn't check but just ignored the warning later on. Would '-mca btl tcp' 
also allow local communication without the network being involved and/or 
replace vader? This was the reason I found '-mca btl ^openib' more appealing 
than listing all others.

-- Reuti


> Prentice
> 
> On 7/23/20 3:34 PM, Prentice Bisbal wrote:
>> I manage a cluster that is very heterogeneous. Some nodes have InfiniBand, 
>> while others have 10 Gb/s Ethernet. We recently upgraded to CentOS 7, and 
>> built a new software stack for CentOS 7. We are using OpenMPI 4.0.3, and we 
>> are using Slurm 19.05.5 as our job scheduler.
>> 
>> We just noticed that when jobs are sent to the nodes with IB, the segfault 
>> immediately, with the segfault appearing to come from libibverbs.so. This is 
>> what I see in the stderr output for one of these failed jobs:
>> 
>> srun: error: greene021: tasks 0-3: Segmentation fault
>> 
>> And here is what I see in the log messages of the compute node where that 
>> segfault happened:
>> 
>> Jul 23 15:19:41 greene021 kernel: mpihello[7911]: segfault at 7f0635f38910 
>> ip 7f0635f49405 sp 7ffe354485a0 error 4
>> Jul 23 15:19:41 greene021 kernel: mpihello[7912]: segfault at 7f23d51ea910 
>> ip 7f23d51fb405 sp 7ffef250a9a0 error 4
>> Jul 23 15:19:41 greene021 kernel: in 
>> libibverbs.so.1.5.22.4[7f23d51ec000+18000]
>> Jul 23 15:19:41 greene021 kernel:
>> Jul 23 15:19:41 greene021 kernel: mpihello[7909]: segfault at 7ff504ba5910 
>> ip 7ff504bb6405 sp 7917ccb0 error 4
>> Jul 23 15:19:41 greene021 kernel: in 
>> libibverbs.so.1.5.22.4[7ff504ba7000+18000]
>> Jul 23 15:19:41 greene021 kernel:
>> Jul 23 15:19:41 greene021 kernel: mpihello[7910]: segfault at 7fa58abc5910 
>> ip 7fa58abd6405 sp 7ffdde50c0d0 error 4
>> Jul 23 15:19:41 greene021 kernel: in 
>> libibverbs.so.1.5.22.4[7fa58abc7000+18000]
>> Jul 23 15:19:41 greene021 kernel:
>> Jul 23 15:19:41 greene021 kernel: in 
>> libibverbs.so.1.5.22.4[7f0635f3a000+18000]
>> Jul 23 15:19:41 greene021 kernel
>> 
>> Any idea what is going on here, or how to debug further? I've been using 
>> OpenMPI for years, and it usually just works.
>> 
>> I normally start my job with srun like this:
>> 
>> srun ./mpihello
>> 
>> But even if I try to take IB out of the equation by starting the job like 
>> this:
>> 
>> mpirun -mca btl ^openib ./mpihello
>> 
>> I still get a segfault issue, although the message to stderr is now a little 
>> different:
>> 
>> -- 
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> -- 
>> -- 
>> mpirun noticed that process rank 1 with PID 8502 on node greene021 exited on 
>> signal 11 (Segmentation fault).
>> -- 
>> 
>> The segfaults happens immediately. It seems to happen as soon as MPI_Init() 
>> is called. The program I'm running is very simple MPI "Hello world!" program.
>> 
>> The output of  ompi_info is below my signature, in case that helps.
>> 
>> Prentice
>> 
>> $ ompi_info
>>  Package: Open MPI u...@host.example.com Distribution
>> Open MPI: 4.0.3
>>   Open MPI repo revision: v4.0.3
>>Open MPI release date: Mar 03, 2020
>> Open RTE: 4.0.3
>>   Open RTE repo revision: v4.0.3
>>Open RTE release date: Mar 03, 2020
>> OPAL: 4.0.3
>>   OPAL repo revision: v4.0.3
>>OPAL release date: Mar 03, 2020
>>  MPI API: 3.1.0
>> Ident string: 4.0.3
>>   Prefix: /usr/pppl/gcc/9.3-pkgs/openmpi-4.0.3
>>  Configured architecture: x86_64-unknown-linux-gnu
>>   Configure host: dawson027.pppl.gov
>>Configured by: lglant
>>Configured on: Mon Jun  1 12:37:07 EDT

Re: [OMPI users] Issues with compilers

2021-01-22 Thread Reuti via users
Hi,

what about putting the "-static-intel" into a configuration file for the Intel 
compiler. Besides the default configuration, one can have a local one and put 
the path in an environment variable IFORTCFG (there are other ones for C/C++).

$ cat myconf 
--version
$ export IFORTCFG=/home/reuti/myconf
$ ifort
ifort (IFORT) 19.0.5.281 20190815

-- Reuti


> Am 22.01.2021 um 15:49 schrieb Alvaro Payero Pinto via users 
> :
> 
> Dear Open MPI support team,
> 
> I am trying to install Open MPI with Intel compiler suite for the Fortran 
> side and GNU compiler suite for the C side. For factors that don’t depend 
> upon me, I’m not allowed to change the C compiler suite to Intel one since 
> that would mean an additional license.
> 
> Problem arises with the fact that the installation should not dynamically 
> depend on Intel libraries, so the flag “-static-intel” (or similar) should be 
> passed to the Fortran compiler. I’ve seen in the FAQ that this problem is 
> solved by passing an Autotools option “-Wc,-static-intel” to the variable 
> LDFLAGS when invoking configure with Intel compilers. This works if both 
> C/C++ and Fortran compilers are from Intel. However, it crashes if the 
> compiler suite is mixed since GNU C/C++ does not recognise the 
> “-static-intel” option.
> 
> Is there any way to bypass this crash and to indicate that such option should 
> only be passed when using Fortran compiler?
> 
> Configure call to reproduce the crash is made as follows:
> 
> ./configure --prefix=/usr/local/ --libdir=/usr/local/lib64/ 
> --includedir=/usr/local/include/ CC=gcc CXX=g++ 'FLAGS=-O2 -m64' 'CFLAGS=-O2 
> -m64' 'CXXFLAGS=-O2 -m64' FC=ifort 'FCFLAGS=-O2 -m64' 
> LDFLAGS=-Wc,-static-intel
> 
> Please, find attached the output from configure and config.log.
> 
> Additional data:
> 
> · Operating system SLES12 SP3.
> · Open MPI version 4.0.5
> · Intel Fortran compiler version 17.0.6
> · GNU C/C++ compiler version 4.8.5.
> 
> I’ll very much appreciate any help provided to solve this problem.
> 
> Kind regards,
> 
> Álvaro
> 
>  
> 



[OMPI users] RUNPATH vs. RPATH

2022-07-22 Thread Reuti via users
Hi,

using Open MPI 4.1.4

$ mpicc --show …

tells me, that the command line contains "… -Wl,--enable-new-dtags …" so that 
even older linkers will include RUNPATH instead of RPATH in the created dynamic 
binary. On the other hand, Open MPI itself doesn't use this option for its own 
libraries:

./liboshmem.so.40.30.2
./libmpi_mpifh.so.40.30.0
./libmpi.so.40.30.4
./libmpi_usempi_ignore_tkr.so.40.30.0
./libopen-rte.so.40.30.2

Is this intended?

Setting LD_LIBRARY_PATH will instruct the created binary to look for libraries 
first in that location and resolve it, but the loaded library in turn will then 
use RPATH inside itself first to load additional libraries.

(I compile Open MPI in my home directory and move it then to the final 
destination for the group; setting OPAL_PREFIX of course. I see a mix of 
library locations when I run the created binary on my own with `ldd`.)

Looks like I can get the intended behavior while configuring Open MPI on this 
(older) system:

$ ./configure …  LDFLAGS=-Wl,--enable-new-dtags

-- Reuti

Re: [OMPI users] RUNPATH vs. RPATH

2022-08-09 Thread Reuti via users
Hi Jeff,

> Am 09.08.2022 um 16:17 schrieb Jeff Squyres (jsquyres) via users 
> :
> 
> Just to follow up on this thread...
> 
> Reuti: I merged the PR on to the main docs branch.  They're now live -- we 
> changed the text:
>   • here: 
> https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/installation.html

On this page I read:

Using --disable-wrapper-rpath will disable both “runpath” and “rpath” behavior 
in the wrapper compilers.

I would phrase it:

Using --disable-wrapper-rpath in addition will disable both “runpath” and 
“rpath” behavior in the wrapper compilers.

(otherwise I get a "configure: error: --enable-wrapper-runpath cannot be 
selected with --disable-wrapper-rpath")


>   • and here: 
> https://docs.open-mpi.org/en/main/installing-open-mpi/configure-cli-options/rpath-and-runpath.html

The last command reads `shell$ ./configure LDFLAGS=--enable-new-dtags ...`. But 
the LDFLAGS will be given to the compiler wrapper, hence it seems to need 
-Wl,--enable-new-dtags what I used initially to avoid:

configure:6591: checking whether the C compiler works
configure:6613: gcc   --enable-new-dtags conftest.c  >&5
cc1: error: unknown pass new-dtags specified in -fenable


> Here's the corresponding PR to update the v5.0.x docs: 
> https://github.com/open-mpi/ompi/pull/10640
> 
> Specifically, the answer to your original question is twofold:
>   • It's complicated. 🙂
>   • It looks like you did the Right Thing for your environment, but you 
> might want to check the output of "readelf -d ..." to be sure.
> Does that additional text help explain things?

Yes, thx a lot for the clarification and update of the documentation.

-- Reuti


> --
> Jeff Squyres
> jsquy...@cisco.com
> From: Jeff Squyres (jsquyres) 
> Sent: Saturday, August 6, 2022 9:36 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] RUNPATH vs. RPATH
> 
> Reuti --
> 
> See my disclaimers on other posts about apologies for taking so long to reply!
> 
> This code was written forever ago; I had to dig through it a bit, read the 
> comments and commit messages, and try to remember why it was done this way.  
> What I thought would be a 5-minute search turned into a few hours of digging 
> through code, multiple conversations with Brian, and one pull request (so 
> far).  We don't have a definitive answer yet, but we think we're getting 
> closer.
> 
> The short version is that what you did appears to be correct:
> 
> ./configure LDFLAGS=-Wl,--enable-new-dtags ...
> 
> The longer answer is that whenever you think you understand the shared 
> library and run-time linkers, you inevitably find out that you don't.  The 
> complicated cases come from the fact that the handling of rpath and runpath 
> can be different on different platforms, and there are subtle differences in 
> their behavior (beyond the initial "search before or after LD_LIBRARY_PATH, 
> such as the handling of primary and secondary/transitive dependencies).
> 
> The pull request I have so far is https://github.com/open-mpi/ompi/pull/10624 
> (rendered here: 
> https://ompi--10624.org.readthedocs.build/en/10624/installing-open-mpi/configure-cli-options/installation.html).
>   We're not 100% confident in that text yet, but I think we're close to at 
> least documenting what the current behavior is.  Once we nail that down, we 
> can talk about whether we want to change that behavior.
> 
> 
> 
> From: users  on behalf of Reuti via users 
> 
> Sent: Friday, July 22, 2022 9:48 AM
> To: Open MPI Users
> Cc: Reuti; zuelc...@staff.uni-marburg.de
> Subject: [OMPI users] RUNPATH vs. RPATH
> 
> Hi,
> 
> using Open MPI 4.1.4
> 
> $ mpicc --show …
> 
> tells me, that the command line contains "… -Wl,--enable-new-dtags …" so that 
> even older linkers will include RUNPATH instead of RPATH in the created 
> dynamic binary. On the other hand, Open MPI itself doesn't use this option 
> for its own libraries:
> 
> ./liboshmem.so.40.30.2
> ./libmpi_mpifh.so.40.30.0
> ./libmpi.so.40.30.4
> ./libmpi_usempi_ignore_tkr.so.40.30.0
> ./libopen-rte.so.40.30.2
> 
> Is this intended?
> 
> Setting LD_LIBRARY_PATH will instruct the created binary to look for 
> libraries first in that location and resolve it, but the loaded library in 
> turn will then use RPATH inside itself first to load additional libraries.
> 
> (I compile Open MPI in my home directory and move it then to the final 
> destination for the group; setting OPAL_PREFIX of course. I see a mix of 
> library locations when I run the created binary on my own with `ldd`.)
> 
> Looks like I can get the intended behavior while configuring Open MPI on this 
> (older) system:
> 
> $ ./configure …  LDFLAGS=-Wl,--enable-new-dtags
> 
> -- Reuti



signature.asc
Description: Message signed with OpenPGP