Re: [OMPI users] Slides from the Open MPI SC'15 State of the Union BOF
Received from Jeff Squyres (jsquyres) on Thu, Nov 19, 2015 at 10:03:33AM EST: > Thanks to the over 100 people who came to the Open MPI State of the Union BOF > yesterday. George Bosilca from U. Tennessee, Nathan Hjelm from Los Alamos > National Lab, and I presented where we are with Open MPI development, and > where we're going. > > If you weren't able to join us, feel free to read through the slides: > > http://www.open-mpi.org/papers/sc-2015/ > > Thank you! FYI, there seems to be some problem with the posted PDF file - when I tried to view it in Firefox 42 and 3 other PDF viewers (on Linux, at least), all of the programs claimed that the file is either corrupted or misformatted. -- Lev Givon Bionet Group | Neurokernel Project http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] 1.10.1 appears to break mpi4py
Received from Gilles Gouaillardet on Mon, Nov 09, 2015 at 07:20:51PM EST: > Orion and Lev, > > here is the minimal patch that makes mpi4py tests happy again > > there might not be a v1.10.2, so you might have to manually apply > that patch until v2.0.0 Confirming that the scatter/gather mpi4py test errors are eliminated by the above patch. Thanks, -- Lev Givon Bionet Group | Neurokernel Project http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] 1.10.1 appears to break mpi4py
Received from Orion Poplawski on Mon, Nov 09, 2015 at 06:36:05PM EST: > We're seeing test failures after bumping to 1.10.1 in Fedora (see below). Is > anyone else seeing this? Any suggestions for debugging? I see similar errors - you might want to mention it on the mpi...@googlegroups.com mailing list. -- Lev Givon Bionet Group | Neurokernel Project http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] reported number of processes emitting error much larger than number started/spawned by mpiexec?
Received from Ralph Castain on Sun, Sep 20, 2015 at 06:54:41PM EDT: (snip) > > On a closer look, it seems that the "17" corresponds to the number of times > > the > > error was emitted after its occurrence regardless of how many actual MPI > > processes > > were running (each of the MPI processes spawned by my program iterates a > > certain > > number of times and causes the error to occur during each iteration). > > That is correct - if you tell us the error, we’d be happy to help > diagnose. Otherwise, your analysis is correct. I'm already in communication with Rolf vandeVaart regarding the error [1]. Unfortunately, neither of us has made much headway finding the source of the problem as of the present time. [1] http://www.open-mpi.org/community/lists/users/2015/09/27526.php -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] reported number of processes emitting error much larger than number started/spawned by mpiexec?
Received from Ralph Castain on Sun, Sep 20, 2015 at 05:08:10PM EDT: > > On Sep 20, 2015, at 12:57 PM, Lev Givon wrote: > > > > While debugging a problem that is causing emission of a non-fatal OpenMPI > > error > > message to stderr, the error message is followed by a line similar to the > > following (I have help message aggregation turned on): > > > > [myhost:10008] 17 more processes have sent help message some_file.txt / > > blah blah failed > > > > The job that I am running is started as a single process (via SLURM using > > PMI) > > that spawns 2 processes via MPI_Spawn; the number of processes reported in > > the > > above line, however, is much larger than 2. Why would the number of > > processes > > reporting an error be so big? When I examine the MPI processes in real time > > as they > > run (e.g., via top), there never appear to be that many processes running. > > > > I'm using OpenMPI 1.10.0 built on Ubuntu 14.04.3; as indicated by > > ompi_info, I > > don't have multiple MPI threads enabled: > > > > posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE > > progress: yes, Event lib: yes) > Just to be clear: you are starting the single process using “srun -n 1 ./app”, > and the app calls MPI_Comm_spawn? Yes. > I’m not sure that’s really supported…I think there might be something in Slurm > behind that call, but I have no idea if it really works. Well, the same question applies if I don't use SLURM and launch with mpiexec -np 1. On a closer look, it seems that the "17" corresponds to the number of times the error was emitted after its occurrence regardless of how many actual MPI processes were running (each of the MPI processes spawned by my program iterates a certain number of times and causes the error to occur during each iteration). -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
[OMPI users] reported number of processes emitting error much larger than number started/spawned by mpiexec?
While debugging a problem that is causing emission of a non-fatal OpenMPI error message to stderr, the error message is followed by a line similar to the following (I have help message aggregation turned on): [myhost:10008] 17 more processes have sent help message some_file.txt / blah blah failed The job that I am running is started as a single process (via SLURM using PMI) that spawns 2 processes via MPI_Spawn; the number of processes reported in the above line, however, is much larger than 2. Why would the number of processes reporting an error be so big? When I examine the MPI processes in real time as they run (e.g., via top), there never appear to be that many processes running. I'm using OpenMPI 1.10.0 built on Ubuntu 14.04.3; as indicated by ompi_info, I don't have multiple MPI threads enabled: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
[OMPI users] tracking down what's causing a cuIpcOpenMemHandle error emitted by OpenMPI
I recently noticed the following error when running a Python program I'm developing that repeatedly performs GPU-to-GPU data transfers via OpenMPI: The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol cannot be used. cuIpcGetMemHandle return value: 1 address: 0x602e75000 Check the cuda.h file for what the return value means. Perhaps a reboot of the node will clear the problem. The system is running Ubuntu 14.04.3 and contains several Tesla S2050 GPUs. I'm using the following software: - Linux kernel 3.19.0 (backported to Ubuntu 14.04.3 from 15.04) - CUDA 7.0 (installed via NVIDIA's deb packages) - NVIDIA kernel driver 346.82 - OpenMPI 1.10.0 (manually compiled with CUDA support) - Python 2.7.10 - pycuda 2015.1.3 (manually compiled against CUDA 7.0) - mpi4py (manually compiled git revision 1d8ab22) OpenMPI, Python, pycuda, and mpi4py are all locally installed in a conda environment. Judging from my program's logs, the error pops up during one of the program's first few iterations. The error isn't fatal, however - the program continues running to completion after the message appears. Running mpiexec with --mca plm_base_verbose 10 doesn't seem to produce any additional debug info of use in tracking this down. I did notice, though, that there are undeleted cuda.shm.* files in /run/shm after the error message appears and my program exits. Deleting the files does not prevent the error from recurring if I subsequently rerun the program. Oddly, the above problem doesn't crop up when I run the same code on an Ubuntu 14.04.3 system with the exact same software containing 2 non-Tesla GPUs (specifically, a GTX 470 and 750). The error seems to have started occurring over the past two weeks, but none of the changes I made to my code over that time seem to be related to the problem (i.e., running an older revision resulted in the same errors). I also tried running my code using older releases of OpenMPI (e.g., 1.8.5) and mpi4py (e.g., from about 4 weeks ago), but the error message still occurs. Both Ubuntu systems are 64-bit and have been kept up to date with the latest package updates. Any thoughts as to what could be causing the problem? -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
Received from Lev Givon on Thu, May 21, 2015 at 11:32:33AM EDT: > Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT: > > (snip) > > > I see that you mentioned you are starting 4 MPS daemons. Are you following > > the instructions here? > > > > http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html > > > > Yes - also > https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf > > > This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for > > CUDA > > IPC. Since you are using CUDA 7 there is no more need to start multiple > > daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single > > MPS control daemon which will handle all GPUs. Can you try that? > > I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value should be > passed to all MPI processes. > > Several questions related to your comment above: > > - Should the MPI processes select and initialize the GPUs they respectively > need > to access as they normally would when MPS is not in use? > - Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS > (and > hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES > to > control GPU resource allocation, and I would like to run my program (and the > MPS control daemon) on a cluster via SLURM. > - Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that > MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU > setting > with CUDA 6.5 even when one starts multiple MPS control daemons as described > in the aforementioned blog post? Using a single control daemon with CUDA_VISIBLE_DEVICES unset appears to solve the problem when IPC is enabled. -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT: (snip) > I see that you mentioned you are starting 4 MPS daemons. Are you following > the instructions here? > > http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html > Yes - also https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf > This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for CUDA > IPC. Since you are using CUDA 7 there is no more need to start multiple > daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single > MPS control daemon which will handle all GPUs. Can you try that? I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value should be passed to all MPI processes. Several questions related to your comment above: - Should the MPI processes select and initialize the GPUs they respectively need to access as they normally would when MPS is not in use? - Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS (and hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES to control GPU resource allocation, and I would like to run my program (and the MPS control daemon) on a cluster via SLURM. - Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU setting with CUDA 6.5 even when one starts multiple MPS control daemons as described in the aforementioned blog post? > Because of this question, we realized we need to update our documentation as > well. -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT: > >-Original Message- > >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon > >Sent: Tuesday, May 19, 2015 6:30 PM > >To: us...@open-mpi.org > >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI > >1.8.5 with CUDA 7.0 and Multi-Process Service > > > >I'm encountering intermittent errors while trying to use the Multi-Process > >Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU > >by multiple MPI processes that perform GPU-to-GPU communication with > >each other (i.e., GPU pointers are passed to the MPI transmission > >primitives). > >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5, > >which is in turn built against CUDA 7.0. In my current configuration, I have > >4 > >MPS server daemons running, each of which controls access to one of 4 GPUs; > >the MPI processes spawned by my program are partitioned into 4 groups > >(which might contain different numbers of processes) that each talk to a > >separate daemon. For certain transmission patterns between these > >processes, the program runs without any problems. For others (e.g., 16 > >processes partitioned into 4 groups), however, it dies with the following > >error: > > > >[node05:20562] Failed to register remote memory, rc=-1 > >-- > >The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and > >will cause the program to abort. > > cuIpcOpenMemHandle return value: 21199360 > > address: 0x1 > >Check the cuda.h file for what the return value means. Perhaps a reboot of > >the node will clear the problem. (snip) > >After the above error occurs, I notice that /dev/shm/ is littered with > >cuda.shm.* files. I tried cleaning up /dev/shm before running my program, > >but that doesn't seem to have any effect upon the problem. Rebooting the > >machine also doesn't have any effect. I should also add that my program runs > >without any error if the groups of MPI processes talk directly to the GPUs > >instead of via MPS. > > > >Does anyone have any ideas as to what could be going on? > > I am not sure why you are seeing this. One thing that is clear is that you > have found a bug in the error reporting. The error message is a little > garbled and I see a bug in what we are reporting. I will fix that. > > If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0. My > expectation is that you will not see any errors, but may lose some > performance. The error does indeed go away when IPC is disabled, although I do want to avoid degrading the performance of data transfers between GPU memory locations. > What does your hardware configuration look like? Can you send me output from > "nvidia-smi topo -m" -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT: > > >-Original Message- > >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon > >Sent: Tuesday, May 19, 2015 6:30 PM > >To: us...@open-mpi.org > >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI > >1.8.5 with CUDA 7.0 and Multi-Process Service > > > >I'm encountering intermittent errors while trying to use the Multi-Process > >Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU > >by multiple MPI processes that perform GPU-to-GPU communication with > >each other (i.e., GPU pointers are passed to the MPI transmission > >primitives). > >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5, > >which is in turn built against CUDA 7.0. In my current configuration, I have > >4 > >MPS server daemons running, each of which controls access to one of 4 GPUs; > >the MPI processes spawned by my program are partitioned into 4 groups > >(which might contain different numbers of processes) that each talk to a > >separate daemon. For certain transmission patterns between these > >processes, the program runs without any problems. For others (e.g., 16 > >processes partitioned into 4 groups), however, it dies with the following > >error: > > > >[node05:20562] Failed to register remote memory, rc=-1 > >-- > >The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and > >will cause the program to abort. > > cuIpcOpenMemHandle return value: 21199360 > > address: 0x1 > >Check the cuda.h file for what the return value means. Perhaps a reboot of > >the node will clear the problem. (snip) > >After the above error occurs, I notice that /dev/shm/ is littered with > >cuda.shm.* files. I tried cleaning up /dev/shm before running my program, > >but that doesn't seem to have any effect upon the problem. Rebooting the > >machine also doesn't have any effect. I should also add that my program runs > >without any error if the groups of MPI processes talk directly to the GPUs > >instead of via MPS. > > > >Does anyone have any ideas as to what could be going on? > > I am not sure why you are seeing this. One thing that is clear is that you > have found a bug in the error reporting. The error message is a little > garbled and I see a bug in what we are reporting. I will fix that. > > If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0. My > expectation is that you will not see any errors, but may lose some > performance. > > What does your hardware configuration look like? Can you send me output from > "nvidia-smi topo -m" GPU0GPU1GPU2GPU3CPU Affinity GPU0 X PHB SOC SOC 0-23 GPU1PHB X SOC SOC 0-23 GPU2SOC SOC X PHB 0-23 GPU3SOC SOC PHB X 0-23 Legend: X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
[OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
I'm encountering intermittent errors while trying to use the Multi-Process Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU by multiple MPI processes that perform GPU-to-GPU communication with each other (i.e., GPU pointers are passed to the MPI transmission primitives). I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5, which is in turn built against CUDA 7.0. In my current configuration, I have 4 MPS server daemons running, each of which controls access to one of 4 GPUs; the MPI processes spawned by my program are partitioned into 4 groups (which might contain different numbers of processes) that each talk to a separate daemon. For certain transmission patterns between these processes, the program runs without any problems. For others (e.g., 16 processes partitioned into 4 groups), however, it dies with the following error: [node05:20562] Failed to register remote memory, rc=-1 -- The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and will cause the program to abort. cuIpcOpenMemHandle return value: 21199360 address: 0x1 Check the cuda.h file for what the return value means. Perhaps a reboot of the node will clear the problem. -- [node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 --- Child job 2 terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. --- [node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104) [node05:20564] Failed to register remote memory, rc=-1 [node05:20564] [[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05:20566] Failed to register remote memory, rc=-1 [node05:20566] [[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05:20567] Failed to register remote memory, rc=-1 [node05:20567] [[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [node05:20569] Failed to register remote memory, rc=-1 [node05:20569] [[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05:20571] Failed to register remote memory, rc=-1 [node05:20571] [[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 [node05:20572] Failed to register remote memory, rc=-1 [node05:20572] [[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477 After the above error occurs, I notice that /dev/shm/ is littered with cuda.shm.* files. I tried cleaning up /dev/shm before running my program, but that doesn't seem to have any effect upon the problem. Rebooting the machine also doesn't have any effect. I should also add that my program runs without any error if the groups of MPI processes talk directly to the GPUs instead of via MPS. Does anyone have any ideas as to what could be going on? -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] getting OpenMPI 1.8.4 w/ CUDA to look for absolute path to libcuda.so.1
Received from Rolf vandeVaart on Wed, Apr 29, 2015 at 11:14:15AM EDT: > > >-Original Message- > >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon > >Sent: Wednesday, April 29, 2015 10:54 AM > >To: us...@open-mpi.org > >Subject: [OMPI users] getting OpenMPI 1.8.4 w/ CUDA to look for absolute > >path to libcuda.so.1 > > > >I'm trying to build/package OpenMPI 1.8.4 with CUDA support enabled on Linux > >x86_64 so that the compiled software can be downloaded/installed as one of > >the dependencies of a project I'm working on with no further user > >configuration. I noticed that MPI programs built with the above will try to > >access /usr/lib/i386-linux-gnu/libcuda.so.1 (and obviously complain about it > >being the wrong ELF class) if /usr/lib/i386-linux-gnu precedes > >/usr/lib/x86_64-linux-gnu in one's ld.so cache. While one can get around this > >by modifying one's ld.so configuration (or tweaking LD_LIBRARY_PATH), is > >there some way to compile OpenMPI such that programs built with it (on > >x86_64) look for the full soname of libcuda.so.1 - i.e., > >/usr/lib/x86_64-linux-gnu/libcuda.so.1 - rather than fall back on ld.so? I > >tried setting the rpath of MPI programs built with the above (by modifying > >the OpenMPI compiler wrappers to include -Wl,-rpath - > >Wl,/usr/lib/x86_64-linux-gnu), but that doesn't seem to help. > > Hi Lev: > Any chance you can try Open MPI 1.8.5rc3 and see if you see the same behavior? > That code has changed a bit from the 1.8.4 series and I am curious if you will > still see the same issue. > > http://www.open-mpi.org/software/ompi/v1.8/downloads/openmpi-1.8.5rc3.tar.gz The issue does not occur with 1.8.5rc3 using the same configure options as used with 1.8.4. Since 1.8.5 is almost ready for stable release, I'll switch over now. Thanks! -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
[OMPI users] getting OpenMPI 1.8.4 w/ CUDA to look for absolute path to libcuda.so.1
I'm trying to build/package OpenMPI 1.8.4 with CUDA support enabled on Linux x86_64 so that the compiled software can be downloaded/installed as one of the dependencies of a project I'm working on with no further user configuration. I noticed that MPI programs built with the above will try to access /usr/lib/i386-linux-gnu/libcuda.so.1 (and obviously complain about it being the wrong ELF class) if /usr/lib/i386-linux-gnu precedes /usr/lib/x86_64-linux-gnu in one's ld.so cache. While one can get around this by modifying one's ld.so configuration (or tweaking LD_LIBRARY_PATH), is there some way to compile OpenMPI such that programs built with it (on x86_64) look for the full soname of libcuda.so.1 - i.e., /usr/lib/x86_64-linux-gnu/libcuda.so.1 - rather than fall back on ld.so? I tried setting the rpath of MPI programs built with the above (by modifying the OpenMPI compiler wrappers to include -Wl,-rpath -Wl,/usr/lib/x86_64-linux-gnu), but that doesn't seem to help. -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] parsability of ompi_info --parsable output
Received from Ralph Castain on Wed, Apr 08, 2015 at 12:23:28PM EDT: > Sounds reasonable - I don't have time to work thru it right now, but we can > look at it once Jeff returns as he wrote all that stuff and might see where to > make the changes more readily than me. Made a note of the suggestion here: https://github.com/open-mpi/ompi/issues/515 Thanks, -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] parsability of ompi_info --parsable output
Received from Ralph Castain on Wed, Apr 08, 2015 at 10:46:58AM EDT: > > > On Apr 8, 2015, at 7:23 AM, Lev Givon wrote: > > > > The output of ompi_info --parsable is somewhat difficult to parse > > programmatically because it doesn't escape or quote fields that contain > > colons, > > e.g., > > > > build:timestamp:Tue Dec 23 15:47:28 EST 2014 > > option:threads:posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI > > progress: no, ORTE progress: yes, Event lib: yes) > > > > Is there some way to facilitate machine parsing of the output of ompi_info > > without having to special-case those options/parameters whose data fields > > might > > contain colons ? If not, it would be nice to quote such fields in > > future releases of ompi_info. > > I think the assumption was that people would parse this as follows: > > * entry before the first colon is the category > > * entry between first and second colons is the subcategory > > * everything past the second colon is the value Given that the "value" as defined above may still contain colons, it's still necessary to process it to extract the various data in it, e.g., the various MCA parameters, their values, types, etc. > You are right, however, that the current format precludes the use of an > automatic tokenizer looking for colon. I don't think quoting the value field > would really solve that problem - do you have any suggestions? Why wouldn't quoting the value field address the parsing problem? Quoting a field that contains colons would effectively permit the output of ompi_info --parsable to be processed just like a CSV file; most CSV readers seem to support inclusion of the separator character in data fields via quoting. -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
[OMPI users] parsability of ompi_info --parsable output
The output of ompi_info --parsable is somewhat difficult to parse programmatically because it doesn't escape or quote fields that contain colons, e.g., build:timestamp:Tue Dec 23 15:47:28 EST 2014 option:threads:posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Is there some way to facilitate machine parsing of the output of ompi_info without having to special-case those options/parameters whose data fields might contain colons ? If not, it would be nice to quote such fields in future releases of ompi_info. -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs
Received from Rolf vandeVaart on Fri, Mar 27, 2015 at 04:09:58PM EDT: > >-Original Message- > >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon > >Sent: Friday, March 27, 2015 3:47 PM > >To: us...@open-mpi.org > >Subject: [OMPI users] segfault during MPI_Isend when transmitting GPU > >arrays between multiple GPUs > > > >I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded today) > >built against OpenMPI 1.8.4 with CUDA support activated to asynchronously > >send GPU arrays between multiple Tesla GPUs (Fermi generation). Each MPI > >process is associated with a single GPU; the process has a run loop that > >starts > >several Isends to transmit the contents of GPU arrays to destination > >processes and several Irecvs to receive data from source processes into GPU > >arrays on the process' GPU. Some of the sends/recvs use one tag, while the > >remainder use a second tag. A single Waitall invocation is used to wait for > >all of > >these sends and receives to complete before the next iteration of the loop > >can commence. All GPU arrays are preallocated before the run loop starts. > >While this pattern works most of the time, it sometimes fails with a segfault > >that appears to occur during an Isend: (snip) > >Any ideas as to what could be causing this problem? > > > >I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04. > > Hi Lev: > > I am not sure what is happening here but there are a few things we can do to > try and narrow things done. > 1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this error >will go away? Yes - that appears to be the case. > 2. Do you know if when you see this error it happens on the first pass through >your communications? That is, you mention how there are multiple >iterations through the loop and I am wondering when you see failures if it >is the first pass through the loop. When the segfault occurs, it appears to always happen during the second iteration of the loop, i.e., at least one slew of Isends (and presumably Irecvs) is successfully performed. Some more details regarding the Isends: each process starts two Isends for each destination process to which it transmits data. The Isends use two different tags, respectively; one is passed None (by design), while the other is passed the pointer to a GPU array with nonzero length. The segfault appears to occur during the latter Isend. -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
[OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs
I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded today) built against OpenMPI 1.8.4 with CUDA support activated to asynchronously send GPU arrays between multiple Tesla GPUs (Fermi generation). Each MPI process is associated with a single GPU; the process has a run loop that starts several Isends to transmit the contents of GPU arrays to destination processes and several Irecvs to receive data from source processes into GPU arrays on the process' GPU. Some of the sends/recvs use one tag, while the remainder use a second tag. A single Waitall invocation is used to wait for all of these sends and receives to complete before the next iteration of the loop can commence. All GPU arrays are preallocated before the run loop starts. While this pattern works most of the time, it sometimes fails with a segfault that appears to occur during an Isend: [myhost:05471] *** Process received signal *** [myhost:05471] Signal: Segmentation fault (11) [myhost:05471] Signal code: (128) [myhost:05471] Failing at address: (nil) [myhost:05471] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x2ac2bb176340] [myhost:05471] [ 1] /usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x1f6b18)[0x2ac2c48bfb18] [myhost:05471] [ 2] /usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x16dcc3)[0x2ac2c4836cc3] [myhost:05471] [ 3] /usr/lib/x86_64-linux-gnu/libcuda.so.1(cuIpcGetEventHandle+0x5d)[0x2ac2c480bccd] [myhost:05471] [ 4] /opt/openmpi-1.8.4/lib/libmpi.so.1(mca_common_cuda_construct_event_and_handle+0x27)[0x2ac2c27d3087] [myhost:05471] [ 5] /opt/openmpi-1.8.4/lib/libmpi.so.1(ompi_free_list_grow+0x199)[0x2ac2c277b8e9] [myhost:05471] [ 6] /opt/openmpi-1.8.4/lib/libmpi.so.1(mca_mpool_gpusm_register+0xf4)[0x2ac2c28c9fd4] [myhost:05471] [ 7] /opt/openmpi-1.8.4/lib/libmpi.so.1(mca_pml_ob1_rdma_cuda_btls+0xcd)[0x2ac2c28f8afd] [myhost:05471] [ 8] /opt/openmpi-1.8.4/lib/libmpi.so.1(mca_pml_ob1_send_request_start_cuda+0xbf)[0x2ac2c28f8d5f] [myhost:05471] [ 9] /opt/openmpi-1.8.4/lib/libmpi.so.1(mca_pml_ob1_isend+0x60e)[0x2ac2c28eb6fe] [myhost:05471] [10] /opt/openmpi-1.8.4/lib/libmpi.so.1(MPI_Isend+0x137)[0x2ac2c27b7cc7] [myhost:05471] [11] /home/lev/Work/miniconda/envs/MYENV/lib/python2.7/site-packages/mpi4py/MPI.so(+0xd3bb2)[0x2ac2c24b3bb2] (Python-related debug lines omitted.) Any ideas as to what could be causing this problem? I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04. -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] compiling OpenMPI 1.8.4 on system with multiarched SLURM libs (Ubuntu 15.04 prerelease)
Received from Ralph Castain on Wed, Mar 04, 2015 at 10:03:06AM EST: > > On Mar 3, 2015, at 9:41 AM, Lev Givon wrote: > > > > Received from Ralph Castain on Sun, Mar 01, 2015 at 10:31:15AM EST: > >>> On Feb 26, 2015, at 1:19 PM, Lev Givon wrote: > >>> > >>> Received from Ralph Castain on Thu, Feb 26, 2015 at 04:14:05PM EST: > >>>>> On Feb 26, 2015, at 1:07 PM, Lev Givon wrote: > >>>>> > >>>>> I recently tried to build OpenMPI 1.8.4 on a daily release of what will > >>>>> eventually become Ubuntu 15.04 (64-bit) with the --with-slurm and > >>>>> --with-pmi > >>>>> options on. I noticed that the libpmi.so.0.0.0 library in Ubuntu 15.04 > >>>>> is now > >>>>> in the multiarch location /usr/lib/x86_64-linux-gnu rather than > >>>>> /usr/lib; this > >>>>> causes the configure script to complain that it can't find > >>>>> libpmi/libpmi2 in > >>>>> /usr/lib or /usr/lib64. Setting LDFLAGS=-L/usr/lib/x86_64-linux-gnu > >>>>> and/or > >>>>> LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu doesn't seem to help. How can > >>>>> I get > >>>>> configure find the pmi library when it is in a multiarch location? > >>>> > >>>> Looks like we don’t have a separate pmi-libdir configure option, so it > >>>> may not > >>>> work. I can add one to the master and set to pull it across to 1.8.5. > >>> > >>> That would be great. Another possibility is to add > >>> /usr/lib/x86_64-linux-gnu and > >>> /usr/lib/i386-linux-gnu to the default libdirs searched when testing for > >>> pmi. > >> > >> > >> Could you please check the nightly 1.8 tarball? I added the pmi-libdir > >> option. Having it default to look for x86 etc subdirs is a little too > >> system-specific - if that ever becomes a broader standard way of installing > >> things, then I'd be more inclined to add it to the default search algo. > >> > >> http://www.open-mpi.org/nightly/v1.8/ > > > > The libpmi library file in Ubuntu 15.04 is in /usr/lib/x86_64-linux-gnu, not > > /usr/lib/x86_64-linux-gnu/lib or /usr/lib/x86_64-linux-gnu/lib64. Could the > > pmi-libdir option be modified to use the specified directory as-is rather > > than > > appending lib or lib64 to it? > > Rats - the backport missed that part. I’ll fix it. Thanks! FYI, I was able to successfully compile nightly build openmpi-v1.8.4-134-g9ad2aa8.tar.bz2 on Ubuntu 15.04 with the latest dev packages (as of today) and --with-pmi-libdir=/usr/lib/x86_64-linux-gnu Thanks, -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] compiling OpenMPI 1.8.4 on system with multiarched SLURM libs (Ubuntu 15.04 prerelease)
Received from Ralph Castain on Sun, Mar 01, 2015 at 10:31:15AM EST: > > On Feb 26, 2015, at 1:19 PM, Lev Givon wrote: > > > > Received from Ralph Castain on Thu, Feb 26, 2015 at 04:14:05PM EST: > >>> On Feb 26, 2015, at 1:07 PM, Lev Givon wrote: > >>> > >>> I recently tried to build OpenMPI 1.8.4 on a daily release of what will > >>> eventually become Ubuntu 15.04 (64-bit) with the --with-slurm and > >>> --with-pmi > >>> options on. I noticed that the libpmi.so.0.0.0 library in Ubuntu 15.04 > >>> is now > >>> in the multiarch location /usr/lib/x86_64-linux-gnu rather than /usr/lib; > >>> this > >>> causes the configure script to complain that it can't find libpmi/libpmi2 > >>> in > >>> /usr/lib or /usr/lib64. Setting LDFLAGS=-L/usr/lib/x86_64-linux-gnu and/or > >>> LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu doesn't seem to help. How can I > >>> get > >>> configure find the pmi library when it is in a multiarch location? > >> > >> Looks like we don’t have a separate pmi-libdir configure option, so it may > >> not > >> work. I can add one to the master and set to pull it across to 1.8.5. > > > > That would be great. Another possibility is to add > > /usr/lib/x86_64-linux-gnu and > > /usr/lib/i386-linux-gnu to the default libdirs searched when testing for > > pmi. > > > Could you please check the nightly 1.8 tarball? I added the pmi-libdir > option. Having it default to look for x86 etc subdirs is a little too > system-specific - if that ever becomes a broader standard way of installing > things, then I'd be more inclined to add it to the default search algo. > > http://www.open-mpi.org/nightly/v1.8/ The libpmi library file in Ubuntu 15.04 is in /usr/lib/x86_64-linux-gnu, not /usr/lib/x86_64-linux-gnu/lib or /usr/lib/x86_64-linux-gnu/lib64. Could the pmi-libdir option be modified to use the specified directory as-is rather than appending lib or lib64 to it? -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] compiling OpenMPI 1.8.4 on system with multiarched SLURM libs (Ubuntu 15.04 prerelease)
Received from Ralph Castain on Thu, Feb 26, 2015 at 04:14:05PM EST: > > On Feb 26, 2015, at 1:07 PM, Lev Givon wrote: > > > > I recently tried to build OpenMPI 1.8.4 on a daily release of what will > > eventually become Ubuntu 15.04 (64-bit) with the --with-slurm and --with-pmi > > options on. I noticed that the libpmi.so.0.0.0 library in Ubuntu 15.04 is > > now > > in the multiarch location /usr/lib/x86_64-linux-gnu rather than /usr/lib; > > this > > causes the configure script to complain that it can't find libpmi/libpmi2 in > > /usr/lib or /usr/lib64. Setting LDFLAGS=-L/usr/lib/x86_64-linux-gnu and/or > > LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu doesn't seem to help. How can I > > get > > configure find the pmi library when it is in a multiarch location? > > Looks like we don’t have a separate pmi-libdir configure option, so it may not > work. I can add one to the master and set to pull it across to 1.8.5. That would be great. Another possibility is to add /usr/lib/x86_64-linux-gnu and /usr/lib/i386-linux-gnu to the default libdirs searched when testing for pmi. Thanks, -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
[OMPI users] compiling OpenMPI 1.8.4 on system with multiarched SLURM libs (Ubuntu 15.04 prerelease)
I recently tried to build OpenMPI 1.8.4 on a daily release of what will eventually become Ubuntu 15.04 (64-bit) with the --with-slurm and --with-pmi options on. I noticed that the libpmi.so.0.0.0 library in Ubuntu 15.04 is now in the multiarch location /usr/lib/x86_64-linux-gnu rather than /usr/lib; this causes the configure script to complain that it can't find libpmi/libpmi2 in /usr/lib or /usr/lib64. Setting LDFLAGS=-L/usr/lib/x86_64-linux-gnu and/or LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu doesn't seem to help. How can I get configure find the pmi library when it is in a multiarch location? -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
[OMPI users] using MPI_Comm_spawn in OpenMPI 1.8.4 with SLURM
I've been using OpenMPI 1.8.4 manually built on Ubuntu 14.04.2 against the PMI libraries provided by the stock SLURM 2.6.5 Ubuntu packages. Although I am able to successfully run MPI jobs that use MPI_Comm_spawn via mpi4py 1.3.1 (also manually built against OpenMPI 1.8.4) to dynamically create processes when I launch those jobs via mpiexec directly, I can't seem to get SLURM to start them (I am able to use SLURM to successfully start jobs with a fixed number of processes, however). For example, attempting to run a job that spawns more than one process with srun -n 1 python myprogram.py results in the following error: [huxley:24037] [[5176,1],0] ORTE_ERROR_LOG: Not available in file dpm_orte.c at line 1100 [huxley:24037] *** An error occurred in MPI_Comm_spawn [huxley:24037] *** reported by process [339214337,0] [huxley:24037] *** on communicator MPI_COMM_SELF [huxley:24037] *** MPI_ERR_UNKNOWN: unknown error [huxley:24037] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [huxley:24037] ***and potentially your MPI job) Running the same program with mpiexec -np 1 python myprogram.py works properly. Has anyone successfully used SLURM (possibly a more recent version than 2.6.5) to submit spawning OpenMPI jobs? If so, what might be causing the above error? -- Lev Givon Bionet Group | Neurokernel Project http://www.columbia.edu/~lev/ http://lebedov.github.io/ http://neurokernel.github.io/
Re: [OMPI users] another mpirun + xgrid question
Received from Neeraj Chourasia on Mon, Sep 10, 2007 at 11:49:03PM EDT: > On Mon, 2007-09-10 at 15:35 -0400, Lev Givon wrote: > > When launching an MPI program with mpirun on an xgrid cluster, is > > there a way to cause the program being run to be temporarily copied to > > the compute nodes in the cluster when executed (i.e., similar to what the > > xgrid command line tool does)? Or is it necessary to make the program > > being run available on every compute node (e.g., using NFS data > > partions)? > > > > L.G. > > > If you are using scheduler like PBS or SGE over MPI, there is an option > called prolog and epilog, where you can give scripts which does copy > operation. This script is called before and after job execution as the > name suggests. > > Without it, in mpi itself, i have to see, if it can be done. > > The alternative way is to keep copy of the program at the same location > on all compute nodes and launch mpirun. > Of course, but one of the advantages of XGrid is that one can submit an executable or script for execution on the compute nodes even if it wasn't installed on those machines. Moreover, it also isn't necessary to have multiple user accounts on the compute nodes in an XGrid cluster because the job scheduler runs as the nobody user. Since OpenMPI makes use of the XGrid scheduler, I was curious whether its mpirun command could also somehow submit the program being run to the compute nodes (a bit of experimenting seems to suggests that it is not possible). > If the executable location is different on compute nodes, you have to > specify the same as the mpirun command-line arguments. L.G.
[OMPI users] another mpirun + xgrid question
When launching an MPI program with mpirun on an xgrid cluster, is there a way to cause the program being run to be temporarily copied to the compute nodes in the cluster when executed (i.e., similar to what the xgrid command line tool does)? Or is it necessary to make the program being run available on every compute node (e.g., using NFS data partions)? L.G.
Re: [OMPI users] running jobs on a remote XGrid cluster via mpirun
Received from Brian Barrett on Tue, Aug 28, 2007 at 05:07:51PM EDT: > On Aug 28, 2007, at 10:59 AM, Lev Givon wrote: > > > Received from Brian Barrett on Tue, Aug 28, 2007 at 12:22:29PM EDT: > >> On Aug 27, 2007, at 3:14 PM, Lev Givon wrote: > >> > >>> I have OpenMPI 1.2.3 installed on an XGrid cluster and a separate > >>> Mac > >>> client that I am using to submit jobs to the head (controller) > >>> node of > >>> the cluster. The cluster's compute nodes are all connected to the > >>> head > >>> node via a private network and are not running any firewalls. When I > >>> try running jobs with mpirun directly on the cluster's head node, > >>> they > >>> execute successfully; if I attempt to submit the jobs from the > >>> client > >>> (which can run jobs on the cluster using the xgrid command line > >>> tool) > >>> with mpirun, however, they appear to hang indefinitely (i.e., a > >>> job ID > >>> is created, but the mpirun itself never returns or terminates). > >>> Is it > >>> nececessary to configure the firewall on the submission client to > >>> grant access to the cluster head node in order to remotely submit > >>> jobs > >>> to the cluster's head node? > >> > >> Currently, every node on which an MPI process is launched must be > >> able to open a connection to a random port on the machine running > >> mpirun. So in your case, you'd have to configure the network on the > >> cluster to be able to connect back to your workstation (and the > >> workstation would have to allow connections from all your cluster > >> nodes). Far from ideal, but it's what it is. > >> > >> Brian > > > > Can this be avoided by submitting the "mpirun -n 10 myProg" command > > directly to the controller node with the xgrid command line tool? For > > some reason, sending the above command to the cluster results in a > > "task: failed with status 255" error even though I can successfully > > run other programs or commands to the cluster with the xgrid tool. I > > know that OpenMPI on the cluster is running properly because I can run > > programs with mpirun successfully when logged into the controller node > > itself. > > Open MPI was designed to be the one calling XGrid's scheduling > algorithm, so I'm pretty sure that you can't submit a job that just > runs Open MPI's mpirun. That wasn't really in our original design > space as an option. > > Brian I see. Apart from employing some grid package with more features than Xgrid (e.g., perhaps Sun GridEngine), is anyone aware of a mechanism that would allow for the submission of MPI jobs to a cluster's head node from remote submit hosts without having to provide every user with an actual Unix account on the head node? L.G.
Re: [OMPI users] OpenMPI and Port Range
Received from George Bosilca on Thu, Aug 30, 2007 at 07:42:52PM EDT: > I have a patch for this, but I never felt a real need for it, so I > never push it in the trunk. I'm not completely convinced that we need > it, except in some really strange situations (read grid). Why do you > need a port range ? For avoiding firewalls ? > >Thanks, > george. I imagine that allowing for more security-conscious firewall configurations would be the main motivation (although I suspect that there are more folks who run MPI on tightly coupled clusters linked by a secure/private network than on grids of machines spread across an insecure network). L.G.
Re: [OMPI users] OpenMPI and Port Range
Received from Simon Hammond on Thu, Aug 30, 2007 at 12:31:15PM EDT: > Hi all, > > Is there anyway to specify the ports that OpenMPI can use? > > I'm using a TCP/IP network in a closed environment, only certain ports > can be used. > > Thanks, > > Si Hammond > University of Warwick > I don't believe so. See http://www.open-mpi.org/community/lists/users/2006/02/0624.php L.G.
Re: [OMPI users] running jobs on a remote XGrid cluster via mpirun
Received from Brian Barrett on Tue, Aug 28, 2007 at 12:22:29PM EDT: > On Aug 27, 2007, at 3:14 PM, Lev Givon wrote: > > > I have OpenMPI 1.2.3 installed on an XGrid cluster and a separate Mac > > client that I am using to submit jobs to the head (controller) node of > > the cluster. The cluster's compute nodes are all connected to the head > > node via a private network and are not running any firewalls. When I > > try running jobs with mpirun directly on the cluster's head node, they > > execute successfully; if I attempt to submit the jobs from the client > > (which can run jobs on the cluster using the xgrid command line tool) > > with mpirun, however, they appear to hang indefinitely (i.e., a job ID > > is created, but the mpirun itself never returns or terminates). Is it > > nececessary to configure the firewall on the submission client to > > grant access to the cluster head node in order to remotely submit jobs > > to the cluster's head node? > > Currently, every node on which an MPI process is launched must be > able to open a connection to a random port on the machine running > mpirun. So in your case, you'd have to configure the network on the > cluster to be able to connect back to your workstation (and the > workstation would have to allow connections from all your cluster > nodes). Far from ideal, but it's what it is. > > Brian Can this be avoided by submitting the "mpirun -n 10 myProg" command directly to the controller node with the xgrid command line tool? For some reason, sending the above command to the cluster results in a "task: failed with status 255" error even though I can successfully run other programs or commands to the cluster with the xgrid tool. I know that OpenMPI on the cluster is running properly because I can run programs with mpirun successfully when logged into the controller node itself. L.G.
[OMPI users] running jobs on a remote XGrid cluster via mpirun
I have OpenMPI 1.2.3 installed on an XGrid cluster and a separate Mac client that I am using to submit jobs to the head (controller) node of the cluster. The cluster's compute nodes are all connected to the head node via a private network and are not running any firewalls. When I try running jobs with mpirun directly on the cluster's head node, they execute successfully; if I attempt to submit the jobs from the client (which can run jobs on the cluster using the xgrid command line tool) with mpirun, however, they appear to hang indefinitely (i.e., a job ID is created, but the mpirun itself never returns or terminates). Is it nececessary to configure the firewall on the submission client to grant access to the cluster head node in order to remotely submit jobs to the cluster's head node? L.G.
Re: [OMPI users] building static and shared OpenMPI libraries on MacOSX
Received from Brian Barrett on Wed, Aug 22, 2007 at 10:50:09AM EDT: > On Aug 21, 2007, at 10:52 PM, Lev Givon wrote: > > > (Running ompi_info after installing the build confirms the absence of > > said components). My concern, unsurprisingly, is motivated by a desire > > to use OpenMPI on an xgrid cluster (i.e., not with rsh/ssh); unless I > > am misconstruing the above observations, building OpenMPI with > > --enable-static seems to preclude this. Should xgrid functionality > > still be present when OpenMPI is built with --enable-static? > > Ah, yes. Do to some issues with our build system, you have to build > shared libraries to use the XGrid support. > > Brian It might be desirable to make a note of this in the FAQ (http://www.open-mpi.org/faq/?category=building#build-rte) or the package README. Thanks, L.G.
Re: [OMPI users] building static and shared OpenMPI libraries on MacOSX
Received from Brian Barrett on Wed, Aug 22, 2007 at 12:05:32AM EDT: > On Aug 21, 2007, at 3:32 PM, Lev Givon wrote: > > > configure: WARNING: *** Shared libraries have been disabled (-- > > disable-shared) > > configure: WARNING: *** Building MCA components as DSOs > > automatically disabled > > checking which components should be static... none > > checking for projects containing MCA frameworks... opal, orte, ompi > > > > Specifying --enable-shared --enable-static results in the same > > behavior, incidentally. Is the above to be expected? > > Yes, this is expected. This is just a warning that we build > components into the library rather than as run-time loadable > components when static libraries are enabled. This is probably not > technically necessary on Linux and OS X, but in general is the > easiest thing for us to do. So you should have a perfectly working > build with this setup. > > > Brian When compiled with --enable-static, the resulting build does indeed work, but some of the components appear to be disabled because they apparently cannot be built statically. Of particular interest to me are pls:xgrid and ras:xgrid: --- MCA component pls:xgrid (m4 configuration macro) checking for MCA component pls:xgrid compile mode... static checking if C and Objective C are link compatible... yes checking for XGridFoundation Framework... yes configure: WARNING: XGrid components must be built as DSOs. Disabling checking if MCA component pls:xgrid can compile... no ... --- MCA component ras:xgrid (m4 configuration macro) checking for MCA component ras:xgrid compile mode... static checking if C and Objective C are link compatible... (cached) yes checking for XGridFoundation Framework... (cached) yes configure: WARNING: XGrid components must be built as DSOs. Disabling checking if MCA component ras:xgrid can compile... no (Running ompi_info after installing the build confirms the absence of said components). My concern, unsurprisingly, is motivated by a desire to use OpenMPI on an xgrid cluster (i.e., not with rsh/ssh); unless I am misconstruing the above observations, building OpenMPI with --enable-static seems to preclude this. Should xgrid functionality still be present when OpenMPI is built with --enable-static? L.G.
[OMPI users] building static and shared OpenMPI libraries on MacOSX
According to the OpenMPI FAQ, specifying the config option --enable-static without specifying --disable-shared should build both shared and static versions of the libraries. When I tried these options on MacOSX 10.4.10 with OpenMPI 1.2.3, however, the following lines in the config output seem to imply otherwise: == Modular Component Architecture (MCA) setup checking for subdir args... '--enable-static' checking for gcc... gcc checking whether we are using the GNU Objective C compiler... yes checking dependency style of gcc... gcc3 checking which components should be disabled... checking which components should be direct-linked into the library... checking which components should be run-time loadable... none configure: WARNING: *** Shared libraries have been disabled (--disable-shared) configure: WARNING: *** Building MCA components as DSOs automatically disabled checking which components should be static... none checking for projects containing MCA frameworks... opal, orte, ompi Specifying --enable-shared --enable-static results in the same behavior, incidentally. Is the above to be expected? L.G.