Re: [OMPI users] libmpi_cxx
Hi Durga, This is only my interpretation, but they were never that appealing, nor very C++-like, and people mostly kept using the C interface. If you want to have a real C++ interface for MPI, have a look at Boost MPI (http://www.boost.org/doc/libs/1_64_0/doc/html/mpi.html ) If the C++ MPI bindings had been similar to Boost MPI, they would probably have been adopted more widely and may still be alive. -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Président - Comité de coordination du soutien à la recherche de Calcul Québec Team lead - Research Support National Team, Compute Canada Instructeur Software Carpentry Ph. D. en physique On 18-03-29 01:08, dpchoudh . wrote: Hello Gilles and all Sorry if this is a bit off topic, but I am curious as to why C++bindings were dropped? Any pointers would be appreciated. Best regards Durga $man why dump woman? man: too many arguments On Wed, Mar 28, 2018 at 11:43 PM, Gilles Gouaillardet wrote: Arthur, Try to configure --enable-mpi-xxx Note c++ bindings have been removed from the MPI standard long time ago, so you might want to consider modernizing your app. Cheers, Gilles "Arthur H. Edwards" wrote: I have built openmpi 3.0 on an ubuntu 16.04 system I have used --with-cuda. There is no libmpi_cxx.so generated, yet the code I'm using requires it. There is a libmpi_cxx.so in the ubuntu installed version. Any insight, or instruction on how to configure so that the build generates this library would be greatly appreciated. Art Edwards -- Arthur H. Edwards edwards...@fastmail.fm ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Running mpi with different account
Le 2015-04-13 09:54, Ralph Castain a écrit : On Apr 13, 2015, at 6:52 AM, Maxime Boissonneault wrote: Just out of curiosity... how will OpenMPI start processes under different accounts ? Through SSH while specifying different user names ? I am assuming that no resource manager or scheduler will allow this. I’m assuming he just plans to run the entire job as the other user. Essentially, it would be the same as if his friend ran the job for him. From this comment : My problem is that my account is limited to use 4 machines (I need more machines to process data). I can borrow my friend's account and thus have access to another 4 machines but I am not sure whether it works. I assumed that he wants to run the job under _both_ accounts at the same time. My recommendation would be to contact your sysadmin and ask for an exception instead of going through with this insanity (forgive the judgement here). Agreed! Maxime Le 2015-04-13 09:47, Ralph Castain a écrit : Let’s hope you sys admin doesn’t find out about it - they tend to take a dim view of sharing accounts! So long as the path and library path are set correctly, we won’t care. On Apr 12, 2015, at 10:33 PM, XingFENG wrote: Hi all, I am wondering if it is possible that MPI programs can be run on machines with different account? I am doing experiments with some MPI programs on a cluster. My problem is that my account is limited to use 4 machines (I need more machines to process data). I can borrow my friend's account and thus have access to another 4 machines but I am not sure whether it works. -- Best Regards. --- Xing FENG PhD Candidate Database Research Group School of Computer Science and Engineering University of New South Wales NSW 2052, Sydney Phone: (+61) 413 857 288 ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26687.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26690.php -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26691.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26692.php -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Running mpi with different account
Just out of curiosity... how will OpenMPI start processes under different accounts ? Through SSH while specifying different user names ? I am assuming that no resource manager or scheduler will allow this. My recommendation would be to contact your sysadmin and ask for an exception instead of going through with this insanity (forgive the judgement here). Maxime Le 2015-04-13 09:47, Ralph Castain a écrit : Let’s hope you sys admin doesn’t find out about it - they tend to take a dim view of sharing accounts! So long as the path and library path are set correctly, we won’t care. On Apr 12, 2015, at 10:33 PM, XingFENG wrote: Hi all, I am wondering if it is possible that MPI programs can be run on machines with different account? I am doing experiments with some MPI programs on a cluster. My problem is that my account is limited to use 4 machines (I need more machines to process data). I can borrow my friend's account and thus have access to another 4 machines but I am not sure whether it works. -- Best Regards. --- Xing FENG PhD Candidate Database Research Group School of Computer Science and Engineering University of New South Wales NSW 2052, Sydney Phone: (+61) 413 857 288 ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26687.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2015/04/26690.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Compiling OpenMPI 1.8.3 with PGI 14.9
I figured it out. It seems like setting CPP to pgprepro isn't the right variable. Thanks, Maxime Le 2014-10-03 10:39, Maxime Boissonneault a écrit : Hi, I am trying to compile OpenMPI 1.8.3 with PGI 14.9 I am getting a severe errors here : 1956 PGC-S-0039-Use of undeclared variable INT64_T (ompi_datatype_module.c: 278) 1957 PGC-S-0039-Use of undeclared variable AINT (ompi_datatype_module.c: 278) 1958 PGC-S-0074-Non-constant expression in initializer (ompi_datatype_module.c: 278) 1959 PGC-W-0093-Type cast required for this conversion of constant (ompi_datatype_module.c: 278) 1960 PGC/x86-64 Linux 14.9-0: compilation completed with severe errors 1961 make[2]: *** [ompi_datatype_module.lo] Erreur 1 Any idea what might be going on ? Attached is the output of my configure and make lines. Thanks,
[OMPI users] Compiling OpenMPI 1.8.3 with PGI 14.9
Hi, I am trying to compile OpenMPI 1.8.3 with PGI 14.9 I am getting a severe errors here : 1956 PGC-S-0039-Use of undeclared variable INT64_T (ompi_datatype_module.c: 278) 1957 PGC-S-0039-Use of undeclared variable AINT (ompi_datatype_module.c: 278) 1958 PGC-S-0074-Non-constant expression in initializer (ompi_datatype_module.c: 278) 1959 PGC-W-0093-Type cast required for this conversion of constant (ompi_datatype_module.c: 278) 1960 PGC/x86-64 Linux 14.9-0: compilation completed with severe errors 1961 make[2]: *** [ompi_datatype_module.lo] Erreur 1 Any idea what might be going on ? Attached is the output of my configure and make lines. Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique config-make.log.tar.gz Description: GNU Zip compressed data
Re: [OMPI users] Strange affinity messages with 1.8 and torque 5
Do you know the topology of the cores allocated by Torque (i.e. were they all on the same nodes, or 8 per node, or a heterogenous distribution for example ?) Le 2014-09-23 15:05, Brock Palen a écrit : Yes the request to torque was procs=64, We are using cpusets. the mpirun without -np 64 creates 64 spawned hostnames. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Sep 23, 2014, at 3:02 PM, Ralph Castain wrote: FWIW: that warning has been removed from the upcoming 1.8.3 release On Sep 23, 2014, at 11:45 AM, Reuti wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Am 23.09.2014 um 19:53 schrieb Brock Palen: I found a fun head scratcher, with openmpi 1.8.2 with torque 5 built with TM support, on hereto core layouts I get the fun thing: mpirun -report-bindings hostname< Works And you get 64 lines of output? mpirun -report-bindings -np 64 hostname <- Wat? -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node:nyx5518 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -- How many cores are physically installed on this machine - two as mentioned above? - -- Reuti I ran with --oversubscribed and got the expected host list, which matched $PBS_NODEFILE and was 64 entires long: mpirun -overload-allowed -report-bindings -np 64 --oversubscribe hostname What did I do wrong? I'm stumped why one works one doesn't but the one that doesn't if your force it appears correct. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25375.php -BEGIN PGP SIGNATURE- Version: GnuPG/MacGPG2 v2.0.20 (Darwin) Comment: GPGTools - http://gpgtools.org iEYEARECAAYFAlQhv7IACgkQo/GbGkBRnRr3HgCgjZoD9l9a+WThl5CDaGF1jawx PWIAmwWnZwQdytNgAJgbir6V7yCyBt5D =NG0H -END PGP SIGNATURE- ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25376.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25378.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25379.php -- --------- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Strange affinity messages with 1.8 and torque 5
Hi, Just an idea here. Do you use cpusets within Torque ? Did you request enough cores to torque ? Maxime Boissonneault Le 2014-09-23 13:53, Brock Palen a écrit : I found a fun head scratcher, with openmpi 1.8.2 with torque 5 built with TM support, on hereto core layouts I get the fun thing: mpirun -report-bindings hostname< Works mpirun -report-bindings -np 64 hostname <- Wat? -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node:nyx5518 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -- I ran with --oversubscribed and got the expected host list, which matched $PBS_NODEFILE and was 64 entires long: mpirun -overload-allowed -report-bindings -np 64 --oversubscribe hostname What did I do wrong? I'm stumped why one works one doesn't but the one that doesn't if your force it appears correct. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25375.php -- --------- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] about using mpi-thread-multiple
Hi, You need to compile OpenMPI with --enable-mpi-thread-multiple. However, OpenMPI used to have problem with that level of threading. Is that still the case in the 1.8 series ? I know in 1.6 series, that was a no go. It caused all sorts of hanging in the openib BTL. If the problems are not solved in the 1.8 series and you really need that level of threading, you may want to take a look to mvapich2 which I believe supports thread multiple. Maxime Le 2014-09-12 14:43, etcamargo a écrit : Hi, I would like to know what is the mpi version recomended for running multiple mpi call per process, i.e., MPI_THREAD_MULTIPLE in MPI_Init_thread(); Thanks, Edson ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/09/25326.php
Re: [OMPI users] Weird error with OMPI 1.6.3
It is still there in 1.6.5 (we also have it). I am just wondering if there is something wrong in our installation that makes MPI unabled to detect that there are two sockets per node if we do not include a npernode directive. Maxime Le 2014-08-29 12:31, Ralph Castain a écrit : No, it isn't - but we aren't really maintaining the 1.6 series any more. You might try updating to 1.6.5 and see if it remains there On Aug 29, 2014, at 9:12 AM, Maxime Boissonneault <mailto:maxime.boissonnea...@calculquebec.ca>> wrote: It looks like -npersocket 1 cannot be used alone. If I do mpiexec -npernode 2 -npersocket 1 ls -la then I get no error message. Is this expected behavior ? Maxime Le 2014-08-29 11:53, Maxime Boissonneault a écrit : Hi, I am having a weird error with OpenMPI 1.6.3. I run a non-MPI command just to exclude any code error. Here is the error I get (I run with set -x to get the exact command that are run). ++ mpiexec -npersocket 1 ls -la -- The requested stdin target is out of range for this job - it points to a process rank that is greater than the number of processes in the job. Specified target: 0 Number of procs: 0 This could be caused by specifying a negative number for the stdin target, or by mistyping the desired rank. Remember that MPI ranks begin with 0, not 1. Please correct the cmd line and try again. How can I debug that ? Thanks, -- --------- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post:http://www.open-mpi.org/community/lists/users/2014/08/25190.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25191.php -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Weird error with OMPI 1.6.3
It looks like -npersocket 1 cannot be used alone. If I do mpiexec -npernode 2 -npersocket 1 ls -la then I get no error message. Is this expected behavior ? Maxime Le 2014-08-29 11:53, Maxime Boissonneault a écrit : Hi, I am having a weird error with OpenMPI 1.6.3. I run a non-MPI command just to exclude any code error. Here is the error I get (I run with set -x to get the exact command that are run). ++ mpiexec -npersocket 1 ls -la -- The requested stdin target is out of range for this job - it points to a process rank that is greater than the number of processes in the job. Specified target: 0 Number of procs: 0 This could be caused by specifying a negative number for the stdin target, or by mistyping the desired rank. Remember that MPI ranks begin with 0, not 1. Please correct the cmd line and try again. How can I debug that ? Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
[OMPI users] Weird error with OMPI 1.6.3
Hi, I am having a weird error with OpenMPI 1.6.3. I run a non-MPI command just to exclude any code error. Here is the error I get (I run with set -x to get the exact command that are run). ++ mpiexec -npersocket 1 ls -la -- The requested stdin target is out of range for this job - it points to a process rank that is greater than the number of processes in the job. Specified target: 0 Number of procs: 0 This could be caused by specifying a negative number for the stdin target, or by mistyping the desired rank. Remember that MPI ranks begin with 0, not 1. Please correct the cmd line and try again. How can I debug that ? Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
[OMPI users] OpenMPI 1.8.1 to 1.8.2rc4
Hi, Would you say that softwares compiled using OpenMPI 1.8.1 need to be recompiled using OpenMPI 1.8.2rc4 to work properly ? Maxime
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
I am also filing a bug at Adaptive Computing since, while I do set CUDA_VISIBLE_DEVICES myself, the default value set by Torque in that case is also wrong. Maxime Le 2014-08-19 10:47, Rolf vandeVaart a écrit : Glad it was solved. I will submit a bug at NVIDIA as that does not seem like a very friendly way to handle that error. -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Tuesday, August 19, 2014 10:39 AM To: Open MPI Users Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, I believe I found what the problem was. My script set the CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7 instead of CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 Sorry for the false bug and thanks for directing me toward the solution. Maxime Le 2014-08-19 09:15, Rolf vandeVaart a écrit : Hi: This problem does not appear to have anything to do with MPI. We are getting a SEGV during the initial call into the CUDA driver. Can you log on to gpu-k20-08, compile your simple program without MPI, and run it there? Also, maybe run dmesg on gpu-k20-08 and see if there is anything in the log? Also, does your program run if you just run it on gpu-k20-07? Can you include the output from nvidia-smi on each node? Thanks, Rolf -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Tuesday, August 19, 2014 8:55 AM To: Open MPI Users Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me much more information. [mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug Prefix: /software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda Internal debug support: yes Memory debugging support: no Is there something I need to do at run time to get more information out of it ? [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node cudampi_simple [gpu-k20-08:46045] *** Process received signal *** [gpu-k20-08:46045] Signal: Segmentation fault (11) [gpu-k20-08:46045] Signal code: Address not mapped (1) [gpu-k20-08:46045] Failing at address: 0x8 [gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710] [gpu-k20-08:46045] [ 1] /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf] [gpu-k20-08:46045] [ 2] /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83] [gpu-k20-08:46045] [ 3] /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da] [gpu-k20-08:46045] [ 4] /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933] [gpu-k20-08:46045] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df 4965] [gpu-k20-08:46045] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df 4a0a] [gpu-k20-08:46045] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df 4a3b] [gpu-k20-08:46045] [ 8] /software- gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0 f647] [gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae] [gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal *** [gpu-k20-07:61816] Signal: Segmentation fault (11) [gpu-k20-07:61816] Signal code: Address not mapped (1) [gpu-k20-07:61816] Failing at address: 0x8 [gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710] [gpu-k20-07:61816] [ 1] /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf] [gpu-k20-07:61816] [ 2] /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83] [gpu-k20-07:61816] [ 3] /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da] [gpu-k20-07:61816] [ 4] /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933] [gpu-k20-07:61816] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b 6965] [gpu-k20-07:61816] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b 6a0a] [gpu-k20-07:61816] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b 6a3b] [gpu-k20-07:61816] [ 8] /software- gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d 1647 ] [gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae] [gpu-k20-07:61816] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d] [gpu-k20-07:61816] [11] cudampi_simple[0x400699] [gpu-k20-07:61816] *** End of error message *** /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d] [gpu-k20-08:46045] [11] cudampi_simple[0x400699] [gpu-k20-08:46045] *** End of error message *** - - mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08 exited on signal 11 (Segmentation fault). - - Thanks, Maxime Le 2014-08-18 16:45, Rolf vandeVaart a écrit : Just to help reduce the scope
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Hi, I believe I found what the problem was. My script set the CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7 instead of CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 Sorry for the false bug and thanks for directing me toward the solution. Maxime Le 2014-08-19 09:15, Rolf vandeVaart a écrit : Hi: This problem does not appear to have anything to do with MPI. We are getting a SEGV during the initial call into the CUDA driver. Can you log on to gpu-k20-08, compile your simple program without MPI, and run it there? Also, maybe run dmesg on gpu-k20-08 and see if there is anything in the log? Also, does your program run if you just run it on gpu-k20-07? Can you include the output from nvidia-smi on each node? Thanks, Rolf -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Tuesday, August 19, 2014 8:55 AM To: Open MPI Users Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me much more information. [mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug Prefix: /software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda Internal debug support: yes Memory debugging support: no Is there something I need to do at run time to get more information out of it ? [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node cudampi_simple [gpu-k20-08:46045] *** Process received signal *** [gpu-k20-08:46045] Signal: Segmentation fault (11) [gpu-k20-08:46045] Signal code: Address not mapped (1) [gpu-k20-08:46045] Failing at address: 0x8 [gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710] [gpu-k20-08:46045] [ 1] /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf] [gpu-k20-08:46045] [ 2] /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83] [gpu-k20-08:46045] [ 3] /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da] [gpu-k20-08:46045] [ 4] /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933] [gpu-k20-08:46045] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965] [gpu-k20-08:46045] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a] [gpu-k20-08:46045] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b] [gpu-k20-08:46045] [ 8] /software- gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647] [gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae] [gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal *** [gpu-k20-07:61816] Signal: Segmentation fault (11) [gpu-k20-07:61816] Signal code: Address not mapped (1) [gpu-k20-07:61816] Failing at address: 0x8 [gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710] [gpu-k20-07:61816] [ 1] /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf] [gpu-k20-07:61816] [ 2] /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83] [gpu-k20-07:61816] [ 3] /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da] [gpu-k20-07:61816] [ 4] /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933] [gpu-k20-07:61816] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965] [gpu-k20-07:61816] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a] [gpu-k20-07:61816] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b] [gpu-k20-07:61816] [ 8] /software- gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647 ] [gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae] [gpu-k20-07:61816] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d] [gpu-k20-07:61816] [11] cudampi_simple[0x400699] [gpu-k20-07:61816] *** End of error message *** /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d] [gpu-k20-08:46045] [11] cudampi_simple[0x400699] [gpu-k20-08:46045] *** End of error message *** -- mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08 exited on signal 11 (Segmentation fault). -- Thanks, Maxime Le 2014-08-18 16:45, Rolf vandeVaart a écrit : Just to help reduce the scope of the problem, can you retest with a non CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the configure line to help with the stack trace? -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Monday, August 18, 2014 4:23 PM To: Open MPI Users Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Hi, I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me much more information. [mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug Prefix: /software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda Internal debug support: yes Memory debugging support: no Is there something I need to do at run time to get more information out of it ? [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node cudampi_simple [gpu-k20-08:46045] *** Process received signal *** [gpu-k20-08:46045] Signal: Segmentation fault (11) [gpu-k20-08:46045] Signal code: Address not mapped (1) [gpu-k20-08:46045] Failing at address: 0x8 [gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710] [gpu-k20-08:46045] [ 1] /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf] [gpu-k20-08:46045] [ 2] /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83] [gpu-k20-08:46045] [ 3] /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da] [gpu-k20-08:46045] [ 4] /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933] [gpu-k20-08:46045] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965] [gpu-k20-08:46045] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a] [gpu-k20-08:46045] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b] [gpu-k20-08:46045] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647] [gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae] [gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal *** [gpu-k20-07:61816] Signal: Segmentation fault (11) [gpu-k20-07:61816] Signal code: Address not mapped (1) [gpu-k20-07:61816] Failing at address: 0x8 [gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710] [gpu-k20-07:61816] [ 1] /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf] [gpu-k20-07:61816] [ 2] /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83] [gpu-k20-07:61816] [ 3] /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da] [gpu-k20-07:61816] [ 4] /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933] [gpu-k20-07:61816] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965] [gpu-k20-07:61816] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a] [gpu-k20-07:61816] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b] [gpu-k20-07:61816] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647] [gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae] [gpu-k20-07:61816] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d] [gpu-k20-07:61816] [11] cudampi_simple[0x400699] [gpu-k20-07:61816] *** End of error message *** /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d] [gpu-k20-08:46045] [11] cudampi_simple[0x400699] [gpu-k20-08:46045] *** End of error message *** -- mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08 exited on signal 11 (Segmentation fault). -- Thanks, Maxime Le 2014-08-18 16:45, Rolf vandeVaart a écrit : Just to help reduce the scope of the problem, can you retest with a non CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the configure line to help with the stack trace? -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Monday, August 18, 2014 4:23 PM To: Open MPI Users Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pastebin.com/1uAK4Z8R Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else. When I compile and run this on a single node, everything works fine. When I compile and run this on more than one node, I get the following stack trace : [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] [gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Indeed, there were those to problems. I took the code from here and simplified it. http://cudamusing.blogspot.ca/2011/08/cuda-mpi-and-infiniband.html However, even with the modified code here http://pastebin.com/ax6g10GZ The symptoms are still the same. Maxime Le 2014-08-19 07:59, Alex A. Granovsky a écrit : Also you need to check return code from cudaMalloc before calling cudaFree - the pointer may be invalid as you did not initialized cuda properly. Alex -Original Message- From: Maxime Boissonneault Sent: Tuesday, August 19, 2014 2:19 AM To: Open MPI Users Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes Same thing : [mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1 [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node cudampi_simple malloc: using debugging hooks malloc: using debugging hooks [gpu-k20-07:47628] *** Process received signal *** [gpu-k20-07:47628] Signal: Segmentation fault (11) [gpu-k20-07:47628] Signal code: Address not mapped (1) [gpu-k20-07:47628] Failing at address: 0x8 [gpu-k20-07:47628] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b14cf850710] [gpu-k20-07:47628] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2b14d4e9facf] [gpu-k20-07:47628] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2b14d4e65a83] [gpu-k20-07:47628] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2b14d4d972da] [gpu-k20-07:47628] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2b14d4d83933] [gpu-k20-07:47628] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b14cf0cf965] [gpu-k20-07:47628] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b14cf0cfa0a] [gpu-k20-07:47628] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b14cf0cfa3b] [gpu-k20-07:47628] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2b14cf0f0532] [gpu-k20-07:47628] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:47628] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b14cfa7cd1d] [gpu-k20-07:47628] [11] cudampi_simple[0x400699] [gpu-k20-07:47628] *** End of error message *** ... (same segfault from the other node) Maxime Le 2014-08-18 16:52, Alex A. Granovsky a écrit : Try the following: export MALLOC_CHECK_=1 and then run it again Kind regards, Alex Granovsky -Original Message- From: Maxime Boissonneault Sent: Tuesday, August 19, 2014 12:23 AM To: Open MPI Users Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pastebin.com/1uAK4Z8R Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else. When I compile and run this on a single node, everything works fine. When I compile and run this on more than one node, I get the following stack trace : [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] [gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] [gpu-k20-07:40041] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] [gpu-k20-07:40041] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] [gpu-k20-07:40041] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] *** End of error message *** The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or OpenMPI 1.8.1 (cuda aware). I know this is more than likely a problem with Cuda than with OpenMPI (since it does the same for two different versions), but I figured I would ask here if somebody has a clue of what might be going on. I have yet to be able to fill a bug report on NVidia's website for Cuda. Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
It's building... to be continued tomorrow morning. Le 2014-08-18 16:45, Rolf vandeVaart a écrit : Just to help reduce the scope of the problem, can you retest with a non CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the configure line to help with the stack trace? -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Monday, August 18, 2014 4:23 PM To: Open MPI Users Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pastebin.com/1uAK4Z8R Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else. When I compile and run this on a single node, everything works fine. When I compile and run this on more than one node, I get the following stack trace : [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] [gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] [gpu-k20-07:40041] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] [gpu-k20-07:40041] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] [gpu-k20-07:40041] [ 8] /software- gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] *** End of error message *** The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or OpenMPI 1.8.1 (cuda aware). I know this is more than likely a problem with Cuda than with OpenMPI (since it does the same for two different versions), but I figured I would ask here if somebody has a clue of what might be going on. I have yet to be able to fill a bug report on NVidia's website for Cuda. Thanks, -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open- mpi.org/community/lists/users/2014/08/25064.php --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --- ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25065.php -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Same thing : [mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1 [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node cudampi_simple malloc: using debugging hooks malloc: using debugging hooks [gpu-k20-07:47628] *** Process received signal *** [gpu-k20-07:47628] Signal: Segmentation fault (11) [gpu-k20-07:47628] Signal code: Address not mapped (1) [gpu-k20-07:47628] Failing at address: 0x8 [gpu-k20-07:47628] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b14cf850710] [gpu-k20-07:47628] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2b14d4e9facf] [gpu-k20-07:47628] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2b14d4e65a83] [gpu-k20-07:47628] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2b14d4d972da] [gpu-k20-07:47628] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2b14d4d83933] [gpu-k20-07:47628] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b14cf0cf965] [gpu-k20-07:47628] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b14cf0cfa0a] [gpu-k20-07:47628] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b14cf0cfa3b] [gpu-k20-07:47628] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2b14cf0f0532] [gpu-k20-07:47628] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:47628] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b14cfa7cd1d] [gpu-k20-07:47628] [11] cudampi_simple[0x400699] [gpu-k20-07:47628] *** End of error message *** ... (same segfault from the other node) Maxime Le 2014-08-18 16:52, Alex A. Granovsky a écrit : Try the following: export MALLOC_CHECK_=1 and then run it again Kind regards, Alex Granovsky -Original Message- From: Maxime Boissonneault Sent: Tuesday, August 19, 2014 12:23 AM To: Open MPI Users Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pastebin.com/1uAK4Z8R Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else. When I compile and run this on a single node, everything works fine. When I compile and run this on more than one node, I get the following stack trace : [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] [gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] [gpu-k20-07:40041] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] [gpu-k20-07:40041] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] [gpu-k20-07:40041] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] *** End of error message *** The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or OpenMPI 1.8.1 (cuda aware). I know this is more than likely a problem with Cuda than with OpenMPI (since it does the same for two different versions), but I figured I would ask here if somebody has a clue of what might be going on. I have yet to be able to fill a bug report on NVidia's website for Cuda. Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
[OMPI users] Segfault with MPI + Cuda on multiple nodes
Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pastebin.com/1uAK4Z8R Basically, it is a program that initializes MPI and cudaMalloc memory, and then free memory and finalize MPI. Nothing else. When I compile and run this on a single node, everything works fine. When I compile and run this on more than one node, I get the following stack trace : [gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal: Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] [gpu-k20-07:40041] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] [gpu-k20-07:40041] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] [gpu-k20-07:40041] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] [gpu-k20-07:40041] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] [gpu-k20-07:40041] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] [gpu-k20-07:40041] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] [gpu-k20-07:40041] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] [gpu-k20-07:40041] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] *** End of error message *** The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or OpenMPI 1.8.1 (cuda aware). I know this is more than likely a problem with Cuda than with OpenMPI (since it does the same for two different versions), but I figured I would ask here if somebody has a clue of what might be going on. I have yet to be able to fill a bug report on NVidia's website for Cuda. Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Ok, I confirm that with mpiexec -mca oob_tcp_if_include lo ring_c it works. It also works with mpiexec -mca oob_tcp_if_include ib0 ring_c We have 4 interfaces on this node. - lo, the local loop - ib0, infiniband - eth2, a management network - eth3, the public network It seems that mpiexec attempts to use the two addresses that do not work (eth2, eth3) and does not use the two that do work (ib0 and lo). However, according to the logs sent previously, it does see ib0 (despite not seeing lo), but does not attempt to use it. On the compute nodes, we have eth0 (management), ib0 and lo, and it works. I am unsure why it does work on the compute nodes and not on the login nodes. The only difference is the presence of a public interface on the login node. Maxime Le 2014-08-18 13:37, Ralph Castain a écrit : Yeah, there are some issues with the internal connection logic that need to get fixed. We haven't had many cases where it's been an issue, but a couple like this have cropped up - enough that I need to set aside some time to fix it. My apologies for the problem. On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault wrote: Indeed, that makes sense now. Why isn't OpenMPI attempting to connect with the local loop for same node ? This used to work with 1.6.5. Maxime Le 2014-08-18 13:11, Ralph Castain a écrit : Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer [[63019,0],0] on socket 11 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: connection failed: Connection refused (111) [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 state CONNECTING [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer [[63019,0],0] The apps are trying to connect back to mpirun using the following addresses: tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 The initial attempt is here [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries I know there is a failover bug in the 1.8 series, and so if that connection got rejected the proc would abort. Should we be using a different network? If so, telling us via the oob_tcp_if_include param would be the solution. On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault wrote: Here it is. Maxime Le 2014-08-18 12:59, Ralph Castain a écrit : Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Quer
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Indeed, that makes sense now. Why isn't OpenMPI attempting to connect with the local loop for same node ? This used to work with 1.6.5. Maxime Le 2014-08-18 13:11, Ralph Castain a écrit : Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer [[63019,0],0] on socket 11 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: connection failed: Connection refused (111) [helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 state CONNECTING [helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer [[63019,0],0] The apps are trying to connect back to mpirun using the following addresses: tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237 The initial attempt is here [helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries I know there is a failover bug in the 1.8 series, and so if that connection got rejected the proc would abort. Should we be using a different network? If so, telling us via the oob_tcp_if_include param would be the solution. On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault wrote: Here it is. Maxime Le 2014-08-18 12:59, Ralph Castain a écrit : Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set priority to 10 [helios-login1:27853] mca:base:select:( plm) Querying component [tm] [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] [helios-login1:27853] mca: base: close: component isolated closed [helios-login1:27853] mca: base: close: unloading component isolated [helios-login1:27853] mca: base: close: component tm closed [helios-login1:27853] mca: base: close: unloading component tm [helios-login1:27853] mca: base: close: component rsh closed [helios-login1:27853] mca: base: close: unloading component rsh [mboisson@helios-login1 examples]$ echo $? 65 Maxime ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25052.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Here it is. Maxime Le 2014-08-18 12:59, Ralph Castain a écrit : Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set priority to 10 [helios-login1:27853] mca:base:select:( plm) Querying component [tm] [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] [helios-login1:27853] mca: base: close: component isolated closed [helios-login1:27853] mca: base: close: unloading component isolated [helios-login1:27853] mca: base: close: component tm closed [helios-login1:27853] mca: base: close: unloading component tm [helios-login1:27853] mca: base: close: component rsh closed [helios-login1:27853] mca: base: close: unloading component rsh [mboisson@helios-login1 examples]$ echo $? 65 Maxime ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25052.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25053.php -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25054.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25055.php -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique output_ringc_verbose2.txt.gz Description: GNU Zip compressed data
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set priority to 10 [helios-login1:27853] mca:base:select:( plm) Querying component [tm] [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] [helios-login1:27853] mca: base: close: component isolated closed [helios-login1:27853] mca: base: close: unloading component isolated [helios-login1:27853] mca: base: close: component tm closed [helios-login1:27853] mca: base: close: unloading component tm [helios-login1:27853] mca: base: close: component rsh closed [helios-login1:27853] mca: base: close: unloading component rsh [mboisson@helios-login1 examples]$ echo $? 65 Maxime ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25052.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25053.php -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique output_ringc_verbose.txt.gz Description: GNU Zip compressed data
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: components_register: found loaded component isolated [helios-login1:27853] mca: base: components_register: component isolated has no register or open function [helios-login1:27853] mca: base: components_register: found loaded component rsh [helios-login1:27853] mca: base: components_register: component rsh register function successful [helios-login1:27853] mca: base: components_register: found loaded component tm [helios-login1:27853] mca: base: components_register: component tm register function successful [helios-login1:27853] mca: base: components_open: opening plm components [helios-login1:27853] mca: base: components_open: found loaded component isolated [helios-login1:27853] mca: base: components_open: component isolated open function successful [helios-login1:27853] mca: base: components_open: found loaded component rsh [helios-login1:27853] mca: base: components_open: component rsh open function successful [helios-login1:27853] mca: base: components_open: found loaded component tm [helios-login1:27853] mca: base: components_open: component tm open function successful [helios-login1:27853] mca:base:select: Auto-selecting plm components [helios-login1:27853] mca:base:select:( plm) Querying component [isolated] [helios-login1:27853] mca:base:select:( plm) Query of component [isolated] set priority to 0 [helios-login1:27853] mca:base:select:( plm) Querying component [rsh] [helios-login1:27853] mca:base:select:( plm) Query of component [rsh] set priority to 10 [helios-login1:27853] mca:base:select:( plm) Querying component [tm] [helios-login1:27853] mca:base:select:( plm) Skipping component [tm]. Query failed to return a module [helios-login1:27853] mca:base:select:( plm) Selected component [rsh] [helios-login1:27853] mca: base: close: component isolated closed [helios-login1:27853] mca: base: close: unloading component isolated [helios-login1:27853] mca: base: close: component tm closed [helios-login1:27853] mca: base: close: unloading component tm [helios-login1:27853] mca: base: close: component rsh closed [helios-login1:27853] mca: base: close: unloading component rsh [mboisson@helios-login1 examples]$ echo $? 65 Maxime
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Hi, I just did compile without Cuda, and the result is the same. No output, exits with code 65. [mboisson@helios-login1 examples]$ ldd ring_c linux-vdso.so.1 => (0x7fff3ab31000) libmpi.so.1 => /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1 (0x7fab9ec2a000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00381c00) libc.so.6 => /lib64/libc.so.6 (0x00381bc0) librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00381c80) libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00381c40) libopen-rte.so.7 => /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7 (0x7fab9e932000) libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x00391820) libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x003917e0) libz.so.1 => /lib64/libz.so.1 (0x00381cc0) libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00382100) libssl.so.10 => /usr/lib64/libssl.so.10 (0x00382300) libopen-pal.so.6 => /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6 (0x7fab9e64a000) libdl.so.2 => /lib64/libdl.so.2 (0x00381b80) librt.so.1 => /lib64/librt.so.1 (0x0035b360) libm.so.6 => /lib64/libm.so.6 (0x003c25a0) libutil.so.1 => /lib64/libutil.so.1 (0x003f7100) /lib64/ld-linux-x86-64.so.2 (0x00381b40) libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003917a0) libgcc_s.so.1 => /software6/compilers/gcc/4.8/lib64/libgcc_s.so.1 (0x7fab9e433000) libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00382240) libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00382140) libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00381e40) libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00382180) libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x003821c0) libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00382200) libresolv.so.2 => /lib64/libresolv.so.2 (0x00381dc0) libselinux.so.1 => /lib64/libselinux.so.1 (0x00381d00) [mboisson@helios-login1 examples]$ mpiexec ring_c [mboisson@helios-login1 examples]$ echo $? 65 Maxime Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit : Just out of curiosity, I saw that one of the segv stack traces involved the cuda stack. Can you try a build without CUDA and see if that resolves the problem? On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault wrote: Hi Jeff, Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit : On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault wrote: Correct. Can it be because torque (pbs_mom) is not running on the head node and mpiexec attempts to contact it ? Not for Open MPI's mpiexec, no. Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM stuff (i.e., Torque stuff) if it sees the environment variable markers indicating that it's inside a Torque job. If not, it just uses rsh/ssh (or localhost launch in your case, since you didn't specify any hosts). If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI "hostname" command from Linux), then something is seriously borked with your Open MPI installation. mpirun -np 4 hostname works fine : [mboisson@helios-login1 ~]$ which mpirun /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $? helios-login1 helios-login1 helios-login1 helios-login1 0 Try running with: mpirun -np 4 --mca plm_base_verbose 10 hostname This should show the steps OMPI is trying to take to launch the 4 copies of "hostname" and potentially give some insight into where it's hanging. Also, just to make sure, you have ensured that you're compiling everything with a single compiler toolchain, and the support libraries from that specific compiler toolchain are available on any server on which you're running (to include the head node and compute nodes), right? Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the same results). Almost every software (that is compiler, toolchain, etc.) is installed on lustre, from sources and is the same on both the login (head) node and the compute. The few differences between the head node and the compute : 1) Computes are in RAMFS - login is installed on disk 2) Computes and login node have different hardware configuration (computes have GPUs, head node does not). 3) Login node has MORE CentOS6 packages than computes (such as the -devel packages, some fonts/X11 libraries, etc.), but all the packages that are on the computes are also on the login node. And you've verified that PAT
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
There is indeed also a problem with MPI + Cuda. This problem however is deeper, since it happens with Mvapich2 1.9, OpenMPI 1.6.5/1.8.1/1.8.2rc4, Cuda 5.5.22/6.0.37. From my tests, everything works fine with MPI + Cuda on a single node, but as soon as I got to MPI + Cuda accross nodes, I get segv. I suspect something either with the ofed (we use linux ofed rdma, not the Mellanox stack) or the nvidia drivers (we are a couple minor versions behind). My next step is to try and upgrade those. I do not think this problem is related to not being able to run ring_c on the head node however, because it runs fine with 1.6.5 and ring_c does not involve cuda. Maxime Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit : Just out of curiosity, I saw that one of the segv stack traces involved the cuda stack. Can you try a build without CUDA and see if that resolves the problem? On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault wrote: Hi Jeff, Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit : On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault wrote: Correct. Can it be because torque (pbs_mom) is not running on the head node and mpiexec attempts to contact it ? Not for Open MPI's mpiexec, no. Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM stuff (i.e., Torque stuff) if it sees the environment variable markers indicating that it's inside a Torque job. If not, it just uses rsh/ssh (or localhost launch in your case, since you didn't specify any hosts). If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI "hostname" command from Linux), then something is seriously borked with your Open MPI installation. mpirun -np 4 hostname works fine : [mboisson@helios-login1 ~]$ which mpirun /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $? helios-login1 helios-login1 helios-login1 helios-login1 0 Try running with: mpirun -np 4 --mca plm_base_verbose 10 hostname This should show the steps OMPI is trying to take to launch the 4 copies of "hostname" and potentially give some insight into where it's hanging. Also, just to make sure, you have ensured that you're compiling everything with a single compiler toolchain, and the support libraries from that specific compiler toolchain are available on any server on which you're running (to include the head node and compute nodes), right? Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the same results). Almost every software (that is compiler, toolchain, etc.) is installed on lustre, from sources and is the same on both the login (head) node and the compute. The few differences between the head node and the compute : 1) Computes are in RAMFS - login is installed on disk 2) Computes and login node have different hardware configuration (computes have GPUs, head node does not). 3) Login node has MORE CentOS6 packages than computes (such as the -devel packages, some fonts/X11 libraries, etc.), but all the packages that are on the computes are also on the login node. And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right places -- i.e., to the Open MPI installation that you expect it to point to. E.g., if you "ldd ring_c", it shows the libmpi.so that you expect. And "which mpiexec" shows the mpirun that you expect. Etc. As per the content of "env.out" in the archive, yes. They point to the OMPI 1.8.2rc4 installation directories, on lustre, and are shared between the head node and the compute nodes. Maxime ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25043.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Hi Jeff, Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit : On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault wrote: Correct. Can it be because torque (pbs_mom) is not running on the head node and mpiexec attempts to contact it ? Not for Open MPI's mpiexec, no. Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM stuff (i.e., Torque stuff) if it sees the environment variable markers indicating that it's inside a Torque job. If not, it just uses rsh/ssh (or localhost launch in your case, since you didn't specify any hosts). If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI "hostname" command from Linux), then something is seriously borked with your Open MPI installation. mpirun -np 4 hostname works fine : [mboisson@helios-login1 ~]$ which mpirun /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $? helios-login1 helios-login1 helios-login1 helios-login1 0 Try running with: mpirun -np 4 --mca plm_base_verbose 10 hostname This should show the steps OMPI is trying to take to launch the 4 copies of "hostname" and potentially give some insight into where it's hanging. Also, just to make sure, you have ensured that you're compiling everything with a single compiler toolchain, and the support libraries from that specific compiler toolchain are available on any server on which you're running (to include the head node and compute nodes), right? Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the same results). Almost every software (that is compiler, toolchain, etc.) is installed on lustre, from sources and is the same on both the login (head) node and the compute. The few differences between the head node and the compute : 1) Computes are in RAMFS - login is installed on disk 2) Computes and login node have different hardware configuration (computes have GPUs, head node does not). 3) Login node has MORE CentOS6 packages than computes (such as the -devel packages, some fonts/X11 libraries, etc.), but all the packages that are on the computes are also on the login node. And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right places -- i.e., to the Open MPI installation that you expect it to point to. E.g., if you "ldd ring_c", it shows the libmpi.so that you expect. And "which mpiexec" shows the mpirun that you expect. Etc. As per the content of "env.out" in the archive, yes. They point to the OMPI 1.8.2rc4 installation directories, on lustre, and are shared between the head node and the compute nodes. Maxime
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Correct. Can it be because torque (pbs_mom) is not running on the head node and mpiexec attempts to contact it ? Maxime Le 2014-08-15 17:31, Joshua Ladd a écrit : But OMPI 1.8.x does run the ring_c program successfully on your compute node, right? The error only happens on the front-end login node if I understood you correctly. Josh On Fri, Aug 15, 2014 at 5:20 PM, Maxime Boissonneault <mailto:maxime.boissonnea...@calculquebec.ca>> wrote: Here are the requested files. In the archive, you will find the output of configure, make, make install as well as the config.log, the environment when running ring_c and the ompi_info --all. Just for a reminder, the ring_c example compiled and ran, but produced no output when running and exited with code 65. Thanks, Maxime Le 2014-08-14 15:26, Joshua Ladd a écrit : One more, Maxime, can you please make sure you've covered everything here: http://www.open-mpi.org/community/help/ Josh On Thu, Aug 14, 2014 at 3:18 PM, Joshua Ladd mailto:jladd.m...@gmail.com>> wrote: And maybe include your LD_LIBRARY_PATH Josh On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd mailto:jladd.m...@gmail.com>> wrote: Can you try to run the example code "ring_c" across nodes? Josh On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault mailto:maxime.boissonnea...@calculquebec.ca>> wrote: Yes, Everything has been built with GCC 4.8.x, although x might have changed between the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI 1.8.2rc4 however, it was the exact same compiler for everything. Maxime Le 2014-08-14 14:57, Joshua Ladd a écrit : Hmmm...weird. Seems like maybe a mismatch between libraries. Did you build OMPI with the same compiler as you did GROMACS/Charm++? I'm stealing this suggestion from an old Gromacs forum with essentially the same symptom: "Did you compile Open MPI and Gromacs with the same compiler (i.e. both gcc and the same version)? You write you tried different OpenMPI versions and different GCC versions but it is unclear whether those match. Can you provide more detail how you compiled (including all options you specified)? Have you tested any other MPI program linked against those Open MPI versions? Please make sure (e.g. with ldd) that the MPI and pthread library you compiled against is also used for execution. If you compiled and run on different hosts, check whether the error still occurs when executing on the build host." http://redmine.gromacs.org/issues/1025 Josh On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault mailto:maxime.boissonnea...@calculquebec.ca>> wrote: I just tried Gromacs with two nodes. It crashes, but with a different error. I get [gpu-k20-13:142156] *** Process received signal *** [gpu-k20-13:142156] Signal: Segmentation fault (11) [gpu-k20-13:142156] Signal code: Address not mapped (1) [gpu-k20-13:142156] Failing at address: 0x8 [gpu-k20-13:142156] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710] [gpu-k20-13:142156] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf] [gpu-k20-13:142156] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83] [gpu-k20-13:142156] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da] [gpu-k20-13:142156] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933] [gpu-k20-13:142156] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965] [gpu-k20-13:142156] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac5d0930a0a] [gpu-k20-13:142156] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac5d0930a3b] [gpu-k20-13:142156] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaDriverGetVersion+0x4a)[0x2ac5d094602a] [gpu-k20-13:142156] [ 9] /software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_print_version
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Here are the requested files. In the archive, you will find the output of configure, make, make install as well as the config.log, the environment when running ring_c and the ompi_info --all. Just for a reminder, the ring_c example compiled and ran, but produced no output when running and exited with code 65. Thanks, Maxime Le 2014-08-14 15:26, Joshua Ladd a écrit : One more, Maxime, can you please make sure you've covered everything here: http://www.open-mpi.org/community/help/ Josh On Thu, Aug 14, 2014 at 3:18 PM, Joshua Ladd <mailto:jladd.m...@gmail.com>> wrote: And maybe include your LD_LIBRARY_PATH Josh On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd mailto:jladd.m...@gmail.com>> wrote: Can you try to run the example code "ring_c" across nodes? Josh On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault mailto:maxime.boissonnea...@calculquebec.ca>> wrote: Yes, Everything has been built with GCC 4.8.x, although x might have changed between the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI 1.8.2rc4 however, it was the exact same compiler for everything. Maxime Le 2014-08-14 14:57, Joshua Ladd a écrit : Hmmm...weird. Seems like maybe a mismatch between libraries. Did you build OMPI with the same compiler as you did GROMACS/Charm++? I'm stealing this suggestion from an old Gromacs forum with essentially the same symptom: "Did you compile Open MPI and Gromacs with the same compiler (i.e. both gcc and the same version)? You write you tried different OpenMPI versions and different GCC versions but it is unclear whether those match. Can you provide more detail how you compiled (including all options you specified)? Have you tested any other MPI program linked against those Open MPI versions? Please make sure (e.g. with ldd) that the MPI and pthread library you compiled against is also used for execution. If you compiled and run on different hosts, check whether the error still occurs when executing on the build host." http://redmine.gromacs.org/issues/1025 Josh On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault mailto:maxime.boissonnea...@calculquebec.ca>> wrote: I just tried Gromacs with two nodes. It crashes, but with a different error. I get [gpu-k20-13:142156] *** Process received signal *** [gpu-k20-13:142156] Signal: Segmentation fault (11) [gpu-k20-13:142156] Signal code: Address not mapped (1) [gpu-k20-13:142156] Failing at address: 0x8 [gpu-k20-13:142156] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710] [gpu-k20-13:142156] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf] [gpu-k20-13:142156] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83] [gpu-k20-13:142156] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da] [gpu-k20-13:142156] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933] [gpu-k20-13:142156] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965] [gpu-k20-13:142156] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac5d0930a0a] [gpu-k20-13:142156] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac5d0930a3b] [gpu-k20-13:142156] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaDriverGetVersion+0x4a)[0x2ac5d094602a] [gpu-k20-13:142156] [ 9] /software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_print_version_info_gpu+0x55)[0x2ac5cf9a90b5] [gpu-k20-13:142156] [10] /software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_log_open+0x17e)[0x2ac5cf54b9be] [gpu-k20-13:142156] [11] mdrunmpi(cmain+0x1cdb)[0x43b4bb] [gpu-k20-13:142156] [12] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac5d1534d1d] [gpu-k20-13:142156] [13] mdrunmpi[0x407be1] [gpu-k20-13:142156] *** End of error message *** -- mpiexec noticed that process rank 0 with PID 142156 on node gpu-k20-1
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Hi, I solved the warning that appeared with OpenMPI 1.6.5 on the login node. I increased the registrable memory. Now, with OpenMPI 1.6.5, it does not give any warning. Yet, with OpenMPI 1.8.1 and OpenMPI 1.8.2rc4, it still exits with error code 65 and does not produce the normal output. I will recompile it from scratch and provide all the information requested on the help webpage. Cheers, Maxime Le 2014-08-15 11:58, Maxime Boissonneault a écrit : Hi Josh, The ring_c example does not work on our login node : [mboisson@helios-login1 examples]$ mpiexec -np 10 ring_c [mboisson@helios-login1 examples]$ echo $? 65 [mboisson@helios-login1 examples]$ echo $LD_LIBRARY_PATH /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib:/usr/lib64/nvidia:/software-gpu/cuda/6.0.37/lib64:/software-gpu/cuda/6.0.37/lib:/software6/compilers/gcc/4.8/lib64:/software6/compilers/gcc/4.8/lib:/software6/apps/buildtools/20140527/lib64:/software6/apps/buildtools/20140527/lib It does work on our compute nodes however. If I compile and run this with OpenMPI 1.6.5, it gives a warning, but it does work on our login note : [mboisson@helios-login1 examples]$ mpiexec ring_c -- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: helios-login1 Registerable memory: 32768 MiB Total memory:65457 MiB Your MPI job will continue, but may be behave poorly and/or hang. -- Process 0 sending 10 to 0, tag 201 (1 processes in ring) Process 0 sent to 0 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Could the warning be causing a failure with OpenMPI 1.8.x ? I suspect it does work on our compute nodes because they are configured to allow more locked pages. I do not understand however how a simple ring test should require that much memory. Maxime Le 2014-08-14 15:16, Joshua Ladd a écrit : Can you try to run the example code "ring_c" across nodes? Josh On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault <mailto:maxime.boissonnea...@calculquebec.ca>> wrote: Yes, Everything has been built with GCC 4.8.x, although x might have changed between the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI 1.8.2rc4 however, it was the exact same compiler for everything. Maxime Le 2014-08-14 14:57, Joshua Ladd a écrit : Hmmm...weird. Seems like maybe a mismatch between libraries. Did you build OMPI with the same compiler as you did GROMACS/Charm++? I'm stealing this suggestion from an old Gromacs forum with essentially the same symptom: "Did you compile Open MPI and Gromacs with the same compiler (i.e. both gcc and the same version)? You write you tried different OpenMPI versions and different GCC versions but it is unclear whether those match. Can you provide more detail how you compiled (including all options you specified)? Have you tested any other MPI program linked against those Open MPI versions? Please make sure (e.g. with ldd) that the MPI and pthread library you compiled against is also used for execution. If you compiled and run on different hosts, check whether the error still occurs when executing on the build host." http://redmine.gromacs.org/issues/1025 Josh On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault mailto:maxime.boissonnea...@calculquebec.ca>> wrote: I just tried Gromacs with two nodes. It crashes, but with a different error. I get [gpu-k20-13:142156] *** Process received signal *** [gpu-k20-13:142156] Signal: Segmentation fault (11) [gpu-k20-13:142156] Signal code: Address not mapped (1) [gpu-k20-13:142156] Failing at address: 0x8 [gpu-k20-13:142156] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710] [gpu-k20-13:142156] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf] [gp
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Hi Josh, The ring_c example does not work on our login node : [mboisson@helios-login1 examples]$ mpiexec -np 10 ring_c [mboisson@helios-login1 examples]$ echo $? 65 [mboisson@helios-login1 examples]$ echo $LD_LIBRARY_PATH /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib:/usr/lib64/nvidia:/software-gpu/cuda/6.0.37/lib64:/software-gpu/cuda/6.0.37/lib:/software6/compilers/gcc/4.8/lib64:/software6/compilers/gcc/4.8/lib:/software6/apps/buildtools/20140527/lib64:/software6/apps/buildtools/20140527/lib It does work on our compute nodes however. If I compile and run this with OpenMPI 1.6.5, it gives a warning, but it does work on our login note : [mboisson@helios-login1 examples]$ mpiexec ring_c -- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: helios-login1 Registerable memory: 32768 MiB Total memory:65457 MiB Your MPI job will continue, but may be behave poorly and/or hang. -- Process 0 sending 10 to 0, tag 201 (1 processes in ring) Process 0 sent to 0 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Could the warning be causing a failure with OpenMPI 1.8.x ? I suspect it does work on our compute nodes because they are configured to allow more locked pages. I do not understand however how a simple ring test should require that much memory. Maxime Le 2014-08-14 15:16, Joshua Ladd a écrit : Can you try to run the example code "ring_c" across nodes? Josh On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault <mailto:maxime.boissonnea...@calculquebec.ca>> wrote: Yes, Everything has been built with GCC 4.8.x, although x might have changed between the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI 1.8.2rc4 however, it was the exact same compiler for everything. Maxime Le 2014-08-14 14:57, Joshua Ladd a écrit : Hmmm...weird. Seems like maybe a mismatch between libraries. Did you build OMPI with the same compiler as you did GROMACS/Charm++? I'm stealing this suggestion from an old Gromacs forum with essentially the same symptom: "Did you compile Open MPI and Gromacs with the same compiler (i.e. both gcc and the same version)? You write you tried different OpenMPI versions and different GCC versions but it is unclear whether those match. Can you provide more detail how you compiled (including all options you specified)? Have you tested any other MPI program linked against those Open MPI versions? Please make sure (e.g. with ldd) that the MPI and pthread library you compiled against is also used for execution. If you compiled and run on different hosts, check whether the error still occurs when executing on the build host." http://redmine.gromacs.org/issues/1025 Josh On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault mailto:maxime.boissonnea...@calculquebec.ca>> wrote: I just tried Gromacs with two nodes. It crashes, but with a different error. I get [gpu-k20-13:142156] *** Process received signal *** [gpu-k20-13:142156] Signal: Segmentation fault (11) [gpu-k20-13:142156] Signal code: Address not mapped (1) [gpu-k20-13:142156] Failing at address: 0x8 [gpu-k20-13:142156] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710] [gpu-k20-13:142156] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf] [gpu-k20-13:142156] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83] [gpu-k20-13:142156] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da] [gpu-k20-13:142156] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933] [gpu-k20-13:142156] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965] [gpu-k20-13:142156] [ 6] /software-gpu/cud
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Yes, Everything has been built with GCC 4.8.x, although x might have changed between the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI 1.8.2rc4 however, it was the exact same compiler for everything. Maxime Le 2014-08-14 14:57, Joshua Ladd a écrit : Hmmm...weird. Seems like maybe a mismatch between libraries. Did you build OMPI with the same compiler as you did GROMACS/Charm++? I'm stealing this suggestion from an old Gromacs forum with essentially the same symptom: "Did you compile Open MPI and Gromacs with the same compiler (i.e. both gcc and the same version)? You write you tried different OpenMPI versions and different GCC versions but it is unclear whether those match. Can you provide more detail how you compiled (including all options you specified)? Have you tested any other MPI program linked against those Open MPI versions? Please make sure (e.g. with ldd) that the MPI and pthread library you compiled against is also used for execution. If you compiled and run on different hosts, check whether the error still occurs when executing on the build host." http://redmine.gromacs.org/issues/1025 Josh On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault <mailto:maxime.boissonnea...@calculquebec.ca>> wrote: I just tried Gromacs with two nodes. It crashes, but with a different error. I get [gpu-k20-13:142156] *** Process received signal *** [gpu-k20-13:142156] Signal: Segmentation fault (11) [gpu-k20-13:142156] Signal code: Address not mapped (1) [gpu-k20-13:142156] Failing at address: 0x8 [gpu-k20-13:142156] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710] [gpu-k20-13:142156] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf] [gpu-k20-13:142156] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83] [gpu-k20-13:142156] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da] [gpu-k20-13:142156] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933] [gpu-k20-13:142156] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965] [gpu-k20-13:142156] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac5d0930a0a] [gpu-k20-13:142156] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac5d0930a3b] [gpu-k20-13:142156] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaDriverGetVersion+0x4a)[0x2ac5d094602a] [gpu-k20-13:142156] [ 9] /software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_print_version_info_gpu+0x55)[0x2ac5cf9a90b5] [gpu-k20-13:142156] [10] /software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_log_open+0x17e)[0x2ac5cf54b9be] [gpu-k20-13:142156] [11] mdrunmpi(cmain+0x1cdb)[0x43b4bb] [gpu-k20-13:142156] [12] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac5d1534d1d] [gpu-k20-13:142156] [13] mdrunmpi[0x407be1] [gpu-k20-13:142156] *** End of error message *** -- mpiexec noticed that process rank 0 with PID 142156 on node gpu-k20-13 exited on signal 11 (Segmentation fault). -- We do not have MPI_THREAD_MULTIPLE enabled in our build, so Charm++ cannot be using this level of threading. The configure line for OpenMPI was ./configure --prefix=$PREFIX \ --with-threads --with-verbs=yes --enable-shared --enable-static \ --with-io-romio-flags="--with-file-system=nfs+lustre" \ --without-loadleveler --without-slurm --with-tm \ --with-cuda=$(dirname $(dirname $(which nvcc))) Maxime Le 2014-08-14 14:20, Joshua Ladd a écrit : What about between nodes? Since this is coming from the OpenIB BTL, would be good to check this. Do you know what the MPI thread level is set to when used with the Charm++ runtime? Is it MPI_THREAD_MULTIPLE? The OpenIB BTL is not thread safe. Josh On Thu, Aug 14, 2014 at 2:17 PM, Maxime Boissonneault mailto:maxime.boissonnea...@calculquebec.ca>> wrote: Hi, I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37 on a single node, with 8 ranks and multiple OpenMP threads. Maxime Le 2014-08-14 14:15, Joshua Ladd a écrit : Hi, Maxime Just curious, are you able to run a vanilla MPI program? Can you try one one of the example programs in the "examples" subdirectory. Looks like a threading issue to me. Thanks, Josh ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post:http://
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
I just tried Gromacs with two nodes. It crashes, but with a different error. I get [gpu-k20-13:142156] *** Process received signal *** [gpu-k20-13:142156] Signal: Segmentation fault (11) [gpu-k20-13:142156] Signal code: Address not mapped (1) [gpu-k20-13:142156] Failing at address: 0x8 [gpu-k20-13:142156] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710] [gpu-k20-13:142156] [ 1] /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf] [gpu-k20-13:142156] [ 2] /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83] [gpu-k20-13:142156] [ 3] /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da] [gpu-k20-13:142156] [ 4] /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933] [gpu-k20-13:142156] [ 5] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965] [gpu-k20-13:142156] [ 6] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac5d0930a0a] [gpu-k20-13:142156] [ 7] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac5d0930a3b] [gpu-k20-13:142156] [ 8] /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaDriverGetVersion+0x4a)[0x2ac5d094602a] [gpu-k20-13:142156] [ 9] /software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_print_version_info_gpu+0x55)[0x2ac5cf9a90b5] [gpu-k20-13:142156] [10] /software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_log_open+0x17e)[0x2ac5cf54b9be] [gpu-k20-13:142156] [11] mdrunmpi(cmain+0x1cdb)[0x43b4bb] [gpu-k20-13:142156] [12] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac5d1534d1d] [gpu-k20-13:142156] [13] mdrunmpi[0x407be1] [gpu-k20-13:142156] *** End of error message *** -- mpiexec noticed that process rank 0 with PID 142156 on node gpu-k20-13 exited on signal 11 (Segmentation fault). -- We do not have MPI_THREAD_MULTIPLE enabled in our build, so Charm++ cannot be using this level of threading. The configure line for OpenMPI was ./configure --prefix=$PREFIX \ --with-threads --with-verbs=yes --enable-shared --enable-static \ --with-io-romio-flags="--with-file-system=nfs+lustre" \ --without-loadleveler --without-slurm --with-tm \ --with-cuda=$(dirname $(dirname $(which nvcc))) Maxime Le 2014-08-14 14:20, Joshua Ladd a écrit : What about between nodes? Since this is coming from the OpenIB BTL, would be good to check this. Do you know what the MPI thread level is set to when used with the Charm++ runtime? Is it MPI_THREAD_MULTIPLE? The OpenIB BTL is not thread safe. Josh On Thu, Aug 14, 2014 at 2:17 PM, Maxime Boissonneault <mailto:maxime.boissonnea...@calculquebec.ca>> wrote: Hi, I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37 on a single node, with 8 ranks and multiple OpenMP threads. Maxime Le 2014-08-14 14:15, Joshua Ladd a écrit : Hi, Maxime Just curious, are you able to run a vanilla MPI program? Can you try one one of the example programs in the "examples" subdirectory. Looks like a threading issue to me. Thanks, Josh ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post:http://www.open-mpi.org/community/lists/users/2014/08/25023.php ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25024.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25025.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Hi, I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37 on a single node, with 8 ranks and multiple OpenMP threads. Maxime Le 2014-08-14 14:15, Joshua Ladd a écrit : Hi, Maxime Just curious, are you able to run a vanilla MPI program? Can you try one one of the example programs in the "examples" subdirectory. Looks like a threading issue to me. Thanks, Josh ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25023.php
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Hi, I just did with 1.8.2rc4 and it does the same : [mboisson@helios-login1 simplearrayhello]$ ./hello [helios-login1:11739] *** Process received signal *** [helios-login1:11739] Signal: Segmentation fault (11) [helios-login1:11739] Signal code: Address not mapped (1) [helios-login1:11739] Failing at address: 0x30 [helios-login1:11739] [ 0] /lib64/libpthread.so.0[0x381c00f710] [helios-login1:11739] [ 1] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xfa238)[0x7f7166a04238] [helios-login1:11739] [ 2] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xfbad4)[0x7f7166a05ad4] [helios-login1:11739] [ 3] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xcf)[0x7f71669ffddf] [helios-login1:11739] [ 4] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xe4773)[0x7f71669ee773] [helios-login1:11739] [ 5] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_btl_base_select+0x168)[0x7f71669e46a8] [helios-login1:11739] [ 6] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_r2_component_init+0x11)[0x7f71669e3fd1] [helios-login1:11739] [ 7] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_base_init+0x7f)[0x7f71669e275f] [helios-login1:11739] [ 8] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0x1e602f)[0x7f7166af002f] [helios-login1:11739] [ 9] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_pml_base_select+0x3b6)[0x7f7166aedc26] [helios-login1:11739] [10] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_mpi_init+0x4e3)[0x7f7166988863] [helios-login1:11739] [11] /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(MPI_Init_thread+0x15d)[0x7f71669a86fd] [helios-login1:11739] [12] ./hello(LrtsInit+0x72)[0x4fcf02] [helios-login1:11739] [13] ./hello(ConverseInit+0x70)[0x4ff680] [helios-login1:11739] [14] ./hello(main+0x27)[0x470767] [helios-login1:11739] [15] /lib64/libc.so.6(__libc_start_main+0xfd)[0x381bc1ed1d] [helios-login1:11739] [16] ./hello[0x470b71] [helios-login1:11739] *** End of error message Maxime Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a écrit : Can you try the latest 1.8.2 rc tarball? (just released yesterday) http://www.open-mpi.org/software/ompi/v1.8/ On Aug 14, 2014, at 8:39 AM, Maxime Boissonneault wrote: Hi, I compiled Charm++ 6.6.0rc3 using ./build charm++ mpi-linux-x86_64 smp --with-production When compiling the simple example mpi-linux-x86_64-smp/tests/charm++/simplearrayhello/ I get a segmentation fault that traces back to OpenMPI : [mboisson@helios-login1 simplearrayhello]$ ./hello [helios-login1:01813] *** Process received signal *** [helios-login1:01813] Signal: Segmentation fault (11) [helios-login1:01813] Signal code: Address not mapped (1) [helios-login1:01813] Failing at address: 0x30 [helios-login1:01813] [ 0] /lib64/libpthread.so.0[0x381c00f710] [helios-login1:01813] [ 1] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf78f8)[0x7f2cd1f6b8f8] [helios-login1:01813] [ 2] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf8f64)[0x7f2cd1f6cf64] [helios-login1:01813] [ 3] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xcf)[0x7f2cd1f672af] [helios-login1:01813] [ 4] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xe1ad7)[0x7f2cd1f55ad7] [helios-login1:01813] [ 5] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_btl_base_select+0x168)[0x7f2cd1f4bf28] [helios-login1:01813] [ 6] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_r2_component_init+0x11)[0x7f2cd1f4b851] [helios-login1:01813] [ 7] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_base_init+0x7f)[0x7f2cd1f4a03f] [helios-login1:01813] [ 8] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0x1e0d17)[0x7f2cd2054d17] [helios-login1:01813] [ 9] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_pml_base_select+0x3b6)[0x7f2cd20529d6] [helios-login1:01813] [10] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_mpi_init+0x4e4)[0x7f2cd1ef0c14] [helios-login1:01813] [11] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(MPI_Init_thread+0x15d)[0x7f2cd1f1065d] [helios-login1:01813] [12] ./hello(LrtsInit+0x72)[0x4fcf02] [helios-login1:01813] [13] ./hello(ConverseInit+0x70)[0x4ff680] [helios-login1:01813] [14] ./hello(main+0x27)[0x470767] [helios-login1:01813] [15] /lib64/libc.so.6(__libc_start_main+0xfd)[0x381bc1ed1d] [helios-login1:01813] [16] ./hello[0x470b71] Anyone has a clue how to fix this ? Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Running a hybrid MPI+openMP program
Hi, You DEFINITELY need to disable OpenMPI's new default binding. Otherwise, your N threads will run on a single core. --bind-to socket would be my recommendation for hybrid jobs. Maxime Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a écrit : I don't know much about OpenMP, but do you need to disable Open MPI's default bind-to-core functionality (I'm assuming you're using Open MPI 1.8.x)? You can try "mpirun --bind-to none ...", which will have Open MPI not bind MPI processes to cores, which might allow OpenMP to think that it can use all the cores, and therefore it will spawn num_cores threads...? On Aug 14, 2014, at 9:50 AM, Oscar Mojica wrote: Hello everybody I am trying to run a hybrid mpi + openmp program in a cluster. I created a queue with 14 machines, each one with 16 cores. The program divides the work among the 14 processors with MPI and within each processor a loop is also divided into 8 threads for example, using openmp. The problem is that when I submit the job to the queue the MPI processes don't divide the work into threads and the program prints the number of threads that are working within each process as one. I made a simple test program that uses openmp and I logged in one machine of the fourteen. I compiled it using gfortran -fopenmp program.f -o exe, set the OMP_NUM_THREADS environment variable equal to 8 and when I ran directly in the terminal the loop was effectively divided among the cores and for example in this case the program printed the number of threads equal to 8 This is my Makefile # Start of the makefile # Defining variables objects = inv_grav3d.o funcpdf.o gr3dprm.o fdjac.o dsvd.o #f90comp = /opt/openmpi/bin/mpif90 f90comp = /usr/bin/mpif90 #switch = -O3 executable = inverse.exe # Makefile all : $(executable) $(executable) : $(objects) $(f90comp) -fopenmp -g -O -o $(executable) $(objects) rm $(objects) %.o: %.f $(f90comp) -c $< # Cleaning everything clean: rm $(executable) # rm $(objects) # End of the makefile and the script that i am using is #!/bin/bash #$ -cwd #$ -j y #$ -S /bin/bash #$ -pe orte 14 #$ -N job #$ -q new.q export OMP_NUM_THREADS=8 /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v -np $NSLOTS ./inverse.exe am I forgetting something? Thanks, Oscar Fabian Mojica Ladino Geologist M.S. in Geophysics ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/25016.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1
Note that if I do the same build with OpenMPI 1.6.5, it works flawlessly. Maxime Le 2014-08-14 08:39, Maxime Boissonneault a écrit : Hi, I compiled Charm++ 6.6.0rc3 using ./build charm++ mpi-linux-x86_64 smp --with-production When compiling the simple example mpi-linux-x86_64-smp/tests/charm++/simplearrayhello/ I get a segmentation fault that traces back to OpenMPI : [mboisson@helios-login1 simplearrayhello]$ ./hello [helios-login1:01813] *** Process received signal *** [helios-login1:01813] Signal: Segmentation fault (11) [helios-login1:01813] Signal code: Address not mapped (1) [helios-login1:01813] Failing at address: 0x30 [helios-login1:01813] [ 0] /lib64/libpthread.so.0[0x381c00f710] [helios-login1:01813] [ 1] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf78f8)[0x7f2cd1f6b8f8] [helios-login1:01813] [ 2] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf8f64)[0x7f2cd1f6cf64] [helios-login1:01813] [ 3] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xcf)[0x7f2cd1f672af] [helios-login1:01813] [ 4] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xe1ad7)[0x7f2cd1f55ad7] [helios-login1:01813] [ 5] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_btl_base_select+0x168)[0x7f2cd1f4bf28] [helios-login1:01813] [ 6] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_r2_component_init+0x11)[0x7f2cd1f4b851] [helios-login1:01813] [ 7] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_base_init+0x7f)[0x7f2cd1f4a03f] [helios-login1:01813] [ 8] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0x1e0d17)[0x7f2cd2054d17] [helios-login1:01813] [ 9] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_pml_base_select+0x3b6)[0x7f2cd20529d6] [helios-login1:01813] [10] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_mpi_init+0x4e4)[0x7f2cd1ef0c14] [helios-login1:01813] [11] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(MPI_Init_thread+0x15d)[0x7f2cd1f1065d] [helios-login1:01813] [12] ./hello(LrtsInit+0x72)[0x4fcf02] [helios-login1:01813] [13] ./hello(ConverseInit+0x70)[0x4ff680] [helios-login1:01813] [14] ./hello(main+0x27)[0x470767] [helios-login1:01813] [15] /lib64/libc.so.6(__libc_start_main+0xfd)[0x381bc1ed1d] [helios-login1:01813] [16] ./hello[0x470b71] Anyone has a clue how to fix this ? Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
[OMPI users] Segmentation fault in OpenMPI 1.8.1
Hi, I compiled Charm++ 6.6.0rc3 using ./build charm++ mpi-linux-x86_64 smp --with-production When compiling the simple example mpi-linux-x86_64-smp/tests/charm++/simplearrayhello/ I get a segmentation fault that traces back to OpenMPI : [mboisson@helios-login1 simplearrayhello]$ ./hello [helios-login1:01813] *** Process received signal *** [helios-login1:01813] Signal: Segmentation fault (11) [helios-login1:01813] Signal code: Address not mapped (1) [helios-login1:01813] Failing at address: 0x30 [helios-login1:01813] [ 0] /lib64/libpthread.so.0[0x381c00f710] [helios-login1:01813] [ 1] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf78f8)[0x7f2cd1f6b8f8] [helios-login1:01813] [ 2] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf8f64)[0x7f2cd1f6cf64] [helios-login1:01813] [ 3] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xcf)[0x7f2cd1f672af] [helios-login1:01813] [ 4] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xe1ad7)[0x7f2cd1f55ad7] [helios-login1:01813] [ 5] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_btl_base_select+0x168)[0x7f2cd1f4bf28] [helios-login1:01813] [ 6] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_r2_component_init+0x11)[0x7f2cd1f4b851] [helios-login1:01813] [ 7] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_base_init+0x7f)[0x7f2cd1f4a03f] [helios-login1:01813] [ 8] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0x1e0d17)[0x7f2cd2054d17] [helios-login1:01813] [ 9] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_pml_base_select+0x3b6)[0x7f2cd20529d6] [helios-login1:01813] [10] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_mpi_init+0x4e4)[0x7f2cd1ef0c14] [helios-login1:01813] [11] /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(MPI_Init_thread+0x15d)[0x7f2cd1f1065d] [helios-login1:01813] [12] ./hello(LrtsInit+0x72)[0x4fcf02] [helios-login1:01813] [13] ./hello(ConverseInit+0x70)[0x4ff680] [helios-login1:01813] [14] ./hello(main+0x27)[0x470767] [helios-login1:01813] [15] /lib64/libc.so.6(__libc_start_main+0xfd)[0x381bc1ed1d] [helios-login1:01813] [16] ./hello[0x470b71] Anyone has a clue how to fix this ? Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
[OMPI users] Filem could not be found for one user
Hi, I am getting a weird error when running mpiexec with one user : [mboisson@gpu-k20-14 helios_test]$ mpiexec -np 2 mdrunmpi -ntomp 10 -s prod_s6_01kcal_bb_dr -deffnm testout -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: gpu-k20-14 Framework: filem Component: rsh -- [gpu-k20-14:205673] mca: base: components_register: registering filem components [gpu-k20-14:205673] [[56298,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 673 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_filem_base_open failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- What is weird is that this same command works for other users, on the same node. Anyone know what might be going on here ? Thanks, -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] How to keep multiple installations at same time
The Environment Modules package user base is not negligible, including many universities, research centers, national labs, ans private companies, in the US and around the world. How does the user base of LMod compare? The user base certainly is much larger for Environment Modules than LMod. But, as a user of both Lmod and Environment Modules, I can tell you the following : Regardless of any virtues that LMod may have, currently I don't see any reason to switch to LMod, install everything over again Nothing needs reinstalling. Lmod understands Tcl modules and can work fine with your old module tree. , troubleshoot it, learn Lua, migrate my modules from Tcl, Again, migration to Lua is not required. Tcl modules gets converted on the fly. educate my users and convince them to use a new package to achieve the same exact thing that they currently have, Very little education has to be done. The commands are the same : module avail module load/add module unload/remove module use ... and in the end gain little if any relevant/useful/new functionality. If you do not want to make any changes, in the way you organize modules, then don't. You will also get no benefit from changing to Lmod in that situation. If you do want to use new features, then there are plenty. Most notably is - the possibility to organize modules in hierarchy (which you do not HAVE to do, but in my opinion, is much more intuitive). - the possibility to cache the module structure (and avoid reading it from a parallel filesystem every time a user type a module command). - the possibility to color-code modules so that users can find what they want easier out of hundreds of modules IF you do use hierarchy, you get the added benefit of avoiding user mistakes such as " module load gcc openmpi_gcc module unload gcc module load intel ... why is my MPI not working! " IF you do use hierarchy, you get the added benefit of not having silly module names such as fftw/3.3_gcc4.8_openmpi1.6.3 fftw/3.3_gcc4.6_openmpi1.8.1 ... Again, you do NOT have to, but the benefits much outweight the changes that need to be made to get them. My 2 cents, Maxime Boissonneault My two cents of opinion Gus Correa On 08/05/2014 12:54 PM, Ralph Castain wrote: Check the repo - hasn't been touched in a very long time On Aug 5, 2014, at 9:42 AM, Fabricio Cannini wrote: On 05-08-2014 13:10, Ralph Castain wrote: Since modules isn't a supported s/w package any more, you might consider using LMOD instead: https://www.tacc.utexas.edu/tacc-projects/lmod Modules isn't supported anymore? :O Could you please send a link about it ? ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/24918.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/24919.php ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/24924.php -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] poor performance using the openib btl
Hi, I recovered the name of the option that caused problems for us. It is --enable-mpi-thread-multiple This option enables threading within OPAL, which was bugged (at least in 1.6.x series). I don't know if it has been fixed in 1.8 series. I do not see your configure line in the attached file, to see if it was enabled or not. Maxime Le 2014-06-25 10:46, Fischer, Greg A. a écrit : Attached are the results of "grep thread" on my configure output. There appears to be some amount of threading, but is there anything I should look for in particular? I see Mike Dubman's questions on the mailing list website, but his message didn't appear to make it to my inbox. The answers to his questions are: [binford:fischega] $ rpm -qa | grep ofed ofed-doc-1.5.4.1-0.11.5 ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5 ofed-1.5.4.1-0.11.5 Distro: SLES11 SP3 HCA: [binf102:fischega] $ /usr/sbin/ibstat CA 'mlx4_0' CA type: MT26428 Command line (path and LD_LIBRARY_PATH are set correctly): mpirun -x LD_LIBRARY_PATH -mca btl openib,sm,self -mca btl_openib_verbose 1 -np 31 $CTF_EXEC *From:*users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Maxime Boissonneault *Sent:* Tuesday, June 24, 2014 6:41 PM *To:* Open MPI Users *Subject:* Re: [OMPI users] poor performance using the openib btl What are your threading options for OpenMPI (when it was built) ? I have seen OpenIB BTL completely lock when some level of threading is enabled before. Maxime Boissonneault Le 2014-06-24 18:18, Fischer, Greg A. a écrit : Hello openmpi-users, A few weeks ago, I posted to the list about difficulties I was having getting openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). The issues were related to Torque imposing restrictive limits on locked memory, and have since been resolved. However, now that I've had some time to test the applications, I'm seeing abysmal performance over the openib layer. Applications run with the tcp btl execute about 10x faster than with the openib btl. Clearly something still isn't quite right. I tried running with "-mca btl_openib_verbose 1", but didn't see anything resembling a smoking gun. How should I go about determining the source of the problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 setup discussed previously.) Thanks, Greg ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post:http://www.open-mpi.org/community/lists/users/2014/06/24697.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24700.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] poor performance using the openib btl
What are your threading options for OpenMPI (when it was built) ? I have seen OpenIB BTL completely lock when some level of threading is enabled before. Maxime Boissonneault Le 2014-06-24 18:18, Fischer, Greg A. a écrit : Hello openmpi-users, A few weeks ago, I posted to the list about difficulties I was having getting openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). The issues were related to Torque imposing restrictive limits on locked memory, and have since been resolved. However, now that I've had some time to test the applications, I'm seeing abysmal performance over the openib layer. Applications run with the tcp btl execute about 10x faster than with the openib btl. Clearly something still isn't quite right. I tried running with "-mca btl_openib_verbose 1", but didn't see anything resembling a smoking gun. How should I go about determining the source of the problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 setup discussed previously.) Thanks, Greg ___ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24697.php -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] affinity issues under cpuset torque 1.8.1
Hi, I've been following this thread because it may be relevant to our setup. Is there a drawback of having orte_hetero_nodes=1 as default MCA parameter ? Is there a reason why the most generic case is not assumed ? Maxime Boissonneault Le 2014-06-20 13:48, Ralph Castain a écrit : Put "orte_hetero_nodes=1" in your default MCA param file - uses can override by setting that param to 0 On Jun 20, 2014, at 10:30 AM, Brock Palen wrote: Perfection! That appears to do it for our standard case. Now I know how to set MCA options by env var or config file. How can I make this the default, that then a user can override? Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 20, 2014, at 1:21 PM, Ralph Castain wrote: I think I begin to grok at least part of the problem. If you are assigning different cpus on each node, then you'll need to tell us that by setting --hetero-nodes otherwise we won't have any way to report that back to mpirun for its binding calculation. Otherwise, we expect that the cpuset of the first node we launch a daemon onto (or where mpirun is executing, if we are only launching local to mpirun) accurately represents the cpuset on every node in the allocation. We still might well have a bug in our binding computation - but the above will definitely impact what you said the user did. On Jun 20, 2014, at 10:06 AM, Brock Palen wrote: Extra data point if I do: [brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname -- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node:nyx5513 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -- [brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38 13:01:37 up 31 days, 23:06, 0 users, load average: 10.13, 10.90, 12.38 [brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get 0x0010 0x1000 [brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513 nyx5513 nyx5513 Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu available, PBS says it gave it two, and if I force (this is all inside an interactive job) just on that node hwloc-bind --get I get what I expect, Is there a way to get a map of what MPI thinks it has on each host? Brock Palen www.umich.edu/~brockp CAEN Advanced Computing XSEDE Campus Champion bro...@umich.edu (734)936-1985 On Jun 20, 2014, at 12:38 PM, Brock Palen wrote: I was able to produce it in my test. orted affinity set by cpuset: [root@nyx5874 ~]# hwloc-bind --get --pid 103645 0xc002 This mask (1, 14,15) which is across sockets, matches the cpu set setup by the batch system. [root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus 1,14-15 The ranks though were then all set to the same core: [root@nyx5874 ~]# hwloc-bind --get --pid 103871 0x8000 [root@nyx5874 ~]# hwloc-bind --get --pid 103872 0x8000 [root@nyx5874 ~]# hwloc-bind --get --pid 103873 0x8000 Which is core 15: report-bindings gave me: You can see how a few nodes were bound to all the same core, the last one in each case. I only gave you the results for the hose nyx5874. [nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all available processors) [nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all available processors) [nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all available processors) [nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all available processors) [nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all available processors) [nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core 15[hwt 0]]: [./././././././.][./././././././B] [nyx5798
Re: [OMPI users] Advices for parameter tuning for CUDA-aware MPI
Answers inline too. 2) Is the absence of btl_openib_have_driver_gdr an indicator of something missing ? Yes, that means that somehow the GPU Direct RDMA is not installed correctly. All that check does is make sure that the file /sys/kernel/mm/memory_peers/nv_mem/version exists. Does that exist? It does not. There is no /sys/kernel/mm/memory_peers/ 3) Are the default parameters, especially the rdma limits and such, optimal for our configuration ? That is hard to say. GPU Direct RDMA does not work well when the GPU and IB card are not "close" on the system. Can you run "nvidia-smi topo -m" on your system? nvidia-smi topo -m gives me the error [mboisson@login-gpu01 ~]$ nvidia-smi topo -m Invalid combination of input arguments. Please run 'nvidia-smi -h' for help. I could not find anything related to topology in the help. However, I can tell you the following which I believe to be true - GPU0 and GPU1 are on PCIe bus 0, socket 0 - GPU2 and GPU3 are on PCIe bus 1, socket 0 - GPU4 and GPU5 are on PCIe bus 2, socket 1 - GPU6 and GPU7 are on PCIe bus 3, socket 1 There is one IB card which I believe is on socket 0. I know that we do not have the Mellanox Ofed. We use the Linux RDMA from CentOS 6.5. However, should that completely disable GDR within a single node ? i.e. does GDR _have_ to go through IB ? I would assume that our lack of Mellanox OFED would result in no-GDR inter-node, but GDR intra-node. Thanks -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
[OMPI users] Advices for parameter tuning for CUDA-aware MPI
async_send" (current value: "true", data source: default, level: 9 dev/all, type: bool) MCA btl: parameter "btl_openib_cuda_async_recv" (current value: "true", data source: default, level: 9 dev/all, type: bool) MCA btl: informational "btl_openib_have_cuda_gdr" (current value: "true", data source: default, level: 5 tuner/detail, type: bool) MCA btl: parameter "btl_openib_want_cuda_gdr" (current value: "false", data source: default, level: 9 dev/all, type: bool) MCA btl: parameter "btl_openib_cuda_eager_limit" (current value: "0", data source: default, level: 5 tuner/detail, type: size_t) MCA btl: parameter "btl_openib_cuda_rdma_limit" (current value: "18446744073709551615", data source: default, level: 5 tuner/detail, type: size_t) MCA btl: parameter "btl_vader_cuda_eager_limit" (current value: "0", data source: default, level: 5 tuner/detail, type: size_t) MCA btl: parameter "btl_vader_cuda_rdma_limit" (current value: "18446744073709551615", data source: default, level: 5 tuner/detail, type: size_t) MCA coll: parameter "coll_ml_config_file" (current value: "/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share/openmpi/mca-coll-ml.config", data source: default, level: 9 dev/all, type: string) MCA io: informational "io_romio_complete_configure_params" (current value: "--with-file-system=nfs+lustre FROM_OMPI=yes CC='/software6/compilers/gcc/4.8/bin/gcc -std=gnu99' CFLAGS='-O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread' CPPFLAGS=' -I/software-gpu/src/openmpi-1.8.1/opal/mca/hwloc/hwloc172/hwloc/include -I/software-gpu/src/openmpi-1.8.1/opal/mca/event/libevent2021/libevent -I/software-gpu/src/openmpi-1.8.1/opal/mca/event/libevent2021/libevent/include' FFLAGS='' LDFLAGS=' ' --enable-shared --enable-static --with-file-system=nfs+lustre --prefix=/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37 --disable-aio", data source: default, level: 9 dev/all, type: string) [login-gpu01.calculquebec.ca:11486] mca: base: close: unloading component Q -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] openmpi configuration error?
Instead of using the outdated and not maintained Module environment, why not use Lmod : https://www.tacc.utexas.edu/tacc-projects/lmod It is a drop-in replacement for Module environment that supports all of their features and much, much more, such as : - module hierarchies - module properties and color highlighting (we use it to higlight bioinformatic modules or tools for example) - module caching (very useful for a parallel filesystem with tons of modules) - path priorities (useful to make sure personal modules take precendence over system modules) - export module tree to json It works like a charm, understand both TCL and Lua modules and is actively developped and debugged. There are litteraly new features every month or so. If it does not do what you want, odds are that the developper will add it shortly (I've had it happen). Maxime Le 2014-05-16 17:58, Douglas L Reeder a écrit : Ben, You might want to use module (source forge) to manage paths to different mpi implementations. It is fairly easy to set up and very robust for this type of problem. You would remove contentious application paths from you standard PATH and then use module to switch them in and out as needed. Doug Reeder On May 16, 2014, at 3:39 PM, Ben Lash <mailto:b...@rice.edu>> wrote: My cluster has just upgraded to a new version of MPI, and I'm using an old one. It seems that I'm having trouble compiling due to the compiler wrapper file moving (full error here: http://pastebin.com/EmwRvCd9) "Cannot open configuration file /opt/apps/openmpi/1.4.4-intel/share/openmpi/mpif90-wrapper-data.txt" I've found the file on the cluster at /opt/apps/openmpi/retired/1.4.4-intel/share/openmpi/mpif90-wrapper-data.txt How do I tell the old mpi wrapper where this file is? I've already corrected one link to mpich -> /opt/apps/openmpi/retired/1.4.4-intel/, which is in the software I'm trying to recompile's lib folder (/home/bl10/CMAQv5.0.1/lib/x86_64/ifort). Thanks for any ideas. I also tried changing $pkgdatadir based on what I read here: http://www.open-mpi.org/faq/?category=mpi-apps#default-wrapper-compiler-flags Thanks. --Ben L ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Question about scheduler support
Le 2014-05-16 09:06, Jeff Squyres (jsquyres) a écrit : On May 15, 2014, at 8:00 PM, Fabricio Cannini wrote: Nobody is disagreeing that one could find a way to make CMake work - all we are saying is that (a) CMake has issues too, just like autotools, and (b) we have yet to see a compelling reason to undertake the transition...which would have to be a *very* compelling one. I was simply agreeing with Maxime about why it could work. ;) But if you and the other devels are fine with it, i'm fine too. FWIW, simply for my own curiosity's sake, if someone could confirm deny whether cmake: 1. Supports the following compiler suites: GNU (that's a given, I assume), Clang, OS X native (which is variants of GNU and Clang), Absoft, PGI, Intel, Cray, HP-UX, Oracle Solaris (Linux and Solaris), Tru64, Microsoft Visual, IBM BlueGene (I think that's gcc, but am not entirely sure). (some of these matter mainly to hwloc, not necessarily OMPI) I have built projects with CMake using GNU, Intel, PGI, OS X native. CMake claims to make MSV projects, so I'm assuming MS Visual works. I can't say about the others. 2. Bootstrap a tarball such that an end user does not need to have cmake installed. That, I have no clue, but they do have a page about bootstrapping cmake itself http://www.cmake.org/cmake/help/install.html I am not sure if this is what you mean. If there is no existing CMake installation, a bootstrap script is provided: ./bootstrap make make install (Note: the make install step is optional, cmake will run from the build directory.) According to this, you could have a tarball including CMake and instruct the users to run some variant of (or make your own bootstrap script including this) ./bootstrap && make && ./cmake . && make && make install Now that I think about it, OpenFOAM uses CMake and bootstraps it if it is not install, so it is certainly possible. Maxime
Re: [OMPI users] Question about scheduler support
Le 2014-05-15 18:27, Jeff Squyres (jsquyres) a écrit : On May 15, 2014, at 6:14 PM, Fabricio Cannini wrote: Alright, but now I'm curious as to why you decided against it. Could please elaborate on it a bit ? OMPI has a long, deep history with the GNU Autotools. It's a very long, complicated story, but the high points are: 1. The GNU Autotools community has given us very good support over the years. 2. The GNU Autotools support all compilers that we want to support, including shared library support (others did not, back in 2004 when we started OMPI). 3. The GNU Autotools can fully bootstrap a tarball such that the end user does not need to have the GNU Autotools installed to build an OMPI tarball. You mean some people do NOT have GNU Autotools ? :P Jokes aside, CMake has certainly matured enough for point #2 and is used by very big projects (KDE for example). Not sure about point #3 though. I am wondering though, how do you handle Windows with OpenMPI and GNU Autotools ? I know CMake is famous for being cross-plateform (that's what the C means) and can generate builds for Windows, Visual Studio and such. In any case, I do not see any need to change from one toolchain to another, although I have seen many projects providing both and that did not seem to be too much of a hassle. It's still probably more work than what you want to get into though. Maxime
Re: [OMPI users] Question about scheduler support
Please allow me to chip in my $0.02 and suggest to not reinvent the wheel, but instead consider to migrate the build system to cmake : http://www.cmake.org/ I agree that menu-wise, CMake does a pretty good job with ccmake, and is much, much easier to create than autoconf/automake/m4 stuff (I speak from experience). However, for the command-line arguments, I find cmake non-intuitive and pretty cumbersome. As example, to say --with-tm=/usr/local/torque with CMAKE, you would have to do something like -DWITH_TM:STRING=/usr/local/torque Maxime
Re: [OMPI users] Question about scheduler support
A file would do the trick, but from my experience of building programs, I always prefer configure options. Maybe just an option --disable-optional that disables anything that is optional and non-explicitely requested. Maxime Le 2014-05-15 08:22, Bennet Fauber a écrit : Would a separate file that contains each scheduler option and is included by configure do the trick? It might read include-slurm=YES include-torque=YES etc. If all options are set to default to YES, then the people who want no options are satisfied, but those of us who would like to change the config would have an easy and scriptable way to change the option using sed or whatever. I agree with Maxime about requiring an interactive system to turn things off. It makes things difficult to script and document exactly what was done. I think providing the kitchen sink is fine for default, but a simple switch or config file that flips it to including nothing that wasn't requested might satisfy the other side. I suspect that something similar would (or could) be part of a menu configuration scheme, so the menu could be tacked on later, if it turns out to be desired, and the menu would just modify the list of things to build, so any work toward that scheme might not be lost. -- bennet On Thu, May 15, 2014 at 7:41 AM, Maxime Boissonneault wrote: Le 2014-05-15 06:29, Jeff Squyres (jsquyres) a écrit : I think Ralph's email summed it up pretty well -- we unfortunately have (at least) two distinct groups of people who install OMPI: a) those who know exactly what they want and don't want anything else b) those who don't know exactly what they want and prefer to have everything installed, and have OMPI auto-select at run time exactly what to use based on the system on which it's running We've traditionally catered to the b) crowd, and made some not-very-easy-to-use capabilities for the a) crowd (i.e., you can manually disable each plugin you don't want to build via configure, but the syntax is fairly laborious). Ralph and I talked about the possibility of something analogous to "make menuconfig" for Linux kernels, where you get a menu-like system (UI TBD) to pick exactly what options you want/don't want. That will output a text config file that can be fed to configure, something along the lines of ./configure --only-build-exactly-this-stuff=file-output-from-menuconfig This idea is *very* rough; I anticipate that it will change quite a bit over time, and it'll take us a bit of time to design and implement it. A menu-like system is not going to be very useful at least for us, since we script all of our installations. Scripting a menu is not very handy. Maxime On May 14, 2014, at 8:56 PM, Bennet Fauber wrote: I think Maxime's suggestion is sane and reasonable. Just in case you're taking ha'penny's worth from the groundlings. I think I would prefer not to have capability included that we won't use. -- bennet On Wed, May 14, 2014 at 7:43 PM, Maxime Boissonneault wrote: For the scheduler issue, I would be happy with something like, if I ask for support for X, disable support for Y, Z and W. I am assuming that very rarely will someone use more than one scheduler. Maxime Le 2014-05-14 19:09, Ralph Castain a écrit : Jeff and I have talked about this and are approaching a compromise. Still more thinking to do, perhaps providing new configure options to "only build what I ask for" and/or a tool to support a menu-driven selection of what to build - as opposed to today's "build everything you don't tell me to not-build" Tough set of compromises as it depends on the target audience. Sys admins prefer the "build only what I say", while users (who frequently aren't that familiar with the inners of a system) prefer the "build all" mentality. On May 14, 2014, at 3:16 PM, Ralph Castain wrote: Indeed, a quick review indicates that the new policy for scheduler support was not uniformly applied. I'll update it. To reiterate: we will only build support for a scheduler if the user specifically requests it. We did this because we are increasingly seeing distros include header support for various schedulers, and so just finding the required headers isn't enough to know that the scheduler is intended for use. So we wind up building a bunch of useless modules. On May 14, 2014, at 3:09 PM, Ralph Castain wrote: FWIW: I believe we no longer build the slurm support by default, though I'd have to check to be sure. The intent is definitely not to do so. The plan we adjusted to a while back was to *only* build support for schedulers upon request. Can't swear that they are all correctly updated, but that was the intent. On May 14, 2014, at 2:52 PM, Jeff Squyres (jsquyres) wrote: Here's a bit of our rational, from the README file: Note that for many of Open MPI
Re: [OMPI users] Question about scheduler support
Le 2014-05-15 06:29, Jeff Squyres (jsquyres) a écrit : I think Ralph's email summed it up pretty well -- we unfortunately have (at least) two distinct groups of people who install OMPI: a) those who know exactly what they want and don't want anything else b) those who don't know exactly what they want and prefer to have everything installed, and have OMPI auto-select at run time exactly what to use based on the system on which it's running We've traditionally catered to the b) crowd, and made some not-very-easy-to-use capabilities for the a) crowd (i.e., you can manually disable each plugin you don't want to build via configure, but the syntax is fairly laborious). Ralph and I talked about the possibility of something analogous to "make menuconfig" for Linux kernels, where you get a menu-like system (UI TBD) to pick exactly what options you want/don't want. That will output a text config file that can be fed to configure, something along the lines of ./configure --only-build-exactly-this-stuff=file-output-from-menuconfig This idea is *very* rough; I anticipate that it will change quite a bit over time, and it'll take us a bit of time to design and implement it. A menu-like system is not going to be very useful at least for us, since we script all of our installations. Scripting a menu is not very handy. Maxime On May 14, 2014, at 8:56 PM, Bennet Fauber wrote: I think Maxime's suggestion is sane and reasonable. Just in case you're taking ha'penny's worth from the groundlings. I think I would prefer not to have capability included that we won't use. -- bennet On Wed, May 14, 2014 at 7:43 PM, Maxime Boissonneault wrote: For the scheduler issue, I would be happy with something like, if I ask for support for X, disable support for Y, Z and W. I am assuming that very rarely will someone use more than one scheduler. Maxime Le 2014-05-14 19:09, Ralph Castain a écrit : Jeff and I have talked about this and are approaching a compromise. Still more thinking to do, perhaps providing new configure options to "only build what I ask for" and/or a tool to support a menu-driven selection of what to build - as opposed to today's "build everything you don't tell me to not-build" Tough set of compromises as it depends on the target audience. Sys admins prefer the "build only what I say", while users (who frequently aren't that familiar with the inners of a system) prefer the "build all" mentality. On May 14, 2014, at 3:16 PM, Ralph Castain wrote: Indeed, a quick review indicates that the new policy for scheduler support was not uniformly applied. I'll update it. To reiterate: we will only build support for a scheduler if the user specifically requests it. We did this because we are increasingly seeing distros include header support for various schedulers, and so just finding the required headers isn't enough to know that the scheduler is intended for use. So we wind up building a bunch of useless modules. On May 14, 2014, at 3:09 PM, Ralph Castain wrote: FWIW: I believe we no longer build the slurm support by default, though I'd have to check to be sure. The intent is definitely not to do so. The plan we adjusted to a while back was to *only* build support for schedulers upon request. Can't swear that they are all correctly updated, but that was the intent. On May 14, 2014, at 2:52 PM, Jeff Squyres (jsquyres) wrote: Here's a bit of our rational, from the README file: Note that for many of Open MPI's --with- options, Open MPI will, by default, search for header files and/or libraries for . If the relevant files are found, Open MPI will built support for ; if they are not found, Open MPI will skip building support for . However, if you specify --with- on the configure command line and Open MPI is unable to find relevant support for , configure will assume that it was unable to provide a feature that was specifically requested and will abort so that a human can resolve out the issue. In some cases, we don't need header or library files. For example, with SLURM and LSF, our native support is actually just fork/exec'ing the SLURM/LSF executables under the covers (e.g., as opposed to using rsh/ssh). So we can basically *always* build them. So we do. In general, OMPI builds support for everything that it can find on the rationale that a) we can't know ahead of time exactly what people want, and b) most people want to just "./configure && make -j 32 install" and be done with it -- so build as much as possible. On May 14, 2014, at 5:31 PM, Maxime Boissonneault wrote: Hi Gus, Oh, I know that, what I am refering to is that slurm and loadleveler support are enabled by default, and it seems that if we're using Torque/Moab, we have no use for slurm an
Re: [OMPI users] Question about scheduler support
For the scheduler issue, I would be happy with something like, if I ask for support for X, disable support for Y, Z and W. I am assuming that very rarely will someone use more than one scheduler. Maxime Le 2014-05-14 19:09, Ralph Castain a écrit : Jeff and I have talked about this and are approaching a compromise. Still more thinking to do, perhaps providing new configure options to "only build what I ask for" and/or a tool to support a menu-driven selection of what to build - as opposed to today's "build everything you don't tell me to not-build" Tough set of compromises as it depends on the target audience. Sys admins prefer the "build only what I say", while users (who frequently aren't that familiar with the inners of a system) prefer the "build all" mentality. On May 14, 2014, at 3:16 PM, Ralph Castain wrote: Indeed, a quick review indicates that the new policy for scheduler support was not uniformly applied. I'll update it. To reiterate: we will only build support for a scheduler if the user specifically requests it. We did this because we are increasingly seeing distros include header support for various schedulers, and so just finding the required headers isn't enough to know that the scheduler is intended for use. So we wind up building a bunch of useless modules. On May 14, 2014, at 3:09 PM, Ralph Castain wrote: FWIW: I believe we no longer build the slurm support by default, though I'd have to check to be sure. The intent is definitely not to do so. The plan we adjusted to a while back was to *only* build support for schedulers upon request. Can't swear that they are all correctly updated, but that was the intent. On May 14, 2014, at 2:52 PM, Jeff Squyres (jsquyres) wrote: Here's a bit of our rational, from the README file: Note that for many of Open MPI's --with- options, Open MPI will, by default, search for header files and/or libraries for . If the relevant files are found, Open MPI will built support for ; if they are not found, Open MPI will skip building support for . However, if you specify --with- on the configure command line and Open MPI is unable to find relevant support for , configure will assume that it was unable to provide a feature that was specifically requested and will abort so that a human can resolve out the issue. In some cases, we don't need header or library files. For example, with SLURM and LSF, our native support is actually just fork/exec'ing the SLURM/LSF executables under the covers (e.g., as opposed to using rsh/ssh). So we can basically *always* build them. So we do. In general, OMPI builds support for everything that it can find on the rationale that a) we can't know ahead of time exactly what people want, and b) most people want to just "./configure && make -j 32 install" and be done with it -- so build as much as possible. On May 14, 2014, at 5:31 PM, Maxime Boissonneault wrote: Hi Gus, Oh, I know that, what I am refering to is that slurm and loadleveler support are enabled by default, and it seems that if we're using Torque/Moab, we have no use for slurm and loadleveler support. My point is not that it is hard to compile it with torque support, my point is that it is compiling support for many schedulers while I'm rather convinced that very few sites actually use multiple schedulers at the same time. Maxime Le 2014-05-14 16:51, Gus Correa a écrit : On 05/14/2014 04:25 PM, Maxime Boissonneault wrote: Hi, I was compiling OpenMPI 1.8.1 today and I noticed that pretty much every single scheduler has its support enabled by default at configure (except the one I need, which is Torque). Is there a reason for that ? Why not have a single scheduler enabled and require to specify it at configure time ? Is there any reason for me to build with loadlever or slurm if we're using torque ? Thanks, Maxime Boisssonneault Hi Maxime I haven't tried 1.8.1 yet. However, for all previous versions of OMPI I tried, up to 1.6.5, all it took to configure it with Torque support was to point configure to the Torque installation directory (which is non-standard in my case): --with-tm=/opt/torque/bla/bla My two cents, Gus Correa _______ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open
Re: [OMPI users] Question about scheduler support
Hi Gus, Oh, I know that, what I am refering to is that slurm and loadleveler support are enabled by default, and it seems that if we're using Torque/Moab, we have no use for slurm and loadleveler support. My point is not that it is hard to compile it with torque support, my point is that it is compiling support for many schedulers while I'm rather convinced that very few sites actually use multiple schedulers at the same time. Maxime Le 2014-05-14 16:51, Gus Correa a écrit : On 05/14/2014 04:25 PM, Maxime Boissonneault wrote: Hi, I was compiling OpenMPI 1.8.1 today and I noticed that pretty much every single scheduler has its support enabled by default at configure (except the one I need, which is Torque). Is there a reason for that ? Why not have a single scheduler enabled and require to specify it at configure time ? Is there any reason for me to build with loadlever or slurm if we're using torque ? Thanks, Maxime Boisssonneault Hi Maxime I haven't tried 1.8.1 yet. However, for all previous versions of OMPI I tried, up to 1.6.5, all it took to configure it with Torque support was to point configure to the Torque installation directory (which is non-standard in my case): --with-tm=/opt/torque/bla/bla My two cents, Gus Correa ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
[OMPI users] Question about scheduler support
Hi, I was compiling OpenMPI 1.8.1 today and I noticed that pretty much every single scheduler has its support enabled by default at configure (except the one I need, which is Torque). Is there a reason for that ? Why not have a single scheduler enabled and require to specify it at configure time ? Is there any reason for me to build with loadlever or slurm if we're using torque ? Thanks, Maxime Boisssonneault
Re: [OMPI users] checkpoint/restart facility - blcr
I heard that c/r support in OpenMPI was being dropped after version 1.6.x. Is this not still the case ? Maxime Boissonneault Le 2014-02-27 13:09, George Bosilca a écrit : Both were supported at some point. I'm not sure if any is still in a workable state in the trunk today. However, there is an ongoing effort to reinstate the coordinated approach. George. On Feb 27, 2014, at 18:50 , basma a.azeem <mailto:basmaabdelaz...@hotmail.com>> wrote: i have a question about the checkpoint/restart facility of BLCR with OPEN MPI , does the checkpoint/restart solution as a whole can be considered as a coordinated or uncoordinated approach ___ users mailing list us...@open-mpi.org <mailto:us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] slowdown with infiniband and latest CentOS kernel
Hi, Do you have thread multiples enabled in your OpenMPI installation ? Maxime Boissonneault Le 2013-12-16 17:40, Noam Bernstein a écrit : Has anyone tried to use openmpi 1.7.3 with the latest CentOS kernel (well, nearly latest: 2.6.32-431.el6.x86_64), and especially with infiniband? I'm seeing lots of weird slowdowns, especially when using infiniband, but even when running with "--mca btl self,sm" (it's much worse with IB, though), so I was wondering if anyone else has tested this kernel yet? Once I have some more detailed information I'll follow up. Noam ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Problem compiling against torque 4.2.4
Hi, You are probably missing the moab-torque-devel package (or torque-devel package if there is one). You need the *-devel to have the headers in order to compile against torque. Maxime Le 2013-12-04 15:06, Matt Burgess a écrit : Hello, I can't seem to compile openmpi version 1.6.5 against torque 4.2.4. Here's the configure line I'm using: ./configure --with-tm=/dg/local/cots/torque/torque-4.2.4/ The relevant portion of config.log appears to be: configure:92031: checking --with-tm value configure:92051: result: sanity check ok (/dg/local/cots/torque/torque-4.2.4/) configure:92076: checking for pbs-config configure:92086: result: /dg/local/cots/torque/torque-4.2.4//bin/pbs-config configure:92099: ess_tm_CPPFLAGS from pbs-config: configure:92122: ess_tm_LDFLAGS from pbs-config: configure:92145: ess_tm_LIBS from pbs-config: configure:92160: checking tm.h usability configure:92160: gcc -c -DNDEBUG -g -O2 -finline-functions -fno-strict-aliasing -pthread -I/root/openmpi-1.6.5/opal/mca/hwloc/hwloc132/hwloc/include conftest.c >&5 conftest.c:597:16: error: tm.h: No such file or directory configure:92160: $? = 1 Thanks in advance for any help anybody can provide. DigitalGlobe logo http://www.digitalglobe.com/images/dg_02.gif *Matt Burgess*** /Linux/HPC Engineer/ +1.303.684.1132 office +1.919.355.8672 mobile matt.burg...@digitalglobe.com <mailto:matt.burg...@digitalglobe.com> This electronic communication and any attachments may contain confidential and proprietary information of DigitalGlobe, Inc. If you are not the intended recipient, or an agent or employee responsible for delivering this communication to the intended recipient, or if you have received this communication in error, please do not print, copy, retransmit, disseminate or otherwise use the information. Please indicate to the sender that you have received this communication in error, and delete the copy you received. DigitalGlobe reserves the right to monitor any electronic communication sent or received by its employees, agents or representatives. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- --------- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] MPI_THREAD_MULTIPLE causes deadlock in simple MPI_Barrier case (ompi 1.6.5 and 1.7.3)
Hi Jean-François ;) Do you have the same behavior if you disable openib at run time ? : --mca btl ^openib My experience with the OpenIB BTL is that its inner threading is bugged. Maxime Le 2013-11-28 19:21, Jean-Francois St-Pierre a écrit : Hi, I've compiled ompi1.6.5 with multi-thread support (using Intel compilers 12.0.4.191, but I get the same result with gcc) : ./configure --with-tm/opt/torque --with-openib --enable-mpi-thread-multiple CC=icc CXX=icpc F77=ifort FC=ifort And i've built a simple sample code that only does the Init and one MPI_Barrier. The core of the code is : setbuf(stdout, NULL); MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); fprintf(stdout,"%6d: Provided thread support %d ", getpid(), provided); int flag,claimed; MPI_Is_thread_main( &flag ); MPI_Query_thread( &claimed ); fprintf(stdout,"%6d: Before Comm_rank, flag %d, claimed %d \n", getpid(), flag, claimed); MPI_Comm_rank(MPI_COMM_WORLD, &gRank); fprintf(stdout,"%6d: Comm_size rank %d\n",getpid(), gRank); MPI_Comm_size(MPI_COMM_WORLD, &gNTasks); fprintf(stdout,"%6d: Before Barrier\n", getpid()); MPI_Barrier( MPI_COMM_WORLD ); fprintf(stdout,"%6d: After Barrier\n", getpid()); MPI_Finalize(); When I launch it on 2 nodes (mono-core per node) I get this sample output : /*** Output mpirun -pernode -np 2 sample_code 7356: Provided thread support 3 MPI_THREAD_MULTIPLE 7356: Before Comm_rank, flag 1, claimed 3 7356: Comm_size rank 0 7356: Before Barrier 26277: Provided thread support 3 MPI_THREAD_MULTIPLE 26277: Before Comm_rank, flag 1, claimed 3 26277: Comm_size rank 1 26277: Before Barrier ^Cmpirun: killing job... */ The deadlock never gets over the MPI_Barrier when I use MPI_THREAD_MULTIPLE, but it runs fine using MPI_THREAD_SERIALIZED . I get the same behavior with ompi 1.7.3. I don't get a deadlock when the 2 MPI processes are hosted on the same node. In attachement, you'll find my config.out, config.log, environment variables on the execution node, both make.out, sample code and output etc. Thanks, Jeff ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Very high latency with openib btl
Hi again, I managed to reproduce the "bug" with a simple case (see the cpp file attached). I am running this on 2 nodes with 8 cores each. If I run with mpiexec ./test-mpi-latency.out then the MPI_Ssend operations take about ~1e-5 second for intra-node ranks, and ~11 seconds for inter-node ranks. Note that 11 seconds is roughly the time required to execute the loop that is after the MPI_Recv. The average time required for the MPI_Ssend to return is 5.1 seconds. If I run with : mpiexec --mca btl ^openib ./test-mpi-latency.out then intra-node communications take ~0.5-1e-5 seconds, while internode communications take ~1e-6 seconds, for an average of ~5e-5 seconds. I compiled this with gcc 4.7.2 + openmpi 1.6.3, as well as gcc 4.6.1 + openmpi 1.4.5. Both show the same behavior. However, on the same machine, with gcc 4.6.1 + mvapich2/1.8, the latency is always quite low. The fact that mvapich2 does not show this behavior points out to a problem with the openib btl within openmpi, and not with our setup. Can anyone try to reproduce this on a different machine ? Thanks, Maxime Boissonneault Le 2013-02-15 14:29, Maxime Boissonneault a écrit : Hi again, I found out that if I add an MPI_Barrier after the MPI_Recv part, then there is no minute-long latency. Is it possible that even if MPI_Recv returns, the openib btl does not guarantee that the acknowledgement is sent promptly ? In other words, is it possible that the computation following the MPI_Recv delays the acknowledgement ? If so, is it supposed to be this way, or is it normal, and why isn't the same behavior observed with the tcp btl ? Maxime Boissonneault Le 2013-02-14 11:50, Maxime Boissonneault a écrit : Hi, I have a strange case here. The application is "plink" (http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The computation/communication pattern of the application is the following : 1- MPI_Init 2- Some single rank computation 3- MPI_Bcast 4- Some single rank computation 5- MPI_Barrier 6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a time. 6- other ranks use MPI_Recv 7- Some single rank computation 8- other ranks send result to rank 0 with MPI_Ssend 8- rank 0 receives data with MPI_Recv 9- rank 0 analyses result 10- MPI_Finalize The amount of data being sent is of the order of the kilobytes, and we have IB. The problem we observe is in step 6. I've output dates before and after each MPI operation. With the openib btl, the behavior I observe is that : - rank 0 starts sending - rank n receives almost instantly, and MPI_Recv returns. - rank 0's MPI_Ssend often returns _minutes_. It looks like the acknowledgement from rank n takes minutes to reach rank 0. Now, the tricky part is that if I disable the openib btl to use instead tcp over IB, there is no such latency and the acknowledgement comes back within a fraction of a second. Also, if rank 0 and rank n are on the same node, the acknowledgement is also quasi-instantaneous (I guess it goes through the SM btl instead of openib). I tried to reproduce this in a simple case, but I observed no such latency. The duration that I got for the whole communication is of the order of milliseconds. Does anyone have an idea of what could cause such very high latencies when using the OpenIB BTL ? Also, I tried replacing step 6 by explicitly sending a confirmation : - rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n - rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0 In this case also, rank n's MPI_Isend executes quasi-instantaneously, and rank 0's MPI_Recv only returns a few minutes later. Thanks, Maxime Boissonneault -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique #include #include #include #include #include #include #include #include #include #include "mpi.h" using namespace std; static struct timespec start, end, duration; static int my_rank, nrank; static int my_mpi_tag_send=0; void clock_start() { clock_gettime(CLOCK_MONOTONIC,&start); } double clock_end(const string & op, int rank_print=0) { double duration_in_sec; clock_gettime(CLOCK_MONOTONIC,&end); duration.tv_sec = end.tv_sec - start.tv_sec; duration.tv_nsec = end.tv_nsec - start.tv_nsec; while (duration.tv_nsec > 10) { duration.tv_sec++; duration.tv_nsec -= 10; } while (duration.tv_nsec < 0) { duration.tv_sec--; duration.tv_nsec += 10; } duration_in_sec = duration.tv_sec + double(duration.tv_nsec)/10.; if (my_rank == rank_print) cout << "Operation \"" << op << "\" done. Took: " << duration_in_sec << " seconds." << endl; return durat
Re: [OMPI users] Very high latency with openib btl
Hi again, I found out that if I add an MPI_Barrier after the MPI_Recv part, then there is no minute-long latency. Is it possible that even if MPI_Recv returns, the openib btl does not guarantee that the acknowledgement is sent promptly ? In other words, is it possible that the computation following the MPI_Recv delays the acknowledgement ? If so, is it supposed to be this way, or is it normal, and why isn't the same behavior observed with the tcp btl ? Maxime Boissonneault Le 2013-02-14 11:50, Maxime Boissonneault a écrit : Hi, I have a strange case here. The application is "plink" (http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The computation/communication pattern of the application is the following : 1- MPI_Init 2- Some single rank computation 3- MPI_Bcast 4- Some single rank computation 5- MPI_Barrier 6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a time. 6- other ranks use MPI_Recv 7- Some single rank computation 8- other ranks send result to rank 0 with MPI_Ssend 8- rank 0 receives data with MPI_Recv 9- rank 0 analyses result 10- MPI_Finalize The amount of data being sent is of the order of the kilobytes, and we have IB. The problem we observe is in step 6. I've output dates before and after each MPI operation. With the openib btl, the behavior I observe is that : - rank 0 starts sending - rank n receives almost instantly, and MPI_Recv returns. - rank 0's MPI_Ssend often returns _minutes_. It looks like the acknowledgement from rank n takes minutes to reach rank 0. Now, the tricky part is that if I disable the openib btl to use instead tcp over IB, there is no such latency and the acknowledgement comes back within a fraction of a second. Also, if rank 0 and rank n are on the same node, the acknowledgement is also quasi-instantaneous (I guess it goes through the SM btl instead of openib). I tried to reproduce this in a simple case, but I observed no such latency. The duration that I got for the whole communication is of the order of milliseconds. Does anyone have an idea of what could cause such very high latencies when using the OpenIB BTL ? Also, I tried replacing step 6 by explicitly sending a confirmation : - rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n - rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0 In this case also, rank n's MPI_Isend executes quasi-instantaneously, and rank 0's MPI_Recv only returns a few minutes later. Thanks, Maxime Boissonneault -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
[OMPI users] Very high latency with openib btl
Hi, I have a strange case here. The application is "plink" (http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The computation/communication pattern of the application is the following : 1- MPI_Init 2- Some single rank computation 3- MPI_Bcast 4- Some single rank computation 5- MPI_Barrier 6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a time. 6- other ranks use MPI_Recv 7- Some single rank computation 8- other ranks send result to rank 0 with MPI_Ssend 8- rank 0 receives data with MPI_Recv 9- rank 0 analyses result 10- MPI_Finalize The amount of data being sent is of the order of the kilobytes, and we have IB. The problem we observe is in step 6. I've output dates before and after each MPI operation. With the openib btl, the behavior I observe is that : - rank 0 starts sending - rank n receives almost instantly, and MPI_Recv returns. - rank 0's MPI_Ssend often returns _minutes_. It looks like the acknowledgement from rank n takes minutes to reach rank 0. Now, the tricky part is that if I disable the openib btl to use instead tcp over IB, there is no such latency and the acknowledgement comes back within a fraction of a second. Also, if rank 0 and rank n are on the same node, the acknowledgement is also quasi-instantaneous (I guess it goes through the SM btl instead of openib). I tried to reproduce this in a simple case, but I observed no such latency. The duration that I got for the whole communication is of the order of milliseconds. Does anyone have an idea of what could cause such very high latencies when using the OpenIB BTL ? Also, I tried replacing step 6 by explicitly sending a confirmation : - rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n - rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0 In this case also, rank n's MPI_Isend executes quasi-instantaneously, and rank 0's MPI_Recv only returns a few minutes later. Thanks, Maxime Boissonneault
Re: [OMPI users] Checkpointing an MPI application with OMPI
Le 2013-01-29 21:02, Ralph Castain a écrit : On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault <mailto:maxime.boissonnea...@calculquebec.ca>> wrote: While our filesystem and management nodes are on UPS, our compute nodes are not. With one average generic (power/cooling mostly) failure every one or two months, running for weeks is just asking for trouble. If you add to that typical dimm/cpu/networking failures (I estimated about 1 node goes down per day because of some sort hardware failure, for a cluster of 960 nodes). With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of failing before it is done. I've been running this in my head all day - it just doesn't fit experience, which really bothered me. So I spent a little time running the calculation, and I came up with a number much lower (more like around 5%). I'm not saying my rough number is correct, but it is at least a little closer to what we see in the field. Given that there are a lot of assumptions required when doing these calculations, I would like to suggest you conduct a very simply and quick experiment before investing tons of time on FT solutions. All you have to do is: Thanks for the calculation. However, this is a cluster that I manage, I do not use it per say, and running such statistical jobs on a large part of the cluster for a long period of time is impossible. We do have the numbers however. The cluster has 960 nodes. We experience roughly one power or cooling failure per month or two months. Assuming one such failure per two months, if you run for 1 month, you have a 50% chance your job will be killed before it ends. If you run for 2 weeks, 25%, etc. These are very rough estimates obviously, but it is way more than 5%. In addition to that, we have a failure rate of ~0.1%/day, meaning that out of 960, on average, one node will have a hardware failure every day. Most of the time, this is a failure of one of the dimms. Considering each node has 12 dimms of 2GB of memory, it means a dimm failure rate of ~0.0001 per day. I don't know if that's bad or not, but this is roughly what we have. If it turns out you see power failure problems, then a simple, low-cost, ride-thru power stabilizer might be a good solution. Flywheels and capacitor-based systems can provide support for momentary power quality issues at reasonably low costs for a cluster of your size. I doubt there is anything low cost for a 330 kW system, and in any case, hardware upgrade is not an option since this a mid-life cluster. Again, as I said, the filesystem (2 x 500 TB lustre partitions) and the management nodes are on UPS, but there is no way to put the compute nodes on UPS. If your node hardware is the problem, or you decide you do want/need to pursue an FT solution, then you might look at the OMPI-based solutions from parties such as http://fault-tolerance.org or the MPICH2 folks. Thanks for the tip. Best regards, Maxime
Re: [OMPI users] Checkpointing an MPI application with OMPI
Hi George, The problem here is not the bandwidth, but the number of IOPs. I wrote to the BLCR list, and they confirmed that : "While ideally the checkpoint would be written in sizable chunks, the current code in BLCR will issue a single write operation for each contiguous range of user memory, and many quite small writes for various meta-data and non-memory state (registers, signal-handlers,etc). As show in Table 1 of the paper cited above, the writes in the 10s of KB range will dominate performance." (Reference being : X. Ouyang, R. Rajachandrasekhar, X. Besseron, H. Wang, J. Huang and D. K. Panda, CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart, Int'l Conference on Parallel Processing (ICPP '11), Sept. 2011. (PDF <http://nowlab.cse.ohio-state.edu/publications/conf-papers/2011/ouyangx-icpp2011.pdf>)) We did run parallel IO benchmarks. Our filesystem can reach a speed of ~15GB/s, but only with large IO operations (at least bigger than 1MB, optimally in the 100MB-1GB range). For small (<1MB) operations, the filesystem is considerably slower. I believe this is precisely what is killing the performance here. Not sure there is anything to be done about it. Best regards, Maxime Le 2013-01-28 15:40, George Bosilca a écrit : At the scale you address you should have no trouble with the C/R if the file system is correctly configured. We get more bandwidth per node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel benchmarks on your cluster ? George. PS: You can some MPI I/O benchmarks at http://www.mcs.anl.gov/~thakur/pio-benchmarks.html On Mon, Jan 28, 2013 at 2:04 PM, Ralph Castain wrote: On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault wrote: Le 2013-01-28 13:15, Ralph Castain a écrit : On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault wrote: Le 2013-01-28 12:46, Ralph Castain a écrit : On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault wrote: Hello Ralph, I agree that ideally, someone would implement checkpointing in the application itself, but that is not always possible (commercial applications, use of complicated libraries, algorithms with no clear progression points at which you can interrupt the algorithm and start it back from there). Hmmm...well, most apps can be adjusted to support it - we have some very complex apps that were updated that way. Commercial apps are another story, but we frankly don't find much call for checkpointing those as they typically just don't run long enough - especially if you are only running 256 ranks, so your cluster is small. Failure rates just don't justify it in such cases, in our experience. Is there some particular reason why you feel you need checkpointing? This specific case is that the jobs run for days. The risk of a hardware or power failure for that kind of duration goes too high (we allow for no more than 48 hours of run time). I'm surprised by that - we run with UPS support on the clusters, but for a small one like you describe, we find the probability that a job will be interrupted even during a multi-week app is vanishingly small. FWIW: I do work with the financial industry where we regularly run apps that execute non-stop for about a month with no reported failures. Are you actually seeing failures, or are you anticipating them? While our filesystem and management nodes are on UPS, our compute nodes are not. With one average generic (power/cooling mostly) failure every one or two months, running for weeks is just asking for trouble. Wow, that is high If you add to that typical dimm/cpu/networking failures (I estimated about 1 node goes down per day because of some sort hardware failure, for a cluster of 960 nodes). That is incredibly high With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of failing before it is done. I've never seen anything like that behavior in practice - a 32 node cluster typically runs for quite a few months without a failure. It is a typical size for the financial sector, so we have a LOT of experience with such clusters. I suspect you won't see anything like that behavior... Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of the ram, that's merely 640 GB of data. Writing that on a lustre filesystem capable of reaching ~15GB/s should take no more than a few minutes if written correctly. Right now, I am getting a few minutes for a hundredth of this amount of data! Agreed - but I don't think you'll get that bandwidth for checkpointing. I suspect you'll find that checkpointing really has troubles when scaling, which is why you don't see it used in production (at least, I haven't). Mostly used for research by a handful of organizations, which is why we haven't been too concerned about its loss. While it is true we can dig through the code of the library to make
Re: [OMPI users] Checkpointing an MPI application with OMPI
Le 2013-01-28 13:15, Ralph Castain a écrit : On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault wrote: Le 2013-01-28 12:46, Ralph Castain a écrit : On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault wrote: Hello Ralph, I agree that ideally, someone would implement checkpointing in the application itself, but that is not always possible (commercial applications, use of complicated libraries, algorithms with no clear progression points at which you can interrupt the algorithm and start it back from there). Hmmm...well, most apps can be adjusted to support it - we have some very complex apps that were updated that way. Commercial apps are another story, but we frankly don't find much call for checkpointing those as they typically just don't run long enough - especially if you are only running 256 ranks, so your cluster is small. Failure rates just don't justify it in such cases, in our experience. Is there some particular reason why you feel you need checkpointing? This specific case is that the jobs run for days. The risk of a hardware or power failure for that kind of duration goes too high (we allow for no more than 48 hours of run time). I'm surprised by that - we run with UPS support on the clusters, but for a small one like you describe, we find the probability that a job will be interrupted even during a multi-week app is vanishingly small. FWIW: I do work with the financial industry where we regularly run apps that execute non-stop for about a month with no reported failures. Are you actually seeing failures, or are you anticipating them? While our filesystem and management nodes are on UPS, our compute nodes are not. With one average generic (power/cooling mostly) failure every one or two months, running for weeks is just asking for trouble. If you add to that typical dimm/cpu/networking failures (I estimated about 1 node goes down per day because of some sort hardware failure, for a cluster of 960 nodes). With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of failing before it is done. Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of the ram, that's merely 640 GB of data. Writing that on a lustre filesystem capable of reaching ~15GB/s should take no more than a few minutes if written correctly. Right now, I am getting a few minutes for a hundredth of this amount of data! While it is true we can dig through the code of the library to make it checkpoint, BLCR checkpointing just seemed easier. I see - just be aware that checkpoint support in OMPI will disappear in v1.7 and there is no clear timetable for restoring it. That is very good to know. Thanks for the information. It is too bad though. There certainly must be a better way to write the information down on disc though. Doing 500 IOPs seems to be completely broken. Why isn't there buffering involved ? I don't know - that's all done in BLCR, I believe. Either way, it isn't something we can address due to the loss of our supporter for c/r. I suppose I should contact BLCR instead then. For the disk op problem, I think that's the way to go - though like I said, I could be wrong and the disk writes could be something we do inside OMPI. I'm not familiar enough with the c/r code to state it with certainty. Thank you, Maxime Sorry we can't be of more help :-( Ralph Thanks, Maxime Le 2013-01-28 10:58, Ralph Castain a écrit : Our c/r person has moved on to a different career path, so we may not have anyone who can answer this question. What we can say is that checkpointing at any significant scale will always be a losing proposition. It just takes too long and hammers the file system. People have been working on extending the capability with things like "burst buffers" (basically putting an SSD in front of the file system to absorb the checkpoint surge), but that hasn't become very common yet. Frankly, what people have found to be the "best" solution is for your app to periodically write out its intermediate results, and then take a flag that indicates "read prior results" when it starts. This minimizes the amount of data being written to the disk. If done correctly, you would only lose whatever work was done since the last intermediate result was written - which is about equivalent to losing whatever works was done since the last checkpoint. HTH Ralph On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault wrote: Hello, I am doing checkpointing tests (with BLCR) with an MPI application compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange. First, some details about the tests : - The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre shared filesystem (tested to be able to provide ~15GB/s for writing and support ~40k IOPs). - The job was running with 8 or 16 MPI ranks on nodes wit
Re: [OMPI users] Checkpointing an MPI application with OMPI
Le 2013-01-28 12:46, Ralph Castain a écrit : On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault wrote: Hello Ralph, I agree that ideally, someone would implement checkpointing in the application itself, but that is not always possible (commercial applications, use of complicated libraries, algorithms with no clear progression points at which you can interrupt the algorithm and start it back from there). Hmmm...well, most apps can be adjusted to support it - we have some very complex apps that were updated that way. Commercial apps are another story, but we frankly don't find much call for checkpointing those as they typically just don't run long enough - especially if you are only running 256 ranks, so your cluster is small. Failure rates just don't justify it in such cases, in our experience. Is there some particular reason why you feel you need checkpointing? This specific case is that the jobs run for days. The risk of a hardware or power failure for that kind of duration goes too high (we allow for no more than 48 hours of run time). While it is true we can dig through the code of the library to make it checkpoint, BLCR checkpointing just seemed easier. There certainly must be a better way to write the information down on disc though. Doing 500 IOPs seems to be completely broken. Why isn't there buffering involved ? I don't know - that's all done in BLCR, I believe. Either way, it isn't something we can address due to the loss of our supporter for c/r. I suppose I should contact BLCR instead then. Thank you, Maxime Sorry we can't be of more help :-( Ralph Thanks, Maxime Le 2013-01-28 10:58, Ralph Castain a écrit : Our c/r person has moved on to a different career path, so we may not have anyone who can answer this question. What we can say is that checkpointing at any significant scale will always be a losing proposition. It just takes too long and hammers the file system. People have been working on extending the capability with things like "burst buffers" (basically putting an SSD in front of the file system to absorb the checkpoint surge), but that hasn't become very common yet. Frankly, what people have found to be the "best" solution is for your app to periodically write out its intermediate results, and then take a flag that indicates "read prior results" when it starts. This minimizes the amount of data being written to the disk. If done correctly, you would only lose whatever work was done since the last intermediate result was written - which is about equivalent to losing whatever works was done since the last checkpoint. HTH Ralph On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault wrote: Hello, I am doing checkpointing tests (with BLCR) with an MPI application compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange. First, some details about the tests : - The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre shared filesystem (tested to be able to provide ~15GB/s for writing and support ~40k IOPs). - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2 nodes). Each MPI rank was using approximately 200MB of memory. - I was doing checkpoints with ompi-checkpoint and restarting with ompi-restart. - I was starting with mpirun -am ft-enable-cr - The nodes are monitored by ganglia, which allows me to see the number of IOPs and the read/write speed on the filesystem. I tried a few different mca settings, but I consistently observed that : - The checkpoints lasted ~4-5 minutes each time - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at ~15MB/s. I am worried by the number of IOPs and the very slow writing speed. This was a very small test. We have jobs running with 128 or 256 MPI ranks, each using 1-2 GB of ram per rank. With such jobs, the overall number of IOPs would reach tens of thousands and would completely overload our lustre filesystem. Moreover, with 15MB/s per node, the checkpointing process would take hours. How can I improve on that ? Is there an MCA setting that I am missing ? Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] Checkpointing an MPI application with OMPI
Hello Ralph, I agree that ideally, someone would implement checkpointing in the application itself, but that is not always possible (commercial applications, use of complicated libraries, algorithms with no clear progression points at which you can interrupt the algorithm and start it back from there). There certainly must be a better way to write the information down on disc though. Doing 500 IOPs seems to be completely broken. Why isn't there buffering involved ? Thanks, Maxime Le 2013-01-28 10:58, Ralph Castain a écrit : Our c/r person has moved on to a different career path, so we may not have anyone who can answer this question. What we can say is that checkpointing at any significant scale will always be a losing proposition. It just takes too long and hammers the file system. People have been working on extending the capability with things like "burst buffers" (basically putting an SSD in front of the file system to absorb the checkpoint surge), but that hasn't become very common yet. Frankly, what people have found to be the "best" solution is for your app to periodically write out its intermediate results, and then take a flag that indicates "read prior results" when it starts. This minimizes the amount of data being written to the disk. If done correctly, you would only lose whatever work was done since the last intermediate result was written - which is about equivalent to losing whatever works was done since the last checkpoint. HTH Ralph On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault wrote: Hello, I am doing checkpointing tests (with BLCR) with an MPI application compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange. First, some details about the tests : - The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre shared filesystem (tested to be able to provide ~15GB/s for writing and support ~40k IOPs). - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2 nodes). Each MPI rank was using approximately 200MB of memory. - I was doing checkpoints with ompi-checkpoint and restarting with ompi-restart. - I was starting with mpirun -am ft-enable-cr - The nodes are monitored by ganglia, which allows me to see the number of IOPs and the read/write speed on the filesystem. I tried a few different mca settings, but I consistently observed that : - The checkpoints lasted ~4-5 minutes each time - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at ~15MB/s. I am worried by the number of IOPs and the very slow writing speed. This was a very small test. We have jobs running with 128 or 256 MPI ranks, each using 1-2 GB of ram per rank. With such jobs, the overall number of IOPs would reach tens of thousands and would completely overload our lustre filesystem. Moreover, with 15MB/s per node, the checkpointing process would take hours. How can I improve on that ? Is there an MCA setting that I am missing ? Thanks, -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- ----- Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
[OMPI users] Checkpointing an MPI application with OMPI
Hello, I am doing checkpointing tests (with BLCR) with an MPI application compiled with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange. First, some details about the tests : - The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre shared filesystem (tested to be able to provide ~15GB/s for writing and support ~40k IOPs). - The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2 nodes). Each MPI rank was using approximately 200MB of memory. - I was doing checkpoints with ompi-checkpoint and restarting with ompi-restart. - I was starting with mpirun -am ft-enable-cr - The nodes are monitored by ganglia, which allows me to see the number of IOPs and the read/write speed on the filesystem. I tried a few different mca settings, but I consistently observed that : - The checkpoints lasted ~4-5 minutes each time - During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at ~15MB/s. I am worried by the number of IOPs and the very slow writing speed. This was a very small test. We have jobs running with 128 or 256 MPI ranks, each using 1-2 GB of ram per rank. With such jobs, the overall number of IOPs would reach tens of thousands and would completely overload our lustre filesystem. Moreover, with 15MB/s per node, the checkpointing process would take hours. How can I improve on that ? Is there an MCA setting that I am missing ? Thanks, -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique