Re: [OMPI users] exited on signal 11 (Segmentation fault).
Hi Gus; I have done as uou suggest me but it always doesn't work! Many thanks for your help Mouhamad Gus Correa <g...@ldeo.columbia.edu> a écrit : Hi Mouhamad Stack of 10240kB is probably the Linux default, not necessarily good for HPC and number crunching. I'd suggest that you change it to unlimited, unless your system administrator has a very good reason not to do so. We've seen many atmosphre/ocean/climate models crash because they couldn't allocate memory on the stack [automatic arrays in subroutines, etc]. This has nothing to do with MPI, the programs can fail even when they run in serial mode because of this. You can just append this line to /etc/security/limits.conf: * - stack -1 I hope this helps, Gus Correa Mouhamad Al-Sayed-Ali wrote: Hi Gus Correa, the output of ulimit -a is file(blocks) unlimited coredump(blocks) 2048 data(kbytes) unlimited stack(kbytes)10240 lockedmem(kbytes)unlimited memory(kbytes) unlimited nofiles(descriptors) 1024 processes256 Thanks Mouhamad Gus Correa <g...@ldeo.columbia.edu> a écrit : Hi Mouhamad The locked memory is set to unlimited, but the lines about the stack are commented out. Have you tried to add this line: * - stack -1 then run wrf again? [Note no "#" hash character] Also, if you login to the compute nodes, what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]? This should tell you what limits are actually set. I hope this helps, Gus Correa Mouhamad Al-Sayed-Ali wrote: Hi all, I've checked the "limits.conf", and it contains theses lines # Jcb 29.06.2007 : pbs wrf (Siji) #* hardstack 100 #* softstack 100 # Dr 14.02.2008 : pour voltaire mpi * hardmemlock unlimited * softmemlock unlimited Many thanks for your help Mouhamad Gus Correa <g...@ldeo.columbia.edu> a écrit : Hi Mouhamad, Ralph, Terry Very often big programs like wrf crash with segfault because they can't allocate memory on the stack, and assume the system doesn't impose any limits for it. This has nothing to do with MPI. Mouhamad: Check if your stack size is set to unlimited on all compute nodes. The easy way to get it done is to change /etc/security/limits.conf, where you or your system administrator could add these lines: * - memlock -1 * - stack -1 * - nofile 4096 My two cents, Gus Correa Ralph Castain wrote: Looks like you are crashing in wrf - have you asked them for help? On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote: Hi again, This is exactly the error I have: taskid: 0 hostname: part034.u-bourgogne.fr [part034:21443] *** Process received signal *** [part034:21443] Signal: Segmentation fault (11) [part034:21443] Signal code: Address not mapped (1) [part034:21443] Failing at address: 0xfffe01eeb340 [part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70] [part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8] [part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0] [part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41] [part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc] [part034:21443] [ 5] wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) [0xcc4ed3] [part034:21443] [ 6] wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) [0xe0e4f5] [part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8] [part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda] [part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a] [part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924] [part034:21443] [11] wrf.exe(main+0x41) [0x4794d1] [part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x361201d8b4] [part034:21443] [13] wrf.exe [0x4793c9] [part034:21443] *** End of error message *** --- Mouhamad ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] exited on signal 11 (Segmentation fault).
Hi Gus Correa, the output of ulimit -a is file(blocks) unlimited coredump(blocks) 2048 data(kbytes) unlimited stack(kbytes)10240 lockedmem(kbytes)unlimited memory(kbytes) unlimited nofiles(descriptors) 1024 processes256 Thanks Mouhamad Gus Correa <g...@ldeo.columbia.edu> a écrit : Hi Mouhamad The locked memory is set to unlimited, but the lines about the stack are commented out. Have you tried to add this line: * - stack -1 then run wrf again? [Note no "#" hash character] Also, if you login to the compute nodes, what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]? This should tell you what limits are actually set. I hope this helps, Gus Correa Mouhamad Al-Sayed-Ali wrote: Hi all, I've checked the "limits.conf", and it contains theses lines # Jcb 29.06.2007 : pbs wrf (Siji) #* hardstack 100 #* softstack 100 # Dr 14.02.2008 : pour voltaire mpi * hardmemlock unlimited * softmemlock unlimited Many thanks for your help Mouhamad Gus Correa <g...@ldeo.columbia.edu> a écrit : Hi Mouhamad, Ralph, Terry Very often big programs like wrf crash with segfault because they can't allocate memory on the stack, and assume the system doesn't impose any limits for it. This has nothing to do with MPI. Mouhamad: Check if your stack size is set to unlimited on all compute nodes. The easy way to get it done is to change /etc/security/limits.conf, where you or your system administrator could add these lines: * - memlock -1 * - stack -1 * - nofile 4096 My two cents, Gus Correa Ralph Castain wrote: Looks like you are crashing in wrf - have you asked them for help? On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote: Hi again, This is exactly the error I have: taskid: 0 hostname: part034.u-bourgogne.fr [part034:21443] *** Process received signal *** [part034:21443] Signal: Segmentation fault (11) [part034:21443] Signal code: Address not mapped (1) [part034:21443] Failing at address: 0xfffe01eeb340 [part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70] [part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8] [part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0] [part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41] [part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc] [part034:21443] [ 5] wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) [0xcc4ed3] [part034:21443] [ 6] wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) [0xe0e4f5] [part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8] [part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda] [part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a] [part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924] [part034:21443] [11] wrf.exe(main+0x41) [0x4794d1] [part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x361201d8b4] [part034:21443] [13] wrf.exe [0x4793c9] [part034:21443] *** End of error message *** --- Mouhamad ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] exited on signal 11 (Segmentation fault).
Hi all, I've checked the "limits.conf", and it contains theses lines # Jcb 29.06.2007 : pbs wrf (Siji) #* hardstack 100 #* softstack 100 # Dr 14.02.2008 : pour voltaire mpi * hardmemlock unlimited * softmemlock unlimited Many thanks for your help Mouhamad Gus Correa <g...@ldeo.columbia.edu> a écrit : Hi Mouhamad, Ralph, Terry Very often big programs like wrf crash with segfault because they can't allocate memory on the stack, and assume the system doesn't impose any limits for it. This has nothing to do with MPI. Mouhamad: Check if your stack size is set to unlimited on all compute nodes. The easy way to get it done is to change /etc/security/limits.conf, where you or your system administrator could add these lines: * - memlock -1 * - stack -1 * - nofile 4096 My two cents, Gus Correa Ralph Castain wrote: Looks like you are crashing in wrf - have you asked them for help? On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote: Hi again, This is exactly the error I have: taskid: 0 hostname: part034.u-bourgogne.fr [part034:21443] *** Process received signal *** [part034:21443] Signal: Segmentation fault (11) [part034:21443] Signal code: Address not mapped (1) [part034:21443] Failing at address: 0xfffe01eeb340 [part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70] [part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8] [part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0] [part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41] [part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc] [part034:21443] [ 5] wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) [0xcc4ed3] [part034:21443] [ 6] wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) [0xe0e4f5] [part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8] [part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda] [part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a] [part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924] [part034:21443] [11] wrf.exe(main+0x41) [0x4794d1] [part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x361201d8b4] [part034:21443] [13] wrf.exe [0x4793c9] [part034:21443] *** End of error message *** --- Mouhamad ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] exited on signal 11 (Segmentation fault).
Hi again, This is exactly the error I have: taskid: 0 hostname: part034.u-bourgogne.fr [part034:21443] *** Process received signal *** [part034:21443] Signal: Segmentation fault (11) [part034:21443] Signal code: Address not mapped (1) [part034:21443] Failing at address: 0xfffe01eeb340 [part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70] [part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8] [part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0] [part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41] [part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc] [part034:21443] [ 5] wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573) [0xcc4ed3] [part034:21443] [ 6] wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5) [0xe0e4f5] [part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8] [part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda] [part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a] [part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924] [part034:21443] [11] wrf.exe(main+0x41) [0x4794d1] [part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x361201d8b4] [part034:21443] [13] wrf.exe [0x4793c9] [part034:21443] *** End of error message *** --- Mouhamad
Re: [OMPI users] exited on signal 11 (Segmentation fault).
Hello can you run wrf successfully on one node? NO, It can't run on one node Can you run a simple code across your two nodes? I would try hostname then some simple MPI program like the ring example. Yes, I can run a simple code many thanks Mouhamad
Re: [OMPI users] exited on signal 11 (Segmentation fault).
hello, -What version of ompi are you using I am using ompi version 1.4.1-1 compiled with gcc 4.5 -What type of machine and os are you running on I'm using linux machine 64 bits. -What does the machine file look like part033 part033 part031 part031 -Is there a stack trace left behind by the pid that seg faulted? No, there is no stack trace Thanks for your help Mouhamad Alsayed
Re: [OMPI users] Memory mapped memory
Hello, I have tried to run the executable "wrf.exe", using mpirun -machinefile /tmp/108388.1.par2/machines -np 4 wrf.exe but, I've got the following error: -- mpirun noticed that process rank 1 with PID 9942 on node part031.u-bourgogne.fr exited on signal 11 (Segmentation fault). -- 11.54s real 6.03s user 0.32s system Starter(9908): Return code=139 Starter end(9908) Thanks for your help Mouhamad Alsayed
Re: [OMPI users] no allocated resources for the application........(mpirun)
Hi Gus Correa, Thank you for your response. I'll check what your saying and see if it'll work. many thanks sincerly Mouhamad Gus Correa <g...@ldeo.columbia.edu> a écrit : Hi Mouhamad Did you check if your Open MPI setup is working with the simple programs in the "examples" subdirectory (connectivity_c.c, hello_c.c, ring_c.c)? This will tell you if the problem is with Open MPI or with arpege (whatever program arpege is). You can compile the examples with mpicc and run them with mpirun with the same hostfile that you used for arpege. Also, make sure your PATH and LD_LIBRARY_PATH point to the bin and lib subdirectories of your Open MPI installation. You can set this in your .cshrc/.bashrc file. See: http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path The Open MPI FAQ are worth reading, and may help you with this and other problems: http://www.open-mpi.org/faq/ Just a suggestion. Gus Correa Mouhamad Al-Sayed-Ali wrote: Hello all, I have been trying to run the executable "arpege" with mpirun -np 2 --host file arpege where file contains the name of the machines. But, I get the following error: - -- There are no allocated resources for the application arpege that match the requested mapping: Verify that you have mapped the allocated resources properly using the --host or --hostfile specification. -- -- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished - Can anyone help me, please ? Sincerly Mouhamad Al sayed ali ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] no allocated resources for the application........(mpirun)
Hello all, I have been trying to run the executable "arpege" with mpirun -np 2 --host file arpege where file contains the name of the machines. But, I get the following error: - -- There are no allocated resources for the application arpege that match the requested mapping: Verify that you have mapped the allocated resources properly using the --host or --hostfile specification. -- -- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- mpirun: clean termination accomplished - Can anyone help me, please ? Sincerly Mouhamad Al sayed ali