Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-26 Thread Mouhamad Al-Sayed-Ali

Hi Gus;

  I have done as uou suggest me but it always doesn't work!

Many thanks for your help


Mouhamad
Gus Correa <g...@ldeo.columbia.edu> a écrit :


Hi Mouhamad

Stack of 10240kB is probably the Linux default,
not necessarily good for HPC and number crunching.
I'd suggest that you change it to unlimited,
unless your system administrator has a very good reason not to do
so.
We've seen many atmosphre/ocean/climate models crash because
they couldn't allocate memory on the stack [automatic arrays
in subroutines, etc].

This has nothing to do with MPI,
the programs can fail even when they run in serial mode
because of this.

You can just append this line to /etc/security/limits.conf:

*   -   stack   -1


I hope this helps,
Gus Correa


Mouhamad Al-Sayed-Ali wrote:

Hi Gus Correa,

the output of ulimit -a is



file(blocks) unlimited
coredump(blocks) 2048
data(kbytes) unlimited
stack(kbytes)10240
lockedmem(kbytes)unlimited
memory(kbytes)   unlimited
nofiles(descriptors) 1024
processes256



Thanks

Mouhamad
Gus Correa <g...@ldeo.columbia.edu> a écrit :


Hi Mouhamad

The locked memory is set to unlimited, but the lines
about the stack are commented out.
Have you tried to add this line:

*   -   stack   -1

then run wrf again? [Note no "#" hash character]

Also, if you login to the compute nodes,
what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]?
This should tell you what limits are actually set.

I hope this helps,
Gus Correa

Mouhamad Al-Sayed-Ali wrote:

Hi all,

 I've checked the "limits.conf", and it contains theses lines


# Jcb 29.06.2007 : pbs wrf (Siji)
#*  hardstack   100
#*  softstack   100

# Dr 14.02.2008 : pour voltaire mpi
*  hardmemlock unlimited
*  softmemlock unlimited



Many thanks for your help
Mouhamad

Gus Correa <g...@ldeo.columbia.edu> a écrit :


Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

My two cents,
Gus Correa

Ralph Castain wrote:

Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:


Hi again,

This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1]  
wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8]
[part034:21443] [ 2]  
wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31)  
[0x11e6e41]
[part034:21443] [ 4]  
wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc]
[part034:21443] [ 5]  
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)  
[0xcc4ed3]
[part034:21443] [ 6]  
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)  
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9]  
wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a]
[part034:21443] [10]  
wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]

[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x361201d8b4]

[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-26 Thread Mouhamad Al-Sayed-Ali

Hi Gus Correa,

 the output of ulimit -a is



file(blocks) unlimited
coredump(blocks) 2048
data(kbytes) unlimited
stack(kbytes)10240
lockedmem(kbytes)unlimited
memory(kbytes)   unlimited
nofiles(descriptors) 1024
processes256



Thanks

Mouhamad
Gus Correa <g...@ldeo.columbia.edu> a écrit :


Hi Mouhamad

The locked memory is set to unlimited, but the lines
about the stack are commented out.
Have you tried to add this line:

*   -   stack   -1

then run wrf again? [Note no "#" hash character]

Also, if you login to the compute nodes,
what is the output of 'limit' [csh,tcsh] or 'ulimit -a' [sh,bash]?
This should tell you what limits are actually set.

I hope this helps,
Gus Correa

Mouhamad Al-Sayed-Ali wrote:

Hi all,

  I've checked the "limits.conf", and it contains theses lines


# Jcb 29.06.2007 : pbs wrf (Siji)
#*  hardstack   100
#*  softstack   100

# Dr 14.02.2008 : pour voltaire mpi
*  hardmemlock unlimited
*  softmemlock unlimited



Many thanks for your help
Mouhamad

Gus Correa <g...@ldeo.columbia.edu> a écrit :


Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

My two cents,
Gus Correa

Ralph Castain wrote:

Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:


Hi again,

This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418)  
[0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260)  
[0x11cfca0]

[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
[part034:21443] [ 4]  
wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc]
[part034:21443] [ 5]  
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)  
[0xcc4ed3]
[part034:21443] [ 6]  
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)  
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9]  
wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a]
[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24)  
[0x47a924]

[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x361201d8b4]

[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Mouhamad Al-Sayed-Ali

Hi all,

   I've checked the "limits.conf", and it contains theses lines


# Jcb 29.06.2007 : pbs wrf (Siji)
#*  hardstack   100
#*  softstack   100

# Dr 14.02.2008 : pour voltaire mpi
*  hardmemlock unlimited
*  softmemlock unlimited



Many thanks for your help
Mouhamad

Gus Correa <g...@ldeo.columbia.edu> a écrit :


Hi Mouhamad, Ralph, Terry

Very often big programs like wrf crash with segfault because they
can't allocate memory on the stack, and assume the system doesn't
impose any limits for it.  This has nothing to do with MPI.

Mouhamad:  Check if your stack size is set to unlimited on all compute
nodes.  The easy way to get it done
is to change /etc/security/limits.conf,
where you or your system administrator could add these lines:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

My two cents,
Gus Correa

Ralph Castain wrote:

Looks like you are crashing in wrf - have you asked them for help?

On Oct 25, 2011, at 7:53 AM, Mouhamad Al-Sayed-Ali wrote:


Hi again,

This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
[part034:21443] [ 4]  
wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec) [0x11e9bcc]
[part034:21443] [ 5]  
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)  
[0xcc4ed3]
[part034:21443] [ 6]  
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)  
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9]  
wrf.exe(__module_integrate_MOD_integrate+0x236) [0x4b2c4a]

[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]
[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x361201d8b4]

[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Mouhamad Al-Sayed-Ali

Hi again,

 This is exactly the error I have:


taskid: 0 hostname: part034.u-bourgogne.fr
[part034:21443] *** Process received signal ***
[part034:21443] Signal: Segmentation fault (11)
[part034:21443] Signal code: Address not mapped (1)
[part034:21443] Failing at address: 0xfffe01eeb340
[part034:21443] [ 0] /lib64/libpthread.so.0 [0x3612c0de70]
[part034:21443] [ 1] wrf.exe(__module_ra_rrtm_MOD_taugb3+0x418) [0x11cc9d8]
[part034:21443] [ 2] wrf.exe(__module_ra_rrtm_MOD_gasabs+0x260) [0x11cfca0]
[part034:21443] [ 3] wrf.exe(__module_ra_rrtm_MOD_rrtm+0xb31) [0x11e6e41]
[part034:21443] [ 4] wrf.exe(__module_ra_rrtm_MOD_rrtmlwrad+0x25ec)  
[0x11e9bcc]
[part034:21443] [ 5]  
wrf.exe(__module_radiation_driver_MOD_radiation_driver+0xe573)  
[0xcc4ed3]
[part034:21443] [ 6]  
wrf.exe(__module_first_rk_step_part1_MOD_first_rk_step_part1+0x40c5)  
[0xe0e4f5]

[part034:21443] [ 7] wrf.exe(solve_em_+0x22e58) [0x9b45c8]
[part034:21443] [ 8] wrf.exe(solve_interface_+0x80a) [0x902dda]
[part034:21443] [ 9] wrf.exe(__module_integrate_MOD_integrate+0x236)  
[0x4b2c4a]

[part034:21443] [10] wrf.exe(__module_wrf_top_MOD_wrf_run+0x24) [0x47a924]
[part034:21443] [11] wrf.exe(main+0x41) [0x4794d1]
[part034:21443] [12] /lib64/libc.so.6(__libc_start_main+0xf4) [0x361201d8b4]
[part034:21443] [13] wrf.exe [0x4793c9]
[part034:21443] *** End of error message ***
---

Mouhamad


Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Mouhamad Al-Sayed-Ali

Hello


can you run wrf successfully on one node?


NO, It can't run on one node

Can you run a simple code across your two nodes?  I would try  
hostname then some simple MPI program like the ring example.

Yes, I can run a simple code

many thanks

Mouhamad






Re: [OMPI users] exited on signal 11 (Segmentation fault).

2011-10-25 Thread Mouhamad Al-Sayed-Ali

hello,


-What version of ompi are you using

  I am using ompi version 1.4.1-1 compiled with gcc 4.5


-What type of machine and os are you running on

   I'm using linux machine 64 bits.


-What does the machine file look like

  part033
  part033
  part031
  part031


-Is there a stack trace left behind by the pid that seg faulted?

  No, there is no stack trace


Thanks for your help

Mouhamad Alsayed


Re: [OMPI users] Memory mapped memory

2011-10-25 Thread Mouhamad Al-Sayed-Ali

Hello,

I have tried to run the executable "wrf.exe", using

  mpirun -machinefile /tmp/108388.1.par2/machines -np 4 wrf.exe

but, I've got the following error:

--
mpirun noticed that process rank 1 with PID 9942 on node  
part031.u-bourgogne.fr exited on signal 11 (Segmentation fault).

--
   11.54s real 6.03s user 0.32s system
Starter(9908): Return code=139
Starter end(9908)




Thanks for your help


Mouhamad Alsayed


Re: [OMPI users] no allocated resources for the application........(mpirun)

2011-07-19 Thread Mouhamad Al-Sayed-Ali

Hi Gus Correa,

 Thank you for your response.

 I'll check what your saying and see if it'll work.

many thanks

sincerly

Mouhamad
Gus Correa <g...@ldeo.columbia.edu> a écrit :


Hi Mouhamad

Did you check if your Open MPI setup is working with the simple
programs in the "examples" subdirectory (connectivity_c.c,  
hello_c.c, ring_c.c)?

This will tell you if the problem is with Open MPI or with arpege
(whatever program arpege is).
You can compile the examples with mpicc and run them with mpirun
with the same hostfile that you used for arpege.

Also, make sure your PATH and LD_LIBRARY_PATH point to the bin and
lib subdirectories of your Open MPI installation.
You can set this in your .cshrc/.bashrc file.
See:
http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

The Open MPI FAQ are worth reading, and may help you with this and
other problems:
http://www.open-mpi.org/faq/

Just a suggestion.
Gus Correa

Mouhamad Al-Sayed-Ali wrote:

Hello all,

I have been trying to run the executable "arpege" with

mpirun -np 2 --host file arpege


where file contains the name of the machines. But, I get the  
following error:


-
--
There are no allocated resources for the application
 arpege
that match the requested mapping:


Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
--
--
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished
-


Can anyone help me, please ?


Sincerly


Mouhamad Al sayed ali
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







[OMPI users] no allocated resources for the application........(mpirun)

2011-07-19 Thread Mouhamad Al-Sayed-Ali

Hello all,

 I have been trying to run the executable "arpege" with

mpirun -np 2 --host file arpege


where file contains the name of the machines. But, I get the following error:

-
--
There are no allocated resources for the application
  arpege
that match the requested mapping:


Verify that you have mapped the allocated resources properly using the
--host or --hostfile specification.
--
--
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished
-


Can anyone help me, please ?


Sincerly


Mouhamad Al sayed ali