[Wien] Fwd: MPI segmentation fault

2010-01-29 Thread Laurence Marks
I've edited your information down (too large for the list), and am
including it so others can see if they run into similar problems.

In essence, you have a mess and you are going to have to talk to your
sysadmin (hikmpn) to get things sorted out. Issues:

a) You have openmpi-1.3.3. This works for small problems, fails for
large ones. This needs to be updated to 1.4.0 or 1.4.1 (the older
versions of openmpi have bugs).
b) The openmpi was compiled with ifort 10.1 but you are using 11.1.064
for Wien2k -- could lead to problems.
c) The openmpi was compiled with gcc and ifort 10.1, not icc and ifort
which could lead to problems.
d) The fftw library you are using was compiled with gcc not icc, this
could lead to problems.
e) Some of the shared libraries are in your LD_LIBRARY_PATH, you will
need to add -x LD_LIBRARY_PATH to how mpirun is called (in
$WIENROOT/parallel_options) -- look at man mpirun.
f) I still don't know what the stack limits are on your machine --
this can lead to severe problems in lapw0_mpi

-- Forwarded message --
From: Fhokrul Islam 
Date: Fri, Jan 29, 2010 at 9:16 AM
Subject: MPI segmentation fault
To: "L-marks at northwestern.edu" 

Below are the information that you requested. I would like to mention
that MPI worked fine when I used it for a bulk
8 atom system. But for surface supercell of 96 atom it crashes at lapw0.

Thanks,
Fhokrul

>> 1) Please do "ompi_info " and paste the output to the end of your
>> response to this email.

? ?1. [eishfh at milleotto s110]$ ompi_info
? ? ? ? ? ? ? ? Package: Open MPI hikmpn at milleotto.local Distribution
? ? ? ? ? ? ? ?Open MPI: 1.3.3
? ? ? ? ? ? ? ? ?Prefix: /home/hikmpn/local
?Configured architecture: x86_64-unknown-linux-gnu
? ? ? ? ?Configure host: milleotto.local
? ? ? ? ? Configured by: hikmpn
?Fortran90 bindings size: small
? ? ? ? ? ? ?C compiler: gcc
? ? C compiler absolute: /usr/bin/gcc
? ? ? ? ? ?C++ compiler: g++
? C++ compiler absolute: /usr/bin/g++
? ? ?Fortran77 compiler: ifort
?Fortran77 compiler abs: /sw/pkg/intel/10.1/bin//ifort
? ? ?Fortran90 compiler: ifort
?Fortran90 compiler abs: /sw/pkg/intel/10.1/bin//ifort

>> 2) Also paste the output of "echo $LD_LIBRARY_PATH"

2. [eishfh at milleotto s110]$ echo $LD_LIBRARY_PATH
/home/eishfh/fftw-2.1.5-gcc/lib/:/home/hikmpn/local/lib/:/sw/pkg/intel/11.1.064//lib/intel64:/sw/pkg/mkl/10.0/lib/em64t:/lib64:/usr/lib64:/usr/X11R6/lib64:/lib:/usr/lib:/usr/X11R6/lib:/usr/local/lib

>> 3) If you have in your .bashrc a "ulimit -s unlimited" please edit
>> this (temporarily) out, then ssh into one of the child nodes.

After editing .bashrc file I did the following from the child node:

3. [eishfh at mn012 ~]$ which mpirun
/home/hikmpn/local/bin/mpirun

4. [eishfh at mn012 ~]$ which lapw0_mpi
/disk/global/home/eishfh/Wien2k_09_2/lapw0_mpi

5. [eishfh at mn012 ~]$ echo $LD_LIBRARY_PATH
-bash: 
home/eishfh/fftw-2.1.5-gcc/lib/:/home/hikmpn/local/lib/:/sw/pkg/intel/11.1.064//lib/intel64:/sw/pkg/mkl/10.0/lib/em64t:/lib64:/usr/lib64:/usr/X11R6/lib64:/lib:/usr/lib:/usr/X11R6/lib:/usr/local/lib

6. [eishfh at mn012 ~]$ ldd $WIENROOT/lapw0_mpi
? ? ? ?libmkl_intel_lp64.so =>
/sw/pkg/mkl/10.0/lib/em64t/libmkl_intel_lp64.so (0x2ab5610d3000)
? ? ? ?libmkl_sequential.so =>
/sw/pkg/mkl/10.0/lib/em64t/libmkl_sequential.so (0x2ab5613d9000)
? ? ? ?libmkl_core.so => /sw/pkg/mkl/10.0/lib/em64t/libmkl_core.so
(0x2ab561566000)
? ? ? ?libiomp5.so => /sw/pkg/intel/11.1.064//lib/intel64/libiomp5.so
(0x2ab561738000)
? ? ? ?libsvml.so => /sw/pkg/intel/11.1.064//lib/intel64/libsvml.so
(0x2ab5618e9000)
? ? ? ?libimf.so => /sw/pkg/intel/11.1.064//lib/intel64/libimf.so
(0x2ab562694000)
? ? ? ?libifport.so.5 =>
/sw/pkg/intel/11.1.064//lib/intel64/libifport.so.5
(0x2ab562a28000)
? ? ? ?libifcoremt.so.5 =>
/sw/pkg/intel/11.1.064//lib/intel64/libifcoremt.so.5
(0x2ab562b61000)
? ? ? ?libintlc.so.5 =>
/sw/pkg/intel/11.1.064//lib/intel64/libintlc.so.5 (0x2ab562e05000)

-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.


[Wien] Fwd: MPI segmentation fault

2010-01-29 Thread Md. Fhokrul Islam

Hi Marks,

Thanks for pointing out possible problems with our system. I will talk to 
the 
system admin about these issues.

Fhokrul



> Date: Fri, 29 Jan 2010 09:47:53 -0600
> From: L-marks at northwestern.edu
> To: wien at zeus.theochem.tuwien.ac.at
> Subject: [Wien] Fwd: MPI segmentation fault
> 
> I've edited your information down (too large for the list), and am
> including it so others can see if they run into similar problems.
> 
> In essence, you have a mess and you are going to have to talk to your
> sysadmin (hikmpn) to get things sorted out. Issues:
> 
> a) You have openmpi-1.3.3. This works for small problems, fails for
> large ones. This needs to be updated to 1.4.0 or 1.4.1 (the older
> versions of openmpi have bugs).
> b) The openmpi was compiled with ifort 10.1 but you are using 11.1.064
> for Wien2k -- could lead to problems.
> c) The openmpi was compiled with gcc and ifort 10.1, not icc and ifort
> which could lead to problems.
> d) The fftw library you are using was compiled with gcc not icc, this
> could lead to problems.
> e) Some of the shared libraries are in your LD_LIBRARY_PATH, you will
> need to add -x LD_LIBRARY_PATH to how mpirun is called (in
> $WIENROOT/parallel_options) -- look at man mpirun.
> f) I still don't know what the stack limits are on your machine --
> this can lead to severe problems in lapw0_mpi
> 
> -- Forwarded message --
> From: Fhokrul Islam 
> Date: Fri, Jan 29, 2010 at 9:16 AM
> Subject: MPI segmentation fault
> To: "L-marks at northwestern.edu" 
> 
> Below are the information that you requested. I would like to mention
> that MPI worked fine when I used it for a bulk
> 8 atom system. But for surface supercell of 96 atom it crashes at lapw0.
> 
> Thanks,
> Fhokrul
> 
> >> 1) Please do "ompi_info " and paste the output to the end of your
> >> response to this email.
> 
>1. [eishfh at milleotto s110]$ ompi_info
> Package: Open MPI hikmpn at milleotto.local Distribution
>Open MPI: 1.3.3
>  Prefix: /home/hikmpn/local
>  Configured architecture: x86_64-unknown-linux-gnu
>  Configure host: milleotto.local
>   Configured by: hikmpn
>  Fortran90 bindings size: small
>  C compiler: gcc
> C compiler absolute: /usr/bin/gcc
>C++ compiler: g++
>   C++ compiler absolute: /usr/bin/g++
>  Fortran77 compiler: ifort
>  Fortran77 compiler abs: /sw/pkg/intel/10.1/bin//ifort
>  Fortran90 compiler: ifort
>  Fortran90 compiler abs: /sw/pkg/intel/10.1/bin//ifort
> 
> >> 2) Also paste the output of "echo $LD_LIBRARY_PATH"
> 
> 2. [eishfh at milleotto s110]$ echo $LD_LIBRARY_PATH
> /home/eishfh/fftw-2.1.5-gcc/lib/:/home/hikmpn/local/lib/:/sw/pkg/intel/11.1.064//lib/intel64:/sw/pkg/mkl/10.0/lib/em64t:/lib64:/usr/lib64:/usr/X11R6/lib64:/lib:/usr/lib:/usr/X11R6/lib:/usr/local/lib
> 
> >> 3) If you have in your .bashrc a "ulimit -s unlimited" please edit
> >> this (temporarily) out, then ssh into one of the child nodes.
> 
> After editing .bashrc file I did the following from the child node:
> 
> 3. [eishfh at mn012 ~]$ which mpirun
> /home/hikmpn/local/bin/mpirun
> 
> 4. [eishfh at mn012 ~]$ which lapw0_mpi
> /disk/global/home/eishfh/Wien2k_09_2/lapw0_mpi
> 
> 5. [eishfh at mn012 ~]$ echo $LD_LIBRARY_PATH
> -bash: 
> home/eishfh/fftw-2.1.5-gcc/lib/:/home/hikmpn/local/lib/:/sw/pkg/intel/11.1.064//lib/intel64:/sw/pkg/mkl/10.0/lib/em64t:/lib64:/usr/lib64:/usr/X11R6/lib64:/lib:/usr/lib:/usr/X11R6/lib:/usr/local/lib
> 
> 6. [eishfh at mn012 ~]$ ldd $WIENROOT/lapw0_mpi
>libmkl_intel_lp64.so =>
> /sw/pkg/mkl/10.0/lib/em64t/libmkl_intel_lp64.so (0x2ab5610d3000)
>libmkl_sequential.so =>
> /sw/pkg/mkl/10.0/lib/em64t/libmkl_sequential.so (0x2ab5613d9000)
>libmkl_core.so => /sw/pkg/mkl/10.0/lib/em64t/libmkl_core.so
> (0x2ab561566000)
>libiomp5.so => /sw/pkg/intel/11.1.064//lib/intel64/libiomp5.so
> (0x2ab561738000)
>libsvml.so => /sw/pkg/intel/11.1.064//lib/intel64/libsvml.so
> (0x2ab5618e9000)
>libimf.so => /sw/pkg/intel/11.1.064//lib/intel64/libimf.so
> (0x2ab562694000)
>libifport.so.5 =>
> /sw/pkg/intel/11.1.064//lib/intel64/libifport.so.5
> (0x2ab562a28000)
>libifcoremt.so.5 =>
> /sw/pkg/intel/11.1.064//lib/intel64/libifcoremt.so.5
> (0x2ab562b61000)
>libintlc.so.5 =>
> /sw/pkg/intel/11.1.064//lib/intel64/libintlc.so.5 (0x2ab562e05000)
> 
> -- 
> Laurence Marks
> Department of Materials Science and Engineering
> MSE R

[Wien] Fwd: MPI segmentation fault

2010-01-30 Thread Md. Fhokrul Islam






Hi Marks,

I have followed your suggestions and have used openmpi 1.4.1 compiled with 
icc.
I also have compiled fftw with cc instead of gcc and recompiled Wien2k with 
mpirun option
in parallel_options:

current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ -x LD_LIBRARY_PATH
 
Although I didn't get segmentation fault but the job still crashes at lapw1 
with a different error 
message. I have pasted case.dayfile and case.error below along with ompi_info 
and stacksize
info. I am not even sure where to look for the solution. Please let me know if 
you have any
suggestions regarding this MPI problem.

Thanks,
Fhokrul 

case.dayfile:

cycle 1 (Sat Jan 30 16:49:55 CET 2010)  (200/99 to go)

>   lapw0 -p(16:49:55) starting parallel lapw0 at Sat Jan 30 16:49:56 CET 
> 2010
 .machine0 : 4 processors
1863.235u 21.743s 8:21.32 376.0%0+0k 0+0io 1068pf+0w
>   lapw1  -c -up -p(16:58:17) starting parallel lapw1 at Sat Jan 30 
> 16:58:18 CET 2010
->  starting parallel LAPW1 jobs at Sat Jan 30 16:58:18 CET 2010
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
 mn117.mpi mn117.mpi mn117.mpi mn117.mpi(1) 1263.782u 28.214s 36:47.58 
58.5%0+0k 0+0io 49300pf+0w
**  LAPW1 crashed!
1266.358u 37.286s 36:53.31 58.8%0+0k 0+0io 49425pf+0w
error: command   /disk/global/home/eishfh/Wien2k_09_2/lapw1cpara -up -c 
uplapw1.def   failed

Error file:

 LAPW0 END
 LAPW0 END
 LAPW0 END
 LAPW0 END
--
mpirun noticed that process rank 0 with PID 8837 on node mn117.local exited on 
signal 9 (Killed).


[eishfh at milleotto
s110]$ ompi_info

 Package: Open MPI
root at milleotto.local Distribution

Open MPI: 1.4.1

  Prefix:
/sw/pkg/openmpi/1.4.1/intel/11.1

 Configured architecture:
x86_64-unknown-linux-gnu

  Configure host: milleotto.local

   Configured by: root

   Configured on: Sat Jan 16 19:40:36
CET 2010

  Configure host: milleotto.local

  Built host: milleotto.local

Fortran90 bindings
size: small

  C compiler: icc

 C compiler absolute:
/sw/pkg/intel/11.1.064//bin/intel64/icc

C++ compiler: icpc

   C++ compiler absolute:
/sw/pkg/intel/11.1.064//bin/intel64/icpc

  Fortran77 compiler: ifort

  Fortran77 compiler abs:
/sw/pkg/intel/11.1.064//bin/intel64/ifort

  Fortran90 compiler: ifort

  Fortran90 compiler abs:
/sw/pkg/intel/11.1.064//bin/intel64/ifort


stacksize:



 [eishfh at milleotto s110]$ ulimit -a

core file size  (blocks, -c) 0

data seg size   (kbytes, -d) unlimited

scheduling
priority (-e) 0

file size   (blocks, -f) unlimited

pending signals (-i) 73728

max locked
memory   (kbytes, -l) 32

max memory size (kbytes, -m) unlimited

open files  (-n) 1024

pipe size(512 bytes, -p) 8

POSIX message
queues (bytes, -q) 819200

real-time
priority  (-r) 0

stack size  (kbytes, -s) unlimited

cpu time   (seconds, -t) unlimited

max user
processes  (-u) 73728

virtual memory  (kbytes, -v) unlimited

file locks  (-x) unlimited





> 
> In essence, you have a mess and you are going to have to talk to your
> sysadmin (hikmpn) to get things sorted out. Issues:
> 
> a) You have openmpi-1.3.3. This works for small problems, fails for
> large ones. This needs to be updated to 1.4.0 or 1.4.1 (the older
> versions of openmpi have bugs).
> b) The openmpi was compiled with ifort 10.1 but you are using 11.1.064
> for Wien2k -- could lead to problems.
> c) The openmpi was compiled with gcc and ifort 10.1, not icc and ifort
> which could lead to problems.
> d) The fftw library you are using was compiled with gcc not icc, this
> could lead to problems.
> e) Some of the shared libraries are in your LD_LIBRARY_PATH, you will
> need to add -x LD_LIBRARY_PATH to how mpirun is called (in
> $WIENROOT/parallel_options) -- look at man mpirun.
> f) I still don't know what the stack limits are on your machine --
> this can lead to severe problems in lapw0_mpi

  
_
Hotmail: Trusted email with Microsoft?s powerful SPAM protection.
https://signup.live.com/signup.aspx?id=60969
-- next part --
An HTML attachment was scrubbed...
URL: 



[Wien] Fwd: MPI segmentation fault

2010-01-30 Thread Md. Fhokrul Islam

Hi Marks,

   In addition to what I have sent in my previous email, I would like to 
mention that
if I use 8 processors instead of 4 processors, I get the segmentation error at 
lapw0.

Thanks,
Fhokrul 


From: fis...@hotmail.com
To: wien at zeus.theochem.tuwien.ac.at
Date: Sat, 30 Jan 2010 18:51:59 +
Subject: Re: [Wien] Fwd: MPI segmentation fault













Hi Marks,

I have followed your suggestions and have used openmpi 1.4.1 compiled with 
icc.
I also have compiled fftw with cc instead of gcc and recompiled Wien2k with 
mpirun option
in parallel_options:

current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ -x LD_LIBRARY_PATH
 
Although I didn't get segmentation fault but the job still crashes at lapw1 
with a different error 
message. I have pasted case.dayfile and case.error below along with ompi_info 
and stacksize
info. I am not even sure where to look for the solution. Please let me know if 
you have any
suggestions regarding this MPI problem.

Thanks,
Fhokrul 

case.dayfile:

cycle 1 (Sat Jan 30 16:49:55 CET 2010)  (200/99 to go)

>   lapw0 -p(16:49:55) starting parallel lapw0 at Sat Jan 30 16:49:56 CET 
> 2010
 .machine0 : 4 processors
1863.235u 21.743s 8:21.32 376.0%0+0k 0+0io 1068pf+0w
>   lapw1  -c -up -p(16:58:17) starting parallel lapw1 at Sat Jan 30 
> 16:58:18 CET 2010
->  starting parallel LAPW1 jobs at Sat Jan 30 16:58:18 CET 2010
running LAPW1 in parallel mode (using .machines)
1 number_of_parallel_jobs
 mn117.mpi mn117.mpi mn117.mpi mn117.mpi(1) 1263.782u 28.214s 36:47.58 
58.5%0+0k 0+0io 49300pf+0w
**  LAPW1 crashed!
1266.358u 37.286s 36:53.31 58.8%0+0k 0+0io 49425pf+0w
error: command   /disk/global/home/eishfh/Wien2k_09_2/lapw1cpara -up -c 
uplapw1.def   failed

Error file:

 LAPW0 END
 LAPW0 END
 LAPW0 END
 LAPW0 END
--
mpirun noticed that process rank 0 with PID 8837 on node mn117.local exited on 
signal 9 (Killed).


[eishfh at milleotto
s110]$ ompi_info

 Package: Open MPI
root at milleotto.local Distribution

Open MPI: 1.4.1

  Prefix:
/sw/pkg/openmpi/1.4.1/intel/11.1

 Configured architecture:
x86_64-unknown-linux-gnu

  Configure host: milleotto.local

   Configured by: root

   Configured on: Sat Jan 16 19:40:36
CET 2010

  Configure host: milleotto.local

  Built host: milleotto.local

Fortran90 bindings
size: small

  C compiler: icc

 C compiler absolute:
/sw/pkg/intel/11.1.064//bin/intel64/icc

C++ compiler: icpc

   C++ compiler absolute:
/sw/pkg/intel/11.1.064//bin/intel64/icpc

  Fortran77 compiler: ifort

  Fortran77 compiler abs:
/sw/pkg/intel/11.1.064//bin/intel64/ifort

  Fortran90 compiler: ifort

  Fortran90 compiler abs:
/sw/pkg/intel/11.1.064//bin/intel64/ifort


stacksize:



 [eishfh at milleotto s110]$ ulimit -a

core file size  (blocks, -c) 0

data seg size   (kbytes, -d) unlimited

scheduling
priority (-e) 0

file size   (blocks, -f) unlimited

pending signals (-i) 73728

max locked
memory   (kbytes, -l) 32

max memory size (kbytes, -m) unlimited

open files  (-n) 1024

pipe size(512 bytes, -p) 8

POSIX message
queues (bytes, -q) 819200

real-time
priority  (-r) 0

stack size  (kbytes, -s) unlimited

cpu time   (seconds, -t) unlimited

max user
processes  (-u) 73728

virtual memory  (kbytes, -v) unlimited

file locks  (-x) unlimited





> 
> In essence, you have a mess and you are going to have to talk to your
> sysadmin (hikmpn) to get things sorted out. Issues:
> 
> a) You have openmpi-1.3.3. This works for small problems, fails for
> large ones. This needs to be updated to 1.4.0 or 1.4.1 (the older
> versions of openmpi have bugs).
> b) The openmpi was compiled with ifort 10.1 but you are using 11.1.064
> for Wien2k -- could lead to problems.
> c) The openmpi was compiled with gcc and ifort 10.1, not icc and ifort
> which could lead to problems.
> d) The fftw library you are using was compiled with gcc not icc, this
> could lead to problems.
> e) Some of the shared libraries are in your LD_LIBRARY_PATH, you will
> need to add -x LD_LIBRARY_PATH to how mpirun is called (in
> $WIENROOT/parallel_options) -- look at man mpirun.
> f) I still don't know what the stack limits are on your machine --
> this can lead to severe problems in lapw0_mpi

  
Hotmail: Trusted email with Microsoft?s powerful SPAM protection. Sign up now.  
  
_
Hotmail: Free, trusted and rich email service.
https://signup.live.com/

[Wien] Fwd: MPI segmentation fault

2010-01-30 Thread Laurence Marks
OK, looks like you have cleaned up many of the issues. The SIGSEV is
(I think) now one of two things:

a) memory limitations (how much do you have, 8Gb or 16-24 Gb ?)

While the process is running do a "top" and see how much memory is
allocated and whether this is essentially all. If you have ganglia
available you can use this to see readily. Similar information is also
available in  cat /proc/meminfo or using the nmon utility from IBM
(google it, it is easy to compile). I suspect that you are simply
running out of memory, running too many tasks at the same time on one
machine -- you would need to use more machines so the memory usage on
any one is smaller.

b) stacksize issue (less likely)

This is an issue with openmpi, see
http://www.open-mpi.org/community/lists/users/2008/09/6491.php . In a
nutshell, the stacksize limit is not an environmental parameter and
there is no direct way to set it correctly with openmpi except to use
a wrapper. I have a patch for this, but lets' try something simpler
first (which I think is OK, but I might have it slightly wrong).

* Create a file called wrap.sh in your search path (e.g. ~/bin or even
$WIENROOT) and put in it
#!/bin/bash
source $HOME/.bashrc
ulimit -s unlimited
#write a line so we know we got here
echo "Hello Fhorkul"
$1 $2 $3 $4

* Do a "chmod a+x wrap.sh" (appropriate location of course)

* Edit parallel_options in $WIENROOT so it reads
setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
-machinefile _HOSTS_ wrap.sh _EXEC_"

This does the same as is described in the email link above, forcing
the Wien2k mpi commands to be executed from within a bash shell so
parameters are setup. If this works then I can provide details for a
more general patch.


2010/1/30 Md. Fhokrul Islam :
> Hi Marks,
>
> ??? I have followed your suggestions and have used openmpi 1.4.1 compiled
> with icc.
> I also have compiled fftw with cc instead of gcc and recompiled Wien2k with
> mpirun option
> in parallel_options:
>
> current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ -x
> LD_LIBRARY_PATH
>
> Although I didn't get segmentation fault but the job still crashes at lapw1
> with a different error
> message. I have pasted case.dayfile and case.error below along with
> ompi_info and stacksize
> info. I am not even sure where to look for the solution. Please let me know
> if you have any
> suggestions regarding this MPI problem.
>
> Thanks,
> Fhokrul
>
> case.dayfile:
>
> ??? cycle 1 (Sat Jan 30 16:49:55 CET 2010)? (200/99 to go)
>
>>?? lapw0 -p??? (16:49:55) starting parallel lapw0 at Sat Jan 30 16:49:56
>> CET 2010
>  .machine0 : 4 processors
> 1863.235u 21.743s 8:21.32 376.0%??? 0+0k 0+0io 1068pf+0w
>>?? lapw1? -c -up -p??? (16:58:17) starting parallel lapw1 at Sat Jan 30
>> 16:58:18 CET 2010
> ->? starting parallel LAPW1 jobs at Sat Jan 30 16:58:18 CET 2010
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
>  mn117.mpi mn117.mpi mn117.mpi mn117.mpi(1) 1263.782u 28.214s 36:47.58
> 58.5%??? 0+0k 0+0io 49300pf+0w
> **? LAPW1 crashed!
> 1266.358u 37.286s 36:53.31 58.8%??? 0+0k 0+0io 49425pf+0w
> error: command?? /disk/global/home/eishfh/Wien2k_09_2/lapw1cpara -up -c
> uplapw1.def?? failed
>
> Error file:
>
> ?LAPW0 END
> ?LAPW0 END
> ?LAPW0 END
> ?LAPW0 END
> --
> mpirun noticed that process rank 0 with PID 8837 on node mn117.local exited
> on signal 9 (Killed).
>
> stacksize:
>
> ?[eishfh at milleotto s110]$ ulimit -a
>
> file locks? (-x) unlimited
>
>
-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.


[Wien] Fwd: MPI segmentation fault

2010-01-30 Thread Md. Fhokrul Islam

Marks,

   Thanks again for your quick reply. You are probably right that its a memory 
problem since 
the system I am using for testing my jobs has very low memory (only 1GB per 
processor).
I will try to run the job in a better machine (4GB per processor) that is 
available in our system. 

Best,
Fhokrul


> Date: Sat, 30 Jan 2010 16:11:07 -0600
> From: L-marks at northwestern.edu
> To: wien at zeus.theochem.tuwien.ac.at
> Subject: Re: [Wien] Fwd: MPI segmentation fault
> 
> OK, looks like you have cleaned up many of the issues. The SIGSEV is
> (I think) now one of two things:
> 
> a) memory limitations (how much do you have, 8Gb or 16-24 Gb ?)
> 
> While the process is running do a "top" and see how much memory is
> allocated and whether this is essentially all. If you have ganglia
> available you can use this to see readily. Similar information is also
> available in  cat /proc/meminfo or using the nmon utility from IBM
> (google it, it is easy to compile). I suspect that you are simply
> running out of memory, running too many tasks at the same time on one
> machine -- you would need to use more machines so the memory usage on
> any one is smaller.
> 
> b) stacksize issue (less likely)
> 
> This is an issue with openmpi, see
> http://www.open-mpi.org/community/lists/users/2008/09/6491.php . In a
> nutshell, the stacksize limit is not an environmental parameter and
> there is no direct way to set it correctly with openmpi except to use
> a wrapper. I have a patch for this, but lets' try something simpler
> first (which I think is OK, but I might have it slightly wrong).
> 
> * Create a file called wrap.sh in your search path (e.g. ~/bin or even
> $WIENROOT) and put in it
> #!/bin/bash
> source $HOME/.bashrc
> ulimit -s unlimited
> #write a line so we know we got here
> echo "Hello Fhorkul"
> $1 $2 $3 $4
> 
> * Do a "chmod a+x wrap.sh" (appropriate location of course)
> 
> * Edit parallel_options in $WIENROOT so it reads
> setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
> -machinefile _HOSTS_ wrap.sh _EXEC_"
> 
> This does the same as is described in the email link above, forcing
> the Wien2k mpi commands to be executed from within a bash shell so
> parameters are setup. If this works then I can provide details for a
> more general patch.
> 
> 
> 2010/1/30 Md. Fhokrul Islam :
> > Hi Marks,
> >
> > I have followed your suggestions and have used openmpi 1.4.1 compiled
> > with icc.
> > I also have compiled fftw with cc instead of gcc and recompiled Wien2k with
> > mpirun option
> > in parallel_options:
> >
> > current:MPIRUN:mpirun -np _NP_ -machinefile _HOSTS_ _EXEC_ -x
> > LD_LIBRARY_PATH
> >
> > Although I didn't get segmentation fault but the job still crashes at lapw1
> > with a different error
> > message. I have pasted case.dayfile and case.error below along with
> > ompi_info and stacksize
> > info. I am not even sure where to look for the solution. Please let me know
> > if you have any
> > suggestions regarding this MPI problem.
> >
> > Thanks,
> > Fhokrul
> >
> > case.dayfile:
> >
> > cycle 1 (Sat Jan 30 16:49:55 CET 2010)  (200/99 to go)
> >
> >>   lapw0 -p(16:49:55) starting parallel lapw0 at Sat Jan 30 16:49:56
> >> CET 2010
> >  .machine0 : 4 processors
> > 1863.235u 21.743s 8:21.32 376.0%0+0k 0+0io 1068pf+0w
> >>   lapw1  -c -up -p(16:58:17) starting parallel lapw1 at Sat Jan 30
> >> 16:58:18 CET 2010
> > ->  starting parallel LAPW1 jobs at Sat Jan 30 16:58:18 CET 2010
> > running LAPW1 in parallel mode (using .machines)
> > 1 number_of_parallel_jobs
> >  mn117.mpi mn117.mpi mn117.mpi mn117.mpi(1) 1263.782u 28.214s 36:47.58
> > 58.5%0+0k 0+0io 49300pf+0w
> > **  LAPW1 crashed!
> > 1266.358u 37.286s 36:53.31 58.8%0+0k 0+0io 49425pf+0w
> > error: command   /disk/global/home/eishfh/Wien2k_09_2/lapw1cpara -up -c
> > uplapw1.def   failed
> >
> > Error file:
> >
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> >  LAPW0 END
> > --
> > mpirun noticed that process rank 0 with PID 8837 on node mn117.local exited
> > on signal 9 (Killed).
> >
> > stacksize:
> >
> >  [eishfh at milleotto s110]$ ulimit -a
> >
> > file locks  (-x) unlimited
> >
> >
> -- 
> Laurence Marks
> Department of Materials Science and Engineering
> MSE Rm 2036