[OMPI devel] large virtual memory consumption on smp nodes and gridengine problems

2007-06-10 Thread Markus Daene
Dear all,

I hope I am in the correct mailing list with my problem.
I try to run openmpi with the gridengine(6.0u10, 6.1). Therefore I
compiled openmpi (1.2.2),
which has the gridengine support included, I have checked it with ompi_info.
In principle, openmpi runs well.
The gridengine is configured such that the user has to specify the
memory consumption
via the h_vmem option. Then I noticed that with a larger number of
processes the job
is killed by the gridengine because of taking too much memory.
To take a closer look on that, I wrote a small and simple (Fortran) MPI
program which has just a MPI_Init
and a (static) array, in my case of 50MB, then the programm goes into a
(infinite) loop, because it
takes some time until the gridengine reports the maxvmem.
I found, that if the processes run all on different nodes, there is only
a offset per process, at least
a linear scaling. But it becomes worse when the jobs run on one node.
There it seems to be a quadratic
scaling with the offset, in my case about 30M. I made a list of the
virtual memory reported by the
gridengine, I was running on a 16 processor node:

#N procvirt. Mem[MB]
1  182
2  468
3  825
4  1065
5  1001
6  1378
7  1817
8  2303
12 4927
16 8559

the pure program should need N*50MB, for 16 it is only 800M, but it
takes 10 times more, >7GB!!!
Of course, the gridengine will kills the job is this overtaking is not
taken into accout,
because of too much virtual memory consumption. The momory consumtion is
not related to the grid engine,
it is the same if I run from the command line.
I guess it might be related to the 'sm' component of the btl.
Is it possible to avoid the quadratic scaling?
Of course I could use the vapi/tcp component only like
mpirun --mca btl mvapi  -np 16 ./my_test_program
in this case the virtual memory is fine, but it will not be what one
wants on a smp node.


then it becomes ever worse:
openmpi nicely report the (max./act.) used virtual memory to the grid
engine as sum of all processes.
This value is the compared with the one the user has specified with the
h_vmem option, but the
gridengine takes this value per process for the allocation of the job
(works) and does not multiply
this with the number of processes. Maybe one should report this to the
gridenging mailing list, but it
could be related as well for the openmpi interface.

The last thing I noticed:
It seems that if the v_mem option for gridengine jobs is specified like
'2.0G' my test job was
immedialtely killed; but when I specify '2000M' (which is obviously
less) it work. The gridengine
puts the job allways on the correct node as requested, but I think there
is might be a problem in
the openmpi interface.


It would be nice if someone could give some hints how to avoid the
quadratic scaling or maybe to think
if this is really neccessary in openmpi.


Thanks.
Markus Daene




my compiling options:
./configure --prefix=/not_important --enable-static
--with-f90-size=medium --with-f90-max-array-dim=7  --with-mpi-para
m-check=always --enable-cxx-exceptions --with-mvapi
--enable-mca-no-build=btl-tcp

ompi_info output:
Open MPI: 1.2.2
   Open MPI SVN revision: r14613
Open RTE: 1.2.2
   Open RTE SVN revision: r14613
OPAL: 1.2.2
   OPAL SVN revision: r14613
  Prefix: /usrurz/openmpi/1.2.2/pathscale_3.0
 Configured architecture: x86_64-unknown-linux-gnu
   Configured by: root
   Configured on: Mon Jun  4 16:04:38 CEST 2007
  Configure host: GE1N01
Built by: root
Built on: Mon Jun  4 16:09:37 CEST 2007
  Built host: GE1N01
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: pathcc
 C compiler absolute: /usrurz/pathscale/bin/pathcc
C++ compiler: pathCC
   C++ compiler absolute: /usrurz/pathscale/bin/pathCC
  Fortran77 compiler: pathf90
  Fortran77 compiler abs: /usrurz/pathscale/bin/pathf90
  Fortran90 compiler: pathf90
  Fortran90 compiler abs: /usrurz/pathscale/bin/pathf90
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: yes
  Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
 MPI parameter check: always
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
  MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.2)
  MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.2)
   MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.2)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.2)
   MCA maffinity: libnuma (MCA 

Re: [OMPI devel] large virtual memory consumption on smp nodes and gridengine problems

2007-06-10 Thread Ralph Castain
Hi Markus

There are two MCA params that can help you, I believe:

1. You to set the maximum size of the shared memory file with

-mca mpool_sm_max_size xxx

where xxx is the maximum memory file you want, expressed in bytes. The
default value I see is 512MBytes.

2. You can set the size/peer of the file, again in bytes:

-mca mpool_sm_per_peer_size xxx

This will allocate a file that is xxx * num_procs_on_the_node on each node,
up to the maximum file size (either the 512MB default or whatever you
specified using the previous param). This defaults to 32MBytes/proc.


I see that there is also a minimum (total, not per-proc) file size that
defaults to 128MBytes. If that is still too large, you can adjust it using

-mca mpool_sm_min_size yyy


Hope that helps
Ralph



On 6/10/07 2:55 PM, "Markus Daene"  wrote:

> Dear all,
> 
> I hope I am in the correct mailing list with my problem.
> I try to run openmpi with the gridengine(6.0u10, 6.1). Therefore I
> compiled openmpi (1.2.2),
> which has the gridengine support included, I have checked it with ompi_info.
> In principle, openmpi runs well.
> The gridengine is configured such that the user has to specify the
> memory consumption
> via the h_vmem option. Then I noticed that with a larger number of
> processes the job
> is killed by the gridengine because of taking too much memory.
> To take a closer look on that, I wrote a small and simple (Fortran) MPI
> program which has just a MPI_Init
> and a (static) array, in my case of 50MB, then the programm goes into a
> (infinite) loop, because it
> takes some time until the gridengine reports the maxvmem.
> I found, that if the processes run all on different nodes, there is only
> a offset per process, at least
> a linear scaling. But it becomes worse when the jobs run on one node.
> There it seems to be a quadratic
> scaling with the offset, in my case about 30M. I made a list of the
> virtual memory reported by the
> gridengine, I was running on a 16 processor node:
> 
> #N procvirt. Mem[MB]
> 1  182
> 2  468
> 3  825
> 4  1065
> 5  1001
> 6  1378
> 7  1817
> 8  2303
> 12 4927
> 16 8559
> 
> the pure program should need N*50MB, for 16 it is only 800M, but it
> takes 10 times more, >7GB!!!
> Of course, the gridengine will kills the job is this overtaking is not
> taken into accout,
> because of too much virtual memory consumption. The momory consumtion is
> not related to the grid engine,
> it is the same if I run from the command line.
> I guess it might be related to the 'sm' component of the btl.
> Is it possible to avoid the quadratic scaling?
> Of course I could use the vapi/tcp component only like
> mpirun --mca btl mvapi  -np 16 ./my_test_program
> in this case the virtual memory is fine, but it will not be what one
> wants on a smp node.
> 
> 
> then it becomes ever worse:
> openmpi nicely report the (max./act.) used virtual memory to the grid
> engine as sum of all processes.
> This value is the compared with the one the user has specified with the
> h_vmem option, but the
> gridengine takes this value per process for the allocation of the job
> (works) and does not multiply
> this with the number of processes. Maybe one should report this to the
> gridenging mailing list, but it
> could be related as well for the openmpi interface.
> 
> The last thing I noticed:
> It seems that if the v_mem option for gridengine jobs is specified like
> '2.0G' my test job was
> immedialtely killed; but when I specify '2000M' (which is obviously
> less) it work. The gridengine
> puts the job allways on the correct node as requested, but I think there
> is might be a problem in
> the openmpi interface.
> 
> 
> It would be nice if someone could give some hints how to avoid the
> quadratic scaling or maybe to think
> if this is really neccessary in openmpi.
> 
> 
> Thanks.
> Markus Daene
> 
> 
> 
> 
> my compiling options:
> ./configure --prefix=/not_important --enable-static
> --with-f90-size=medium --with-f90-max-array-dim=7  --with-mpi-para
> m-check=always --enable-cxx-exceptions --with-mvapi
> --enable-mca-no-build=btl-tcp
> 
> ompi_info output:
> Open MPI: 1.2.2
>Open MPI SVN revision: r14613
> Open RTE: 1.2.2
>Open RTE SVN revision: r14613
> OPAL: 1.2.2
>OPAL SVN revision: r14613
>   Prefix: /usrurz/openmpi/1.2.2/pathscale_3.0
>  Configured architecture: x86_64-unknown-linux-gnu
>Configured by: root
>Configured on: Mon Jun  4 16:04:38 CEST 2007
>   Configure host: GE1N01
> Built by: root
> Built on: Mon Jun  4 16:09:37 CEST 2007
>   Built host: GE1N01
>   C bindings: yes
> C++ bindings: yes
>   Fortran77 bindings: yes (all)
>   Fortran90 bindings: yes
>  Fortran90 bindings size: small
>   C compiler: pathcc
>  C compiler absolute: /u