Re: [gridengine users] trouble running MPI jobs through SGE

Marlies Hankel Sat, 11 Apr 2015 04:59:27 -0700

Dear Reuti,

No, I did not use ScaLAPACK for now. We do not have intelMPI and at themoment I needed to get things going to get our new cluster up and usable.

All our calculations are MPI based, not just VASP, and my own home growncode does not run either through SGE, so I hope I can find the problemsoon....


Best wishes

Marlies

On 04/11/2015 07:40 PM, Reuti wrote:

Am 11.04.2015 um 03:16 schrieb Marlies Hankel:

Dear all,

Yes, I checked the paths and that looked ok. Also, I made sure that it finds 
the right MPI version and vasp path etc.

I do not think the h_vmem is the problem as I do not get any errors in the 
queue logs for example. Also, in the end I change h_vmem to be not consumable 
and I also asked for a lot and that made no difference.

I will try and use a 1.6.5 openMPI version and see if that makes any difference.

Would the network scan cause SGE to abort the job?

No. But there is a delay in startup.

BTW: Are you using ScaLAPACK for VASP?

-- Reuti

I do get some message about finding to IBs but I also get that when I run 
interactively (ssh to node not via a qlogin). I have switched that off to via 
mca to make sure this was not causing trouble.

Best wishes

Marlies


On 04/10/2015 08:12 PM, Reuti wrote:

Am 10.04.2015 um 04:51 schrieb Marlies Hankel<[email protected]>:

Dear all,

I have a ROCKS 6.1.1 install and I have also installed the SGE roll. So the 
base config was done via the ROCKS install. The only changes I have made are 
setting the h_vmem complex to consumable and setting up a scratch complex. I 
have also set the h_vmem for all hosts.

And the VASP job does work without h_vmem? We are using VASP too and have no 
problems with any set h_vmem.

I can run single CPU jobs fine and can execute simple things like

mpirun -np 40 hostname

but I cannot run proper MPI programs. I get the following error.

mpirun noticed that process rank 0 with PID 27465 on node phi-0-3 exited on 
signal 11 (Segmentation fault).

Are you using the correct `mpiexec` also during execution of a job, i.e. 
between the nodes - maybe the interactive login has a different $PATH set than 
inside a job script?

And if it's from Open MPI: was the application compiled with the same version 
of Open MPI which's `mpiexec` is used later on on all nodes?

Basically the queues error logs on the head node and the execution nodes show 
nothing (/opt/gridengine/default/spool/../messages), also the .e, .o and .pe, 
.po also show nothing. The above error is in the standard output file of the 
program. I am trying VASP but have also tried a home grown MPI code. Both of 
these have been running out of the box via SGE for years on our old cluster 
(which was not ROCKS). I have tried the supplied orte PE (programs are compiled 
with openmpi 1.8.4

The easiest would be to stay with Open MPI 1.6.5 as long as possible. In the 
1.8 series they changed some things which might hinder a proper use:

- The core binding is enabled by default in Open MPI 1.8. Having two MPI jobs on a node 
they may use the same cores and leave others idle. One can use "--bind-to none" 
and leave the binding of SGE in effect (if any). The behavior is different in that way, 
as SGE will give a job a set of cores, and the Linux scheduler is free to move the 
processes around inside this set. The native binding in Open MPI is per process 
(something SGE can't do of course, as Open MPI opens additional forks after the initial 
startup of  `orted`. (Sure, the given set of cores by SGE could be rearranged to give 
this list to Open MPI).

- Open MPI may scan the network before the actual jobs start to get all 
possible routes between the nodes. Depending on the network setup this may take 
1-2 minutes.

-- Reuti

  compiled with intel and with --with-sge and --with-verbs) and have also tried 
one where I specify catch rsh and startmpi and stopmpi scripts but it made no 
difference. It seems as if the program does not even start. I am not even 
trying to run over several nodes yet.

Adding to that is that I can run the program (VASP) perfectly fine by ssh to a 
node and just running from the command line. And also over several nodes via a 
hostfile. So VASP itself is working fine.

I had a look at env and made sure ulimits are set OK (need ulimit -s unlimted 
for VASP to work) but all looks OK.

Has anyone seen this problem before? Or do you have any suggestion on what to 
do to get some info on where it actually goes wrong?

Thanks in advance

Marlies
--

------------------

Dr. Marlies Hankel
Research Fellow, Theory and Computation Group
Australian Institute for Bioengineering and Nanotechnology (Bldg 75)
eResearch Analyst, Research Computing Centre and Queensland Cyber 
Infrastructure Foundation
The University of Queensland
Qld 4072, Brisbane, Australia
Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445
Email:
[email protected] | www.theory-computation.uq.edu.au



Notice: If you receive this e-mail by mistake, please notify me,
and do not make any use of its contents. I do not waive any
privilege, confidentiality or copyright associated with it. Unless
stated otherwise, this e-mail represents only the views of the
Sender and not the views of The University of Queensland.



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

--

ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms

Please note change of work hours: Monday, Wednesday and Friday

Dr. Marlies Hankel
Research Fellow
High Performance Computing, Quantum Dynamics&   Nanotechnology
Theory and Computational Molecular Sciences Group
Room 229 Australian Institute for Bioengineering and Nanotechnology  (75)
The University of Queensland
Qld 4072, Brisbane
Australia
Tel: +61 (0)7-33463996
Fax: +61 (0)7-334 63992
mobile:+61 (0)404262445
Email: [email protected]
http://web.aibn.uq.edu.au/cbn/

ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms

Notice: If you receive this e-mail by mistake, please notify me, and do
not make any use of its contents. I do not waive any privilege,
confidentiality or copyright associated with it. Unless stated
otherwise, this e-mail represents only the views of the Sender and not
the views of The University of Queensland.


--

ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms

Please note change of work hours: Monday, Wednesday and Friday

Dr. Marlies Hankel
Research Fellow
High Performance Computing, Quantum Dynamics&  Nanotechnology
Theory and Computational Molecular Sciences Group
Room 229 Australian Institute for Bioengineering and Nanotechnology  (75)
The University of Queensland
Qld 4072, Brisbane
Australia
Tel: +61 (0)7-33463996
Fax: +61 (0)7-334 63992
mobile:+61 (0)404262445
Email: [email protected]
http://web.aibn.uq.edu.au/cbn/

ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms

Notice: If you receive this e-mail by mistake, please notify me, and do
not make any use of its contents. I do not waive any privilege,
confidentiality or copyright associated with it. Unless stated
otherwise, this e-mail represents only the views of the Sender and not
the views of The University of Queensland.



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] trouble running MPI jobs through SGE

Reply via email to