Am 11.04.2015 um 03:16 schrieb Marlies Hankel:

> Dear all,
> 
> Yes, I checked the paths and that looked ok. Also, I made sure that it finds 
> the right MPI version and vasp path etc.
> 
> I do not think the h_vmem is the problem as I do not get any errors in the 
> queue logs for example. Also, in the end I change h_vmem to be not consumable 
> and I also asked for a lot and that made no difference.
> 
> I will try and use a 1.6.5 openMPI version and see if that makes any 
> difference.
> 
> Would the network scan cause SGE to abort the job?

No. But there is a delay in startup.

BTW: Are you using ScaLAPACK for VASP?

-- Reuti


> I do get some message about finding to IBs but I also get that when I run 
> interactively (ssh to node not via a qlogin). I have switched that off to via 
> mca to make sure this was not causing trouble.
> 
> Best wishes
> 
> Marlies
> 
> 
> On 04/10/2015 08:12 PM, Reuti wrote:
>>> Am 10.04.2015 um 04:51 schrieb Marlies Hankel<[email protected]>:
>>> 
>>> Dear all,
>>> 
>>> I have a ROCKS 6.1.1 install and I have also installed the SGE roll. So the 
>>> base config was done via the ROCKS install. The only changes I have made 
>>> are setting the h_vmem complex to consumable and setting up a scratch 
>>> complex. I have also set the h_vmem for all hosts.
>> And the VASP job does work without h_vmem? We are using VASP too and have no 
>> problems with any set h_vmem.
>> 
>> 
>>> I can run single CPU jobs fine and can execute simple things like
>>> 
>>> mpirun -np 40 hostname
>>> 
>>> but I cannot run proper MPI programs. I get the following error.
>>> 
>>> mpirun noticed that process rank 0 with PID 27465 on node phi-0-3 exited on 
>>> signal 11 (Segmentation fault).
>> Are you using the correct `mpiexec` also during execution of a job, i.e. 
>> between the nodes - maybe the interactive login has a different $PATH set 
>> than inside a job script?
>> 
>> And if it's from Open MPI: was the application compiled with the same 
>> version of Open MPI which's `mpiexec` is used later on on all nodes?
>> 
>> 
>>> Basically the queues error logs on the head node and the execution nodes 
>>> show nothing (/opt/gridengine/default/spool/../messages), also the .e, .o 
>>> and .pe, .po also show nothing. The above error is in the standard output 
>>> file of the program. I am trying VASP but have also tried a home grown MPI 
>>> code. Both of these have been running out of the box via SGE for years on 
>>> our old cluster (which was not ROCKS). I have tried the supplied orte PE 
>>> (programs are compiled with openmpi 1.8.4
>> The easiest would be to stay with Open MPI 1.6.5 as long as possible. In the 
>> 1.8 series they changed some things which might hinder a proper use:
>> 
>> - The core binding is enabled by default in Open MPI 1.8. Having two MPI 
>> jobs on a node they may use the same cores and leave others idle. One can 
>> use "--bind-to none" and leave the binding of SGE in effect (if any). The 
>> behavior is different in that way, as SGE will give a job a set of cores, 
>> and the Linux scheduler is free to move the processes around inside this 
>> set. The native binding in Open MPI is per process (something SGE can't do 
>> of course, as Open MPI opens additional forks after the initial startup of  
>> `orted`. (Sure, the given set of cores by SGE could be rearranged to give 
>> this list to Open MPI).
>> 
>> - Open MPI may scan the network before the actual jobs start to get all 
>> possible routes between the nodes. Depending on the network setup this may 
>> take 1-2 minutes.
>> 
>> -- Reuti
>> 
>> 
>>>  compiled with intel and with --with-sge and --with-verbs) and have also 
>>> tried one where I specify catch rsh and startmpi and stopmpi scripts but it 
>>> made no difference. It seems as if the program does not even start. I am 
>>> not even trying to run over several nodes yet.
>>> 
>>> Adding to that is that I can run the program (VASP) perfectly fine by ssh 
>>> to a node and just running from the command line. And also over several 
>>> nodes via a hostfile. So VASP itself is working fine.
>>> 
>>> I had a look at env and made sure ulimits are set OK (need ulimit -s 
>>> unlimted for VASP to work) but all looks OK.
>>> 
>>> Has anyone seen this problem before? Or do you have any suggestion on what 
>>> to do to get some info on where it actually goes wrong?
>>> 
>>> Thanks in advance
>>> 
>>> Marlies
>>> -- 
>>> 
>>> ------------------
>>> 
>>> Dr. Marlies Hankel
>>> Research Fellow, Theory and Computation Group
>>> Australian Institute for Bioengineering and Nanotechnology (Bldg 75)
>>> eResearch Analyst, Research Computing Centre and Queensland Cyber 
>>> Infrastructure Foundation
>>> The University of Queensland
>>> Qld 4072, Brisbane, Australia
>>> Tel: +61 7 334 63996 | Fax: +61 7 334 63992 | mobile:0404262445
>>> Email:
>>> [email protected] | www.theory-computation.uq.edu.au
>>> 
>>> 
>>> 
>>> Notice: If you receive this e-mail by mistake, please notify me,
>>> and do not make any use of its contents. I do not waive any
>>> privilege, confidentiality or copyright associated with it. Unless
>>> stated otherwise, this e-mail represents only the views of the
>>> Sender and not the views of The University of Queensland.
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
> 
> -- 
> 
> ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms
> 
> Please note change of work hours: Monday, Wednesday and Friday
> 
> Dr. Marlies Hankel
> Research Fellow
> High Performance Computing, Quantum Dynamics&  Nanotechnology
> Theory and Computational Molecular Sciences Group
> Room 229 Australian Institute for Bioengineering and Nanotechnology  (75)
> The University of Queensland
> Qld 4072, Brisbane
> Australia
> Tel: +61 (0)7-33463996
> Fax: +61 (0)7-334 63992
> mobile:+61 (0)404262445
> Email: [email protected]
> http://web.aibn.uq.edu.au/cbn/
> 
> ccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccmsccms
> 
> Notice: If you receive this e-mail by mistake, please notify me, and do
> not make any use of its contents. I do not waive any privilege,
> confidentiality or copyright associated with it. Unless stated
> otherwise, this e-mail represents only the views of the Sender and not
> the views of The University of Queensland.
> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to